CS145 Lecture Notes (15) -- Data Warehousing and Data Mining



Two broad types of database activity:

OLTP: On-Line Transaction Processing

OLAP: On-Line Analytical Processing

Data Warehousing

Bring data from "operational" (OLTP) sources into a single warehouse to do analysis and mining (OLAP).

(system figure)


























Also referred to as Decision Support Systems (DSS)

=> Extremely popular in large corporations today. Many have spent millions in data warehousing projects.

Example: Victoria's Secret

Example: Wal-Mart

Example: Large internet company


Technical challenges:
  1. Extracting data from operational sources in useful format
    (add to figure)

  2. Transforming, "cleaning" ("scrubbing"), and possibly summarizing operational data
    (add to figure)

  3. Integrating data from multiple sources
    (add to figure)

  4. Keeping warehouse up-to-date as source data changes

Data at Warehouse

Most warehouse applications are of similar character with two kinds of data:

Star Schema

One fact table referencing several dimension tables
Example:
  Sales(StoreID, ItemID, CustID, qty, price)  // fact table
  Store(StoreID, city, state)
  Item(ItemID, name, brand, color, size)
  Customer(CustID, name, address)
(diagram)














In fact table: Complete star join:
  SELECT *
  FROM   Sales, Store, Item, Customer
  WHERE  Sales.StoreID = Store.StoreID
  AND    Sales.ItemID = Item.ItemID
  AND    Sales.CustID = Customer.CustID
Typical OLAP query will:
  1. Do all or part of star join
  2. Filter interesting tuples based on fact and/or dimension data
  3. Group by one or more dimensions
  4. Aggregate the result
Example: Find the sum of all sales in California of blue items with item price > 100, grouped by store and customer



















Performance:

Question: Why are materialized views appropriate in this setting?






Data Cubes

Also called "Mutidimensional OLAP"
Idea: N-dimensional version of spreadsheet

(diagram)



















Fact table uniqueness:

Question: Is date a dimension or dependent attribute?




Let's keep things simple: Sales(StoreID, ItemID, CustID, price)

Queries can "slice and dice", "drill down and roll up"

Performance:

SQL Constructs (CUBE and ROLLUP)

Adding "WITH CUBE" to a GROUP-BY query expands the query result into a full data cube:
  SELECT StoreID, ItemID, CustID, SUM(price)
  FROM Sales
  GROUP BY StoreID,ItemID,CustID WITH CUBE
Alternative syntax: "... GROUP BY CUBE(StoreID,ItemID,CustID)"

All result tuples, plus all cube summary tuples over result. For example:

CUBE queries useful for data browsing, also for materialized views:
  CREATE MATERIALIZED VIEW SalesCube AS
    SELECT StoreID, ItemID, CustID, SUM(price) as p
    FROM Sales
    GROUP BY StoreID,ItemID,CustID WITH CUBE
Example query using view: Find total sales of all blue items in California













Adding "WITH ROLLUP" to a GROUP-BY query expands the query result into a portion of the data cube:
  Sales(RegionID, StoreID, ClerkID, hourlyPay)

  SELECT RegionID, StoreID, ClerkID, AVG(hourlyPay)
  FROM Sales
  GROUP BY RegionID,StoreID,ClerkID WITH ROLLUP
Alternative syntax: "... GROUP BY ROLLUP(RegionID,StoreID,ClerkID)"

All result tuples, plus summary tuples with NULLs in right-end columns:

Notes:

Data Mining

Search for patterns in large databases Classic application: "market basket" data
  Purchase(salesID, item)
  ...
  (3, bread)
  (3, milk)
  (3, eggs)
  (3, beer)
  (4, beer)
  (4, chips)
  ....
Want to find association rules
  {L1,L2,...,Ln} -> R
"If a customer bought all the items in set {L1, L2, ..., Ln}, he is very likely to also have bought item R."

Example:

  {bread, milk} -> eggs
  {diapers}     -> beer

Question: Can we write a SQL query to find association rules?





Question: What can we write it in?



Two concepts for association rules: {L1,L2,...,Ln} -> R Goals of data mining:
Other types of data mining rules and patterns: