CS145 Lecture Notes (15) -- Data Warehousing and Data Mining
Two broad types of database activity:
OLTP: On-Line Transaction Processing
- Short transactions, both queries and updates
(e.g., update account balance, enroll in course)
- Queries are simple
(e.g., find account balance, find grade in course)
- Queries and updates touch small portion of data
(e.g., examples above)
- Updates are frequent
(e.g., movie tickets, seat reservations, shopping carts)
OLAP: On-Line Analytical Processing
- Long transactions, usually complex queries
(e.g., all statistics about all sales, grouped by dept and month)
- Queries touch large portion of the data
(e.g., as above)
- "Data mining" operations
- Infrequent updates
Data Warehousing
Bring data from "operational" (OLTP) sources into a single warehouse
to do analysis and mining (OLAP).
(system figure)
Also referred to as Decision Support Systems (DSS)
=> Extremely popular in large corporations today. Many have spent
millions in data warehousing projects.
Example: Victoria's Secret
- All sales information copied into data warehouse. Used to supply
appropriate merchandise (current and future) to appropriate outlets.
Data warehousing project paid off in 1-2 years.
Example: Wal-Mart
- All distributor, store, and sales information copied into
data warehouse. Used to make store layout and supplier decisions.
Hundreds (thousands?) of terabytes of data.
Example: Large internet company
- All page view, click stream, and purchase information propagates
into data warehouse. Used to make page layout and advertising
decisions (targeted and global).
Technical challenges:
- Extracting data from operational sources in useful format
(add to figure)
- Transforming, "cleaning" ("scrubbing"), and possibly summarizing
operational data
(add to figure)
- Integrating data from multiple sources
(add to figure)
- Keeping warehouse up-to-date as source data changes
Data at Warehouse
Most warehouse applications are of similar character with two kinds of
data:
- Fact data: sales transactions, flight arrivals, course
enrollments, page views
Updated frequently, often append-only, very large
- Dimension data: store items, store customers, students, courses, users, advertisers
Updated infrequently, not as large
Star Schema
One fact table referencing several dimension tables
Example:
Sales(StoreID, ItemID, CustID, qty, price) // fact table
Store(StoreID, city, state)
Item(ItemID, name, brand, color, size)
Customer(CustID, name, address)
(diagram)
In fact table:
- Dimension attributes: foreign keys to dimension tables
- Dependent attributes: all others, often aggregated in queries
Complete star join:
SELECT *
FROM Sales, Store, Item, Customer
WHERE Sales.StoreID = Store.StoreID
AND Sales.ItemID = Item.ItemID
AND Sales.CustID = Customer.CustID
Typical OLAP query will:
- Do all or part of star join
- Filter interesting tuples based on fact and/or dimension data
- Group by one or more dimensions
- Aggregate the result
Example: Find the sum of all sales in California of blue items with item
price > 100, grouped by store and customer
Performance:
- OLAP queries can be extremely slow
- New kinds of indexes
- New query processing techniques
- Systems make extensive use of materialized views
Question: Why are materialized views appropriate in this setting?
Data Cubes
Also called "Mutidimensional OLAP"
Idea:
- Dimension data forms axes of "cube"
- Fact data in cells
- Aggregated data on sides, edges, and corners
N-dimensional version of spreadsheet
(diagram)
Fact table uniqueness:
- If dimension attributes are not a key, cells must aggregate
(e.g., sum of
qty*price
)
- "Date" often used to create a key:
Sales(StoreID,ItemID,CustID,date,qty,price)
Question: Is date a dimension or dependent attribute?
Let's keep things simple: Sales(StoreID, ItemID, CustID, price)
Queries can "slice and dice", "drill down and roll up"
- Slice: Select on one or more dimensions
(e.g., only store S2, only stores in California)
- Dice: Differing definitions
Constrain all dimensions (e.g., male customers buying clothes in California), or
Partition or group on one or more dimensions (e.g., stores by
state and items by color)
- Drill down ("de-aggregate"):
Examining summary data, break it out by dimension attribute
Example: Looking at sum of all California sales, break it out by
store
- Roll up ("aggregate"):
Examining data, summarize along some dimension
Example: Looking at data grouped by item and customer, aggregate so
only grouped by customer
Performance:
- Data cube can be huge
- Also can be sparse
- Can compute in advance, compute on demand, or some combination
SQL Constructs (CUBE
and ROLLUP
)
- Textbook describes earlier proposed standard:
CUBE(R) in FROM
clause
- These notes cover the actual standard:
CUBE
and ROLLUP
in GROUP-BY
clause
Adding "WITH CUBE
" to a GROUP-BY
query expands the
query result into a full data cube:
SELECT StoreID, ItemID, CustID, SUM(price)
FROM Sales
GROUP BY StoreID,ItemID,CustID WITH CUBE
Alternative syntax: "... GROUP BY CUBE(StoreID,ItemID,CustID)
"
All result tuples, plus all cube summary tuples over result. For example:
(store123, NULL, cust456, $1000)
$1000 is sum of price
for all items bought by Customer cust456
at
store store123
Face of cube
(NULL, item789, NULL, $10,000)
$10,000 is sum of price
for item item789
bought by any customer
at any store
Edge of cube
(NULL, NULL, NULL, $1,000,000)
$1,000,000 is sum of price
in entire database
Corner of cube
CUBE queries useful for data browsing, also for materialized views:
CREATE MATERIALIZED VIEW SalesCube AS
SELECT StoreID, ItemID, CustID, SUM(price) as p
FROM Sales
GROUP BY StoreID,ItemID,CustID WITH CUBE
Example query using view: Find total sales of all blue items in California
-
CUBE
construct enables efficient data cube operations
built on top of conventional relational DBMS.
- Can write the same queries without
CUBE
. Often more
complex, less efficient.
Adding "WITH ROLLUP
" to a GROUP-BY
query expands the
query result into a portion of the data cube:
Sales(RegionID, StoreID, ClerkID, hourlyPay)
SELECT RegionID, StoreID, ClerkID, AVG(hourlyPay)
FROM Sales
GROUP BY RegionID,StoreID,ClerkID WITH ROLLUP
Alternative syntax: "... GROUP BY ROLLUP(RegionID,StoreID,ClerkID)
"
All result tuples, plus summary tuples with NULLs in right-end columns:
(region123, store456, NULL, $25)
Average hourly pay for given store
(region123, NULL, NULL, $22)
Average hourly pay for given region
(NULL, NULL, NULL, $26)
Average hourly pay over all data
- But not for example
(NULL, store456, NULL, $25)
Notes:
ROLLUP
makes the most sense when attribute list is a hierarchy.
- Both
CUBE
and ROLLUP
can be applied to subsets of GROUP-BY
attributes
Ex: "... GROUP BY StoreID, CUBE(ItemID,CustID)
"
Data Mining
Search for patterns in large databases
- Often performed over data warehouses
- Both less structured and less ad-hoc than OLAP queries
Classic application: "market basket" data
Purchase(salesID, item)
...
(3, bread)
(3, milk)
(3, eggs)
(3, beer)
(4, beer)
(4, chips)
....
Want to find association rules
{L1,L2,...,Ln} -> R
"If a customer bought all the items in set {L1, L2, ..., Ln}, he is
very likely to also have bought item R."
Example:
{bread, milk} -> eggs
{diapers} -> beer
Question: Can we write a SQL query to find association rules?
Question: What can we write it in?
Two concepts for association rules: {L1,L2,...,Ln} -> R
- Support: {L1,L2,...,Ln} must appear frequently
Support = #sales containing {L1,L2,...,Ln} / total #sales
- Confidence: probability that R appears if {L1,L2,...,Ln} does
Confidence = #sales containing {L1,L2,...,Ln,R} / #sales containing {L1,L2,...,Ln}
Goals of data mining:
- Quickly find association rules over extremely large data sets
(e.g., all Wal-Mart sales for a year).
- Allow user to tune support and confidence.
Other types of data mining rules and patterns:
- Classification trees (= decision trees)
Buyers(<attributes>, purchase)
Want to predict purchase
from <attributes>
- Clustering
Buyers(<attributes>)
Automatically group buyers into N similar types
- Top-N items
Purchase(salesID, item)
What were the N most often purchased items? (salesID
irrelevant)
Question: Can we write it in SQL?