CS346 Lecture Notes
Topics in Query Processing and Optimization

Steps in database query processing

Step	Input	Component	Output
1	Query string	Parser	Query tree
2	Query tree	Checker	Valid query tree
3	Valid query tree	View expander	Valid tree without views
4	Valid tree without views	Logical query plan generator	Logical query plan
5	Logical query plan	Rewriter (heuristic)	Better logical query plan
6	Logical query plan	Physical plan generator (cost-based)	Selected physical plan
7	Physical plan	Code generator	Executable code
8	Executable code	Execution engine	Answer!

Running example

   Student(ID, Name, Major)  // ID is key
   Course(Num, Dept)         // Num is key
   Taking(ID, Num)           // (ID,Num) is key

Query to find names of all EE students taking at least one CS course:

   Select Distinct Name
   From   Student, Course, Taking
   Where  Taking.ID = Student.ID
   And    Taking.Num = Course.Num
   And    Major = 'EE'
   And    Dept = 'CS'

Parser

Produces tree representation of the query string:

Checker

Verifies query tree against database schema:

All tables in From clause exist
Qualified attributes exist in their tables
Unqualified attrributes exist in exactly one referenced table (table names usually added at this point)
No ambiguities in table references
All comparisons, aggregations, etc. are type-compatible

View expander

Suppose Student table is actually a view:

   StudName(ID, Name)
   StudMajor(ID, Major)

   Create View Student(ID,Name,Major) As
      Select StudName.ID, StudName.Name, StudMajor.Major
      From   StudName, StudMajor
      Where  StudName.ID = StudMajor.ID

Parse tree becomes:

Question: Do we need to run the checker again?

Question: What if SQL did not allow table expressions in the From clause?

Logical and physical query plans

Both are trees representing query evaluation

Leaves represent data

Internal nodes are operators over the data

Logical plan is higher-level and algebraic

Physical plan is lower-level and operational

Logical plan operators correspond to query language constructs

Physical plan operators correspond to implemented access methods

Logical plan operators

Extended relational algebra

Leaves of logical plans are table names

Basic operators: Select, Project, Cross-Product, Union, Difference

Abbreviations: Natural-Join, Theta-Join, Intersect

Extensions: Rename, Aggregate-GroupBy, Distinct (+ others)

Logical query plan for example query (direct from SQL):

Using 2-way natural joins instead of cross-products:

Logical plan generator

Usually there's a straightforward mapping from a valid parse tree to a naive logical query plan. This logical plan may then be rewritten to a better one.

Example: Push selections down:

=> We will cover rewriting techniques a little later

Physical plan operators

Leaves of physical plans are usually table names, sometimes indexes or other information
Access methods for single tables:
- Table Scan
- Condition-Based Index Scan
- Index Scan without Condition
(Question: why is the last operator useful?)
Join methods:
- Nested-Loops Join (various algorithms/improvements)
- Sort-Merge Join
- Hash Join (various algorithms)
Other important operators:
- Sort
- Group
- Aggregate (may be combined with Group)
- Union, Except (= Difference), Intersect
Operators often merged with other ones:
- Filter (= Selection = Apply-Pred)
- Projection
- Distinct (= Dup-Elim)
In a distributed system:
- Ship
In a parallel system:
- Split
- Merge
- Exchange

Possible physical plan for example query (Student no longer a view):

Another possible physical plan:

Physical plan generation

There are lots and lots of possible physical query plans for a given logical plan. The physical plan generator (sometimes called physical plan enumerator) tries to select the one that will be optimal, usually with respect to response time or in some cases throughput.

=> Physical plan generation is the meatiest part of query processing!! Will cover it in detail later.

Code generator

Translates physical query plan tree into executable code.

Very system-specific
Some systems may instead use a query plan interpreter

Some terminology

Term	Meaning
Query processing	entire process
Query optimization	rewriter + physical plan generator
Query compilation	entire process through code generation
Query execution	execution engine

Coming up

Query rewriting
Passing information between physical operators
Cost metrics for physical query plans
Physical query plan enumeration
Reducing the search space
Further optimizations

Exercise

Consider the following modification to our running example:

   Student(ID, Name, Major)  // ID is key
   Course(Num, Dept, Units)  // Num is key
   Took(ID, Num, Quarter)    // (ID,Num) is key

Query to find all students named "Mary" who took at least one 3-unit course in their own major in the fall quarter 2010:

   Select *
   From   Student, Course, Took
   Where  Took.ID = Student.ID
   And    Took.Num = Course.Num
   And    Course.Dept = Student.Major
   And    Student.Name = 'Mary'
   And    Course.Units = 3
   And    Took.Quarter = 'Fall07'

Suppose:

The DBMS supports Nested-Loops Join (NLJ), Sort-Merge Join, and Hash Join. For now assume that all 3 join methods are asymmetric, i.e., joins are noncommutative.
Each table can be accessed either with a Table Scan or a Condition-Based Index Scan (assume all relevant attributes are indexed).
All single-table selections are pushed down, i.e., they are performed before joins.
For now assume that the system will not perform NLJ using an index on the inner relation's join attribute.

Note: join associativity matters

Question: How many possible physical query plans are there for this query?

Question: What if the system also supports NLJ with inner index (NLIJ)?
(Note: By convention, the right operand of NLJ and NLIJ is the inner.)

Case (a): System can exploit 2 indexes on a table

Case (b): System can't exploit 2 indexes on a table

Question: Go back to the original problem (no NLIJ) but assume that Sort-Merge Join is symmetric. Now how many plans?

Query rewriting

Rewrite the logical query plan so it's likely to produce a better physical query plan.

Rewrites are based on heuristic rules, not on database statistics.

Implication: enumeration of physical plans considers only those physical plans with some correspondence to the input logical plan.

* Push selections down

Can be detrimental in the (unusual) scenario of very selective join, unselective selection condition.

Can be detrimental in the case of expensive predicates.

* "Flatten" subqueries

Need to be careful about duplicates

Question: Why flatten subqueries?

* Insert or push down projections

Question: Why insert or push down projections?

* Many other rewrites, especially involving grouping and aggregation

=> Since rewrites are not guaranteed to be improvements, could generate alternatives and send some or all of them to physical query plan generator.

Passing information between physical operators

Example physical plan:

Two modes of information passing:

1. Temporary Tables

Create table for each interior node in the query plan.
Table for root contains query result.

Pro: conceptually easy

Cons: harder to implement, often less efficient

2. Iterators

Each operator supports three methods:

Open: set up iterator
GetNext: get next record
Close: destroy iterator

Iterator for non-leaf operator calls iterators for children.
Iterator for root operator produces query result.

Question: Look familiar?

Pros and cons: converse of temporary tables

Physical operators -- iterators

Operator	Iterator::Open	Iterator::GetNext	Iterator::Close
Table Scan .	......................................................	......................................................	......................................................
Index Scan without Condition .	......................................................	......................................................	......................................................
Condition-Based Index Scan .	......................................................	......................................................	......................................................
Nested-Loops Join .	......................................................	......................................................	......................................................
Sort-Merge Join .	......................................................	......................................................	......................................................
Hash Join .	......................................................	......................................................	......................................................
Sort .	......................................................	......................................................	......................................................
Group/Aggregate .	......................................................	......................................................	......................................................
Union .	......................................................	......................................................	......................................................
Except .	......................................................	......................................................	......................................................
Intersection .	......................................................	......................................................	......................................................
Filter (Selection) .	......................................................	......................................................	......................................................
Projection .	......................................................	......................................................	......................................................
Distinct .	......................................................	......................................................	......................................................

Some terminology:

Blocking operator: must call GetNext exhaustively (or nearly exhaustively) on its children before returning its first tuple.
Nonblocking operator: expects to make only a few GetNext calls to its children before returning its first (or next) tuple.

Question: Which of the above physical operators are blocking?

Note: Many production DBMS's use a hybrid approach between iterators and temporary relations: iterators with buffers between operators

Query shapes

Example: R1 Join R2 Join R3 Join R4 Join R5

Left-deep tree:

Right-deep tree:

Bushy tree (as opposed to linear):

With temporary tables

Question: Suppose all relations are the same size and all join selectivities are uniform. Using nested-loops joins, is there a preference between left-deep and right-deep?

Question: Can bushy be better than left-deep or right-deep?

Question: Suppose relation sizes and join selectivities vary. Using nested-loops joins, is there a preference between left-deep, right-deep, and bushy?

With iterators

Question: Suppose all relations are the same size and all join selectivities are uniform. Using nested-loops joins, is there a preference between left-deep, right-deep, and bushy?

Exercise

Consider physical plan to execute (R NLJ S NLJ T) with no indexes and no condition-based scan, and print the result.

Suppose |R| = 3, |S| = 6, |T| = 4, join selectivity = .5

How many calls total to any iterator GetNext method for left-deep plan versus right-deep plan? Don't count EOFs.

Touched on two simple cost metrics:

For temporary tables: size of intermediate tables
For iterators: number of GetNext calls

=> More on cost metrics later.

Query shapes and hash joins

Consider hash join with Build on left and Probe on right:

Algorithm

For each tuple t1 in Build relation, hash join attribute a1 and insert (a1,t1) into appropriate hash bucket.
For each tuple t2 in Probe relation, hash join attribute a2, check appropriate hash bucket for matches.

Left-deep plan with hash joins:

Question: How many hash tables needed at once?

Right-deep plan with hash joins:

Question: How many hash tables needed at once?

Conclusion: left-deep plan requires less memory.
Question: Any advantages to right-deep plan?

Reality: as a pruning heuristic many systems limit physical plan enumeration to left-deep join plans only.

Coming up

Cost metrics for physical query plans
Physical query plan enumeration
Reducing the search space
Further optimizations

Costing Physical Plans

Physical query plan enumerator generates multiple possible plans, selects the one with lowest estimated cost according to some metric.

Input to cost metric:

Physical plan
Cardinality of relations
Value distributions of attributes (e.g., min value, max value, number of distinct values, histogram)
Record sizes, page sizes, buffer size
Clustering information
Index information: size of keys, records, pages; depth of B-trees
Possibly: expected concurrency, disk characteristics, other machine-specific parameters

Output: estimated cost, especially relative to cost of other plans

Digression: Value Distributions

Usually 2nd-min and 2nd-max are used instead of min and max.

Question: Name two reasons 2nd-min and 2nd-max might be preferred over min and max

(1)


(2)

Equi-width histograms

Equi-height histograms

Question: Which is better and why?

Simple cost metrics

We already discussed two simple cost metrics:

Total size of intermediate results
Total number of GetNext calls in iterator execution model

Notice that estimating (1) is required in order to estimate (2).

Examples: size of intermediate results

Example: selection over R of tuples with R.A < 5

Question: estimated number of tuples in result?

Example: join of R and S with R.A = S.A

Question: estimated number of tuples in result?

Example: group R by R.A and compute aggregate over R.B

Question: estimated number of tuples in result?

Examples: number of GetNext calls

Recall left-deep versus right-deep 3-way join examples from earlier.

More complex cost metrics

Model buffer pool and try to estimate pages fetched

Take CPU usage (e.g., # of comparisons) into account: cost = (k1 * IOcost) + (k2 * CPUcost)

Could even model disk arm movement, disk latency, transfer rate

Note that estimating size of intermediate results still needed.

Example:

R join S

R occupies B(R) blocks, S occupies B(S) blocks

Nested-block join with rocking

Initially empty LRU buffer pool with K pages

Question: estimated number of page fetches?

Case (a): Allocate 1 block to R

Case (b): Allocate r blocks to R (1 <= r <= B(R) <= K) to minimize page fetches

Throughput versus response time

Most cost metrics are based on maximizing throughput.

An alternative metric is to minimize response time.

Example:

   Select   *
   From     R, S
   Where    R.A = S.A
   Order By R.A

Suppose R is sorted on attribute A.

Question: What kind of join will minimize response time?

Question: What kind of join will maximize throughput?

Memory usage

A physical query plan may require a certain amount of memory to be available at run-time, e.g., for hash joins or certain sort algorithms.

Plan enumeration may take into account how much memory is expected to be available and select feasible plans accordingly, but run-time guarantees about memory availability usually cannot be made.

If not enough memory is available at run-time, possibilities are:

Default to disk-based version of operators (e.g., for hash join)
Use virtual memory and hope for decent performance
Choose alternate plan requiring less memory
Wait for memory to become available

Physical query plan enumeration

Many possible algorithms:

Heuristic generation of one plan, with or without considering cost
Naive exhaustive enumeration
Transformation-based (rule-based, top-down) search, heuristic or exhaustive
"Branch and bound": heuristic plan followed by search
Bottom-up (dynamic programming) exhaustive search
Bottom-up greedy search
Selinger bottom-up algorithm
Randomized search: simulated annealing, iterative improvement, genetic algorithms, many others [not discussed further]

Running example

Query block:

Sort (Project (Group/Aggregate (Join (Filter(R1), Filter(R2), ..., Filter(Rn))))

Corresponds to:

   Select   attributes, aggregates
   From     R1, R2, ..., Rn
   Where    join conditions and local selections
   Group By attributes
   Order By attributes

Most significant decisions are join ordering, relation access methods, and join methods, so let's consider simple query blocks:

Join (Filter(R1), Filter(R2), ..., Filter(Rn))

or in SQL:

   Select attributes
   From   R1, R2, ..., Rn
   Where  join conditions and local selections

1. Heuristic plan generation

Generate just one plan based on a set of heuristics

Without cost:

Decision based on available indexes, existence of local and join conditions in query, types of conditions (=, <, etc.)
General goals:
- Avoid scans (use indexes)
- Keep intermediate results small
Heuristics:
- Join smaller operands first
- Do equijoins first
- Exploit sorted operands for sort-merge joins
- Arrange join order so indexes can be used on inners of nested-loops joins
- Others?

With cost:

Heuristics as above but can estimate and minimize intermediate result sizes

2. Naive exhaustive enumeration

   for each possible ordering of R1, R2, ..., Rn:
     for each possible associativity of selected order:
       for each possible join method for Join-1:
         for each possible join method for Join-2:
           ...
             for each possible join method for Join-(n-1):
               for each possible access method for Filter(R1):
                 for each possible access method for Filter(R2):
                   ...
                     for each possible access method for Filter(Rn):
                    construct physical plan;
                    compute estimated cost;
                    save plan and cost if cheapest so far

Some simplifications don't compromise finding cheapest plan, e.g., eliminating incompatible join/access methods, not considering commutative cases of commutative joins (recall exercises).

Real implementation would not recompute subresults.

3. Transformation-based search

Define set of "rules" for generating and permuting physical plans
Scheduler (or search engine) applies rules and does bookkeeping
Also called top-down
Example rules:
- Join can be replaced with Nested-Loops Join, Sort-Merge Join, or Hash Join
- (Exp1 Join Exp2) can be replaced with (Exp2 Join Exp1)
- Table Scan can be replaced with Index Scan
- Sort can be added anywhere
Can stop anytime after all logical operators are replaced, or can generate and cost every possible physical plan
Simple pruning: only apply a rule that may result in a cheaper plan

4. Branch and bound

Generate first plan using heuristic approach (no-cost or cost).
Permute the plan using transformation rules. Cost of plan from step 1 provides bound for pruning search space.

Can stop anytime in step 2, or can generate and cost all feasible plans

5. Exhaustive bottom-up plan enumeration

Dynamic programming approach:

For each pair {Ri, Rj}, generate all possible plans and costs for Filter(Ri) Join Filter(Rj), including commutative case.
Using plans and costs from step 1, for each possible triple {Ri, Rj, Rk}, generate all possible plans and costs for Filter(Ri) Join Filter(Rj) Join Filter(Rk), including different associativities and commutativities.
Continue until plans and costs are generated for n-way join, pick cheapest.

6. Greedy bottom-up plan enumeration

Same as exhaustive, except at each step save only the cheapest plan.

Question: Will the selected plan always be optimal?

7. Selinger / System R algorithm

Same as greedy, except at each step save all plans that represent an interesting order.

An intermediate result has an interesting order if it is sorted by any of:

Final SORT attribute
Final grouping attribute(s)
Later join attributes

(First two cases are for general query blocks)

Question: Will the selected plan always be optimal?

Reducing the search space

Lots of possibilities, all with potential to inadvertently prune away the optimal plan. Examples:

Consider left-deep join trees only
Always favor joins over cross-products
Always use hash join if build operand is expected to fit in memory
Always use sort-merge join if one operand is sorted
Always favor selection index over join index (or vice-versa)
Others?

The Real World

Most commercial optimizers use either:

4. Branch and bound: heuristic plan followed by transformation-based search, or
7. Selinger bottom-up algorithm

Question: Advantages of branch and bound?

Question: Advantages of bottom-up?

Some Tricks to Further Speed Up Query Processing

Index-only operations

Compute query answer using indexes only

Example:

   Select price
   From   product
   Where  price < 25

Assume B+ tree index on price
Question: How can we execute the query without accessing data?

Example:

   Select gpa
   From   student
   Where  status = 'MS' And gpa > 3.8

Assume indexes on status and gpa
Question: How can we execute the query without accessing data?

Example: Same as previous except select name instead of gpa
Question: How can we execute the query minimizing access to data?

Aside: Using Indexes

We've been assuming it's always a good idea to use an index when possible

Very simple example: Select * From R Where R.A=5
Question: If there is an index on R.A, is it always a good idea to use it?

Storing Relations as Indexes

Combine index on relation and relation itself into one structure (sometimes called clustered relation or clustered index)

B-tree or hash index, just replace RIDs with full records

Question: Advantages?

Question: Disadvantages?

Join processing

Joins are the most complex, frequent, studied operation

2-way join problem definition:

R JOIN[pred] S

pred is R.A op S.B

op is one of =, <, <=, >, >=, <>

Note: join predicates other than pred are residual

Standard algorithms:

Nested loops/blocks

Sort-merge join

Hash join (various flavors)

Nested-loops join with key

   for each t1 in R
     for each t2 in S
       if pred(t1,t2) then return combine(t1,t2)

Suppose pred is R.A=S.B, B is key for S, no relevant indexes.
Question: How to improve basic algorithm?

Note: could choose commutative ordering of R JOIN S based on knowledge of key

RID-based join

Suppose have "scanable" index on R.A, index on S.B

   for each <a,rid1> in index on R.A
     use index on S.B to find all <b,rid2> such that pred(a,b)
       save <rid1,rid2> in temporary table
   fetch actual tuples

Question: Why is this approach more efficient than nested-loops index join?

(1)

(2)

Question: What happens in the special case of the following query?

   Select R.A, S.B
   From   R, S
   Where  R.A op S.B

Auxiliary structures

Store extra information to speed up join.
Question: What classic trade-off is being made here?

Pointer-based joins

Store RIDs for joining S tuples within R tuples:

Question: What is the disadvantage besides the update cost?

Question: When can we avoid variable-length records?

Note: this structure also works well for query conditions such as R.A=S.B And R.C < 10

Join index

Keep table of all RIDs participating in join:

More complex -- keep table of all RIDs participating in join and satisfying certain selection conditions

Even more complex -- keep table of all RIDs in result of some arbitrary query

Common variation: RIDs for joining tuples in leaves of B-tree index:

Looks like pointer-based join with table stored as index.

=> In reality these are points in a wide spectrum of interrelated techniques

Bitmap index

Similar to conventional index except store bitmaps instead of RIDs:

Typically used for star schemas

Example:

   sales(storeID, itemID, ...)   // "fact table"
   store(storeID, state, ...)    // "dimension table"
   item(itemID, type, ...)       // "dimension table"















   Select sales.*
   From   sales, store, item
   Where  sales.storeID = store.storeID
   And    sales.itemID = item.itemID
   And    store.state = 'CA'
   And    item.type = 'clothing'

Bitmap indexes typically created on dimension foreign-key attributes of fact tables:

  create bitmap index on sales(storeID)
  create bitmap index on sales(itemID)

Execution of above query:

Get relevant storeID's from table store
Perform lookups in storeID index, bitmap Or the results (bitmap B1)
Do the same for itemID's (bitmap B2)
Bitmap And on B1 and B2
Fetch sales tuples

Question: In what environment do bitmaps make sense?

Bitmap join index

Recall join index B-tree variant, but replace joining RIDs with bitmaps:

Typically created on dimension tables. (Note: made-up syntax)

  create bitmap join index (storeID = sales.storeID) on store(state)
  create bitmap join index (itemID = sales.itemID) on item(type)

Question:What are execution steps for above query?

=> Here too there is a wide spectrum of interrelated techniques. We've only scratched the surface.

Summary of Query Processing Material

Steps
1. parse
2. validate
3. expand views
4. generate logical plan
5. rewrite logical plan
6. generate physical plan
7. generate code
8. execute
Structure of:
- parse trees
- logical plans
- physical plans
Query rewriting (including view expansion)
Temporary table approach to executing physical plan trees
Iterator approach to executing physical plan trees
Query shapes: left-deep, right-deep, bushy
Relationship between query shape and efficiency (nested-loops and hash joins)
Assigning cost to physical plans
- relevant statistics
- size of intermediate results
- number of GetNext calls
- number of page fetches
Throughput versus response time
Memory usage
Physical query plan enumeration
1. heuristic, with or without cost
2. naive exhaustive
3. transformation-based (rule-based), heuristic or exhaustive
4. branch and bound
5. bottom-up exhaustive
6. bottom-up greedy
7. bottom-up System R (Selinger)
8. randomized search
Reducing the search space
Index-only and "index-heavy" operations, when not to use indexes
Improved join algorithms
- nested-block with rocking
- nested-loops with key
- RID-based
Auxiliary structures for joins
- pointer-based joins
- join index
- bitmap index
- bitmap join index
Star schemas

CS346 Lecture Notes Topics in Query Processing and Optimization

Steps in database query processing

Running example

Parser

Checker

View expander

Logical and physical query plans

Logical plan operators

Logical plan generator

Physical plan operators

Physical plan generation

Code generator

Some terminology

Coming up

Exercise

Query rewriting

* Push selections down

* "Flatten" subqueries

* Insert or push down projections

* Many other rewrites, especially involving grouping and aggregation

Passing information between physical operators

1. Temporary Tables

2. Iterators

Physical operators -- iterators

Query shapes

With temporary tables

With iterators

Exercise

Query shapes and hash joins

Algorithm

Coming up

Costing Physical Plans

Digression: Value Distributions

Equi-width histograms

Equi-height histograms

Simple cost metrics

Examples: size of intermediate results

Examples: number of GetNext calls

More complex cost metrics

Throughput versus response time

Memory usage

Physical query plan enumeration

Running example

1. Heuristic plan generation

2. Naive exhaustive enumeration

3. Transformation-based search

4. Branch and bound

5. Exhaustive bottom-up plan enumeration

6. Greedy bottom-up plan enumeration

7. Selinger / System R algorithm

Reducing the search space

The Real World

Some Tricks to Further Speed Up Query Processing

Index-only operations

Aside: Using Indexes

Storing Relations as Indexes

Join processing

Nested-loops join with key

RID-based join

Auxiliary structures

Pointer-based joins

Join index

Bitmap index

Bitmap join index

Summary of Query Processing Material

CS346 Lecture Notes
Topics in Query Processing and Optimization