CS346 Lecture Notes
Topics in Query Processing and Optimization


Steps in database query processing

Step Input Component Output
1 Query string Parser Query tree
2 Query tree Checker Valid query tree
3 Valid query tree View expander Valid tree without views
4 Valid tree without views Logical query plan generator Logical query plan
5 Logical query plan Rewriter (heuristic) Better logical query plan
6 Logical query plan Physical plan generator (cost-based) Selected physical plan
7 Physical plan Code generator Executable code
8 Executable code Execution engine Answer!


Running example

   Student(ID, Name, Major)  // ID is key
   Course(Num, Dept)         // Num is key
   Taking(ID, Num)           // (ID,Num) is key
Query to find names of all EE students taking at least one CS course:
   Select Distinct Name
   From   Student, Course, Taking
   Where  Taking.ID = Student.ID
   And    Taking.Num = Course.Num
   And    Major = 'EE'
   And    Dept = 'CS'

Parser

Produces tree representation of the query string:













Checker

Verifies query tree against database schema:

View expander

Suppose Student table is actually a view:
   StudName(ID, Name)
   StudMajor(ID, Major)

   Create View Student(ID,Name,Major) As
      Select StudName.ID, StudName.Name, StudMajor.Major
      From   StudName, StudMajor
      Where  StudName.ID = StudMajor.ID
Parse tree becomes:
















Question: Do we need to run the checker again?



Question: What if SQL did not allow table expressions in the From clause?






Logical and physical query plans

  • Both are trees representing query evaluation
  • Leaves represent data
  • Internal nodes are operators over the data
    
    
    
    
    
    
    
    
    
    
    
  • Logical plan is higher-level and algebraic
  • Physical plan is lower-level and operational

  • Logical plan operators correspond to query language constructs
  • Physical plan operators correspond to implemented access methods

    Logical plan operators

  • Extended relational algebra
  • Leaves of logical plans are table names

  • Basic operators: Select, Project, Cross-Product, Union, Difference


  • Abbreviations: Natural-Join, Theta-Join, Intersect


  • Extensions: Rename, Aggregate-GroupBy, Distinct (+ others)



    Logical query plan for example query (direct from SQL):

    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    Using 2-way natural joins instead of cross-products:
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    

    Logical plan generator

    Usually there's a straightforward mapping from a valid parse tree to a naive logical query plan. This logical plan may then be rewritten to a better one.

    Example: Push selections down:

    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    => We will cover rewriting techniques a little later

    Physical plan operators

    Possible physical plan for example query (Student no longer a view):
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    Another possible physical plan:
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    

    Physical plan generation

    There are lots and lots of possible physical query plans for a given logical plan. The physical plan generator (sometimes called physical plan enumerator) tries to select the one that will be optimal, usually with respect to response time or in some cases throughput.

    => Physical plan generation is the meatiest part of query processing!! Will cover it in detail later.


    Code generator

    Translates physical query plan tree into executable code.

    Some terminology

    Term Meaning
    Query processing entire process
    Query optimization rewriter + physical plan generator
    Query compilation entire process through code generation
    Query execution execution engine

    Coming up


    Exercise

    Consider the following modification to our running example:
       Student(ID, Name, Major)  // ID is key
       Course(Num, Dept, Units)  // Num is key
       Took(ID, Num, Quarter)    // (ID,Num) is key
    
    Query to find all students named "Mary" who took at least one 3-unit course in their own major in the fall quarter 2010:
       Select *
       From   Student, Course, Took
       Where  Took.ID = Student.ID
       And    Took.Num = Course.Num
       And    Course.Dept = Student.Major
       And    Student.Name = 'Mary'
       And    Course.Units = 3
       And    Took.Quarter = 'Fall07'
    
    Suppose: Note: join associativity matters

    Question: How many possible physical query plans are there for this query?

    
    
    
    
    
    
    
    
    
    
    
    
    
    
    Question: What if the system also supports NLJ with inner index (NLIJ)?
    (Note: By convention, the right operand of NLJ and NLIJ is the inner.)

    Case (a): System can exploit 2 indexes on a table

    
    
    
    
    
    
    
    
    
    
    
    Case (b): System can't exploit 2 indexes on a table
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    Question: Go back to the original problem (no NLIJ) but assume that Sort-Merge Join is symmetric. Now how many plans?
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    

    Query rewriting

  • Rewrite the logical query plan so it's likely to produce a better physical query plan.
  • Rewrites are based on heuristic rules, not on database statistics.

  • Implication: enumeration of physical plans considers only those physical plans with some correspondence to the input logical plan.

    * Push selections down

    
    
    
    
    
    
    
    
    
    
    
    
    
    Can be detrimental in the (unusual) scenario of very selective join, unselective selection condition.
    
    
    
    
    
    
    
    
    
    
    
    Can be detrimental in the case of expensive predicates.
    
    
    
    
    
    
    
    
    
    
    

    * "Flatten" subqueries

    
    
    
    
    
    
    
    
    
    
    
    Need to be careful about duplicates
    
    
    
    Question: Why flatten subqueries?
    
    
    
    

    * Insert or push down projections

    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    Question: Why insert or push down projections?
    
    
    
    

    * Many other rewrites, especially involving grouping and aggregation

    => Since rewrites are not guaranteed to be improvements, could generate alternatives and send some or all of them to physical query plan generator.


    Passing information between physical operators

    Example physical plan:
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    Two modes of information passing:

    1. Temporary Tables

    Create table for each interior node in the query plan.
    Table for root contains query result.

  • Pro: conceptually easy
  • Cons: harder to implement, often less efficient

    2. Iterators

    Each operator supports three methods:

    Iterator for non-leaf operator calls iterators for children.
    Iterator for root operator produces query result.

    Question: Look familiar?

    
    
    
    
    Pros and cons: converse of temporary tables

    Physical operators -- iterators

    Operator Iterator::Open Iterator::GetNext Iterator::Close

    Table Scan
    .


    ......................................................


    ......................................................


    ......................................................

    Index Scan without Condition
    .


    ......................................................


    ......................................................


    ......................................................

    Condition-Based Index Scan
    .


    ......................................................


    ......................................................


    ......................................................

    Nested-Loops Join
    .


    ......................................................


    ......................................................


    ......................................................

    Sort-Merge Join
    .


    ......................................................


    ......................................................


    ......................................................

    Hash Join
    .


    ......................................................


    ......................................................


    ......................................................

    Sort
    .


    ......................................................


    ......................................................


    ......................................................

    Group/Aggregate
    .


    ......................................................


    ......................................................


    ......................................................

    Union
    .


    ......................................................


    ......................................................


    ......................................................

    Except
    .


    ......................................................


    ......................................................


    ......................................................

    Intersection
    .


    ......................................................


    ......................................................


    ......................................................

    Filter (Selection)
    .


    ......................................................


    ......................................................


    ......................................................

    Projection
    .


    ......................................................


    ......................................................


    ......................................................

    Distinct
    .


    ......................................................


    ......................................................


    ......................................................

    Some terminology:

    Question: Which of the above physical operators are blocking?

    
    
    
    Note: Many production DBMS's use a hybrid approach between iterators and temporary relations: iterators with buffers between operators


    Query shapes

    Example: R1 Join R2 Join R3 Join R4 Join R5

    Left-deep tree:

    
    
    
    
    
    
    
    
    
    
    
    Right-deep tree:
    
    
    
    
    
    
    
    
    
    
    
    
    
    Bushy tree (as opposed to linear):
    
    
    
    
    
    
    
    
    
    
    
    
    
    

    With temporary tables

    Question: Suppose all relations are the same size and all join selectivities are uniform. Using nested-loops joins, is there a preference between left-deep and right-deep?
    
    
    
    
    Question: Can bushy be better than left-deep or right-deep?
    
    
    
    
    

    Question: Suppose relation sizes and join selectivities vary. Using nested-loops joins, is there a preference between left-deep, right-deep, and bushy?

    
    
    
    
    

    With iterators

    Question: Suppose all relations are the same size and all join selectivities are uniform. Using nested-loops joins, is there a preference between left-deep, right-deep, and bushy?

    
    
    
    
    
    
    
    
    
    
    
    

    Exercise

    Consider physical plan to execute (R NLJ S NLJ T) with no indexes and no condition-based scan, and print the result.

    Suppose |R| = 3, |S| = 6, |T| = 4, join selectivity = .5

    How many calls total to any iterator GetNext method for left-deep plan versus right-deep plan? Don't count EOFs.

    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    


    Touched on two simple cost metrics:
    1. For temporary tables: size of intermediate tables
    2. For iterators: number of GetNext calls
    => More on cost metrics later.

    Query shapes and hash joins

    Consider hash join with Build on left and Probe on right:
    
    
    
    
    
    
    
    
    
    
    

    Algorithm

    1. For each tuple t1 in Build relation, hash join attribute a1 and insert (a1,t1) into appropriate hash bucket.
    2. For each tuple t2 in Probe relation, hash join attribute a2, check appropriate hash bucket for matches.

    Left-deep plan with hash joins:

    
    
    
    
    
    
    
    
    
    
    
    Question: How many hash tables needed at once?
    
    
    
    Right-deep plan with hash joins:
    
    
    
    
    
    
    
    
    
    
    
    
    
    Question: How many hash tables needed at once?
    
    
    
    Conclusion: left-deep plan requires less memory.
    Question: Any advantages to right-deep plan?
    
    
    
    
    
    Reality: as a pruning heuristic many systems limit physical plan enumeration to left-deep join plans only.


    Coming up


    Costing Physical Plans

    Physical query plan enumerator generates multiple possible plans, selects the one with lowest estimated cost according to some metric.

    Input to cost metric:

    Output: estimated cost, especially relative to cost of other plans

    Digression: Value Distributions

    Usually 2nd-min and 2nd-max are used instead of min and max.

    Question: Name two reasons 2nd-min and 2nd-max might be preferred over min and max

    (1)
    
    
    (2)
    
    

    Equi-width histograms

    
    
    
    
    
    
    

    Equi-height histograms

    
    
    
    
    
    
    

    Question: Which is better and why?

    
    
    
    
    

    Simple cost metrics

    We already discussed two simple cost metrics:
    1. Total size of intermediate results

    2. Total number of GetNext calls in iterator execution model
    Notice that estimating (1) is required in order to estimate (2).

    Examples: size of intermediate results

    Example: selection over R of tuples with R.A < 5

    Question: estimated number of tuples in result?

    
    
    
    
    
    
    
    
    Example: join of R and S with R.A = S.A

    Question: estimated number of tuples in result?

    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    Example: group R by R.A and compute aggregate over R.B

    Question: estimated number of tuples in result?

    
    
    
    

    Examples: number of GetNext calls

    Recall left-deep versus right-deep 3-way join examples from earlier.

    More complex cost metrics

  • Model buffer pool and try to estimate pages fetched

  • Take CPU usage (e.g., # of comparisons) into account: cost = (k1 * IOcost) + (k2 * CPUcost)

  • Could even model disk arm movement, disk latency, transfer rate

    Note that estimating size of intermediate results still needed.

    Example:

  • R join S
  • R occupies B(R) blocks, S occupies B(S) blocks
  • Nested-block join with rocking
  • Initially empty LRU buffer pool with K pages

    Question: estimated number of page fetches?

    Case (a): Allocate 1 block to R

    
    
    
    
    
    
    
    
    
    
    
    
    
    

    Case (b): Allocate r blocks to R (1 <= r <= B(R) <= K) to minimize page fetches

    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    

    Throughput versus response time

  • Most cost metrics are based on maximizing throughput.
  • An alternative metric is to minimize response time.

    Example:

       Select   *
       From     R, S
       Where    R.A = S.A
       Order By R.A
    
    Suppose R is sorted on attribute A.

    Question: What kind of join will minimize response time?

    
    
    
    Question: What kind of join will maximize throughput?
    
    
    
    

    Memory usage

    A physical query plan may require a certain amount of memory to be available at run-time, e.g., for hash joins or certain sort algorithms.

    Plan enumeration may take into account how much memory is expected to be available and select feasible plans accordingly, but run-time guarantees about memory availability usually cannot be made.

    If not enough memory is available at run-time, possibilities are:

    1. Default to disk-based version of operators (e.g., for hash join)

    2. Use virtual memory and hope for decent performance

    3. Choose alternate plan requiring less memory

    4. Wait for memory to become available

    Physical query plan enumeration

    Many possible algorithms:
    1. Heuristic generation of one plan, with or without considering cost

    2. Naive exhaustive enumeration

    3. Transformation-based (rule-based, top-down) search, heuristic or exhaustive

    4. "Branch and bound": heuristic plan followed by search

    5. Bottom-up (dynamic programming) exhaustive search

    6. Bottom-up greedy search

    7. Selinger bottom-up algorithm

    8. Randomized search: simulated annealing, iterative improvement, genetic algorithms, many others [not discussed further]

    Running example

    Query block:

    Sort (Project (Group/Aggregate (Join (Filter(R1), Filter(R2), ..., Filter(Rn))))

    Corresponds to:

       Select   attributes, aggregates
       From     R1, R2, ..., Rn
       Where    join conditions and local selections
       Group By attributes
       Order By attributes
    
    Most significant decisions are join ordering, relation access methods, and join methods, so let's consider simple query blocks:

    Join (Filter(R1), Filter(R2), ..., Filter(Rn))

    or in SQL:

       Select attributes
       From   R1, R2, ..., Rn
       Where  join conditions and local selections
    

    1. Heuristic plan generation

    Generate just one plan based on a set of heuristics

    Without cost:

    With cost:

    2. Naive exhaustive enumeration

       for each possible ordering of R1, R2, ..., Rn:
         for each possible associativity of selected order:
           for each possible join method for Join-1:
             for each possible join method for Join-2:
               ...
                 for each possible join method for Join-(n-1):
                   for each possible access method for Filter(R1):
                     for each possible access method for Filter(R2):
                       ...
                         for each possible access method for Filter(Rn):
                        construct physical plan;
                        compute estimated cost;
                        save plan and cost if cheapest so far
    
    Some simplifications don't compromise finding cheapest plan, e.g., eliminating incompatible join/access methods, not considering commutative cases of commutative joins (recall exercises).

    Real implementation would not recompute subresults.

    3. Transformation-based search


    4. Branch and bound

    1. Generate first plan using heuristic approach (no-cost or cost).

    2. Permute the plan using transformation rules. Cost of plan from step 1 provides bound for pruning search space.
    Can stop anytime in step 2, or can generate and cost all feasible plans

    5. Exhaustive bottom-up plan enumeration

    Dynamic programming approach:
    1. For each pair {Ri, Rj}, generate all possible plans and costs for Filter(Ri) Join Filter(Rj), including commutative case.

    2. Using plans and costs from step 1, for each possible triple {Ri, Rj, Rk}, generate all possible plans and costs for Filter(Ri) Join Filter(Rj) Join Filter(Rk), including different associativities and commutativities.

    3. Continue until plans and costs are generated for n-way join, pick cheapest.
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    

    6. Greedy bottom-up plan enumeration

    Same as exhaustive, except at each step save only the cheapest plan.

    Question: Will the selected plan always be optimal?

    
    
    
    
    

    7. Selinger / System R algorithm

    Same as greedy, except at each step save all plans that represent an interesting order.

    An intermediate result has an interesting order if it is sorted by any of:

    (First two cases are for general query blocks)

    Question: Will the selected plan always be optimal?

    
    
    
    
    

    Reducing the search space

    Lots of possibilities, all with potential to inadvertently prune away the optimal plan. Examples:

    The Real World

    Most commercial optimizers use either:

    Question: Advantages of branch and bound?

    
    
    
    
    
    
    
    
    
    
    
    
    
    Question: Advantages of bottom-up?
    
    
    
    
    
    
    
    
    
    
    
    

    Some Tricks to Further Speed Up Query Processing

    Index-only operations

    Compute query answer using indexes only

    Example:

       Select price
       From   product
       Where  price < 25
    
    Assume B+ tree index on price
    Question: How can we execute the query without accessing data?
    
    
    

    Example:

       Select gpa
       From   student
       Where  status = 'MS' And gpa > 3.8
    
    Assume indexes on status and gpa
    Question: How can we execute the query without accessing data?
    
    
    
    
    
    
    Example: Same as previous except select name instead of gpa
    Question: How can we execute the query minimizing access to data?
    
    
    
    
    
    

    Aside: Using Indexes

    We've been assuming it's always a good idea to use an index when possible

    Very simple example: Select * From R Where R.A=5
    Question: If there is an index on R.A, is it always a good idea to use it?

    
    
    
    
    
    
    
    
    
    
    
    
    

    Storing Relations as Indexes

  • Combine index on relation and relation itself into one structure (sometimes called clustered relation or clustered index)
  • B-tree or hash index, just replace RIDs with full records

    Question: Advantages?

    
    
    
    
    
    Question: Disadvantages?
    
    
    
    
    
    
    
    
    
    
    

    Join processing

    Joins are the most complex, frequent, studied operation

    2-way join problem definition:

  • R JOIN[pred] S
  • pred is R.A op S.B
  • op is one of =, <, <=, >, >=, <>

    Note: join predicates other than pred are residual

    Standard algorithms:

  • Nested loops/blocks
  • Sort-merge join
  • Hash join (various flavors)

    Nested-loops join with key

       for each t1 in R
         for each t2 in S
           if pred(t1,t2) then return combine(t1,t2)
    
    Suppose pred is R.A=S.B, B is key for S, no relevant indexes.
    Question: How to improve basic algorithm?
    
    
    
    
    Note: could choose commutative ordering of R JOIN S based on knowledge of key
    
    
    


    RID-based join

    Suppose have "scanable" index on R.A, index on S.B
       for each <a,rid1> in index on R.A
         use index on S.B to find all <b,rid2> such that pred(a,b)
           save <rid1,rid2> in temporary table
       fetch actual tuples
    
    Question: Why is this approach more efficient than nested-loops index join?
    (1)
    
    (2)
    
    
    Question: What happens in the special case of the following query?
       Select R.A, S.B
       From   R, S
       Where  R.A op S.B
    
    
    
    
    

    Auxiliary structures

    Store extra information to speed up join.
    Question: What classic trade-off is being made here?
    
    
    
    

    Pointer-based joins

    Store RIDs for joining S tuples within R tuples:
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    Question: What is the disadvantage besides the update cost?
    
    
    
    
    Question: When can we avoid variable-length records?
    
    
    
    
    Note: this structure also works well for query conditions such as R.A=S.B And R.C < 10

    Join index

    Keep table of all RIDs participating in join:
    
    
    
    
    
    
    
    
    
    
    
  • More complex -- keep table of all RIDs participating in join and satisfying certain selection conditions

  • Even more complex -- keep table of all RIDs in result of some arbitrary query

    Common variation: RIDs for joining tuples in leaves of B-tree index:

    
    
    
    
    
    
    
    
    
    
    Looks like pointer-based join with table stored as index.

    => In reality these are points in a wide spectrum of interrelated techniques

    Bitmap index

    Similar to conventional index except store bitmaps instead of RIDs:
    
    
    
    
    
    
    
    
    
    
    
    
    
    Typically used for star schemas

    Example:

       sales(storeID, itemID, ...)   // "fact table"
       store(storeID, state, ...)    // "dimension table"
       item(itemID, type, ...)       // "dimension table"
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
       Select sales.*
       From   sales, store, item
       Where  sales.storeID = store.storeID
       And    sales.itemID = item.itemID
       And    store.state = 'CA'
       And    item.type = 'clothing'
    
    Bitmap indexes typically created on dimension foreign-key attributes of fact tables:
      create bitmap index on sales(storeID)
      create bitmap index on sales(itemID)
    
    
    
    
    
    
    
    
    
    
    
    
    
    Execution of above query:
    1. Get relevant storeID's from table store
    2. Perform lookups in storeID index, bitmap Or the results (bitmap B1)
    3. Do the same for itemID's (bitmap B2)
    4. Bitmap And on B1 and B2
    5. Fetch sales tuples
    Question: In what environment do bitmaps make sense?
    
    
    
    
    
    
    
    
    


    Bitmap join index

    Recall join index B-tree variant, but replace joining RIDs with bitmaps:
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    Typically created on dimension tables. (Note: made-up syntax)
      create bitmap join index (storeID = sales.storeID) on store(state)
      create bitmap join index (itemID = sales.itemID) on item(type)
    

    Question:What are execution steps for above query?
    
    
    
    
    
    
    
    
    
    
    
    => Here too there is a wide spectrum of interrelated techniques. We've only scratched the surface.


    Summary of Query Processing Material