===========================================================================
CS 346 Fall '98 - Prof. Widom - Handout #30
LECTURE NOTES - Query processing lecture #3
============================================================================

Left to cover
-------------

- Cost metrics for physical query plans
- Physical query plan enumeration
- Reducing the search space
- Further optimizations


Costing physical plans
----------------------

Physical query plan enumerator generates multiple possible plans,
selects the one with lowest estimated cost according to some "metric".

INPUT to cost metric:

- Physical plan

- Cardinality of relations

- Value distributions of attributes (e.g., min value, max value,
  number of distinct values, histograms, inverted histograms, ...)

- Record sizes, page sizes, buffer size

- Clustering information

- Index information: size of keys, records, pages; depth of B-trees

- Available memory (possibly)

OUTPUT: estimated cost, especially relative to cost of other plans


Simple cost metrics
-------------------

We already discussed two simple cost metrics:

  (1) Total size of intermediate results
  
  (2) Total number of "get-next" calls in iterator execution model

Notice that estimating (1) is required in order to estimate (2).


Examples: size of intermediate results
--------------------------------------

Ex: selection over R of tuples with R.A < 5

Q: estimated number of tuples in result?


Ex: join of R and S with R.A = S.A

Q: estimated number of tuples in result?


Ex: group R by R.A and compute aggregate over R.B

Q: estimated number of tuples in result?


Examples: number of get-next calls
----------------------------------

Recall left-deep versus right-deep 3-way join examples from previous
lecture.


More complex cost metrics
-------------------------

- Model buffer pool and try to estimate pages fetched

- Take CPU usage (e.g., # of comparisons) into account:
  cost = k1*I/O-cost + k2*CPU-cost

- Could even model disk arm movement, disk latency, transfer rate

Note that estimating size of intermediate results still needed


Ex: R join S
    R occupies B(R) blocks, S occupies B(S) blocks
    Naive nested-block join with "rocking"
    Initially empty LRU buffer pool with K pages

Q: estimated number of page fetches?


Throughput versus response time
-------------------------------

Most cost metrics are based on maximizing throughput.

An alternative metric is to minimize response time.

Ex: SELECT *
    FROM R, S
    WHERE R.A = S.A
    ORDER BY R.A

Suppose R is sorted on attribute A.

Q: What kind of join will minimize response time?

A:


Q: What kind of join will maximize throughput?

A:


Memory usage
------------

A physical query plan may require a certain amount of memory to be
available at run-time, e.g., for hash joins or certain sort
algorithms.

Plan enumeration may take into account how much memory is expected to
be available and select feasible plans accordingly, but run-time
guarantees about memory availability usually cannot be made.

If not enough memory is available at run-time, possibilities are:

(1) default to disk-based version of operators (e.g., for hash join)

(2) use virtual memory and hope for decent performance

(3) choose alternate plan requiring less memory

(4) wait for memory to become available


Physical query plan enumeration
-------------------------------

Many possible algorithms:

(1) Heuristic based on set of rules, with or without considering cost

(2) Exhaustive top-down search

(3) "Branch and bound": heuristic followed by search

(4) Bottom-up exhaustive search

(5) Bottom-up greedy search

(6) Selinger bottom-up algorithm

(7) Randomized search: simulated annealing, iterative improvement,
    genetic algorithms, many others


Running example:

"Query block": SORT (PROJECT (GROUP/AGGREGATE (JOIN
                 (SELECTION(R1), SELECTION(R2), ..., SELECTION(Rn))))

Corresponds to: SELECT attributes, aggregates
                FROM R1, R2, ..., Rn
		WHERE join conditions and local selections
                GROUP BY attributes
                ORDER BY attributes

Most significant decision is join ordering, relation access methods,
and join methods.


Heuristic plan enumeration
--------------------------

Generate just one plan based on a set of rules.

Without cost:

- Decision based on available indexes, existence of local and join
  conditions in query, types of conditions (=, <, etc.)

- General goals: avoid scans (use indexes), keep intermediate results
  small

- Heuristics: join smaller operands first, do equijoins first,
  exploit sorted operands for sort-merge joins, arrange join order so
  indexes can be used on inners of nested-loop joins

With cost:

- Heuristics as above but can estimate and minimize intermediate
  result sizes


Top-down plan enumeration
-------------------------

for each possible ordering of R1, R2, ..., Rn:
  for each possible associativity of selected order:
    for each possible join method for join 1:
      for each possible join method for join 2:
        ...
          for each possible join method for join (n-1):
            for each possible access method for SELECTION(R1):
              for each possible access method for SELECTION(R2):
                ...
                  for each possible access method for SELECTION(Rn):
                    construct physical plan including aggregation,
                      projection, and ordering;
                    compute estimated cost;
                    save plan and cost if cheapest so far

Some simplifications don't compromise finding cheapest plan, e.g.,
eliminating incompatible join/access methods, not considering
commutative cases of commutative joins.

Implementation would not recompute subresults.


Branch-and-bound plan enumeration
---------------------------------

(1) Generate first plan using heuristic approach (no-cost or cost).

(2) Permute the plan looking for lower-cost alternatives.  During
    permutations can stop estimating cost for a plan when higher than
    best so far.

Can stop optimization at any point with expectation of good plan,
otherwise searches entire space.


Exhaustive bottom-up plan enumeration
-------------------------------------

(1) For each pair Ri/Rj, generate all possible plans and costs for
SELECTION(Ri) JOIN SELECTION(Rj), including commutative case.

(2) Using plans and costs from step (1), for each possible triple
Ri/Rj/Rk, generate all possible plans and costs for SELECTION(Ri) JOIN
SELECTION(Rj) JOIN SELECTION(Rk), including different associativities
and commutativities.

(3) Continue until plans and costs are generated for n-way join.

(4) Add aggregation, projection, and ordering; pick cheapest plan.


Greedy bottom-up plan enumeration
---------------------------------

Same as exhaustive, except at each step save only the cheapest plan.

Q: Will the selected plan always be optimal?

A:


Selinger / System R algorithm
-----------------------------

Same as greedy, except at each step save all plans that represent an
"interesting order"

- An intermediate result has an "interesting order" if it is sorted by
  any of:

  * Final SORT attribute
  * Grouping attribute(s)
  * Later join attributes


Q: Will the selected plan always be optimal?

A:


Reducing the search space
-------------------------

Lots of possibilities, all with potential to prune away the optimal
plan.  Examples:

- Consider left-deep join trees only

- Always favor joins over cross-products

- Always use hash join if build operand is expected to fit in memory

- Always use sort-merge join if one operand is sorted

- Always favor selection index over join index (or vice-versa)

Q: others?