=========================================================================== CS 346 Fall '98 - Prof. Widom - Handout #30 LECTURE NOTES - Query processing lecture #3 ============================================================================ Left to cover ------------- - Cost metrics for physical query plans - Physical query plan enumeration - Reducing the search space - Further optimizations Costing physical plans ---------------------- Physical query plan enumerator generates multiple possible plans, selects the one with lowest estimated cost according to some "metric". INPUT to cost metric: - Physical plan - Cardinality of relations - Value distributions of attributes (e.g., min value, max value, number of distinct values, histograms, inverted histograms, ...) - Record sizes, page sizes, buffer size - Clustering information - Index information: size of keys, records, pages; depth of B-trees - Available memory (possibly) OUTPUT: estimated cost, especially relative to cost of other plans Simple cost metrics ------------------- We already discussed two simple cost metrics: (1) Total size of intermediate results (2) Total number of "get-next" calls in iterator execution model Notice that estimating (1) is required in order to estimate (2). Examples: size of intermediate results -------------------------------------- Ex: selection over R of tuples with R.A < 5 Q: estimated number of tuples in result? Ex: join of R and S with R.A = S.A Q: estimated number of tuples in result? Ex: group R by R.A and compute aggregate over R.B Q: estimated number of tuples in result? Examples: number of get-next calls ---------------------------------- Recall left-deep versus right-deep 3-way join examples from previous lecture. More complex cost metrics ------------------------- - Model buffer pool and try to estimate pages fetched - Take CPU usage (e.g., # of comparisons) into account: cost = k1*I/O-cost + k2*CPU-cost - Could even model disk arm movement, disk latency, transfer rate Note that estimating size of intermediate results still needed Ex: R join S R occupies B(R) blocks, S occupies B(S) blocks Naive nested-block join with "rocking" Initially empty LRU buffer pool with K pages Q: estimated number of page fetches? Throughput versus response time ------------------------------- Most cost metrics are based on maximizing throughput. An alternative metric is to minimize response time. Ex: SELECT * FROM R, S WHERE R.A = S.A ORDER BY R.A Suppose R is sorted on attribute A. Q: What kind of join will minimize response time? A: Q: What kind of join will maximize throughput? A: Memory usage ------------ A physical query plan may require a certain amount of memory to be available at run-time, e.g., for hash joins or certain sort algorithms. Plan enumeration may take into account how much memory is expected to be available and select feasible plans accordingly, but run-time guarantees about memory availability usually cannot be made. If not enough memory is available at run-time, possibilities are: (1) default to disk-based version of operators (e.g., for hash join) (2) use virtual memory and hope for decent performance (3) choose alternate plan requiring less memory (4) wait for memory to become available Physical query plan enumeration ------------------------------- Many possible algorithms: (1) Heuristic based on set of rules, with or without considering cost (2) Exhaustive top-down search (3) "Branch and bound": heuristic followed by search (4) Bottom-up exhaustive search (5) Bottom-up greedy search (6) Selinger bottom-up algorithm (7) Randomized search: simulated annealing, iterative improvement, genetic algorithms, many others Running example: "Query block": SORT (PROJECT (GROUP/AGGREGATE (JOIN (SELECTION(R1), SELECTION(R2), ..., SELECTION(Rn)))) Corresponds to: SELECT attributes, aggregates FROM R1, R2, ..., Rn WHERE join conditions and local selections GROUP BY attributes ORDER BY attributes Most significant decision is join ordering, relation access methods, and join methods. Heuristic plan enumeration -------------------------- Generate just one plan based on a set of rules. Without cost: - Decision based on available indexes, existence of local and join conditions in query, types of conditions (=, <, etc.) - General goals: avoid scans (use indexes), keep intermediate results small - Heuristics: join smaller operands first, do equijoins first, exploit sorted operands for sort-merge joins, arrange join order so indexes can be used on inners of nested-loop joins With cost: - Heuristics as above but can estimate and minimize intermediate result sizes Top-down plan enumeration ------------------------- for each possible ordering of R1, R2, ..., Rn: for each possible associativity of selected order: for each possible join method for join 1: for each possible join method for join 2: ... for each possible join method for join (n-1): for each possible access method for SELECTION(R1): for each possible access method for SELECTION(R2): ... for each possible access method for SELECTION(Rn): construct physical plan including aggregation, projection, and ordering; compute estimated cost; save plan and cost if cheapest so far Some simplifications don't compromise finding cheapest plan, e.g., eliminating incompatible join/access methods, not considering commutative cases of commutative joins. Implementation would not recompute subresults. Branch-and-bound plan enumeration --------------------------------- (1) Generate first plan using heuristic approach (no-cost or cost). (2) Permute the plan looking for lower-cost alternatives. During permutations can stop estimating cost for a plan when higher than best so far. Can stop optimization at any point with expectation of good plan, otherwise searches entire space. Exhaustive bottom-up plan enumeration ------------------------------------- (1) For each pair Ri/Rj, generate all possible plans and costs for SELECTION(Ri) JOIN SELECTION(Rj), including commutative case. (2) Using plans and costs from step (1), for each possible triple Ri/Rj/Rk, generate all possible plans and costs for SELECTION(Ri) JOIN SELECTION(Rj) JOIN SELECTION(Rk), including different associativities and commutativities. (3) Continue until plans and costs are generated for n-way join. (4) Add aggregation, projection, and ordering; pick cheapest plan. Greedy bottom-up plan enumeration --------------------------------- Same as exhaustive, except at each step save only the cheapest plan. Q: Will the selected plan always be optimal? A: Selinger / System R algorithm ----------------------------- Same as greedy, except at each step save all plans that represent an "interesting order" - An intermediate result has an "interesting order" if it is sorted by any of: * Final SORT attribute * Grouping attribute(s) * Later join attributes Q: Will the selected plan always be optimal? A: Reducing the search space ------------------------- Lots of possibilities, all with potential to prune away the optimal plan. Examples: - Consider left-deep join trees only - Always favor joins over cross-products - Always use hash join if build operand is expected to fit in memory - Always use sort-merge join if one operand is sorted - Always favor selection index over join index (or vice-versa) Q: others?