============================================================================ CS 346 Fall '98 - Prof. Widom - Handout #14 LECTURE NOTES - Query processing lecture #2 ============================================================================ Review: steps in database query processing ------------------------------------------ Query string --> Parser Query tree --> Checker Valid query tree --> View expander Valid tree w/o views --> Logical query plan generator Logical query plan --> Query rewriter (heuristic) Better logical plan --> Physical query plan generator (cost-based) Selected physical plan --> Code generator Executable code --> Execution engine Left to cover ------------- - Query rewriting - Passing information between physical operators - Cost metrics for physical query plans - Physical query plan enumeration - Reducing the search space - Further optimizations Query rewriting --------------- - Idea: rewrite the logical query plan so it's likely to produce a better physical query plan. Rewrites are based on heuristic rules, not on database statistics. (Implication: enumeration of physical plans considers only those with some correspondence to the input logical plan) (1) Push selections down - Can be bad in the (unusual) scenario of very selective join, unselective selection condition. - Can be bad in the case of expensive predicates. (2) "Flatten" subqueries - Need to be careful about duplicates (3) Insert or push projections Q: Why bother? A: (4) Many others, especially with aggregates * Since rewrites are not guaranteed to be improvements, could generate alternatives and send all of them to physical query plan generator. Passing information between physical operators ---------------------------------------------- Example physical plan: Two modes of information passing: (1) TEMPORARY TABLES Create table for each interior node in the query plan. Table for root contains query result. - Pro: conceptually easy - Cons: harder to implement, often less efficient (2) ITERATORS Each operator supports three methods: Open: set up iterator GetNext: get next record Close: destroy iterator Iterator for root operator produces query result. Q: Look familiar? A: Pros and cons: converse of temporary tables Physical operators ------------------ Iterator: Open Next Close ------------------------------------------------------------------------------ TABLE SCAN | | | | ------------------------------------------------------------------------------ INDEX SCAN W/O CONDITION | | | | ------------------------------------------------------------------------------ CONDITION-BASED INDEX SCAN | | | | ------------------------------------------------------------------------------ NESTED-LOOP JOIN | | | | ------------------------------------------------------------------------------ SORT-MERGE JOIN | | | | ------------------------------------------------------------------------------ HASH JOIN | | | | ------------------------------------------------------------------------------ SORT | | | | ------------------------------------------------------------------------------ GROUP/AGGREGATE | | | | ------------------------------------------------------------------------------ UNION | | | | ------------------------------------------------------------------------------ EXCEPT | | | | ------------------------------------------------------------------------------ INTERSECT | | | | ------------------------------------------------------------------------------ SELECTION (APPLY-PRED) | | | | ------------------------------------------------------------------------------ PROJECTION | | | | ------------------------------------------------------------------------------ DISTINCT | | | | ------------------------------------------------------------------------------ Query shapes ------------ Example: R1 JOIN R2 JOIN R3 JOIN R4 JOIN R5 "Left-deep" tree: "Right-deep" tree: "Bushy" tree (as opposed to "linear"): *** With temporary tables: Q: Suppose all relations are the same size and all join selectivities are uniform. Using nested-loop joins, is there a preference between left-deep, right-deep, and bushy? A: Q: Suppose relation sizes and join selectivities vary. Using nested-loop joins, is there a preference between left-deep, right-deep, and bushy? A: *** With iterators: Q: Suppose all relations are the same size and all join selectivities are uniform. Using nested-loop joins, is there a preference between left-deep, right-deep, and bushy? A: Exercise: Consider physical plan to execute (R NLJ S NLJ T) with no indexes and print the result. Suppose |R| = 3, |S| = 6, |T| = 4, join selectivity = .5 How many calls total to an iterator GetNext method for left-deep plan versus right-deep plan? Don't count EOFs * Touched on two simple "cost metrics": (1) For temporary tables: size of intermediate tables and result (2) For iterators: number of GetNext calls More on cost metrics later... Query shapes and hash joins --------------------------- Consider hash join with "build" on left and "probe" on right: Algorithm: 1. For each tuple t1 in "build" relation, hash join attribute a1 and insert (a1,t1) into appropriate hash bucket. 2. For each tuple t2 in "probe" relation, hash join attribute a2, check appropriate hash bucket for matches Left-deep plan with hash joins: Q: How many hash tables needed at once? A: Right-deep plan with hash joins: Q: How many hash tables needed at once? A: Summary: as a pruning heuristic one can limit physical plan enumeration to consider left-deep join plans only. More on pruning heuristics later...