### The cost and constraint in query processing

• Introductory example on processing cost

• Consider the relations:

 ``` Employee SSN fname salary dno --------- ------ --------- ----------- 123456789 John 20900.00 5 333445555 Frankl 50000.00 5 999887777 Alicia 25000.00 4 987654321 Jennif 43000.00 4 666884444 Ramesh 38000.00 5 453453453 Joyce 25000.00 5 987987987 Ahmad 25000.00 4 888665555 James 55000.00 1 Department: dnumber dname ----------- --------------- 5 Research 4 Administration 1 Headquarters ```

Note:

 There are no indexes on any of the relations

• Find all employee in the Research department who earn > \$50,000:

 ``` σ salary > 50000 ∧ dname='Research' ( employee ⋈ department ) ```

• Query plan 1:

• In tree form:

• Execution:

 ``` for ( tuple t1 ∈ employee ) do for ( tuple t2 ∈ department ) do if ( t1.dno = t2.dnumber ) { if ( sal > 5000 && dname = 'Research' ) output t1 . t2 } ```

• Cost of execution:

 Accesses n2 tuples

• Query plan 2:

• In tree form:

• Execution:

 ``` for ( tuple t1 ∈ employee ) do if ( sal > 50000 ) output t1 to tmpRel; for ( tuple t1 ∈ tmpRel ) for ( tuple t2 ∈ department ) do if ( t1.dno = t2.dnumber ) { if ( dname = 'Research' ) output t1 . t2 } ```

• Cost of execution:

 Accesses n + k×n tuples

Note: k << n        (k is much smaller than n) !!!

• Query plan 3:

• In tree form:

• Execution:

 ``` for ( tuple t1 ∈ employee ) do if ( sal > 50000 ) output t1 to tmpRel1; for ( tuple t2 ∈ department ) do if ( dname = 'Research' ) output t2 to tmpRel2; for ( tuple t1 ∈ tmpRel ) for ( tuple t2 ∈ tmpRel2 ) do if ( t1.dno = t2.dnumber ) { output t1 . t2 } ```

• Cost of execution:

 Accesses n + n + k×k tuples

Note: k << n !!!

• Execution of a Phyiscal Query Plan

• Physical Query Plan:

 Physical Query Plan consists of a number of Physical Query Plan operators

• The execution of a physical query plan will need to:

• Allocate a number of memory buffers to store disk blocks

(i.e.: require some resources)

Resource constraint:

 The availability of the memory buffers constitues a constraint on the execution of the physical query plan

• Access (= delay) a number of disk blocks (to process the query)

Performance cost:

• A high the number of disk block accesed results in:

 Longer query processing time !!

• The # disk block accesses constitutes the cost of the physical query plan

• Cost and constraint of a Phyiscal Query Plan

• Cost of a physical query plan:

 Cost of a physical query plan = the number of disk blocks that are accessed by the execution of the physical query plan

• Constraint on a physical query plan:

• Memory limitation:

 The operators used in the physical query plan will require a certain minimum buffer allocation requirement

• The amount of memory available can prohibit the choice of certain (= more efficient) algorithms for an operator !!!

• Query optimization:

• Query optimization = find the least cost (physical) query plan such that:

 ``` Total memory used by the operators ≤ Total available memory buffers ```

(It's a constrainted optimilization problem)

• Assumption in calculating the cost of an operator

• Recall:

 Cost of an operator = the number of disk I/O operations (= # disk blocks) performed by the operator

• Assumption in computing the cost of an operator:

• The output of the operator is left in memory !!!

• I.e.:

 The cost of an operation does not include the disk I/O's to write result (to disk).... Unless the execution plan needs to write the result to disk (to save memory space)

Reason:

• Query plans often use pipelining to execute the operations in the query plans

Pipelining passes the output tuples using memory buffers

Cost computation:

 Passing tuples using memory buffers requires no disk access !!!

• Parameters (statistics) used in expressing the cost and constraints of operations

• Notation:

• M = the number of memory buffers in main memory that is available to an operator

Note:

 We do not include the memory space used to hold the output of the operator in the calculation !!!

• B(R) = the number of disk blocks accessed/used for the relation R

Notes:

• We use B for short - when it is clear which relation is referred to.

• B(R) is a good estimate for the # disk IO operations to scan (read) a cluster relation R

• T(R) = the number of tuples accessed/used for the relation R

Notes:

• We use T for short - when it is clear which relation is referred to.

• T(R) is a good estimate for the # disk IO operations to scan (read) a uncluster relation R

• V(R, attr) = the number of distinct values found in attribute "attr" in relation R

Example:

 ``` Employee: fname lname dno ------ -------- ----------- John Smith 5 Frankl Wong 5 Alicia Zelaya 4 Jennif Wallace 4 Ramesh Narayan 5 Joyce English 5 Ahmad Jabbar 4 James Borg 1 V(Emp, dno) = 3 (3 distinct values: 1, 4, 5) ```

Note:

 V(R, attr) = | δ ( πattr(R) ) |

• V(R, [a1,a2, ..,an]) = the number of distinct values found in attributes "a1,a2, .., an" in relation R

Note:

 V(R, [a1,a2, .., an] ) = | δ ( π a1,a2, .., an (R) ) |