Data management I, Autumn 1999, Exercise 6 (8.-11.11.)

The minimum requirement for active attendance: 2 tasks done

The first exam will be as indicated: Nov 3rd at 15:00-18:00 in Porthania I. The requirements of the exam: pages 1-181 in lecture notes and exercises 1-5. (The corresponding parts in E&N are given on the English course page.) On week 44 (Nov 1-5) we have neither lectures nor exercise groups.

1. Consider the following query

select e.LNAME, s.LNAME
from EMPLOYEE e, EMPLOYEE s, DEPARTMENT d
where e.SUPERSSN = s.SSN and e.DNO = d.DNUMBER and d.DNAME = "Research"
and e.SALARY > 30000.

a) Give the projection-selection-product relational expression
directly corresponding to the query. Also give the corresponding query
tree. Associate the cartesian products in such a way that presumably
will result in the best optimization effect.

b) Optimize the query using heuristics "apply the selections as early as
possible". Give the query tree and the corresponding relational
expression for the optimized query.

2. How would you evaluate the query of the previous problem
when there is an index on attribute SSN to relation EMPLOYEE and an
index on attribute DNAME to relation DEPARTMENT?

3. In the nested (inner-outer) loop approach of computing
the join of relations r and s one may choose r as the outer
relation and s as the inner relation, or vice versa.
Is it better to choose the larger or the smaller relation as the outer
relation?    

4. Consider computing the natural join of relations
r(A,B) and s(B,C) using the sort-merge join algorithm,
when the join attribute $B$ is a unique key of s.
Sketch the algorithm of the merge phase of the join.
The algorithm uses only three pages of buffer memory.

5. Consider the evaluation of the relational expression

        P  (r * s * t)                  (P is the projection operator)
         BD

when the relations r(AB), s(BC) and t(CD) are as follows.

r = {(i,1) | i = 1,...,n},
s = {(1,j) | j = 1,...,n},
t = {(k,2) | k = 1,...,n}.
                          
What is the size of the result, that is, how many tuples does the
value of the expression contain? 
How large intermediary results are generated, when the expression
is evaluated as

a)      P  (r * P  (s * t)),    b)   P  (P  (r * s) * t)
         BD      BD                   BD  BC
  
Assume that catalog information about the sizes of relations and the
sizes of attribute domains are available for the query optimizer.
What kind of heuristics could be used to choose the more efficient
of the evaluation strategies?