CS457 Syllabus & Progress

• Some complex queries using groups

1. Query: combining a group condition and a "all" condition

 Find the fname and lname of the employees with more than 2 dependents and work on all projects controlled by department #1

Solution:

 ``` Make the conditions explicit first: SELECT fname, lname FROM employee WHERE employee has > 2 dependent AND employee works on all projects controlled by dept 1 In other words: SELECT fname, lname FROM employee WHERE ssn IN "set of ssn of employees with > 2 dependents" AND ssn IN "set of ssn of employees who work on all projects controlled by dept 1" Now you have 2 subqueries that you can handle separately (divide and conquer) ! First subquery: "set of ssn of employees with > 2 dependents" Solution: SELECT essn FROM dependent GROUP BY essn HAVING count(name) > 2 Substitute into the SQL query above: SELECT fname, lname FROM employee WHERE ssn IN ( SELECT essn FROM dependent GROUP BY essn HAVING count(name) > 2 ) AND ssn IN "set of ssn of employees who work on all projects controlled by dept 1" Deal with second (tougher) subquery separately: Find: "set of ssn of employees who work on all projects controlled by dept 1" Query in pseudo code: SELECT ssn FROM employee E WHERE "set of projects worked on by E.ssn" contains "set of projects controlled by dept 1" Use the set different technique: SELECT ssn FROM employee E WHERE NOT EXISTS ("set of projects controlled by dept 1" - "set of projects worked on by E.ssn" ) Formulate the 2 sub-queries separately: SELECT ssn FROM employee E WHERE NOT EXISTS ( SELECT pnumber FROM project WHERE pnumber IN ( SELECT pnumber FROM project WHERE dnum = 1 ) AND pnumber NOT IN ( SELECT pno FROM works_on WHERE essn = E.ssn ) ) Finally, substitute the second subquery into the partial (pseudo) solution: Final Solution: SELECT fname, lname FROM employee WHERE ssn IN ( SELECT essn /* This set of emps has > 2 dependents */ FROM dependent GROUP BY essn HAVING count(name) > 2 ) AND ssn IN ( SELECT ssn /* This set of emps works on all proj controlled by dept 1 */ FROM employee E WHERE NOT EXISTS ( SELECT pnumber FROM project WHERE pnumber IN ( SELECT pnumber FROM project WHERE dnum = 1 ) AND pnumber NOT IN ( SELECT pno FROM works_on WHERE essn = E.ssn ) ) ) fname lname ------ -------- Frankl Wong ```

2. Query:

 Find fname, lname of employees who work on ≥ 2 projects that John Smith works on

Solution:

 ``` 1. Query in pseudo code: SELECT fname, lname FROM employee E WHERE # { projs worked on by employee E and by John Smith } >= 2 Or: SELECT fname, lname FROM employee E WHERE # { projs worked on by (1) employee E and by (2) John Smith } >= 2 2. Re-write the set cardinality condition using count(): SELECT fname, lname FROM employee E WHERE (SELECT COUNT(pnumber) FROM project WHERE pnumber IN { proj worked on by employee E } and pnumber IN { proj worked on by John Smith } ) >= 2 3. Work out the inner sets: A. The set { proj worked on by employee E } is obtained with this query: (SELECT pno FROM works_on WHERE essn = E.ssn) B. The set { proj worked on by John Smith } is obtained with this query: (SELECT pno FROM works_on, employee WHERE essn = ssn AND fname='John' AND lname='Smith') 4. Substitute these sub-queries in the pseudo code: SELECT fname, lname FROM employee E WHERE (SELECT COUNT(pnumber) FROM project WHERE pnumber IN (SELECT pno FROM works_on WHERE essn = E.ssn) and pnumber IN (SELECT pno FROM works_on, employee WHERE essn = ssn AND fname='John' AND lname='Smith') ) >= 2 ```

Alternate solution: (try to figure it out what this query is about)

 ``` select E.essn from (select pno /* Projects w.o. by John Smith */ from works_on, employee where essn = ssn and fname='John' and lname='Smith') A, works_on E where A.pno = E.pno group by E.essn having count(*) >= 2 ```

(You can get fname and lname easily using the essn's)

3. Query:

Intro:

• Previously, we have seen how to solve this query:

 Find the (names of) employees who work on all projects controlled by Department 4

• Find departments who has 2 or more employees working on all projects controlled by the 'Research' department

Solution:

 ``` 1. Solution in pseudo code: SELECT dname FROM department D WHERE # { employees in D working on all proj's controlled by 'Research' department } >= 2 2. Re-written using the count() function: SELECT dname FROM department D WHERE COUNT { employees in D working on all proj's controlled by 'Research' department } >= 2 Make the condition explicit: SELECT dname FROM department D WHERE COUNT { employees who (1) are in dept D && who (2) work on all proj's controlled by 'Research' department } >= 2 Re-written with SQL's select count( ) syntax: SELECT dname FROM department D WHERE ( select COUNT(E.ssn) from employee E where employee E is in dept D and employee E works on all projs controlled by 'Research' department ) >= 2 3. E is in department D if: E's dno is equal to D's dnumber: SELECT dname FROM department D WHERE ( select COUNT(E.ssn) from employee E where E.dno = D.dnumber // E works in department D !! and employee E works on all projs controlled by 'Research' department ) >= 2 4. We work out the second condition: E works on all proj's controlled by 'Research' department using a subset/superset relationship: { proj's worked on by E.ssn } contains { proj's controlled by 'Research' department } After the substitution: SELECT dname FROM department D WHERE ( select COUNT(E.ssn) from employee E where E.dno = D.dnumber and ({ proj's worked on by E.ssn } contains { proj's controlled by 'Research' department } ) ) >= 2 5. Apply the set subtraction technqiue and work out the details: SELECT dname FROM department D WHERE ( select COUNT(E.ssn) from employee E where E.dno = D.dnumber /* Employee E is in dept D */ and (NOT EXISTS { proj's controlled by 'Research' department } - { proj's worked on by E.ssn } ) ) >= 2 Or: SELECT dname FROM department D WHERE ( select COUNT(E.ssn) from employee E where E.dno = D.dnumber /* Employee E is in dept D */ and NOT EXISTS ( select * from project where pnumber in { proj's controlled by 'Research' department } and pnumber not in { proj's worked on by E.ssn } ) ) >= 2 6. Work out the inner 2 sets and we obtain the solution: SELECT dname FROM department D WHERE (SELECT COUNT(E.ssn) FROM employee E WHERE E.dno = D.dnumber /* E is in dept D */ AND NOT EXISTS (SELECT * FROM project WHERE pnumber IN (SELECT pnumber FROM project, department where dnum=dnumber AND dname='Research' ) AND pnumber NOT IN (SELECT pno FROM works_on WHERE essn = E.ssn ) ) ) >= 2 ```

4. Query: (Tricky !!)

 Find the department name, and the number of employees in that department that earns > \$40000 for departments with >= 2 employees,

Let us look at an example of a correct answer:

 ``` Suppose that this is the content of the database: DNAME FNAME LNAME SALARY --------------- ------ -------- ---------- Headquarters James Borg 55000 Administration Alice Miller 25000 Administration Jack Wallace 43000 Administration Jake Jones 25000 Research John Smith 30000 Research Joyce English 25000 Research John Doe 38000 Research Frank Wong 40000 These departments with >= 2 employees are: Administration Research The number of employees earning > 40000 in these departments are: Administration 1 Research 0 The correct answer to the query is: Administration 1 (1 employee) (The Research department is not selected because its set of employee is empty !!!) ```

Incorrect solution:

 ``` 1. Here are the departments and their employees: select dname, fname, lname, salary from department, employee where dno=dnumber dname fname lname salary --------------- ------ -------- --------- Administration Alicia Zelaya 25000.00 Administration Jennif Wallace 43000.00 Administration Ahmad Jabbar 25000.00 Headquarters James Borg 55000.00 Research Frankl Wong 40000.00 Research Ramesh Narayan 38000.00 Research Joyce English 25000.00 Research John Smith 30000.00 2. Here are the departments with >= 2 employees: SELECT dname, count(ssn) FROM employee, department WHERE dnumber = dno GROUP BY dname HAVING COUNT(ssn) >= 2 dname count(ssn) --------------- --------------------- Administration 3 Research 4 \$64,000 question: Is this query the correct solution ??? SELECT dname, COUNT(ssn) FROM department, employee WHERE dnumber = dno AND salary > 40000 GROUP BY dname HAVING COUNT(ssn) >= 2 Output: (empty !!!) dname count(ssn) --------------- --------------------- ```

• What went wrong ???

Important principle: in group by query formulation:

 You must first find them (tuples) before you can count them (tuples) !!!

We have violated this principle:

 We counted them (tuples) and then we go back (to the where clause) and found a different set of tuples !!!

Execution of the new (modified) query is as follows:

 ``` Fact: The TUPLE selection part (WHERE) of the query is executed first !!! This is the TUPLE selection part (WHERE): SELECT dname, fname, lname, salary FROM department, employee WHERE dnumber = dno AND salary > 40000 Output: dname fname lname salary --------------- ------ -------- --------- Administration Jennif Wallace 43000.00 Headquarters James Borg 55000.00 Meaning of output: all employees with salary > 40000 !!! THEN, grouping is applied to this set: SELECT dname, COUNT(ssn) FROM department, employee WHERE dnumber = dno AND salary > 40000 GROUP BY dname HAVING COUNT(ssn) >= 2 Meaning of the groups: departments with >= 2 employees earning > \$40000 !!! ```

Moral of this example:

• Do not change the WHERE condition when there is a GROUP BY clause in your query !!!

(Because the new condition will select a different set of tuples that you cannot see due to the group by clause !!!)

• If you change the WHERE condition, make sure to check the result set without a GROUP BY clause to see if you still have the correct set of tuples:

 ```1. Here are the departments and their employees: select dname, fname, lname, salary from department, employee where dno=dnumber dname fname lname salary --------------- ------ -------- --------- Administration Alicia Zelaya 25000.00 Administration Jennif Wallace 43000.00 Administration Ahmad Jabbar 25000.00 Headquarters James Borg 55000.00 Research Frankl Wong 40000.00 Research Ramesh Narayan 38000.00 Research Joyce English 25000.00 Research John Smith 30000.00 2. Here are the tuples with the additional condition: SELECT dname, count(ssn) FROM employee, department WHERE dnumber = dno AND salary > 40000 (OMIT the GROUP BY !!! You want to see the actual tuples) dname fname lname salary --------------- ------ -------- --------- Administration Jennif Wallace 43000.00 Headquarters James Borg 55000.00 ```

Observation: You cannot use these tuples to answer the query:

 Find the department name, and the number of employees in that department that earns > \$40000 for departments with >= 2 employees !!!

because you eliminated some employees in departments !!!!

The correct solution

Recall: the query

 Find the department name, and the number of employees in that department that earns > \$40000 for departments with >= 2 employees,

Correct solution:

 ``` 1. Here are the departments and their employees: select dname, fname, lname, salary from department, employee where dno=dnumber dname fname lname salary --------------- ------ -------- --------- Administration Alicia Zelaya 25000.00 Administration Jennif Wallace 43000.00 Administration Ahmad Jabbar 25000.00 Headquarters James Borg 55000.00 Research Frankl Wong 40000.00 Research Ramesh Narayan 38000.00 Research Joyce English 25000.00 Research John Smith 30000.00 2. Here are the departments with >= 2 employees: SELECT dname, count(ssn) FROM employee, department WHERE dnumber = dno GROUP BY dname HAVING COUNT(ssn) >= 2 dname count(ssn) --------------- --------------------- Administration 3 Research 4 The query can be re-stated as follows: Find (1) the department name, and (2) the number of employees in the department earning > \$40000 for departments 3 and 4 !!! 3. We first find the employees IN these departments ! SELECT dname, fname, lname, salary FROM employee, department WHERE dno=dnumber AND dno IN ( SELECT dno FROM employee GROUP BY dno HAVING COUNT(ssn) >= 2 ) Output: dname fname lname salary --------------- ------ -------- --------- Administration Ahmad Jabbar 25000.00 Administration Alicia Zelaya 25000.00 Administration Jennif Wallace 43000.00 Research John Smith 30000.00 Research Frankl Wong 40000.00 Research Ramesh Narayan 38000.00 Research Joyce English 25000.00 4. Filter the employees with salary > 40000 in these departments SELECT dname, fname, lname, salary FROM employee, department WHERE dno=dnumber AND salary > 40000 AND dno IN ( SELECT dno FROM employee GROUP BY dno HAVING COUNT(ssn) >= 2 ) Output: dname fname lname salary --------------- ------ -------- --------- Administration Jennif Wallace 43000.00 5. Finally, count how many employees in each group: SELECT dname, COUNT(ssn) FROM employee, department WHERE dno=dnumber AND salary > 40000 AND dno IN ( SELECT dno FROM employee GROUP BY dno HAVING COUNT(ssn) >= 2 ) GROUP BY dname Output: dname COUNT(ssn) --------------- --------------------- Administration 1 ```