Saturday, November 2, 2013

Find Answers Faster


Analytic functions provide powerful ways to view your data.
It's 5:00 p.m. You're tired, hungry, and ready to go home for the day, when your manager calls you into his office to request a new report he needs "right now."
"Find the highest salary in each department, and give me a list of employees earning that amount. Can you do that before you leave? Just that one thing?"
You've heard the "just that one thing" line before. You phone home to say you'll be late for dinner, and then you get back to your keyboard to think about that query. "Find the highest salary in each department" is easy enough. A simple GROUP BY provides that information: 
SELECT department_id, MAX(salary)
FROM employees
GROUP BY department_id;


But you also need to list the highest-paid employees in each department. Perhaps a subquery is what you need. Correlated or uncorrelated? It's late, and you really don't want to think through subqueries right now.
Fortunately, there is an easier way. Although you can solve this query problem by using a subquery, you can solve it much more easily by using the analytic function syntax that Oracle has supported since the release of Oracle8i Database. Listing 1 shows the analytic function solution and some of its output. (The examples in this article use the hr sample schema.) Not only is the solution in Listing 1 easier to think through than a subquery solution but verifying at a glance that it's correct is also easier, in our opinion.
Code Listing 1: Listing the highest salary and highest-paid employee 
SELECT department_id dept, first_name || ' ' || last_name name, salary
FROM ( 
SELECT department_id, first_name, last_name,
       MAX(salary) OVER (PARTITION BY department_id) dept_max_sal, salary 
FROM employees e
) WHERE salary = dept_max_sal;

      DEPT       NAME                       SALARY
  ---------     ------------------------    --------
        10      Jennifer Whalen             4400
        20      Michael Hartstein           13000
        30      Den Raphaely                11000
...


The query in Listing 1 uses a two-step process to solve our reporting problem:
1. The inner query—in boldface text in Listing 1—generates a list of employees. Each row returned by this query contains a salary column with the employee's salary and also a dept_max_salcolumn with the maximum salary paid in the employee's department (more on the dept_max_salcolumn in a moment).
2. The outer query simply compares salary with dept_max_sal , returning only those rows where the two amounts match.
The key to this solution is the analytic function call that computes the maximum salary for each employee's department: 
MAX(salary) OVER (
   PARTITION BY department_id
   )


MAX is an aggregate function, and aggregate functions serve double duty as analytic functions. When you see the OVER keyword, you know you are dealing with the analytic version of a function. In this case, the PARTITION BY department_id clause causes MAX to return the maximum salary for each department_id value. Think of department_id as defining a window of rows over which MAX is computed. The GROUP BY solution to the first part of your manager's request collapses the query results into one row per group, but the analytic solution returns the maximum salary value for each detail row. Thus, you have both detail and summary data available at the same time, making it easy to compare each employee's salary with the maximum for that person's department.
Analytic functions are powerful, and they make possible some elegant solutions to problems that previously were difficult to solve in SQL. Your manager's request highlights one problem scenario that can benefit from analytic syntax:
You need access to both detail and summary data at the same time.
Aside from being elegant, analytic function solutions tend to be less brittle and more changeable than their nonanalytic counterparts. Suppose you've taken the report from Listing 1 and handed it to your manager, only to be told, "You know, what I really need is for you to find the maximum salary for each job and then report on the employees in each position that are earning that maximum. Second, please include the maximum department salary for each employee you list."
The first part of this new request is quite easy to carry out. Simply begin with Listing 1 and change all occurrences of department_id to job_id . Also change the column aliases; for example, change dept_max_sal to job_max_sal . The revised query and solution to the first part of your manager's latest request are in Listing 2.
Code Listing 2: Listing the highest-paid employees in each job 
SELECT job_id job, first_name || ' ' || last_name name, salary
FROM ( 
SELECT job_id, first_name, last_name,
       MAX(salary) OVER (PARTITION BY job_id) job_max_sal, salary 
       FROM employees e
) WHERE salary = job_max_sal;


The second part of this new request is also straightforward. All you need to do is add the following column to the inner query and include it in the outer query's select list: 
 
 MAX(salary) OVER (
    PARTITION BY department_id
    ) dept_max_sal


Listing 3 shows the final query for this new request from your manager, along with some sample output. For good measure, Listing 3 also selects the department_id column for each employee. The query in Listing 3 highlights another problem scenario to which analytic syntax can be applied:
You need to summarize data over two or more dimensions at the same time.
Code Listing 3 Same as Listing 2, but with departmental maximun salary 
                               
SELECT department_id dept, job_id job, first_name || ' ' || last_name name, 
      salary, dept_max_sal
FROM (
SELECT department_id, job_id, first_name, last_name,
       MAX(salary) OVER (
          PARTITION BY job_id) job_max_sal, salary,
       MAX(salary) OVER (
          PARTITION BY department_id) dept_max_sal
FROM employees e
) WHERE salary = job_max_sal;

      DEPT     JOB             NAME             SALARY       DEPT_MAX_SAL
 ----------    --------------  --------------   --------     ------------
       110     AC_ACCOUNT      William Gietz        8300         12000
       110     AC_MGR          Shelley Higgins     12000         12000
        10     AD_ASST         Jennifer Whalen      4400          4400

                            


The amount of effort required to go from Listing 2 to Listing 3 is minimal.
Happy that it was so easy to get through "that one thing," you present the report to your manager, who looks it over and then (of course) asks, "Would it be possible to report on the employees with the top five salaries in each department? I really do need to look by department, not by job. And I'd like the salaries to be ranked."
Walking away and muttering under your breath, you stop and remember that there is a RANKfunction that assigns ranks to rows within a window. You write the query in Listing 4 and get ready to give the results to your manager. The part of Listing 4 that generates the ranking is the following function call: 
RANK() OVER (
   PARTITION BY department_id 
   ORDER BY salary DESC
) dept_rank


Code Listing 4: Listing the employees using RANK 
                               
SELECT department_id dept, job_id, first_name || ' ' || last_name name,
       salary, dept_rank
FROM (
SELECT department_id, job_id, first_name, last_name, salary,
       RANK() OVER (
         PARTITION BY department_id 
         ORDER BY salary DESC
       ) dept_rank
  FROM employees
) WHERE dept_rank <= 5
  ORDER BY dept, dept_rank DESC;

      DEPT     JOB_ID            NAME                 SALARY      DEPT_RANK
 ---------     ---------         ----------------   --------    ------------
        10     AD_ASST           Jennifer Whalen      4400               1
        20     MK_REP            Pat Fay              6000               2
        20     MK_MAN            Michael Hartstein    13000              1
...
        60     IT_PROG           Diana Lorentz        4200               5
        60     IT_PROG           David Austin         4800               3
        60     IT_PROG           Valli Pataballa      4800               3
        60     IT_PROG           Bruce Ernst          6000               2
        60     IT_PROG           Alexander Hunold     9000               1

                            


Here is, conceptually, how this RANK function call is evaluated:
1. Rows are grouped by department ( PARTITION BY department_id ). 
2. Within each group, rows are sorted in descending order by salary ( ORDER BY salary DESC ). 
3. Each row in each group is given a number to indicate its rank in the group.
There's no argument to RANK , because the value returned is a function of how the rows are ordered in preparation for ranking. RANK returns 1 for the topmost row, 2 for the next row, and so forth. The ORDER BY clause is mandatory for use of any of the analytic ranking functions, because you use it to specify the basis for ranking. Keep in mind that the ORDER BY for ranking the employees by salary is not the same as the final ORDER BY that ranks the entire result set. In Listing 4, the final result set is in ascending order by department (the dept column) and then by descending order of salary rank. The PARTITION BY clause in Listing 4 causes separate ranking of each department's employees.
Listing 4 highlights yet another problem scenario for which you should consider an analytic solution:
You need to assign a numerical rank to rows as a result of sorting them by one or more columns.
You put on your coat and present the report to your manager, who looks it over and seems quite pleased until ... "Hey, what's with the ranking for department 60?" Examining the report more closely, you notice that employees David Austin and Valli Pataballa have the same salary and are tied for the third-highest in the department. The next employee, Diana Lorentz, is ranked fifth, due to the tie of the previous two employees' salaries. That is the way RANK works. Whenever n rows are tied, n- 1 ranks are skipped, so that the next row's ranking will be what it would have been without the tie. Your manager then says, "So, I don't want to see any gaps in the rankings, OK?"
Standing there, you recall the DENSE_RANK function, for just such a request.
You go back to your desk and, making a small adjustment to the query, come up with Listing 5. You race back to your manager's office and present the report. Your manager looks it over and is pleased, "This is perfect; thank you very much." But as you are leaving, he asks, "For each employee in this report, do the rankings correspond to when the employee was hired? I assume that if one employee earns more than another, that employee has most likely been here longer. The report confirms that, right?"
Code Listing 5: Using DENSE_RANK instead of RANK 
                               
SELECT department_id dept, job_id, first_name || ' ' || last_name name,
       salary, dept_rank
  FROM (
SELECT department_id, job_id, first_name, last_name, salary,
       DENSE_RANK() OVER (
          PARTITION BY department_id ORDER BY salary DESC) dept_rank
  FROM employees
) WHERE dept_rank <= 5
  ORDER BY dept, dept_rank DESC;

      DEPT     JOB_ID           NAME                 SALARY      DEPT_RANK
 ----------    ----------       ----------------   --------    ------------
        10     AD_ASST          Jennifer Whalen        4400               1
        20     MK_REP           Pat Fay                6000               2
        20     MK_MAN           Michael Hartstein     13000               1
...
        60     IT_PROG          Diana Lorentz          4200               4
        60     IT_PROG          David Austin           4800               3
        60     IT_PROG          Valli Pataballa        4800               3
        60     IT_PROG          Bruce Ernst            6000               2
        60     IT_PROG          Alexander Hunold       9000               1

                            


Stomach growling, you go back to your cube one more time.
To answer your manager's query, you will need to be able to compare the salary of each employee with that of the employee hired immediately afterward. You could write a self-join for this purpose, but you're too tired and hungry to want to think through doing that. Fortunately, you recall that theLAG and LEAD functions let you look at the preceding and following rows. After giving the matter a few moments' thought, you come up with the query in Listing 6, which brings us to our final problem scenario for which analytic functions provide a solution:
You need to compare values that lie in separate rows.
Code Listing 6: Using LEAD 
                               
SELECT department_id dept, job_id, first_name || ' ' || last_name name,
       salary, dept_rank, hire_date,
       CASE WHEN salary < LEAD(salary) OVER(
                          PARTITION BY department_id ORDER BY hire_date) 
            THEN 'No'
            ELSE 'Yes'
       END seniority_salary 
  FROM (
SELECT department_id,
       job_id, first_name, last_name, hire_date, salary,
       DENSE_RANK() OVER(
          PARTITION BY department_id ORDER BY salary DESC) dept_rank
  FROM employees
) WHERE dept_rank <= 5
  ORDER BY dept, dept_rank DESC;

 DEPT     JOB_ID         NAME                SALARY    DEPT_RANK    HIRE_DATE     SEN
------    ----------     ------------------- --------  ----------   ------------  ----
    10      AD_ASST        Jennifer Whalen      4400        1          17-SEP-87   Yes
    20      MK_REP         Pat Fay              6000        2          17-AUG-97   Yes
    20      MK_MAN         Michael Hartstein   13000        1          17-FEB-96   Yes
    30      PU_CLERK       Guy Himuro           2600        5          15-NOV-98   Yes
    30      PU_CLERK       Sigal Tobias         2800        4          24-JUL-97    No
    30      PU_CLERK       Shelli Baida         2900        3          24-DEC-97   Yes
    30      PU_CLERK       Alexander Khoo       3100        2          18-MAY-95   Yes
    30      PU_MAN         Den Raphaely        11000        1          07-DEC-94   Yes
...

                            


Key to your solution is the following function in the outermost SELECT list in Listing 6: 
LEAD(salary) OVER(
   PARTITION BY department_id 
   ORDER BY hire_date) 


For the LAG and LEAD functions to work, you must impose order on the rows you are working with. Otherwise there's no sense of "preceding" or "following" among the results. Thus, as with RANKand DENSE_RANK , the ORDER BY clause is mandatory for the LAG and LEAD functions. What's great about analytic syntax is that the order you choose for one function is completely independent of the order you choose for another. Thus, your call to LEAD looks at the next row sorted byhire_date , whereas your call to DENSE_RANK still ranks rows by salary.
The CASE expression in Listing 6 compares each employee's salary with that of the next employee hired ( ORDER BY hire_date ) within the same department. Anytime the newer employee has a higher salary, the expression returns 'No' . Your manager can tell at a glance whenever salary progression in a department does not reflect strict seniority.
Anticipating your manager's likely next request, you decide to enhance your query even further to return the answer to the following question: "How does each employee's salary compare to the average salary of all employees hired within a one-year moving window centered on that one employee?
To answer this question, you need a moving average. In effect, you need a one-year-wide partition that is redefined for each employee row the query returns. This is another problem scenario to which you can apply analytic syntax:
You need the results of an aggregate function computed over a moving window.
To generate the one-year moving average you seek, you add the following function call to your query: 
ROUND(AVG(salary) OVER (
   ORDER BY hire_date
   RANGE BETWEEN 
      hire_date - ADD_MONTHS(
         hire_date,-6) PRECEDING
   AND 
      ADD_MONTHS(hire_date,6) 
         - hire_date FOLLOWING)
   ) moving_average


The key aspects of this analytic function call are as follows:
1. The RANGE BETWEEN clause defines a moving window that is computed anew for each row the query returns.
2. The ORDER BY hire_date clause specifies that hire_date is to be the basis for the moving window.

Next Steps


READ 
more about analytics
Oracle Database Data Warehousing Guide
3. The two expressions in theRANGE BETWEEN clause, one forPRECEDING and one forFOLLOWING , will both yield values of approximately 180. The numbers will vary somewhat, depending on the currenthire_date , because the number of days in a month varies. 
When the query is executed, the database engine subtracts thePRECEDING value from hire_date to determine the beginning of the moving window. Likewise, the database adds the FOLLOWING value to hire_date to determine the end of the moving window. This subtraction and addition is done for each row returned by the query.
4. There is no PARTITION BY clause, because you want the moving average to consider all salaries within the window. You could add PARTITION BY department_id if you wanted each moving average to consider only employees in the current department.
5. The ROUND function rounds the result to an integer.
Listing 7 shows the final query, with some sample results. Confident that this is the last time, you present the report to your manager. He is pleased with the report and your initiative. Happily, it's only 5:30 p.m. You've done all this work in a mere 30 minutes. You'll be home for dinner after all.
Code Listing 7: Adding a 12-month, moving-average 
                               
SELECT department_id dept, job_id, first_name || ' ' || last_name name,
       salary, dept_rank, hire_date,
       CASE WHEN salary < LEAD(salary) OVER(
                          PARTITION BY department_id ORDER BY hire_date) 
            THEN 'No'
            ELSE 'Yes'
       END seniority_salary, moving_average
  FROM (
SELECT department_id, job_id, first_name, last_name, hire_date, salary,
       DENSE_RANK() OVER(PARTITION BY department_id ORDER BY salary DESC) dept_rank,
       ROUND(AVG(salary) OVER (
         ORDER BY hire_date
         RANGE BETWEEN hire_date - ADD_MONTHS(hire_date,-6) PRECEDING
         AND ADD_MONTHS(hire_date,6) - hire_date FOLLOWING)
         ) moving_average
  FROM employees
) WHERE dept_rank <= 5
  ORDER BY dept,dept_rank DESC;

 DEPT   JOB_ID       NAME              SALARY   DEPT_RANK   HIRE_DATE   SEN    MOVING_AVE
 ----   ----------   ----------------- -------- ---------- ------------ ----  - ---------
  10    AD_ASST      Jennifer Whalen    4400            1  17-SEP-87   Yes         14200
  20    MK_REP       Pat Fay            6000            2  17-AUG-97   Yes          6146
  20    MK_MAN       Michael Hartstein 13000            1  17-FEB-96   Yes          7550
  30    PU_CLERK     Guy Himuro         2600            5  15-NOV-98   Yes          4244
  30    PU_CLERK     Sigal Tobias       2800            4  24-JUL-97    No          6214
  30    PU_CLERK     Shelli Baida       2900            3  24-DEC-97   Yes          5427
  30    PU_CLERK     Alexander Khoo     3100            2  18-MAY-95   Yes          5820
  30    PU_MAN       Den Raphaely      11000            1  07-DEC-94   Yes          8867
...

No comments:

Post a Comment