8- Advanced Analytic SQL(高级分析函数)

 
1. Analytic SQL Overview
The types of queries issued by Decision Support Systems (DSS) differ from those issued against OLTP systems.Queries such as these are staples(来源于 ) of DSS, and are used by managers, analysts, marketing executives, etc. to spot trends, identify outliers, uncover business opportunities, and predict future business performance. DSS systems typically sit atop data warehouses, in which large quantities of scrubbed(临时的), aggregated data provide fertile grounds for researching and formulating business decisions.
 
For this and other examples in this chapter, we use a simple star schema consisting of a single fact table (called "orders") containing aggregated sales information across the following dimensions: region, salesperson, customer, and month. There are two main facets to this query, each requiring a different level of aggregation of the same data
    1. Sum all sales per region last year.
    2. Sum all sales per customer last year
 
Rather than issuing two separate queries to aggregate sales per region and per customer, we will create a single query that aggregates sales over both region and customer. We can then call an analytic function that performs a second level of aggregation to generate total sales per region:
 
    SELECT o.region_id region_id, o.cust_nbr cust_nbr,
       SUM(o.tot_sales) tot_sales,
       SUM(SUM(o.tot_sales)) OVER (PARTITION BY o.region_id) region_sales
      FROM orders o
WHERE o.year = 2001
      GROUP BY o.region_id, o.cust_nbr;
       
    REGION_ID   CUST_NBR TOT_SALES REGION_SALES
---------- ---------- ---------- ------------
         5          1    1151162      6584167
         5          2    1223518      6584167
         5          3    1161286      6584167
         5          4    1878275      6584167
The aggregate function (SUM(o.tot_sales)) in line 2 generates the total sales per customer and region as directed by the GROUP BY clause, and the analytic function in line 3 aggregates these sums for each region, thereby computing the aggregate sales per region. The value for the region_sales column is identical for all customers within the same region and is equal to the sum of all customer sales within that region.

(第 2行的聚集函数SUM(o.tot_sales)通过Group by子句直接对记录按地区和客户进行汇总,第3行的聚集函数则对上面的统计结果按地区再进行一次汇总,列regeion_sales的值是由所有相同地区的客户订单总额汇总而成,)
   
    SELECT c.name cust_name,cust_sales.tot_sales cust_sales, r.name region_name,
           100 * ROUND(cust_sales.tot_sales / cust_sales.region_sales, 2) percent
FROM region r, customer c,
            (SELECT o.region_id region_id, o.cust_nbr cust_nbr,
        SUM(o.tot_sales) tot_sales,
        SUM(SUM(o.tot_sales)) OVER (PARTITION BY o.region_id) region_sales
     FROM orders o
   WHERE o.year = 2001
      GROUP BY o.region_id, o.cust_nbr) cust_sales
WHERE cust_sales.tot_sales > (cust_sales.region_sales * .2)
 AND cust_sales.region_id = r.region_id
 AND cust_sales.cust_nbr = c.cust_nbr;
 
CUST_NAME               CUST_SALES REGION_NAME          PERCENT_OF_REGION
---------------------- ---------- -------------------- ------------------
Flowtech Inc.            1878275     New England                  29
Spartan Industries       1788836     Mid-Atlantic                 28
Madden Industries        1929774     SouthEast US                 28
Evans Supply Corp.       1944281     SouthWest US                 28
 
Unlike built-in functions such as DECODE, GREATEST, and SUBSTR, Oracle's suite of analytic functions can only be used in the SELECT clause of a query. This is because analytic functions are only executed after the FROM, WHERE, GROUP BY, and HAVING clauses have been evaluated. After the analytic functions have executed, the query's ORDER BY clause is evaluated in order to sort the final result set, and the ORDER BY clause is allowed to reference columns in the SELECT clause generated via analytic functions.
 
2. Ranking Function:
Beginning with Oracle8i, however, developers can take advantage of several new functions to either generate rankings for each row in a result set or to group rows into buckets for percentile calculations
 
The RANK, DENSE_RANK, and ROW_NUMBER functions generate an integer value from 1 to N for each row, where N is less than or equal to the number of rows in the result set. The differences in the values returned by these functions revolves around how each one handles ties:
 
.ROW_NUMBER returns a unique number for each row starting with 1. For rows that have duplicate values, numbers are arbitrarily assigned.
 
.DENSE_RANK assigns a unique number for each row starting with 1, except for rows that have duplicate values, in which case the same ranking is assigned.
 
.RANK assigns a unique number for each row starting with 1, except for rows that have duplicate values, in which case the same ranking is assigned and a gap appears in the sequence for each duplicate ranking.
 
RANK, DENSE_RANK,ROW_NUMBER这三个函数为每条记录产生一个从1到N的整数,N的值小于或等于实际返回的结果集中记录的数目,这三个函数的主要区别在于如何处理重复记录

.ROW_NUMBER为每条记录返回一个从 1开始的唯一整数,对于那些重复的记录,其返回的序号是不同的。

.DENSE_RANK为每条记录返回一个从 1开始的唯一整数,除了那些重复的记录之外-其返回的序号是相同的

.RANK为每条记录返回一个从 1开始的唯一整数,除了那些重复的记录之外-其返回的序号是相同的,同时会在该序号和下一个序号之间留下一个间隙(例如:两个第2名之后为第4名,没有第3名)
 
例:按地区和用户分组统计每个用户的收入以及在所有记录中的排名
SELECT region_id, cust_nbr,
        SUM(tot_sales) cust_sales,
       RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_rank,
       DENSE_RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_dense_rank,
       ROW_NUMBER( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_number
 FROM orders
 WHERE year = 2001
 GROUP BY region_id, cust_nbr
 ORDER BY 6; --按结果集的第 6个字段值排序
 
结果如下:
 
8          18    1253840         11               11           11
 
5           2    1224992         12               12           12
 
9          23    1224992         12               12           13
 
9          24    1224992         12               12           14
 
10         30    1216858         15               13           15
 
Don't be confused by the ORDER BY clause at the end of the query and the ORDER BY clauses within each function call; the functions use their ORDER BY clause internally to sort the results for the purpose of applying a ranking. Thus, each of the three functions applies its ranking algorithm to the sum of each customer's sales in descending order. The final ORDER BY clause specifies the results of the ROW_NUMBER function as the sort key for the final result set, but we could have picked any of the six columns as our sort key.
 
Deciding which of the three functions to use depends on the desired outcome. If we want to identify the top 13 customers from this result set, we would use:

.ROW_NUMBER if we want exactly 13 rows without regard to ties. In this case, one of the       customers who might otherwise be included in the list will be excluded from the final set. 

.ANK if we want at least 13 rows but don't want to include rows that would have been excluded had there been no ties. In this case, we would retrieve 14 rows. 

.DENSE_RANK if we want all customers with a ranking of 13 or less, including all     duplicates. In this case, we would retrieve 15 rows.
 
何时使用上述三种方法中的一种取决于你想要什么样的结果集。如果我们想要从结果集中筛选前 13名的记录,我们可以使用

.如果我们只想要精确的 13条记录而不管是否存在多条排名相同记录的话,那么可以使用ROW_NUMBER函数,此时在结果集中的若干条排名相同的记录则只有一条会出现在最终的结果中,其它的记录则被排除。

.如果我们想要至少 13条记录,但不想包含那些已经被排除在外的非相同排名的记录的话,我们可以使用RANK函数

.如果我们想要排名在 13或小于13的序列,包括所有重复的记录的话,那么使用DENSE_RANK

 While the previous query generates rankings across the entire result set, it is also possible to generate 
independent sets of rankings across multiple partitions of the result set. The following query generates rankings for customer sales within each region rather than across all regions. Note the addition of the PARTITION BY clause:
 
例:按地区和用户统计用户的收入和用户在自己地区内的排名
SELECT region_id, cust_nbr, SUM(tot_sales) cust_sales,
       RANK( ) OVER (PARTITION BY region_id  
                    ORDER BY SUM(tot_sales) DESC) sales_rank,
       DENSE_RANK( ) OVER(PARTITION BY region_id
                         ORDER BY SUM(tot_sales) DESC) sales_dense_rank,
       ROW_NUMBER( ) OVER (PARTITION BY region_id
                            ORDER BY SUM(tot_sales) DESC) sales_number
 FROM orders
 WHERE year = 2001
 GROUP BY region_id, cust_nbr
 ORDER BY 1,6;
 
The PARTITION BY clause used in ranking functions is used to divide a result set into pieces so that rankings can be applied within each subset. This is completely different from the PARTITION BY RANGE/HASH/LIST clauses introduced in Chapter 10 for breaking a table or index into multiple pieces
 
All ranking functions allow the caller to specify where in the ranking order NULL values should appear. This is accomplished by appending either NULLS FIRST or NULLS LAST after the ORDER BY clause of the function, as in:
 
    SELECT region_id, cust_nbr, SUM(tot_sales) cust_sales,
           RANK( ) OVER (ORDER BY SUM(tot_sales) DESC NULLS LAST) sales_rank
      FROM orders
     WHERE year = 2001
     GROUP BY region_id, cust_nbr;
 
One of the most common uses of a ranked data set is to identify the top-N or bottom-N performers. Since we can't call analytic functions from the WHERE or HAVING clauses, we are forced to generate the rankings for all the rows and then use an outer query to filter out the unwanted rankings. For example, the following query uses an inline view to identify the top-5 salespersons for 2001
    SELECT s.name, sp.sp_sales total_sales FROM salesperson s,
        (SELECT salesperson_id, SUM(tot_sales) sp_sales,
                RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_rank
           FROM orders
          WHERE year = 2001
          GROUP BY salesperson_id) sp
     WHERE sp.sales_rank <= 5
       AND sp.salesperson_id = s.salesperson_id
     ORDER BY sp.sales_rank;
 
While there is no function for returning only the top or bottom-N from a ranked result set, Oracle provides functionality for identifying the first (top 1) or last (bottom 1) records in a ranked set. This is useful for queries such as the following: "Find the regions with the best and worst total sales last year." Unlike the top-5 salespeople example from the previous section, this query needs an additional piece of information—the size of the result set—in order to answer the question
 
    .An ORDER BY clause that specifies how to rank the result set.
 
    .The keywords FIRST and LAST to specify whether to use the top or bottom-ranked row.
 
    .An aggregate function (i.e., MIN, MAX, AVG, COUNT) used as a tiebreaker in case more  than one row of the result set tie for the FIRST or LAST spot in the ranking.
 
The following query uses the MIN aggregate function to find the regions that rank FIRST and LAST by total sales: 

   SELECT MIN(region_id) KEEP (DENSE_RANK FIRST ORDER BY SUM(tot_sales) DESC) best_region,
                   MIN(region_id) KEEP (DENSE_RANK LAST ORDER BY SUM(tot_sales) DESC) worst_region
      FROM orders
     WHERE year = 2001
     GROUP BY region_id;
 
The use of the MIN function in the previous query is a bit confusing: it is only used if more than one region ties for either first or last place in the ranking. If there were a tie, the row with the minimum value for region_id would be chosen.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值