8- Advanced Analytic SQL(高级分析函数)

最新推荐文章于 2023-08-17 11:23:14 发布

pengpenglin

最新推荐文章于 2023-08-17 11:23:14 发布

阅读量1.6k

点赞数

分类专栏： Oracle SQL/PLSQL 文章标签： sql function each query returning oracle

本文链接：https://blog.csdn.net/pengpenglin/article/details/1743209

版权

Oracle SQL/PLSQL 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

1． Analytic SQL Overview

The types of queries issued by Decision Support Systems (DSS) differ from those issued against OLTP systems.Queries such as these are staples(来源于 ) of DSS, and are used by managers, analysts, marketing executives, etc. to spot trends, identify outliers, uncover business opportunities, and predict future business performance. DSS systems typically sit atop data warehouses, in which large quantities of scrubbed(临时的), aggregated data provide fertile grounds for researching and formulating business decisions.

For this and other examples in this chapter, we use a simple star schema consisting of a single fact table (called "orders") containing aggregated sales information across the following dimensions: region, salesperson, customer, and month. There are two main facets to this query, each requiring a different level of aggregation of the same data

1． Sum all sales per region last year.
2． Sum all sales per customer last year

Rather than issuing two separate queries to aggregate sales per region and per customer, we will create a single query that aggregates sales over both region and customer. We can then call an analytic function that performs a second level of aggregation to generate total sales per region:

SELECT o.region_id region_id, o.cust_nbr cust_nbr,

SUM(o.tot_sales) tot_sales,

SUM(SUM(o.tot_sales)) OVER (PARTITION BY o.region_id) region_sales

FROM orders o

WHERE o.year = 2001

GROUP BY o.region_id, o.cust_nbr;

REGION_ID CUST_NBR TOT_SALES REGION_SALES

---------- ---------- ---------- ------------

5 1 1151162 6584167

5 2 1223518 6584167

5 3 1161286 6584167

5 4 1878275 6584167

The aggregate function (SUM(o.tot_sales)) in line 2 generates the total sales per customer and region as directed by the GROUP BY clause, and the analytic function in line 3 aggregates these sums for each region, thereby computing the aggregate sales per region. The value for the region_sales column is identical for all customers within the same region and is equal to the sum of all customer sales within that region.

(第 2行的聚集函数SUM(o.tot_sales)通过Group by子句直接对记录按地区和客户进行汇总，第3行的聚集函数则对上面的统计结果按地区再进行一次汇总，列regeion_sales的值是由所有相同地区的客户订单总额汇总而成，)

SELECT c.name cust_name,cust_sales.tot_sales cust_sales, r.name region_name,

100 * ROUND(cust_sales.tot_sales / cust_sales.region_sales, 2) percent

FROM region r, customer c,

(SELECT o.region_id region_id, o.cust_nbr cust_nbr,

SUM(o.tot_sales) tot_sales,

SUM(SUM(o.tot_sales)) OVER (PARTITION BY o.region_id) region_sales

FROM orders o

WHERE o.year = 2001

GROUP BY o.region_id, o.cust_nbr) cust_sales

WHERE cust_sales.tot_sales > (cust_sales.region_sales * .2)

AND cust_sales.region_id = r.region_id

AND cust_sales.cust_nbr = c.cust_nbr;

CUST_NAME CUST_SALES REGION_NAME PERCENT_OF_REGION

---------------------- ---------- -------------------- ------------------

Flowtech Inc. 1878275 New England 29

Spartan Industries 1788836 Mid-Atlantic 28

Madden Industries 1929774 SouthEast US 28

Evans Supply Corp. 1944281 SouthWest US 28

Unlike built-in functions such as DECODE, GREATEST, and SUBSTR, Oracle's suite of analytic functions can only be used in the SELECT clause of a query. This is because analytic functions are only executed after the FROM, WHERE, GROUP BY, and HAVING clauses have been evaluated. After the analytic functions have executed, the query's ORDER BY clause is evaluated in order to sort the final result set, and the ORDER BY clause is allowed to reference columns in the SELECT clause generated via analytic functions.

2． Ranking Function:

Beginning with Oracle8i, however, developers can take advantage of several new functions to either generate rankings for each row in a result set or to group rows into buckets for percentile calculations

The RANK, DENSE_RANK, and ROW_NUMBER functions generate an integer value from 1 to N for each row, where N is less than or equal to the number of rows in the result set. The differences in the values returned by these functions revolves around how each one handles ties:

.ROW_NUMBER returns a unique number for each row starting with 1. For rows that have duplicate values, numbers are arbitrarily assigned.

.DENSE_RANK assigns a unique number for each row starting with 1, except for rows that have duplicate values, in which case the same ranking is assigned.

.RANK assigns a unique number for each row starting with 1, except for rows that have duplicate values, in which case the same ranking is assigned and a gap appears in the sequence for each duplicate ranking.

RANK， DENSE_RANK，ROW_NUMBER这三个函数为每条记录产生一个从1到N的整数，N的值小于或等于实际返回的结果集中记录的数目，这三个函数的主要区别在于如何处理重复记录

.ROW_NUMBER为每条记录返回一个从 1开始的唯一整数，对于那些重复的记录，其返回的序号是不同的。

.DENSE_RANK为每条记录返回一个从 1开始的唯一整数，除了那些重复的记录之外－其返回的序号是相同的

.RANK为每条记录返回一个从 1开始的唯一整数，除了那些重复的记录之外－其返回的序号是相同的，同时会在该序号和下一个序号之间留下一个间隙(例如：两个第2名之后为第4名，没有第3名)

例：按地区和用户分组统计每个用户的收入以及在所有记录中的排名

SELECT region_id, cust_nbr,

SUM(tot_sales) cust_sales,

RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_rank,

DENSE_RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_dense_rank,

ROW_NUMBER( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_number

FROM orders

WHERE year = 2001

GROUP BY region_id, cust_nbr

ORDER BY 6; --按结果集的第 6个字段值排序

结果如下：

…

8 18 1253840 11 11 11

5 2 1224992 12 12 12

9 23 1224992 12 12 13

9 24 1224992 12 12 14

10 30 1216858 15 13 15

Don't be confused by the ORDER BY clause at the end of the query and the ORDER BY clauses within each function call; the functions use their ORDER BY clause internally to sort the results for the purpose of applying a ranking. Thus, each of the three functions applies its ranking algorithm to the sum of each customer's sales in descending order. The final ORDER BY clause specifies the results of the ROW_NUMBER function as the sort key for the final result set, but we could have picked any of the six columns as our sort key.

Deciding which of the three functions to use depends on the desired outcome. If we want to identify the top 13 customers from this result set, we would use:

.ROW_NUMBER if we want exactly 13 rows without regard to ties. In this case, one of the customers who might otherwise be included in the list will be excluded from the final set.

.ANK if we want at least 13 rows but don't want to include rows that would have been excluded had there been no ties. In this case, we would retrieve 14 rows.

.DENSE_RANK if we want all customers with a ranking of 13 or less, including all duplicates. In this case, we would retrieve 15 rows.

何时使用上述三种方法中的一种取决于你想要什么样的结果集。如果我们想要从结果集中筛选前 13名的记录，我们可以使用

.如果我们只想要精确的 13条记录而不管是否存在多条排名相同记录的话，那么可以使用ROW_NUMBER函数，此时在结果集中的若干条排名相同的记录则只有一条会出现在最终的结果中，其它的记录则被排除。

.如果我们想要至少 13条记录，但不想包含那些已经被排除在外的非相同排名的记录的话，我们可以使用RANK函数

.如果我们想要排名在 13或小于13的序列，包括所有重复的记录的话，那么使用DENSE_RANK

While the previous query generates rankings across the entire result set, it is also possible to generate
independent sets of rankings across multiple partitions of the result set. The following query generates rankings for customer sales within each region rather than across all regions. Note the addition of the PARTITION BY clause:

例：按地区和用户统计用户的收入和用户在自己地区内的排名

SELECT region_id, cust_nbr, SUM(tot_sales) cust_sales,

RANK( ) OVER (PARTITION BY region_id

ORDER BY SUM(tot_sales) DESC) sales_rank,

DENSE_RANK( ) OVER(PARTITION BY region_id

ORDER BY SUM(tot_sales) DESC) sales_dense_rank,

ROW_NUMBER( ) OVER (PARTITION BY region_id

ORDER BY SUM(tot_sales) DESC) sales_number

FROM orders

WHERE year = 2001

GROUP BY region_id, cust_nbr

ORDER BY 1,6;

The PARTITION BY clause used in ranking functions is used to divide a result set into pieces so that rankings can be applied within each subset. This is completely different from the PARTITION BY RANGE/HASH/LIST clauses introduced in Chapter 10 for breaking a table or index into multiple pieces

All ranking functions allow the caller to specify where in the ranking order NULL values should appear. This is accomplished by appending either NULLS FIRST or NULLS LAST after the ORDER BY clause of the function, as in:

SELECT region_id, cust_nbr, SUM(tot_sales) cust_sales,

RANK( ) OVER (ORDER BY SUM(tot_sales) DESC NULLS LAST) sales_rank

FROM orders

WHERE year = 2001

GROUP BY region_id, cust_nbr;

One of the most common uses of a ranked data set is to identify the top-N or bottom-N performers. Since we can't call analytic functions from the WHERE or HAVING clauses, we are forced to generate the rankings for all the rows and then use an outer query to filter out the unwanted rankings. For example, the following query uses an inline view to identify the top-5 salespersons for 2001

SELECT s.name, sp.sp_sales total_sales FROM salesperson s,

(SELECT salesperson_id, SUM(tot_sales) sp_sales,

RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_rank

FROM orders

WHERE year = 2001

GROUP BY salesperson_id) sp

WHERE sp.sales_rank <= 5

AND sp.salesperson_id = s.salesperson_id

ORDER BY sp.sales_rank;

While there is no function for returning only the top or bottom-N from a ranked result set, Oracle provides functionality for identifying the first (top 1) or last (bottom 1) records in a ranked set. This is useful for queries such as the following: "Find the regions with the best and worst total sales last year." Unlike the top-5 salespeople example from the previous section, this query needs an additional piece of information—the size of the result set—in order to answer the question

.An ORDER BY clause that specifies how to rank the result set.

.The keywords FIRST and LAST to specify whether to use the top or bottom-ranked row.

.An aggregate function (i.e., MIN, MAX, AVG, COUNT) used as a tiebreaker in case more than one row of the result set tie for the FIRST or LAST spot in the ranking.

The following query uses the MIN aggregate function to find the regions that rank FIRST and LAST by total sales:

SELECT MIN(region_id) KEEP (DENSE_RANK FIRST ORDER BY SUM(tot_sales) DESC) best_region,

MIN(region_id) KEEP (DENSE_RANK LAST ORDER BY SUM(tot_sales) DESC) worst_region

FROM orders

WHERE year = 2001

GROUP BY region_id;

The use of the MIN function in the previous query is a bit confusing: it is only used if more than one region ties for either first or last place in the ranking. If there were a tie, the row with the minimum value for region_id would be chosen.