Related to Oracle 关于ORACLE中的分析函数与窗口函数

Oracle Analytic Functions: An Introduction 
//建议读此文章时打开任何一个SQL工具并连接到测试数据库,使用下边提示中的脚本建立测试表及数据,我就是边执行示例SQL边看文章,感觉读完这篇文章对窗口函数以及分析函数有了更透彻的了解
Oracle 分析函数:介绍



Note: A script to create and populate the sample schema can be found here

提示:点击 这里 可以找到建表与初始化下边SQL示例使用数据的脚本。 网站可能需要翻墙。。

Introduction:

介绍:

SQL is a compact yet versatile language - a surprisingly large variety of questions can be answered using standard SQL. However, certain classes of questions are extremely hard to answer using standard SQL. For example consider the following:

SQL 是一种已经成型并且通用的语言 - 使用标准的SQL可以解答相当多并且各式各样的问题。然而,有些类型的问题使用标准SQL来解答是相当困难的。例如试想下边的情况:

  • Ranking data within subsets of a data set.
  • 对数据集下的每个子集数据进行排行操作。
  • Aggregating data within subsets of a data set.
  • 对数据集下的每个子集数据使用聚合操作。 
  • Performing aggregations over moving windows.
  • 在一个移动的窗口上应用聚集操作。 // 窗口可以理解为一个范围 比如,对当前行,以及它的前一行,后一行应用聚集操作。
  • Displaying and comparing aggregates to individual entries within a single query.
  • 使用单个的查询比较并显示特定的数据行
  • Comparing two or more rows within a given data set.
  • 对给定数据集中两行或多行数据进行比较

Typically one would have to tackle such calculations using complicated (and hence poorly performing) SQL or via procedural methods. In version 8i Oracle introduced analytic functions - extensions to standard SQL - which are designed to tackle such problems . The present article introduces analytic functions, and shows how they can be used to answer such questions. A schema creation script to accompany the discussion can be found here. You may want to create and populate the tables in a test schema so that you can run and explore variations of the SQL discussed in this article.

大多数人通常会使用一个复杂(并且不高效)的SQL或者通过存储过程来解决这些情况。 在Oracle 8i中就引进了分析函数,作为对标准SQL的扩展,用来解决以上难题。这篇文章介绍了分析函数,并且展示如何使用分析函数解决上边这些问题。用于此篇文章的建表的脚本可以在点击这里查看。你可以建立包含测试数据的临时数据库表用来运行,测试这篇文行所讨论的SQL。

Ranking functions:

排名函数:

These functions rank records within a set or subset of data according to specified criteria. Their utility is best explained by working through a set of examples. Consider the following: we want to rank divisions and regions by sales for 2004. The traditional SQL solution to the problem would be as follows:

这类函数都可以根据特定的排名规则对数据集或其子集进行排名操作。通过一组示例可以很好的解释这些函数的应用。试想下边的情况:我们想对各大区及分部根据2004年的销售额进行排名。使用传统的SQL解决方法如下:

--query (1) - region / division ranked by sales for 2004

select
  region_name,
  division_name,
  sum_amount,
  rownum
from
  (select
    t1.region_name,
    t2.division_name,
    sum(t3.amount) sum_amount
  from
    region t1,
    division t2,
    sales t3
  where
    t1.region_id=t3.region_id
  and
    t2.division_id=t3.division_id
  and
    t3.year=2004
  group by
    t1.region_name,
    t2.division_name
  order by 3 desc)

(这里存在一个问题就是,如果出现销售额相同的情况,因为使用rownum作为排序名次,所以不会有并列的情况出现,下边使用了rank()函数的则不同)


A subquery is required because Oracle assigns row numbers before the ordering is done. The inner query does the ordering and the outer query does the ranking using rownum. Now compare the above query to the following, much more compact, one which uses an analytical ranking function:

上边语句中使用了一个子查询,因为Oracle在进行排序之前就对数据行分配了rownum,子查询负责了排序,外层的查询使用rownum进行了排名。现在比较一下上面和下面这条使用了分析排名函数的更为精简的查询语句。

--query (2) - region / division ranked by sales for 2004

select
  t1.region_name,
  t2.division_name,
  sum(t3.amount),
   rank() over (order by sum(t3.amount) desc) sales_rank
from
  region t1,
  division t2,
  sales t3
where
  t1.region_id=t3.region_id
and
  t2.division_id=t3.division_id
and
  t3.year=2004
group by
  t1.region_name,
  t2.division_name


As you'll probably agree, query (2) is far more compact than query (1).
你很可能会赞同,query(2)比query(1)要简单明了的多。
Now to deconstruct the unfamiliar bits in query (2):

现在对query(2)中我们不熟悉的部分进行说明。

  1. As its name suggests, rank() is the ranking function. The over .. syntax associated with rank() indicates that it is an analytic function. This syntax is common to all analytic functions as we'll see in the course of this discussion.
    正如其名,rank()是用来进行排名的函数。联合rank()函数一起使用over的这种语法表明它是一个分析函数。在我们接下来的讨论中会看到,over这样的句法在分析函数的语句中是一个共同点。
  2. The order by sum(t3.amount) desc tells Oracle that the data is to be ranked by total sales for the year.
    order by sum(t3.amount) desc这一段告诉Oracle数据按照年份销售总额进行排名。

So far so good. However, ranking functions really come into their own when there's a need to rank data within subgroups of a result set. Here's the SQL to rank sales by division within each region:

到目前为止,一切进行顺利。然而,当有对数据集的子集进行排名的需求时,排名函数才真正显示出它的威力。下面是对每个单独的大区内部门按销售额进行排名的SQL。

--query (3) - division ranked by 2004 sales for each region

select
  t1.region_name,
  t2.division_name,
  sum(t3.amount),
  rank() over ( partition by t1.region_name order by sum(t3.amount) desc) sales_rank_by_region
from
  region t1,
  division t2,
  sales t3
where
  t1.region_id=t3.region_id
and
  t2.division_id=t3.division_id
and
  t3.year=2004
group by
  t1.region_name,
  t2.division_name

The only difference between query (2) and (3) is the partition by clause, which tells Oracle that the data is to be ranked within each region. In general, the partition by clause defines subgroups within which the analytic operation (aggregation, ranking etc.) is to be performed.

query(2)和query(3)的唯一区别是partition by 条件,partition by条件用来告诉Oracle数据需要按单独的每个大区进行排名。一般来说,partition by 条件定义了分析操作执行中划分的子集的条件。


Other functions with usage similar to rank() are: dense_rank(), percent_rank() cume_dist() and ntile(). Check the Oracle documentation for details on these. The relevant book is the Oracle Database Data Warehousing Guide, available from the Oracle documentation site.

其他一些类似rank()的函数:dense_rank(), percent_rank() cume_dist() and ntile()。在Oracle文档中可以找到关于这些函数更详细的说明。相关书籍有Oracle Database Data Warehousing Guide,在Oracle documentation site可以找到。

Windowing aggregate functions:

窗口聚集函数:

These functions can be used to perform aggregate operations - sums, counts, averages etc - on subgroups of data. Further, these aggregates can be performed over user specified windows or ranges. Again, the syntax is best illustrated through some examples, so here we go. The SQL below returns the monthly and year-to-date sales for 2004 by region and division:

这类函数可以用来对数据集下的所有子集分别执行聚集操作-求和,计数, 求平均值等等。这些聚集函数可以在用户定义的窗口或者范围内执行。我们还是要通过一些例子来说明语法的使用,我们开始。下边的SQL返回了每个大区下各个部门2004年度月销售额以及当前月累计销售总额。

--query (4) - YTD sales by region and division

select
  t1.region_name,
  t2.division_name,
  t3.month,
  t3.amount month_sales,
   sum(t3.amount) over (partition by t1.region_name, t2.division_name
    order by t3.month rows unbounded preceding) YTD_sales

from
  region t1,
  division t2,
  sales t3
where
  t1.region_id=t3.region_id
and
  t2.division_id=t3.division_id
and
  t3.year=2004

The syntax, as you'll notice, is very similar to that used for ranking functions. The various analytic clauses are explained below:

你会注意到,语法与排名函数十分相似,下面会解释一下各个分析条件(指的partition by,order by ,rows unbounded preceding)的作用:

  1. The partition by clause defines subgroups within which the aggregation is to be done. In the above example, the sum is to be calculated for each region/division combination.
    partition by 条件划分了需要应用聚集函数的子集,在上边的例子中,SUM用来计算每个 大区/部门组成的一个组合 的销售总额。
  2. The order by defines the order in which data is to be cumulated within each subgroup. In the above, data is to be summed year-to-date (also see next item).
    order by 定义了每个子集中的排序规则,在上边数据按照累计至当前日期的销售额进行求和计算。
  3. The new element, rows unbounded preceding, is a windowing clause. The windowing clause defines the aggregation window -i.e. how many rows are to be included in the aggregation. In the above example, rows unbounded preceding tells Oracle that the data is to be aggregated from the start of the subgroup to the current row. As it happens this is the default, so we could have omitted the rows unbounded preceding clause.
    一个新的元素, rows unbounded preceding, 是一个划分窗口(开窗)的条件,这个开窗条件定义了一个执行聚集函数的窗口,例如,有多少行数据被包含在聚集函数中。在上边的例子,rows unbounded preceding 告诉Oracle 在数据第一行与当前行这个范围内执行聚集函数。其实在数据第一行与当前行这个范围内执行聚集函数这个条件是默认的,所以我们可以省略rows unbounded preceding。

Moving on, the next statement calculates a rolling three monthly sum based on the current and previous two months:

我们继续,下一条语句自上而下的计算了当前月与前两个月的销售额。

--query (5) - 3 month rolling sum

select
  t1.region_name,
  t2.division_name,
  t3.month,
  t3.amount month_sales,
   sum(t3.amount) over (partition by t1.region_name, t2.division_name
    order by t3.month rows 2 preceding)
three_mth_sum
from
  region t1,
  division t2,
  sales t3
where
  t1.region_id=t3.region_id
and
  t2.division_id=t3.division_id
and
  t3.year=2004

In contrast, the following SQL calculates a centered three month rolling sum:

作为对照,下边的SQL自上而下的计算了当前月与前一月和后一月三个月的销售额

--query (6) - centered 3 month rolling sum

select
  t1.region_name,
  t2.division_name,
  t3.month,
  t3.amount month_sales,
  sum(t3.amount) over (partition by t1.region_name, t2.division_name
    order by t3.month rows between 1 preceding and 1 following) ctrd_three_mth_sum
from
  region t1,
  division t2,
  sales t3
where
  t1.region_id=t3.region_id
and
  t2.division_id=t3.division_id
and
  t3.year=2004

The windowing clause has other variations based on ranges instead of number of rows. Further, it is possible to vary the window size for each row. Check the Oracle documentation for details.

开窗条件还有一些其他的基于范围的条件来代替我们定义的行数范围。甚至它可以改变每行数据窗口的大小。可以在Oracle documentation中查询详细内容。

With the above syntax one can write single SQL statements to answer more complicated questions such as quarterly sales for 2004 by region and division. Here's the SQL:

有了上边的语法,我们可以写一条单独的SQL语句来回答像“计算2004年各个大区及分公司季度销售额”这样复杂的问题,下边是SQL:

--query (7) - quarterly sales for 2004 by region and division

select
  region_name,
  division_name,
  month/3 quarter,
  sum_3_months quarterly_sales
from
  (select
    t1.region_name,
    t2.division_name,
    t3.month,
    sum(t3.amount) over (partition by t1.region_name, t2.division_name
      order by t3.month rows 2 preceding) sum_3_months
  from
    region t1,
    division t2,
    sales t3
  where
    t1.region_id=t3.region_id
  and
    t2.division_id=t3.division_id
  and
    t3.year=2004)
where
  month in (3,6,9,12)

Here we used analytical SQL to calculate a three monthly sum in the subquery, and then picked out the quarter end months in the outer query.

这里我们使用了分析函数SQL在子查询中自上而下的计算了当前月与前两个月总共3个月销售额的总和,然后在外层查询中挑出月份为季度时间点的数据。如3月 6月 9月 12月。

We have dealt exclusively with sum in this discussion. There are quite a few other aggregating functions. Some commonly used ones include: avg, max, min and count. Check the Oracle documentation for more details.

我们在这次讨论中进行的处理只使用了SUM函数,除此之外还有一些其他常用的聚集函数,包括: avg, max, min and count。可以在Oracle documentation中查询详细内容。



Reporting aggregate functions:

报告聚集函数:

Reporting aggregate functions return an aggregated value for the entire partition. This value, which is the same for each row in the partition, is returned with each row in the partition. Available functions include: sum, avg, max, min and count. The essential difference between reporting and windowing aggregations is the absence of a windowing clause (order by) in the former. Here's an example that calculates the maximum monthly sales by region and division:

报告聚集函数返回针对整个分组的聚集值。这个值在分组中的每个数据行中都是相同的,可用的函数包括sum, avg, max, min and count。报告聚集函数和窗口聚集函数本质的区别在于报告聚集函数没有 order by 条件。这里是一个按大区和部门分组计算最大月份销售额的例子。

--query (8) - maximum monthly sales for 2004 by region and division (all months)

select
  t1.region_name,
  t2.division_name,
  t3.month,
  t3.amount mthly_sales,
   max(t3.amount) over (partition by t1.region_name, t2.division_name)
    max_mthly_sales
from
  region t1,
  division t2,
  sales t3
where
  t1.region_id=t3.region_id
and
  t2.division_id=t3.division_id
and
  t3.year=2004

This returns the maximum sales by region and division with every row - the same value is returned with each row. We can now use the above query to get just the maximum sales months like so:

这个语句执行在每行数据后显示了按大区和部门进行分组月销售额最大的值-每行数据显示的月销售额最大值是相同的。我们可以像这样使用上边的语句得到销售额最大的月份。

--query (9) - maximum monthly sales for 2004 by region / division (max mths only)

select   region_name,
  division_name,
  month max_sales_month,
  max_mthly_sales
from
  (select
    t1.region_name,
    t2.division_name,
    t3.month,
    t3.amount mthly_sales,
    max(t3.amount) over (partition by t1.region_name, t2.division_name)
      max_mthly_sales
  from
    region t1,
    division t2,
    sales t3
  where
    t1.region_id=t3.region_id
  and
    t2.division_id=t3.division_id
  and
    t3.year=2004)
where
    mthly_sales=max_mthly_sales

Here analytical SQL calculates the maximum in the subquery (same as query (8)), and we pick out the rows for which the maximum equals monthly sales in the outer query.

这里分析SQL计算出了partition by划分出的每个子集中的月销售额的最大值(同query(8)一样),我们在外层查询中又挑出了月销售额与最大月销售额相等的那些数据。

Oracle also offers the ratio_to_report function, which calculates ratios based on partition-wide aggregates. Here's an example that reports the ratio of sales by region and division to total regional sales:

Oracle 还提供了ratio_to_report 函数,用来计算子集范围内的比率,这是一个计算各个部门销售额占部门所在大区销售额的比率。

--query (10) - sales by region/div to total region sales

select
  t1.region_name,
  t2.division_name,
  sum(t3.amount),
   sum(sum(t3.amount)) over (partition by t1.region_name) region_total,
   ratio_to_report(sum(t3.amount)) over (partition by t1.region_name) region_ratio
from
  region t1,
  division t2,
  sales t3
where
  t1.region_id=t3.region_id
and
  t2.division_id=t3.division_id
and
  t3.year=2004
group by
  t1.region_name,
  t2.division_name

A couple of points to note in query (10):

  1. Here the ratio_to_report computes the ratio of region/division sum to the regional sum. the partition by clause determines the denominator of the ratio.
    这里ratio_to_report 计算了部门年度销售总额相对所在大区销售总额的比率。 partition by 条件决定了,计算比率的分母。
  2. Although not required for calculating the ratio, I have also shown how the regional sum can be obtained via summing the region/division sum - this is also an example of a reporting aggregate since there is no windowing clause present.
    即使这里不需要计算比率,我也展示了怎么样在计算大区/部门销售额总和的同时得到大区的销售总额。并且这也是报告聚合的例子。

Finally in case you want to treat the entire data set as a one partition for reporting aggregates and ratios:

最后使用报告聚集函数和 ratios比率计算,我们把整个数据集作为一个分组。

--query (11) - sales by region/div to total sales

select
  t1.region_name,
  t2.division_name,
  sum(t3.amount),
   sum(sum(t3.amount)) over () region_total,
   ratio_to_report(sum(t3.amount)) over () region_ratio
from
  region t1,
  division t2,
  sales t3
where
  t1.region_id=t3.region_id
and
  t2.division_id=t3.division_id
and
  t3.year=2004
group by
  t1.region_name,
  t2.division_name

Here we've simply omitted the partition by clause, thereby instructing Oracle to treat the entire data set as a single partition.

这里我们省略了 partition by 条件,从而指示Oracle把整个数据集作为一个单独的分组。


There is, of course, much more to reporting aggregates. Check the documentation for lots more.

当然,还有很多报告聚集函数,可以在Oracle documentation中查询详细内容。


Lag/Lead functions:

Lag/lead 函数:

Comparing data across rows is hard using standard SQL. The lag and lead analytical functions are designed for just this type of problem. Here's an example:

使用标准的SQL来比较数据集中的行与行的信息是很困难的。lag和lead两个分析函数被设计用作解决这类问题。这里是一个例子:

--query (12) - yearly sales by region/division and comparison to previous year

select
  t1.region_name,
  t2.division_name,
  t3.year,
  sum(t3.amount) yearly_sales,
  sum(t3.amount)- lag(sum(t3.amount),1) over (partition by t1.region_name, t2.division_name
    order by t3.year) diff_prev_year
from
  region t1,
  division t2,
  sales t3
where
  t1.region_id=t3.region_id
and
  t2.division_id=t3.division_id
group by
  t1.region_name,
  t2.division_name,
  t3.year

Here the lag function compares sales for 2004 to that for 2003. The usage for lead is very similar - see the documentation for some examples.

这里lag函数比较了2004与2003年的销售额,lead的使用也非常简单,可以查询oracle文档中的一些例子。

End Note:

Analytic functions are a very powerful extension to standard SQL. With it we can now answer questions that could previously be done only through procedural methods or via convoluted, poorly performing SQL. If you are a developer writing complex queries, always check to see if analytic functions can be used. I have often been surprised at how easily complex questions can be answered using analytics. If you are a DBA, do yourself a favour and ensure developers working on your databases are aware of analytic functions.

 

结语:

分析函数是对标准SQL的强大的扩展。有了它我们现在就可以回答先前只能通过存储过程或复杂低效的嵌套查询来解答的问题了。如果你是一个开发复杂查询的人员,你总是需要查证是否可以使用分析函数来解决问题。我总惊讶于使用分析函数能那么简单的来解决复杂的问题。如果你是DBA,帮你自己一个忙,确保你的开发人员了解分析函数。

 

感谢作者,原文:http://www.orafusion.com/art_anlytc.htm

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值