原文地址:
http://www.cnblogs.com/skyEva/p/5730531.html
http://blog.csdn.net/qq_34941023/article/details/51773367
分析函数是什么?
分析函数是Oracle专门用于解决复杂报表统计需求的功能强大的函数,它可以在数据中进行分组然后计算基于组的某种统计值,并且每一组的每一行都可以返回一个统计值。
分析函数和聚合函数的不同之处是什么?
普通的聚合函数用group by分组,每个分组返回一个统计值,而分析函数采用partition by分组,并且每组每行都可以返回一个统计值。
分析函数的形式
分析函数带有一个开窗函数over(),包含三个分析子句:分组(partition by), 排序(order by), 窗口(rows) ,他们的使用形式如下:over(partition by xxx order by yyy rows between zzz)。
-----first_value()与last_value():求最值对应的其他属性
FIRST_VALUE(AREA_CODE) OVER(PARTITION BY BILL_MONTH ORDER BY SUM(LOCAL_FARE) DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FIRSTVAL,
LAST_VALUE(AREA_CODE) OVER(PARTITION BY BILL_MONTH ORDER BY SUM(LOCAL_FARE) DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) LASTVAL
-----rank(),dense_rank()与row_number():求排序
rank()值相同时排名相同,其后排名跳跃不连续
RANK() OVER(PARTITION BY DEPTNO ORDER BY SAL DESC) RW
dense_rank()值相同时排名相同,其后排名连续不跳跃
DENSE_RANK() OVER(PARTITION BY DEPTNO ORDER BY SAL DESC) RW
row_number()值相同时排名不相等,其后排名连续不跳跃
ROW_NUMBER() OVER(PARTITION BY DEPTNO ORDER BY SAL DESC) RW
-----------lag()与lead():求之前或之后的第N行
lag和lead函数可以在一次查询中取出同一字段的前n行的数据和后n行的值。这种操作可以使用对相同表的表连接来实现,不过使用lag和lead有更高的效率。
lag(arg1,arg2,arg3)
第一个参数是列名,
第二个参数是偏移的offset,
第三个参数是超出记录窗口时的默认值。
举例如下:
SQL> select * from kkk;
ID NAME
---------- --------------------
1 1name
2 2name
3 3name
4 4name
5 5name
SQL> select id,name,lag(name,1,0) over(order by id) from kkk;
ID NAME LAG(NAME,1,0)OVER(ORDERBYID)
---------- -------------------- ----------------------------
1 1name 0
2 2name 1name
3 3name 2name
4 4name 3name
5 5name 4name
SQL> select id,name,lead(name,1,0) over(order by id) from kkk;
ID NAME LEAD(NAME,1,0)OVER(ORDERBYID)
---------- -------------------- -----------------------------
1 1name 2name
2 2name 3name
3 3name 4name
4 4name 5name
5 5name 0
SQL> select id,name,lead(name,2,0) over(order by id) from kkk;
ID NAME LEAD(NAME,2,0)OVER(ORDERBYID)
---------- -------------------- -----------------------------
1 1name 3name
2 2name 4name
3 3name 5name
4 4name 0
5 5name 0
SQL> select id,name,lead(name,1,'linjiqin') over(order by id) from kkk;
ID NAME LEAD(NAME,1,'ALSDFJLASDJFSAF')
---------- -------------------- ------------------------------
1 1name 2name
2 2name 3name
3 3name 4name
4 4name 5name
5 5name linjiqin
-------------rollup()与cube():排列组合分组
1)、group by rollup(a, b, c):
首先会对(a、b、c)进行group by,
然后再对(a、b)进行group by,
其后再对(a)进行group by,
最后对全表进行汇总操作。
2)、group by cube(a, b, c):
则首先会对(a、b、c)进行group by,
然后依次是(a、b),(a、c),(a),(b、c),(b),(c),
最后对全表进行汇总操作。
-----------max(),min(),sum()与avg():求移动的最值总和与平均值
SUM(LOCAL_FARE) OVER(PARTITION BY AREA_CODE ORDER BY TO_NUMBER(BILL_MONTH) RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING)
AVG(LOCAL_FARE) OVER(PARTITION BY AREA_CODE ORDER BY TO_NUMBER(BILL_MONTH) RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING)
MAX(LOCAL_FARE) OVER(PARTITION BY AREA_CODE ORDER BY TO_NUMBER(BILL_MONTH) RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING)
MIN(LOCAL_FARE) OVER(PARTITION BY AREA_CODE ORDER BY TO_NUMBER(BILL_MONTH) RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING)
----unbounded preceding and unbouned following针对当前所有记录的前一条、后一条记录,也就是表中的所有记录
--unbounded:不受控制的,无限的
--preceding:在...之前
--following:在...之后
----ROWS BETWEEN unbounded preceding AND current row 是指第一行至当前行的汇总
----ROWS BETWEEN current row AND unbounded following 指当前行到最后一行的汇总
----ROWS BETWEEN 1 preceding AND current row 是指当前行的上一行(rownum-1)到当前行的汇总
----ROWS BETWEEN 1 preceding AND 1 following 是指当前行的上一行(rownum-1)到当前行的下辆行(rownum+2)的汇总
----------累加
SELECT AREA_CODE,
BILL_MONTH,
LOCAL_FARE,
SUM(LOCAL_FARE)OVER(PARTITIONBY AREA_CODE
ORDERBY BILL_MONTHASC) "last_sum_value"
FROM (SELECT T.AREA_CODE, T.BILL_MONTH,SUM(T.LOCAL_FARE) LOCAL_FARE
FROM T
GROUPBY T.AREA_CODE, T.BILL_MONTH)
ORDERBY AREA_CODE, BILL_MONTH
------------排序
ROW_NUMBER() over (PARTITION byhexuncookieid order by cycookied) order
-------------无需排序给行号
select row_number() over () as rowid, cust_idfrom dms.fund_trans_log2 limit 10;
1 分析函数:用于等级、百分点、n分片等
Ntile 是Hive很强大的一个分析函数。
- 可以看成是:它把有序的数据集合 平均分配 到 指定的数量(num)个桶中, 将桶号分配给每一行。如果不能平均分配,则优先分配较小编号的桶,并且各个桶中能放的行数最多相差1。
- 语法是:
ntile (num) over ([partition_clause] order_by_clause) as your_bucket_num
- 然后可以根据桶号,选取前或后 n分之几的数据。
例子:
给了用户和每个用户对应的消费信息表, 计算花费前50%的用户的平均消费;
-- 把用户和消费表,按消费下降顺序平均分成2份
drop tableif exists test_by_payment_ntile;
create table test_by_payment_ntileas
select
nick,
payment ,
NTILE(2)OVER(ORDERBY payment desc)AS rn
from test_nick_payment;
-- 分别对每一份计算平均值,就可以得到消费靠前50%和后50%的平均消费
select
'avg_payment'as inf,
t1.avg_payment_up_50 as avg_payment_up_50,
t2.avg_payment_down_50 as avg_payment_down_50
from
(select
avg(payment)as avg_payment_up_50
from test_by_payment_ntile
where rn=1
)t1
join
(select
avg(payment)as avg_payment_down_50
from test_by_payment_ntile
where rn=2
)t2
on (t1.dp_id=t2.dp_id);
Rank,Dense_Rank, Row_Number
SQL很熟悉的3个组内排序函数了。语法一样:
R() over (partion by col1... order by col2... desc/asc)
select
class1,
score,
rank() over(partitionby class1 orderby score desc) rk1,
dense_rank() over(partitionby class1 orderby score desc) rk2,
row_number() over(partitionby class1 orderby score desc) rk3
from zyy_test1;
如上图所示,rank 会对相同数值,输出相同的序号,而且下一个序号不间断;
dense_rank 会对相同数值,输出相同的序号,但下一个序号,间断
row_number 会对所有数值输出不同的序号,序号唯一连续;
2. 窗口函数 Lag, Lead, First_value,Last_value
Lag, Lead
LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值
LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值, 与LAG相反
-- 组内排序后,向后或向前偏移
-- 如果省略掉第三个参数,默认为NULL,否则补上。
select
dp_id,
mt,
payment,
LAG(mt,2)over(partition by dp_idorder by mt) mt_new
from test2;
-- 组内排序后,向后或向前偏移
-- 如果省略掉第三个参数,默认为NULL,否则补上。
select
dp_id,
mt,
payment,
LEAD(mt,2,'1111-11')over(partition by dp_idorder by mt) mt_new
from test2;
FIRST_VALUE, LAST_VALUE
first_value: 取分组内排序后,截止到当前行,第一个值
last_value: 取分组内排序后,截止到当前行,最后一个值
-- FIRST_VALUE 获得组内当前行往前的首个值
-- LAST_VALUE 获得组内当前行往前的最后一个值
-- FIRST_VALUE(DESC) 获得组内全局的最后一个值
select
dp_id,
mt,
payment,
FIRST_VALUE(payment) over(partitionby dp_id orderby mt) payment_g_first,
LAST_VALUE(payment) over(partitionby dp_id orderby mt) payment_g_last,
FIRST_VALUE(payment) over(partitionby dp_id orderby mt desc) payment_g_last_global
from test2
ORDER BY dp_id,mt;
####
Windowing and Analytics Functions
- Windowing and Analytics Functions
- Enhancements to Hive QL
- Examples
- PARTITION BY with one partitioning column, no ORDER BY or window specification
- PARTITION BY with two partitioning columns, no ORDER BY or window specification
- PARTITION BY with one partitioning column, one ORDER BY column, and no window specification
- PARTITION BY with two partitioning columns, two ORDER BY columns, and no window specification
- PARTITION BY with partitioning, ORDER BY, and window specification
- WINDOW clause
- LEAD using default 1 row lead and not specifying default value
- LAG specifying a lag of 3 rows and default value of 0
- Distinct counting for each partition
Enhancements to Hive QL
Version
Introduced in Hive version 0.11.
This section introduces the Hive QL enhancements for windowing and analytics functions. See "Windowing Specifications in HQL" (attached to HIVE-4197) for details. HIVE-896 has more information, including links to earlier documentation in the initial comments.
All of the windowing and analytics functions operate as per the SQL standard.
The current release supports the following functions for windowing and analytics:
- Windowing functions
- LEAD
- The number of rows to lead can optionally be specified. If the number of rows to lead is not specified, the lead is one row.
- Returns null when the lead for the current row extends beyond the end of the window.
- LAG
- The number of rows to lag can optionally be specified. If the number of rows to lag is not specified, the lag is one row.
- Returns null when the lag for the current row extends before the beginning of the window.
- FIRST_VALUE
- LAST_VALUE
- LEAD
- The OVER clause
- OVER with standard aggregates:
- COUNT
- SUM
- MIN
- MAX
- AVG
- OVER with a PARTITION BY statement with one or more partitioning columns of any primitive datatype.
- OVER with PARTITION BY and ORDER BY with one or more partitioning and/or ordering columns of any datatype.
- OVER with a window specification. Windows can be defined separately in a WINDOW clause. Window specifications support the following formats:
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING) - (ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
- (ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING
When ORDER BY is specified with missing WINDOW clause, the WINDOW specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
When both ORDER BY and WINDOW clauses are missing, the WINDOW specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
The OVER clause supports the following functions, but it does not support a window with them (see HIVE-4797):
Ranking functions: Rank, NTile, DenseRank, CumeDist, PercentRank.
Lead and Lag functions.
- OVER with a window specification. Windows can be defined separately in a WINDOW clause. Window specifications support the following formats:
- OVER with standard aggregates:
- Analytics functions
- RANK
- ROW_NUMBER
- DENSE_RANK
- CUME_DIST
- PERCENT_RANK
- NTILE
- Distinct support in Hive 2.1.0 and later (see HIVE-9534)
Distinct is supported for aggregation functions including SUM, COUNT and AVG, which aggregate over the distinct values within each partition. Current implementation has the limitation that no ORDER BY or window specification can be supported in the partitioning clause for performance reason. The supported syntax is as follows.- COUNT(DISTINCT a) OVER (PARTITION BY c)
ORDER BY and window specification is supported for distinct in Hive 2.2.0 (see HIVE-13453). An example is as follows.
- COUNT(DISTINCT a) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
- COUNT(DISTINCT a) OVER (PARTITION BY c)
- Aggregate functions in OVER clause support in Hive 2.1.0 and later (see HIVE-13475)
Support to reference aggregate functions within the OVER clause has been added. For instance, currently we can use the SUM aggregation function within the OVER clause as follows.- SELECT rank() OVER (ORDER BY sum(b))
FROM T
GROUP BY a;
Examples
This section provides examples of how to use the Hive QL windowing and analytics functions in SELECT statements. See HIVE-896 for additional examples.
PARTITION BY with one partitioning column, no ORDER BY or window specification
SELECT a, COUNT(b) OVER (PARTITION BY c)
FROM T;
PARTITION BY with two partitioning columns, no ORDER BY or window specification
SELECT a, COUNT(b) OVER (PARTITION BY c, d)
FROM T;
PARTITION BY with one partitioning column, one ORDER BY column, and no window specification
SELECT a, SUM(b) OVER (PARTITION BY c ORDER BY d)
FROM T;
PARTITION BY with two partitioning columns, two ORDER BY columns, and no window specification
SELECT a, SUM(b) OVER (PARTITION BY c, d ORDER BY e, f)
FROM T;
PARTITION BY with partitioning, ORDER BY, and window specification
SELECT a, SUM(b) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM T;
SELECT a, AVG(b) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
FROM T;
SELECT a, AVG(b) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING)
FROM T;
SELECT a, AVG(b) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
FROM T;
There can be multiple OVER clauses in a single query. A single OVER clause only applies to the immediately preceding function call. In this example, the first OVER clause applies to COUNT(b) and the second OVER clause applies to SUM(b):
SELECT
a,
COUNT(b) OVER (PARTITION BY c),
SUM(b) OVER (PARTITION BY c)
FROM T;
Aliases can be used as well, with or without the keyword AS:
SELECT
a,
COUNT(b) OVER (PARTITION BY c) AS b_count,
SUM(b) OVER (PARTITION BY c) b_sum
FROM T;
WINDOW clause
SELECT a, SUM(b) OVER w
FROM T
WINDOW w AS (PARTITION BY c ORDER BY d ROWS UNBOUNDED PRECEDING);
LEAD using default 1 row lead and not specifying default value
SELECT a, LEAD(a) OVER (PARTITION BY b ORDER BY C ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING)
FROM T;
LAG specifying a lag of 3 rows and default value of 0
SELECT a, LAG(a, 3, 0) OVER (PARTITION BY b ORDER BY C ROWS 3 PRECEDING)
FROM T;
Distinct counting for each partition
SELECT a, COUNT(distinct a) OVER (PARTITION BY b)
FROM T;
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/"> <rdf:Description rdf:about="https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics" dc:identifier="https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics" dc:title="LanguageManual WindowingAndAnalytics" trackback:ping="https://cwiki.apache.org/confluence/rpc/trackback/31819589"/> </rdf:RDF>
4 people like this
- SELECT rank() OVER (ORDER BY sum(b))