概述
Hive从0.11.0开始支持窗口函数,这是一组特殊的函数,可以扫描多个输入行来计算每个输出值。窗口函数通常和OVER, PARTITION BY, ORDER BY 以及 windowing specification一起使用。不同于和GROUP BY一起使用的常规聚合函数(每组只有一个结果值),窗口函数在窗口上工作,其中的输入行会被排序并且通过OVER 和 PARTITION子句使用灵活的条件表达进行分组。窗口函数也会提供一个聚合结果,但不对结果集进行分组,它们会随每条记录多次返回group值。相比常规的GROUP BY子句,窗口函数提供了极大的灵活性和功能性,使HQL的特殊聚合更加简单、强大。
窗口函数的语法格式如下:
Function (arg1,..., argn) OVER ([PARTITION BY <...>] [ORDER BY <....>] [<window_expression>])
Function (arg1,…, argn) 是下面几类函数中的任意一个
- 窗口聚合函数 :也就是常规聚合函数,如sum(…)、max(…)、min(…)、avg(…)、count(…)等
- 窗口排序函数:用以对数据进行排序,如rank(…)、row_number(…)等
- 窗口分析函数:用以统计和比较的函数,如lead(…)、lag(…)、first_value(…)、last_value(…)等
OVER [PARTITION BY <…>] 子句类似于GROUP BY子句,它按列将行划分为一个或多个分区包含相同值的组。这些逻辑组被称为分区(与分区表中的分区概念不同)。忽略PARTITION BY语句的操作会作用于表中的所有行上。
[ORDER BY <…>] 子句和常规的ORDER BY是一样的。它确保由 PARTITION BY 子句生成的行是按照特定方式排序的,如升序或降序。
窗口聚合函数
在窗口函数中使用聚合函数会有比GROUP BY更好的灵活性。从Hive 2.2.0开始(见HIVE-13453),DISTINCT 支持和窗口函数中的聚合函数一起使用。
示例
- 准备测试表及测试数据
> CREATE TABLE IF NOT EXISTS employee_contract
(
name string,
dept_num int,
employee_id int,
salary int,
type string,
start_date date
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;
测试数据如下:
cat employee_contract.txt
Michael|1000|100|5000|full|2014-01-29
Will|1000|101|4000|full|2013-10-02
Will|1000|101|4000|part|2014-10-02
Steven|1000|102|6400|part|2012-11-03
Lucy|1000|103|5500|full|2010-01-03
Lily|1001|104|5000|part|2014-11-29
Jess|1001|105|6000|part|2014-12-02
Mike|1001|106|6400|part|2013-11-03
Wei|1002|107|7000|part|2010-04-03
Yun|1002|108|5500|full|2014-01-29
Richard|1002|109|8000|full|2013-09-01
加载数据
> LOAD DATA INPATH
'/tmp/employee_contract.txt'
OVERWRITE INTO TABLE employee_contract;
- 常规聚合函数作为窗口函数使用
> SELECT
name,
dept_num as deptno,
salary,
count(*) OVER (PARTITION BY dept_num) as cnt,
count(distinct dept_num) OVER (PARTITION BY dept_num) as dcnt,
sum(salary) OVER(PARTITION BY dept_num ORDER BY dept_num) as sum1,
sum(salary) OVER(ORDER BY dept_num) as sum2,
sum(salary) OVER(ORDER BY dept_num, name) as sum3
FROM employee_contract
ORDER BY deptno, name;
+----------+---------+---------+------+-------+--------+--------+--------+
| name | deptno | salary | cnt | dcnt | sum1 | sum2 | sum3 |
+----------+---------+---------+------+-------+--------+--------+--------+
| Lucy | 1000 | 5500 | 5 | 1 | 24900 | 24900 | 5500 |
| Michael | 1000 | 5000 | 5 | 1 | 24900 | 24900 | 10500 |
| Steven | 1000 | 6400 | 5 | 1 | 24900 | 24900 | 16900 |
| Will | 1000 | 4000 | 5 | 1 | 24900 | 24900 | 24900 |
| Will | 1000 | 4000 | 5 | 1 | 24900 | 24900 | 24900 |
| Jess | 1001 | 6000 | 3 | 1 | 17400 | 42300 | 30900 |
| Lily | 1001 | 5000 | 3 | 1 | 17400 | 42300 | 35900 |
| Mike | 1001 | 6400 | 3 | 1 | 17400 | 42300 | 42300 |
| Richard | 1002 | 8000 | 3 | 1 | 20500 | 62800 | 50300 |
| Wei | 1002 | 7000 | 3 | 1 | 20500 | 62800 | 57300 |
| Yun | 1002 | 5500 | 3 | 1 | 20500 | 62800 | 62800 |
+----------+---------+---------+------+-------+--------+--------+--------+
窗口排序函数
窗口排序函数在返回的数据中提供特定组内的排序数据信息,如行号和列组。常用的窗口排序函数包括:
- RANK:对组中的项进行排序,排名相等会在名次中留下空位,例如根据特定条件查找前N行。
- ROW_NUMBER: 根据分区和顺序规范,为每行指定一个从1开始的唯一序列号。
- DENSE_RANK:类似于RANK,但排名相等不会在名次中留下空位,例如,如果使用DENSE_RANK对一场比赛进行排名,并且有两名选手并列第二,我们将看到这两名选手都排在第二位,而下一名选手则排在第三位。但是,RANK函数将两个人排在第二位,而下一个人将排在第四位
- CUME_DIST:小于等于当前值的行数/分组内总行数
- PERCENT_RANK:返回(current rank - 1)/(total number of rows - 1)的百分比值。
- NTILE:将一个有序的数据集划分为若干个bucket,并为每一行分配一个适当的bucket号。它可用于将行分成相等的集合,并为每行指定一个数字。
示例
> SELECT
name,
dept_num as deptno,
salary,
row_number() OVER () as rnum,
rank() OVER (PARTITION BY dept_num ORDER BY salary) as rk,
dense_rank() OVER (PARTITION BY dept_num ORDER BY salary) as drk,
percent_rank() OVER(PARTITION BY dept_num ORDER BY salary) as prk,
ntile(4) OVER(PARTITION BY dept_num ORDER BY salary) as ntile
FROM employee_contract
ORDER BY deptno, name;
+----------+---------+---------+-------+-----+------+-------+--------+
| name | deptno | salary | rnum | rk | drk | prk | ntile |
+----------+---------+---------+-------+-----+------+-------+--------+
| Lucy | 1000 | 5500 | 7 | 4 | 3 | 0.75 | 3 |
| Michael | 1000 | 5000 | 11 | 3 | 2 | 0.5 | 2 |
| Steven | 1000 | 6400 | 8 | 5 | 4 | 1.0 | 4 |
| Will | 1000 | 4000 | 9 | 1 | 1 | 0.0 | 1 |
| Will | 1000 | 4000 | 10 | 1 | 1 | 0.0 | 1 |
| Jess | 1001 | 6000 | 5 | 2 | 2 | 0.5 | 2 |
| Lily | 1001 | 5000 | 6 | 1 | 1 | 0.0 | 1 |
| Mike | 1001 | 6400 | 4 | 3 | 3 | 1.0 | 3 |
| Richard | 1002 | 8000 | 1 | 3 | 3 | 1.0 | 3 |
| Wei | 1002 | 7000 | 3 | 2 | 2 | 0.5 | 2 |
| Yun | 1002 | 5500 | 2 | 1 | 1 | 0.0 | 1 |
+----------+---------+---------+-------+-----+------+-------+--------+
从Hive 2.1.0开始,可以在OVER子句中使用聚合函数
> SELECT
dept_num,
rank() OVER (PARTITION BY dept_num ORDER BY sum(salary)) as rk
FROM employee_contract
GROUP BY dept_num;
+-----------+-----+
| dept_num | rk |
+-----------+-----+
| 1000 | 1 |
| 1001 | 1 |
| 1002 | 1 |
+-----------+-----+
窗口分析函数
窗口分析函数提供了扩展的数据分析功能,如获取有序集中的 lag, lead, 最后一行或第一行。常用的分析函数如下:
- CUME_DIST:计算当前值小于等于(number of rows ≤ current row)/(total number of rows)值的行数
- LEAD(value_expr[,offset[,default]]): 用于从下一行返回数据。第一个参数为列名,第二个参数指定要lead的行数(偏移量),可选,默认为1;第三个参数表示如果未指定默认值,则函数返回[,default]或如不指定default时则返回NULL。
- LAG(value_expr[,offset[,default]]) : 用于从上一行接收数据。第一个参数为列名,第二个参数指定要lag的行数(偏移量),可选,默认为1;第三个参数表示如果未指定默认值,则函数返回[,default]或如不指定default时则返回NULL。
- FIRST_VALUE : 取分组内排序后的第一行。最多接收2个参数,第一个参数为列名,第二个参数是一个布尔值(可选),默认是false,如果设置为true,则忽略NULL值。
- LAST_VALUE : 取分组内排序后的最后一行。最多接收2个参数,第一个参数为列名,第二个参数是一个布尔值(可选),默认是false,如果设置为true,则忽略NULL值。
示例
> SELECT
name,
dept_num as deptno,
salary,
round(cume_dist() OVER (PARTITION BY dept_num ORDER BY salary), 2) as cume,
lead(salary, 2) OVER (PARTITION BY dept_num ORDER BY salary) as lead,
lag(salary, 2, 0) OVER (PARTITION BY dept_num ORDER BY salary) as lag,
first_value(salary) OVER (PARTITION BY dept_num ORDER BY salary) as fval,
last_value(salary) OVER (PARTITION BY dept_num ORDER BY salary) as lvalue,
last_value(salary) OVER (PARTITION BY dept_num ORDER BY salary RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS lvalue2
FROM employee_contract
ORDER BY deptno, salary;
+----------+---------+---------+-------+-------+-------+-------+---------+----------+
| name | deptno | salary | cume | lead | lag | fval | lvalue | lvalue2 |
+----------+---------+---------+-------+-------+-------+-------+---------+----------+
| Will | 1000 | 4000 | 0.4 | 5500 | 0 | 4000 | 4000 | 6400 |
| Will | 1000 | 4000 | 0.4 | 5000 | 0 | 4000 | 4000 | 6400 |
| Michael | 1000 | 5000 | 0.6 | 6400 | 4000 | 4000 | 5000 | 6400 |
| Lucy | 1000 | 5500 | 0.8 | NULL | 4000 | 4000 | 5500 | 6400 |
| Steven | 1000 | 6400 | 1.0 | NULL | 5000 | 4000 | 6400 | 6400 |
| Lily | 1001 | 5000 | 0.33 | 6400 | 0 | 5000 | 5000 | 6400 |
| Jess | 1001 | 6000 | 0.67 | NULL | 0 | 5000 | 6000 | 6400 |
| Mike | 1001 | 6400 | 1.0 | NULL | 5000 | 5000 | 6400 | 6400 |
| Yun | 1002 | 5500 | 0.33 | 8000 | 0 | 5500 | 5500 | 8000 |
| Wei | 1002 | 7000 | 0.67 | NULL | 0 | 5500 | 7000 | 8000 |
| Richard | 1002 | 8000 | 1.0 | NULL | 5500 | 5500 | 8000 | 8000 |
+----------+---------+---------+-------+-------+-------+-------+---------+----------+
对于last_value,结果(字段 lval)有点出乎意料,这是因为所使用的默认窗口子句是 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW,这在示例中意味着当前行将始终是最后一个值。将窗口子句改为 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING 就可以得到预期的结果(见列lvalue2)。
窗口表达式
[<window_expression>] 用于进一步细分结果并应用窗口函数,有2种类型的窗口:Row Type 和 Range Type。
注意:根据JIRA https://issues.apache.org/jira/browse/HIVE-4797, rank(…), ntile(…), dense_rank(…), cume_dist(…), percent_rank(…), lead(…), lag(…) 和 row_number(…) 函数还不支持和窗口表达式一起使用。
Row Type
对于row type的窗口,其以当前行之前或之后的行号为处理单元。常用的语法如下:
ROWS BETWEEN <start_expr> AND <end_expr>
<start_expr> 是如下任一语句:
- UNBOUNDED PRECEDING
- CURRENT ROW
- N PRECEDING or FOLLOWING
<end_expr> 是如下任一语句: - UNBOUNDED FOLLOWING
- CURRENT ROW
- N PRECEDING or FOLLOWING
下面详细介绍如何使用窗口表达式及其组合:
- BETWEEN … AND:用以指定窗口的起始点和结束点。第一个表达式指定起始点,第二个指定结束点。如果省略了 BETWEEN…AND(如ROWS * N PRECEDING 或 ROWS UNBOUNDED PRECEDING),hive会将其看作起始点,而结束点默认是当前行。
- N PRECEDING or FOLLOWING: 用以表示当前行之前或之后的N行
- UNBOUNDED PRECEDING: 表示窗口从分区的第一行开始。这是起始点的说明规范,不能用以结束点说明规范
- UNBOUNDED FOLLOWING:表示窗口以分区的最后一行作为结束点。这是结束点的说明规范,不能用以起始点的说明规范
- UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING: 表示每行的第一行和最后一行,也就是表中的所有行
- CURRENT ROW: 作为起始点,CURRENT ROW指定从当前行或值开始,这取决于是否指定了ROW或RANGE,这种情况下,结束点不能是 M * PRECEDING;而作为结束点,CURRENT ROW指定当前行或值作为窗口的结束,这取决于是否指定了ROW或RANGE,这种情况下,起始点不能是 M FOLLOWING
下图可以帮助我们更清楚地理解前面的定义:
示例:window expression preceding and following
> SELECT
name, dept_num as dno, salary AS sal,
max(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) win1,
max(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 2 PRECEDING AND UNBOUNDED FOLLOWING) win2,
max(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING) win3,
max(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 2 PRECEDING AND 1 PRECEDING) win4,
max(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 1 FOLLOWING AND 2 FOLLOWING) win5,
max(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS 2 PRECEDING) win6,
max(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS UNBOUNDED PRECEDING) win7
FROM employee_contract
ORDER BY dno, name;
+----------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
| name | dno | sal | win1 | win2 | win3 | win4 | win5 | win6 | win7 |
+----------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
| Lucy | 1000 | 5500 | 5500 | 6400 | 6400 | NULL | 6400 | 5500 | 5500 |
| Michael | 1000 | 5000 | 5500 | 6400 | 6400 | 5500 | 6400 | 5500 | 5500 |
| Steven | 1000 | 6400 | 6400 | 6400 | 6400 | 5500 | 4000 | 6400 | 6400 |
| Will | 1000 | 4000 | 6400 | 6400 | 4000 | 6400 | NULL | 6400 | 6400 |
| Will | 1000 | 4000 | 6400 | 6400 | 6400 | 6400 | 4000 | 6400 | 6400 |
| Jess | 1001 | 6000 | 6000 | 6400 | 6400 | NULL | 6400 | 6000 | 6000 |
| Lily | 1001 | 5000 | 6000 | 6400 | 6400 | 6000 | 6400 | 6000 | 6000 |
| Mike | 1001 | 6400 | 6400 | 6400 | 6400 | 6000 | NULL | 6400 | 6400 |
| Richard | 1002 | 8000 | 8000 | 8000 | 8000 | NULL | 7000 | 8000 | 8000 |
| Wei | 1002 | 7000 | 8000 | 8000 | 8000 | 8000 | 5500 | 8000 | 8000 |
| Yun | 1002 | 5500 | 8000 | 8000 | 7000 | 8000 | NULL | 8000 | 8000 |
+----------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
示例:window expression current_row
> SELECT
name, dept_num as dno, salary AS sal,
max(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN CURRENT ROW AND CURRENT ROW) win8,
max(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING) win9,
max(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) win10,
max(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) win11,
max(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) win12,
max(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING) win13,
max(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) win14
FROM employee_contract
ORDER BY dno, name;
+----------+-------+-------+-------+-------+--------+--------+--------+--------+--------+
| name | dno | sal | win8 | win9 | win10 | win11 | win12 | win13 | win14 |
+----------+-------+-------+-------+-------+--------+--------+--------+--------+--------+
| Lucy | 1000 | 5500 | 5500 | 5500 | 6400 | NULL | 5500 | 5500 | 6400 |
| Michael | 1000 | 5000 | 5000 | 6400 | 6400 | 5500 | 5500 | 6400 | 6400 |
| Steven | 1000 | 6400 | 6400 | 6400 | 6400 | 5500 | 6400 | 6400 | 6400 |
| Will | 1000 | 4000 | 4000 | 4000 | 4000 | 6400 | 6400 | 6400 | 6400 |
| Will | 1000 | 4000 | 4000 | 4000 | 4000 | 6400 | 6400 | 6400 | 6400 |
| Jess | 1001 | 6000 | 6000 | 6000 | 6400 | NULL | 6000 | 6000 | 6400 |
| Lily | 1001 | 5000 | 5000 | 6400 | 6400 | 6000 | 6000 | 6400 | 6400 |
| Mike | 1001 | 6400 | 6400 | 6400 | 6400 | 6000 | 6400 | 6400 | 6400 |
| Richard | 1002 | 8000 | 8000 | 8000 | 8000 | NULL | 8000 | 8000 | 8000 |
| Wei | 1002 | 7000 | 7000 | 7000 | 7000 | 8000 | 8000 | 8000 | 8000 |
| Yun | 1002 | 5500 | 5500 | 5500 | 5500 | 8000 | 8000 | 8000 | 8000 |
+----------+-------+-------+-------+-------+--------+--------+--------+--------+--------+
此外,窗口可以在单独的 window 子句中定义,也可以由其他 windows 引用
示例:window reference
> SELECT name, dept_num, salary,
MAX(salary) OVER w1 AS win1,
MAX(salary) OVER w2 AS win2,
MAX(salary) OVER w3 AS win3
FROM employee_contract
WINDOW
w1 as (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 2 PRECEDING AND CURRENT ROW),
w2 as w3,
w3 as (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING);
+----------+-----------+---------+-------+-------+-------+
| name | dept_num | salary | win1 | win2 | win3 |
+----------+-----------+---------+-------+-------+-------+
| Lucy | 1000 | 5500 | 5500 | 6400 | 6400 |
| Michael | 1000 | 5000 | 5500 | 6400 | 6400 |
| Steven | 1000 | 6400 | 6400 | 6400 | 6400 |
| Will | 1000 | 4000 | 6400 | 6400 | 6400 |
| Will | 1000 | 4000 | 6400 | 4000 | 4000 |
| Jess | 1001 | 6000 | 6000 | 6400 | 6400 |
| Lily | 1001 | 5000 | 6000 | 6400 | 6400 |
| Mike | 1001 | 6400 | 6400 | 6400 | 6400 |
| Richard | 1002 | 8000 | 8000 | 8000 | 8000 |
| Wei | 1002 | 7000 | 8000 | 8000 | 8000 |
| Yun | 1002 | 5500 | 8000 | 7000 | 7000 |
+----------+-----------+---------+-------+-------+-------+
range type
相比以行为单位的row type窗口,range type窗口是在窗口表达式中指定值的范围值。如下示例, max(salary) RANGE BETWEEN 500 PRECEDING AND 1000 FOLLOWING 语句会计算分区内的相对于当前行值从-500到+1000中的max(salary),假如当前行的salary值为4000,那么max(salary)的计算范围就是每个由dept_num指定的分区内的salary值在3500和5000之间的数据。
> SELECT
dept_num, start_date, name, salary,
max(salary) OVER (PARTITION BY dept_num ORDER BY salary
RANGE BETWEEN 500 PRECEDING AND 1000 FOLLOWING) win1,
max(salary) OVER (PARTITION BY dept_num ORDER BY salary
RANGE BETWEEN 500 PRECEDING AND CURRENT ROW) win2
FROM employee_contract
order by dept_num, start_date;
+-----------+-------------+----------+---------+-------+-------+
| dept_num | start_date | name | salary | win1 | win2 |
+-----------+-------------+----------+---------+-------+-------+
| 1000 | 2010-01-03 | Lucy | 5500 | 6400 | 5500 |
| 1000 | 2012-11-03 | Steven | 6400 | 6400 | 6400 |
| 1000 | 2013-10-02 | Will | 4000 | 5000 | 4000 |
| 1000 | 2014-01-29 | Michael | 5000 | 5500 | 5000 |
| 1000 | 2014-10-02 | Will | 4000 | 5000 | 4000 |
| 1001 | 2013-11-03 | Mike | 6400 | 6400 | 6400 |
| 1001 | 2014-11-29 | Lily | 5000 | 6000 | 5000 |
| 1001 | 2014-12-02 | Jess | 6000 | 6400 | 6000 |
| 1002 | 2010-04-03 | Wei | 7000 | 8000 | 7000 |
| 1002 | 2013-09-01 | Richard | 8000 | 8000 | 8000 |
| 1002 | 2014-01-29 | Yun | 5500 | 5500 | 5500 |
+-----------+-------------+----------+---------+-------+-------+
注意:如果省略了整个窗口表达式语句,那么默认的窗口语句就是 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW。如果ORDER BY 和 WINDOW 语句同时都没有,则默认的窗口语句是 ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING。
参考
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
书籍 Apache Hive Essentials Second Edition (by Dayong Du) Chapter 6