目录
窗口函数又名开窗函数,属于分析函数的一种。用于解决复杂报表统计需求的功能强大的函数,很多场景都需要用到。窗口函数用于计算基于组的某种聚合值,它和聚合函数的不同之处是:对于每个组返回多行,而聚合函数对于每个组只返回一行。
1、 窗口基本用法
1.1 over 关键字
使用窗口函数之前一般要要通过over()进行开窗
select
ename,
sal,
sum(sal) over() salsum,
concat(round(sal / sum(sal) over()*100, 1) || '%') ratiosal
from emp;
+---------+-------+---------+-----------+
| ename | sal | salsum | ratiosal |
+---------+-------+---------+-----------+
| MILLER | 1300 | 29025 | 4.5% |
| FORD | 3000 | 29025 | 10.3% |
| JAMES | 950 | 29025 | 3.3% |
| ADAMS | 1100 | 29025 | 3.8% |
| TURNER | 1500 | 29025 | 5.2% |
| KING | 5000 | 29025 | 17.2% |
| SCOTT | 3000 | 29025 | 10.3% |
| CLARK | 2450 | 29025 | 8.4% |
| BLAKE | 2850 | 29025 | 9.8% |
| MARTIN | 1250 | 29025 | 4.3% |
| JONES | 2975 | 29025 | 10.2% |
| WARD | 1250 | 29025 | 4.3% |
| ALLEN | 1600 | 29025 | 5.5% |
| SMITH | 800 | 29025 | 2.8% |
+---------+-------+---------+-----------+
窗口函数是针对每一行数据的;如果over中没有参数,默认的是全部结果集;
1.2 partition by子句
在over窗口中进行分区,对某一列进行分区统计,窗口的大小就是分区的大小
select
deptno,
ename,
sal,
sum(sal) over(partition by deptno) salsum
from emp order by deptno limit 20;
+---------+---------+-------+---------+
| deptno | ename | sal | salsum |
+---------+---------+-------+---------+
| 10 | MILLER | 1300 | 8750 |
| 10 | KING | 5000 | 8750 |
| 10 | CLARK | 2450 | 8750 |
| 20 | ADAMS | 1100 | 10875 |
| 20 | SCOTT | 3000 | 10875 |
| 20 | FORD | 3000 | 10875 |
| 20 | JONES | 2975 | 10875 |
| 20 | SMITH | 800 | 10875 |
| 30 | BLAKE | 2850 | 9400 |
| 30 | WARD | 1250 | 9400 |
| 30 | MARTIN | 1250 | 9400 |
| 30 | TURNER | 1500 | 9400 |
| 30 | JAMES | 950 | 9400 |
| 30 | ALLEN | 1600 | 9400 |
+---------+---------+-------+---------+
1.3 order by 子句
select
ename,
sal,
deptno,
sum(sal) over(partition by deptno) salsum1,
sum(sal) over(partition by deptno order by sal desc) salsum
from emp;
+---------+-------+---------+----------+---------+
| ename | sal | deptno | salsum1 | salsum |
+---------+-------+---------+----------+---------+
| KING | 5000 | 10 | 8750 | 5000 |
| CLARK | 2450 | 10 | 8750 | 7450 |
| MILLER | 1300 | 10 | 8750 | 8750 |
| FORD | 3000 | 20 | 10875 | 6000 |
| SCOTT | 3000 | 20 | 10875 | 6000 |
| JONES | 2975 | 20 | 10875 | 8975 |
| ADAMS | 1100 | 20 | 10875 | 10075 |
| SMITH | 800 | 20 | 10875 | 10875 |
| BLAKE | 2850 | 30 | 9400 | 2850 |
| ALLEN | 1600 | 30 | 9400 | 4450 |
| TURNER | 1500 | 30 | 9400 | 5950 |
| MARTIN | 1250 | 30 | 9400 | 8450 |
| WARD | 1250 | 30 | 9400 | 8450 |
| JAMES | 950 | 30 | 9400 | 9400 |
+---------+-------+---------+----------+---------+
# 20 6000
#| FORD | 3000 | 20 | 10875 | 6000 |
#| SCOTT | 3000 | 20 | 10875 | 6000 |
over() 没有order by 子句是结果是整个组的累加
有了order by 是组内逐行累加
2、Window子句
rows between ... and ...
如果要对窗口的结果做更细粒度的划分,使用window子句,有如下的几个选项:
- unbounded preceding。组内第一行数据
- n preceding。组内当前行的前n行数据
- current row。当前行数据
- n following。组内当前行的后n行数据
- unbounded following。组内最后一行数据
-- 等价。组内,第一行到当前行的和
select ename,
sal,
deptno,
sum(sal) over(partition by deptno order by ename)
from emp;
select
ename,
sal,
deptno,
sum(sal) over(partition by deptno order by ename rows between unbounded preceding and current row
)
from emp;
+---------+-------+---------+---------------+
| ename | sal | deptno | sum_window_0 |
+---------+-------+---------+---------------+
| CLARK | 2450 | 10 | 2450 |
| KING | 5000 | 10 | 7450 |
| MILLER | 1300 | 10 | 8750 |
| ADAMS | 1100 | 20 | 1100 |
| FORD | 3000 | 20 | 4100 |
| JONES | 2975 | 20 | 7075 |
| SCOTT | 3000 | 20 | 10075 |
| SMITH | 800 | 20 | 10875 |
| ALLEN | 1600 | 30 | 1600 |
| BLAKE | 2850 | 30 | 4450 |
| JAMES | 950 | 30 | 5400 |
| MARTIN | 1250 | 30 | 6650 |
| TURNER | 1500 | 30 | 8150 |
| WARD | 1250 | 30 | 9400 |
+---------+-------+---------+---------------+
-- 组内,第一行到最后一行的和
-- 等价 sum(sal) over(partition by deptno)
select
ename,
sal,
deptno,
sum(sal) over(partition by deptno order by ename
rows between unbounded preceding and unbounded following
)
from emp;
+---------+-------+---------+---------------+
| ename | sal | deptno | sum_window_0 |
+---------+-------+---------+---------------+
| CLARK | 2450 | 10 | 8750 |
| KING | 5000 | 10 | 8750 |
| MILLER | 1300 | 10 | 8750 |
| ADAMS | 1100 | 20 | 10875 |
| FORD | 3000 | 20 | 10875 |
| JONES | 2975 | 20 | 10875 |
| SCOTT | 3000 | 20 | 10875 |
| SMITH | 800 | 20 | 10875 |
| ALLEN | 1600 | 30 | 9400 |
| BLAKE | 2850 | 30 | 9400 |
| JAMES | 950 | 30 | 9400 |
| MARTIN | 1250 | 30 | 9400 |
| TURNER | 1500 | 30 | 9400 |
| WARD | 1250 | 30 | 9400 |
+---------+-------+---------+---------------+
-- 组内,前一行、当前行、后一行的和
select
ename,
sal,
deptno,
sum(sal) over(partition by deptno order by ename
rows between 1 preceding and 1 following
)
from emp;
+---------+-------+---------+---------------+
| ename | sal | deptno | sum_window_0 |
+---------+-------+---------+---------------+
| CLARK | 2450 | 10 | 7450 |
| KING | 5000 | 10 | 8750 |
| MILLER | 1300 | 10 | 6300 |
| ADAMS | 1100 | 20 | 4100 |
| FORD | 3000 | 20 | 7075 |
| JONES | 2975 | 20 | 8975 |
| SCOTT | 3000 | 20 | 6775 |
| SMITH | 800 | 20 | 3800 |
| ALLEN | 1600 | 30 | 4450 |
| BLAKE | 2850 | 30 | 5400 |
| JAMES | 950 | 30 | 5050 |
| MARTIN | 1250 | 30 | 3700 |
| TURNER | 1500 | 30 | 4000 |
| WARD | 1250 | 30 | 2750 |
+---------+-------+---------+---------------+
3、排名函数
都是从1开始,生成数据项在分组中的排名。
- row_number()。排名顺序增加不会重复;如1、2、3、4、… …
- RANK()。 排名相等会在名次中留下空位;如1、2、2、4、5、… …
- DENSE_RANK()。 排名相等会在名次中不会留下空位 ;如1、2、2、3、4、… …
-- 数据
class1,s01,100
class1,s03,100
class1,s05,100
class1,s07,99
class1,s09,98
class1,s02,98
class1,s04,97
class2,s21,100
class2,s24,99
class2,s27,99
class2,s22,98
class2,s25,98
class2,s28,97
class2,s26,96
-- 创建表加载数据
create table rank(
cname string,
sname string,
score int
) row format delimited fields terminated by ',';
load data local inpath '/root/bigdata/test_data/hive/rank.txt' into table
rank;
select cname, sname, score,
row_number() over (partition by cname order by score
desc) rank1,
rank() over (partition by cname order by score desc)
rank2,
dense_rank() over (partition by cname order by score
desc) rank3
from rank;
(i.e. spark, tez) or using Hive 1.X releases.
+---------+--------+--------+--------+--------+--------+
| cname | sname | score | rank1 | rank2 | rank3 |
+---------+--------+--------+--------+--------+--------+
| class1 | s01 | 100 | 1 | 1 | 1 |
| class1 | s03 | 100 | 2 | 1 | 1 |
| class1 | s05 | 100 | 3 | 1 | 1 |
| class1 | s07 | 99 | 4 | 4 | 2 |
| class1 | s09 | 98 | 5 | 5 | 3 |
| class1 | s02 | 98 | 6 | 5 | 3 |
| class1 | s04 | 97 | 7 | 7 | 4 |
| class2 | s21 | 100 | 1 | 1 | 1 |
| class2 | s24 | 99 | 2 | 2 | 2 |
| class2 | s27 | 99 | 3 | 2 | 2 |
| class2 | s22 | 98 | 4 | 4 | 3 |
| class2 | s25 | 98 | 5 | 4 | 3 |
| class2 | s28 | 97 | 6 | 6 | 4 |
| class2 | s26 | 96 | 7 | 7 | 5 |
+---------+--------+--------+--------+--------+--------+
4、序列函数
- lag。返回当前数据行的上一行数据
- lead。返回当前数据行的下一行数据
- first_value。取分组内排序后,截止到当前行,第一个值
- last_value。分组内排序后,截止到当前行,最后一个值
- ntile。将分组的数据按照顺序切分成n片,返回当前切片值
-- 数据
cookie1,2019-04-10,1
cookie1,2019-04-11,5
cookie1,2019-04-12,7
cookie1,2019-04-13,3
cookie1,2019-04-14,2
cookie1,2019-04-15,4
cookie1,2019-04-16,4
cookie2,2019-04-10,2
cookie2,2019-04-11,3
cookie2,2019-04-12,5
cookie2,2019-04-13,6
cookie2,2019-04-14,3
cookie2,2019-04-15,9
cookie2,2019-04-16,7
-- 建表语句
create table userpv(
cid string,
ctime date,
pv int
)
row format delimited fields terminated by ",";
-- 加载数据
Load data local inpath '/home/hadoop/data/userpv.dat' into
table userpv;
select * from userpv;
+-------------+---------------+------------+
| userpv.cid | userpv.ctime | userpv.pv |
+-------------+---------------+------------+
| cookie1 | 2019-04-10 | 1 |
| cookie1 | 2019-04-11 | 5 |
| cookie1 | 2019-04-12 | 7 |
| cookie1 | 2019-04-13 | 3 |
| cookie1 | 2019-04-14 | 2 |
| cookie1 | 2019-04-15 | 4 |
| cookie1 | 2019-04-16 | 4 |
| cookie2 | 2019-04-10 | 2 |
| cookie2 | 2019-04-11 | 3 |
| cookie2 | 2019-04-12 | 5 |
| cookie2 | 2019-04-13 | 6 |
| cookie2 | 2019-04-14 | 3 |
| cookie2 | 2019-04-15 | 9 |
| cookie2 | 2019-04-16 | 7 |
+-------------+---------------+------------+
-- lag 返回当前数据行的上一行数据
-- lead 返回当前数据行的下一行数据
select cid, ctime, pv,
lag(pv) over(partition by cid order by ctime) lagpv,
lead(pv) over(partition by cid order by ctime) leadpv
from userpv;
+----------+-------------+-----+--------+---------+
| cid | ctime | pv | lagpv | leadpv |
+----------+-------------+-----+--------+---------+
| cookie1 | 2019-04-10 | 1 | NULL | 5 |
| cookie1 | 2019-04-11 | 5 | 1 | 7 |
| cookie1 | 2019-04-12 | 7 | 5 | 3 |
| cookie1 | 2019-04-13 | 3 | 7 | 2 |
| cookie1 | 2019-04-14 | 2 | 3 | 4 |
| cookie1 | 2019-04-15 | 4 | 2 | 4 |
| cookie1 | 2019-04-16 | 4 | 4 | NULL |
| cookie2 | 2019-04-10 | 2 | NULL | 3 |
| cookie2 | 2019-04-11 | 3 | 2 | 5 |
| cookie2 | 2019-04-12 | 5 | 3 | 6 |
| cookie2 | 2019-04-13 | 6 | 5 | 3 |
| cookie2 | 2019-04-14 | 3 | 6 | 9 |
| cookie2 | 2019-04-15 | 9 | 3 | 7 |
| cookie2 | 2019-04-16 | 7 | 9 | NULL |
+----------+-------------+-----+--------+---------+
select cid, ctime, pv,
first_value(pv) over (partition by cid order by ctime
rows between unbounded preceding and unbounded following) as
firstpv,
last_value(pv) over (partition by cid order by ctime
rows between unbounded preceding and unbounded following) as
lastpv
from userpv;
+----------+-------------+-----+----------+---------+
| cid | ctime | pv | firstpv | lastpv |
+----------+-------------+-----+----------+---------+
| cookie1 | 2019-04-10 | 1 | 1 | 4 |
| cookie1 | 2019-04-11 | 5 | 1 | 4 |
| cookie1 | 2019-04-12 | 7 | 1 | 4 |
| cookie1 | 2019-04-13 | 3 | 1 | 4 |
| cookie1 | 2019-04-14 | 2 | 1 | 4 |
| cookie1 | 2019-04-15 | 4 | 1 | 4 |
| cookie1 | 2019-04-16 | 4 | 1 | 4 |
| cookie2 | 2019-04-10 | 2 | 2 | 7 |
| cookie2 | 2019-04-11 | 3 | 2 | 7 |
| cookie2 | 2019-04-12 | 5 | 2 | 7 |
| cookie2 | 2019-04-13 | 6 | 2 | 7 |
| cookie2 | 2019-04-14 | 3 | 2 | 7 |
| cookie2 | 2019-04-15 | 9 | 2 | 7 |
| cookie2 | 2019-04-16 | 7 | 2 | 7 |
+----------+-------------+-----+----------+---------+
-- ntile。按照cid进行分组,每组数据分成2份
select cid, ctime, pv,
ntile(2) over(partition by cid order by ctime) ntile
from userpv;
+----------+-------------+-----+--------+
| cid | ctime | pv | ntile |
+----------+-------------+-----+--------+
| cookie1 | 2019-04-10 | 1 | 1 |
| cookie1 | 2019-04-11 | 5 | 1 |
| cookie1 | 2019-04-12 | 7 | 1 |
| cookie1 | 2019-04-13 | 3 | 1 |
| cookie1 | 2019-04-14 | 2 | 2 |
| cookie1 | 2019-04-15 | 4 | 2 |
| cookie1 | 2019-04-16 | 4 | 2 |
| cookie2 | 2019-04-10 | 2 | 1 |
| cookie2 | 2019-04-11 | 3 | 1 |
| cookie2 | 2019-04-12 | 5 | 1 |
| cookie2 | 2019-04-13 | 6 | 1 |
| cookie2 | 2019-04-14 | 3 | 2 |
| cookie2 | 2019-04-15 | 9 | 2 |
| cookie2 | 2019-04-16 | 7 | 2 |
+----------+-------------+-----+--------+
5、GROUPING SETS、GROUPING__ID、CUBE和ROLLUP
-- 数据
2015-03,2015-03-10,cookie1
2015-03,2015-03-10,cookie5
2015-03,2015-03-12,cookie7
2015-04,2015-04-12,cookie3
2015-04,2015-04-13,cookie2
2015-04,2015-04-13,cookie4
2015-04,2015-04-16,cookie4
2015-03,2015-03-10,cookie2
2015-03,2015-03-10,cookie3
2015-04,2015-04-12,cookie5
2015-04,2015-04-13,cookie6
2015-04,2015-04-15,cookie3
2015-04,2015-04-15,cookie2
2015-04,2015-04-16,cookie1
create table cookie5(month string, day string, cookieid string)
row format delimited fields terminated by ',';
load data local inpath "/home/hadoop/cookie5.txt" into table cookie5;
select * from cookie5;
+----------------+--------------+-------------------+
| cookie5.month | cookie5.day | cookie5.cookieid |
+----------------+--------------+-------------------+
| 2015-03 | 2015-03-10 | cookie1 |
| 2015-03 | 2015-03-10 | cookie5 |
| 2015-03 | 2015-03-12 | cookie7 |
| 2015-04 | 2015-04-12 | cookie3 |
| 2015-04 | 2015-04-13 | cookie2 |
| 2015-04 | 2015-04-13 | cookie4 |
| 2015-04 | 2015-04-16 | cookie4 |
| 2015-03 | 2015-03-10 | cookie2 |
| 2015-03 | 2015-03-10 | cookie3 |
| 2015-04 | 2015-04-12 | cookie5 |
| 2015-04 | 2015-04-13 | cookie6 |
| 2015-04 | 2015-04-15 | cookie3 |
| 2015-04 | 2015-04-15 | cookie2 |
| 2015-04 | 2015-04-16 | cookie1 |
+----------------+--------------+-------------------+
--在一个GROUP BY查询中,根据不同的维度组合进行聚合,等价于将不同维度的GROUP BY结果集进行UNION ALL ,GROUPING__ID表示结果属于哪一个分组集合。
select
month,
day,
count(distinct cookieid) as uv,
GROUPING__ID
from cookie.cookie5
group by month,day
grouping sets (month,day)
order by GROUPING__ID;
--等价于
SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM cookie5 GROUP BY month
UNION ALL
SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM cookie5 GROUP BY day
+----------+-------------+-----+---------------+
| month | day | uv | grouping__id |
+----------+-------------+-----+---------------+
| 2015-04 | NULL | 6 | 1 |
| 2015-03 | NULL | 5 | 1 |
| NULL | 2015-04-13 | 3 | 2 |
| NULL | 2015-03-10 | 4 | 2 |
| NULL | 2015-04-16 | 2 | 2 |
| NULL | 2015-04-15 | 2 | 2 |
| NULL | 2015-04-12 | 2 | 2 |
| NULL | 2015-03-12 | 1 | 2 |
+----------+-------------+-----+---------------+
SELECT month, day,
COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID
FROM cookie5
GROUP BY month,day
GROUPING SETS (month,day,(month,day))
ORDER BY GROUPING__ID limit 100;
--等价于
SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM cookie5 GROUP BY month
UNION ALL
SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM cookie5 GROUP BY day
UNION ALL
SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM cookie5 GROUP BY month,day
+----------+-------------+-----+---------------+
| month | day | uv | grouping__id |
+----------+-------------+-----+---------------+
| 2015-03 | 2015-03-12 | 1 | 0 |
| 2015-03 | 2015-03-10 | 4 | 0 |
| 2015-04 | 2015-04-13 | 3 | 0 |
| 2015-04 | 2015-04-15 | 2 | 0 |
| 2015-04 | 2015-04-12 | 2 | 0 |
| 2015-04 | 2015-04-16 | 2 | 0 |
| 2015-03 | NULL | 5 | 1 |
| 2015-04 | NULL | 6 | 1 |
| NULL | 2015-03-12 | 1 | 2 |
| NULL | 2015-04-12 | 2 | 2 |
| NULL | 2015-04-13 | 3 | 2 |
| NULL | 2015-04-15 | 2 | 2 |
| NULL | 2015-04-16 | 2 | 2 |
| NULL | 2015-03-10 | 4 | 2 |
+----------+-------------+-----+---------------+
-- CUBE 根据GROUP BY的维度的所有组合进行聚合
SELECT month, day,
COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID
FROM cookie5
GROUP BY month,day
WITH CUBE
ORDER BY GROUPING__ID;
--等价于
SELECT NULL,NULL,COUNT(DISTINCT cookieid) AS uv,0 AS GROUPING__ID FROM cookie5
UNION ALL
SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM cookie5 GROUP BY month
UNION ALL
SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM cookie5 GROUP BY day
UNION ALL
SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM cookie5 GROUP BY month,day
+----------+-------------+-----+---------------+
| month | day | uv | grouping__id |
+----------+-------------+-----+---------------+
| 2015-03 | 2015-03-10 | 4 | 0 |
| 2015-03 | 2015-03-12 | 1 | 0 |
| 2015-04 | 2015-04-12 | 2 | 0 |
| 2015-04 | 2015-04-13 | 3 | 0 |
| 2015-04 | 2015-04-15 | 2 | 0 |
| 2015-04 | 2015-04-16 | 2 | 0 |
| 2015-04 | NULL | 6 | 1 |
| 2015-03 | NULL | 5 | 1 |
| NULL | 2015-04-12 | 2 | 2 |
| NULL | 2015-03-10 | 4 | 2 |
| NULL | 2015-04-13 | 3 | 2 |
| NULL | 2015-04-15 | 2 | 2 |
| NULL | 2015-04-16 | 2 | 2 |
| NULL | 2015-03-12 | 1 | 2 |
| NULL | NULL | 7 | 3 |
+----------+-------------+-----+---------------+
-- ROLLUP
-- 是CUBE的子集,以最左侧的维度为主,从该维度进行层级聚合
SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID
FROM cookie5
GROUP BY month,day WITH ROLLUP ORDER BY GROUPING__ID;
-- 等价于
select NULL ,NULL, COUNT(DISTINCT cookieid) AS uv, 0 AS GROUP_ID FROM cookie5
UNION ALL
select month,NULL, COUNT(DISTINCT cookieid) AS uv, 1 AS GROUP_ID FROM cookie5 GROUP BY month
UNION ALL
select month,day, COUNT(DISTINCT cookieid) AS uv, 2 AS GROUP_ID FROM cookie5 GROUP BY month,day
+----------+-------------+-----+---------------+
| month | day | uv | grouping__id |
+----------+-------------+-----+---------------+
| 2015-04 | 2015-04-16 | 2 | 0 |
| 2015-04 | 2015-04-13 | 3 | 0 |
| 2015-04 | 2015-04-12 | 2 | 0 |
| 2015-04 | 2015-04-15 | 2 | 0 |
| 2015-03 | 2015-03-12 | 1 | 0 |
| 2015-03 | 2015-03-10 | 4 | 0 |
| 2015-04 | NULL | 6 | 1 |
| 2015-03 | NULL | 5 | 1 |
| NULL | NULL | 7 | 3 |
+----------+-------------+-----+---------------+
参考链接:https://www.cnblogs.com/qingyunzong/p/8798987.html