窗口函数
背景
平常我们使用 hive或者 mysql时,一般聚合函数用的比较多。但对于某些偏分析的需求,group by可能很费力,子查询很多,这个时候就需要使用窗口分析函数了
相关函数说明
OVER():指定分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的变而变化
CURRENT ROW:当前行
n PRECEDING:往前n行数据
n FOLLOWING:往后n行数据
UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING表示到后面的终点
over函数写法
over(partition by cookieid order by createtime) 先根据cookieid字段分区,相同的cookieid分为一区,每个分区内根据createtime字段排序(默认升序)
注:不加 partition by 的话则把整个数据集当作一个分区,不加 order by的话会对某些函数统计结果产生影响,如sum()
数据准备
a 2017-12-01 3
b 2017-12-00 3
cookie1 2017-12-10 1
cookie1 2017-12-11 5
cookie1 2017-12-12 7
cookie1 2017-12-13 3
cookie1 2017-12-14 2
cookie1 2017-12-15 4
cookie1 2017-12-16 4
cookie2 2017-12-12 7
cookie2 2017-12-16 6
cookie2 2017-12-24 1
cookie3 2017-12-22 5
建表导数据
create table test2(cookieid string,createtime string,pv int)
row format delimited fields terminated by ' ';
load data local inpath '/root/datas/data.txt' into table test2;
窗口含义
SELECT cookieid,createtime,pv,
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv3, --当前行+往前3行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv4, --当前行+往前3行+往后1行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv5 ---当前行+往后所有行
FROM test2;
cookieid | createtime | pv | pv1 | pv2 | pv3 | pv4 | pv5 |
---|---|---|---|---|---|---|---|
a | 2017-12-1 | 3 | 3 | 3 | 3 | 3 | 3 |
b | 2017-12-00 | 3 | 3 | 3 | 3 | 3 | 3 |
cookie1 | 2017-12-10 | 1 | 1 | 1 | 1 | 6 | 26 |
cookie1 | 2017-12-11 | 5 | 6 | 6 | 6 | 13 | 25 |
cookie1 | 2017-12-12 | 7 | 13 | 13 | 13 | 16 | 20 |
cookie1 | 2017-12-13 | 3 | 16 | 16 | 16 | 18 | 13 |
cookie1 | 2017-12-14 | 2 | 18 | 18 | 17 | 21 | 10 |
cookie1 | 2017-12-15 | 4 | 22 | 22 | 16 | 20 | 8 |
cookie1 | 2017-12-16 | 4 | 26 | 26 | 13 | 13 | 4 |
cookie2 | 2017-12-12 | 7 | 7 | 7 | 7 | 13 | 14 |
cookie2 | 2017-12-16 | 6 | 13 | 13 | 13 | 14 | 7 |
cookie2 | 2017-12-24 | 1 | 14 | 14 | 14 | 14 | 1 |
cookie3 | 2017-12-22 | 5 | 5 | 5 | 5 | 5 | 5 |
注:这些窗口的划分都是在分区内部!超过分区大小就无效了
SUM 函数
select cookieid,createtime,pv,
sum(pv) over(PARTITION BY cookieid ORDER BY createtime) as pv1
FROM test2;
可以看到各分区内的时间升序排列
cookieid | createtime | pv | pv1 |
---|---|---|---|
a | 2017-12-1 | 3 | 3 |
b | 2017-12-00 | 3 | 3 |
cookie1 | 2017-12-10 | 1 | 1 |
cookie1 | 2017-12-11 | 5 | 6 |
cookie1 | 2017-12-12 | 7 | 13 |
cookie1 | 2017-12-13 | 3 | 16 |
cookie1 | 2017-12-14 | 2 | 18 |
cookie1 | 2017-12-15 | 4 | 22 |
cookie1 | 2017-12-16 | 4 | 26 |
cookie2 | 2017-12-12 | 7 | 7 |
cookie2 | 2017-12-16 | 6 | 13 |
cookie2 | 2017-12-24 | 1 | 14 |
cookie3 | 2017-12-22 | 5 | 5 |
我们可以清晰地看到,窗口函数和聚合函数的不同,sum()函数可以根据每一行的窗口返回各自行对应的值,有多少行记录就有多少个sum值,而group by只能计算每一组的sum,每组只有一个值!
其中sum()计算的是分区内排序后一个个叠加的值,和order by有关!
如果不加 order by会咋样:
select cookieid,createtime,pv,
sum(pv) over(PARTITION BY cookieid) as pv1
FROM test2;
可以看到,如果没有order by,不仅分区内没有排序,sum()计算的pv也是整个分区的pv
cookieid | createtime | pv | pv1 |
---|---|---|---|
a | 2017-12-1 | 3 | 3 |
b | 2017-12-00 | 3 | 3 |
cookie1 | 2017-12-14 | 2 | 26 |
cookie1 | 2017-12-16 | 4 | 26 |
cookie1 | 2017-12-15 | 4 | 26 |
cookie1 | 2017-12-13 | 3 | 26 |
cookie1 | 2017-12-12 | 7 | 26 |
cookie1 | 2017-12-11 | 5 | 26 |
cookie1 | 2017-12-10 | 1 | 26 |
cookie2 | 2017-12-16 | 6 | 14 |
cookie2 | 2017-12-12 | 7 | 14 |
cookie2 | 2017-12-24 | 1 | 14 |
cookie3 | 2017-12-22 | 5 | 5 |
NTILE 函数
NTILE(n),用于将分组数据按照顺序切分成n片,返回当前切片值
*注1:如果切片不均匀,*默认增加第一个切片的分布
注2:NTILE不支持ROWS BETWEEN
SELECT cookieid,createtime,pv,
NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime) AS ntile1, --分组内将数据分成2片
NTILE(3) OVER(PARTITION BY cookieid ORDER BY createtime) AS ntile2, --分组内将数据分成3片
NTILE(4) OVER(PARTITION BY cookieid ORDER BY createtime) AS ntile3 --将所有数据分成4片
FROM test2;
cookieid | createtime | pv | ntile1 | ntile2 | ntile3 |
---|---|---|---|---|---|
a | 2017-12-1 | 3 | 1 | 1 | 1 |
b | 2017-12-00 | 3 | 1 | 1 | 1 |
cookie1 | 2017-12-10 | 1 | 1 | 1 | 1 |
cookie1 | 2017-12-11 | 5 | 1 | 1 | 1 |
cookie1 | 2017-12-12 | 7 | 1 | 1 | 2 |
cookie1 | 2017-12-13 | 3 | 1 | 2 | 2 |
cookie1 | 2017-12-14 | 2 | 2 | 2 | 3 |
cookie1 | 2017-12-15 | 4 | 2 | 3 | 3 |
cookie1 | 2017-12-16 | 4 | 2 | 3 | 4 |
cookie2 | 2017-12-12 | 7 | 1 | 1 | 1 |
cookie2 | 2017-12-16 | 6 | 1 | 2 | 2 |
cookie2 | 2017-12-24 | 1 | 2 | 3 | 3 |
cookie3 | 2017-12-22 | 5 | 1 | 1 | 1 |
用法举例:
统计一个cookie,pv数最多的前1/3的天:
SELECT cookieid,createtime,pv,
NTILE(3) OVER(PARTITION BY cookieid ORDER BY pv DESC) AS ntile
FROM test2;
取 ntile = 1 的记录,就是我们想要的结果!
ROW_NUMBER 函数
ROW_NUMBER() 从1开始,按照顺序,生成分组内记录的序列
ROW_NUMBER() 的应用场景非常多,比如获取分组内排序第一的记录、获取一个session中的第一条refer等。
SELECT cookieid,createtime,pv,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn
FROM test2;
cookieid | createtime | pv | rn |
---|---|---|---|
a | 2017-12-1 | 3 | 1 |
b | 2017-12-00 | 3 | 1 |
cookie1 | 2017-12-12 | 7 | 1 |
cookie1 | 2017-12-11 | 5 | 2 |
cookie1 | 2017-12-16 | 4 | 3 |
cookie1 | 2017-12-15 | 4 | 4 |
cookie1 | 2017-12-13 | 3 | 5 |
cookie1 | 2017-12-14 | 2 | 6 |
cookie1 | 2017-12-10 | 1 | 7 |
cookie2 | 2017-12-12 | 7 | 1 |
cookie2 | 2017-12-16 | 6 | 2 |
cookie2 | 2017-12-24 | 1 | 3 |
cookie3 | 2017-12-22 | 5 | 1 |
RANK 和 DENSE_RANK 函数
RANK() 生成数据项在分组中的排名,排名相等会在名次中留下空位
DENSE_RANK() 生成数据项在分组中的排名,排名相等会在名次中不会留下空位
我们把 rank、dense_rank、row_number三者对比,这样比较清晰
SELECT cookieid,createtime,pv,
RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rank1,
DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS d_rank2,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3
FROM test2;
cookieid | createtime | pv | rank1 | d_rank2 | rn3 |
---|---|---|---|---|---|
a | 2017-12-1 | 3 | 1 | 1 | 1 |
b | 2017-12-00 | 3 | 1 | 1 | 1 |
cookie1 | 2017-12-12 | 7 | 1 | 1 | 1 |
cookie1 | 2017-12-11 | 5 | 2 | 2 | 2 |
cookie1 | 2017-12-16 | 4 | 3 | 3 | 3 |
cookie1 | 2017-12-15 | 4 | 3 | 3 | 4 |
cookie1 | 2017-12-13 | 3 | 5 | 4 | 5 |
cookie1 | 2017-12-14 | 2 | 6 | 5 | 6 |
cookie1 | 2017-12-10 | 1 | 7 | 6 | 7 |
cookie2 | 2017-12-12 | 7 | 1 | 1 | 1 |
cookie2 | 2017-12-16 | 6 | 2 | 2 | 2 |
cookie2 | 2017-12-24 | 1 | 3 | 3 | 3 |
cookie3 | 2017-12-22 | 5 | 1 | 1 | 1 |
CUME_DIST 函数
cume_dist 返回小于等于当前值的行数/分组内总行数
比如,我们可以统计小于等于当前薪水的人数,所占总人数的比例
SELECT cookieid,createtime,pv,
round(CUME_DIST() OVER(ORDER BY pv),2) AS cd1,
round(CUME_DIST() OVER(PARTITION BY cookieid ORDER BY pv),2) AS cd2
FROM test2;
cookieid | createtime | pv | cd1 | cd2 |
---|---|---|---|---|
a | 2017-12-1 | 3 | 0.46 | 1 |
b | 2017-12-00 | 3 | 0.46 | 1 |
cookie1 | 2017-12-10 | 1 | 0.15 | 0.14 |
cookie1 | 2017-12-14 | 2 | 0.23 | 0.29 |
cookie1 | 2017-12-13 | 3 | 0.46 | 0.43 |
cookie1 | 2017-12-15 | 4 | 0.62 | 0.71 |
cookie1 | 2017-12-16 | 4 | 0.62 | 0.71 |
cookie1 | 2017-12-11 | 5 | 0.77 | 0.86 |
cookie1 | 2017-12-12 | 7 | 1 | 1 |
cookie2 | 2017-12-24 | 1 | 0.15 | 0.33 |
cookie2 | 2017-12-16 | 6 | 0.85 | 0.67 |
cookie2 | 2017-12-12 | 7 | 1 | 1 |
cookie3 | 2017-12-22 | 5 | 0.77 | 1 |
注:cd1没有partition,所有数据均为1组!
PERCENT_RANK 函数
percent_rank 分组内当前行的RANK值-1/分组内总行数-1
注:一般不会用到该函数,可能在一些特殊算法的实现中可以用到吧
SELECT cookieid,createtime,pv,
PERCENT_RANK() OVER(ORDER BY pv) AS rn1
from test2;
cookieid | createtime | pv | rn1 |
---|---|---|---|
cookie2 | 2017-12-24 | 1 | 0 |
cookie1 | 2017-12-10 | 1 | 0 |
cookie1 | 2017-12-14 | 2 | 0.166666667 |
a | 2017-12-1 | 3 | 0.25 |
cookie1 | 2017-12-13 | 3 | 0.25 |
b | 2017-12-00 | 3 | 0.25 |
cookie1 | 2017-12-16 | 4 | 0.5 |
cookie1 | 2017-12-15 | 4 | 0.5 |
cookie3 | 2017-12-22 | 5 | 0.666666667 |
cookie1 | 2017-12-11 | 5 | 0.666666667 |
cookie2 | 2017-12-16 | 6 | 0.833333333 |
cookie2 | 2017-12-12 | 7 | 0.916666667 |
cookie1 | 2017-12-12 | 7 | 0.916666667 |
LAG 和 LEAD 函数
LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值
第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)
SELECT cookieid,createtime,pv,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAG(createtime,1,'1970-01-01') OVER(PARTITION BY cookieid ORDER BY createtime) AS lag1,
LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS lag2
FROM test2;
cookieid | createtime | pv | rn | lag1 | lag2 |
---|---|---|---|---|---|
a | 2017-12-1 | 3 | 1 | 1970-1-1 | NULL |
b | 2017-12-00 | 3 | 1 | 1970-1-1 | NULL |
cookie1 | 2017-12-10 | 1 | 1 | 1970-1-1 | NULL |
cookie1 | 2017-12-11 | 5 | 2 | 2017-12-10 | NULL |
cookie1 | 2017-12-12 | 7 | 3 | 2017-12-11 | 2017-12-10 |
cookie1 | 2017-12-13 | 3 | 4 | 2017-12-12 | 2017-12-11 |
cookie1 | 2017-12-14 | 2 | 5 | 2017-12-13 | 2017-12-12 |
cookie1 | 2017-12-15 | 4 | 6 | 2017-12-14 | 2017-12-13 |
cookie1 | 2017-12-16 | 4 | 7 | 2017-12-15 | 2017-12-14 |
cookie2 | 2017-12-12 | 7 | 1 | 1970-1-1 | NULL |
cookie2 | 2017-12-16 | 6 | 2 | 2017-12-12 | NULL |
cookie2 | 2017-12-24 | 1 | 3 | 2017-12-16 | 2017-12-12 |
cookie3 | 2017-12-22 | 5 | 1 | 1970-1-1 | NULL |
LEAD 函数则与 LAG 相反:
LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值
第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)
FIRST_VALUE 和 LAST_VALUE 函数
FIRST_VALUE 取分组内排序后,截止到当前行,第一个值
SELECT cookieid,createtime,pv,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
FIRST_VALUE(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS first
FROM test2;
cookieid | createtime | pv | rn | first |
---|---|---|---|---|
a | 2017-12-1 | 3 | 1 | 1900-1-3 |
b | 2017-12-00 | 3 | 1 | 1900-1-3 |
cookie1 | 2017-12-10 | 1 | 1 | 1900-1-1 |
cookie1 | 2017-12-11 | 5 | 2 | 1900-1-1 |
cookie1 | 2017-12-12 | 7 | 3 | 1900-1-1 |
cookie1 | 2017-12-13 | 3 | 4 | 1900-1-1 |
cookie1 | 2017-12-14 | 2 | 5 | 1900-1-1 |
cookie1 | 2017-12-15 | 4 | 6 | 1900-1-1 |
cookie1 | 2017-12-16 | 4 | 7 | 1900-1-1 |
cookie2 | 2017-12-12 | 7 | 1 | 1900-1-7 |
cookie2 | 2017-12-16 | 6 | 2 | 1900-1-7 |
cookie2 | 2017-12-24 | 1 | 3 | 1900-1-7 |
cookie3 | 2017-12-22 | 5 | 1 | 1900-1-5 |
LAST_VALUE 函数则相反:
LAST_VALUE 取分组内排序后,截止到当前行,最后一个值
案例实战
1.数据准备:name,orderdate,cost
jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94
2.需求
(1)查询在2017年4月份购买过的顾客及总人数
(2)查询顾客的购买明细及月购买总额
(3)上述的场景,要将cost按照日期进行累加
(4)查询顾客上次的购买时间
(5)查询前20%时间的订单信息
3.创建本地business.txt,导入数据
[root@h1 datas]# vi business.txt
4.创建hive表并导入数据
create table business(
name string,
orderdate string,
cost int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
load data local inpath '/root/datas/business.txt' into table business;
5.按需求查询数据
(1)查询在2017年4月份购买过的顾客及总人数
select name,count(*) over ()
from business
where substring(orderdate,1,7) = '2017-04'
group by name;
(2)查询顾客的购买明细及月购买总额
select name,orderdate,cost,sum(cost) over(partition by month(orderdate)) from business;
解析:sum(cost) over(partition by month(orderdate))
这就是每一行在sum运算时都启动了按月份的窗口,使得sum的运算范围始终在这个月内
(3)上述的场景,要将cost按照日期进行累加
select name,orderdate,cost,
sum(cost) over() as sample1,--所有行相加
sum(cost) over(partition by name) as sample2,--按name分组,组内数据相加
sum(cost) over(partition by name order by orderdate) as sample3,--按name分组,组内数据累加
sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 ,--和sample3一样,由起点到当前行的聚合
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5, --当前行和前面一行做聚合
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6,--当前行和前边一行及后面一行
sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 --当前行及后面所有行
from business;
(4)查看顾客上次的购买时间
select name,orderdate,cost,
lag(orderdate,1,'1900-01-01') over(partition by name order by orderdate ) as time1, lag(orderdate,2) over (partition by name order by orderdate) as time2
from business;
(5)查询前20%时间的订单信息
select * from (
select name,orderdate,cost, ntile(5) over(order by orderdate) sorted
from business
) t
where sorted = 1;