Hive 窗口函数

窗口函数

背景

平常我们使用 hive或者 mysql时,一般聚合函数用的比较多。但对于某些偏分析的需求,group by可能很费力,子查询很多,这个时候就需要使用窗口分析函数了

相关函数说明

OVER():指定分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的变而变化

CURRENT ROW:当前行

n PRECEDING:往前n行数据

n FOLLOWING:往后n行数据

UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING表示到后面的终点

over函数写法

over(partition by cookieid order by createtime) 先根据cookieid字段分区,相同的cookieid分为一区,每个分区内根据createtime字段排序(默认升序)

注:不加 partition by 的话则把整个数据集当作一个分区,不加 order by的话会对某些函数统计结果产生影响,如sum()

数据准备

a 2017-12-01 3
b 2017-12-00 3
cookie1 2017-12-10 1
cookie1 2017-12-11 5
cookie1 2017-12-12 7
cookie1 2017-12-13 3
cookie1 2017-12-14 2
cookie1 2017-12-15 4
cookie1 2017-12-16 4
cookie2 2017-12-12 7
cookie2 2017-12-16 6
cookie2 2017-12-24 1
cookie3 2017-12-22 5

建表导数据

create table test2(cookieid string,createtime string,pv int) 
row format delimited fields terminated by ' ';

load data local inpath '/root/datas/data.txt' into table test2;

窗口含义

SELECT cookieid,createtime,pv,
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1 
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv3,   --当前行+往前3行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv4,    --当前行+往前3行+往后1行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv5   ---当前行+往后所有行  
FROM test2;
cookieidcreatetimepvpv1pv2pv3pv4pv5
a2017-12-1333333
b2017-12-00333333
cookie12017-12-101111626
cookie12017-12-1156661325
cookie12017-12-1271313131620
cookie12017-12-1331616161813
cookie12017-12-1421818172110
cookie12017-12-154222216208
cookie12017-12-164262613134
cookie22017-12-1277771314
cookie22017-12-166131313147
cookie22017-12-241141414141
cookie32017-12-22555555

注:这些窗口的划分都是在分区内部!超过分区大小就无效了

SUM 函数

select cookieid,createtime,pv,
sum(pv) over(PARTITION BY cookieid ORDER BY createtime) as pv1 
FROM test2;

可以看到各分区内的时间升序排列

cookieidcreatetimepvpv1
a2017-12-133
b2017-12-0033
cookie12017-12-1011
cookie12017-12-1156
cookie12017-12-12713
cookie12017-12-13316
cookie12017-12-14218
cookie12017-12-15422
cookie12017-12-16426
cookie22017-12-1277
cookie22017-12-16613
cookie22017-12-24114
cookie32017-12-2255

我们可以清晰地看到,窗口函数和聚合函数的不同,sum()函数可以根据每一行的窗口返回各自行对应的值,有多少行记录就有多少个sum值,而group by只能计算每一组的sum,每组只有一个值!

其中sum()计算的是分区内排序后一个个叠加的值,和order by有关!

如果不加 order by会咋样:

select cookieid,createtime,pv,
sum(pv) over(PARTITION BY cookieid) as pv1 
FROM test2;

可以看到,如果没有order by,不仅分区内没有排序,sum()计算的pv也是整个分区的pv

cookieidcreatetimepvpv1
a2017-12-133
b2017-12-0033
cookie12017-12-14226
cookie12017-12-16426
cookie12017-12-15426
cookie12017-12-13326
cookie12017-12-12726
cookie12017-12-11526
cookie12017-12-10126
cookie22017-12-16614
cookie22017-12-12714
cookie22017-12-24114
cookie32017-12-2255

NTILE 函数

NTILE(n),用于将分组数据按照顺序切分成n片,返回当前切片值

*注1:如果切片不均匀,*默认增加第一个切片的分布

注2:NTILE不支持ROWS BETWEEN

SELECT cookieid,createtime,pv,
NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime) AS ntile1,	--分组内将数据分成2片
NTILE(3) OVER(PARTITION BY cookieid ORDER BY createtime) AS ntile2,  --分组内将数据分成3片
NTILE(4) OVER(PARTITION BY cookieid ORDER BY createtime) AS ntile3   --将所有数据分成4片
FROM test2;
cookieidcreatetimepvntile1ntile2ntile3
a2017-12-13111
b2017-12-003111
cookie12017-12-101111
cookie12017-12-115111
cookie12017-12-127112
cookie12017-12-133122
cookie12017-12-142223
cookie12017-12-154233
cookie12017-12-164234
cookie22017-12-127111
cookie22017-12-166122
cookie22017-12-241233
cookie32017-12-225111

用法举例:

统计一个cookie,pv数最多的前1/3的天:

SELECT cookieid,createtime,pv,
NTILE(3) OVER(PARTITION BY cookieid ORDER BY pv DESC) AS ntile 
FROM test2;

取 ntile = 1 的记录,就是我们想要的结果!

ROW_NUMBER 函数

ROW_NUMBER() 从1开始,按照顺序,生成分组内记录的序列

ROW_NUMBER() 的应用场景非常多,比如获取分组内排序第一的记录、获取一个session中的第一条refer等。

SELECT cookieid,createtime,pv,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn  
FROM test2;
cookieidcreatetimepvrn
a2017-12-131
b2017-12-0031
cookie12017-12-1271
cookie12017-12-1152
cookie12017-12-1643
cookie12017-12-1544
cookie12017-12-1335
cookie12017-12-1426
cookie12017-12-1017
cookie22017-12-1271
cookie22017-12-1662
cookie22017-12-2413
cookie32017-12-2251

RANK 和 DENSE_RANK 函数

RANK() 生成数据项在分组中的排名,排名相等会在名次中留下空位

DENSE_RANK() 生成数据项在分组中的排名,排名相等会在名次中不会留下空位

我们把 rankdense_rankrow_number三者对比,这样比较清晰

SELECT cookieid,createtime,pv,
RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rank1,
DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS d_rank2,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3 
FROM test2; 

cookieidcreatetimepvrank1d_rank2rn3
a2017-12-13111
b2017-12-003111
cookie12017-12-127111
cookie12017-12-115222
cookie12017-12-164333
cookie12017-12-154334
cookie12017-12-133545
cookie12017-12-142656
cookie12017-12-101767
cookie22017-12-127111
cookie22017-12-166222
cookie22017-12-241333
cookie32017-12-225111

CUME_DIST 函数

cume_dist 返回小于等于当前值的行数/分组内总行数

比如,我们可以统计小于等于当前薪水的人数,所占总人数的比例

SELECT cookieid,createtime,pv,
round(CUME_DIST() OVER(ORDER BY pv),2) AS cd1,
round(CUME_DIST() OVER(PARTITION BY cookieid ORDER BY pv),2) AS cd2  
FROM test2;

cookieidcreatetimepvcd1cd2
a2017-12-130.461
b2017-12-0030.461
cookie12017-12-1010.150.14
cookie12017-12-1420.230.29
cookie12017-12-1330.460.43
cookie12017-12-1540.620.71
cookie12017-12-1640.620.71
cookie12017-12-1150.770.86
cookie12017-12-12711
cookie22017-12-2410.150.33
cookie22017-12-1660.850.67
cookie22017-12-12711
cookie32017-12-2250.771

注:cd1没有partition,所有数据均为1组!

PERCENT_RANK 函数

percent_rank 分组内当前行的RANK值-1/分组内总行数-1

注:一般不会用到该函数,可能在一些特殊算法的实现中可以用到吧

SELECT  cookieid,createtime,pv,
PERCENT_RANK() OVER(ORDER BY pv) AS rn1 
from test2;

cookieidcreatetimepvrn1
cookie22017-12-2410
cookie12017-12-1010
cookie12017-12-1420.166666667
a2017-12-130.25
cookie12017-12-1330.25
b2017-12-0030.25
cookie12017-12-1640.5
cookie12017-12-1540.5
cookie32017-12-2250.666666667
cookie12017-12-1150.666666667
cookie22017-12-1660.833333333
cookie22017-12-1270.916666667
cookie12017-12-1270.916666667

LAG 和 LEAD 函数

LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值

第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)

SELECT cookieid,createtime,pv,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAG(createtime,1,'1970-01-01') OVER(PARTITION BY cookieid ORDER BY createtime) AS lag1,
LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS lag2 
FROM test2;

cookieidcreatetimepvrnlag1lag2
a2017-12-1311970-1-1NULL
b2017-12-00311970-1-1NULL
cookie12017-12-10111970-1-1NULL
cookie12017-12-11522017-12-10NULL
cookie12017-12-12732017-12-112017-12-10
cookie12017-12-13342017-12-122017-12-11
cookie12017-12-14252017-12-132017-12-12
cookie12017-12-15462017-12-142017-12-13
cookie12017-12-16472017-12-152017-12-14
cookie22017-12-12711970-1-1NULL
cookie22017-12-16622017-12-12NULL
cookie22017-12-24132017-12-162017-12-12
cookie32017-12-22511970-1-1NULL

LEAD 函数则与 LAG 相反:

LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值

第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)

FIRST_VALUE 和 LAST_VALUE 函数

FIRST_VALUE 取分组内排序后,截止到当前行,第一个值

SELECT cookieid,createtime,pv,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
FIRST_VALUE(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS first  
FROM test2;

cookieidcreatetimepvrnfirst
a2017-12-1311900-1-3
b2017-12-00311900-1-3
cookie12017-12-10111900-1-1
cookie12017-12-11521900-1-1
cookie12017-12-12731900-1-1
cookie12017-12-13341900-1-1
cookie12017-12-14251900-1-1
cookie12017-12-15461900-1-1
cookie12017-12-16471900-1-1
cookie22017-12-12711900-1-7
cookie22017-12-16621900-1-7
cookie22017-12-24131900-1-7
cookie32017-12-22511900-1-5

LAST_VALUE 函数则相反:

LAST_VALUE 取分组内排序后,截止到当前行,最后一个值

案例实战

1.数据准备:name,orderdate,cost

jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94

2.需求

(1)查询在2017年4月份购买过的顾客及总人数

(2)查询顾客的购买明细及月购买总额

(3)上述的场景,要将cost按照日期进行累加

(4)查询顾客上次的购买时间

(5)查询前20%时间的订单信息

3.创建本地business.txt,导入数据

[root@h1 datas]# vi business.txt 

4.创建hive表并导入数据

create table business(
name string,
orderdate string,
cost int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
 
load data local inpath '/root/datas/business.txt' into table business;

5.按需求查询数据

(1)查询在2017年4月份购买过的顾客及总人数

select name,count(*) over ()
from business 
where substring(orderdate,1,7) = '2017-04'
group by name;

(2)查询顾客的购买明细及月购买总额

select name,orderdate,cost,sum(cost) over(partition by month(orderdate)) from business;

解析:sum(cost) over(partition by month(orderdate))
这就是每一行在sum运算时都启动了按月份的窗口,使得sum的运算范围始终在这个月内

(3)上述的场景,要将cost按照日期进行累加

select name,orderdate,cost,
sum(cost) over() as sample1,--所有行相加
sum(cost) over(partition by name) as sample2,--按name分组,组内数据相加
sum(cost) over(partition by name order by orderdate) as sample3,--按name分组,组内数据累加
sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 ,--和sample3一样,由起点到当前行的聚合
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5, --当前行和前面一行做聚合
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6,--当前行和前边一行及后面一行
sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 --当前行及后面所有行
from business;

(4)查看顾客上次的购买时间

select name,orderdate,cost,
lag(orderdate,1,'1900-01-01') over(partition by name order by orderdate ) as time1, lag(orderdate,2) over (partition by name order by orderdate) as time2
from business;

(5)查询前20%时间的订单信息

select * from (
    select name,orderdate,cost, ntile(5) over(order by orderdate) sorted
    from business
) t
where sorted = 1;

已标记关键词 清除标记
一、关系运算: 4 1. 等值比较: = 4 2. 不等值比较: <> 4 3. 小于比较: < 4 4. 小于等于比较: <= 4 5. 大于比较: > 5 6. 大于等于比较: >= 5 7. 空值判断: IS NULL 5 8. 非空判断: IS NOT NULL 6 9. LIKE比较: LIKE 6 10. JAVA的LIKE操作: RLIKE 6 11. REGEXP操作: REGEXP 7 二、数学运算: 7 1. 加法操作: + 7 2. 减法操作: - 7 3. 乘法操作: * 8 4. 除法操作: / 8 5. 取余操作: % 8 6. 位与操作: & 9 7. 位或操作: | 9 8. 位异或操作: ^ 9 9.位取反操作: ~ 10 三、逻辑运算: 10 1. 逻辑与操作: AND 10 2. 逻辑或操作: OR 10 3. 逻辑非操作: NOT 10 四、数值计算 11 1. 取整函数: round 11 2. 指定精度取整函数: round 11 3. 向下取整函数: floor 11 4. 向上取整函数: ceil 12 5. 向上取整函数: ceiling 12 6. 取随机数函数: rand 12 7. 自然指数函数: exp 13 8. 以10为底对数函数: log10 13 9. 以2为底对数函数: log2 13 10. 对数函数: log 13 11. 幂运算函数: pow 14 12. 幂运算函数: power 14 13. 开平方函数: sqrt 14 14. 二进制函数: bin 14 15. 十六进制函数: hex 15 16. 反转十六进制函数: unhex 15 17. 进制转换函数: conv 15 18. 绝对值函数: abs 16 19. 正取余函数: pmod 16 20. 正弦函数: sin 16 21. 反正弦函数: asin 16 22. 余弦函数: cos 17 23. 反余弦函数: acos 17 24. positive函数: positive 17 25. negative函数: negative 17 五、日期函数 18 1. UNIX时间戳转日期函数: from_unixtime 18 2. 获取当前UNIX时间戳函数: unix_timestamp 18 3. 日期转UNIX时间戳函数: unix_timestamp 18 4. 指定格式日期转UNIX时间戳函数: unix_timestamp 18 5. 日期时间转日期函数: to_date 19 6. 日期转年函数: year 19 7. 日期转月函数: month 19 8. 日期转天函数: day 19 9. 日期转小时函数: hour 20 10. 日期转分钟函数: minute 20 11. 日期转秒函数: second 20 12. 日期转周函数: weekofyear 20 13. 日期比较函数: datediff 21 14. 日期增加函数: date_add 21 15. 日期减少函数: date_sub 21 六、条件函数 21 1. If函数: if 21 2. 非空查找函数: COALESCE 22 3. 条件判断函数:CASE 22 4. 条件判断函数:CASE 22 七、字符串函数 23 1. 字符串长度函数:length 23 2. 字符串反转函数:reverse 23 3. 字符串连接函数:concat 23 4. 带分隔符字符串连接函数:concat_ws 23 5. 字符串截取函数:substr,substring 24 6. 字符串截取函数:substr,substring 24 7. 字符串转大写函数:upper,ucase 24 8. 字符串转小写函数:lower,lcase 25 9. 去空格函数:trim 25 10. 左边去空格函数:ltrim 25 11. 右边去空格函数:rtrim 25 12. 正则表达式替换函数:regexp_replace 26 13. 正则表达式解析函数:regexp_extract 26 14. URL解析函数:parse_url 26 15. json解析函数:get_json_object 27 16. 空格字符串函数:space 27 17. 重复字符串函数:repeat 27 18. 首字符ascii函数:ascii 28 19. 左补足函数:lpad 28 20. 右补足函数:rpad 28 21. 分割字符串函数: split 28 22. 集合查找函数: find_in_set 29 八、集合统计函数 29 1. 个数统计函数: count 29 2. 总和统计函数: sum 29 3. 平均值统计函数:
相关推荐
©️2020 CSDN 皮肤主题: 大白 设计师:CSDN官方博客 返回首页