提示:文章写完后,目录可以自动生成,如何生成可参考右边的帮助文档
hive 窗口函数 实战
一、数据准备
1. 创建本地business.txt,导入数据 name,orderdate,cost
jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94
2.创建hive表并导入数据
create table business(
name string,
orderdate string,
cost int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
load data local inpath "/opt/module/datas/business.txt" into table business;
二、需求
用步骤
1.查询在2017年4月份购买过的顾客及总人数
sql: select name,count(*) over()
from business
where substring(orderdate,1,7) = '2017-04'
group by name;
------------------------------------
结果:
mart 2
jack 2
2. 查询顾客的购买明细及月购买总额
select name,orderdate,cost,sum(cost) over(partition by name,substring(orderdate,1,7) )
from business
结果:
name orderdate cost sum_window_0
jack 2017-01-05 46 111
jack 2017-01-08 55 111
jack 2017-01-01 10 111
jack 2017-02-03 23 23
jack 2017-04-06 42 42
mart 2017-04-13 94 299
mart 2017-04-11 75 299
mart 2017-04-09 68 299
mart 2017-04-08 62 299
neil 2017-05-10 12 12
neil 2017-06-12 80 80
tony 2017-01-04 29 94
tony 2017-01-02 15 94
tony 2017-01-07 50 94
3.查询顾客的购买明细要将cost按照日期进行累加
select name,orderdate,cost,sum(cost) over(partition by name order by orderdate )
from business
name orderdate cost sum_window_0
jack 2017-01-01 10 10
jack 2017-01-05 46 56
jack 2017-01-08 55 111
jack 2017-02-03 23 134
jack 2017-04-06 42 176
mart 2017-04-08 62 62
mart 2017-04-09 68 130
mart 2017-04-11 75 205
mart 2017-04-13 94 299
neil 2017-05-10 12 12
neil 2017-06-12 80 92
tony 2017-01-02 15 15
tony 2017-01-04 29 44
tony 2017-01-07 50 94
4.查询顾客的购买明细及顾客上次的购买时间
select name,orderdate,cost,lag(orderdate,1,‘无数据’) over(partition by name order by orderdate )
from business
name orderdate cost lag_window_0
jack 2017-01-01 10 无数据
jack 2017-01-05 46 2017-01-01
jack 2017-01-08 55 2017-01-05
jack 2017-02-03 23 2017-01-08
jack 2017-04-06 42 2017-02-03
mart 2017-04-08 62 无数据
mart 2017-04-09 68 2017-04-08
mart 2017-04-11 75 2017-04-09
mart 2017-04-13 94 2017-04-11
neil 2017-05-10 12 无数据
neil 2017-06-12 80 2017-05-10
tony 2017-01-02 15 无数据
tony 2017-01-04 29 2017-01-02
tony 2017-01-07 50 2017-01-04
5.查询顾客的购买明细及顾客下次的购买时间
select name,orderdate,cost,lead(orderdate,1,‘无数据’) over(partition by name order by orderdate )
from business
6. 查询顾客的购买明细及顾客本月第一次购买的时间
select name,orderdate,cost,FIRST_VALUE(orderdate,true) over(partition by name,substring(orderdate,1,7) order by orderdate )
from business
7.查询顾客的购买明细及顾客本月最后一次购买的时间
select name,orderdate,cost,LAST_VALUE(orderdate,true) over(partition by name,substring(orderdate,1,7) order by orderdate rows between CURRENT row and UNBOUNDED FOLLOWING)
from business
8.查询顾客的购买明细及顾客最近三次cost花费
最近三次: 当前和之前两次 或 当前+前一次+后一次
当前和之前两次:
select name,orderdate,cost,sum(cost) over(partition by name order by orderdate rows between 2 PRECEDING and CURRENT row)
from business
当前+前一次+后一次:
select name,orderdate,cost,sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and 1 FOLLOWING)
from business
或
select name,orderdate,cost,cost+
lag(cost,1,0) over(partition by name order by orderdate )+
lead(cost,1,0) over(partition by name order by orderdate )
from business
9. 查询前20%时间的订单信息
select *
from
(select name,orderdate,cost,cume_dist() over(order by orderdate ) cdnum
from business) tmp
where cdnum<=0.2
10.sum汇总
select name,orderdate,cost,
sum(cost) over() as sample1,--所有行相加
sum(cost) over(partition by name) as sample2,--按name分组,组内数据相加
sum(cost) over(partition by name order by orderdate) as sample3,--按name分组,组内数据累加
sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 ,--和sample3一样,由起点到当前行的聚合
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5, --当前行和前面一行做聚合
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6,--当前行和前边一行及后面一行
sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 --当前行及后面所有行
from business;
执行结果:
hive (default)> select * from business;
OK
business.name business.orderdate business.cost
jack 2017-01-01 10
tony 2017-01-02 15
jack 2017-02-03 23
tony 2017-01-04 29
jack 2017-01-05 46
jack 2017-04-06 42
tony 2017-01-07 50
jack 2017-01-08 55
mart 2017-04-08 62
mart 2017-04-09 68
neil 2017-05-10 12
mart 2017-04-11 75
neil 2017-06-12 80
mart 2017-04-13 94
name orderdate c
ost sample1 sample2 sample3 sample4 sample5 sample6 sample7
jack 2017-01-01 10 661 176 10 10 10 56 176
jack 2017-01-05 46 661 176 56 56 56 111 166
jack 2017-01-08 55 661 176 111 111 101 124 120
jack 2017-02-03 23 661 176 134 134 78 120 65
jack 2017-04-06 42 661 176 176 176 65 65 42
mart 2017-04-08 62 661 299 62 62 62 130 299
mart 2017-04-09 68 661 299 130 130 130 205 237
mart 2017-04-11 75 661 299 205 205 143 237 169
mart 2017-04-13 94 661 299 299 299 169 169 94
neil 2017-05-10 12 661 92 12 12 12 92 92
neil 2017-06-12 80 661 92 92 92 92 92 80
tony 2017-01-02 15 661 94 15 15 15 44 94
tony 2017-01-04 29 661 94 44 44 44 94 79
tony 2017-01-07 50 661 94 94 94 79 79 50
Time taken: 94.075 seconds, Fetched: 14 row(s)
三、 窗口函数说明
官方文档地址:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
oracle,sqlserver都提供了窗口函数,但是在mysql5.5和5.6都没有提供窗口函数!
窗口函数: 窗口+函数
窗口: 函数运行时计算的数据集的范围
函数: 运行的函数!
仅仅支持以下函数:
Windowing functions:
LEAD:
LEAD (scalar_expression [,offset] [,default]): 返回当前行以下N行的指定列的列值!
如果找不到,就采用默认值
LAG:
LAG (scalar_expression [,offset] [,default]): 返回当前行以上N行的指定列的列值!
如果找不到,就采用默认值
FIRST_VALUE:
FIRST_VALUE(列名,[false(默认)]): 返回当前窗口指定列的第一个值,
第二个参数如果为true,代表加入第一个值为null,跳过空值,继续寻找!
LAST_VALUE:
LAST_VALUE(列名,[false(默认)]): 返回当前窗口指定列的最后一个值,
第二个参数如果为true,代表加入第一个值为null,跳过空值,继续寻找!
统计类的函数(一般都需要结合over使用): min,max,avg,sum,count
排名分析:
RANK
ROW_NUMBER
DENSE_RANK
CUME_DIST
PERCENT_RANK
NTILE
注意:不是所有的函数在运行都是可以通过改变窗口的大小,来控制计算的数据集的范围!
所有的排名函数和LAG,LEAD,支持使用over(),但是在over()中不能定义 window_clause
格式: 函数 over( partition by 字段 ,order by 字段 window_clause )
窗口的大小可以通过windows_clause来指定:
(rows | range) between (unbounded | [num]) preceding and ([num] preceding | current row | (unbounded | [num]) following)
(rows | range) between current row and (current row | (unbounded | [num]) following)
(rows | range) between [num] following and (unbounded | [num]) following
特殊情况: ①在over()中既没有出现windows_clause,也没有出现order by,窗口默认为rows between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING
②在over()中(没有出现windows_clause),指定了order by,窗口默认为rows between UNBOUNDED PRECEDING and CURRENT ROW
窗口函数和分组有什么区别
- 如果是分组操作,select后只能写分组后的字段
- 如果是窗口函数,窗口函数是在指定的窗口内,对每条记录都执行一次函数
- 如果是分组操作,有去重效果,而partition不去重!