hive 窗口函数 实战

提示:文章写完后,目录可以自动生成,如何生成可参考右边的帮助文档


一、数据准备

1. 创建本地business.txt,导入数据 name,orderdate,cost

jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94

2.创建hive表并导入数据

create table business(
name string, 
orderdate string,
cost int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

load data local inpath "/opt/module/datas/business.txt" into table business;

二、需求

用步骤

1.查询在2017年4月份购买过的顾客及总人数

sql: select name,count(*) over() 
from business 
where substring(orderdate,1,7) = '2017-04' 
group by name;
------------------------------------
结果:
mart	2
jack	2

2. 查询顾客的购买明细及月购买总额

select name,orderdate,cost,sum(cost) over(partition by name,substring(orderdate,1,7) )
from business
结果:

name	orderdate	cost	sum_window_0
jack	2017-01-05	46	111
jack	2017-01-08	55	111
jack	2017-01-01	10	111
jack	2017-02-03	23	23
jack	2017-04-06	42	42
mart	2017-04-13	94	299
mart	2017-04-11	75	299
mart	2017-04-09	68	299
mart	2017-04-08	62	299
neil	2017-05-10	12	12
neil	2017-06-12	80	80
tony	2017-01-04	29	94
tony	2017-01-02	15	94
tony	2017-01-07	50	94

3.查询顾客的购买明细要将cost按照日期进行累加

select name,orderdate,cost,sum(cost) over(partition by name order by orderdate )
from business

name	orderdate	cost	sum_window_0
jack	2017-01-01	10	10
jack	2017-01-05	46	56
jack	2017-01-08	55	111
jack	2017-02-03	23	134
jack	2017-04-06	42	176
mart	2017-04-08	62	62
mart	2017-04-09	68	130
mart	2017-04-11	75	205
mart	2017-04-13	94	299
neil	2017-05-10	12	12
neil	2017-06-12	80	92
tony	2017-01-02	15	15
tony	2017-01-04	29	44
tony	2017-01-07	50	94

4.查询顾客的购买明细及顾客上次的购买时间

select name,orderdate,cost,lag(orderdate,1,‘无数据’) over(partition by name order by orderdate )
from business

name	orderdate	cost	lag_window_0
jack	2017-01-01	10	无数据
jack	2017-01-05	46	2017-01-01
jack	2017-01-08	55	2017-01-05
jack	2017-02-03	23	2017-01-08
jack	2017-04-06	42	2017-02-03
mart	2017-04-08	62	无数据
mart	2017-04-09	68	2017-04-08
mart	2017-04-11	75	2017-04-09
mart	2017-04-13	94	2017-04-11
neil	2017-05-10	12	无数据
neil	2017-06-12	80	2017-05-10
tony	2017-01-02	15	无数据
tony	2017-01-04	29	2017-01-02
tony	2017-01-07	50	2017-01-04

5.查询顾客的购买明细及顾客下次的购买时间

select name,orderdate,cost,lead(orderdate,1,‘无数据’) over(partition by name order by orderdate )
from business

6. 查询顾客的购买明细及顾客本月第一次购买的时间

select name,orderdate,cost,FIRST_VALUE(orderdate,true) over(partition by name,substring(orderdate,1,7) order by orderdate )
from business

7.查询顾客的购买明细及顾客本月最后一次购买的时间

select name,orderdate,cost,LAST_VALUE(orderdate,true) over(partition by name,substring(orderdate,1,7) order by orderdate rows between CURRENT row and UNBOUNDED FOLLOWING)
from business

8.查询顾客的购买明细及顾客最近三次cost花费

最近三次: 当前和之前两次 或 当前+前一次+后一次

当前和之前两次:
select name,orderdate,cost,sum(cost) over(partition by name order by orderdate rows between 2 PRECEDING and CURRENT row)
from business

当前+前一次+后一次:
select name,orderdate,cost,sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and 1 FOLLOWING)
from business


select name,orderdate,cost,cost+
lag(cost,1,0) over(partition by name order by orderdate )+
lead(cost,1,0) over(partition by name order by orderdate )
from business

9. 查询前20%时间的订单信息

select *
from
(select name,orderdate,cost,cume_dist() over(order by orderdate ) cdnum
from business) tmp
where cdnum<=0.2

10.sum汇总

select name,orderdate,cost, 
sum(cost) over() as sample1,--所有行相加 
sum(cost) over(partition by name) as sample2,--按name分组,组内数据相加 
sum(cost) over(partition by name order by orderdate) as sample3,--按name分组,组内数据累加 
sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 ,--和sample3一样,由起点到当前行的聚合 
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5, --当前行和前面一行做聚合 
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6,--当前行和前边一行及后面一行 
sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 --当前行及后面所有行 
from business;

执行结果:


hive (default)> select * from business;
OK
business.name	business.orderdate	business.cost
jack	2017-01-01	10
tony	2017-01-02	15
jack	2017-02-03	23
tony	2017-01-04	29
jack	2017-01-05	46
jack	2017-04-06	42
tony	2017-01-07	50
jack	2017-01-08	55
mart	2017-04-08	62
mart	2017-04-09	68
neil	2017-05-10	12
mart	2017-04-11	75
neil	2017-06-12	80
mart	2017-04-13	94

name	orderdate	c
ost	sample1	sample2	sample3	sample4	sample5	sample6	sample7
jack	2017-01-01	10	661	176	10	10	10	56	176
jack	2017-01-05	46	661	176	56	56	56	111	166
jack	2017-01-08	55	661	176	111	111	101	124	120
jack	2017-02-03	23	661	176	134	134	78	120	65
jack	2017-04-06	42	661	176	176	176	65	65	42
mart	2017-04-08	62	661	299	62	62	62	130	299
mart	2017-04-09	68	661	299	130	130	130	205	237
mart	2017-04-11	75	661	299	205	205	143	237	169
mart	2017-04-13	94	661	299	299	299	169	169	94
neil	2017-05-10	12	661	92	12	12	12	92	92
neil	2017-06-12	80	661	92	92	92	92	92	80
tony	2017-01-02	15	661	94	15	15	15	44	94
tony	2017-01-04	29	661	94	44	44	44	94	79
tony	2017-01-07	50	661	94	94	94	79	79	50
Time taken: 94.075 seconds, Fetched: 14 row(s)

三、 窗口函数说明

官方文档地址:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
		oracle,sqlserver都提供了窗口函数,但是在mysql5.5和5.6都没有提供窗口函数!
		
		窗口函数:  窗口+函数
			窗口: 函数运行时计算的数据集的范围
			函数: 运行的函数!
				仅仅支持以下函数:
					Windowing functions:
							LEAD:
									LEAD (scalar_expression [,offset] [,default]): 返回当前行以下N行的指定列的列值!
											如果找不到,就采用默认值
							LAG:
									LAG (scalar_expression [,offset] [,default]): 返回当前行以上N行的指定列的列值!
											如果找不到,就采用默认值
							FIRST_VALUE:
									FIRST_VALUE(列名,[false(默认)]):  返回当前窗口指定列的第一个值,
											第二个参数如果为true,代表加入第一个值为null,跳过空值,继续寻找!
							LAST_VALUE:
									LAST_VALUE(列名,[false(默认)]):  返回当前窗口指定列的最后一个值,
											第二个参数如果为true,代表加入第一个值为null,跳过空值,继续寻找!
					统计类的函数(一般都需要结合over使用): min,max,avg,sum,count
					排名分析:  
							RANK
							ROW_NUMBER
							DENSE_RANK
							CUME_DIST
							PERCENT_RANK
							NTILE
			
			注意:不是所有的函数在运行都是可以通过改变窗口的大小,来控制计算的数据集的范围!
					所有的排名函数和LAG,LEAD,支持使用over(),但是在over()中不能定义 window_clause
			
			格式:   函数   over( partition by 字段 ,order by 字段  window_clause )
			
			
			窗口的大小可以通过windows_clause来指定:
				(rows | range) between (unbounded | [num]) preceding and ([num] preceding | current row | (unbounded | [num]) following)
				(rows | range) between current row and (current row | (unbounded | [num]) following)
				(rows | range) between [num] following and (unbounded | [num]) following
				
				特殊情况: ①在over()中既没有出现windows_clause,也没有出现order by,窗口默认为rows between UNBOUNDED  PRECEDING and UNBOUNDED  FOLLOWING
						   ②在over()中(没有出现windows_clause),指定了order by,窗口默认为rows between UNBOUNDED  PRECEDING and CURRENT ROW

窗口函数和分组有什么区别

  1. 如果是分组操作,select后只能写分组后的字段
  2. 如果是窗口函数,窗口函数是在指定的窗口内,对每条记录都执行一次函数
  3. 如果是分组操作,有去重效果,而partition不去重!
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值