hive 窗口函数　实战

最新推荐文章于 2024-09-22 19:45:45 发布

*星星之火*

最新推荐文章于 2024-09-22 19:45:45 发布

阅读量190

点赞数

分类专栏： hive 文章标签： Hive 窗口函数数据统计分组操作购买分析

本文链接：https://blog.csdn.net/spark_dev/article/details/116200661

版权

hive 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档

hive 窗口函数　实战

一、数据准备
- 1. 创建本地business.txt，导入数据 name，orderdate，cost
- 2.创建hive表并导入数据
二、需求
三、窗口函数说明
窗口函数和分组有什么区别

一、数据准备

1. 创建本地business.txt，导入数据 name，orderdate，cost

jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94

2.创建hive表并导入数据

create table business(
name string, 
orderdate string,
cost int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

load data local inpath "/opt/module/datas/business.txt" into table business;

二、需求

用步骤

1.查询在2017年4月份购买过的顾客及总人数

sql: select name,count(*) over() 
from business 
where substring(orderdate,1,7) = '2017-04' 
group by name;
------------------------------------
结果：
mart	2
jack	2

2. 查询顾客的购买明细及月购买总额

select name,orderdate,cost,sum(cost) over(partition by name,substring(orderdate,1,7) )
from business
结果：

name	orderdate	cost	sum_window_0
jack	2017-01-05	46	111
jack	2017-01-08	55	111
jack	2017-01-01	10	111
jack	2017-02-03	23	23
jack	2017-04-06	42	42
mart	2017-04-13	94	299
mart	2017-04-11	75	299
mart	2017-04-09	68	299
mart	2017-04-08	62	299
neil	2017-05-10	12	12
neil	2017-06-12	80	80
tony	2017-01-04	29	94
tony	2017-01-02	15	94
tony	2017-01-07	50	94

3.查询顾客的购买明细要将cost按照日期进行累加

select name,orderdate,cost,sum(cost) over(partition by name order by orderdate )
from business

name	orderdate	cost	sum_window_0
jack	2017-01-01	10	10
jack	2017-01-05	46	56
jack	2017-01-08	55	111
jack	2017-02-03	23	134
jack	2017-04-06	42	176
mart	2017-04-08	62	62
mart	2017-04-09	68	130
mart	2017-04-11	75	205
mart	2017-04-13	94	299
neil	2017-05-10	12	12
neil	2017-06-12	80	92
tony	2017-01-02	15	15
tony	2017-01-04	29	44
tony	2017-01-07	50	94

4.查询顾客的购买明细及顾客上次的购买时间

select name,orderdate,cost,lag(orderdate,1,‘无数据’) over(partition by name order by orderdate )
from business

name	orderdate	cost	lag_window_0
jack	2017-01-01	10	无数据
jack	2017-01-05	46	2017-01-01
jack	2017-01-08	55	2017-01-05
jack	2017-02-03	23	2017-01-08
jack	2017-04-06	42	2017-02-03
mart	2017-04-08	62	无数据
mart	2017-04-09	68	2017-04-08
mart	2017-04-11	75	2017-04-09
mart	2017-04-13	94	2017-04-11
neil	2017-05-10	12	无数据
neil	2017-06-12	80	2017-05-10
tony	2017-01-02	15	无数据
tony	2017-01-04	29	2017-01-02
tony	2017-01-07	50	2017-01-04

5.查询顾客的购买明细及顾客下次的购买时间

select name,orderdate,cost,lead(orderdate,1,‘无数据’) over(partition by name order by orderdate )
from business

6. 查询顾客的购买明细及顾客本月第一次购买的时间

select name,orderdate,cost,FIRST_VALUE(orderdate,true) over(partition by name,substring(orderdate,1,7) order by orderdate )
from business

7.查询顾客的购买明细及顾客本月最后一次购买的时间

select name,orderdate,cost,LAST_VALUE(orderdate,true) over(partition by name,substring(orderdate,1,7) order by orderdate rows between CURRENT row and UNBOUNDED FOLLOWING)
from business

８.查询顾客的购买明细及顾客最近三次cost花费

最近三次：当前和之前两次或当前+前一次+后一次

当前和之前两次:
select name,orderdate,cost,sum(cost) over(partition by name order by orderdate rows between 2 PRECEDING and CURRENT row)
from business

当前+前一次+后一次:
select name,orderdate,cost,sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and 1 FOLLOWING)
from business

或
select name,orderdate,cost,cost+
lag(cost,1,0) over(partition by name order by orderdate )+
lead(cost,1,0) over(partition by name order by orderdate )
from business

9. 查询前20%时间的订单信息

select *
from
(select name,orderdate,cost,cume_dist() over(order by orderdate ) cdnum
from business) tmp
where cdnum<=0.2

1０.ｓｕｍ汇总

select name,orderdate,cost, 
sum(cost) over() as sample1,--所有行相加 
sum(cost) over(partition by name) as sample2,--按name分组，组内数据相加 
sum(cost) over(partition by name order by orderdate) as sample3,--按name分组，组内数据累加 
sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 ,--和sample3一样,由起点到当前行的聚合 
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5, --当前行和前面一行做聚合 
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6,--当前行和前边一行及后面一行 
sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 --当前行及后面所有行 
from business;

执行结果：


hive (default)> select * from business;
OK
business.name	business.orderdate	business.cost
jack	2017-01-01	10
tony	2017-01-02	15
jack	2017-02-03	23
tony	2017-01-04	29
jack	2017-01-05	46
jack	2017-04-06	42
tony	2017-01-07	50
jack	2017-01-08	55
mart	2017-04-08	62
mart	2017-04-09	68
neil	2017-05-10	12
mart	2017-04-11	75
neil	2017-06-12	80
mart	2017-04-13	94

name	orderdate	c
ost	sample1	sample2	sample3	sample4	sample5	sample6	sample7
jack	2017-01-01	10	661	176	10	10	10	56	176
jack	2017-01-05	46	661	176	56	56	56	111	166
jack	2017-01-08	55	661	176	111	111	101	124	120
jack	2017-02-03	23	661	176	134	134	78	120	65
jack	2017-04-06	42	661	176	176	176	65	65	42
mart	2017-04-08	62	661	299	62	62	62	130	299
mart	2017-04-09	68	661	299	130	130	130	205	237
mart	2017-04-11	75	661	299	205	205	143	237	169
mart	2017-04-13	94	661	299	299	299	169	169	94
neil	2017-05-10	12	661	92	12	12	12	92	92
neil	2017-06-12	80	661	92	92	92	92	92	80
tony	2017-01-02	15	661	94	15	15	15	44	94
tony	2017-01-04	29	661	94	44	44	44	94	79
tony	2017-01-07	50	661	94	94	94	79	79	50
Time taken: 94.075 seconds, Fetched: 14 row(s)

三、窗口函数说明

官方文档地址：https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
		oracle,sqlserver都提供了窗口函数，但是在mysql5.5和5.6都没有提供窗口函数！
		
		窗口函数：  窗口+函数
			窗口： 函数运行时计算的数据集的范围
			函数： 运行的函数！
				仅仅支持以下函数：
					Windowing functions：
							LEAD:
									LEAD (scalar_expression [,offset] [,default])： 返回当前行以下N行的指定列的列值！
											如果找不到，就采用默认值
							LAG:
									LAG (scalar_expression [,offset] [,default])： 返回当前行以上N行的指定列的列值！
											如果找不到，就采用默认值
							FIRST_VALUE:
									FIRST_VALUE(列名,[false(默认)])：  返回当前窗口指定列的第一个值，
											第二个参数如果为true,代表加入第一个值为null，跳过空值，继续寻找！
							LAST_VALUE:
									LAST_VALUE(列名,[false(默认)])：  返回当前窗口指定列的最后一个值，
											第二个参数如果为true,代表加入第一个值为null，跳过空值，继续寻找！
					统计类的函数(一般都需要结合over使用)： min,max,avg,sum,count
					排名分析：  
							RANK
							ROW_NUMBER
							DENSE_RANK
							CUME_DIST
							PERCENT_RANK
							NTILE
			
			注意：不是所有的函数在运行都是可以通过改变窗口的大小，来控制计算的数据集的范围！
					所有的排名函数和LAG,LEAD，支持使用over()，但是在over()中不能定义 window_clause
			
			格式：   函数   over( partition by 字段 ,order by 字段  window_clause )
			
			
			窗口的大小可以通过windows_clause来指定：
				(rows | range) between (unbounded | [num]) preceding and ([num] preceding | current row | (unbounded | [num]) following)
				(rows | range) between current row and (current row | (unbounded | [num]) following)
				(rows | range) between [num] following and (unbounded | [num]) following
				
				特殊情况： ①在over()中既没有出现windows_clause，也没有出现order by，窗口默认为rows between UNBOUNDED  PRECEDING and UNBOUNDED  FOLLOWING
						   ②在over()中(没有出现windows_clause)，指定了order by，窗口默认为rows between UNBOUNDED  PRECEDING and CURRENT ROW