Hive窗口函数

最新推荐文章于 2024-06-04 14:39:18 发布

TT15751097576

最新推荐文章于 2024-06-04 14:39:18 发布

阅读量183

点赞数

分类专栏：大数据基础篇文章标签： Hive窗口函数

本文链接：https://blog.csdn.net/tt15751097576/article/details/102829997

版权

大数据基础篇专栏收录该内容

19 篇文章 0 订阅

订阅专栏

窗口函数：

窗口函数指的就是每一条数据的窗口 OVER（），如果over（）里面没有约束，则表示整张表的窗口（全表）。

over():指定分析函数工作的数据窗口大小，这个数据窗口大小可能回随着行的变化而变化

current row:当前行，UNBOUNDED PRECEDING 表示从前面的起点，UNBOUNDED

FOLLOWING 表示到后面的终点

n preceding: 往前 n 行数据

n following: 往后 n 行数据

unbounded : 起点，

lag（col，n）：往前第 n 行数据

lead（col，n）：往后第 n 行数据

ntile（n）：把有序分区中的行分发到指定数据的组中，各个组有编号，编号从1考试，对每一行，ntile返回此行所属的组。

//创建表
hive> create table business(
    > name string, 
    > orderdate string,
    > cost int
    > ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
//导入函数
load data local inpath "/opt/db/business.txt" into table business;

//要求
（1）查询在2017年4月份购买过的顾客及总人数
（2）查询顾客的购买明细及月购买总额
（3）上述的场景,要将 cost 按照日期进行累加
（4）查询每个顾客上次的购买时间
（5）查询前 20%时间的订单信息

//解答：（1）查询在2017年4月份购买过的顾客及总人数
select name,count(*)
from business
where sunstring(orderdate,1,7)="2017-04"
group bu name;
//结果
mart	2
jack	2

//解答：（2）查询顾客的购买明细及月购买总额
//按月份month(orderdate)分区partition by
select *,sum(cost) over(partition by month(orderdate)) from business;
或者 select *,sum(cost) over(distribute by month(orderdate)) from business;\
//结果：
jack	2017-01-01	10	205
jack	2017-01-08	55	205
tony	2017-01-07	50	205
jack	2017-01-05	46	205
tony	2017-01-04	29	205
tony	2017-01-02	15	205
jack	2017-02-03	23	23
mart	2017-04-13	94	341
jack	2017-04-06	42	341
mart	2017-04-11	75	341
mart	2017-04-09	68	341
mart	2017-04-08	62	341
neil	2017-05-10	12	12
neil	2017-06-12	80	80

//解答（3）上述的场景,要将 cost 按照日期进行累加

hive> select *,sum(cost) over(sort by orderdate rows between unbounded preceding and current row) from business;
//窗口函数中sort by orderdate rows between unbounded preceding and current row表示按照月份进行分组，从当前行到最后（rows表示很多行）
//结果
jack	2017-01-01	10	10
tony	2017-01-02	15	25
tony	2017-01-04	29	54
jack	2017-01-05	46	100
tony	2017-01-07	50	150
jack	2017-01-08	55	205
jack	2017-02-03	23	228
jack	2017-04-06	42	270
mart	2017-04-08	62	332
mart	2017-04-09	68	400
mart	2017-04-11	75	475
mart	2017-04-13	94	569
neil	2017-05-10	12	581
neil	2017-06-12	80	661

//解析（4）查询每个顾客上次的购买时间
select *,
lag(orderdate,1) over(distribute by name sort by orderdate),
lead(orderdate,1) over(distribute by name sort by orderdate) from business;
//结果
jack	2017-01-01	10	NULL	2017-01-05
jack	2017-01-05	46	2017-01-01	2017-01-08
jack	2017-01-08	55	2017-01-05	2017-02-03
jack	2017-02-03	23	2017-01-08	2017-04-06
jack	2017-04-06	42	2017-02-03	NULL
mart	2017-04-08	62	NULL	2017-04-09
mart	2017-04-09	68	2017-04-08	2017-04-11
mart	2017-04-11	75	2017-04-09	2017-04-13
mart	2017-04-13	94	2017-04-11	NULL
neil	2017-05-10	12	NULL	2017-06-12
neil	2017-06-12	80	2017-05-10	NULL
tony	2017-01-02	15	NULL	2017-01-04
tony	2017-01-04	29	2017-01-02	2017-01-07
tony	2017-01-07	50	2017-01-04	NULL

//解析（5）查询前20%时间的订单信息
hive> select * from(
    > select name,orderdate,cost,ntile(5) over(order by orderdate) sorted
    > from business
    > ) t
    > where sorted = 1;
//结果
jack	2017-01-01	10	1
tony	2017-01-02	15	1
tony	2017-01-04	29	1

count()与sum()的区别：
1 apple 1.00
2 pear 2.00
select count(price) from fruit; ----执行之后结果为：2 (表示有2条记录)
select sum(price) from fruit;---执行之后结果为：3:00（表示各记录price字段之和为3.00）
count 是数个数， sum 是求和
String方法下面的subString()的作用，截取字符串【提取字符串中两个指定的索引号之间的字符】

排序：4 种

//全局排序（Order By）全程只有一个Reduce，默认升序（ASC），降序（DESC）
//每个MapReduce内部排序（Sort By），每个Reduce内部进行排序，对全部结果集来说不是排序。（需要设置Reduce个数，尽量和分区的个数一致）
//分区排序（Distribute By）：类MR种partition，进行处理，否则无法看到分区排序的效果（需要设置Reduce个数，尽量和分区的个数一致）
//Cluster By当distribute by和sorts by字段相同的时候，可以使用cluster方式。（只能是升序）