hive 窗口函数应用场景与实战详解

最新推荐文章于 2024-02-11 16:48:58 发布

bitcarmanlee

最新推荐文章于 2024-02-11 16:48:58 发布

阅读量3.5k

点赞数 2

分类专栏： hive 文章标签： hive 窗口函数详解实战

本文链接：https://blog.csdn.net/bitcarmanlee/article/details/51745926

版权

hive 专栏收录该内容

28 篇文章 13 订阅

订阅专栏

项目github地址：bitcarmanlee easy-algorithm-interview-and-practice
欢迎大家star，留言，一起学习进步

hive中的窗口函数，功能非常强大，使用也比较方便，可以给我们的查询提供相当多的遍历。下面就结合具体的需求与实例，一一为大家讲解窗口函数的使用方法。

1.数据准备

先在hive数据库中建一张表，表的结构如下：

hive (tmp)> desc phone_test;
OK
calling_num         	string
called_num          	string

准备测试文件：

将文件put到hdfs中hive表对应的位置：

hadoop fs -put phone /data/hive/warehouse/tmp.db/phone_test

至此，数据准备工作完毕。

2.row_num()方法，最有用的窗口函数，或许没有之一

表phone_test的第一列为主叫电话，第二列为被叫电话，我们经常有这种需求：对于同一主叫电话，按通话次数的多少（即被叫电话）进行排序。这个时候row_num()方法就派上了用场。请看：

select a.calling_num,called_num,count,
row_number() OVER (distribute BY calling_num sort BY count DESC) rn
from
(select calling_num,called_num,count(*) as count
from phone_test group by calling_num,called_num)a

130	131	6	1
130	133	2	2
130	132	2	3
130	134	1	4
132	130	5	1
132	133	3	2
132	131	3	3
134	135	4	1
134	136	3	2
138	137	3	1
138	136	2	2
135	130	4	1
135	132	2	2

最后一列，就是在同一主叫电话中，被叫电话与主叫电话通话次数的排序。怎么样，很方便吧。
如果要取通话最多的那个，在此基础上稍作改动：

select b.calling_num,called_num,count
from
(select a.calling_num,called_num,count,
row_number() OVER (distribute BY calling_num sort BY count DESC) rn
from
(select calling_num,called_num,count(*) as count
from phone_test group by calling_num,called_num)a)b
where b.rn = 1

130	131	6
132	130	5
134	135	4
138	137	3
135	130	4

这样就把每组中rn=1的那一行给选了出来。

换一种写法，还可以这么写：

select b.calling_num,called_num,count
from
(select a.calling_num,called_num,count,
row_number() OVER (partition BY calling_num order BY count DESC) rn
from
(select calling_num,called_num,count(*) as count
from phone_test group by calling_num,called_num)a)b
where b.rn = 1

130	131	6
132	130	5
134	135	4
138	137	3
135	130	4

效果与前面是一样一样滴！

3.rank() dense_rank()

rank，顾名思义，就是排序啦。这个比排序更高级一点的是，返回的数据项是在分组中的排名，排名相等的会在名词中留下对应的空位。而dense_rank与rank唯一的不同，就是排名相等的时候不会留下对应的空位。
看例子：

select a.calling_num,called_num,count,
rank() over (partition by calling_num order by count desc) rank,
dense_rank() over (partition by calling_num order by count desc) drank,
row_number() OVER (distribute BY calling_num sort BY count DESC) rn
from
(select calling_num,called_num,count(*) as count
from phone_test group by calling_num,called_num)a

对应结果的列为：calling_num called_num count rank drank rn
130	131	6	1	1	1
130	133	2	2	2	2
130	132	2	2	2	3
130	134	1	4	3	4
132	130	5	1	1	1
132	133	3	2	2	2
132	131	3	2	2	3
134	135	4	1	1	1
134	136	3	2	2	2
138	137	3	1	1	1
138	136	2	2	2	2
135	130	4	1	1	1
135	132	2	2	2	2

聪明的你，是不是已经明白这两个函数的用法了？

4.ntile()

ntile是按层次查询。其作用是将数据分成几部分，例如我们想把数据总共分为十份，我们想取前10%来做分析。请看:

select calling_num,called_num,count(*),ntile(5) over(order by count(*) desc) til
from phone_test group by calling_num,called_num
130	131	6	1
132	130	5	1
135	130	4	1
134	135	4	2
138	137	3	2
132	133	3	2
132	131	3	3
134	136	3	3
138	136	2	3
130	132	2	4
135	132	2	4
130	133	2	5
130	134	1	5

我们先将所有数据分为了5部分。如果只想查看其中某一部分：

select a.calling_num,a.called_num,count,til from
(select calling_num,called_num,count(*) count,ntile(5) over(order by count(*) desc) til
from phone_test group by calling_num,called_num)a
where til = 1

130	131	6	1
132	130	5	1
135	130	4	1

5.一定范围内的聚合

实际应用场景中，我们有各种数据聚合的要求，而且还比一般的计算要复杂。例如对与电商来说，经常需要看当月与今年之前所有月的累计订单量。对于销售人员来说，也经常需要看当月与今年之前或这个季度之前的累计销售额。对于本文中的例子，我们想计算当前号码的通话总和，以及与之前所有号码通话总和的累计：

select calling_num,sum(num),
sum(sum(num)) over(order by calling_num ROWS between unbounded preceding and current row) as cumulative_sum
from
(select calling_num,called_num,count(*) as num
from phone_test group by calling_num,called_num)a
group by calling_num

130	11	11
132	11	22
134	7	29
135	6	35
138	5	40

sum(sum(num))这种写法中，里面的sum(num)表示需要累加的和。
重点看下over里面的内容：
order by calling_num表示按主叫电话排序
ROWS between unbounded preceding and current row 肯定就是表示聚合的起始位置与终止位置了。unbounded preceding是起点，表示从第一行开始；current row为默认值，表示到当前行。
以下写法，能达到同样的效果：

select calling_num,sum(num),
sum( sum(num)) over(order by calling_num ROWS unbounded preceding) as cumulative_sum
from
(select calling_num,called_num,count(*) as num
from phone_test group by calling_num,called_num)a
group by calling_num

130	11	11
132	11	22
134	7	29
135	6	35
138	5	40

如果只想在当前行对前两行聚合，总共计算前两行+当前行=3行的值，可以这样写：

select calling_num,sum(num),
sum( sum(num)) over(order by calling_num ROWS between 2 preceding and current row) as cumulative_sum
from
(select calling_num,called_num,count(*) as num
from phone_test group by calling_num,called_num)a
group by calling_num

130	11	11
132	11	22
134	7	29
135	6	24
138	5	18

也可以这么写：

select calling_num,sum(num),
sum( sum(num)) over(order by calling_num ROWS 2 preceding) as cumulative_sum
from
(select calling_num,called_num,count(*) as num
from phone_test group by calling_num,called_num)a
group by calling_num

130	11	11
132	11	22
134	7	29
135	6	24
138	5	18

如果想对之前一行后面一行进行聚合，总共前面一行+当前行+后面一行=3行计算结果，可以这么写：

select calling_num,sum(num),
sum(sum(num)) over(order by calling_num ROWS between 1 preceding and 1 following) as cumulative_sum
from
(select calling_num,called_num,count(*) as num
from phone_test group by calling_num,called_num)a
group by calling_num

130	11	22
132	11	29
134	7	24
135	6	18
138	5	11

bitcarmanlee

关注

2
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
hive 窗口函数应用场景与实战详解

hive中的窗口函数，功能非常强大，使用也比较方便，可以给我们的查询提供相当多的遍历。下面就结合具体的需求与实例，一一为大家讲解窗口函数的使用方法。1.数据准备先在hive数据库中建一张表，表的结构如下：hive (tmp)> desc phone_test;OKcalling_num stringcalled_num string准备测试文件
复制链接

扫一扫