HIVE实战处理（八） hive窗口函数

最新推荐文章于 2024-01-08 10:45:14 发布

sheep8521

最新推荐文章于 2024-01-08 10:45:14 发布

阅读量801

点赞数

分类专栏： hive

本文链接：https://blog.csdn.net/sheep8521/article/details/113772843

版权

hive 专栏收录该内容

43 篇文章 10 订阅

订阅专栏

1、 count、sum、avg、max、min

# 按照 year 来分组，统计每一年的总和
# 结果：每个月的值都是本年的总和
sum(val) over(partition by year)

# 按照 year 来分组，按照 month 来排序
# 结果：n 月的值是本年 1月到 n 月的累计值
sum(val) over(partition by year order by month)

通过 explain select ... 来查看语句解析，可以简单理解为，在每一次 order by 之后，会执行一次 sum 的 reduce 过程，也就导致结果计算的是 1 - n 月的累计值

2、rows between

注明:该函数用法只能适合sum累加计算的，不合适去重计算

1）准备每天的活跃用户和这天的新增用户数据。
样例数据截图：
temp.tmp_20210209_visit_user_list
在这里插入图片描述
2）月累计去重用户数

----错误示范-----

--按天出月累计用户数，mtd
select 
month,dt
,count(distinct udid)over(partition by month order by dt rows between unbounded preceding and current row) mtd_uv 
,count(distinct new_udid)over(partition by month order by dt rows between unbounded preceding and current row) as mtd_nv
from 
(select  udid,new_udid,dt,substr(dt,1,6) as month 
from  temp.tmp_20210209_visit_user_list 
order by month asc,dt asc
--如果数据是要按天一次累计的话，那这个顺序必须重新排序
)s 
group by month,dt

以上线排序后累计去重，都已测算过，但是数据仍是不对，没有按照dt依次累计用户数，所以判定该方法不适合去重的数据统计。
在这里插入图片描述
-------------错误示范end------------

-------------正确月累计示范start------------


select 
select 
substr(a.dt,0,6) month,
a.dt,
count(distinct b.udid) mtd_uv,
count(distinct b.new_udid) mtd_nu
from temp.tmp_20210209_visit_user_list  a
join temp.tmp_20210209_visit_user_list  b
on substr(a.dt,0,6)=substr(b.dt,0,6)
where a.dt>=b.dt 
group by substr(a.dt,0,6),a.dt

数据样例：

3）其他聚合函数的使用

# 按照 year 分组，按照 month 排序，计算前3行和后1行的总和
sum(val) over(partition by year order by month rows between 3 preceding and 1 following)

preceding：往前
following：往后
current row：当前行
unbounded：起点
unbounded preceding：表示从前面的起点
unbounded following：表示到后面的终点

# 以下两种方式是等效的
sum(val) over(partition by year)
sum(val) over(partition by year rows between unbounded preceding and unbounded following)

# 以下两种方式是等效的
sum(val) over(partition by year order by month)
sum(val) over(partition by year order by month rows between unbounded preceding and current row)
--从前面的起点累计到当前行
# 以下两种方式不等效
#按年分区，从前面的年份起点累计到当前行
sum(val) over(partition by year rows between unbounded preceding and current row)
#按年分区，从前面的月份起点累计到当前行
sum(val) over(partition by year order by month rows between unbounded preceding and current row)
# current row 应该是和 order by 同时出现，要不然会导致数据错位

3、ntile

切片：用于将分组数据按照顺序切分成n片，返回当前切片值；不支持 rows between；如果切片不均匀，默认增加第一个切片的分布（比如有6条数据，分4组，数量依次为2 2 1 1）

# 统计一个月内，val 最多的前 1/n 
ntile(n) over(partition by month order by val desc) as rn
rn = 1 就是最终想要的结果，前提是数据可以被均匀分片

数据样例：


## 查询原始数据2月的各个天的数据条数
select month,dt,count(*) cnt
from temp.tmp_20210209_visit_user_list 
where month=202102

## 将202102的数据切片为n=10份，取前1/n

select month,dt, ntile(10) over(partition by month order by dt desc) as rn
from temp.tmp_20210209_visit_user_list 
where month=202102

4、row_number、rank、dense_rank#

row_number：行号
rank：排名——结果中可能有空位 eg：1 2 2 4
dense_rank：排名——结果中无空位 eg：1 2 2 3

5、 cume_dist#

计算公式：(小于等于当前值的行数 / 分组内的总行数)

# 统计小于等于当前薪水的人占部门内总人数的比例
cume_dist() over(partition by dept order by salary)

6、 cube#

根据 group by 的维度的所有组合进行聚合。

select year, month, count(1)a, grouping__id 
from ...
group by year, month
with cube
order by grouping__id

等价于以下四种情况 union all
1. 相当于直接 count(1)a
2. 按照 year 来分组
3. 按照 month 来分组
4. 按照 year&month 来分组

7、grouping sets#

select year, month, count(1)a, grouping__id 
from ...
group by year, month
grouping sets(year, month, (year, month))
order by grouping__id

等价于

select year, 'null' as month, count(1)a, 1 as grouping__id 
from ...
group by year, month
union all
select 'null' as year, month, count(1)a, 2 as grouping__id 
from ...
group by month
union all
select year, month, count(1)a, 3 as grouping__id 
from ...
group by year, month

在一个 group by 查询中，根据不同的维度组合进行聚合，等价于将不同维度的 group by 结果集进行 union all

8、 rollup#

是 cube 的子集，以最左侧的维度为主，从该维度进行层级聚合。

select year, month, count(1)a, grouping__id 
from ...
group by year, month
with rollup
order by grouping__id

等价于先进行 with cube操作，即以下四种情况 union all
1. 相当于直接 count(1)a
2. 按照 year 来分组
3. 按照 month 来分组
4. 按照 year&month 来分组

然后 year 是最左侧的维度，则按照 year 来进行层级聚合，过滤掉 year 为 NULL 的记录（但是第1中情况对所有数据进行count(1)的这一条数据会依旧保存）

9、cume_dist#

计算公式：(小于等于当前值的行数 / 分组内的总行数)

# 统计小于等于当前薪水的人占部门内总人数的比例
cume_dist() over(partition by dept order by salary)

样例数据:

SELECT 
cookieid,
createtime,
pv,
CUME_DIST() OVER(ORDER BY pv) AS rn1,
CUME_DIST() OVER(PARTITION BY cookieid ORDER BY pv) AS rn2 
FROM cookies

sheep8521

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
HIVE实战处理（八） hive窗口函数

1、 count、sum、avg、max、min# 按照 year 来分组，统计每一年的总和# 结果：每个月的值都是本年的总和sum(val) over(partition by year)# 按照 year 来分组，按照 month 来排序# 结果：n 月的值是本年 1月到 n 月的累计值sum(val) over(partition by year order by month)通过 explain select ... 来查看语句解析，可以简单理解为，在每一次 order by 之后
复制链接

扫一扫

专栏目录