Hive--开窗函数--窗口聚合函数:SUM、AVG、COUNT、MAX、MIN

本文介绍窗口聚合函数的基本概念及应用场景,包括SUM、AVG、COUNT等函数如何在不同窗口设置下进行数据聚合运算,如累计求和、指定区间求和等,并通过具体实例展示其在数据处理中的应用。

窗口聚合函数:SUM、AVG、COUNT、MAX、MIN

以SUM为例

总结

  • 1.如果不使用 over(),窗口大小是针对查询产生的所有数据
  • 功能:用于实现数据分区后的聚合
  • 语法:fun_name(col1) over (partition by col2 order by col3)
  • 示例:实现分区内的累加,其他的原理类似
  • 不指定窗口时,不排序默认第一行到最后一行,排序默认第一行到当前行
  • 指定窗口时 --rows between 起始位置 and 结束位置
    • N preceding:往前多少行
    • N following:往后多少行
    • current row:当前行
    • unbounded:起点或者终点,没有边界
    • unbounded preceding 表示从前面的起点
    • unbounded following:表示到后面的终点
  • 注意
    • 分组列表,组内按排列顺序求累加和, 每组的相同level(排序字段)分享同一累加值。
    • 如果level(排序字段)有相同,需要注意是否是你想要的结果,是否需要考虑指定窗口位置

实现分区内起始到当前行的pv累加,指定排序默认窗口:取分区第一行开始到当前行的和

--创建文件:vim /export/datas/window.txt
cookie1,2018-04-10,1
cookie1,2018-04-11,5
cookie1,2018-04-12,7
cookie1,2018-04-13,3
cookie2,2018-04-13,3
cookie2,2018-04-14,2
cookie2,2018-04-15,4
cookie1,2018-04-14,2
cookie1,2018-04-15,4
cookie1,2018-04-16,4
cookie2,2018-04-10,1
cookie2,2018-04-11,5
cookie2,2018-04-12,7
--创建表
create database db_function;
use db_function;
create table hanjiaxiaozhi_f1(
cookieid string,
daystr string,
pv int
) row format delimited fields terminated by ',';
--加载数据
load data local inpath '/export/datas/window.txt' into table hanjiaxiaozhi_f1;
set hive.exec.mode.local.auto=true;
--实现分区内起始到当前行的pv累加,默认窗口:取分区第一行开始到当前行的和
select 
  cookieid,
  daystr,
  pv,
  sum(pv) over(partition by cookieid order by daystr) as pv1 
from 
  hanjiaxiaozhi_f1;
  
+-----------+-------------+-----+------+--+
| cookieid  |   daystr    | pv  | pv3  |
+-----------+-------------+-----+------+--+
| cookie1   | 2018-04-10  | 1   | 1    |
| cookie1   | 2018-04-11  | 5   | 6    |
| cookie1   | 2018-04-12  | 7   | 13   |
| cookie1   | 2018-04-13  | 3   | 16   |
| cookie1   | 2018-04-14  | 2   | 18   |
| cookie1   | 2018-04-15  | 4   | 22   |
| cookie1   | 2018-04-16  | 4   | 26   |
| cookie2   | 2018-04-10  | 1   | 1    |
| cookie2   | 2018-04-11  | 5   | 6    |
| cookie2   | 2018-04-12  | 7   | 13   |
| cookie2   | 2018-04-13  | 3   | 16   |
| cookie2   | 2018-04-14  | 2   | 18   |
| cookie2   | 2018-04-15  | 4   | 22   |
+-----------+-------------+-----+------+--+

实现分区内起始到当前行的pv累加,指定排序默认窗口:取分区第一行开始到当前行的和,每组的相同level分享同一累加值。

select id,score,
sum(score) over(patition by id order by score) cumm_scores  
from GRADE

idscorecumm_scores
19898
1100198
295190
295190

实现分区内所有pv的累加,不指定排序默认窗口:从第一行到最后一行

select 
  cookieid,
  daystr,
  pv,
  sum(pv) over(partition by cookieid ) as pv2
from 
  hanjiaxiaozhi_f1;
  
+-----------+-------------+-----+------+--+
| cookieid  |   daystr    | pv  | pv2  |
+-----------+-------------+-----+------+--+
| cookie1   | 2018-04-16  | 4   | 26   |
| cookie1   | 2018-04-15  | 4   | 26   |
| cookie1   | 2018-04-13  | 3   | 26   |
| cookie1   | 2018-04-12  | 7   | 26   |
| cookie1   | 2018-04-11  | 5   | 26   |
| cookie1   | 2018-04-14  | 2   | 26   |
| cookie1   | 2018-04-10  | 1   | 26   |
| cookie2   | 2018-04-12  | 7   | 22   |
| cookie2   | 2018-04-11  | 5   | 22   |
| cookie2   | 2018-04-10  | 1   | 22   |
| cookie2   | 2018-04-14  | 2   | 22   |
| cookie2   | 2018-04-13  | 3   | 22   |
| cookie2   | 2018-04-15  | 4   | 22   |
+-----------+-------------+-----+------+--+
  

手动指定窗口的大小:分区

rows between 起始位置 and 结束位置
指定窗口的关键字:rows betweenand- N preceding:往前多少行
- N following:往后多少行
- current row:当前行
- unbounded:起点或者终点,没有边界
- unbounded preceding 表示从前面的起点
- unbounded following:表示到后面的终点
rows between unbounded preceding and current row
--实现分区内从第一行到当前行的pv累加
--手动指定窗口的大小:分区
rows between 起始位置 and 结束位置
rows between unbounded preceding and current row

实现分区内起始到当前行的pv累加,手动指定窗口大小从第一行到当前行

select 
  cookieid,
  daystr,
  pv,
  sum(pv) over(partition by cookieid order by daystr rows between unbounded preceding and current row) as pv3
from 
  hanjiaxiaozhi_f1;
  
  
+-----------+-------------+-----+------+--+
| cookieid  |   daystr    | pv  | pv3  |
+-----------+-------------+-----+------+--+
| cookie1   | 2018-04-10  | 1   | 1    |
| cookie1   | 2018-04-11  | 5   | 6    |
| cookie1   | 2018-04-12  | 7   | 13   |
| cookie1   | 2018-04-13  | 3   | 16   |
| cookie1   | 2018-04-14  | 2   | 18   |
| cookie1   | 2018-04-15  | 4   | 22   |
| cookie1   | 2018-04-16  | 4   | 26 
| cookie2   | 2018-04-10  | 1   | 1    |
| cookie2   | 2018-04-11  | 5   | 6    |
| cookie2   | 2018-04-12  | 7   | 13   |
| cookie2   | 2018-04-13  | 3   | 16   |
| cookie2   | 2018-04-14  | 2   | 18   |
| cookie2   | 2018-04-15  | 4   | 22   |
+-----------+-------------+-----+------+--+

实现分区内指定前N行到当前行的pv累加

select 
  cookieid,
  daystr,
  pv,
  sum(pv) over(partition by cookieid order by daystr rows between 3 preceding and current row) as pv4
from 
  hanjiaxiaozhi_f1;
  
+-----------+-------------+-----+------+--+
| cookieid  |   daystr    | pv  | pv4  |
+-----------+-------------+-----+------+--+
| cookie1   | 2018-04-10  | 1   | 1    |
| cookie1   | 2018-04-11  | 5   | 6    |
| cookie1   | 2018-04-12  | 7   | 13   |
| cookie1   | 2018-04-13  | 3   | 16   |
| cookie1   | 2018-04-14  | 2   | 17   |
| cookie1   | 2018-04-15  | 4   | 16   |
| cookie1   | 2018-04-16  | 4   | 13   |
| cookie2   | 2018-04-10  | 1   | 1    |
| cookie2   | 2018-04-11  | 5   | 6    |
| cookie2   | 2018-04-12  | 7   | 13   |
| cookie2   | 2018-04-13  | 3   | 16   |
| cookie2   | 2018-04-14  | 2   | 17   |
| cookie2   | 2018-04-15  | 4   | 16   |
+-----------+-------------+-----+------+--+

实现分区内指定前N行到后N行的pv累加

select 
  cookieid,
  daystr,
  pv,
  sum(pv) over(partition by cookieid order by daystr rows between 3 preceding and 1 following) as pv5
from 
  hanjiaxiaozhi_f1;
  
+-----------+-------------+-----+------+--+
| cookieid  |   daystr    | pv  | pv5  |
+-----------+-------------+-----+------+--+
| cookie1   | 2018-04-10  | 1   | 6    |
| cookie1   | 2018-04-11  | 5   | 13   |
| cookie1   | 2018-04-12  | 7   | 16   |
| cookie1   | 2018-04-13  | 3   | 18   |
| cookie1   | 2018-04-14  | 2   | 21   |
| cookie1   | 2018-04-15  | 4   | 20   |
| cookie1   | 2018-04-16  | 4   | 13   |
| cookie2   | 2018-04-10  | 1   | 6    |
| cookie2   | 2018-04-11  | 5   | 13   |
| cookie2   | 2018-04-12  | 7   | 16   |
| cookie2   | 2018-04-13  | 3   | 18   |
| cookie2   | 2018-04-14  | 2   | 21   |
| cookie2   | 2018-04-15  | 4   | 16   |
+-----------+-------------+-----+------+--+

实现分区内指定当前行到后N行的pv累加

select 
  cookieid,
  daystr,
  pv,
  sum(pv) over(partition by cookieid order by daystr rows between current row and unbounded following) as pv6
from 
  hanjiaxiaozhi_f1;
  
+-----------+-------------+-----+------+--+
| cookieid  |   daystr    | pv  | pv6  |
+-----------+-------------+-----+------+--+
| cookie1   | 2018-04-10  | 1   | 26   |
| cookie1   | 2018-04-11  | 5   | 25   |
| cookie1   | 2018-04-12  | 7   | 20   |
| cookie1   | 2018-04-13  | 3   | 13   |
| cookie1   | 2018-04-14  | 2   | 10   |
| cookie1   | 2018-04-15  | 4   | 8    |
| cookie1   | 2018-04-16  | 4   | 4    |
| cookie2   | 2018-04-10  | 1   | 22   |
| cookie2   | 2018-04-11  | 5   | 21   |
| cookie2   | 2018-04-12  | 7   | 16   |
| cookie2   | 2018-04-13  | 3   | 9    |
| cookie2   | 2018-04-14  | 2   | 6    |
| cookie2   | 2018-04-15  | 4   | 4    |
+-----------+-------------+-----+------+--+

注意:窗口函数不支持distinct

  • 会报错Error in query: Distinct window functions are not supported: 。。。。
  • 解决:size( collect_set(xxx)over( ) )
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值