文章目录
场景
如果我们需要想对一组数据进行分组,我们通过会使用聚合函数group by,但是它有个缺点,每组数据只能返回一行数据,且只能返回分组字段和聚合字段,如果我们想在分组的同时还保留数据详细的信息,这时我们就可以使用窗口分析函数
演示数据
user1,2018-04-10,1
user1,2018-04-11,5
user1,2018-04-12,7
user1,2018-04-13,3
user1,2018-04-14,2
user1,2018-04-15,4
user1,2018-04-16,4
user2,2018-04-10,2
user2,2018-04-11,3
user2,2018-04-12,5
user2,2018-04-13,6
user2,2018-04-14,3
user2,2018-04-15,9
user2,2018-04-16,7
创建表并加载数据
create table open_window(
userid string,
createtime string,
pv int
)
row format delimited
fields terminated by ',';
load data local inpath '/export/testdatas/openwindow.txt' into table open_window;
查询验证
select * from open_window;
分组排名
row_number
从1开始按顺序排序,不间断,无重复
select *,row_number() over (partition by userid order by pv desc ) as rk from open_window;
rank
从1开始按顺序排序,有间断,有重复,碰到相同数据会显示相同排名,但是排名序号会有断层
select *,rank() over (partition by userid order by pv desc ) as rk from open_window;
dense_rank
从1开始按顺序排序,无间断,有重复,碰到相同数据会显示相同排名,排名序号不会断层
select *,rank() over (partition by userid order by pv desc ) as rk from open_window;
分组求和,平均,最大,最小
sum
需要指定计算字段,排序方式,计算窗口(范围),前两者都会影响计算结果
over中常用关键字
partition by
: 指定分区(组)字段order by
: 指定排序方式,不指定默认整个分组为计算窗口rows between
指定计算窗口(范围), 不指定默认从第一行到当前行
rows between 子句,也叫window 子句
preceding
:表示往前几行current row
:表示当前行following
:表示往后几行unbounded preceding
:表示从第一行开始unbounded following
:表示到最后一行终止
只指定分区字段 partition by
默认计算范围,整个分组
select *, sum(pv) over (partition by userid) as sum from open_window;
不指定rows between
默认计算范围,第一行到当前行
select *, sum(pv) over (partition by userid order by createtime) as sum from open_window;
-- 等效写法
select *, sum(pv) over (partition by userid order by createtime rows unbounded preceding) as sum from open_window;
select *, sum(pv) over (partition by userid order by createtime rows between unbounded preceding and current row ) as sum from open_window;
指定rows between
从当前行的前3行开始到后1行结束
指定计算范围从当前行的前3行开始到后1行结束
select *, sum(pv) over (partition by userid order by createtime rows between 3 preceding and 1 following) as sum from open_window;
从当前行到最后一行
select *, sum(pv) over (partition by userid order by createtime rows between current row and unbounded following) as sum from open_window;