hive所有窗口函数详情总结

jokertiger

已于 2024-04-07 19:43:54 修改

阅读量1.7k

点赞数 2

文章标签： hive hadoop 数据仓库

于 2023-08-02 14:28:39 首次发布

本文链接：https://blog.csdn.net/jokertiger/article/details/120140267

版权

hive窗口函数详情总结

解释
语法
hive开窗函数

解释

开窗函数用于为行定义一个窗口（指运算将要操作的行的集合），它对一组值进行操作，不需要使用 Group By 子句对数据进行分组，能够在同一行中同时返回基础行的列和聚合列。

语法

函数() over(partition by 列名1 
          order by 列名2 
          rows between 
          		[[unbounded|num] preceding | current row]
          and 
          		[[unbounded|num] following | current row]) 

 rows between：作用为划分表中窗口边界
 unbounded preceding：表示表中窗口无上边界
 num preceding：表示表中窗口上界到距离当前行向上num行
 current row：表示当前行
 num following：表示表中窗口下界到距离当前行向下num行
 unbounded following：表示表中窗口无下边界
 rows between unbounded preceding and unbounded following：

hive开窗函数

排序开窗函数

样例数据

select * from test ；
>name score  subject
 A     90     语文
 A 	   90     数学
 A     98     英语
 B     93     语文
 B     90     数学
 B     94     英语

RANK()

在计算排序时，若存在相同位次，会跳过之后的位次。有3条排在第1位时，排序为：1，1，1，4······
示例：

#按姓名分组，排序每个人的分数从低到高
select name , score , subject ,rank()over(partition by name order by score ) rk from test;
>name score   subject  rk 
> A     90     语文     1
> A 	90     数学     1
> A     98     英语     3
> B     90     数学     1
> B     93     语文     2
> B     94     英语     3

DENSE_RANK()

在计算排序时，若存在相同位次，不会跳过之后的位次。有3条排在第1位时，排序为：1，1，1，2······
示例：

#按姓名分组，排序每个人的分数从低到高
select name , score , subject ,rank()over(partition by name order by score ) rk from test;
>name score   subject  rk 
> A     90     语文     1
> A 	90     数学     1
> A     98     英语     2
> B     90     数学     1
> B     93     语文     2
> B     94     英语     3

ROW_NUMBER()

这个函数赋予唯一的连续位次。例如，有3条排在第1位时，排序为：1，2，3，4······
示例：

#按姓名分组，排序每个人的分数从低到高
select name , score , subject ,rank()over(partition by name order by score ) rk from test;
>name score   subject  rk 
> A     90     语文     1
> A 	90     数学     2
> A     98     英语     3
> B     90     数学     1
> B     93     语文     2
> B     94     英语     3

分析开窗函数

样例数据：

select * from test;
  RN      ADDRESS     ARRIVAL_TIME         USERID    
 ------  ----------  -------------------  --------- 
 1       A1          2012-7-9 下午12:03:21  1                  
 (null)  A2          2012-7-9 下午12:04:21  2                  
 (null)  A3          2012-7-9 下午12:05:21  3                 
 2       A1          2012-7-9 下午12:08:21  4                   
 (null)  A2          2012-7-9 下午12:09:21  5                   
 (null)  A3          2012-7-9 下午12:10:21  6                  
 3       A1          2012-7-9 下午12:13:21  7                   
 (null)  A3          2012-7-9 下午12:15:21  8                   
 4       A1          2012-7-9 下午12:18:23  9                   
 5       A1          2012-7-9 下午12:19:21  10                  
 (null)  A2          2012-7-9 下午12:20:21  11                 
 (null)  A3          2012-7-9 下午12:21:21  12                 
 6       A1          2012-7-9 下午12:23:23  13                  
 (null)  A2          2012-7-9 下午12:24:21  14

last_value

取开窗最后一个值
第一个参数是列名，第二个参数可选布尔值,默认值为FALSE，true可以忽略null值

select rn,address,arrival_time,userid,last_value(rn,true) over(order by userid) group_t from test
 查询结果如下：
  RN      ADDRESS     ARRIVAL_TIME         USERID     GROUP_T    
 ------  ----------  -------------------  ---------  ---------- 
 1       A1          2012-7-9 下午12:03:21  1          1          
 (null)  A2          2012-7-9 下午12:04:21  2          1          
 (null)  A3          2012-7-9 下午12:05:21  3          1          
 2       A1          2012-7-9 下午12:08:21  4          2          
 (null)  A2          2012-7-9 下午12:09:21  5          2          
 (null)  A3          2012-7-9 下午12:10:21  6          2          
 3       A1          2012-7-9 下午12:13:21  7          3          
 (null)  A3          2012-7-9 下午12:15:21  8          3          
 4       A1          2012-7-9 下午12:18:23  9          4          
 5       A1          2012-7-9 下午12:19:21  10         5          
 (null)  A2          2012-7-9 下午12:20:21  11         5          
 (null)  A3          2012-7-9 下午12:21:21  12         5          
 6       A1          2012-7-9 下午12:23:23  13         6          
 (null)  A2          2012-7-9 下午12:24:21  14         6

first_value

取开窗第一个值
第一个参数是列名，第二个参数可选布尔值,默认值为FALSE，true可以忽略null值

select rn,address,arrival_time,userid,first_value(rn,true) over(order by userid) group_t from test
 查询结果如下：
  RN      ADDRESS     ARRIVAL_TIME         USERID     GROUP_T    
 ------  ----------  -------------------  ---------  ---------- 
 1       A1          2012-7-9 下午12:03:21  1          1          
 (null)  A2          2012-7-9 下午12:04:21  2          1          
 (null)  A3          2012-7-9 下午12:05:21  3          1          
 2       A1          2012-7-9 下午12:08:21  4          1          
 (null)  A2          2012-7-9 下午12:09:21  5          1          
 (null)  A3          2012-7-9 下午12:10:21  6          1          
 3       A1          2012-7-9 下午12:13:21  7          1          
 (null)  A3          2012-7-9 下午12:15:21  8          1          
 4       A1          2012-7-9 下午12:18:23  9          1          
 5       A1          2012-7-9 下午12:19:21  10         1          
 (null)  A2          2012-7-9 下午12:20:21  11         1          
 (null)  A3          2012-7-9 下午12:21:21  12         1          
 6       A1          2012-7-9 下午12:23:23  13         1          
 (null)  A2          2012-7-9 下午12:24:21  14         1

lag

LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值 ,第三个参数指的是往上n个weinull的默认值，不是指开窗那列的值为null的默认值，示例：

select  rn,address,arrival_time,userid,lag(rn,2,0) over(order by userid) group_t from test
 查询结果如下：
  RN      ADDRESS     ARRIVAL_TIME         USERID     GROUP_T    
 ------  ----------  -------------------  ---------  ---------- 
 1       A1          2012-7-9 下午12:03:21  1          0         
 (null)  A2          2012-7-9 下午12:04:21  2          0          
 (null)  A3          2012-7-9 下午12:05:21  3          1        
 2       A1          2012-7-9 下午12:08:21  4          null          
 (null)  A2          2012-7-9 下午12:09:21  5          null         
 (null)  A3          2012-7-9 下午12:10:21  6          2          
 3       A1          2012-7-9 下午12:13:21  7          null         
 (null)  A3          2012-7-9 下午12:15:21  8          null                   
 4       A1          2012-7-9 下午12:18:23  9          3  
 5       A1          2012-7-9 下午12:19:21  10         null          
 (null)  A2          2012-7-9 下午12:20:21  11         4
 (null)  A3          2012-7-9 下午12:21:21  12         5               
 6       A1          2012-7-9 下午12:23:23  13         null         
 (null)  A2          2012-7-9 下午12:24:21  14         null

lead

LEAD(col,n,DEFAULT)用于统计窗口内往下第n行值

select  rn,address,arrival_time,userid,lead(rn,2,0) over(order by userid) group_t from test
查询结果如下：
  RN      ADDRESS     ARRIVAL_TIME         USERID     GROUP_T    
 ------  ----------  -------------------  ---------  ---------- 
 1       A1          2012-7-9 下午12:03:21  1          null        
 (null)  A2          2012-7-9 下午12:04:21  2          2          
 (null)  A3          2012-7-9 下午12:05:21  3          null        
 2       A1          2012-7-9 下午12:08:21  4          null          
 (null)  A2          2012-7-9 下午12:09:21  5          3
 (null)  A3          2012-7-9 下午12:10:21  6          null          
 3       A1          2012-7-9 下午12:13:21  7          4
 (null)  A3          2012-7-9 下午12:15:21  8          5
 4       A1          2012-7-9 下午12:18:23  9          null  
 5       A1          2012-7-9 下午12:19:21  10         null          
 (null)  A2          2012-7-9 下午12:20:21  11         6
 (null)  A3          2012-7-9 下午12:21:21  12         null               
 6       A1          2012-7-9 下午12:23:23  13         0
 (null)  A2          2012-7-9 下午12:24:21  14         0

其他窗口函数

ntile

NTILE(n) 用于将分组数据按照顺序切分成n片，返回当前切片值，如果切片不均匀，默认增加第一个切片的分布。NTILE不支持ROWS BETWEEN

select a,ntile(2)over(order by a) nt from (select 1 a union select 2 union select 3)t
 a  nt
>1	1
>2	1
>3	2

cume_dist

这个函数不太常用，小于等于当前值的行数/分组内总行数

select r, a ,cume_dist() over( order by a  ) col from (
select 'cc' r, 1  a union all select 'aa',2 union all select 'bb', 3
) t
 r  a       col
>c	1	0.3333333333333333  #1/3
>aa	2	0.6666666666666666  #2/3
>b	3	1                   #3/3

percent_rank

percent_rank ：窗口内当前行的RANK值-1/窗口内总行数-1(这里的rank值就是指的是rank 函数的的返回值)

select r, a ,percent_rank() over( order by a  ) col from (
select 'cc' r, 1  a union all select 'aa',2 union all select 'bb', 3
) t
 r  a       col
>c	1		 0   #1-1/3-1
>aa	2	     0.5   #2-1/3-1
>b	3	     1   #3-1/3-1

jokertiger

关注

2
点赞
踩
4

收藏

觉得还不错? 一键收藏
2
评论
hive所有窗口函数详情总结

开窗函数用于为行定义一个窗口（指运算将要操作的行的集合），它对一组值进行操作，不需要使用 Group By 子句对数据进行分组，能够在同一行中同时返回基础行的列和聚合列。percent_rank ：窗口内当前行的RANK值-1/窗口内总行数-1(这里的rank值就是指的是rank 函数的的返回值)有3条排在第1位时，排序为：1，1，1，2······有3条排在第1位时，排序为：1，1，1，4······这个函数赋予唯一的连续位次。例如，有3条排在第1位时，排序为：1，2，3，4······
复制链接

扫一扫