Hive 开窗函数 —— over(partition by) 介绍-CSDN博客

本文链接：https://blog.csdn.net/weixin_45417821/article/details/119303288

开窗函数，分析函数用于计算基于组的某种聚合值，它和聚合函数的不同之处是：对于每个组返回多行，而聚合函数对于每个组只返回一行。开窗函数指定了分析函数工作的数据窗口大小，这个数据窗口大小可能会随着行的变化而变化。

–排序，即便值一样，也不会出现重复排序，
select row_number() over(order by name) as 排序, * from t2_temp

–排序，值一样，就重复排序,例如1,1,3,4
select rank() over(order by name) as 排序, * from t2_temp

–排序，值一样，就重复排序,例如1,1,2,2,3,4,5
select dense_rank() over(order by name) as 排序, * from t2_temp

–排序，分成几组
select ntile(2) over(order by name) as 排序, * from t2_temp

举个例子

建表

create table t2_temp(
    name string,
    class string,
    sroce int 
);

添加数据

insert into t2_temp values ('cfe', '2', 74);
 
insert into t2_temp values ('dss', '1', 95);
 
insert into t2_temp values ('ffd', '1', 95);
 
insert into t2_temp values ('fda', '1', 80);
 
insert into t2_temp values ('gds', '2', 92);
 
insert into t2_temp values ('gf', '3', 99);
 
insert into t2_temp values ('ddd', '3', 99);
 
insert into t2_temp values ('adf', '3', 45);
 
insert into t2_temp values ('asdf', '3', 55);
 
insert into t2_temp values ('3dd', '3', 78);

1、over函数的写法：

over（partition by class order by sroce）
按照sroce排序进行累计，order by是个默认的开窗函数，按照class分区。

2、开窗的窗口范围：

over（order by sroce range between 5 preceding and 5 following）：
窗口范围为当前行数据幅度减5加5后的范围内的。

over（order by sroce rows between 5 preceding and 5 following）：
窗口范围为当前行前后各移动5行。

3、与over()函数结合的函数的介绍

举例

1，查询每个班级的第一名

先把每个班级的排名进行标记

select name,class,sroce,rank() over(partition by class order by sroce desc) mm from t2_temp;

然后将排名等于1 即可

SELECT * FROM (select t.name,t.class,t.sroce,rank() over(partition by t.class order by t.sroce desc) mm from t2_temp t) t where mm = 1;

结果为

t.name	t.class	t.sroce	t.mm
dss	1	95	1
ffd	1	95	1
gds	2	92	1
gf	3	99	1
ddd	3	99	1

注意：在求第一名成绩的时候，不能用row_number()，因为如果同班有两个并列第一，row_number()只返回一个结果。

select * from (select t.name,t.class,t.sroce,row_number() over(partition by t.class order by t.sroce desc) mm from T2_TEMP t)   t   where mm = 1;

结果为

t.name	t.class	t.sroce	t.mm
dss	1	95	1
gds	2	92	1
gf	3	99	1
Time taken: 30.707 seconds, Fetched: 3 row(s)

2，求班级成绩排名

rank() 和 dense_rank() 可以将所有的都查找出来，rank可以将并列第一名的都查找出来；

rank() 和 dense_rank() 区别：rank()是跳跃排序，有两个第二名时接下来就是第四名。

方式一：

select t.name,t.class,t.sroce,rank() over(partition by t.class order by t.sroce desc) mm from t2_temp t;

结果

t.name	  t.class	t.sroce	mm
dss	       1	      95	1
ffd	       1	      95	1
fda	       1	      80	3
gds	       2	      92	1
cfe	       2	      74	2
gf	       3	      99	1
ddd	       3	      99	1
3dd	       3	      78	3
asdf	   3	      55	4
adf	       3	      45	5

方式二：

select t.name,t.class,t.sroce,dense_rank() over(partition by t.class order by t.sroce desc) mm from t2_temp t;

结果

t.name	  t.class	t.sroce	mm
dss	         1	  95	1
ffd	         1	  95	1
fda	         1	  80	2
gds	         2	  92	1
cfe	         2	  74	2
gf	         3	  99	1
ddd	         3	  99	1
3dd	         3	  78	2
asdf	     3	  55	3
adf	         3	  45	4

从结果来看，看fda ，如果是rank()的话，直接就跳转到了第三名，而dense_rankI() 是直接第二名的。

3、sum() over（）的使用

根据班级进行分数求和

select t.name,t.class,t.sroce,sum(t.sroce) over(partition by t.class order by t.sroce desc) mm from t2_temp t;

结果

t.name	t.class	  t.sroce	 mm
dss	       1	    95	    190  --由于两个95都是第一名，所以累加时是两个第一名的相加
ffd	       1	    95	    190
fda	       1	    80	    270  --第一名加上第二名的
gds	       2	    92	    92
cfe	       2	    74	    166
gf	       3	    99	    198
ddd	       3	    99	    198
3dd	       3	    78	    276
asdf	   3	    55	    331
adf	       3	    45	    376

4、first_value() over()和last_value() over()的使用

分别求出第一个和最后一个成绩。

第一个成绩

select t.name,t.class,t.sroce,first_value(t.sroce) over(partition by t.class order by t.sroce desc) mm from t2_temp t;

结果

t.name	t.class	t.sroce	mm
dss	      1	      95	95
ffd	      1	      95	95
fda	      1	      80	95
gds	      2	      92	92
cfe	      2	      74	92
gf	      3	      99	99
ddd	      3	      99	99
3dd	      3	      78	99
asdf	  3	      55	99
adf	      3	      45	99

最后一个成绩

select t.name,t.class,t.sroce,last_value(t.sroce) over(partition by t.class order by t.sroce desc) mm from t2_temp t;

5、sum() over()的使用

求出班级的总分。

select t.name,t.class,t.sroce,sum(t.sroce) over(partition by t.class order by t.sroce desc) mm from t2_temp t;

下面还有很多用法，就不一一列举了，简单介绍一下，和上面用法类似：

1，count() over(partition by … order by …)：求分组后的总数。
2，max() over(partition by … order by …)：求分组后的最大值。
3，min() over(partition by … order by …)：求分组后的最小值。
4，avg() over(partition by … order by …)：求分组后的平均值。
5，lag() over(partition by … order by …)：取出前n行数据。　　
6，lead() over(partition by … order by …)：取出后n行数据。
7，ratio_to_report() over(partition by … order by …)：
8，Ratio_to_report() 括号中就是分子，over() 括号中就是分母。
9，percent_rank() over(partition by … order by …)：

6、over partition by与group by的区别：

group by是对检索结果的保留行进行单纯分组，一般和聚合函数一起使用例如max、min、sum、avg、count等一块用。 partition by虽然也具有分组功能，但同时也具有其他的高级功能。