hive窗口函数总结

最新推荐文章于 2024-08-05 17:32:23 发布

windy_bober

最新推荐文章于 2024-08-05 17:32:23 发布

阅读量2.8k

点赞数 4

分类专栏：大数据

本文链接：https://blog.csdn.net/bjl0915/article/details/111402452

版权

大数据专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Hive preceding and following理解

在了解hive开窗函数前我们来看看Hive窗口函数preceding and following是怎么回事呢.
Hive窗口函数中,有一个功能是统计当前行之前或之后指定行作为一个聚合,关键字是 preceding 和 following,举例说明其使用方法.
常规的窗口函数比较简单,这里介绍一下分组的,重点是分组排序之后的rows between用法。
关键是理解rows between中关键字含义：

关键字	含义
preceding	往前
following	往后
current row	当前行
unbounded	开始行
unbounded preceding	表示从前面的起点
unbounded following	表示到后面的终点

案例

select country,time,charge,
max(charge) over (partition by country order by time) as normal,
max(charge) over (partition by country order by time rows between unbounded preceding and current row) as unb_pre_cur,
max(charge) over (partition by country order by time rows between 2 preceding and 1 following) as pre2_fol1,
max(charge) over (partition by country order by time rows between current row and unbounded following) as cur_unb_fol 
from temp

注意:默认是在分组类的当前行之前的行中计算。
rows between unbounded preceding and current row和默认的一样
rows between 2 preceding and 1 following表示在当前行的前2行和后1行中计算
rows between current row and unbounded following表示在当前行和到最后行中计算
rows between对于avg、min、max、sum这几个窗口函数的含义基本是一致的,注意查看当前结果
注意查看分组后窗口函数统计结果
在 hive 环境中创建临时表

create table tmp_student
(
   name           string,
   class          tinyint,
   cooperator_name   string,
   score          tinyint
)row format delimited fields terminated by '|';

加载测试数据
load data local inpath ‘text.txt’ into table tmp_student;
其中text.txt中内容为:

adf|3|测试公司1|45
xx|3|测试公司2|55
cfe|2|测试公司2|74
3dd|3|测试公司5|n
fda|1|测试公司7|80
gds|2|测试公司9|92
ffd|1|测试公司10|95
dss|1|测试公司4|95
ddd|3|测试公司3|99
gf|3|测试公司9|99

查看是否加载成功

hive> select * from tmp_student;
adf	3	测试公司1	45
xx 3	测试公司2	55
cfe	2	测试公司2	74
3dd	3	测试公司5	NULL
fda	1	测试公司7	80
gds	2	测试公司9	92
ffd	1	测试公司10	95
dss	1	测试公司4	95
ddd	3	测试公司3	99
gf	3	测试公司9	99
Time taken: 1.314 seconds, Fetched: 10 row(s)

下面来了解preceding and following函数用法,执行下面sql

select
    name,
    score,
    sum(score) over(order by score range between 2 preceding and 2 following) s1, -- 当前行的score值加减2的范围内的所有行
    sum(score) over(order by score rows between 2 preceding and 2 following) s2, -- 当前行+前后2行,一共5行
    sum(score) over(order by score range between unbounded preceding and unbounded following) s3, -- 全部行,不做限制
    sum(score) over(order by score rows between unbounded preceding and unbounded following) s4, -- 全部行,不做限制
    sum(score) over(order by score) s5, -- 第一行到当前行（和当前行相同score值的所有行都会包含进去）
    sum(score) over(order by score rows between unbounded preceding and current row) s6, -- 第一行到当前行（和当前行相同score值的其他行不会包含进去,这是和上面的区别）
    sum(score) over(order by score rows between 3 preceding and current row) s7, -- 当前行+往前3行
    sum(score) over(order by score rows between 3 preceding and 1 following) s8, --当前行+往前3行+往后1行
    sum(score) over(order by score rows between current row and unbounded following) s9 --当前行+往后所有行
from
    tmp.tmp_student
order by 
    score;

注意:
当ORDER BY后面缺少窗口从句条件，窗口规范默认是 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
当ORDER BY和窗口从句都缺失, 窗口规范默认是 ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
rows是物理窗口，是哪一行就是哪一行，与当前行的值（order by key的key的值）无关，只与排序后的行号相关，就是我们常规理解的那样。
range是逻辑窗口，与当前行的值有关（order by key的key的值）,在key上操作range范围。
得到相关结果如下
注意查看窗口函数统计结果通过上面的练习我们主要是对preceding and following有了一个比较全面的理解,所谓开窗函数其实就相当于flink中的滚动窗口,统计分析都是基于这个滚动窗口内完成的,所有的聚合计算统计都需要先根据range或者row确定窗口内的数据,然后就很容易得到正确的计算结果,在确定行数的过程中需要根据range和row确定是逻辑范围还是物理范围,最终都可以看作是第N行到第M行内数据的聚合统计.
Flink窗口说明

窗口函Windowing functions

FIRST_VALUE(col, bool DEFAULT)

返回分组窗口内第一行col的值,DEFAULT默认为false,如果指定为true,则跳过NULL后再取值,对于FIRST_VALUE每个分组第一行数据的FIRST_VALUE(col, bool DEFAULT) 就等于col,接下来几行数据会参考第一行数据是否为NULL根据True/False进行取舍.

WITH tmp AS (
		SELECT 1 AS group_id, 'a' AS col
		UNION ALL
		SELECT 1 AS group_id, 'b' AS col
		UNION ALL
		SELECT 1 AS group_id, 'c' AS col
		UNION ALL
		SELECT 2 AS group_id, NULL AS col
		UNION ALL
		SELECT 2 AS group_id, 'e' AS col
	)
SELECT group_id, col, FIRST_VALUE(col) OVER (PARTITION BY group_id ORDER BY col) AS col_new
FROM tmp;
返回结果为:
group_id col col_new  
1 a a 
1 b a 
1 c a 
2 NULL NULL  
2 e NULL
如果是True
WITH tmp AS (
		SELECT 1 AS group_id, NULL AS col
		UNION ALL
		SELECT 1 AS group_id, 'b' AS col
		UNION ALL
		SELECT 1 AS group_id, 'c' AS col
		UNION ALL
		SELECT 2 AS group_id, NULL AS col
		UNION ALL
		SELECT 2 AS group_id, 'e' AS col
	)
SELECT group_id, col, FIRST_VALUE(col, true) OVER (PARTITION BY group_id ORDER BY col) AS col_new
FROM tmp;
返回结果为：  
group_id col col_new  
1 NULL NULL  
1 b b  
1 c b  
2 NULL NULL  
2 e e

LAST_VALUE(col, bool DEFAULT)
返回分组窗口内第后一行col的值,DEFAULT默认为false,如果指定为true,则跳过NULL后再取值.

WITH tmp AS (
		SELECT 1 AS group_id, 'a' AS col
		UNION ALL
		SELECT 1 AS group_id, NULL AS col
		UNION ALL
		SELECT 1 AS group_id, 'c' AS col
		UNION ALL
		SELECT 2 AS group_id, 'd' AS col
		UNION ALL
		SELECT 2 AS group_id, 'e' AS col
	)
SELECT group_id, col, LAST_VALUE(col) OVER (PARTITION BY group_id ORDER BY col DESC) AS col_new FROM tmp; 
返回结果为： 
group_id col col_new  
1 c c  
1 a a  
1 NULL NULL  
2 e e  
2 d d
如果是True
WITH tmp AS (
		SELECT 1 AS group_id, 'a' AS col
		UNION ALL
		SELECT 1 AS group_id, NULL AS col
		UNION ALL
		SELECT 1 AS group_id, 'c' AS col
		UNION ALL
		SELECT 2 AS group_id, 'd' AS col
		UNION ALL
		SELECT 2 AS group_id, 'e' AS col
	)
SELECT group_id, col, LAST_VALUE(col, true) OVER (ORDER BY group_id,col DESC ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS col_new FROM tmp;
返回结果为：  
group_id col col_new  
1 c a  
1 a a  
1 NULL e  
2 e d  
2 d d

开窗函数不同于group by函数,开窗函数能够把所有的记录都显示出来,一般select所选择的列也都与over里面的分组和排序字段相同,这样才能比较清楚地看到当前记录在聚合函数中的区别和贡献,上面两个窗口函数我们针对最后一个案例进行说明下.
使用了开窗函数首先要确定窗口的大小,根据上面的PRECEDING和FOLLOWING讲解我们可以知道在分析时候窗口大小为[前一行,当前行,后一行],那么对于第一行1 c取出last_value就是从[空值,c,a]取出集合中最后一个就是a,同理对于第二行1 a取出last_value就是从[c,a,Null]中取出最后一个Null跳过再取得到a,对于2 e从集合[Null,e,d]last_value=d*

LEAD(col, n, DEFAULT)

返回分组窗口内往下第n行col的值,n默认为1,往下第n没有时返回DEFAULT（DEFAULT默认为NULL）使用分组后那么分组之间就不交叉计算.

WITH tmp AS
(
 SELECT 1 AS group_id, 'a' AS col 
 UNION ALL SELECT 1 AS group_id,  'b' AS col 
 UNION ALL SELECT 1 AS group_id,  'c' AS col 
 UNION ALL SELECT 2 AS group_id,  'd' AS col 
 UNION ALL SELECT 2 AS group_id,  'e' AS col
)
SELECT group_id,
      col,
      LEAD(col) over(partition by group_id order by col) as col_new
FROM tmp;

返回结果

group_id col col_new
1 a b
1 b c
1 c NULL
2 d e
2 e NULL

等同于

WITH tmp AS
(
 SELECT 1 AS group_id, 'a' AS col 
 UNION ALL SELECT 1 AS group_id,  'b' AS col 
 UNION ALL SELECT 1 AS group_id,  'c' AS col 
 UNION ALL SELECT 2 AS group_id,  'd' AS col 
 UNION ALL SELECT 2 AS group_id,  'e' AS col
)
SELECT group_id,
      col,
      LAST_VALUE(col) over(partition by group_id order by col rows between 1 FOLLOWING and 1 FOLLOWING) as col_new
FROM tmp;

其中rows between 1 FOLLOWING and 1 FOLLOWING为从往后一行开始到往后一行结束=往后一行
返回结果

group_id col col_new
1 a b
1 b c
1 c NULL
2 d e
2 e NULL

使用LEAD默认值

WITH tmp AS
(
 SELECT 1 AS group_id, 'a' AS col 
 UNION ALL SELECT 1 AS group_id,  'b' AS col 
 UNION ALL SELECT 1 AS group_id,  'c' AS col 
 UNION ALL SELECT 2 AS group_id,  'd' AS col 
 UNION ALL SELECT 2 AS group_id,  'e' AS col
)
SELECT group_id,
      col,
      LEAD(col, 2, 'z') over(partition by group_id order by col) as col_new
FROM tmp;

返回结果

group_id col col_new
1 a c
1 b z
1 c z
2 d z
2 e z

LAG(col, n, DEFAULT)
返回分组窗口内往上第n行col的值,n默认为1,往上第n没有时返回DEFAULT（DEFAULT默认为NULL）

WITH tmp AS
(
 SELECT 1 AS group_id, 'a' AS col 
 UNION ALL SELECT 1 AS group_id,  'b' AS col 
 UNION ALL SELECT 1 AS group_id,  'c' AS col 
 UNION ALL SELECT 2 AS group_id,  'd' AS col 
 UNION ALL SELECT 2 AS group_id,  'e' AS col
)
SELECT group_id,
      col,
      LAG(col) over(partition by group_id order by col) as col_new
FROM tmp;

等同于

WITH tmp AS
(
 SELECT 1 AS group_id, 'a' AS col 
 UNION ALL SELECT 1 AS group_id,  'b' AS col 
 UNION ALL SELECT 1 AS group_id,  'c' AS col 
 UNION ALL SELECT 2 AS group_id,  'd' AS col 
 UNION ALL SELECT 2 AS group_id,  'e' AS col
)
SELECT group_id,
      col,
      FIRST_VALUE(col) over(partition by group_id order by col rows BETWEEN 1 PRECEDING and 1 PRECEDING) as col_new
FROM tmp;

返回结果都是

group_id col col_new
1 a NULL
1 b a
1 c b
2 d NULL
2 e d

使用默认值

WITH tmp AS
(
 SELECT 1 AS group_id, 'a' AS col 
 UNION ALL SELECT 1 AS group_id,  'b' AS col 
 UNION ALL SELECT 1 AS group_id,  'c' AS col 
 UNION ALL SELECT 2 AS group_id,  'd' AS col 
 UNION ALL SELECT 2 AS group_id,  'e' AS col
)
SELECT group_id,
      col,
      LAG(col, 2, 'zz') over(partition by group_id order by col) as col_new
FROM tmp;

返回结果

group_id col col_new
1 a zz
1 b zz
1 c a
2 d zz
2 e zz

OVER详解 The OVER clause

** FUNCTION(expr) OVER([PARTITION BY statement] [ORDER BY statement] [window clause]) **
中括号为可选参数
FUNCTION：包括标准聚合函数(COUNT/SUM/MIN/MAX/AVG)和一些分析函数(RANK/ROW_NUMBER/DENSE_RANK等)
PARTITION BY：可以由一个或者多个列组成
ORDER BY：可以由一个或者多个列组成
window clause:(ROWS | RANGE) BETWEEN (UNBOUNDED PRECEDING | num PRECEDING | CURRENT ROW) AND (UNBOUNDED PRECEDING | num PRECEDING | CURRENT ROW)
当window clause 未指定时默认为RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW,即分组内第一行至当前行作为窗口
当 window clause和ORDER BY都未指定时,默认为ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
**即分组内第一行至最后一行作为窗口.**

标准聚合函数

COUNT(expr) OVER()
返回窗口内行数
WITH tmp AS
(
 SELECT 1 AS group_id, 'a' AS col 
 UNION ALL SELECT 1 AS group_id,  'b' AS col 
 UNION ALL SELECT 1 AS group_id,  'c' AS col 
 UNION ALL SELECT 2 AS group_id,  'e' AS col 
 UNION ALL SELECT 2 AS group_id,  'e' AS col
)
SELECT group_id,
      col,
      count(col) over(partition by group_id) as cnt1,
      count(col) over(partition by group_id order by col) as cnt2,
      count(col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as cnt3,
      count(distinct col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as cnt4
FROM tmp;
返回结果为
group_id col cnt1 cnt2 cnt3 cnt4
1 a 3 1 3 3
1 b 3 2 2 2
1 c 3 3 1 1
2 e 2 2 2 1
2 e 2 2 1 1

SUM(expr) OVER()
返回窗口内求和值
WITH tmp AS
(
 SELECT 1 AS group_id, 1 AS col 
 UNION ALL SELECT 1 AS group_id,  2 AS col 
 UNION ALL SELECT 1 AS group_id,  3 AS col 
 UNION ALL SELECT 2 AS group_id,  4 AS col 
 UNION ALL SELECT 2 AS group_id,  4 AS col
)
SELECT group_id,
      col,
      SUM(col) over(partition by group_id) as sum1,
      SUM(col) over(partition by group_id order by col) as sum2,
      SUM(col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as sum3,
      SUM(distinct col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as sum4
FROM tmp;
返回结果为
group_id col sum1 sum2 sum3 sum4
1 1 6 1 6 6
1 2 6 3 5 5
1 3 6 6 3 3
2 4 8 8 8 4
2 4 8 8 4 4

MIN(expr) OVER()
返回窗口内最小值
WITH tmp AS
(
 SELECT 1 AS group_id, 1 AS col 
 UNION ALL SELECT 1 AS group_id,  2 AS col 
 UNION ALL SELECT 1 AS group_id,  3 AS col 
 UNION ALL SELECT 2 AS group_id,  4 AS col 
 UNION ALL SELECT 2 AS group_id,  5 AS col
)
SELECT group_id,
      col,
      MIN(col) over(partition by group_id) as min1,
      MIN(col) over(partition by group_id order by col) as min2,
      MIN(col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as min3
FROM tmp;
group_id col min1 min2 min3
1 1 1 1 1
1 2 1 1 2
1 3 1 1 3
2 4 4 4 4
2 5 4 4 5

MAX(expr) OVER()
返回窗口内最大值
WITH tmp AS
(
 SELECT 1 AS group_id, 1 AS col 
 UNION ALL SELECT 1 AS group_id,  2 AS col 
 UNION ALL SELECT 1 AS group_id,  3 AS col 
 UNION ALL SELECT 2 AS group_id,  4 AS col 
 UNION ALL SELECT 2 AS group_id,  5 AS col
)
SELECT group_id,
      col,
      MAX(col) over(partition by group_id) as max1,
      MAX(col) over(partition by group_id order by col) as max2,
      MAX(col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as max3
FROM tmp;
返回结果为
group_id col max1 max2 max3
1 1 3 1 3
1 2 3 2 3
1 3 3 3 3
2 4 5 4 5
2 5 5 5 5

AVG(expr) OVER()
返回窗口内平均值
WITH tmp AS
(
 SELECT 1 AS group_id, 1 AS col 
 UNION ALL SELECT 1 AS group_id,  2 AS col 
 UNION ALL SELECT 1 AS group_id,  3 AS col 
 UNION ALL SELECT 2 AS group_id,  4 AS col 
 UNION ALL SELECT 2 AS group_id,  4 AS col
)
SELECT group_id,
      col,
      AVG(col) over(partition by group_id) as avg1,
      AVG(col) over(partition by group_id order by col) as avg2,
      AVG(col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as avg3,
      AVG(distinct col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as avg4
FROM tmp;
返回结果为
|group_id|col|avg1|avg2|avg3|avg4|
|1|1|2.0|1.0|2.0|2.0|
|1|2|2.0|1.5|2.5|2.5|
|1|3|2.0|2.0|3.0|3.0|
|2|4|4.0|4.0|4.0|4.0|
|2|4|4.0|4.0|4.0|4.0|

分析函数 Analytics functions
RANK() OVER()
返回分组内排名（不支持自定义窗口）
WITH tmp AS
(
 SELECT 1 AS group_id, 1 AS col 
 UNION ALL SELECT 1 AS group_id,  3 AS col 
 UNION ALL SELECT 1 AS group_id,  3 AS col 
 UNION ALL SELECT 2 AS group_id,  4 AS col 
 UNION ALL SELECT 2 AS group_id,  5 AS col
)
SELECT group_id,
      col,
      RANK() over(partition by group_id order by col desc) as r
FROM tmp;
返回结果为
|group_id|col|r|
|1|3|1|
|1|3|1|
|1|1|3|
|2|5|1|
|2|4|2|

ROW_NUMBER() OVER()
返回分组内行号（不支持自定义窗口）
WITH tmp AS
(
 SELECT 1 AS group_id, 1 AS col 
 UNION ALL SELECT 1 AS group_id,  3 AS col 
 UNION ALL SELECT 1 AS group_id,  3 AS col 
 UNION ALL SELECT 2 AS group_id,  4 AS col 
 UNION ALL SELECT 2 AS group_id,  5 AS col
)
SELECT group_id,
      col,
      ROW_NUMBER() over(partition by group_id order by col desc) as r
FROM tmp;
返回结果为
|group_id|col|r|
|1|3|1|
|1|3|2|
|1|1|3|
|2|5|1|
|2|4|2|

DENSE_RANK() OVER()
返回分组内排名（排名相等不会留下空位,不支持自定义窗口）
WITH tmp AS
(
 SELECT 1 AS group_id, 1 AS col 
 UNION ALL SELECT 1 AS group_id,  3 AS col 
 UNION ALL SELECT 1 AS group_id,  3 AS col 
 UNION ALL SELECT 2 AS group_id,  4 AS col 
 UNION ALL SELECT 2 AS group_id,  5 AS col
)
SELECT group_id,
      col,
      DENSE_RANK() over(partition by group_id order by col desc) as r
FROM tmp;
返回结果为
|group_id|col|r|
|1|3|1|
|1|3|1|
|1|1|2|
|2|5|1|
|2|4|2|

CUME_DIST() OVER()
返回分组内累计分布值,即分组内小于(或者大于)等于当前值行数/分组内总行数
WITH tmp AS
(
 SELECT 1 AS group_id, 1 AS col 
 UNION ALL SELECT 1 AS group_id,  3 AS col 
 UNION ALL SELECT 1 AS group_id,  3 AS col 
 UNION ALL SELECT 2 AS group_id,  4 AS col 
 UNION ALL SELECT 2 AS group_id,  5 AS col
)
SELECT group_id,
      col,
      CUME_DIST() over(partition by group_id order by col asc) as d1,
      CUME_DIST() over(partition by group_id order by col desc) as d2
FROM tmp;

返回结果为
|group_id|col|d1|d2|
|1|3|1.0|0.6666666666666666|
|1|3|1.0|0.6666666666666666|
|1|1|0.3333333333333333|1.0|
|2|5|1.0|0.5|
|2|4|0.5|1.0|

PERCENT_RANK() OVER()
返回百分比排序值,即分组内当前行的RANK值-1/分组内总行数-1
WITH tmp AS
(
 SELECT 1 AS group_id, 1 AS col 
 UNION ALL SELECT 1 AS group_id,  3 AS col 
 UNION ALL SELECT 1 AS group_id,  3 AS col 
 UNION ALL SELECT 2 AS group_id,  4 AS col 
 UNION ALL SELECT 2 AS group_id,  5 AS col
)
SELECT group_id,
      col,
      RANK() over(partition by group_id order by col asc) as r1,
      PERCENT_RANK() over(partition by group_id order by col asc) as p1,
      RANK() over(partition by group_id order by col desc) as r2,
      PERCENT_RANK() over(partition by group_id order by col desc) as p2
FROM tmp;

返回结果为
|group_id|col|r1|p1|r2|p2|
|1|3|2|0.5|1|0.0|
|1|3|2|0.5|1|0.0|
|1|1|1|0.0|3|1.0|
|2|5|2|1.0|1|0.0|
|2|4|1|0.0|2|1.0|

NTILE(INTEGER x) OVER()
返回分区编号（将有序分区划分为x个组,称为bucket,并为分区中的每一行分配一个bucket编号）
WITH tmp AS
(
 SELECT 1 AS group_id, 1 AS col 
 UNION ALL SELECT 1 AS group_id,  3 AS col 
 UNION ALL SELECT 1 AS group_id,  3 AS col 
 UNION ALL SELECT 1 AS group_id,  3 AS col 
 UNION ALL SELECT 2 AS group_id,  4 AS col 
 UNION ALL SELECT 2 AS group_id,  5 AS col
)
SELECT group_id,
      col,
      NTILE(2) over(partition by group_id order by col asc) as bucket_id
FROM tmp;

返回结果为
|group_id|col|bucket_id|
|1|1|1|
|1|3|1|
|1|3|2|
|1|3|2|
|2|4|1|
|2|5|2|
OVER子句也支持聚合函数
Hive 2.1.0及之后版本,OVER子句也支持聚合函数,如：
WITH tmp AS
(
 SELECT 1 AS group_id, 1 AS col 
 UNION ALL SELECT 1 AS group_id,  3 AS col 
 UNION ALL SELECT 1 AS group_id,  3 AS col 
 UNION ALL SELECT 2 AS group_id,  4 AS col 
 UNION ALL SELECT 2 AS group_id,  5 AS col
)
SELECT group_id,
      RANK() over(order by sum(col) desc) as r
FROM tmp
group by group_id;
结果为
|group_id|r|
|2|1|
|1|2|

window clause 的另一种写法
将window子句写在from后面,在over后使用别名进行引用,如下:
WITH tmp AS
(
 SELECT 1 AS group_id, 1 AS col 
 UNION ALL SELECT 1 AS group_id,  2 AS col 
 UNION ALL SELECT 1 AS group_id,  3 AS col 
 UNION ALL SELECT 2 AS group_id,  4 AS col 
 UNION ALL SELECT 2 AS group_id,  4 AS col
)
SELECT group_id,
      col,
      AVG(col) over w1 as avg1,
      AVG(distinct col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as avg2
FROM tmp
WINDOW w1 AS (partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following);

结果为
|group_id|col|avg1|avg2|
|1|1|2.0|2.0|
|1|2|2.5|2.5|
|1|3|3.0|3.0|
|2|4|4.0|4.0|
|2|4|4.0|4.0|

WITH tmp AS
(
 SELECT 1 AS group_id, 1 AS col 
 UNION ALL SELECT 1 AS group_id,  2 AS col 
 UNION ALL SELECT 1 AS group_id,  3 AS col 
 UNION ALL SELECT 2 AS group_id,  4 AS col 
 UNION ALL SELECT 2 AS group_id,  4 AS col
)
SELECT group_id,
      col,
      AVG(col) over w1 as avg1,
      AVG(distinct col) over w2 as avg2
FROM tmp
WINDOW w1 AS (partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following),
w2 AS (partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following);

结果为
|group_id|col|avg1|avg2|
|1|1|2.0|2.0|
|1|2|2.5|2.5|
|1|3|3.0|3.0|
|2|4|4.0|4.0|
|2|4|4.0|4.0|

本文完.
Any suggestions and criticisms will be sincerely welcomed.
资料

https://blog.csdn.net/happyrocking/article/details/105369558
https://docs.aws.amazon.com/redshift/latest/dg/redshift
https://www.jianshu.com/p/3f3cf58472ca
https://www.cnblogs.com/hyunbar/p/13524855.html
https://blog.csdn.net/weixin_42307036/article/details/112381387