hive sql——窗口函数使用实战小结

16 篇文章 1 订阅
12 篇文章 2 订阅

hive推出的窗口函数功能是对hive sql的功能增强,确实目前用于离线数据分析逻辑日趋复杂,很多场景都需要用到。用于实现分组内所有和连续累积的统计。

一、什么是窗口函数

1、窗口函数指定了函数工作的数据窗口大小(当前行的上下多少行),这个数据窗口大小可能会随着行的变化而变化。
2、窗口函数对于每个组返回多行,组内每一行对应返回一行值。

二、窗口函数介绍

()、聚合函数:
	1.sum(col) over() :  分组对col累计求和,over() 中的语法如下
	2.count(col) over() : 分组对col累计,over() 中的语法如下
	3.min(col) over() : 分组对col求最小
	4.max(col) over() : 分组求col的最大值
	5.avg(col) over() : 分组求col列的平均值

()、取值函数:
	1.first_value(col) over() : 某分区排序后的第一个col值
	2.last_value(col) over() : 某分区排序后的最后一个col值
	3.lag(col,n,DEFAULT) : 统计往前n行的col值,n可选,默认为1DEFAULT当往上第n行为NULL时候,取默认值,如不指定,则为NULL
	4.lead(col,n,DEFAULT) : 统计往后n行的col值,n可选,默认为1DEFAULT当往下第n行为NULL时候,取默认值,如不指定,则为NULL()、排序函数:
	1.row_number() over() : 排名函数,不会重复,适合于生成主键或者不并列排名
	2.rank() over() :  排名函数,有并列名次,名次不连续。如:1,1,3
	3.dense_rank() over() : 排名函数,有并列名次,名次连续。如:114.ntile(n) : 用于将分组数据按照顺序切分成n片,返回当前切片值。注意:n必须为int类型。

()、比例函数:
	1.cume_dist() 小于等于当前值的行数/分组内总行数比如,统计小于等于当前薪水的人数,所占总人数的比例
	2.percent_rank() 计算给定行的百分比排名。分组内当前行的RANK值-1/分组内总行数-1,可以用来计算超过了百分之多少的人。
	
()、增强GROUP BY
    1.GROUPING SETS
    2.GROUPING__ID
    3.CUBE
    4.ROLLUP

三、窗口函数基本语法

<窗口函数>()
OVER
(
	[PARTITION BY <COLUMN 1 , COLUMN 2,COLUMN 3 ...>]
	[ORDER BY <排序用的清单列>][ASC/DESC]
	(ROWS | RANGE) <范围条件>
)

窗口函数的语法分为四个部分:

  • 函数子句:指明具体操作,如sum-求和,first_value-取第一个值;
  • partition by子句:指明分区字段,如果没有,则将所有数据作为一个分区;
  • order by子句:指明了每个分区排序的字段和方式,也是可选的,没有就是按照表中的顺序;
  • 窗口子句:指明相对当前记录的计算范围,可以向上(preceding),可以向下(following),也可以使用between指明,上下边界的值,没有的话默认为当前分区。ROWS BETWEEN,也叫做window子句。数字+PRECEDING 向前n条,数字+FOLLOWING 向后n条, CURRENT ROW 当前行,UNBOUNDED 无边界。 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING表示从最前面的起点开始,表示到最后面的终点,UNBOUNDED PRECEDING 向前无边界,UNBOUNDED FOLLOWING向后无边界。

四、理解Window子句

1. 理解下什么是WINDOW子句(灵活控制窗口的子集)

  • PRECEDING:往前
  • FOLLOWING:往后
  • CURRENT ROW:当前行
  • UNBOUNDED:无边界(一般结合PRECEDING,FOLLOWING使用)
  • UNBOUNDED PRECEDING:表示该窗口最前面的行(起点)
  • UNBOUNDED FOLLOWING:表示该窗口最后面的行(终点)

比如说:

  • ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW(表示从起点到当前行)
  • ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING(表示往前2行到往后1行)
  • ROWS BETWEEN 2 PRECEDING AND 1 CURRENT ROW(表示往前2行到当前行)
  • ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING(表示当前行到终点)

2. 官网有一段话列出了如下窗口函数是不支持window子句的:

  • rank(): 排名函数,有并列名次,名次不连续。如:1,1,3。
  • ntile(n): 用于将分组数据按照顺序切分成n片,返回当前切片值。注意:n必须为int类型。
  • dense_rank() over(): 排名函数,有并列名次,名次连续。如:1,1。
  • cume_dist():小于等于当前值的行数/分组内总行数比如,统计小于等于当前薪水的人数,所占总人数的比例。
  • percent_rank(): 计算给定行的百分比排名。分组内当前行的RANK值-1/分组内总行数-1,可以用来计算超过了百分之多少的人。

Hive窗口函数实例:

聚合函数

实例表准备:
CREATE EXTERNAL TABLE test.student_score (
`student_id` string,
`date_key` string,
`school_id` string,
`grade` string,
`class` string,
`score` string
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
stored as textfile 
location '/tmp/test/student_score/';
数据准备
10001,2021-05-20,1001,初一,1,11
10002,2021-05-21,1001,初二,2,55
10003,2021-05-23,1001,初三,1,77
10004,2021-05-24,1001,初一,3,33
10005,2021-05-25,1001,初一,1,22
10006,2021-05-26,1001,初三,2,99
10007,2021-05-27,1001,初二,2,99
SUM()

SUM — 注意,结果和ORDER BY相关,默认为升序

测试代码:

select 
    school_id,
    student_id,
    score,
    -- 默认为从起点到当前行
    SUM(score) OVER(PARTITION BY school_id ORDER BY student_id) AS scores1, 
    --从起点到当前行,结果同pv1 
    SUM(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS scores2, 
    --分组内所有行
    SUM(score) OVER(PARTITION BY school_id) AS scores3,
     --当前行+往前3行                             
    SUM(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS scores4,  
    --当前行+往前3行+往后1行
    SUM(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS scores5,   
    --当前行+往后所有行  
    SUM(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS scores6    
from test.student_score;

结果:

school_idstudent_idscorescores1scores2scores3scores4scores5scores6
1001100011111.011.0396.011.066.0396.0
1001100025566.066.0396.066.0143.0385.0
10011000377143.0143.0396.0143.0176.0330.0
10011000433176.0176.0396.0176.0198.0253.0
10011000522198.0198.0396.0187.0286.0220.0
10011000699297.0297.0396.0231.0330.0198.0
10011000799396.0396.0396.0253.0253.099.0

如果不指定ROWS BETWEEN,默认为从起点到当前行;
如果不指定ORDER BY,则将分组内所有值累加;

COUNT()

在Window子句上与sum()的理解不同,结果和ORDER BY相关,默认为升序

测试代码:

select 
    school_id,
    grade,
    class,
     -- 默认为从起点到当前行,明细数据会按照排序主键排序来一行加一行,如果排序主键有多行数据一样,也就是多行数据顺序一致,则会一起被加上。
    COUNT(student_id) OVER(PARTITION BY grade ORDER BY class) AS sv1,
    --从起点到当前行,在分组内,按照排序建顺序,来一行加一行,就算是排序主键一样也会一次加,不会一起被加。
    COUNT(student_id) OVER(PARTITION BY grade ORDER BY class ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS sv2, 
    --分组内所有行
    COUNT(student_id) OVER(PARTITION BY grade) AS sv3, 
    --当前行+往前3行,规则同sv2
    COUNT(student_id) OVER(PARTITION BY grade ORDER BY class ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS sv4,  
     --当前行+往前3行+往后1行,规则同sv2 
    COUNT(student_id) OVER(PARTITION BY grade ORDER BY class ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS sv5,   
     ---当前行+往后所有行,规则同sv2
    COUNT(student_id) OVER(PARTITION BY grade ORDER BY class ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS sv6  
from test.student_score;

注意:窗口函数不支持COUNT(DISTINCT xxx)操作,这种情况可以利用子查询先对数据进行去重之后再进行统计,可以参见我的文章:HIVE如何实现如何实现 COUNT(DISTINCT ) OVER (PARTITION BY )?

结果:

school_idgradeclasssv1sv2sv3sv4sv5sv6
1001初二2212122
1001初二2222221
1001初一1213123
1001初一1223232
1001初一3333331
1001初三1112122
1001初三2222221
MAX()

在Window子句上与sum()的理解相同,结果和ORDER BY相关,默认为升序。

代码测试:

select 
    school_id,
    student_id,
    score,
    -- 默认为从起点到当前行
    MAX(score) OVER(PARTITION BY school_id ORDER BY student_id) AS mas1, 
    --从起点到当前行,结果同pv1 
    MAX(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS mas2, 
    --分组内所有行
    MAX(score) OVER(PARTITION BY school_id) AS mas3,
     --当前行+往前3行                             
    MAX(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS mas4,  
    --当前行+往前3行+往后1行
    MAX(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS mas5,   
    --当前行+往后所有行  
    MAX(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS mas6    
from test.student_score;

结果:

school_idstudent_idscoremas1mas2mas3mas4mas5mas6
10011000111111199115599
10011000255555599557799
10011000377777799777799
10011000433777799777799
10011000522777799779999
10011000699999999999999
10011000799999999999999
MIN()

在Window子句上与sum()的理解相同,结果和ORDER BY相关,默认为升序

代码测试:

select 
    school_id,
    student_id,
    score,
    -- 默认为从起点到当前行
    MIN(score) OVER(PARTITION BY school_id ORDER BY student_id) AS mis1, 
    --从起点到当前行,结果同pv1 
    MIN(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS mis2, 
    --分组内所有行
    MIN(score) OVER(PARTITION BY school_id) AS mis3,
     --当前行+往前3行                             
    MIN(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS mis4,  
    --当前行+往前3行+往后1行
    MIN(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS mis5,   
    --当前行+往后所有行  
    MIN(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS mis6    
from test.student_score;

结果:

school_idstudent_idscoremis1mis2mis3mis4mis5mis6
10011000111111111111111
10011000255111111111122
10011000377111111111122
10011000433111111111122
10011000522111111222222
10011000699111111222299
10011000799111111222299
AVG()

在Window子句上与sum()的理解相同,结果和ORDER BY相关,默认为升序

测试代码:

select 
    school_id,
    student_id,
    score,
    -- 默认为从起点到当前行
    AVG(score) OVER(PARTITION BY school_id ORDER BY student_id) AS as1, 
    --从起点到当前行,结果同pv1 
    AVG(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS as2, 
    --分组内所有行
    AVG(score) OVER(PARTITION BY school_id) AS as3,
     --当前行+往前3行                             
    AVG(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS as4,  
    --当前行+往前3行+往后1行
    AVG(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS as5,   
    --当前行+往后所有行  
    AVG(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS as6    
from test.student_score;

结果:

school_idstudent_idscoremis1mis2mis3mis4mis5mis6
1001100011111.011.056.5714285714285711.033.056.57142857142857
1001100025533.033.056.5714285714285733.047.66666666666666464.16666666666667
1001100037747.66666666666666447.66666666666666456.5714285714285747.66666666666666444.066.0
1001100043344.044.056.5714285714285744.039.663.25
1001100052239.639.656.5714285714285746.7557.273.33333333333333
1001100069949.549.556.5714285714285757.7566.099.0
1001100079956.5714285714285756.5714285714285756.5714285714285763.2563.2599.0

排序函数

实例表准备
CREATE EXTERNAL TABLE test.student_score(
    `student_id` STRING, 
    `date_key` STRING, 
    `school_id` STRING, 
    `grade` STRING, 
    `class` STRING, 
    `score` STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS textfile
LOCATION '/tmp/test/student_score';
数据准备
10001,2021-05-20,1001,初一,1,11
10002,2021-05-21,1001,初二,2,55
10003,2021-05-23,1001,初三,1,77
10004,2021-05-24,1001,初一,3,33
10005,2021-05-25,1001,初一,1,22
10006,2021-05-26,1001,初三,2,99
10007,2021-05-27,1001,初二,2,99
10001,2021-05-20,1001,初一,1,22
10002,2021-05-21,1001,初二,2,66
10003,2021-05-23,1001,初三,1,88
10004,2021-05-24,1001,初一,3,44
10005,2021-05-25,1001,初一,1,33
10006,2021-05-26,1001,初三,2,33
10007,2021-05-27,1001,初二,2,11
NTILE()

NTILE(n),用于将分组数据按照顺序切分成n片,返回当前切片值
NTILE不支持ROWS BETWEEN,比如 NTILE(2) OVER(PARTITION BY school_id ORDER BY date_key ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)。如果切片不均匀,默认增加第一个切片的分布

SELECT 
    school_id,
    student_id,
    score,
    --分组内将数据分成2片
    NTILE(2) OVER(PARTITION BY school_id ORDER BY student_id) AS rn1,
    --分组内将数据分成3片
    NTILE(3) OVER(PARTITION BY school_id ORDER BY student_id) AS rn2,
    --将所有数据分成4片
    NTILE(4) OVER(ORDER BY school_id) AS rn3
FROM test.student_score;

结果:

school_idstudent_idscorern1rn2rn3
10011000111111
10011000255111
10011000377111
10011000433121
10011000522222
10011000699232
10011000799232
10021000122112
10021000266113
10021000388113
10021000444123
10021000533224
10021000633234
10021000711234

统计不同学校school,成绩最高的前1/3的学生:

--rn1 = 1 的记录,就是我们想要的结果
SELECT 
school_id,
student_id,
score,
NTILE(3) OVER(PARTITION BY school_id ORDER BY student_id DESC) AS rn1
FROM test.student_score;

结果:

school_idstudent_idscorern1
100210007111
100210006331
100210005331
100210004442
100210003882
100210002663
100210001223
100110007991
100110006991
100110005221
100110004332
100110003772
100110002553
100110001113
ROW_NUMBER()\RANK()\DENSE_RANK()
SELECT
    school_id,
    student_id,
    score,
    ROW_NUMBER() OVER(PARTITION BY school_id ORDER BY score) AS rank1,
    RANK() OVER(PARTITION BY school_id ORDER BY score) AS rank2,
    DENSE_RANK() OVER(PARTITION BY school_id ORDER BY score) AS rank3
FROM test.student_score
ORDER BY school_id,score;

结果:

school_idstudent_idscorerank1rank2rank3
10011000111111
10011000522222
10011000433333
10011000255444
10011000377555
10011000699666
10011000799766
10021000711111
10021000122222
10021000533333
10021000633433
10021000444554
10021000266665
10021000388776

比例函数

CUME_DIST()

CUME_DIST 小于等于当前值的行数占分组内总行数的比例。

--–比如,统计小于等于当前成绩的人数,所占总人数的比例
SELECT 
    school_id,
    student_id,
    score,
    CUME_DIST() OVER(ORDER BY score) AS rn1,
    CUME_DIST() OVER(PARTITION BY school_id ORDER BY score) AS rn2 
FROM test.student_score;

结果:

school_idstudent_idscorern1rn2
100110001110.142857142857142850.14285714285714285
100110005220.28571428571428570.2857142857142857
100110004330.50.42857142857142855
100110002550.64285714285714290.5714285714285714
100110003770.78571428571428570.7142857142857143
100110006991.01.0
100110007991.01.0
100210007110.142857142857142850.14285714285714285
100210001220.28571428571428570.2857142857142857
100210005330.50.5714285714285714
100210006330.50.5714285714285714
100210004440.57142857142857140.7142857142857143
100210002660.71428571428571430.8571428571428571
100210003880.85714285714285711.0
PERCENT_RANK()

PERCENT_RANK 分组内当前行的RANK值-1/分组内总行数-1

SELECT 
    school_id,
    student_id,
    score,
    PERCENT_RANK() OVER(ORDER BY score) AS rn1,   --分组内
    RANK() OVER(ORDER BY score) AS rank1,          --分组内RANK值
    SUM(1) OVER(PARTITION BY NULL) AS sum1,     --分组内总行数
    PERCENT_RANK() OVER(PARTITION BY school_id ORDER BY score)
     AS rn2,
     RANK() OVER(PARTITION BY school_id ORDER BY score) AS rank2,
    SUM(1) OVER(PARTITION BY school_id) AS sum2
FROM test.student_score;

结果:

school_idstudent_idscorern1rank1sum1rn2rank2sum2
100110001110.01140.017
100110005220.153846153846153853140.1666666666666666627
100110004330.30769230769230775140.333333333333333337
100110002550.61538461538461549140.547
100110003770.769230769230769311140.666666666666666657
100110006990.923076923076923113140.833333333333333467
100110007990.923076923076923113140.833333333333333467
100210007110.01140.017
100210001220.153846153846153853140.1666666666666666627
100210005330.30769230769230775140.333333333333333337
100210006330.30769230769230775140.333333333333333337
100210004440.53846153846153848140.666666666666666657
100210002660.692307692307692310140.833333333333333467
100210003880.846153846153846112141.077

取值函数

数据和表准备同排序函数。

LAG
  • 作用:
    LAG(col,n,DEFAULT)用于统计窗口内往上第n行值
  • 参数:
    第一个参数为列名,
    第二个参数为往上第n行(可选,默认为1),
    第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)
SELECT
    school_id,
    date_key,
    grade,
    ROW_NUMBER() OVER(PARTITION BY school_id ORDER BY date_key) AS rn,
    LAG(date_key,1,'1970-01-01') OVER(PARTITION BY school_id ORDER BY date_key) AS last_day_1,
    LAG(date_key,2) OVER(PARTITION BY school_id ORDER BY date_key) AS last_day_2 
FROM test.student_score;

结果:

school_iddate_keygradernlast_day_1last_day_2
10022021-05-20初一11970-01-01NULL
10022021-05-21初二22021-05-20NULL
10022021-05-23初三32021-05-212021-05-20
10022021-05-24初一42021-05-232021-05-21
10022021-05-25初一52021-05-242021-05-23
10022021-05-26初三62021-05-252021-05-24
10022021-05-27初二72021-05-262021-05-25
10012021-05-20初一11970-01-01NULL
10012021-05-21初二22021-05-20NULL
10012021-05-23初三32021-05-212021-05-20
10012021-05-24初一42021-05-232021-05-21
10012021-05-25初一52021-05-242021-05-23
10012021-05-26初三62021-05-252021-05-24
10012021-05-27初二72021-05-262021-05-25

结果分析:

last_day_1: 指定了往上第1行的值,default'1970-01-01'  
            1002第一行,往上1行为NULL,因此取默认值 1970-01-01
            1002第三行,往上1行值为第二行值,2021-05-20
            1002第六行,往上1行值为第五行值,2021-05-25
last_day_2: 指定了往上第2行的值,为指定默认值
			1002第一行,往上2行为NULL
			1002第二行,往上2行为NULL
		    1002第四行,往上2行为第二行值,2021-05-21
		    1002第七行,往上2行为第五行值,2021-05-25
LEAD()
  • 作用:
    LAG相反。LEAD(col,n,DEFAULT)用于统计窗口内往下第n行值。
  • 参数:
    第一个参数为列名.
    第二个参数为往下第n行(可选,默认为1).
    第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)
SELECT
    school_id,
    date_key,
    grade,
    ROW_NUMBER() OVER(PARTITION BY school_id ORDER BY date_key) AS rn,
    LEAD(date_key,1,'1970-01-01') OVER(PARTITION BY school_id ORDER BY date_key) AS next_day_1,
    LEAD(date_key,2) OVER(PARTITION BY school_id ORDER BY date_key) AS next_day_2 
FROM test.student_score;

结果:

school_iddate_keygradernnext_day_1next_day_2
10022021-05-20初一12021-05-212021-05-23
10022021-05-21初二22021-05-232021-05-24
10022021-05-23初三32021-05-242021-05-25
10022021-05-24初一42021-05-252021-05-26
10022021-05-25初一52021-05-262021-05-27
10022021-05-26初三62021-05-27NULL
10022021-05-27初二71970-01-01NULL
10012021-05-20初一12021-05-212021-05-23
10012021-05-21初二22021-05-232021-05-24
10012021-05-23初三32021-05-242021-05-25
10012021-05-24初一42021-05-252021-05-26
10012021-05-25初一52021-05-262021-05-27
10012021-05-26初三62021-05-27NULL
10012021-05-27初二71970-01-01NULL

结果分析:

next_day_1: 指定了往下第1行的值,default'1970-01-01'  
            1002第七行,往下行为NULL,因此取默认值 1970-01-01
            1002第四行,往下1行值为第二行值,2021-05-25
            1002第二行,往下1行值为第五行值,2021-05-23
next_day_2: 指定了往下第2行的值,为指定默认值
			1002第七行,往下2行为NULL
			1002第六行,往下2行为NULL
		    1002第三行,往下2行为第二行值,2021-05-25
		    1002第一行,往上2行为第五行值,2021-05-23
FIRST_VALUE()、LAST_VALUE()
  1. FIRST_VALUE()取分组内排序后,截止到当前行,第一个值。
  2. LAST_VALUE()取分组内排序后,截止到当前行,最后一个值。
  3. 如果不指定ORDER BY,则默认按照记录在文件中的偏移量进行排序,会出现错误的结果。
  4. 如果想要取分组内排序后最后一个值,则需要变通一下,如实例中last_3last_4
SELECT
    school_id,
    date_key,
    grade,
    ROW_NUMBER() OVER(PARTITION BY school_id ORDER BY date_key) AS rn,
    FIRST_VALUE(date_key) OVER(PARTITION BY school_id ORDER BY date_key) AS first_1,
    -- 如果不指定`ORDER BY`,则默认按照记录在文件中的偏移量进行排序,会出现错误的结果。
    FIRST_VALUE(date_key) OVER(PARTITION BY school_id) AS first_2,
    LAST_VALUE(date_key) OVER(PARTITION BY school_id) AS last_2,
    --我们发现LAST_VALUE并不能取最后一个值而是默认取从起点到当前行的最后一个值
    LAST_VALUE(date_key) OVER(PARTITION BY school_id ORDER BY date_key) AS last_3,
    --如果想要取分组内排序后最后一个值,则需要变通一下
    FIRST_VALUE(date_key) OVER(PARTITION BY school_id ORDER BY date_key DESC) AS last_4 
FROM test.student_score;

结果:

school_iddate_keygradernfirst_1first_2last_2last_3last_4
10022021-05-20初一12021-05-202021-05-202021-05-272021-05-202021-05-27
10022021-05-21初二22021-05-202021-05-202021-05-272021-05-212021-05-27
10022021-05-23初三32021-05-202021-05-202021-05-272021-05-232021-05-27
10022021-05-24初一42021-05-202021-05-202021-05-272021-05-242021-05-27
10022021-05-25初一52021-05-202021-05-202021-05-272021-05-252021-05-27
10022021-05-26初三62021-05-202021-05-202021-05-272021-05-262021-05-27
10022021-05-27初二72021-05-202021-05-202021-05-272021-05-272021-05-27
10012021-05-20初一12021-05-202021-05-202021-05-272021-05-202021-05-27
10012021-05-21初二22021-05-202021-05-202021-05-272021-05-212021-05-27
10012021-05-23初三32021-05-202021-05-202021-05-272021-05-232021-05-27
10012021-05-24初一42021-05-202021-05-202021-05-272021-05-242021-05-27
10012021-05-25初一52021-05-202021-05-202021-05-272021-05-252021-05-27
10012021-05-26初三62021-05-202021-05-202021-05-272021-05-262021-05-27
10012021-05-27初二72021-05-202021-05-202021-05-272021-05-272021-05-27

增强GROUP

GROUPING SETS,GROUPING__ID,CUBE,ROLLUP,这几个分析函数通常用于OLAP中,使用于多指标需要根据不同的维度上钻和下钻的指标统计,比如,分学校、年级、班级的统计不同维度组合数据。

GROUPING SETS

在一个GROUP BY查询中,根据不同的维度组合进行聚合,等价于将不同维度的GROUP BY结果集进行UNION ALL。其中的 GROUPING__ID,表示结果属于哪一个分组集合。

SELECT 
    school_id,
    grade,
    class,
    COUNT(DISTINCT student_id) AS uv,
    GROUPING__ID 
FROM test.student_score
GROUP BY school_id,grade,class 
GROUPING SETS (school_id,grade,class,(school_id,grade,class),(school_id,grade),(grade,class),(school_id,class)) 
ORDER BY GROUPING__ID;

结果:

school_idgradeclassuvGROUPING__ID
1001初一310
1002初三110
1001初一120
1002初一310
1001初二220
1001初三210
1002初一120
1002初三210
1002初二220
1001初三110
1001初一NULL31
1002初二NULL21
1001初二NULL21
1002初一NULL31
1001初三NULL21
1002初三NULL21
1002NULL232
1002NULL312
1001NULL132
1001NULL312
1001NULL232
1002NULL132
1001NULLNULL73
1002NULLNULL73
NULL初三114
NULL初三214
NULL初一124
NULL初二224
NULL初一314
NULL初三NULL25
NULL初一NULL35
NULL初二NULL25
NULLNULL316
NULLNULL236
NULLNULL136

经过结果分析我们发现,使用 GROUPING SETS可以再同一层SQL查询进行多维度统计并将结果合并(union all)在一起。其效果等价于如下:

SELECT school_id,grade,class,COUNT(DISTINCT student_id) AS uv,0 AS GROUPING__ID FROM test.student_score GROUP BY school_id,grade,class
UNION ALL
SELECT school_id,grade,NULL AS class,COUNT(DISTINCT student_id) AS uv,1 AS GROUPING__ID FROM test.student_score GROUP BY school_id,grade
UNION ALL
SELECT school_id,NULL AS class,class,COUNT(DISTINCT student_id) AS uv,2 AS GROUPING__ID FROM test.student_score GROUP BY school_id,class
UNION ALL
SELECT NULL AS school_id,grade,class,COUNT(DISTINCT student_id) AS uv,3 AS GROUPING__ID FROM test.student_score GROUP BY grade,class
UNION ALL
SELECT school_id,NULL AS grade,NULL AS class,COUNT(DISTINCT student_id) AS uv,4 AS GROUPING__ID FROM test.student_score GROUP BY school_id
UNION ALL
SELECT NULL AS school_id,grade,NULL AS class,COUNT(DISTINCT student_id) AS uv,5 AS GROUPING__ID FROM test.student_score GROUP BY grade
UNION ALL
SELECT NULL AS school_id,NULL AS grade,class,COUNT(DISTINCT student_id) AS uv,6 AS GROUPING__ID FROM test.student_score GROUP BY class
CUBE

根据GROUP BY的维度的所有组合进行聚合

SELECT 
    school_id,
    grade,
    class,
    COUNT(DISTINCT student_id) AS uv,
    GROUPING__ID 
FROM test.student_score 
GROUP BY school_id,grade,class 
WITH CUBE 
ORDER BY GROUPING__ID;

结果:

school_idgradeclassuvGROUPING__ID
1001初三110
1002初一310
1001初三210
1001初一120
1002初一120
1001初一310
1002初二220
1001初二220
1002初三110
1002初三210
1002初二NULL21
1002初一NULL31
1001初二NULL21
1001初三NULL21
1002初三NULL21
1001初一NULL31
1001NULL312
1002NULL312
1002NULL132
1001NULL132
1002NULL232
1001NULL232
1002NULLNULL74
1001NULLNULL74
NULL初三213
NULL初一313
NULL初一123
NULL初三113
NULL初二223
NULL初二NULL25
NULL初一NULL35
NULL初三NULL25
NULLNULL316
NULLNULL136
NULLNULL236
NULLNULLNULL77

经过结果分析我们发现,使用 CUBE可以再同一层SQL,GROUP BY的维度的所有组合进行多维度统计并将结果合并(union all)在一起。其效果等价于如下:

SELECT school_id,grade,class,COUNT(DISTINCT student_id) AS uv,0 AS GROUPING__ID FROM test.student_score GROUP BY school_id,grade,class
UNION ALL
SELECT school_id,grade,NULL AS class,COUNT(DISTINCT student_id) AS uv,1 AS GROUPING__ID FROM test.student_score GROUP BY school_id,grade
UNION ALL
SELECT school_id,NULL AS class,class,COUNT(DISTINCT student_id) AS uv,2 AS GROUPING__ID FROM test.student_score GROUP BY school_id,class
UNION ALL
SELECT school_id,NULL AS grade,NULL AS class,COUNT(DISTINCT student_id) AS uv,4 AS GROUPING__ID FROM test.student_score GROUP BY school_id
UNION ALL
SELECT NULL AS school_id,grade,class,COUNT(DISTINCT student_id) AS uv,3 AS GROUPING__ID FROM test.student_score GROUP BY grade,class
UNION ALL
SELECT NULL AS school_id,grade,NULL AS class,COUNT(DISTINCT student_id) AS uv,5 AS GROUPING__ID FROM test.student_score GROUP BY grade
UNION ALL
SELECT NULL AS school_id,NULL AS grade,class,COUNT(DISTINCT student_id) AS uv,6 AS GROUPING__ID FROM test.student_score GROUP BY class
UNION ALL
SELECT NULL AS school_id,NULL AS grade,NULL AS class,COUNT(DISTINCT student_id) AS uv,7 AS GROUPING__ID FROM test.student_score
ROLLUP

CUBE的子集,以最左侧的维度为主,从该维度进行层级聚合。
group by rollup(school_id,grade,class), 可以理解为从右到左以一次少一列的方式依次进行group by。

学校、年级、班级的UV->学校、年级的UV->学校的UV->总的UV:

SELECT 
    school_id,
    grade,
    class,
    COUNT(DISTINCT student_id) AS uv,
    GROUPING__ID 
FROM test.student_score 
GROUP BY school_id,grade,class 
WITH ROLLUP 
ORDER BY GROUPING__ID;

结果:

school_idgradeclassuvGROUPING__ID
1001初一310
1002初三110
1001初一120
1002初一310
1001初二220
1001初三210
1002初一120
1002初三210
1002初二220
1001初三110
1001初一NULL31
1002初二NULL21
1001初二NULL21
1002初一NULL31
1001初三NULL21
1002初三NULL21
1001NULLNULL73
1002NULLNULL73
NULLNULLNULL77

经过对结果分析: group by rollup(school_id,grade,class) 则以group by(school_id,grade,class) -> group by(school_id,grade) -> group by(school_id) -> group by null(最终汇总)的顺序进行分组相当于:

SELECT school_id,grade,class,COUNT(DISTINCT student_id) AS uv,GROUPING__ID FROM test.student_score GROUP BY school_id,grade,class 
UNION ALL
SELECT school_id,grade,NULL AS class,COUNT(DISTINCT student_id) AS uv,GROUPING__ID FROM test.student_score GROUP BY school_id,grade 
UNION ALL
SELECT school_id,NULL AS grade,NULL AS class,COUNT(DISTINCT student_id) AS uv,GROUPING__ID FROM test.student_score GROUP BY school_id
UNION ALL
SELECT NULL AS school_id,NULL AS grade,NULL AS class,COUNT(DISTINCT student_id) AS uv,GROUPING__ID FROM test.student_score
  • 2
    点赞
  • 16
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

扫地增

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值