Hive分析窗口函数系列文章

分析窗口函数应用场景:

(1)用于分区排序

(2)动态Group By

(3)Top N

(4)累计计算

(5)层次查询

 

Hive分析窗口函数(一) SUM,AVG,MIN,MAX

Hive中提供了越来越多的分析函数,用于完成负责的统计分析。抽时间将所有的分析窗口函数理一遍,将陆续发布。

今天先看几个基础的,SUM、AVG、MIN、MAX。

用于实现分组内所有和连续累积的统计。

数据准备:

 
  1. CREATE EXTERNAL TABLE lxw1234 (

  2. cookieid string,

  3. createtime string, --day

  4. pv INT

  5. ) ROW FORMAT DELIMITED

  6. FIELDS TERMINATED BY ','

  7. stored as textfile location '/tmp/lxw11/';

  8.  
  9. DESC lxw1234;

  10. cookieid STRING

  11. createtime STRING

  12. pv INT

  13.  
  14. hive> select * from lxw1234;

  15. OK

  16. cookie1 2015-04-10 1

  17. cookie1 2015-04-11 5

  18. cookie1 2015-04-12 7

  19. cookie1 2015-04-13 3

  20. cookie1 2015-04-14 2

  21. cookie1 2015-04-15 4

  22. cookie1 2015-04-16 4

SUM — 注意,结果和ORDER BY相关,默认为升序

 
  1. SELECT cookieid,

  2. createtime,

  3. pv,

  4. SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行

  5. SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1

  6. SUM(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行

  7. SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --当前行+往前3行

  8. SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --当前行+往前3行+往后1行

  9. SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---当前行+往后所有行

  10. FROM lxw1234;

  11.  
  12. cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6

  13. -----------------------------------------------------------------------------

  14. cookie1 2015-04-10 1 1 1 26 1 6 26

  15. cookie1 2015-04-11 5 6 6 26 6 13 25

  16. cookie1 2015-04-12 7 13 13 26 13 16 20

  17. cookie1 2015-04-13 3 16 16 26 16 18 13

  18. cookie1 2015-04-14 2 18 18 26 17 21 10

  19. cookie1 2015-04-15 4 22 22 26 16 20 8

  20. cookie1 2015-04-16 4 26 26 26 13 13 4

pv1: 分组内从起点到当前行的pv累积,如,11号的pv1=10号的pv+11号的pv, 12号=10号+11号+12号
pv2: 同pv1
pv3: 分组内(cookie1)所有的pv累加
pv4: 分组内当前行+往前3行,如,11号=10号+11号, 12号=10号+11号+12号, 13号=10号+11号+12号+13号, 14号=11号+12号+13号+14号
pv5: 分组内当前行+往前3行+往后1行,如,14号=11号+12号+13号+14号+15号=5+7+3+2+4=21
pv6: 分组内当前行+往后所有行,如,13号=13号+14号+15号+16号=3+2+4+4=13,14号=14号+15号+16号=2+4+4=10

 

如果不指定ROWS BETWEEN,默认为从起点到当前行;
如果不指定ORDER BY,则将分组内所有值累加;
关键是理解ROWS BETWEEN含义,也叫做WINDOW子句
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:当前行
UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING:表示到后面的终点

–其他AVG,MIN,MAX,和SUM用法一样。

 
  1. --AVG

  2. SELECT cookieid,

  3. createtime,

  4. pv,

  5. AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行

  6. AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1

  7. AVG(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行

  8. AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --当前行+往前3行

  9. AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --当前行+往前3行+往后1行

  10. AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---当前行+往后所有行

  11. FROM lxw1234;

  12. cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6

  13. -----------------------------------------------------------------------------

  14. cookie1 2015-04-10 1 1.0 1.0 3.7142857142857144 1.0 3.0 3.7142857142857144

  15. cookie1 2015-04-11 5 3.0 3.0 3.7142857142857144 3.0 4.333333333333333 4.166666666666667

  16. cookie1 2015-04-12 7 4.333333333333333 4.333333333333333 3.7142857142857144 4.333333333333333 4.0 4.0

  17. cookie1 2015-04-13 3 4.0 4.0 3.7142857142857144 4.0 3.6 3.25

  18. cookie1 2015-04-14 2 3.6 3.6 3.7142857142857144 4.25 4.2 3.3333333333333335

  19. cookie1 2015-04-15 4 3.6666666666666665 3.6666666666666665 3.7142857142857144 4.0 4.0 4.0

  20. cookie1 2015-04-16 4 3.7142857142857144 3.7142857142857144 3.7142857142857144 3.25 3.25 4.0

 
  1. --MIN

  2. SELECT cookieid,

  3. createtime,

  4. pv,

  5. MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行

  6. MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1

  7. MIN(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行

  8. MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --当前行+往前3行

  9. MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --当前行+往前3行+往后1行

  10. MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---当前行+往后所有行

  11. FROM lxw1234;

  12.  
  13. cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6

  14. -----------------------------------------------------------------------------

  15. cookie1 2015-04-10 1 1 1 1 1 1 1

  16. cookie1 2015-04-11 5 1 1 1 1 1 2

  17. cookie1 2015-04-12 7 1 1 1 1 1 2

  18. cookie1 2015-04-13 3 1 1 1 1 1 2

  19. cookie1 2015-04-14 2 1 1 1 2 2 2

  20. cookie1 2015-04-15 4 1 1 1 2 2 4

  21. cookie1 2015-04-16 4 1 1 1 2 2 4

 
  1. --MAX

  2. SELECT cookieid,

  3. createtime,

  4. pv,

  5. MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行

  6. MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1

  7. MAX(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行

  8. MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --当前行+往前3行

  9. MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --当前行+往前3行+往后1行

  10. MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---当前行+往后所有行

  11. FROM lxw1234;

  12.  
  13. cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6

  14. -----------------------------------------------------------------------------

  15. cookie1 2015-04-10 1 1 1 7 1 5 7

  16. cookie1 2015-04-11 5 5 5 7 5 7 7

  17. cookie1 2015-04-12 7 7 7 7 7 7 7

  18. cookie1 2015-04-13 3 7 7 7 7 7 4

  19. cookie1 2015-04-14 2 7 7 7 7 7 4

  20. cookie1 2015-04-15 4 7 7 7 7 7 4

  21. cookie1 2015-04-16 4 7 7 7 4 4 4

 

Hive分析窗口函数(二) NTILE,ROW_NUMBER,RANK,DENSE_RANK

本文中介绍前几个序列函数,NTILE,ROW_NUMBER,RANK,DENSE_RANK,下面会一一解释各自的用途。

注意: 序列函数不支持WINDOW子句。(什么是WINDOW子句,点此查看前面的文章

数据准备:

 
  1. CREATE EXTERNAL TABLE lxw1234 (

  2. cookieid string,

  3. createtime string, --day

  4. pv INT

  5. ) ROW FORMAT DELIMITED

  6. FIELDS TERMINATED BY ','

  7. stored as textfile location '/tmp/lxw11/';

  8.  
  9. DESC lxw1234;

  10. cookieid STRING

  11. createtime STRING

  12. pv INT

  13.  
  14. hive> select * from lxw1234;

  15. OK

  16. cookie1 2015-04-10 1

  17. cookie1 2015-04-11 5

  18. cookie1 2015-04-12 7

  19. cookie1 2015-04-13 3

  20. cookie1 2015-04-14 2

  21. cookie1 2015-04-15 4

  22. cookie1 2015-04-16 4

  23. cookie2 2015-04-10 2

  24. cookie2 2015-04-11 3

  25. cookie2 2015-04-12 5

  26. cookie2 2015-04-13 6

  27. cookie2 2015-04-14 3

  28. cookie2 2015-04-15 9

  29. cookie2 2015-04-16 7

 

NTILE

NTILE(n),用于将分组数据按照顺序切分成n片,返回当前切片值
NTILE不支持ROWS BETWEEN,比如 NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
如果切片不均匀,默认增加第一个切片的分布

 
  1. SELECT

  2. cookieid,

  3. createtime,

  4. pv,

  5. NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn1, --分组内将数据分成2片

  6. NTILE(3) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn2, --分组内将数据分成3片

  7. NTILE(4) OVER(ORDER BY createtime) AS rn3 --将所有数据分成4片

  8. FROM lxw1234

  9. ORDER BY cookieid,createtime;

  10.  
  11. cookieid day pv rn1 rn2 rn3

  12. -------------------------------------------------

  13. cookie1 2015-04-10 1 1 1 1

  14. cookie1 2015-04-11 5 1 1 1

  15. cookie1 2015-04-12 7 1 1 2

  16. cookie1 2015-04-13 3 1 2 2

  17. cookie1 2015-04-14 2 2 2 3

  18. cookie1 2015-04-15 4 2 3 3

  19. cookie1 2015-04-16 4 2 3 4

  20. cookie2 2015-04-10 2 1 1 1

  21. cookie2 2015-04-11 3 1 1 1

  22. cookie2 2015-04-12 5 1 1 2

  23. cookie2 2015-04-13 6 1 2 2

  24. cookie2 2015-04-14 3 2 2 3

  25. cookie2 2015-04-15 9 2 3 4

  26. cookie2 2015-04-16 7 2 3 4

 

–比如,统计一个cookie,pv数最多的前1/3的天

 
  1. SELECT

  2. cookieid,

  3. createtime,

  4. pv,

  5. NTILE(3) OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn

  6. FROM lxw1234;

  7.  
  8. --rn = 1 的记录,就是我们想要的结果

  9.  
  10. cookieid day pv rn

  11. ----------------------------------

  12. cookie1 2015-04-12 7 1

  13. cookie1 2015-04-11 5 1

  14. cookie1 2015-04-15 4 1

  15. cookie1 2015-04-16 4 2

  16. cookie1 2015-04-13 3 2

  17. cookie1 2015-04-14 2 3

  18. cookie1 2015-04-10 1 3

  19. cookie2 2015-04-15 9 1

  20. cookie2 2015-04-16 7 1

  21. cookie2 2015-04-13 6 1

  22. cookie2 2015-04-12 5 2

  23. cookie2 2015-04-14 3 2

  24. cookie2 2015-04-11 3 3

  25. cookie2 2015-04-10 2 3

 

ROW_NUMBER

ROW_NUMBER() –从1开始,按照顺序,生成分组内记录的序列
–比如,按照pv降序排列,生成分组内每天的pv名次
ROW_NUMBER() 的应用场景非常多,再比如,获取分组内排序第一的记录;获取一个session中的第一条refer等。

 
  1. SELECT

  2. cookieid,

  3. createtime,

  4. pv,

  5. ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn

  6. FROM lxw1234;

  7.  
  8. cookieid day pv rn

  9. -------------------------------------------

  10. cookie1 2015-04-12 7 1

  11. cookie1 2015-04-11 5 2

  12. cookie1 2015-04-15 4 3

  13. cookie1 2015-04-16 4 4

  14. cookie1 2015-04-13 3 5

  15. cookie1 2015-04-14 2 6

  16. cookie1 2015-04-10 1 7

  17. cookie2 2015-04-15 9 1

  18. cookie2 2015-04-16 7 2

  19. cookie2 2015-04-13 6 3

  20. cookie2 2015-04-12 5 4

  21. cookie2 2015-04-14 3 5

  22. cookie2 2015-04-11 3 6

  23. cookie2 2015-04-10 2 7

 

RANK 和 DENSE_RANK

—RANK() 生成数据项在分组中的排名,排名相等会在名次中留下空位
—DENSE_RANK() 生成数据项在分组中的排名,排名相等会在名次中不会留下空位

 
  1. SELECT

  2. cookieid,

  3. createtime,

  4. pv,

  5. RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,

  6. DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,

  7. ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3

  8. FROM lxw1234

  9. WHERE cookieid = 'cookie1';

  10.  
  11. cookieid day pv rn1 rn2 rn3

  12. --------------------------------------------------

  13. cookie1 2015-04-12 7 1 1 1

  14. cookie1 2015-04-11 5 2 2 2

  15. cookie1 2015-04-15 4 3 3 3

  16. cookie1 2015-04-16 4 3 3 4

  17. cookie1 2015-04-13 3 5 4 5

  18. cookie1 2015-04-14 2 6 5 6

  19. cookie1 2015-04-10 1 7 6 7

  20.  
  21. rn1: 15号和16号并列第3, 13号排第5

  22. rn2: 15号和16号并列第3, 13号排第4

  23. rn3: 如果相等,则按记录值排序,生成唯一的次序,如果所有记录值都相等,或许会随机排吧。

 

Hive分析窗口函数(三) CUME_DIST,PERCENT_RANK

这两个序列分析函数不是很常用,这里也介绍一下。

注意: 序列函数不支持WINDOW子句。(什么是WINDOW子句,点此查看前面的文章

数据准备:

 
  1. CREATE EXTERNAL TABLE lxw1234 (

  2. dept STRING,

  3. userid string,

  4. sal INT

  5. ) ROW FORMAT DELIMITED

  6. FIELDS TERMINATED BY ','

  7. stored as textfile location '/tmp/lxw11/';

  8.  
  9.  
  10. hive> select * from lxw1234;

  11. OK

  12. d1 user1 1000

  13. d1 user2 2000

  14. d1 user3 3000

  15. d2 user4 4000

  16. d2 user5 5000

 

CUME_DIST

–CUME_DIST 小于等于当前值的行数/分组内总行数
–比如,统计小于等于当前薪水的人数,所占总人数的比例

 
  1. SELECT

  2. dept,

  3. userid,

  4. sal,

  5. CUME_DIST() OVER(ORDER BY sal) AS rn1,

  6. CUME_DIST() OVER(PARTITION BY dept ORDER BY sal) AS rn2

  7. FROM lxw1234;

  8.  
  9. dept userid sal rn1 rn2

  10. -------------------------------------------

  11. d1 user1 1000 0.2 0.3333333333333333

  12. d1 user2 2000 0.4 0.6666666666666666

  13. d1 user3 3000 0.6 1.0

  14. d2 user4 4000 0.8 0.5

  15. d2 user5 5000 1.0 1.0

  16.  
  17. rn1: 没有partition,所有数据均为1组,总行数为5,

  18. 第一行:小于等于1000的行数为1,因此,1/5=0.2

  19. 第三行:小于等于3000的行数为3,因此,3/5=0.6

  20. rn2: 按照部门分组,dpet=d1的行数为3,

  21. 第二行:小于等于2000的行数为2,因此,2/3=0.6666666666666666

 

PERCENT_RANK

–PERCENT_RANK 分组内当前行的RANK值-1/分组内总行数-1
应用场景不了解,可能在一些特殊算法的实现中可以用到吧。

 
  1. SELECT

  2. dept,

  3. userid,

  4. sal,

  5. PERCENT_RANK() OVER(ORDER BY sal) AS rn1, --分组内

  6. RANK() OVER(ORDER BY sal) AS rn11, --分组内RANK值

  7. SUM(1) OVER(PARTITION BY NULL) AS rn12, --分组内总行数

  8. PERCENT_RANK() OVER(PARTITION BY dept ORDER BY sal) AS rn2

  9. FROM lxw1234;

  10.  
  11. dept userid sal rn1 rn11 rn12 rn2

  12. ---------------------------------------------------

  13. d1 user1 1000 0.0 1 5 0.0

  14. d1 user2 2000 0.25 2 5 0.5

  15. d1 user3 3000 0.5 3 5 1.0

  16. d2 user4 4000 0.75 4 5 0.0

  17. d2 user5 5000 1.0 5 5 1.0

  18.  
  19. rn1: rn1 = (rn11-1) / (rn12-1)

  20. 第一行,(1-1)/(5-1)=0/4=0

  21. 第二行,(2-1)/(5-1)=1/4=0.25

  22. 第四行,(4-1)/(5-1)=3/4=0.75

  23. rn2: 按照dept分组,

  24. dept=d1的总行数为3

  25. 第一行,(1-1)/(3-1)=0

  26. 第三行,(3-1)/(3-1)=1

 

Hive分析窗口函数(四) LAG,LEAD,FIRST_VALUE,LAST_VALUE

继续学习这四个分析函数。

注意: 这几个函数不支持WINDOW子句。(什么是WINDOW子句,点此查看前面的文章

数据准备:

 
  1. CREATE EXTERNAL TABLE lxw1234 (

  2. cookieid string,

  3. createtime string, --页面访问时间

  4. url STRING --被访问页面

  5. ) ROW FORMAT DELIMITED

  6. FIELDS TERMINATED BY ','

  7. stored as textfile location '/tmp/lxw11/';

  8.  
  9.  
  10. hive> select * from lxw1234;

  11. OK

  12. cookie1 2015-04-10 10:00:02 url2

  13. cookie1 2015-04-10 10:00:00 url1

  14. cookie1 2015-04-10 10:03:04 1url3

  15. cookie1 2015-04-10 10:50:05 url6

  16. cookie1 2015-04-10 11:00:00 url7

  17. cookie1 2015-04-10 10:10:00 url4

  18. cookie1 2015-04-10 10:50:01 url5

  19. cookie2 2015-04-10 10:00:02 url22

  20. cookie2 2015-04-10 10:00:00 url11

  21. cookie2 2015-04-10 10:03:04 1url33

  22. cookie2 2015-04-10 10:50:05 url66

  23. cookie2 2015-04-10 11:00:00 url77

  24. cookie2 2015-04-10 10:10:00 url44

  25. cookie2 2015-04-10 10:50:01 url55

 

LAG

LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值
第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)

 
  1. SELECT cookieid,

  2. createtime,

  3. url,

  4. ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,

  5. LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time,

  6. LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time

  7. FROM lxw1234;

  8.  
  9.  
  10. cookieid createtime url rn last_1_time last_2_time

  11. -------------------------------------------------------------------------------------------

  12. cookie1 2015-04-10 10:00:00 url1 1 1970-01-01 00:00:00 NULL

  13. cookie1 2015-04-10 10:00:02 url2 2 2015-04-10 10:00:00 NULL

  14. cookie1 2015-04-10 10:03:04 1url3 3 2015-04-10 10:00:02 2015-04-10 10:00:00

  15. cookie1 2015-04-10 10:10:00 url4 4 2015-04-10 10:03:04 2015-04-10 10:00:02

  16. cookie1 2015-04-10 10:50:01 url5 5 2015-04-10 10:10:00 2015-04-10 10:03:04

  17. cookie1 2015-04-10 10:50:05 url6 6 2015-04-10 10:50:01 2015-04-10 10:10:00

  18. cookie1 2015-04-10 11:00:00 url7 7 2015-04-10 10:50:05 2015-04-10 10:50:01

  19. cookie2 2015-04-10 10:00:00 url11 1 1970-01-01 00:00:00 NULL

  20. cookie2 2015-04-10 10:00:02 url22 2 2015-04-10 10:00:00 NULL

  21. cookie2 2015-04-10 10:03:04 1url33 3 2015-04-10 10:00:02 2015-04-10 10:00:00

  22. cookie2 2015-04-10 10:10:00 url44 4 2015-04-10 10:03:04 2015-04-10 10:00:02

  23. cookie2 2015-04-10 10:50:01 url55 5 2015-04-10 10:10:00 2015-04-10 10:03:04

  24. cookie2 2015-04-10 10:50:05 url66 6 2015-04-10 10:50:01 2015-04-10 10:10:00

  25. cookie2 2015-04-10 11:00:00 url77 7 2015-04-10 10:50:05 2015-04-10 10:50:01

  26.  
  27.  
  28. last_1_time: 指定了往上第1行的值,default为'1970-01-01 00:00:00'

  29. cookie1第一行,往上1行为NULL,因此取默认值 1970-01-01 00:00:00

  30. cookie1第三行,往上1行值为第二行值,2015-04-10 10:00:02

  31. cookie1第六行,往上1行值为第五行值,2015-04-10 10:50:01

  32. last_2_time: 指定了往上第2行的值,为指定默认值

  33. cookie1第一行,往上2行为NULL

  34. cookie1第二行,往上2行为NULL

  35. cookie1第四行,往上2行为第二行值,2015-04-10 10:00:02

  36. cookie1第七行,往上2行为第五行值,2015-04-10 10:50:01

 

LEAD

与LAG相反
LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值
第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)

 
  1. SELECT cookieid,

  2. createtime,

  3. url,

  4. ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,

  5. LEAD(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time,

  6. LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time

  7. FROM lxw1234;

  8.  
  9.  
  10. cookieid createtime url rn next_1_time next_2_time

  11. -------------------------------------------------------------------------------------------

  12. cookie1 2015-04-10 10:00:00 url1 1 2015-04-10 10:00:02 2015-04-10 10:03:04

  13. cookie1 2015-04-10 10:00:02 url2 2 2015-04-10 10:03:04 2015-04-10 10:10:00

  14. cookie1 2015-04-10 10:03:04 1url3 3 2015-04-10 10:10:00 2015-04-10 10:50:01

  15. cookie1 2015-04-10 10:10:00 url4 4 2015-04-10 10:50:01 2015-04-10 10:50:05

  16. cookie1 2015-04-10 10:50:01 url5 5 2015-04-10 10:50:05 2015-04-10 11:00:00

  17. cookie1 2015-04-10 10:50:05 url6 6 2015-04-10 11:00:00 NULL

  18. cookie1 2015-04-10 11:00:00 url7 7 1970-01-01 00:00:00 NULL

  19. cookie2 2015-04-10 10:00:00 url11 1 2015-04-10 10:00:02 2015-04-10 10:03:04

  20. cookie2 2015-04-10 10:00:02 url22 2 2015-04-10 10:03:04 2015-04-10 10:10:00

  21. cookie2 2015-04-10 10:03:04 1url33 3 2015-04-10 10:10:00 2015-04-10 10:50:01

  22. cookie2 2015-04-10 10:10:00 url44 4 2015-04-10 10:50:01 2015-04-10 10:50:05

  23. cookie2 2015-04-10 10:50:01 url55 5 2015-04-10 10:50:05 2015-04-10 11:00:00

  24. cookie2 2015-04-10 10:50:05 url66 6 2015-04-10 11:00:00 NULL

  25. cookie2 2015-04-10 11:00:00 url77 7 1970-01-01 00:00:00 NULL

  26.  
  27. --逻辑与LAG一样,只不过LAG是往上,LEAD是往下。

 

FIRST_VALUE

取分组内排序后,截止到当前行,第一个值

 
  1. SELECT cookieid,

  2. createtime,

  3. url,

  4. ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,

  5. FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1

  6. FROM lxw1234;

  7.  
  8. cookieid createtime url rn first1

  9. ---------------------------------------------------------

  10. cookie1 2015-04-10 10:00:00 url1 1 url1

  11. cookie1 2015-04-10 10:00:02 url2 2 url1

  12. cookie1 2015-04-10 10:03:04 1url3 3 url1

  13. cookie1 2015-04-10 10:10:00 url4 4 url1

  14. cookie1 2015-04-10 10:50:01 url5 5 url1

  15. cookie1 2015-04-10 10:50:05 url6 6 url1

  16. cookie1 2015-04-10 11:00:00 url7 7 url1

  17. cookie2 2015-04-10 10:00:00 url11 1 url11

  18. cookie2 2015-04-10 10:00:02 url22 2 url11

  19. cookie2 2015-04-10 10:03:04 1url33 3 url11

  20. cookie2 2015-04-10 10:10:00 url44 4 url11

  21. cookie2 2015-04-10 10:50:01 url55 5 url11

  22. cookie2 2015-04-10 10:50:05 url66 6 url11

  23. cookie2 2015-04-10 11:00:00 url77 7 url11

 

LAST_VALUE

取分组内排序后,截止到当前行,最后一个值

 
  1. SELECT cookieid,

  2. createtime,

  3. url,

  4. ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,

  5. LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1

  6. FROM lxw1234;

  7.  
  8.  
  9. cookieid createtime url rn last1

  10. -----------------------------------------------------------------

  11. cookie1 2015-04-10 10:00:00 url1 1 url1

  12. cookie1 2015-04-10 10:00:02 url2 2 url2

  13. cookie1 2015-04-10 10:03:04 1url3 3 1url3

  14. cookie1 2015-04-10 10:10:00 url4 4 url4

  15. cookie1 2015-04-10 10:50:01 url5 5 url5

  16. cookie1 2015-04-10 10:50:05 url6 6 url6

  17. cookie1 2015-04-10 11:00:00 url7 7 url7

  18. cookie2 2015-04-10 10:00:00 url11 1 url11

  19. cookie2 2015-04-10 10:00:02 url22 2 url22

  20. cookie2 2015-04-10 10:03:04 1url33 3 1url33

  21. cookie2 2015-04-10 10:10:00 url44 4 url44

  22. cookie2 2015-04-10 10:50:01 url55 5 url55

  23. cookie2 2015-04-10 10:50:05 url66 6 url66

  24. cookie2 2015-04-10 11:00:00 url77 7 url77

如果不指定ORDER BY,则默认按照记录在文件中的偏移量进行排序,会出现错误的结果

 
  1. SELECT cookieid,

  2. createtime,

  3. url,

  4. FIRST_VALUE(url) OVER(PARTITION BY cookieid) AS first2

  5. FROM lxw1234;

  6.  
  7. cookieid createtime url first2

  8. ----------------------------------------------

  9. cookie1 2015-04-10 10:00:02 url2 url2

  10. cookie1 2015-04-10 10:00:00 url1 url2

  11. cookie1 2015-04-10 10:03:04 1url3 url2

  12. cookie1 2015-04-10 10:50:05 url6 url2

  13. cookie1 2015-04-10 11:00:00 url7 url2

  14. cookie1 2015-04-10 10:10:00 url4 url2

  15. cookie1 2015-04-10 10:50:01 url5 url2

  16. cookie2 2015-04-10 10:00:02 url22 url22

  17. cookie2 2015-04-10 10:00:00 url11 url22

  18. cookie2 2015-04-10 10:03:04 1url33 url22

  19. cookie2 2015-04-10 10:50:05 url66 url22

  20. cookie2 2015-04-10 11:00:00 url77 url22

  21. cookie2 2015-04-10 10:10:00 url44 url22

  22. cookie2 2015-04-10 10:50:01 url55 url22

  23.  
  24. SELECT cookieid,

  25. createtime,

  26. url,

  27. LAST_VALUE(url) OVER(PARTITION BY cookieid) AS last2

  28. FROM lxw1234;

  29.  
  30. cookieid createtime url last2

  31. ----------------------------------------------

  32. cookie1 2015-04-10 10:00:02 url2 url5

  33. cookie1 2015-04-10 10:00:00 url1 url5

  34. cookie1 2015-04-10 10:03:04 1url3 url5

  35. cookie1 2015-04-10 10:50:05 url6 url5

  36. cookie1 2015-04-10 11:00:00 url7 url5

  37. cookie1 2015-04-10 10:10:00 url4 url5

  38. cookie1 2015-04-10 10:50:01 url5 url5

  39. cookie2 2015-04-10 10:00:02 url22 url55

  40. cookie2 2015-04-10 10:00:00 url11 url55

  41. cookie2 2015-04-10 10:03:04 1url33 url55

  42. cookie2 2015-04-10 10:50:05 url66 url55

  43. cookie2 2015-04-10 11:00:00 url77 url55

  44. cookie2 2015-04-10 10:10:00 url44 url55

  45. cookie2 2015-04-10 10:50:01 url55 url55

如果想要取分组内排序后最后一个值,则需要变通一下:

 
  1. SELECT cookieid,

  2. createtime,

  3. url,

  4. ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,

  5. LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1,

  6. FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last2

  7. FROM lxw1234

  8. ORDER BY cookieid,createtime;

  9.  
  10. cookieid createtime url rn last1 last2

  11. -------------------------------------------------------------

  12. cookie1 2015-04-10 10:00:00 url1 1 url1 url7

  13. cookie1 2015-04-10 10:00:02 url2 2 url2 url7

  14. cookie1 2015-04-10 10:03:04 1url3 3 1url3 url7

  15. cookie1 2015-04-10 10:10:00 url4 4 url4 url7

  16. cookie1 2015-04-10 10:50:01 url5 5 url5 url7

  17. cookie1 2015-04-10 10:50:05 url6 6 url6 url7

  18. cookie1 2015-04-10 11:00:00 url7 7 url7 url7

  19. cookie2 2015-04-10 10:00:00 url11 1 url11 url77

  20. cookie2 2015-04-10 10:00:02 url22 2 url22 url77

  21. cookie2 2015-04-10 10:03:04 1url33 3 1url33 url77

  22. cookie2 2015-04-10 10:10:00 url44 4 url44 url77

  23. cookie2 2015-04-10 10:50:01 url55 5 url55 url77

  24. cookie2 2015-04-10 10:50:05 url66 6 url66 url77

  25. cookie2 2015-04-10 11:00:00 url77 7 url77 url77

<span style="font-weight: bold; color: rgb(255, 0, 0); font-family: Arial, Helvetica, sans-serif; background-color: rgb(255, 255, 255);">提示:在使用分析函数的过程中,要特别注意ORDER BY子句,用的不恰当,统计出的结果就不是你所期望的。</span>

Hive分析窗口函数(五) GROUPING SETS,GROUPING__ID,CUBE,ROLLUP

GROUPING SETS,GROUPING__ID,CUBE,ROLLUP

这几个分析函数通常用于OLAP中,不能累加,而且需要根据不同维度上钻和下钻的指标统计,比如,分小时、天、月的UV数。

数据准备:

 
  1. CREATE EXTERNAL TABLE lxw1234 (

  2. month STRING,

  3. day STRING,

  4. cookieid STRING

  5. ) ROW FORMAT DELIMITED

  6. FIELDS TERMINATED BY ','

  7. stored as textfile location '/tmp/lxw11/';

  8.  
  9.  
  10. hive> select * from lxw1234;

  11. OK

  12. 2015-03 2015-03-10 cookie1

  13. 2015-03 2015-03-10 cookie5

  14. 2015-03 2015-03-12 cookie7

  15. 2015-04 2015-04-12 cookie3

  16. 2015-04 2015-04-13 cookie2

  17. 2015-04 2015-04-13 cookie4

  18. 2015-04 2015-04-16 cookie4

  19. 2015-03 2015-03-10 cookie2

  20. 2015-03 2015-03-10 cookie3

  21. 2015-04 2015-04-12 cookie5

  22. 2015-04 2015-04-13 cookie6

  23. 2015-04 2015-04-15 cookie3

  24. 2015-04 2015-04-15 cookie2

  25. 2015-04 2015-04-16 cookie1

 

GROUPING SETS

在一个GROUP BY查询中,根据不同的维度组合进行聚合,等价于将不同维度的GROUP BY结果集进行UNION ALL

 
  1. SELECT

  2. month,

  3. day,

  4. COUNT(DISTINCT cookieid) AS uv,

  5. GROUPING__ID

  6. FROM lxw1234

  7. GROUP BY month,day

  8. GROUPING SETS (month,day)

  9. ORDER BY GROUPING__ID;

  10.  
  11. month day uv GROUPING__ID

  12. ------------------------------------------------

  13. 2015-03 NULL 5 1

  14. 2015-04 NULL 6 1

  15. NULL 2015-03-10 4 2

  16. NULL 2015-03-12 1 2

  17. NULL 2015-04-12 2 2

  18. NULL 2015-04-13 3 2

  19. NULL 2015-04-15 2 2

  20. NULL 2015-04-16 2 2

  21.  
  22.  
  23. 等价于

  24. SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM lxw1234 GROUP BY month

  25. UNION ALL

  26. SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM lxw1234 GROUP BY day

 

再如:

 
  1. SELECT

  2. month,

  3. day,

  4. COUNT(DISTINCT cookieid) AS uv,

  5. GROUPING__ID

  6. FROM lxw1234

  7. GROUP BY month,day

  8. GROUPING SETS (month,day,(month,day))

  9. ORDER BY GROUPING__ID;

  10.  
  11. month day uv GROUPING__ID

  12. ------------------------------------------------

  13. 2015-03 NULL 5 1

  14. 2015-04 NULL 6 1

  15. NULL 2015-03-10 4 2

  16. NULL 2015-03-12 1 2

  17. NULL 2015-04-12 2 2

  18. NULL 2015-04-13 3 2

  19. NULL 2015-04-15 2 2

  20. NULL 2015-04-16 2 2

  21. 2015-03 2015-03-10 4 3

  22. 2015-03 2015-03-12 1 3

  23. 2015-04 2015-04-12 2 3

  24. 2015-04 2015-04-13 3 3

  25. 2015-04 2015-04-15 2 3

  26. 2015-04 2015-04-16 2 3

  27.  
  28.  
  29. 等价于

  30. SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM lxw1234 GROUP BY month

  31. UNION ALL

  32. SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM lxw1234 GROUP BY day

  33. UNION ALL

  34. SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM lxw1234 GROUP BY month,day

 

其中的 GROUPING__ID,表示结果属于哪一个分组集合。

CUBE

根据GROUP BY的维度的所有组合进行聚合。

 
  1. SELECT

  2. month,

  3. day,

  4. COUNT(DISTINCT cookieid) AS uv,

  5. GROUPING__ID

  6. FROM lxw1234

  7. GROUP BY month,day

  8. WITH CUBE

  9. ORDER BY GROUPING__ID;

  10.  
  11.  
  12. month day uv GROUPING__ID

  13. --------------------------------------------

  14. NULL NULL 7 0

  15. 2015-03 NULL 5 1

  16. 2015-04 NULL 6 1

  17. NULL 2015-04-12 2 2

  18. NULL 2015-04-13 3 2

  19. NULL 2015-04-15 2 2

  20. NULL 2015-04-16 2 2

  21. NULL 2015-03-10 4 2

  22. NULL 2015-03-12 1 2

  23. 2015-03 2015-03-10 4 3

  24. 2015-03 2015-03-12 1 3

  25. 2015-04 2015-04-16 2 3

  26. 2015-04 2015-04-12 2 3

  27. 2015-04 2015-04-13 3 3

  28. 2015-04 2015-04-15 2 3

  29.  
  30.  
  31.  
  32. 等价于

  33. SELECT NULL,NULL,COUNT(DISTINCT cookieid) AS uv,0 AS GROUPING__ID FROM lxw1234

  34. UNION ALL

  35. SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM lxw1234 GROUP BY month

  36. UNION ALL

  37. SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM lxw1234 GROUP BY day

  38. UNION ALL

  39. SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM lxw1234 GROUP BY month,day

 

ROLLUP

是CUBE的子集,以最左侧的维度为主,从该维度进行层级聚合。

 
  1. 比如,以month维度进行层级聚合:

  2. SELECT

  3. month,

  4. day,

  5. COUNT(DISTINCT cookieid) AS uv,

  6. GROUPING__ID

  7. FROM lxw1234

  8. GROUP BY month,day

  9. WITH ROLLUP

  10. ORDER BY GROUPING__ID;

  11.  
  12. month day uv GROUPING__ID

  13. ---------------------------------------------------

  14. NULL NULL 7 0

  15. 2015-03 NULL 5 1

  16. 2015-04 NULL 6 1

  17. 2015-03 2015-03-10 4 3

  18. 2015-03 2015-03-12 1 3

  19. 2015-04 2015-04-12 2 3

  20. 2015-04 2015-04-13 3 3

  21. 2015-04 2015-04-15 2 3

  22. 2015-04 2015-04-16 2 3

  23.  
  24. 可以实现这样的上钻过程:

  25. 月天的UV->月的UV->总UV

 
  1. --把month和day调换顺序,则以day维度进行层级聚合:

  2.  
  3. SELECT

  4. day,

  5. month,

  6. COUNT(DISTINCT cookieid) AS uv,

  7. GROUPING__ID

  8. FROM lxw1234

  9. GROUP BY day,month

  10. WITH ROLLUP

  11. ORDER BY GROUPING__ID;

  12.  
  13.  
  14. day month uv GROUPING__ID

  15. -------------------------------------------------------

  16. NULL NULL 7 0

  17. 2015-04-13 NULL 3 1

  18. 2015-03-12 NULL 1 1

  19. 2015-04-15 NULL 2 1

  20. 2015-03-10 NULL 4 1

  21. 2015-04-16 NULL 2 1

  22. 2015-04-12 NULL 2 1

  23. 2015-04-12 2015-04 2 3

  24. 2015-03-10 2015-03 4 3

  25. 2015-03-12 2015-03 1 3

  26. 2015-04-13 2015-04 3 3

  27. 2015-04-15 2015-04 2 3

  28. 2015-04-16 2015-04 2 3

  29.  
  30. 可以实现这样的上钻过程:

  31. 天月的UV->天的UV->总UV

  32. (这里,根据天和月进行聚合,和根据天聚合结果一样,因为有父子关系,如果是其他维度组合的话,就会不一样)

 

这种函数,需要结合实际场景和数据去使用和研究,只看说明的话,很难理解。

官网的介绍: https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值