需要指定一个窗口的边界,语法是这样的:
ROWS betweenCURRENT ROW | UNBOUNDED PRECEDING | [num] PRECEDING AND UNBOUNDED FOLLOWING | [num] FOLLOWING| CURRENT ROW
或
RANGE between [num] PRECEDING AND [num]FOLLOWING
如下图:
ROWS是物理窗口,从行数上控制窗口的尺寸的;
RANGE是逻辑窗口,从列值上控制窗口的尺寸。这个比较难理解,但说白了就简单了,具体解释如下面栗子1.
结合order by子句使用,如果在order by子句后面没有指定窗口子句,则默认为:range between unbounded preceding and current row
------------------------------------------------------------
rt_data表存放了当天每半小时的店铺销售数据
字段stat_date格式yyMMddHHmm
------------------------------------------------------------
1.统计截止到当前时间段的店铺累计销售金额
- select shop_id, stat_date, ordamt, sum(ordamt) over(partition by shop_id order by stat_date)
- from rt_data
- where dt = '2015-01-11' and shop_id = 10026;
-
- 10026 201501120030 5170 5170
- 10026 201501120100 5669 10839
- 10026 201501120130 2396 13235
- 10026 201501120200 1498 14733
- 10026 201501120230 1997 16730
- 10026 201501120300 1188 17918
- 10026 201501120330 598 18516
- 10026 201501120400 479 18995
- 10026 201501120430 1587 20582
- 10026 201501120530 799 21381
这里order by 后面省去了窗口子句,还原起来就是order by stat_date range between unbounded preceding and current row. 这是什么意思?
就是先按stat_date排序,窗口就是比当前stat_date小的行到当前stat_date的行。
其实,这里写成order by stat_date rows between unbounded preceding and current row是一样的。
2.统计每个时间段的销售占比
- select shop_id, stat_date, ordamt, ordamt / sum(ordamt) over(partition by shop_id)
- from rt_data
- where dt = '2015-01-11' and shop_id = 10026
- order by stat_date;
-
- 10026 201501120900 5406 0.014227
- 10026 201501120930 13098 0.034471
- 10026 201501121000 15679 0.041263
- 10026 201501121030 12134 0.031934
- 10026 201501121100 26259 0.069107
- 10026 201501121130 16516 0.043466
- 10026 201501121200 20127 0.052969
partition by之后没有使用order by,窗口大小默认就是整个分组。
3.找出2点的销售金额及前半小时的销售金额和后1个小时的销售金额
- select * from (
- select
- shop_id,
- stat_date,
- ordamt,
- lag(stat_date,1) over(partition by shop_id order by stat_date),
- lag(ordamt,1) over(partition by shop_id order by stat_date),
- lead(stat_date,2) over(partition by shop_id order by stat_date),
- lead(ordamt,2) over(partition by shop_id order by stat_date)
- from rt_data
- where dt = '2015-01-11' and shop_id = 10026
- ) t
- where stat_date = 201501120200;
-
- 10026 201501120200 1498 201501120130 2396 201501120300 1188
lag(field, N)是取前N行的值,lead(field, N)是取后N行的值。
4.按照销售金额进行排名,金额最大的排最前(limit可以取topn的数)
- select shop_id, stat_date, ordamt, dense_rank() over(partition by shop_id order by ordamt desc)
- from rt_data
- where dt = '2015-01-11' and shop_id = 10026;
-
- 10026 201501121100 26259.0 1
- 10026 201501121200 20127.0 2
- 10026 201501121300 18138.0 3
- 10026 201501121130 16516.0 4
- 10026 201501121400 15956.0 5
- 10026 201501121000 15679.0 6
- 10026 201501122200 14290.0 7
- 10026 201501121430 14203.0 8
- 10026 201501121330 13290.0 9
- 10026 201501120930 13098.0 10
dense_rank和rank都是排名函数,区别在于dense_rank是连续排名,rank遇到排名并列时,下一列排名跳空。
percent_rank,介于0和1直接的小数形式表示,计算方法是(rank - 1) / (n-1),其中rank为行的序号,n为组的行数,这个暂时不知道有什么地方可以运用,欢迎指导。
5.按销售金额排序,取出前20%的时间段和相应金额
- select * from (
- select shop_id, stat_date, ordamt, ntile(5) over(partition by shop_id order by ordamt desc) sorted
- from rt_data
- where dt = '2015-01-11' and shop_id = 10026
- ) t
- where sorted = 1;
-
- 10026 201501121100 26259.0 1
- 10026 201501121200 20127.0 1
- 10026 201501121300 18138.0 1
- 10026 201501121130 16516.0 1
- 10026 201501121400 15956.0 1
- 10026 201501121000 15679.0 1
- 10026 201501122200 14290.0 1
- 10026 201501121430 14203.0 1
- 10026 201501121330 13290.0 1
NTILE这个很强大,以前要获取一定比例的数据是非常困难的,NTILE就是把有序分区中的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,NTILE返回此行所属的组的编号。
---------------------------------------------------------------------
over语句还可以独立出来,用window重写,但需要注意的是,如下sort by使用了多个字段,如果用range指定窗口的话会出错,需要用rows来指定窗口,因为range是对列的比较,不支持多列比较。
- select shop_id, stat_date, ordamt, sum(ordamt) over win as t
- from rt_data
- where dt = '2015-01-11' and shop_id = 10026
- window win as (distribute by shop_id sort by shop_id, ordamt desc rows between unbounded preceding and current row);