目录
窗口函数的概述与总结:
1.什么时候用开窗函数?开窗函数常结合聚合函数使用,一般来讲聚合后的行数要少于聚合前的行数,但是有时我们既想显示聚集前的数据,又要显示聚集后的数据,这时我们便引入了窗口函数.如下:
+-------+-------------+-------+---------------+--+
| name | orderdate | cost | sum_window_0 |
+-------+-------------+-------+---------------+--+
| jack | 2017-01-01 | 10 | 205 |
| jack | 2017-01-08 | 55 | 205 |
| tony | 2017-01-07 | 50 | 205 |
| jack | 2017-01-05 | 46 | 205 |
| tony | 2017-01-04 | 29 | 205 |
| tony | 2017-01-02 | 15 | 205 |
| jack | 2017-02-03 | 23 | 23 |
| mart | 2017-04-13 | 94 | 341 |
| jack | 2017-04-06 | 42 | 341 |
| mart | 2017-04-11 | 75 | 341 |
| mart | 2017-04-09 | 68 | 341 |
| mart | 2017-04-08 | 62 | 341 |
| neil | 2017-05-10 | 12 | 12 |
| neil | 2017-06-12 | 80 | 80 |
+-------+-------------+-------+---------------+--
2.窗口函数的语法:
UDAF() over (PARTITION By col1,col2 order by col3 窗口子句(rows between .. and ..)) AS 列别名
注意:PARTITION By后可跟多个字段,order By只跟一个字段。
partition by子句:
一旦指定了partition by子句,聚合函数的作用范围就是分区之后的数据,这一点和group by 有些类似
order by子句:
order by子句对字段进行排序,如果order by子句后面没有跟rows between ** and ** 则表示起点到当前行
的聚合。order by后的rows子句近一步限制聚合函数的作用范围。
窗口子句
CURRENT ROW:当前行
n PRECEDING:往前n行数据
n FOLLOWING:往后n行数据
UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING表示到后面的终点
窗口子句对聚合函数的聚合范围作进一步的动态划分,没有指定的时候,默认为起点到当前行的聚合
注意:
(1)order by必须跟在partition by后;
(2)Rows必须跟在Order by子;
(3)(partition by .. order by)可替换为(distribute by .. sort by ..)
可上手案例实操:
数据的准备:
[isea@hadoop108 datas]$ cat business.txt
jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94
需求:
(1)查询在2017年4月份购买过的顾客及总人数
(2)查询顾客的购买明细及月购买总额
(3)上述的场景, 将每个顾客的cost按照日期进行累加
(4)查询每个顾客上次的购买时间
(5)查询前20%时间的订单信息
创建hive表并导入数据:
create table business(name string,orderdate string,cost int)
row format delimited
fields terminated by ',';
load data local inpath '/opt/module/datas/business.txt' into table business;
0: jdbc:hive2://hadoop108:10000> select * from business;
+----------------+---------------------+----------------+--+
| business.name | business.orderdate | business.cost |
+----------------+---------------------+----------------+--+
| jack | 2017-01-01 | 10 |
| tony | 2017-01-02 | 15 |
| jack | 2017-02-03 | 23 |
| tony | 2017-01-04 | 29 |
| jack | 2017-01-05 | 46 |
| jack | 2017-04-06 | 42 |
| tony | 2017-01-07 | 50 |
| jack | 2017-01-08 | 55 |
| mart | 2017-04-08 | 62 |
| mart | 2017-04-09 | 68 |
| neil | 2017-05-10 | 12 |
| mart | 2017-04-11 | 75 |
| neil | 2017-06-12 | 80 |
| mart | 2017-04-13 | 94 |
+----------------+---------------------+----------------+--+
(1)查询在2017年4月份购买过的顾客及总人数:
分析过程:四月份的数据如下:
| jack | 2017-04-06 | 42 |
| mart | 2017-04-08 | 62 |
| mart | 2017-04-09 | 68 |
| mart | 2017-04-11 | 75 |
| mart | 2017-04-13 | 94 |
最后的输出结果应该是长成这个样子:
jack 2
mart 2
我们一起来理解一下聚合函数,下面的这个聚合函数,将business表中的所有内容作为输入,输入到聚合函数中去
select count(*) from business;
+------+--+
| _c0 |
+------+--+
| 14 |
+------+--+
下面的这个聚合函数,作用在分组的数据中,这样一来,聚合函数的作用对象就是组,即如果是同一个组,会作为输入,输入到聚合函数中去。
select name,count(*)
from business
where orderdate like '2017-04%'
group by name;
+-------+------+--+
| name | _c1 |
+-------+------+--+
| jack | 1 |
| mart | 4 |
+-------+------+--+
上面的这个结果很明显并不是我们想要的,因为这里是将一个组的内容作为输入输入到聚合函数中去的,所以此时统计的是name的个数。我们
想要的结果是:将这两行的内容作为聚合函数的输入。我们可以使用窗口函数实现:
select name,count(*) over()
from business
where orderdate like '2017-04%'
group by name;
+-------+-----------------+--+
| name | count_window_0 |
+-------+-----------------+--+
| mart | 2 |
| jack | 2 |
+-------+-----------------+--+
(2)查询顾客的购买明细及月购买总额
这里面是月购买总额,所以应该以月份来进行分组,同样月份的进入聚合函数,这里既要显示原来的数据,
又要显示聚合之后的数据,所以使用窗口函数。
select *,sum(cost) over(partition by month(orderdate))
from business;
+----------------+---------------------+----------------+---------------+--+
| business.name | business.orderdate | business.cost | sum_window_0 |
+----------------+---------------------+----------------+---------------+--+
| jack | 2017-01-01 | 10 | 205 |
| jack | 2017-01-08 | 55 | 205 |
| tony | 2017-01-07 | 50 | 205 |
| jack | 2017-01-05 | 46 | 205 |
| tony | 2017-01-04 | 29 | 205 |
| tony | 2017-01-02 | 15 | 205 |
| jack | 2017-02-03 | 23 | 23 |
| mart | 2017-04-13 | 94 | 341 |
| jack | 2017-04-06 | 42 | 341 |
| mart | 2017-04-11 | 75 | 341 |
| mart | 2017-04-09 | 68 | 341 |
| mart | 2017-04-08 | 62 | 341 |
| neil | 2017-05-10 | 12 | 12 |
| neil | 2017-06-12 | 80 | 80 |
+----------------+---------------------+----------------+---------------+--+
(3) 将每个顾客的cost按照日期进行累加
这里面需要将每一个顾客的cost累加,所以要对name进行分组,按照日期进行累加,应该对于日期进行排序,
这样才好一行一行的累加。
select *,sum(cost) over(partition by name order by orderdate)
from business;
+----------------+---------------------+----------------+---------------+--+
| business.name | business.orderdate | business.cost | sum_window_0 |
+----------------+---------------------+----------------+---------------+--+
| jack | 2017-01-01 | 10 | 10 |
| jack | 2017-01-05 | 46 | 56 |
| jack | 2017-01-08 | 55 | 111 |
| jack | 2017-02-03 | 23 | 134 |
| jack | 2017-04-06 | 42 | 176 |
| mart | 2017-04-08 | 62 | 62 |
| mart | 2017-04-09 | 68 | 130 |
| mart | 2017-04-11 | 75 | 205 |
| mart | 2017-04-13 | 94 | 299 |
| neil | 2017-05-10 | 12 | 12 |
| neil | 2017-06-12 | 80 | 92 |
| tony | 2017-01-02 | 15 | 15 |
| tony | 2017-01-04 | 29 | 44 |
| tony | 2017-01-07 | 50 | 94 |
+----------------+---------------------+----------------+---------------+--+
除了上面的这种方式,我们还可以使用窗口子句来实现:
select *,sum(cost) over(partition by name order by orderdate
rows between UNBOUNDED PRECEDING and CURRENT ROW)
from business;
+----------------+---------------------+----------------+---------------+--+
| business.name | business.orderdate | business.cost | sum_window_0 |
+----------------+---------------------+----------------+---------------+--+
| jack | 2017-01-01 | 10 | 10 |
| jack | 2017-01-05 | 46 | 56 |
| jack | 2017-01-08 | 55 | 111 |
| jack | 2017-02-03 | 23 | 134 |
| jack | 2017-04-06 | 42 | 176 |
| mart | 2017-04-08 | 62 | 62 |
| mart | 2017-04-09 | 68 | 130 |
| mart | 2017-04-11 | 75 | 205 |
| mart | 2017-04-13 | 94 | 299 |
| neil | 2017-05-10 | 12 | 12 |
| neil | 2017-06-12 | 80 | 92 |
| tony | 2017-01-02 | 15 | 15 |
| tony | 2017-01-04 | 29 | 44 |
| tony | 2017-01-07 | 50 | 94 |
+----------------+---------------------+----------------+---------------+--+
(4)查询每个顾客上次的购买时间
对name 分区,对时间排序,例如下面的这个样子
+----------------+---------------------+----------------+--+
| business.name | business.orderdate | business.cost | date
+----------------+---------------------+----------------+--+
| jack | 2017-01-01 | 10 | null
| jack | 2017-01-05 | 46 | 2017-01-01
| jack | 2017-02-03 | 23 | 2017-01-05
| jack | 2017-04-06 | 42 | 2017-02-03
| tony | 2017-01-02 | 15 | null
| tony | 2017-01-04 | 29 | 2017-01-02
select *,lag(orderdate,1,'-1') over(partition by name order by orderdate)
from business;
+----------------+---------------------+----------------+---------------+--+
| business.name | business.orderdate | business.cost | lag_window_0 |
+----------------+---------------------+----------------+---------------+--+
| jack | 2017-01-01 | 10 | -1 |
| jack | 2017-01-05 | 46 | 2017-01-01 |
| jack | 2017-01-08 | 55 | 2017-01-05 |
| jack | 2017-02-03 | 23 | 2017-01-08 |
| jack | 2017-04-06 | 42 | 2017-02-03 |
| mart | 2017-04-08 | 62 | -1 |
| mart | 2017-04-09 | 68 | 2017-04-08 |
| mart | 2017-04-11 | 75 | 2017-04-09 |
| mart | 2017-04-13 | 94 | 2017-04-11 |
| neil | 2017-05-10 | 12 | -1 |
| neil | 2017-06-12 | 80 | 2017-05-10 |
| tony | 2017-01-02 | 15 | -1 |
| tony | 2017-01-04 | 29 | 2017-01-02 |
| tony | 2017-01-07 | 50 | 2017-01-04 |
+----------------+---------------------+----------------+---------------+--+
(5)查询前20%时间的订单信息
20%需要对时间进行排序,取到其中的20%,输入全部,得到20%,使用Ntail聚合函数
t1:
select *,NTILE(5) over(order by orderdate) num
from business ;
+----------------+---------------------+----------------+------+--+
| business.name | business.orderdate | business.cost | num |
+----------------+---------------------+----------------+------+--+
| jack | 2017-01-01 | 10 | 1 |
| tony | 2017-01-02 | 15 | 1 |
| tony | 2017-01-04 | 29 | 1 |
| jack | 2017-01-05 | 46 | 2 |
| tony | 2017-01-07 | 50 | 2 |
| jack | 2017-01-08 | 55 | 2 |
| jack | 2017-02-03 | 23 | 3 |
| jack | 2017-04-06 | 42 | 3 |
| mart | 2017-04-08 | 62 | 3 |
| mart | 2017-04-09 | 68 | 4 |
| mart | 2017-04-11 | 75 | 4 |
| mart | 2017-04-13 | 94 | 4 |
| neil | 2017-05-10 | 12 | 5 |
| neil | 2017-06-12 | 80 | 5 |
+----------------+---------------------+----------------+------+--+
select * from
(select *,NTILE(5) over(order by orderdate) num
from business ) t1
where num = 1;
+----------+---------------+----------+---------+--+
| t1.name | t1.orderdate | t1.cost | t1.num |
+----------+---------------+----------+---------+--+
| jack | 2017-01-01 | 10 | 1 |
| tony | 2017-01-02 | 15 | 1 |
| tony | 2017-01-04 | 29 | 1 |
+----------+---------------+----------+---------+--+
总结:
①理解窗口函数的前提是深入理解聚合函数,理解聚合函数,就是要理解聚合函数的作用范围,首先没有任何修饰的聚合函数的作用范围是全体的数据;其次有group by的聚合函数,聚合函数对同组的数据聚合;有了partition by 的范围也是组内的数据;有了窗口子句之后,窗口子句会进一步限制聚合函数的作用范围。②既想显示聚集前的数据,又要显示聚集后的数据,使用窗口函数,因为select 后面的字段必须是聚合函数和group by 字段,如果想显示其他字段,group by做不到,就得使用窗口函数。