hive 开窗函数理解区别range和row以及默认省略语句

最新推荐文章于 2024-07-29 20:37:43 发布

新奇酷

最新推荐文章于 2024-07-29 20:37:43 发布

阅读量2.6k

点赞数 2

本文链接：https://blog.csdn.net/xinqiku/article/details/96634579

版权

数据准备：name，orderdate，cost

jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-01,29
jack,2017-01-01,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94

需求

查询在2017年4月份购买过的顾客及总人数
查询顾客的购买明细及月购买总额
上述的场景,要将cost按照日期进行累加
查询顾客上次的购买时间
查询前20%时间的订单信息

创建hive表并导入数据

create table business(

name string,

orderdate string,

cost int

) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

load data local inpath "/opt/module/datas/business.txt" into table business;

按需求查询数据

查询在2017年4月份购买过的顾客及总人数

select name,count(*) over ()

from business

where substring(orderdate,1,7) = '2017-04'

group by name;

查询每个月购买总额及顾客的购买明细

select name,orderdate,cost,sum(cost) over(partition by month(orderdate)) from business;

上述的场景,要将cost按照日期进行累加

select name,orderdate,cost,
sum(cost) over() as s1,
sum(cost) over(partition by name) as s2,
sum(cost) over(partition by name order by orderdate ) as s3,
sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as s4
from business;

+-------+-------------+-------+------+------+------+------+
| name | orderdate | cost | s1 | s2 | s3 | s4 |
+-------+-------------+-------+------+------+------+------+
| jack | 2017-01-01 | 10 | 661 | 176 | 56 | 10 |
| jack | 2017-01-01 | 46 | 661 | 176 | 56 | 56 |
| jack | 2017-01-08 | 55 | 661 | 176 | 111 | 111 |
| jack | 2017-02-03 | 23 | 661 | 176 | 134 | 134 |
| jack | 2017-04-06 | 42 | 661 | 176 | 176 | 176 |
| mart | 2017-04-08 | 62 | 661 | 299 | 62 | 62 |
| mart | 2017-04-09 | 68 | 661 | 299 | 130 | 130 |
| mart | 2017-04-11 | 75 | 661 | 299 | 205 | 205 |
| mart | 2017-04-13 | 94 | 661 | 299 | 299 | 299 |
| neil | 2017-05-10 | 12 | 661 | 92 | 12 | 12 |
| neil | 2017-06-12 | 80 | 661 | 92 | 92 | 92 |
| tony | 2017-01-01 | 29 | 661 | 94 | 29 | 29 |
| tony | 2017-01-02 | 15 | 661 | 94 | 44 | 44 |
| tony | 2017-01-07 | 50 | 661 | 94 | 94 | 94 |
+-------+-------------+-------+------+------+------+------+

select name,orderdate,cost,
sum(cost) over(range between unbounded preceding and current row) as s1,
sum(cost) over(partition by name range between unbounded preceding and current row) as s2,
sum(cost) over(partition by name order by orderdate range between unbounded preceding and current row) as s3,
sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as s4,

sum(cost) over(partition by name rows between UNBOUNDED PRECEDING and current row ) as s5
from business;

两者运行结果一致, 说明省略的即range between unbounded preceding and current row

官方给出了说省略的是row between unbounded preceding and unbounded following

如果有order by 则是省略的range between unbounded preceding and unbounded following

个人解释:

s1 range没有partition 所以current row 没有界限和unbounded following 一致结果(此处不再显示演示结果)

s2 unbounded preceding 即同一个partition中的首行不受order by 影响, current row 即当前行 ,当前行即name=jack , 因为此处为range开窗范围

range 代表name=jack的所有行, 所以结果为5个jack行之和, 关于range和row的区别请参照 https://blog.csdn.net/xinqiku/article/details/96362796

s3 unbounded preceding 即同一个partition中的首行不受order by 影响, current row 受order by orderdate影响 , 当前行即 name=jack, orderdate =2017-01-01 联合, 又因为是range开窗所以结果为2行name=jack, orderdate =2017-01-01之和 (如果current row 改成unbounded following , unbounded following 则不受order by 影响而直接指向partition的末行结果为jack五行之和 ,此处不再显示演示结果)

s4 unbounded preceding 即同一个partition中的首行 , current row 因为是row 开窗范围, 即当前行name=jack这一行的和, 如果当前行移到下一行,即为当前行之前的所有行之和

此为刚接触hive 总结规律所写, 不当之处请大佬留言赐教!

查询顾客上次的购买时间

select
name, orderdate, cost,
lag(orderdate, 1 ,'1970-01-01') over(partition by name order by orderdate)
from
business

查询前20%时间的订单信息

SELECT *
FROM (
   SELECT *
       ,ntile(5) OVER (
           ORDER BY orderdate
           ) n
   FROM orders
   ) t
WHERE t.n = 1;

解释:

SELECT *
       ,ntile(5) OVER (
           ORDER BY orderdate
           ) n
   FROM orders

+----------------+---------------------+----------------+----+
| business.name | business.orderdate | business.cost | n |
+----------------+---------------------+----------------+----+
| jack | 2017-01-01 | 10 | 1 |
| tony | 2017-01-01 | 29 | 1 |
| jack | 2017-01-01 | 46 | 1 |
| tony | 2017-01-02 | 15 | 2 |
| tony | 2017-01-07 | 50 | 2 |
| jack | 2017-01-08 | 55 | 2 |
| jack | 2017-02-03 | 23 | 3 |
| jack | 2017-04-06 | 42 | 3 |
| mart | 2017-04-08 | 62 | 3 |
| mart | 2017-04-09 | 68 | 4 |
| mart | 2017-04-11 | 75 | 4 |
| mart | 2017-04-13 | 94 | 4 |
| neil | 2017-05-10 | 12 | 5 |
| neil | 2017-06-12 | 80 | 5 |
+----------------+---------------------+----------------+----+