数据准备:name,orderdate,cost
jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-01,29
jack,2017-01-01,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94
需求
- 查询在2017年4月份购买过的顾客及总人数
- 查询顾客的购买明细及月购买总额
- 上述的场景,要将cost按照日期进行累加
- 查询顾客上次的购买时间
- 查询前20%时间的订单信息
创建hive表并导入数据
create table business(
name string,
orderdate string,
cost int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
load data local inpath "/opt/module/datas/business.txt" into table business;
按需求查询数据
查询在2017年4月份购买过的顾客及总人数
select name,count(*) over ()
from business
where substring(orderdate,1,7) = '2017-04'
group by name;
查询每个月购买总额及顾客的购买明细
select name,orderdate,cost,sum(cost) over(partition by month(orderdate)) from business;
上述的场景,要将cost按照日期进行累加
select name,orderdate,cost,
sum(cost) over() as s1,
sum(cost) over(partition by name) as s2,
sum(cost) over(partition by name order by orderdate ) as s3,
sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as s4
from business;
+-------+-------------+-------+------+------+------+------+
| name | orderdate | cost | s1 | s2 | s3 | s4 |
+-------+-------------+-------+------+------+------+------+
| jack | 2017-01-01 | 10 | 661 | 176 | 56 | 10 |
| jack | 2017-01-01 | 46 | 661 | 176 | 56 | 56 |
| jack | 2017-01-08 | 55 | 661 | 176 | 111 | 111 |
| jack | 2017-02-03 | 23 | 661 | 176 | 134 | 134 |
| jack | 2017-04-06 | 42 | 661 | 176 | 176 | 176 |
| mart | 2017-04-08 | 62 | 661 | 299 | 62 | 62 |
| mart | 2017-04-09 | 68 | 661 | 299 | 130 | 130 |
| mart | 2017-04-11 | 75 | 661 | 299 | 205 | 205 |
| mart | 2017-04-13 | 94 | 661 | 299 | 299 | 299 |
| neil | 2017-05-10 | 12 | 661 | 92 | 12 | 12 |
| neil | 2017-06-12 | 80 | 661 | 92 | 92 | 92 |
| tony | 2017-01-01 | 29 | 661 | 94 | 29 | 29 |
| tony | 2017-01-02 | 15 | 661 | 94 | 44 | 44 |
| tony | 2017-01-07 | 50 | 661 | 94 | 94 | 94 |
+-------+-------------+-------+------+------+------+------+
select name,orderdate,cost,
sum(cost) over(range between unbounded preceding and current row) as s1,
sum(cost) over(partition by name range between unbounded preceding and current row) as s2,
sum(cost) over(partition by name order by orderdate range between unbounded preceding and current row) as s3,
sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as s4,
sum(cost) over(partition by name rows between UNBOUNDED PRECEDING and current row ) as s5
from business;
+-------+-------------+-------+------+------+------+------+
| name | orderdate | cost | s1 | s2 | s3 | s4 |
+-------+-------------+-------+------+------+------+------+
| jack | 2017-01-01 | 10 | 661 | 176 | 56 | 10 |
| jack | 2017-01-01 | 46 | 661 | 176 | 56 | 56 |
| jack | 2017-01-08 | 55 | 661 | 176 | 111 | 111 |
| jack | 2017-02-03 | 23 | 661 | 176 | 134 | 134 |
| jack | 2017-04-06 | 42 | 661 | 176 | 176 | 176 |
| mart | 2017-04-08 | 62 | 661 | 299 | 62 | 62 |
| mart | 2017-04-09 | 68 | 661 | 299 | 130 | 130 |
| mart | 2017-04-11 | 75 | 661 | 299 | 205 | 205 |
| mart | 2017-04-13 | 94 | 661 | 299 | 299 | 299 |
| neil | 2017-05-10 | 12 | 661 | 92 | 12 | 12 |
| neil | 2017-06-12 | 80 | 661 | 92 | 92 | 92 |
| tony | 2017-01-01 | 29 | 661 | 94 | 29 | 29 |
| tony | 2017-01-02 | 15 | 661 | 94 | 44 | 44 |
| tony | 2017-01-07 | 50 | 661 | 94 | 94 | 94 |
+-------+-------------+-------+------+------+------+------+
两者运行结果一致, 说明省略的 即range between unbounded preceding and current row
官方给出了说省略 的是row between unbounded preceding and unbounded following
如果有order by 则是省略的range between unbounded preceding and unbounded following
个人解释:
s1 range没有partition 所以current row 没有界限 和unbounded following 一致 结果(此处不再显示演示结果)
s2 unbounded preceding 即 同一个partition中的首行 不受order by 影响, current row 即当前行 ,当前行即name=jack , 因为此处为range开窗范围
range 代表name=jack的所有行, 所以结果为5个jack行之和, 关于range和row的区别请参照 https://blog.csdn.net/xinqiku/article/details/96362796
s3 unbounded preceding 即 同一个partition中的首行 不受order by 影响, current row 受order by orderdate影响 , 当前行即 name=jack, orderdate =2017-01-01 联合, 又因为是range开窗 所以结果为2行name=jack, orderdate =2017-01-01之和 (如果current row 改成unbounded following , unbounded following 则不受order by 影响 而直接指向partition的末行 结果为jack五行之和 ,此处不再显示演示结果)
s4 unbounded preceding 即 同一个partition中的首行 , current row 因为是row 开窗范围, 即当前行name=jack这一行的和, 如果当前行移到下一行,即为当前行之前的所有行之和
此为刚接触hive 总结规律所写, 不当之处请大佬留言赐教!
查询顾客上次的购买时间
select
name, orderdate, cost,
lag(orderdate, 1 ,'1970-01-01') over(partition by name order by orderdate)
from
business
查询前20%时间的订单信息
SELECT *
FROM (
SELECT *
,ntile(5) OVER (
ORDER BY orderdate
) n
FROM orders
) t
WHERE t.n = 1;
解释:
SELECT *
,ntile(5) OVER (
ORDER BY orderdate
) n
FROM orders
+----------------+---------------------+----------------+----+
| business.name | business.orderdate | business.cost | n |
+----------------+---------------------+----------------+----+
| jack | 2017-01-01 | 10 | 1 |
| tony | 2017-01-01 | 29 | 1 |
| jack | 2017-01-01 | 46 | 1 |
| tony | 2017-01-02 | 15 | 2 |
| tony | 2017-01-07 | 50 | 2 |
| jack | 2017-01-08 | 55 | 2 |
| jack | 2017-02-03 | 23 | 3 |
| jack | 2017-04-06 | 42 | 3 |
| mart | 2017-04-08 | 62 | 3 |
| mart | 2017-04-09 | 68 | 4 |
| mart | 2017-04-11 | 75 | 4 |
| mart | 2017-04-13 | 94 | 4 |
| neil | 2017-05-10 | 12 | 5 |
| neil | 2017-06-12 | 80 | 5 |
+----------------+---------------------+----------------+----+
输出为后续加了 n字段数字分为5桶 每桶为一样数字, 让后再查询此表 得出前20%数据
ntile 参考 https://blog.csdn.net/BabyFish13/article/details/78550601