一、 函数说明
over():指定分析函数工作的数据窗口大小,这个数据窗口大小 可能会随着行的变化而变化。
current row:当前行
n preceding:往前n行数据
n following:往后n行数据
unbounded:起点
unbounded preceding:表示从前面的起点
unbounded following:表示到后面的终点
LAG(col,n,default_val):往前第n行数据
LEAD(col,n,default_val):往后第n行数据
NTILE(n):把有序分区中的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,NTILE返回此行所属的组的编号。注意:n
必须为int类型。
二、数据准备:name,orderdate,cost
jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94
三、创建表并导入数据
create table business(
name string,
orderdate string,
cost int
)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
load data local inpath "/opt/module/datas/business.txt" into table business;
四、按需求查询数据
(1)查询在2017年4月份购买过的顾客及总人数
select name,count(*) over()
from business
where substring(orderdate,1,7) = '2017-04'
group by name
(2)查询顾客的购买明细及月购总额
select name,orderdate,cost,
sum(cost) over(partition by month(orderdate))
from business;
(3)上述的场景,将每个顾客的cost按照日期进行累加
select name,orderdate,cost,
sum(cost) over() as sample1, --所有行相加
sum(cost) over(partition by name) as sample2, --按name分组,组内数据相加
sum(cost) over(partition by name order by orderdate) as sample3,
--按name分组,组内数据累加
sum(cost) over(partition by name order by orderdate
rows between UNBOUNDED PRECEDING and current row) as sample4,
--和sample3一样,由起点到当前行的聚合
sum(cost) over(partition by name order by orderdate
rows between 1 PRECEDING and current row) as sample5,
--当前行和前面一行做聚合
sum(cost) over(partition by name order by orderedate
rows between 1 PRECEDING and 1 FOLLOWING) as sample6,
--当前行和前面一行及后面一行
sum(cost) over(partition by name order by orderdatep
rows between current row and UNBOUNDED FOLLOWING) as sample7
--当前行和后面所有行
from business;
rows必须跟在order by 子句之后,对排序的结果进行限制,使用固定的行数来限制分区中的数据行数量。
(4)查看顾客上次的购买时间
select name,orderdate,cost,
lag(orderdate,1,'1900-01-01') over(partition by name order by orderdate) as time1,
lag(orderdate,2) over(partition by name order by orderdate) as time2
from busibness;
(5)查询前20%时间的订单信息
select * from (
select name,orderdate,cost,ntile(5) over(order by orderdate) sorted from business
) t
where sorted = 1;
五、Rank函数
1、函数说明
RANK() 排序相同时会重复,总数不会变
DENSE_RANK() 排序相同时会重复,总数会减少
ROW_NUMBER() 会根据顺序计算
2、数据准备
name | subject | score |
---|---|---|
孙悟空 | 语文 | 87 |
孙悟空 | 数学 | 90 |
孙悟空 | 英语 | 68 |
猪八戒 | 语文 | 94 |
猪八戒 | 数学 | 56 |
猪八戒 | 英语 | 84 |
唐僧 | 语文 | 64 |
唐僧 | 数学 | 86 |
唐僧 | 英语 | 84 |
沙悟净 | 语文 | 65 |
沙悟净 | 数学 | 85 |
沙悟净 | 英语 | 78 |
3、创建hive表并导入数据
create table score(
name string,
subject string,
score int)
row format delimited fields terminated by "\t";
load data local inpath '/opt/module/datas/score.txt' into table score;
4、计算每门学科成绩排名
select name,
subject,
score,
rank() over(partition by subject order by score desc) rp,
dense_rank() over(partition by subject order by score desc) drp,
row_number() over(partition by subject order by score desc) rnp,
from score;