一、函数介绍
1、OVER():
指定分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的变而变化,例如分区排序;
over函数相关参数:
partition by:分区,窗口大小为分区
order by:一般与分区相结合使用,窗口大小为排序后的数据起点到当前行
CURRENT ROW:当前行
n PRECEDING:往前n行数据
n FOLLOWING:往后n行数据
UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING表示到后面的终点
2、其他函数
LAG(col,n):往前第n行数据
LAG(col,n):往前第n行数据
NTILE(n):把有序分区中的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,NTILE返回此行所属的组的编号。注意:n必须为int类型。
此三个函数都需要over() 开窗来确定数据范围大小,不然单独无法使用
3、排序函数介绍
RANK() 排序相同时会重复,总数不会变
DENSE_RANK() 排序相同时会重复,总数会减少
ROW_NUMBER() 会根据顺序计算
一样需要over函数来开窗确定数据范围大小。
二、over使用介绍
1、数据准备
jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94
建表
create table business(
name string,
orderdate string,
cost int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
导入
load data local inpath "/opt/module/datas/business.txt" into table business;
2、查询购买顾客姓名及购买总人数–窗口大小为筛选后的数据集
2.1 分组查询购买顾客姓名,但是没有顾客总数
hive (default)> select name from business group by name;
2.2 使用over计算购买总人数
hive (default)> select name,count(*) over() from business group by name;
name count_window_0
jack 4
mart 4
neil 4
tony 4
2.3 sql讲解:count(*) over()
count(*) 为合计
over为当前count函数确定需要合计的数据窗口大小,由于over中没有参数,name就是筛选后的数据全集为窗口大小
3、查询顾客的购买明细及 月购买总额-窗口分区大小
数据窗口分区:按照月份分区
3.1 购买明细
select name,orderdate,cost from business;
3.2 购买明细及总额
select name,orderdate,cost,sum(cost) over(partition by month(orderdate)) from business;
name orderdate cost sum_window_0
jack 2017-01-01 10 205
tony 2017-01-02 15 205
tony 2017-01-04 29 205
jack 2017-01-05 46 205
tony 2017-01-07 50 205
jack 2017-01-08 55 205
jack 2017-02-03 23 23
mart 2017-04-13 94 341
mart 2017-04-08 62 341
mart 2017-04-09 68 341
mart 2017-04-11 75 341
jack 2017-04-06 42 341
neil 2017-05-10 12 12
neil 2017-06-12 80 80
3.3 总额计算解析
over(partition by month(orderdate)) 按照月份分区,每个窗口大小为所在分区范围大小
4、求每个用户购买明细及将cost按照日期进行累加
hive (default)> select name,orderdate,cost,sum(cost) over(partition by name order by orderdate) from business;
name orderdate cost sum_window_0
jack 2017-01-01 10 10
jack 2017-01-05 46 56
jack 2017-01-08 55 111
jack 2017-02-03 23 134
jack 2017-04-06 42 176
mart 2017-04-08 62 62
mart 2017-04-09 68 130
mart 2017-04-11 75 205
mart 2017-04-13 94 299
neil 2017-05-10 12 12
neil 2017-06-12 80 92
tony 2017-01-02 15 15
tony 2017-01-04 29 44
tony 2017-01-07 50 94
Time taken: 38.788 seconds, Fetched: 14 row(s)
sql解析:sum(cost) over(partition by name order by orderdate)
按照姓名分区,并且区内按照日期排序
4、其他参数介绍
select name,orderdate,cost,
sum(cost) over() as sample1,--所有行相加
sum(cost) over(partition by name) as sample2,--按name分组,组内数据相加
sum(cost) over(partition by name order by orderdate) as sample3,--按name分组,组内数据累加
sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 ,--和sample3一样,由起点到当前行的聚合
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5, --当前行和前面一行做聚合
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6,--当前行和前边一行及后面一行
sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 --当前行及后面所有行
from business;
5、查看顾客上次的购买时间-lag函数使用
select name,orderdate,cost,
lag(orderdate,1,'1900-01-01') over(partition by name order by orderdate ) as time1,
lag(orderdate,2) over (partition by name order by orderdate) as time2
from business;
lag(orderdate,1,‘1900-01-01’):求orderdate列的前一条数据此列值,为null则用默认值1900-01-01
6、查看顾客下一次购买时间-lead函数使用
select name,orderdate,cost,
LEAD(orderdate,1,'1900-01-01') over(partition by name order by orderdate ) as time1,
LEAD(orderdate,2) over (partition by name order by orderdate) as time2
from business;
7、查询前20%时间的订单信息-ntile函数使用
7.1 为数据分区
select name,orderdate,cost, ntile(5) over(order by orderdate) sorted
from business;
7.2 查询分区编号为1的数据就是前20%的数据
select * from (
select name,orderdate,cost, ntile(5) over(order by orderdate) sorted
from business
) t
where sorted = 1;
三、排序介绍
1、数据准备
数据
孙悟空 语文 87
孙悟空 数学 95
孙悟空 英语 68
大海 语文 94
大海 数学 56
大海 英语 84
宋宋 语文 64
宋宋 数学 86
宋宋 英语 84
婷婷 语文 65
婷婷 数学 85
婷婷 英语 78
建表
create table score(
name string,
subject string,
score int)
row format delimited fields terminated by "\t";
导入数据
load data local inpath '/opt/module/datas/score.txt' into table score;
2、三种排序方式比较
select name,
subject,
score,
rank() over(partition by subject order by score desc) rp,
dense_rank() over(partition by subject order by score desc) drp,
row_number() over(partition by subject order by score desc) rmp
from score;