大数据之hive_hive的窗口函数

最新推荐文章于 2022-08-25 21:34:15 发布

普罗米修斯之火

最新推荐文章于 2022-08-25 21:34:15 发布

阅读量262

点赞数

分类专栏： hive 文章标签： hive

本文链接：https://blog.csdn.net/WuBoooo/article/details/108391885

版权

hive 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

窗口函数

我们在对表的某个字段中的相同的值进行分组聚合时,我们还想对聚合之前的这些相同的值所对应的一些数据进行一些局部运算的操作,这时候就需要使用到窗口函数.

1．相关函数说明

OVER()：指定分析函数工作的数据窗口大小，这个数据窗口大小可能会随着行的变而变化

CURRENT ROW：当前行  current row

n PRECEDING：往前n行数据  n  preceding

n FOLLOWING：往后n行数据  n following

UNBOUNDED：起点， unbounded

UNBOUNDED PRECEDING 表示从前面的起点， unbounded preceding

 UNBOUNDED FOLLOWING表示到后面的终点  unbound following

LAG(col,n)：往前第n行数据  lag  参数一 字段  n

LEAD(col,n)：往后第n行数据 lead

NTILE(n)：把有序分区中的行分发到指定数据的组中，各个组有编号，编号从1开始，对于每一行，
NTILE返回此行所属的组的编号。注意：n必须为int类型。  ntile(5)

实例1:

select name,orderdate,cost, 
--所有行相加 
sum(cost) over() as sample1,

--按name分组，组内所有cost数据相加 
sum(cost) over(partition by name) as sample2,

--按name分组，先组内按orderdate数据排序,然后组内的每行cost都从当前行的位置累加到起始行 
sum(cost) over(partition by name order by orderdate) as sample3,

--和sample3一样,由起点到当前行的聚合
sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 , 

--当前行和前面一行做聚合 
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5, 

--当前行和前边一行及后面一行 
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6,

--当前行及后面所有行 
sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 
from business;

实例2:

数据准备：name，ctime，money
jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94

需求
（1）查询在2017年4月份购买过的顾客及总人数

select
a.name,
count(1) over ()  --over()内不加参数,就以全表为窗口
from
(select
name,ctime
from
tb_orders
--substr(ctime,0,7)取日期的前七位
where substr(ctime,0,7)="2017-04")a
group by a.name
;

（2）查询顾客的购买明细及月购买总额

select
name,
ctime,
--按顾客名和日期的前七位进行窗口sum运算
sum(money) over (partition by name,substr(ctime,0,7))
from tb_orders;

（3）上述的场景,要将money按照日期进行累加

select
*,
sum(money) over (partition by name,substr(ctime,0,7) order by ctime)
from tb_orders;

（4）查询顾客上次的购买时间

select
name,
ctime,
--lag(ctime,1,ctime)在窗口内求当前行的上一行ctime,没有就默认当前行的ctime
lag(ctime,1,ctime) over (partition by name order by ctime)
from tb_orders;

（5）查询前20%时间的订单信息

select
*
from
(select
*,
--以全表为窗口按时间排序,并分成5份,拿出第一份
ntile(5) over(order by ctime) num
from tb_orders) a
where a.num=1;

实例3:

有以下数据：
a,2017-02-05,200
a,2017-02-06,300
a,2017-02-07,200
a,2017-02-08,400
a,2017-02-10,600
b,2017-02-05,200
b,2017-02-06,300
b,2017-02-08,200
b,2017-02-09,400
b,2017-02-10,600
c,2017-01-31,200
c,2017-02-01,300
c,2017-02-02,200
c,2017-02-03,400
c,2017-02-10,600
a,2017-03-01,200
a,2017-03-02,300
a,2017-03-03,200
a,2017-03-04,400
a,2017-03-05,600

求连续三天以上销售记录的店铺名称:

select
distinct cc.name
from
(select
b.name,
count(1) c
from
(select
a.name,
--date_sub(a.ctime,a.row_number)当前行的ctim时间减去row_number天
date_sub(a.ctime,a.row_number) diff
from (
select
name,
ctime,
--按名字分组,在组内按时间排序,然后使用row_number()给组每行数据按1,2,3,~打上标记
row_number() over (partition by name order by ctime) row_number
from
tb_shop) a) b
group by b.name,b.diff
having c>3) cc;

实例4:

打地鼠游戏:
求连续三次都命中的姓名
数据:
在这里插入图片描述

select
b.uid,
count(1) con
from
(select
a.uid,
a.num-a.ro rr
from
(select
uid,
num,
tof,
--对筛选出的命中了的记录按uid在组内进行排序和,打上递增数字标记
row_number() over ( partition by uid order by num) ro
from
tb_dds
--筛查出命中了的记录
where tof=1)a
)b
group by b.uid,b.rr
having con>3
;

Rank

1．函数说明
RANK() 排序相同时会重复，总数不会变
DENSE_RANK() 排序相同时会重复，总数会减少
ROW_NUMBER() 会根据顺序计算

2．数据准备
表6-7 数据准备
name subject score
孙悟空语文 87
孙悟空数学 95
孙悟空英语 68
娜娜语文 94
娜娜数学 56
娜娜英语 84
宋宋语文 64
宋宋数学 86
宋宋英语 84
婷婷语文 65
婷婷数学 85
婷婷英语 78
3．需求
计算每门学科成绩排名。
4．创建本地score.txt，导入数据
[doit@hadoop102 datas]$ vi score.txt
5．创建hive表并导入数据

create table score(
name string,
subject string, 
score int) 
row format delimited fields terminated by "\t";
load data local inpath '/opt/module/datas/score.txt' into table score;

6．按需求查询数据

select name,
subject,
score,
rank() over(partition by subject order by score desc) rp,
dense_rank() over(partition by subject order by score desc) drp,
row_number() over(partition by subject order by score desc) rmp
from score;

name subject score rp drp rmp
孙悟空数学 95 1 1 1
宋宋数学 86 2 2 2
婷婷数学 85 3 3 3
娜娜数学 56 4 4 4
宋宋英语 84 1 1 1
娜娜英语 84 1 1 2
婷婷英语 78 3 2 3
孙悟空英语 68 4 3 4
娜娜语文 94 1 1 1
孙悟空语文 87 2 2 2
婷婷语文 65 3 3 3
宋宋语文 64 4 4 4