Hive-开窗函数的使用案例

最新推荐文章于 2024-08-01 08:57:45 发布

HelloWorld闯天涯

最新推荐文章于 2024-08-01 08:57:45 发布

阅读量654

点赞数 1

分类专栏： Hive 文章标签： Hive

本文链接：https://blog.csdn.net/qq_18800463/article/details/103870893

版权

Hive 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

题目一：

表t_user_event结构及数据如下：

请写出sql (hive sql)语句得到：

目标字段

DIFF_TS: 是同一用户id之间相邻EVENT_TS 时间间隔

GROUP_ID: 同一用户id，相邻时间间隔>50的时候，分成一个小组

构建数据样本：

create table t_order
(oid int ,
uid int ,
otime date,
oamount int
)partitioned by (dt string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

insert into table t_order partition(dt ='2018-01-01') values(1003,2,'2018-01-01',100);
insert into table t_order partition(dt ='2018-01-02') values(1004,2,'2018-01-02',20);
insert into table t_order partition(dt ='2018-01-02') values(1005,2,'2018-01-02',100);
insert into table t_order partition(dt ='2018-01-02') values(1006,4,'2018-01-02',30);
insert into table t_order partition(dt ='2018-01-03') values(1007,1,'2018-01-03',130);
insert into table t_order partition(dt ='2018-01-03') values(1008,2,'2018-01-03',5);
insert into table t_order partition(dt ='2018-01-03') values(1009,2,'2018-01-03',5);

insert into table t_order partition(dt ='2018-02-01') values(1001,5,'2018-02-01',110);
insert into table t_order partition(dt ='2018-02-01') values(1002,3,'2018-02-01',110);
insert into table t_order partition(dt ='2018-02-03') values(1003,3,'2018-02-03',100);
insert into table t_order partition(dt ='2018-02-03') values(1004,3,'2018-02-03',20);
insert into table t_order partition(dt ='2018-02-04') values(1005,3,'2018-02-04',30);
insert into table t_order partition(dt ='2018-02-04') values(1006,6,'2018-02-04',100);
insert into table t_order partition(dt ='2018-02-04') values(1007,6,'2018-02-04',130);

insert into table t_order partition(dt ='2018-03-01') values(1001,1,'2018-03-01',120);
insert into table t_order partition(dt ='2018-03-03') values(1002,2,'2018-03-03',5);
insert into table t_order partition(dt ='2018-03-03') values(1003,2,'2018-03-03',11);
insert into table t_order partition(dt ='2018-03-03') values(1004,3,'2018-03-03',1);
insert into table t_order partition(dt ='2018-03-04') values(1005,3,'2018-03-04',20);
insert into table t_order partition(dt ='2018-03-04') values(1006,4,'2018-03-04',30);
insert into table t_order partition(dt ='2018-03-04') values(1007,1,'2018-03-04',50);

查询语句：

with
lag_tmp
as
(
	select 
		user_id, event_ts, 
		event_ts - lag(event_ts, 1, event_ts) over(partition by user_id order by event_ts) diff_ts   
	from t_user_event
)
select  
	user_id, 
	event_ts, 
	diff_ts, 
	1 + sum(if(diff_ts>50,1,0)) over(partition by user_id order by event_ts) group_id
from lag_tmp;

难点分析：

lag函数

向前取相邻列的值，用法 lag(参数1，参数2，参数3)

参数1：要取的列名

参数2：向前取几个，也就是步长

参数3：超出记录窗口时的默认值

这里的 lag(event_ts, 1, event_ts) 取前一个相邻的值，如果前一个没有，就默认当前值

类似函数：lead向后取

over(partition by)函数

over(partition by ......)主要和聚合函数sum()、count()、avg()等结合使用，实现分组聚合的功能

这里的 lag(event_ts, 1, event_ts) over(partition by user_id order by event_ts) 分组排序后，求固定步长的值
sum(if(diff_ts>50,1,0)) over(partition by user_id order by event_ts) 分组排序后，如果diff_ts属性的值大于50，求和统计就加1

题目二：

t_order表结构

字段名	含义
oid	订单编号
uid	用户id
otime	订单时间（yyyy-MM-dd）
oamount	订单金额(元)

所有在2019年7月下过单并且在8月没有下过单的用户，在9月份的下单情况：

目标字段名	含义
uid	用户id
big_order_count	当月订单金额超过1000元的订单个数
first_order_amount	当月首次下单金额
last_order_amount	当月末次下单金额

要求：用sql (hive sql) 编写，t_order表的扫描次数不超过2次。

查询语句：

with
rk_ct_tmp
as (
	select 
		*,
		row_number()over(partition by uid,date_format(  otime,'yyyy-MM') order by otime ) rk,
		count(*)over(partition by uid,date_format(  otime,'yyyy-MM') )  ct 
	from  ttt_order order by otime,uid
 )
select 
	uid,
	sum(if( date_format( o.otime,'yyyy-MM')='2018-01' ,1,0))  mon7_count,
	sum(if( date_format( o.otime,'yyyy-MM')='2018-02' ,1,0)) mon8_count,
	sum(if( date_format( o.otime,'yyyy-MM')='2018-03' and o.oamount>10 ,1,0)) mon9_count ,
	sum(if(rk =1 and date_format( o.otime,'yyyy-MM')='2018-03',o.oamount,0)) first_amount,
	sum(if(rk =ct and date_format( o.otime,'yyyy-MM')='2018-03',o.oamount,0)) last_amount
from  rk_ct_tmp o
group by uid
having mon7_count>0 and mon8_count=0

难点分析：

row_number() over(partition by)