题目一 :
表t_user_event结构及数据如下:
请写出sql (hive sql)语句得到:
目标字段
DIFF_TS: 是同一用户id之间相邻EVENT_TS 时间间隔
GROUP_ID: 同一用户id,相邻时间间隔>50的时候,分成一个小组
构建数据样本:
create table t_order
(oid int ,
uid int ,
otime date,
oamount int
)partitioned by (dt string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
insert into table t_order partition(dt ='2018-01-01') values(1003,2,'2018-01-01',100);
insert into table t_order partition(dt ='2018-01-02') values(1004,2,'2018-01-02',20);
insert into table t_order partition(dt ='2018-01-02') values(1005,2,'2018-01-02',100);
insert into table t_order partition(dt ='2018-01-02') values(1006,4,'2018-01-02',30);
insert into table t_order partition(dt ='2018-01-03') values(1007,1,'2018-01-03',130);
insert into table t_order partition(dt ='2018-01-03') values(1008,2,'2018-01-03',5);
insert into table t_order partition(dt ='2018-01-03') values(1009,2,'2018-01-03',5);
insert into table t_order partition(dt ='2018-02-01') values(1001,5,'2018-02-01',110);
insert into table t_order partition(dt ='2018-02-01') values(1002,3,'2018-02-01',110);
insert into table t_order partition(dt ='2018-02-03') values(1003,3,'2018-02-03',100);
insert into table t_order partition(dt ='2018-02-03') values(1004,3,'2018-02-03',20);
insert into table t_order partition(dt ='2018-02-04') values(1005,3,'2018-02-04',30);
insert into table t_order partition(dt ='2018-02-04') values(1006,6,'2018-02-04',100);
insert into table t_order partition(dt ='2018-02-04') values(1007,6,'2018-02-04',130);
insert into table t_order partition(dt ='2018-03-01') values(1001,1,'2018-03-01',120);
insert into table t_order partition(dt ='2018-03-03') values(1002,2,'2018-03-03',5);
insert into table t_order partition(dt ='2018-03-03') values(1003,2,'2018-03-03',11);
insert into table t_order partition(dt ='2018-03-03') values(1004,3,'2018-03-03',1);
insert into table t_order partition(dt ='2018-03-04') values(1005,3,'2018-03-04',20);
insert into table t_order partition(dt ='2018-03-04') values(1006,4,'2018-03-04',30);
insert into table t_order partition(dt ='2018-03-04') values(1007,1,'2018-03-04',50);
查询语句:
with
lag_tmp
as
(
select
user_id, event_ts,
event_ts - lag(event_ts, 1, event_ts) over(partition by user_id order by event_ts) diff_ts
from t_user_event
)
select
user_id,
event_ts,
diff_ts,
1 + sum(if(diff_ts>50,1,0)) over(partition by user_id order by event_ts) group_id
from lag_tmp;
难点分析:
lag函数
向前取相邻列的值,用法 lag(参数1,参数2,参数3)
参数1:要取的列名
参数2:向前取几个,也就是步长
参数3:超出记录窗口时的默认值
这里的 lag(event_ts, 1, event_ts) 取前一个相邻的值,如果前一个没有,就默认当前值
类似函数:lead向后取
over(partition by)函数
over(partition by ......)主要和聚合函数sum()、count()、avg()等结合使用,实现分组聚合的功能
这里的 lag(event_ts, 1, event_ts) over(partition by user_id order by event_ts) 分组排序后,求固定步长的值
sum(if(diff_ts>50,1,0)) over(partition by user_id order by event_ts) 分组排序后,如果diff_ts属性的值大于50,求和统计就加1
题目二:
t_order表结构
字段名 | 含义 |
oid | 订单编号 |
uid | 用户id |
otime | 订单时间(yyyy-MM-dd) |
oamount | 订单金额(元) |
所有在2019年7月下过单并且在8月没有下过单的用户,在9月份的下单情况:
目标字段名 | 含义 |
uid | 用户id |
big_order_count | 当月订单金额超过1000元的订单个数 |
first_order_amount | 当月首次下单金额 |
last_order_amount | 当月末次下单金额 |
要求:用sql (hive sql) 编写 ,t_order表的扫描次数不超过2次。
查询语句:
with
rk_ct_tmp
as (
select
*,
row_number()over(partition by uid,date_format( otime,'yyyy-MM') order by otime ) rk,
count(*)over(partition by uid,date_format( otime,'yyyy-MM') ) ct
from ttt_order order by otime,uid
)
select
uid,
sum(if( date_format( o.otime,'yyyy-MM')='2018-01' ,1,0)) mon7_count,
sum(if( date_format( o.otime,'yyyy-MM')='2018-02' ,1,0)) mon8_count,
sum(if( date_format( o.otime,'yyyy-MM')='2018-03' and o.oamount>10 ,1,0)) mon9_count ,
sum(if(rk =1 and date_format( o.otime,'yyyy-MM')='2018-03',o.oamount,0)) first_amount,
sum(if(rk =ct and date_format( o.otime,'yyyy-MM')='2018-03',o.oamount,0)) last_amount
from rk_ct_tmp o
group by uid
having mon7_count>0 and mon8_count=0
难点分析:
row_number() over(partition by)
分组后排序
count(*) over(partition by)
分组求和
data_format
日期格式化
if()
条件判断取值
其它函数整理:
(1)row_number() over(partition by)
此方法不管排名是否有相同的,都按照顺序1,2,3…..n
(2)rank() over(partition by)
跳跃排序,此方法进行排序时,相同的排序是一样的,而且下一个不同值是跳着排序的。如1,1,3
(3)dense_rank() over(partition by)
不跳跃排序,此方法对于排名相同的名次一样,且后面名次不跳跃。如1,1,2
(4)first_value() over(partition by)
分组取第一个值
(5)last_value() over(partition by)
分组取最后一个值
(6)更多参考
https://blog.csdn.net/sherri_du/article/details/53312085