第1题 连续问题
如下数据为蚂蚁森林中用户领取的减少碳排放量
id dt lowcarbon
1001 2021-12-12 123
1002 2021-12-12 45
1001 2021-12-13 43
1001 2021-12-13 45
1001 2021-12-13 23
1002 2021-12-14 45
1001 2021-12-14 230
1002 2021-12-15 45
1001 2021-12-15 23
… …
1.找出连续3天及以上累计减少碳排放量在100以上的用户
2.找出连续连续3天以上有减少排放量的用户
LAG
LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值
参数1为列名,参数2为往上第n行(可选,默认为1),参数3为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)
- hive 建表及数据加载
CREATE EXTERNAL TABLE IF NOT EXISTS demo1( id string comment 'id', dt string comment '日期', lowcarbon int comment '姓名' ) comment '蚂蚁森林减少排放测试表' PARTITIONED BY (pdt STRING) row format delimited fields terminated by ' '--逗号分隔 STORED AS TEXTFILE -- 存储格式为TEXTFILE tblproperties("skip.header.line.count"="1") --跳过文件行第一1行;
-
加载数据(首先将数据文件上传到hdfs 加载本地文件使用 load data local)\
load data inpath '/libangqin/data1.txt' overwrite into table demo1 partition(pdt='2022-05-14');
3.进行分析
1. 首先按照日期进行排序 使用lag() 开窗函数
select e.id,e.dt,sum(e.lowcarbon) as lowcarbon from demo1 e group by id,dt having sum(e.lowcarbon) >=100
select d.id,d.dt,d.lowcarbon,lag(d.dt,1,date_sub(d.dt,1)) over( partition by id order by dt ) as ldt from (select e.id,e.dt,sum(e.lowcarbon) as lowcarbon from demo1 e group by id,dt having sum(e.lowcarbon) >=100 ) d
d.id d.dt d.lowcarbon ldt
1001 2021-12-12 123 NULL
1001 2021-12-13 43 2021-12-12
1001 2021-12-13 45 2021-12-13
1001 2021-12-13 23 2021-12-13
1001 2021-12-14 230 2021-12-13
1001 2021-12-15 23 2021-12-14
1001 2021-12-15 33 2021-12-15
1001 2021-12-16 33 2021-12-15
1001 2021-12-17 53 2021-12-16
1002 2021-12-12 45 NULL
1002 2021-12-14 45 2021-12-12
1002 2021-12-15 45 2021-12-14
1002 2021-12-18 63 2021-12-15
1003 2021-12-19 73 NULL
1004 2021-12-16 23 NULL
dt - ldt 按照diff 进行分组并且count >=3 表示连续天数超过3次符合
最终结果
t2.id lowcarbon mindate maxdate
1001 464 2021-12-12 2021-12-14
Time taken: 2.252 seconds, Fetched: 1 row(s)
select t2.id,sum(t2.lowcarbon) as lowcarbon,min(t2.dt) as minDate,max(t2.dt) as maxDate from (
select d.id,d.dt,d.lowcarbon,datediff(d.dt,lag(d.dt,1,date_sub(d.dt,1)) over( partition by d.id order by d.dt )) as df from (select e.id,e.dt,sum(e.lowcarbon) as lowcarbon from demo1 e group by id,dt having sum(e.lowcarbon) >=100 ) d
) t2 group by t2.id,t2.df having count(t2.id) >= 3;
第2题 分组问题
如下为电商公司用户访问时间数据
id ts(秒)
1001 17523641234
1001 17523641256
1002 17523641278
1001 17523641334
1002 17523641434
1001 17523641534
1001 17523641544
1002 17523641634
1001 17523641638
1001 17523641654
某个用户连续的访问记录如果时间间隔小于60秒,则分为同一个组,结果为:
id ts(秒) group
1001 17523641234 1
1001 17523641256 1
1001 17523641334 2
1001 17523641534 3
1001 17523641544 3
1001 17523641638 4
1001 17523641654 4
1002 17523641278 1
1002 17523641434 2
1002 17523641634 3
第3题 间隔连续问题
某游戏公司记录的用户每日登录数据
id dt
1001 2021-12-12
1002 2021-12-12
1001 2021-12-13
1001 2021-12-14
1001 2021-12-16
1002 2021-12-16
1001 2021-12-19
1002 2021-12-17
1001 2021-12-20
计算每个用户最大的连续登录天数,可以间隔一天。解释:如果一个用户在1,3,5,6登录游戏,则视为连续6天登录。
1. hive 建表及数据加载
CREATE EXTERNAL TABLE IF NOT EXISTS demo2(
id string comment 'id',
dt string comment '日期'
)
comment '连续登录日期间隔统计测试表'
PARTITIONED BY (pdt STRING)
row format delimited fields terminated by ' '--逗号分隔
STORED AS TEXTFILE -- 存储格式为TEXTFILE
tblproperties("skip.header.line.count"="1") --跳过文件行第一1行;
2.加载数据(首先将数据文件上传到hdfs 加载本地文件使用 load data local)
load data inpath '/libangqin/data2.txt' overwrite into table demo2 partition(pdt='2022-05-14');
3.进行分析
1 lag 开窗 作差
1001 2021-12-12 2021-12-11 1
1001 2021-12-13 2021-12-12 1
1001 2021-12-14 2021-12-13 1
1001 2021-12-16 2021-12-14 2
1001 2021-12-19 2021-12-16 3
1001 2021-12-20 2021-12-19 1
1002 2021-12-16
1002 2021-12-17
1002 2021-12-12
1. lag 开窗作差
select id,dt,datediff(t1.dt,t1.ldt) df from (select id,dt,lag(dt,1,date_sub(dt,1)) over (partition by id order by dt) ldt from demo2 ) t1
2.根据df进行分组
select id,dt,df,sum(if(df>2,1,0)) over (partition by id order by dt) sumCount from (select id,dt,datediff(t1.dt,t1.ldt) df from (select id,dt,lag(dt,1,date_sub(dt,1)) over (partition by id order by dt) ldt from demo2 ) t1)t2;
3.group by 操作 求连续天数maxDay
select t3.id,(datediff(max(dt),min(dt)) + 1) as maxDay,max(t3.dt) as maxDate,min(t3.dt) as minDate,sumCount from (select id,dt,df,sum(if(df>2,1,0)) over (partition by id order by dt) sumCount from (select id,dt,datediff(t1.dt,t1.ldt) df from (select id,dt,lag(dt,1,date_sub(dt,1)) over (partition by id order by dt) ldt from demo2 ) t1)t2)t3 group by id,sumCount;
4.
select id,maxDay,minDate,maxDate,sumCount from (
select t3.id,(datediff(max(dt),min(dt)) + 1) as maxDay,max(t3.dt) as maxDate,min(t3.dt) as minDate,sumCount from (
select id,dt,df,sum(if(df>2,1,0)) over (partition by id order by dt) sumCount from (
select id,dt,datediff(t1.dt,t1.ldt) df from (
select id,dt,lag(dt,1,date_sub(dt,1)) over (partition by id order by dt) ldt from demo2 )t1)t2)t3 group by id,sumCount
)t4
运行报错
hive return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
设置执行引擎为mr 执行模式为本地执行
set hive.execution.engine=mr;
set hive.exec.mode.local.auto=true;
运行结果
1001 5 2021-12-12 2021-12-16
1001 9 2021-12-19 2021-12-27
1002 1 2021-12-12 2021-12-12
1002 2 2021-12-16 2021-12-17
按照 sum(if(diff>2,1,0)) 的结果进行分组
1001 2021-12-12 2021-12-11 1 0
1001 2021-12-13 2021-12-12 1 0
1001 2021-12-14 2021-12-13 1 0
1001 2021-12-16 2021-12-14 2 0
1001 2021-12-19 2021-12-16 3 1
1001 2021-12-20 2021-12-19 1 1
进行count 并且计算maxDate - minDate + 1 并且取差值较大的作为最终连续天数
第4题 打折日期交叉问题
如下为平台商品促销数据:字段为品牌,打折开始日期,打折结束日期
brand stt edt
oppo 2021-06-05 2021-06-09
oppo 2021-06-11 2021-06-21
vivo 2021-06-05 2021-06-15
vivo 2021-06-09 2021-06-21
redmi 2021-06-05 2021-06-21
redmi 2021-06-09 2021-06-15
redmi 2021-06-17 2021-06-26
huawei 2021-06-05 2021-06-26
huawei 2021-06-09 2021-06-15
huawei 2021-06-17 2021-06-21
计算每个品牌总的打折销售天数,注意其中的交叉日期,比如vivo品牌,第一次活动时间为2021-06-05到2021-06-15,第二次活动时间为2021-06-09到2021-06-21其中9号到15号为重复天数,只统计一次,即vivo总打折天数为2021-06-05到2021-06-21共计17天
1. hive 建表及数据加载
CREATE EXTERNAL TABLE IF NOT EXISTS demo3(
id string comment 'id',
stt string comment '开始打折日期',
edt string comment '结束打折日期'
)
comment '品牌打折日期统计测试表'
PARTITIONED BY (pdt STRING)
row format delimited fields terminated by ' '--空格分隔
STORED AS TEXTFILE -- 存储格式为TEXTFILE
tblproperties("skip.header.line.count"="1") --跳过文件行第一1行;
2.加载数据(首先将数据文件上传到hdfs 加载本地文件使用 load data local)
load data inpath '/libangqin/data3.txt' overwrite into table demo3 partition(pdt='2022-05-14');
3.进行分析
1. 将当前行往前的数据中最大的edt放置到当前的位置
开窗范围 rows between unbounded preceding and 1 preceding
select id,stt,edt,max(edt) over (partition by id order by stt rows between unbounded preceding and 1 preceding ) as mtt from demo3;
2. edt - mtt
select id,stt,edt,mt,datediff(edt,mt) as df from(select id,stt,edt,if(mtt is null,stt,if(stt>mtt,stt,mtt)) mt from(select id,stt,edt,max(edt) over (partition by id order by stt rows between unbounded preceding and 1 preceding ) as mtt from demo3) t1)t2
3. sum(if(df>0)) 计算总的打折天数
select id,sum(if(df>=0,df+1,0)) as dayCount from (select id,stt,edt,mt,datediff(edt,mt) as df from(select id,stt,edt,if(mtt is null,stt,if(stt>mtt,stt,date_add(mtt,1))) mt from(select id,stt,edt,max(edt) over (partition by id order by stt rows between unbounded preceding and 1 preceding ) as mtt from demo3) t1)t2)t3 group by t3.id;
结果
OK
huawe 22
oppo 16
redmi 22
vivo 17
xiaomi 23
第5题 同时在线问题
如下为某直播平台主播开播及关播时间,根据该数据计算出平台最高峰同时在线的主播人数。
id stt edt
1001 2021-06-14 12:12:12 2021-06-14 18:12:12
1003 2021-06-14 13:12:12 2021-06-14 16:12:12
1004 2021-06-14 13:15:12 2021-06-14 20:12:12
1002 2021-06-14 15:12:12 2021-06-14 16:12:12
1005 2021-06-14 15:18:12 2021-06-14 20:12:12
1001 2021-06-14 20:12:12 2021-06-14 23:12:12
1006 2021-06-14 21:12:12 2021-06-14 23:15:12
1007 2021-06-14 22:12:12 2021-06-14 23:10:12
1. hive 建表及数据加载
CREATE EXTERNAL TABLE IF NOT EXISTS demo4(
id string comment 'id',
stt string comment '开始直播日期',
edt string comment '结束直播日期'
)
comment '同时在线测试表'
PARTITIONED BY (pdt STRING)
row format delimited fields terminated by ','--空格分隔
STORED AS TEXTFILE -- 存储格式为TEXTFILE
tblproperties("skip.header.line.count"="1") --跳过文件行第一1行;
2.加载数据(首先将数据文件上传到hdfs 加载本地文件使用 load data local)
load data inpath '/libangqin/data4.txt' overwrite into table demo4 partition(pdt='2022-05-14');
3.进行分析
1.将开播日期定义为1 表示主播上线 结束日期定义为-1 表示主播下线 然后将日期进行排行 进行sum 窗口分析
select id,tt,p from (select id,stt as tt,1 as p from demo4
union all
select id,edt as tt,-1 as p from demo4 )t1 order by tt
2. 窗口分析
此时不需要进行partition by
select id,tt,sum(p) over(order by tt ) as sumP from (select id,tt,p from (select id,stt as tt,1 as p from demo4 union all select id,edt as tt,-1 as p from demo4 )t1 order by tt)t2;
结果:
1001 2021-06-14 12:12:12 1
1003 2021-06-14 13:12:12 2
1004 2021-06-14 13:15:12 3
1002 2021-06-14 15:12:12 4
1005 2021-06-14 15:18:12 5
1002 2021-06-14 16:12:12 3
1003 2021-06-14 16:12:12 3
1001 2021-06-14 18:12:12 2
1001 2021-06-14 20:12:12 1
1004 2021-06-14 20:12:12 1
1005 2021-06-14 20:12:12 1
1006 2021-06-14 21:12:12 2
1007 2021-06-14 22:12:12 3
1007 2021-06-14 23:10:12 2
1001 2021-06-14 23:12:12 1
1006 2021-06-14 23:15:12 0
3. 求最大
select max(sumP) from ()t3;
六、groupping_set 测试
建表 导入数据
CREATE EXTERNAL TABLE IF NOT EXISTS demo5(
school string comment '学校',
grade string comment '班级',
name string comment '用户'
)
comment '学生表'
row format delimited fields terminated by ','--空格分隔
STORED AS TEXTFILE -- 存储格式为TEXTFILE
tblproperties("skip.header.line.count"="1") --跳过文件行第一1行;
load data inpath '/libangqin/data5.txt' overwrite into table demo5;
导入数据时报错
Hive FAILED: ParseException line 1:4 character ' ' not supported here
解决方法:
重新编辑创建表sql
重新导入数据即可
grouping sets 测试
select nvl(school,'=school=') as school,nvl(grade,'==grade==') as grade,count(1) as cnt from demo5
group by school,grade grouping sets((school,grade),school,grade);
结果
=school= grade1 3
=school= grade2 3
=school= grade3 3
=school= grade4 3
=school= grade5 3
=school= grade6 3
school ==grade== 6
school grade1 1
school grade2 1
school grade3 1
school grade4 1
school grade5 1
school grade6 1
school2 ==grade== 6
school2 grade1 1
school2 grade2 1
school2 grade3 1
school2 grade4 1
school2 grade5 1
school2 grade6 1
school3 ==grade== 6
school3 grade1 1
school3 grade2 1
school3 grade3 1
school3 grade4 1
school3 grade5 1
school3 grade6 1
上卷测试
select nvl(school,'=school=') as school,nvl(grade,'==grade==') as grade,count(1) as cnt from demo5 group by school,grade with rollup;
结果
=school= ==grade== 18
school ==grade== 6
school grade1 1
school grade2 1
school grade3 1
school grade4 1
school grade5 1
school grade6 1
school2 ==grade== 6
school2 grade1 1
school2 grade2 1
school2 grade3 1
school2 grade4 1
school2 grade5 1
school2 grade6 1
school3 ==grade== 6
school3 grade1 1
school3 grade2 1
school3 grade3 1
school3 grade4 1
school3 grade5 1
school3 grade6 1