用户观看节目hql练习
感觉很好的题目,记录下,
数据如下
date user age programid Playtime
20190421 u1 30 a 4min
20190421 u1 30 b 10min
20190421 u2 27 a 2min
20190422 u3 35 c 3min
20190422 u2 27 d 1min
问题:
1.统计:用户总量,用户平均年龄,用户平均观看时长
2.统计:每10岁一个分段,统计每个区间的用户总量,用户平均观看时长
3.统计:每个用户最喜欢的节目
4.统计:观看时长大于5min的用户总量,只要有一个节目用户观看时间小于5min就不能算
-- 建表
create table program(
date bigint,
user string,
age int,
programid string,
playtime int
)
row format delimited
fields terminated by '\t'
stored as textfile
;
题目:
1.统计:用户总量,用户平均年龄,用户平均观看时长
2.统计:每10岁一个分段,统计每个区间的用户总量,用户平均观看时长
3.统计:每个用户最喜欢的节目
4.统计:观看时长大于5min的用户总量,只要有一个节目用户观看时间小于5min就不能算
解答:
第一问:统计:用户总量,用户平均年龄,用户平均观看时长
select count(*),avg(age),avg(spt)
from
(select uname,age,sum(playtime) as spt
from program group by uname,age) t;
第二问:每10岁一个分段,统计每个区间的用户总量,用户平均观看时长
一开始做错了:
select
sum(if(age<=10,1,0)) as `0-10岁用户总数`,
coalesce(Round(sum(if(age<=10,playtime,0))/sum(if(age<=10,1,0)),2),0) as `0-10岁平均时长`,
sum(if(age>10 and age<=20,1,0)) as `10-20岁用户总数`,
coalesce(Round(sum(if(age>10 and age<=20,playtime,0))/sum(if(age>10 and age<=20,1,0)),2),0) as `10-20岁平均时长`,
sum(if(age>20 and age<=30,1,0)) as `20-30岁用户总数`,
coalesce(Round(sum(if(age>20 and age<=30,playtime,0))/sum(if(age>20 and age<=30,1,0)),2),0) as `20-30岁平均时长`,
sum(if(age>30 and age<=40,1,0)) as `30-40岁用户总数`,
coalesce(Round(sum(if(age>30 and age<=40,playtime,0))/sum(if(age>30 and age<=40,1,0)),2),0) as `30-40岁平均时长`,
sum(if(age>40 and age<=50,1,0)) as `40-50岁用户总数`,
coalesce(Round(sum(if(age>40 and age<=50,playtime,0))/sum(if(age>40 and age<=50,1,0)),2),0) as `40-50岁平均时长`
from program;
错了!!因为,因为计算人数的时候根本没对用户进行去重
----------------------
一个job
OK
0-10岁用户总数 0-10岁平均时长 10-20岁用户总数 10-20岁平均时长 20-30岁用户总数 20-30岁平均时长 30-40岁用户总数 30-40岁平均时长40-50岁用户总数 40-50岁平均时长
0 0.0 0 0.0 4 4.25 1 3.0 0 0.0
Time taken: 51.096 seconds, Fetched: 1 row(s)
----------------------------
正确答案:
解法一:
select flag,
count(uname),
Round(avg(spt),2)
from
(
select
uname,
(case when age<10 then 0
when age>=10 and age<20 then 1
when age>=20 and age<30 then 2
when age>=30 and age<40 then 3
when age>=40 and age<50 then 4
end
) as flag,
sum(playtime) spt
from program
group by uname,age
)t
group by flag
;
----------------
2个job
OK
flag _c1 _c2
2 1 3.0
3 2 8.5
Time taken: 83.925 seconds, Fetched: 2 row(s)
------------------------------------------------------
解法二:
大佬使用floor函数,更加简洁,学习了
select
flag,count(uname),round(avg(spt),2)
from
(select
uname,
floor(age/10) flag,
sum(playtime) as spt
from program
group by uname,age
)t
group by flag
;
---------------------------
OK
flag _c1 _c2
2 1 3.0
3 2 8.5
Time taken: 88.804 seconds, Fetched: 2 row(s)
第三问:统计:每个用户最喜欢的节目
我的麻烦的写法:
select
uname,programid
from(
select
uname,programid,
row_number() over(partition by uname order by spt desc) as rn
from
(select uname,programid,sum(playtime) spt from program group by uname,programid) t1
)t2
where t2.rn=1;
2JOB
OK
uname programid
u1 b
u2 a
u3 c
Time taken: 85.238 seconds, Fetched: 3 row(s)
----------------------------
人家写的:
少了一个嵌套,order by后面只写sum函数,之前不知道可以这样写
select
uname,programid
from(
select
uname,programid,
sum(playtime),
row_number() over(partition by uname order by sum(playtime) desc) as rn
from program
group by uname,programid
)t2
where t2.rn=1;
ps:有个小插曲,
如果在order by 后面直接写sum(playtime),那么要在select后面,写sum(play time)
否则报错:
//在orderby后面加sum函数报错,因为select后面没有加上sum(playtime)
FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies.
Underlying error: org.apache.hadoop.hive.ql.parse.SemanticException: Line 4:0 Expression not in GROUP BY key 'playtime'
第四问:观看时长大于5min的用户总量,只要有一个节目用户观看时间小于5min就不能算
select
count(uname)
from(
select
uname
from program
group by uname
having max(playtime)>=5) t;
原来的错误思路:
select
count(uname)
from program
group by uname
having max(playtime)>=5;
这样的得到的是每个分组的各自的数量,不是有多少个分组的数量