目录
数据准备-表结构
gulivideo_orc
字段 | 备注 | 详细描述 |
---|---|---|
videoId | 视频唯一id(STRING) | 11位字符串 |
uploader | 视频上传者(STRING) | 上传视频的用户名STRING |
age | 视频年龄(INT) | 视频在平台上的整数天 |
category | 视频类别(Array) | 上传视频指定的视频分类 |
LENGTH | 视频长度(INT) | 整形数字标识的视频长度 |
views | 观看次数(INT) | 视频被浏览的次数 |
rate | 视频评分(DOUBLE) | 满分5分 |
Ratings | 流量(INT) | 视频的流量,整型数字 |
conments | 评论数(INT) | 一个视频的整数评论数 |
relatedId | 相关视频id(Array) | 相关视频的id,最多20个 |
gulivideo_user_orc
字段 | 备注 | 字段类型 |
---|---|---|
uploader | 上传者用户名 | STRING |
videos | 上传视频数 | INT |
friends | 朋友数量 | INT |
需求一:统计视频观看数最高的 top10
SELECT
videoId,views
FROM
gulivideo_orc
ORDER BY views DESC
LIMIT 10;
需求二:统计视频类别热度Top10
– 热度:总视频数 总流量 总观看数。。。
– 约定: 按照类别下包含的总视频个数来衡量热度
- a. 炸开每个视频的类别
SELECT videoId,category_name
FROM gulivideo_orc
lateral VIEW explode(category) gulivideo_orc_tmp AS category_name ===>t1
- b.按照category_name分组,count 每个组视频的个数,倒序,limit 10个。
SELECT category_name,COUNT(t1.videoId) hot
FROM t1
GROUP BY category_name
ORDER BY hot DESC
LIMIT 10;
- c. 组合:
SELECT category_name,COUNT(t1.videoId) hot
FROM
(SELECT videoId,category_name
FROM gulivideo_orc
lateral VIEW explode(category) gulivideo_orc_tmp AS category_name ) t1
GROUP BY category_name
ORDER BY hot DESC
LIMIT 10;
================================================================
需求三:统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数
- a.统计视频观看数最高的20个视频
SELECT videoId,category,views
FROM gulivideo_orc
ORDER BY views DESC
LIMIT 20 ===>t1
- b.炸开每个视频的类别
SELECT t1.videoId,category_name
FROM t1
literal VIEW explode(t1.category) gulivideo_orc_tmp AS category_name ==>t2
- 组合
SELECT t1.videoId,category_name
FROM
(SELECT videoId,category,views
FROM gulivideo_orc
ORDER BY views DESC
LIMIT 20) t1
lateral VIEW explode(t1.category) gulivideo_orc_tmp AS category_name ===》t2
- c.按照类别分组,求每个类别中的视频个数
SELECT t2.category_name,COUNT(t2.videoId) AS videos
FROM t2
GROUP BY t2.category_name
- d.组合
SELECT t2.category_name,COUNT(t2.videoId) AS videos
FROM (SELECT t1.videoId,category_name
FROM
(SELECT videoId,category,views
FROM gulivideo_orc
ORDER BY views DESC
LIMIT 20) t1
lateral VIEW explode(t1.category) gulivideo_orc_tmp AS category_name)
t2
GROUP BY t2.category_name
========================================================
需求四: 统计视频观看数Top50所关联视频的所属类别Rank
- a.视频观看数Top50所关联视频
SELECT videoId,relatedId,views
FROM gulivideo_orc
ORDER BY views DESC
LIMIT 50 ====》t1
- b.炸开关联视频
SELECT relatedId_id
FROM t1
latera VIEW explode(t1.relatedId) gulivideo_orc_temp AS relatedId_id
- 组合
SELECT relatedId_id
FROM
(SELECT videoId,relatedId,views
FROM gulivideo_orc
ORDER BY views DESC
LIMIT 50 )t1
lateral VIEW explode(t1.relatedId) gulivideo_orc_temp AS relatedId_id
===>t2
- c. t2 JOIN gulivideo_orc表,找到每个关联视频对应的类别
SELECT t2.relatedId_id,t3.category
FROM t2 JOIN gulivideo_orc t3
ON t2.relatedId_id = t3.videoId
===>t4
- d.炸开类别
SELECT t4.category_name,t4.relatedId_id
FROM t4
lateral VIEW explode(t4.category) t4_tmp AS category_name
===》t5
- e.按照类别分组,求每个类别出现的次数
SELECT t5.category_name,COUNT(t5.relaredId_id) category_count
FROM t5
GROUP BY t5.category_name
===》t6
- f.排名
SELECT t6.category_name,t6.category_count,
rank() over(ORDER BY t6.category_count DESC) rk
form t6
需求五:统计每个类别中的视频热度Top10,以Music为例
- a.炸开每个视频的类别
SELECT videoId,category_name
FROM gulivideo_orc
lateral VIEW explode(category) gulivideo_orc_tmp AS category_name
===>t1
- b.过滤Music类别的数据,按照views倒叙,取top10
SELECT t1.videoId,t1.views,t1.category_name
FROM t1
WHERE category_name = 'Music'
ORDER BY t1.views DESC
LIMIT 10
- c.组合
SELECT t1.videoId,t1.views,t1.category_name
FROM
(SELECT videoId,category_name,views
FROM gulivideo_orc
lateral VIEW explode(category) gulivideo_orc_tmp AS category_name)t1
WHERE category_name = 'Music'
ORDER BY t1.views DESC
LIMIT 10
需求五扩展: 统计每个类别中的视频热度Top10
- a.炸开每个视频的类别
SELECT videoId,category_name,views
FROM gulivideo_orc
lateral VIEW explode(category) gulivideo_orc_tmp AS category_name
===>t1
- b.开窗,按照类别分区,观看数倒序排序,求排名
SELECT t1.videoId,t1.views,t1.category_name,
rank() over(PARTITION BY t1.category_name ORDER BY t1.views DESC) rk
FROM t1
===>t2
- c.取出每个类别的top10
SELECT t2.videoId,t2.views,t2.category_name,t2.rk
FROM t2
WHERE t2.rk <= 10
- d.组合
SELECT t2.videoId,t2.views,t2.category_name,t2.rk
FROM
(SELECT t1.videoId,t1.views,t1.category_name,
rank() over(PARTITION BY t1.category_name ORDER BY t1.views DESC) rk
FROM
(SELECT videoId,category_name,views
FROM gulivideo_orc
lateral VIEW explode(category) gulivideo_orc_tmp AS category_name) t1) t2
WHERE t2.rk <= 10
================================================================
需求六:统计每个类别视频观看数Top10 参考需求五拓展
====================================================================
需求七:统计上传视频最多的用户Top10以及他们上传的视频观看次数在前20的视频
约定1:取top10中每个人上传视频的前20
- a.top10用户
SELECT uploader,videos
FROM gulivideo_user_orc
ORDER BY videos DESC
LIMIT 10
==>t1
- b.关联gulivideo_orc,取到者10个用户上传的视频
SELECT t2.uploader,t2.videoId,t2.views
FROM t1 JOIN gulivideo_orc t2
ON t1.uploader = t2.uploader
===>t3
- c. 开窗 ,按照uplodaer分区,views排序,求排名
SELECT t3.uploader,t3.videoId,t3.views,
rank() over(PARTITION BY t3.uploader ORDER BY t3.views DESC) rk
FROM t3
==>t4
- d.取每个用户的前20
SELECT t4.uploader,t4.videoId,t4.views,t4.rk
FROM t4
WHERE t4.rk <= 20
- e.组合
SELECT t4.uploader,t4.videoId,t4.views,t4.rk
FROM (SELECT t3.uploader,t3.videoId,t3.views,
rank() over(PARTITION BY t3.uploader ORDER BY t3.views DESC) rk
FROM (SELECT t2.uploader,t2.videoId,t2.views
FROM (SELECT uploader,videos
FROM gulivideo_user_orc
ORDER BY videos DESC
LIMIT 10)t1 JOIN gulivideo_orc t2
ON t1.uploader = t2.uploader) t3 )t4
WHERE t4.rk <= 20
约定2:取top10中所有人上传视频的前20
- a.top10用户
SELECT uploader,videos
FROM gulivideo_user_orc
ORDER BY videos DESC
LIMIT 10
==>t1
- b.关联gulivideo_orc,取到者10个用户上传的视频观看数前20
SELECT t2.uploader,t2.videoId,t2.views
FROM t1 JOIN gulivideo_orc t2
ON t1.uploader = t2.uploader
ORDER BY t2.views DESC
LIMIT 20;
- c.组合:
SELECT t2.uploader,t2.videoId,t2.views
FROM
(SELECT uploader,videos
FROM gulivideo_user_orc
ORDER BY videos DESC
LIMIT 10
)t1 JOIN gulivideo_orc t2
ON t1.uploader = t2.uploader
ORDER BY t2.views DESC
LIMIT 20;
约定3:取top10用户上传视频在所有人上传的视频的前20
- a.top10用户
SELECT uploader,videos
FROM gulivideo_user_orc
ORDER BY videos DESC
LIMIT 10
==>t1
- b.关联gulivideo_orc,取到者10个用户上传的视频
SELECT t2.uploader,t2.videoId,t2.views
FROM t1 JOIN gulivideo_orc t2
ON t1.uploader = t2.uploader
===>t3
- c.找到所有视频中观看数前20的
SELECT videoId
FROM gulivideo_orc
ORDER BY views DESC
LIMIT 20
==>t4
- d.关联
SELECT t3.videoId,t3.uploader,t3.views
FROM t3 JOIN t4
ON t3.videoId = t4.videoId
- e.组合
SELECT t3.videoId,t3.uploader,t3.views
FROM
(SELECT t2.uploader,t2.videoId,t2.views
FROM
(SELECT uploader,videos
FROM gulivideo_user_orc
ORDER BY videos DESC
LIMIT 10)t1 JOIN gulivideo_orc t2
ON t1.uploader = t2.uploader)t3 JOIN
(SELECT videoId,views
FROM gulivideo_orc
ORDER BY views DESC
LIMIT 20)t4
ON t3.videoId = t4.videoId
总结
1.写Hive脚本开始阶段,分步骤写。一步一步分析清楚,化繁为简。多点子查询没关系,后期优化。
2.多思考,多总结,多分析。