Hive 实战

数据准备-表结构

gulivideo_orc

字段备注详细描述
videoId视频唯一id(STRING)11位字符串
uploader视频上传者(STRING)上传视频的用户名STRING
age视频年龄(INT)视频在平台上的整数天
category视频类别(Array)上传视频指定的视频分类
LENGTH视频长度(INT)整形数字标识的视频长度
views观看次数(INT)视频被浏览的次数
rate视频评分(DOUBLE)满分5分
Ratings流量(INT)视频的流量,整型数字
conments评论数(INT)一个视频的整数评论数
relatedId相关视频id(Array)相关视频的id,最多20个

gulivideo_user_orc

字段备注字段类型
uploader上传者用户名STRING
videos上传视频数INT
friends朋友数量INT

需求一:统计视频观看数最高的 top10

SELECT 
videoId,views
FROM 
gulivideo_orc
ORDER BY views DESC
LIMIT 10

需求二:统计视频类别热度Top10

– 热度:总视频数 总流量 总观看数。。。
– 约定: 按照类别下包含的总视频个数来衡量热度

  • a. 炸开每个视频的类别
SELECT videoId,category_name
FROM gulivideo_orc
lateral VIEW explode(category)   gulivideo_orc_tmp AS category_name  ===>t1
  • b.按照category_name分组,count 每个组视频的个数,倒序,limit 10个。
SELECT  category_name,COUNT(t1.videoId) hot
FROM t1
GROUP BY category_name
ORDER BY hot DESC
LIMIT 10;
  • c. 组合:
SELECT  category_name,COUNT(t1.videoId) hot
FROM
(SELECT videoId,category_name
FROM gulivideo_orc
lateral VIEW explode(category)   gulivideo_orc_tmp AS category_name ) t1
GROUP BY category_name
ORDER BY hot DESC
LIMIT 10;

================================================================

需求三:统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数

  • a.统计视频观看数最高的20个视频
SELECT videoId,category,views
FROM gulivideo_orc
ORDER BY views DESC
LIMIT 20 ===>t1
  • b.炸开每个视频的类别
SELECT t1.videoId,category_name
FROM t1
literal VIEW explode(t1.category)  gulivideo_orc_tmp AS category_name ==>t2
  • 组合
SELECT t1.videoId,category_name
FROM 
(SELECT videoId,category,views
FROM gulivideo_orc
ORDER BY views DESC
LIMIT 20) t1
lateral VIEW explode(t1.category)  gulivideo_orc_tmp AS category_name  ===》t2
  • c.按照类别分组,求每个类别中的视频个数
SELECT t2.category_name,COUNT(t2.videoId) AS videos
FROM t2
GROUP BY t2.category_name
  • d.组合
SELECT t2.category_name,COUNT(t2.videoId) AS videos
FROM (SELECT t1.videoId,category_name
FROM 
(SELECT videoId,category,views
FROM gulivideo_orc
ORDER BY views DESC
LIMIT 20) t1
lateral VIEW explode(t1.category)  gulivideo_orc_tmp AS category_name)
t2
GROUP BY t2.category_name

========================================================

需求四: 统计视频观看数Top50所关联视频的所属类别Rank

  • a.视频观看数Top50所关联视频
SELECT videoId,relatedId,views
FROM gulivideo_orc
ORDER BY views DESC 
LIMIT 50   ====》t1
  • b.炸开关联视频
SELECT relatedId_id
FROM t1
latera VIEW explode(t1.relatedId) gulivideo_orc_temp AS relatedId_id
  • 组合
SELECT relatedId_id
FROM 
(SELECT videoId,relatedId,views
FROM gulivideo_orc
ORDER BY views DESC 
LIMIT 50 )t1
lateral VIEW explode(t1.relatedId) gulivideo_orc_temp AS relatedId_id 
===>t2
  • c. t2 JOIN gulivideo_orc表,找到每个关联视频对应的类别
SELECT t2.relatedId_id,t3.category
FROM t2 JOIN gulivideo_orc t3
ON t2.relatedId_id = t3.videoId
===>t4
  • d.炸开类别
SELECT t4.category_name,t4.relatedId_id
FROM t4
lateral VIEW explode(t4.category) t4_tmp AS category_name
===》t5
  • e.按照类别分组,求每个类别出现的次数
SELECT t5.category_name,COUNT(t5.relaredId_id) category_count
FROM t5
GROUP BY t5.category_name
===》t6
  • f.排名
SELECT t6.category_name,t6.category_count,
rank() over(ORDER BY t6.category_count DESC) rk
form t6

需求五:统计每个类别中的视频热度Top10,以Music为例

  • a.炸开每个视频的类别
SELECT videoId,category_name
FROM gulivideo_orc
lateral VIEW explode(category) gulivideo_orc_tmp AS category_name
===>t1
  • b.过滤Music类别的数据,按照views倒叙,取top10
SELECT t1.videoId,t1.views,t1.category_name
FROM t1
WHERE category_name = 'Music'
ORDER BY t1.views DESC
LIMIT 10
  • c.组合
SELECT t1.videoId,t1.views,t1.category_name
FROM 
(SELECT videoId,category_name,views
FROM gulivideo_orc
lateral VIEW explode(category) gulivideo_orc_tmp AS category_name)t1
WHERE category_name = 'Music'
ORDER BY t1.views DESC
LIMIT 10

需求五扩展: 统计每个类别中的视频热度Top10

  • a.炸开每个视频的类别
SELECT videoId,category_name,views
FROM gulivideo_orc
lateral VIEW explode(category) gulivideo_orc_tmp AS category_name
===>t1
  • b.开窗,按照类别分区,观看数倒序排序,求排名
SELECT t1.videoId,t1.views,t1.category_name,
rank() over(PARTITION BY t1.category_name ORDER BY t1.views DESC) rk
FROM t1
===>t2
  • c.取出每个类别的top10
SELECT t2.videoId,t2.views,t2.category_name,t2.rk
FROM t2
WHERE t2.rk <= 10
  • d.组合
SELECT t2.videoId,t2.views,t2.category_name,t2.rk
FROM 
(SELECT t1.videoId,t1.views,t1.category_name,
rank() over(PARTITION BY t1.category_name ORDER BY t1.views DESC) rk
FROM 
(SELECT videoId,category_name,views
FROM gulivideo_orc
lateral VIEW explode(category) gulivideo_orc_tmp AS category_name) t1) t2
WHERE t2.rk <= 10

================================================================

需求六:统计每个类别视频观看数Top10 参考需求五拓展

====================================================================

需求七:统计上传视频最多的用户Top10以及他们上传的视频观看次数在前20的视频

约定1:取top10中每个人上传视频的前20

  • a.top10用户
SELECT uploader,videos
FROM gulivideo_user_orc
ORDER BY videos DESC
LIMIT 10
==>t1
  • b.关联gulivideo_orc,取到者10个用户上传的视频
SELECT t2.uploader,t2.videoId,t2.views
FROM t1 JOIN gulivideo_orc t2
ON t1.uploader = t2.uploader
===>t3
  • c. 开窗 ,按照uplodaer分区,views排序,求排名
SELECT t3.uploader,t3.videoId,t3.views,
rank() over(PARTITION BY t3.uploader ORDER BY t3.views DESC) rk
FROM t3 
==>t4
  • d.取每个用户的前20
SELECT t4.uploader,t4.videoId,t4.views,t4.rk
FROM t4
WHERE t4.rk <= 20
  • e.组合
SELECT t4.uploader,t4.videoId,t4.views,t4.rk
FROM (SELECT t3.uploader,t3.videoId,t3.views,
rank() over(PARTITION BY t3.uploader ORDER BY t3.views DESC) rk
FROM (SELECT t2.uploader,t2.videoId,t2.views
FROM (SELECT uploader,videos
FROM gulivideo_user_orc
ORDER BY videos DESC
LIMIT 10)t1 JOIN gulivideo_orc t2
ON t1.uploader = t2.uploader) t3 )t4
WHERE t4.rk <= 20

约定2:取top10中所有人上传视频的前20

  • a.top10用户
SELECT uploader,videos
FROM gulivideo_user_orc
ORDER BY videos DESC
LIMIT 10
==>t1
  • b.关联gulivideo_orc,取到者10个用户上传的视频观看数前20
SELECT t2.uploader,t2.videoId,t2.views
FROM t1 JOIN gulivideo_orc t2
ON t1.uploader = t2.uploader
ORDER BY t2.views DESC
LIMIT 20;
  • c.组合:
SELECT t2.uploader,t2.videoId,t2.views
FROM 
(SELECT uploader,videos
FROM gulivideo_user_orc
ORDER BY videos DESC
LIMIT 10
)t1 JOIN gulivideo_orc t2
ON t1.uploader = t2.uploader
ORDER BY t2.views DESC
LIMIT 20;

约定3:取top10用户上传视频在所有人上传的视频的前20

  • a.top10用户
SELECT uploader,videos
FROM gulivideo_user_orc
ORDER BY videos DESC
LIMIT 10
==>t1
  • b.关联gulivideo_orc,取到者10个用户上传的视频
SELECT t2.uploader,t2.videoId,t2.views
FROM t1 JOIN gulivideo_orc t2
ON t1.uploader = t2.uploader
===>t3
  • c.找到所有视频中观看数前20的
SELECT videoId
FROM gulivideo_orc 
ORDER BY views DESC
LIMIT 20
==>t4
  • d.关联
SELECT t3.videoId,t3.uploader,t3.views
FROM t3 JOIN t4
ON t3.videoId = t4.videoId
  • e.组合
SELECT t3.videoId,t3.uploader,t3.views
FROM 
(SELECT t2.uploader,t2.videoId,t2.views
FROM 
(SELECT uploader,videos
FROM gulivideo_user_orc
ORDER BY videos DESC
LIMIT 10)t1 JOIN gulivideo_orc t2
ON t1.uploader = t2.uploader)t3 JOIN 
(SELECT videoId,views
FROM gulivideo_orc 
ORDER BY views DESC
LIMIT 20)t4
ON t3.videoId = t4.videoId

总结

1.写Hive脚本开始阶段,分步骤写。一步一步分析清楚,化繁为简。多点子查询没关系,后期优化。
2.多思考,多总结,多分析。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值