1. 需求描述
统计硅谷影音视频网站的常规指标,各种 TopN 指标:
- 统计视频观看数 Top10
- 统计视频类别热度 Top10
- 统计视频观看数 Top20 所属类别以及类别包含的 Top20 的视频个数
- 统计视频观看数 Top50 所关联视频的所属类别 Rank
- 统计上传视频最多的用户 Top10 以及他们上传的观看次数在前 20 视频
- 统计每个类别中的视频热度 Top10
1.1数据结构
(1)视频表
字段 | 备注 | 详细描述 |
---|---|---|
video id | 视频唯一 id | 11 位字符串 |
uploader | 视频上传者 | 上传视频的用户名 String |
age | 视频年龄 | 视频在平台上的整数天 |
category | 视频类别 | 上传视频指定的视频分类 |
length | 视频长度 | 整形数字标识的视频长度 |
views | 观看次数 | 视频被浏览的次数 |
rate | 视频评分 | 满分 5 分 |
ratings | 流量 视频的流量,整型数字 | |
conments | 评论数 | 一个视频的整数评论数 |
related ids | 相关视频 id | 相关视频的 id,最多 20 个 |
(2)用户表
字段 | 备注 | 详细描述 |
---|---|---|
字段 | 备注 | 字段类型 |
uploader | 上传者用户名 | string |
videos | 上传视频数 | int |
friends | 朋友数量 | int |
2.准备工作
2.1创建表
创建表:gulivideo_ori,gulivideo_user_ori,
创建表:gulivideo_orc,gulivideo_user_orc
gulivideo_ori:
create table gulivideo_ori(
videoId string,
uploader string,
age int,
category array<string>,
length int,
views int,
rate float,
ratings int,
comments int,
relatedId array<string>)
row format delimited fields terminated by "\t"
collection items terminated by "&"
stored as textfile;
gulivideo_user_ori:
create table gulivideo_user_ori(
uploader string,
videos int,
friends int)
row format delimited fields terminated by "\t"
stored as textfile;
把原始数据插入到 orc 表中
gulivideo_orc:
create table gulivideo_orc(
videoId string,
uploader string,
age int,
category array<string>,
length int,
views int,
rate float,
ratings int,
comments int,
relatedId array<string>)
row format delimited fields terminated by "\t"
collection items terminated by "&"
stored as orc;
gulivideo_user_orc:
create table gulivideo_user_orc(
uploader string,
videos int,
friends int)
row format delimited fields terminated by "\t"
stored as orc;
2.1导入 ETL 后的数据
gulivideo_ori:
load data inpath "/gulivideo/output/video/" into table gulivideo_ori;
gulivideo_user_ori:
load data inpath "/gulivideo/user" into table gulivideo_user_ori;
2.3向 ORC 表插入数据
gulivideo_orc:
insert into table gulivideo_orc select * from gulivideo_ori;
gulivideo_user_orc:
insert into table gulivideo_user_orc select * from gulivideo_user_ori;
3.业务分析
3.1统计视频观看数 Top10
思路:
使用 order by 按照 views 字段做一个全局排序即可,同时设置只显示前 10 条。
select
videoId,
uploader,
age,
category,
length,
views,
rate,
ratings,
comments
from
gulivideo_orc
order by
views desc
limit
10;
结果:
dMH0bHeiRNg | 42513417 |
---|---|
0XxI-hvPRRA | 20282464 |
1dmVU08zVpA | 16087899 |
RB-wUgnyGv0 | 15712924 |
QjA5faZF1A8 | 15256922 |
-_CSo1gOd48 | 13199833 |
49IDp76kjPw | 11970018 |
tYnn51C3X_w | 11823701 |
pv5zWaTEVkI | 11672017 |
D2kJZOfq7zk | 11184051 |
3.2 统计视频类别热度 Top10
某类视频的个数作为视频类别热度
思路:
1.使用UDTF函数将类别列炸开
2.按照category_name进行分组,统计每种类别视频的总数,同时按照该总数进行倒序排名,取前10;
3.合并(嵌套)
1.使用UDTF函数将类别列炸开
select
videoId,
category_name
from
gulivideo_orc
lateral view explode(category) tmp_category as category_name;t1
2.按照category_name进行分组,统计每种类别视频的总数,同时按照该总数进行倒序排名,取前10;
select
category_name,
count(*) category_count
from
t1
group by
category_name
order by
category_count desc
limit 10;
3.合并(嵌套)
select
category_name,
count(*) category_count
from
(select
videoId,
category_name
from
gulivideo_orc
lateral view explode(category) tmp_category as category_name)t1
group by
category_name
order by
category_count desc
limit 10;
结果:
Music | 179049 |
---|---|
Entertainment | 127674 |
Comedy | 87818 |
Animation | 73293 |
Film | 73293 |
Sports | 67329 |
Gadgets | 59817 |
Games | 59817 |
Blogs | 48890 |
People | 48890 |
3.3统计出视频观看数最高的 20 个视频的所属类别以及类别包含Top20 视频的个数
思路:
1.统计视频观看数Top20。
2.对t1表中的category进行炸裂。
3.对t2表进行分组(category_name)求和(总数)
4.合并(嵌套)
1.统计视频观看数Top20
select
videoId,
views,
category
from
gulivideo_orc
order by
views desc
limit 20;t1
2.对t1表中的category进行炸裂
select
videoId,
category_name
from
t1
lateral view explode(category) tmp_category as category_name;t2
3.对t2表进行分组(category_name)求和(总数)
select
category_name,
count(*) category_count
from
t2
group by
category_name
order by
category_count desc;
4.合并(嵌套)
select
category_name,
count(*) category_count
from
(select
videoId,
category_name
from
(select
videoId,
views,
category
from
gulivideo_orc
order by
views desc
limit 20)t1
lateral view explode(category) tmp_category as category_name)t2
group by
category_name
order by
category_count desc;
结果:
Entertainment | 6 |
---|---|
Comedy | 6 |
Music | 5 |
People | 2 |
Blogs | 2 |
UNA | 1 |
3.4 统计视频观看数 Top50 所关联视频的所属类别 Rank
思路:
1.统计视频观看数Top50。
2.对t1表中的relatedId进行炸裂并去重。
3.取出观看数前50视频关联ID视频的类别。
4.对t3表中的category进行炸裂。
5.分组(类别)求和(总数)。
6.合并(嵌套)。
1.统计视频观看数Top50
select
relatedId,
views
from
gulivideo_orc
order by
views desc
limit 50;t1
2.对t1表中的relatedId进行炸裂并去重
select
related_id
from
t1
lateral view explode(relatedId) tmp_related as related_id
group by related_id;t2
3.取出观看数前50视频关联ID视频的类别
select
category
from
t2
join gulivideo_orc orc
on t2.related_id=orc.videoId;t3
4.对t3表中的category进行炸裂
select
explode(category) category_name
from
t3;t4
5.分组(类别)求和(总数)
select
category_name,
count(*) category_count
from
t4
group by
category_name
order by
category_count desc;
6.合并(嵌套)
select
category_name,
count(*) category_count
from
(select
explode(category) category_name
from
(select
category
from
(select
related_id
from
(select
relatedId,
views
from
gulivideo_orc
order by
views desc
limit 50)t1
lateral view explode(relatedId) tmp_related as related_id
group by related_id)t2
join gulivideo_orc orc
on t2.related_id=orc.videoId)t3)t4
group by
category_name
order by
category_count desc;
结果:
Comedy | 232 |
---|---|
Entertainment | 216 |
Music | 195 |
Blogs | 51 |
People | 51 |
Film | 47 |
Animation | 47 |
News | 22 |
Politics | 22 |
Games | 20 |
Gadgets | 20 |
Sports | 19 |
Howto | 14 |
DIY | 14 |
UNA | 13 |
Places | 12 |
Travel | 12 |
Animals | 11 |
Pets | 11 |
Autos | 4 |
Vehicles | 4 |
3.6 统计上传视频最多的用户Top10以及他们上传的观看次数在前20视频
Top10
思路:
1.统计上传视频最多的用户Top10。
2.取出这10个人上传的所有视频,按照观看次数进行排名,取前20。
3.合并(嵌套)
1.统计上传视频最多的用户Top10。
select
uploader,
videos
from
gulivideo_user_orc
order by
videos desc
limit 10;t1
2.取出这10个人上传的所有视频,按照观看次数进行排名,取前20
select
video.videoId,
video.views
from
t1
join
gulivideo_orc video
on
t1.uploader=video.uploader
order by
views desc
limit 20;
3.合并(嵌套)
select
video.videoId,
video.views
from
(select
uploader,
videos
from
gulivideo_user_orc
order by
videos desc
limit 10)t1
join
gulivideo_orc video
on
t1.uploader=video.uploader
order by
views desc
limit 20;
3.7 统计每个类别中的视频热度Top10
思路:
1.给每一种类别根据视频观看数添加rank值(倒序)
2.过滤前十
1.给每一种类别根据视频观看数添加rank值(倒序)
select
categoryId,
videoId,
views,
rank() over(partition by categoryId order by views desc) rk
from
gulivideo_category;
2.过滤前十
select
categoryId,
videoId,
views
from
(select
categoryId,
videoId,
views,
rank() over(partition by categoryId order by views desc) rk
from
gulivideo_category)t1
where
rk<=10;