13:Hive从0到1系列学习:几个Hive小案例,连接Hive知识点,加深巩固

46 篇文章 2 订阅

Hive实战小项目

需求分析:

统计硅谷影音视频网站的常规指标,各种TopN指标:

– 统计视频观看数Top10

– 统计视频类别热度Top10

– 统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数

– 统计视频观看数Top50所关联视频的所属类别Rank

– 统计每个类别中的视频热度Top10,以Music为例

– 统计每个类别视频观看数Top10

– 统计上传视频最多的用户Top10以及他们上传的视频观看次数在前20的视频

数据结构:

1、视频表

字段备注详细描述
videoId视频唯一id(String)11位字符串
uploader视频上传者(String)上传视频的用户名String
age视频年龄(int)视频在平台上的整数天
category视频类别(Array)上传视频指定的视频分类
length视频长度(Int)整形数字标识的视频长度
views观看次数(Int)视频被浏览的次数
rate视频评分(Double)满分5分
Ratings流量(Int)视频的流量,整型数字
conments评论数(Int)一个视频的整数评论数
relatedId相关视频id(Array)相关视频的id,最多20个

2、用户表

字段备注字段类型
uploader上传者用户名string
videos上传视频数int
friends朋友数量int

1、数据准备

①创建原始表

视频表

create table gulivideo_ori(
    videoId string, 
    uploader string, 
    age int, 
    category array<string>, 
    length int, 
    views int, 
    rate float, 
    ratings int, 
    comments int,
    relatedId array<string>)
row format delimited fields terminated by "\t"
collection items terminated by "&"
stored as textfile;

用户表

create table gulivideo_user_ori(
    uploader string,
    videos int,
    friends int)
row format delimited 
fields terminated by "\t" 
stored as textfile;

②创建orc存储格式带snappy压缩的表

视频表

create table gulivideo_orc(
    videoId string, 
    uploader string, 
    age int, 
    category array<string>, 
    length int, 
    views int, 
    rate float, 
    ratings int, 
    comments int,
    relatedId array<string>)
stored as orc
tblproperties("orc.compress"="SNAPPY");

用户表

create table gulivideo_user_orc(
    uploader string,
    videos int,
    friends int)
row format delimited 
fields terminated by "\t" 
stored as orc
tblproperties("orc.compress"="SNAPPY");

③向ori表中插入数据

load data local inpath "/opt/module/hive/datas/video" into table gulivideo_ori;
load data local inpath "/opt/module/hive/datas/guliVideo/user/user.txt" into table gulivideo_user_ori;

④向orc表中插入数据

insert into table gulivideo_orc select * from gulivideo_ori;
insert into table gulivideo_user_orc select * from gulivideo_user_ori;

2、业务分析

①统计视频观看数Top10

使用order by按照views字段做一个全局排序即可,同时我们设置只显示前10条。

select
    videoId,
    `views`
from gulivideo_orc
order by `views` desc 
limit 10;

在这里插入图片描述

②统计视频类别热度Top10

思路:就是统计出每个视频类别中包含的视频数,将数量最多的十个输出

因为一个视频对应一个或者多个类别,所以还要来一个列转行

1)首先进行列转行

select videoId, category_col from gulivideo_ori
lateral view explode(category) t as category_col
limit 10;

在这里插入图片描述

2)然后按照类别进行分组,然后计算count()即可,排序输出前10

select 
    tmp01.category_col,
    count(tmp01.videoId) nums
from
(
select videoId, category_col from gulivideo_ori
lateral view explode(category) t as category_col
) tmp01
group by tmp01.category_col
order by nums desc
limit 10;

在这里插入图片描述

③统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数

1)首先,视频观看数最高的20个视频所属的类别

select 
    videoId,
    views,
    category
from gulivideo_ori
order by views desc
limit 20;

在这里插入图片描述

2)通过上述的结果,将其列转行

select
category_col,
count(tmp01.videoId) nums
from(
    select 
    videoId,
    views,
    category
    from gulivideo_ori
    order by views desc
    limit 20
) tmp01
lateral view
explode(tmp01.category) t as category_col
group by category_col
order by nums
limit 10;

在这里插入图片描述

④统计视频观看数Top50所关联视频的所属类别排序

1)首先查看视频观看数Top50的视频

select
    videoId,
    relatedId,
    `views`
from gulivideo_orc
order by `views` desc
limit 50

在这里插入图片描述

2)将查询结果的列转为行

select
	realte_id
from (
select
    videoId,
    relatedId,
    `views`
from gulivideo_orc
order by `views` desc
limit 50
) t1
lateral view
explode(t1.relatedId) tmp as realte_id;

在这里插入图片描述

3)上述就得到了所有的视频类型结果,然后找到这些视频的关联类别

select
    t2.realte_id,
    t2.cate
from (
    select
    realte_id,
    t1.category cate
    from (
        select
            videoId,
            relatedId,
            category,
            `views`
        from gulivideo_orc
        order by `views` desc
        limit 50
    ) t1
    lateral view
    explode(t1.relatedId) tmp as realte_id
) t2 join gulivideo_orc g on t2.realte_id = g.videoId;

在这里插入图片描述

4)找到了视频对应的类型,还要将相应的视频类型列转行

select
    t3.realte_id,
    category_name
from (
select
    t2.realte_id,
    g.category
from (
select
	realte_id
from (
select
    videoId,
    relatedId,
    `views`
from gulivideo_orc
order by `views` desc
limit 50
) t1
lateral view
explode(t1.relatedId) tmp as realte_id
) t2 join gulivideo_orc g on t2.realte_id = g.videoId
) t3
lateral view
explode(t3.category) tmp as category_name;

在这里插入图片描述

5)按照类别名进行分组,统计每组的数量,进行排序

select 
    t5.category_name,
    t5.num,
    rank() over(order by t5.num desc ) rk
from (
     select
         t4.category_name,
         count(t4.realte_id) num
     from (
              select
                  t3.realte_id,
                  category_name
              from (
                       select
                           t2.realte_id,
                           g.category
                       from (
                                select
                                    realte_id
                                from (
                                         select
                                             videoId,
                                             relatedId,
                                             `views`
                                         from gulivideo_orc
                                         order by `views` desc
                                         limit 50
                                     ) t1
                                         lateral view
                                             explode(t1.relatedId) tmp as realte_id
                            ) t2 join gulivideo_orc g on t2.realte_id = g.videoId
                   ) t3
                       lateral view
                           explode(t3.category) tmp as category_name
          ) t4
     group by t4.category_name        
) t5 ;

在这里插入图片描述

⑤统计类别视频观看数Top10

1)查找出视频的观看数以及所属于的类别

select 
	videoId, 
	views, 
	category 
from gulivideo_orc;

在这里插入图片描述

2)将类别进行分行

select 
	videoId, 
	views, 
	category_name 
from gulivideo_orc
lateral view 
explode(category) tmp as category_name;

在这里插入图片描述

3)按照类别分组,求和观看数

select 
	sum(t1.views)
    sum_views, 
    t1.category_name 
from (
    select 
    	views, 
    	category_name
    from gulivideo_orc 
    lateral view explode(category) tmp as category_name) t1
group by t1.category_name limit 10;

在这里插入图片描述

4)排序得到Top10

select 
    t2.sum_views, 
    t2.category_name
from(
	select 
    	sum(t1.views) sum_views,
    	t1.category_name 
   	from (
        select 
        	views, 
        	category_name 
        from gulivideo_orc 
        lateral view explode(category) tmp as category_name) t1
	group by t1.category_name
    ) t2
order by t2.sum_views desc limit 10;

在这里插入图片描述

⑥统计每个类别视频观看数Top10

注意,这个和上题的区别是统计每个类别中视频观看数的Top10

使用窗口函数,按照视频类别进行开窗,每个分区内按照视频播放量进行排序,取Top10

1)将视频类型的列转为行

select 
	views, 
	category_name
from gulivideo_orc 
lateral view explode(category) tmp as category_name;

在这里插入图片描述

2)开窗,按照视频类别进行分区,视频播放量进行排序,取Top10

select 
	t2.views,
	t2.category_name,
	t2.rk
from( 
    select 
        t1.views,
        t1.category_name,
        rank() over(partition by t1.category_name order by t1.views desc) rk
    from( 
        select
            views,
            category_name
        from gulivideo_orc
        lateral view explode(category) tmp as category_name
        ) t1
    ) t2
where rk<=10;

在这里插入图片描述

⑦统计上传视频最多的用户Top10以及他们上传的视频观看次数在前20的视频

1)首先需要查找上传视频最多的用户Top10

select 
	uploader
from gulivideo_user_orc
order by videos desc
limit 10;

在这里插入图片描述

2)查找出其上传的视频

select
    t1.uploader,
    g.videoId,
    g.`views`
from (
    select
    	uploader
    from gulivideo_user_orc
    order by videos desc
    limit 10
    ) t1 
    join gulivideo_orc g on t1.uploader = g.uploader

在这里插入图片描述

3)开窗按照uploader分组,观看量进行排序

select
    t3.uploader,
    t3.videoId,
    t3.views,
    t3.rk
from(
    select
        t2.uploader,
        t2.videoId,
        t2.views,
        rank() over(partition by t2.uploader order by t2.views desc) rk
    from(
        select
            t1.uploader,
            g.videoId,
            g.`views`
        from (
            select
            	uploader
            from gulivideo_user_orc
            order by videos desc
            limit 10
        ) t1 
        join gulivideo_orc g on t1.uploader = g.uploader
    ) t2
) t3
where rk<=20;

在这里插入图片描述

  • 3
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 6
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

牧码文

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值