接https://georgedage.blog.csdn.net/article/details/102996488
ETL后的数据处理!!!
建表:
影音信息数据:
create table gulivideo_ori(
videoId string,
uploader string,
age int,
category array<string>,
length int,
views int,
rate float,
ratings int,
comments int,
relatedId array<string>)
row format delimited
fields terminated by "\t"
collection items terminated by "&"
stored as textfile
用户信息数据:
create table gulivideo_user_ori(
uploader string,
videos int,
friends int)
row format delimited
fields terminated by "\t"
stored as textfile;
show tables:
+---------------------+--+
| tab_name |
+---------------------+--+
| gulivideo_ori |
| gulivideo_user_ori |
+---------------------+--+
导入数据:
load data inpath "/guiliVideo/output/video/2008/0222" into table
gulivideo_ori;
load data inpath "/guiliVideo/user/2008/0903" into table gulivideo_user_ori;
验证一下:
业务分析:
先来个图示:
#1、统计视频观看数 Top10
select videoId,views from gulivideo_ori order by views desc limit 10;
结果展示:
#2、统计视频类别热度 Top10
即统计每个类别有多少个视频,显示出包含视频最多的前 10 个类别。
所以根据之前https://georgedage.blog.csdn.net/article/details/102905208的经验
需要列转行
select videoId,category_name from gulivideo_ori lateral view explode(category) table_tmp as category_name limit 10;
列转行结果:
然后根据这条语句在进行操作
select category_name as category,count(t.videoId) as hot
from (select videoId,category_name from gulivideo_ori lateral view explode(category) table_tmp as category_name) t
group by t.category_name
order by hot desc
limit 10;
结果展示:
出现了问题,目前没有解决,后续再更!!!
beeline下提交任务,报出Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask (state=08S01,code=2)