影评项目（hive）

最新推荐文章于 2020-06-05 09:41:58 发布

brz_em

最新推荐文章于 2020-06-05 09:41:58 发布

阅读量1.2k

点赞数 1

分类专栏：云计算、大数据文章标签：影评项目（hive） hive 影评项目

本文链接：https://blog.csdn.net/qq_35180983/article/details/83239407

版权

本文介绍了使用Hive对影评数据进行清洗、建表、导入，并解决一系列数据分析问题，包括求评分次数最多电影、男女评分最高电影、特定电影各年龄段平均评分、女性最高分电影平均评分等复杂查询。

摘要由CSDN通过智能技术生成

现有如此三份数据：
1、users.dat
数据格式为： 2::M::56::16::70072
对应字段为：UserID BigInt,Gender String,Age Int,Occupation String,Zipcode String
对应字段中文解释：用户id，性别，年龄，职业，邮政编码

2、movies.dat
数据格式为： 2::Jumanji (1995)::Adventure|Children’s|Fantasy
对应字段为：MovieID BigInt, Title String, Genres String
对应字段中文解释：电影ID，电影名字，电影类型

3、ratings.dat
数据格式为： 1::1193::5::978300760
对应字段为：UserID BigInt, MovieID BigInt, Rating Double, Timestamped String
对应字段中文解释：用户ID，电影ID，评分，评分时间戳

set hive.cli.print.current.db=true; //显示当前库
set hive.exec.mode.local.auto=true; //设置hive执行的本机模式
set hive.mapred.mode=nonstrict;

影评项目数据：
链接：https://pan.baidu.com/s/1Aq28cvfaSgdPe_TmsFqPaQ 密码：nlqo

题目要求：

数据要求：

（1）写shell脚本清洗数据。（hive不支持解析多字节的分隔符，也就是说hive只能解析’:’, 不支持解析’::’，所以用普通方式建表来使用是行不通的，要求对数据做一次简单清洗）

#!/bin/bash
echo "Wait for a moment"
cd /home/movetest/ml-1m
for i in $'*.dat'
do
echo $i
sed -i "s/::/:/g" $i
done
echo "have finished!"

或者：

#!/bin/bash
echo "Wait for a moment"
cd /home/movetest/ml-1m
sed -i "s/::/:/g" `grep "qwe" -rl ./`
echo "have finished!"

（2）使用Hive能解析的方式进行
注：建表时处理

Hive要求：

1、正确建表，导入数据（三张表，三份数据），并验证是否正确

create table users(UserID BigInt,Gender String,Age Int,Occupation String,Zipcode String)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s %5$s')
stored as textfile;
load data local inpath '/home/movetest/ml-1m/users.dat' INTO TABLE users;
select * from users limit 10;
 
create table movies(MovieID BigInt, Title String, Genres String)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s')
stored as textfile;
load data local inpath '/home/movetest/ml-1m/movies.dat' INTO TABLE movies;
select * from movies limit 10;
 
create table ratings(UserID BigInt, MovieID BigInt, Rating Double, Timestamped String)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s')
stored as textfile;
load data local inpath '/home/movetest/ml-1m/ratings.dat' INTO TABLE ratings;
select * from ratings limit 10;

2、求被评分次数最多的10部电影，并给出评分次数（电影名，评分次数）

分析：
	表：       movies    ratings   
	要求的字段：title     count(userid)

select a.title, count(b.userid) counts 
from movies a join ratings b 
on a.movieid = b.movieid 
group by a.title,b.movieid 
order by counts desc 
limit 10
;

American Beauty (1999)	3428
Star Wars	2991
Star Wars	2990
Star Wars	2883
Jurassic Park (1993)	2672
Saving Private Ryan (1998)	2653
Terminator 2	2649
Matrix, The (1999)	2590
Back to the Future (1985)	2583
Silence of the Lambs, The (1991)	2578
Time taken: 125.833 seconds, Fetched: 10 row(s)

3、分别求男性，女性当中评分最高的10部电影（性别，电影名，影评分）

分析：
表： users movies ratings
要求的字段：gender title avg(rating)

select a.gender,a.title,avg(c.rating) avgs,count(c.rating) counts
from ratings c 
join users a on c.userid = a.userid
join movies b on c.movieid = b.movieid
where a.gender = 'F'
group by b.movieid,b.title
order by avgs desc
limit 10
;

FAILED: SemanticException [Error 10025]: Line 1:7 Expression not in GROUP BY key ‘gender’
这里出错：参考，
https://blog.csdn.net/zhoujj303030/article/details/38424469

select  collect_set(a.gender),collect_set(b.title),avg(c.rating) avgs,count(c.rating) counts
from ratings c 
join users a on c.userid = a.userid
join movies b on c.movieid = b.movieid
where a.gender = 'F'
group by b.movieid,b.title
having counts >= 60
order by avgs desc
limit 10
;

["F"]	["Close Shave, A (1995)"]	4.644444444444445	180
["F"]	["Wrong Trousers, The (1993)"]	4.588235294117647	238
["F"]	["Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)"]	4.572649572649572	117
["F"]	["Wallace & Gromit"]	4.563106796116505	103
["F"]	["Schindler's List (1993)"]	4.56260162601626	615
["F"]	["Shawshank Redemption, The (1994)"]	4.539074960127592	627
["F"]	["Grand Day Out, A (1992)"]	4.537878787878788	132
["F"]	["To Kill a Mockingbird (1962)"]	4.536666666666667	300
["F"]	["Creature Comforts (1990)"]	4.513888888888889	72
["F"]	["Usual Suspects, The (1995)"]	4.513317191283293	413

结果是集合，改进：

select  collect_set(a.gender)[0],collect_set(b.title)[0],avg(c.rating) avgs,count(c.rating) counts
from ratings c 
join users a on

最低0.47元/天解锁文章

brz_em

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
影评项目（hive）

现有如此三份数据：1、users.dat数据格式为： 2::M::56::16::70072对应字段为：UserID BigInt,Gender String,Age Int,Occupation String,Zipcode String对应字段中文解释：用户id，性别，年龄，职业，邮政编码2、movies.dat数据格式为： 2::Jumanji (1995)::Adventur...
复制链接

扫一扫

专栏目录