影评项目(hive)

本文介绍了使用Hive对影评数据进行清洗、建表、导入,并解决一系列数据分析问题,包括求评分次数最多电影、男女评分最高电影、特定电影各年龄段平均评分、女性最高分电影平均评分等复杂查询。
摘要由CSDN通过智能技术生成

现有如此三份数据:
1、users.dat
数据格式为: 2::M::56::16::70072
对应字段为:UserID BigInt,Gender String,Age Int,Occupation String,Zipcode String
对应字段中文解释:用户id,性别,年龄,职业,邮政编码

2、movies.dat
数据格式为: 2::Jumanji (1995)::Adventure|Children’s|Fantasy
对应字段为:MovieID BigInt, Title String, Genres String
对应字段中文解释:电影ID,电影名字,电影类型

3、ratings.dat
数据格式为: 1::1193::5::978300760
对应字段为:UserID BigInt, MovieID BigInt, Rating Double, Timestamped String
对应字段中文解释:用户ID,电影ID,评分,评分时间戳

set hive.cli.print.current.db=true; //显示当前库
set hive.exec.mode.local.auto=true; //设置hive执行的本机模式
set hive.mapred.mode=nonstrict;

影评项目数据:
链接:https://pan.baidu.com/s/1Aq28cvfaSgdPe_TmsFqPaQ 密码:nlqo

题目要求:

数据要求:

(1)写shell脚本清洗数据。(hive不支持解析多字节的分隔符,也就是说hive只能解析’:’, 不支持解析’::’,所以用普通方式建表来使用是行不通的,要求对数据做一次简单清洗)

#!/bin/bash
echo "Wait for a moment"
cd /home/movetest/ml-1m
for i in $'*.dat'
do
echo $i
sed -i "s/::/:/g" $i
done
echo "have finished!"

或者:

#!/bin/bash
echo "Wait for a moment"
cd /home/movetest/ml-1m
sed -i "s/::/:/g" `grep "qwe" -rl ./`
echo "have finished!"

(2)使用Hive能解析的方式进行
注:建表时处理

Hive要求:

1、正确建表,导入数据(三张表,三份数据),并验证是否正确
create table users(UserID BigInt,Gender String,Age Int,Occupation String,Zipcode String)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s %5$s')
stored as textfile;
load data local inpath '/home/movetest/ml-1m/users.dat' INTO TABLE users;
select * from users limit 10;
 
create table movies(MovieID BigInt, Title String, Genres String)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s')
stored as textfile;
load data local inpath '/home/movetest/ml-1m/movies.dat' INTO TABLE movies;
select * from movies limit 10;
 
create table ratings(UserID BigInt, MovieID BigInt, Rating Double, Timestamped String)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s')
stored as textfile;
load data local inpath '/home/movetest/ml-1m/ratings.dat' INTO TABLE ratings;
select * from ratings limit 10;
2、求被评分次数最多的10部电影,并给出评分次数(电影名,评分次数)
分析:
	表:       movies    ratings   
	要求的字段:title     count(userid)

select a.title, count(b.userid) counts 
from movies a join ratings b 
on a.movieid = b.movieid 
group by a.title,b.movieid 
order by counts desc 
limit 10
;

American Beauty (1999)	3428
Star Wars	2991
Star Wars	2990
Star Wars	2883
Jurassic Park (1993)	2672
Saving Private Ryan (1998)	2653
Terminator 2	2649
Matrix, The (1999)	2590
Back to the Future (1985)	2583
Silence of the Lambs, The (1991)	2578
Time taken: 125.833 seconds, Fetched: 10 row(s)
3、分别求男性,女性当中评分最高的10部电影(性别,电影名,影评分)

分析:
表: users movies ratings
要求的字段:gender title avg(rating)

select a.gender,a.title,avg(c.rating) avgs,count(c.rating) counts
from ratings c 
join users a on c.userid = a.userid
join movies b on c.movieid = b.movieid
where a.gender = 'F'
group by b.movieid,b.title
order by avgs desc
limit 10
;

FAILED: SemanticException [Error 10025]: Line 1:7 Expression not in GROUP BY key ‘gender’
这里出错:参考,
https://blog.csdn.net/zhoujj303030/article/details/38424469

select  collect_set(a.gender),collect_set(b.title),avg(c.rating) avgs,count(c.rating) counts
from ratings c 
join users a on c.userid = a.userid
join movies b on c.movieid = b.movieid
where a.gender = 'F'
group by b.movieid,b.title
having counts >= 60
order by avgs desc
limit 10
;

["F"]	["Close Shave, A (1995)"]	4.644444444444445	180
["F"]	["Wrong Trousers, The (1993)"]	4.588235294117647	238
["F"]	["Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)"]	4.572649572649572	117
["F"]	["Wallace & Gromit"]	4.563106796116505	103
["F"]	["Schindler's List (1993)"]	4.56260162601626	615
["F"]	["Shawshank Redemption, The (1994)"]	4.539074960127592	627
["F"]	["Grand Day Out, A (1992)"]	4.537878787878788	132
["F"]	["To Kill a Mockingbird (1962)"]	4.536666666666667	300
["F"]	["Creature Comforts (1990)"]	4.513888888888889	72
["F"]	["Usual Suspects, The (1995)"]	4.513317191283293	413

结果是集合,改进:

select  collect_set(a.gender)[0],collect_set(b.title)[0],avg(c.rating) avgs,count(c.rating) counts
from ratings c 
join users a on 
  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值