Hive 之 影评案例

现有如此三份数据:
1、users.dat 数据格式为: 2::M::56::16::70072
对应字段为:UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String
对应字段中文解释:用户id,性别,年龄,职业,邮政编码

2、movies.dat 数据格式为: 2::Jumanji (1995)::Adventure|Children’s|Fantasy
对应字段为:MovieID BigInt, Title String, Genres String
对应字段中文解释:电影ID,电影名字,电影类型

3、ratings.dat 数据格式为: 1::1193::5::978300760
对应字段为:UserID BigInt, MovieID BigInt, Rating Double, Timestamped String
对应字段中文解释:用户ID,电影ID,评分,评分时间戳

题目要求:

数据要求:
(1)写shell脚本清洗数据。
(2)使用Hive能解析的方式进行

Hive要求:
(1)正确建表,导入数据(三张表,三份数据),并验证是否正确
思路:重点在于要把 ‘::’ 给替换掉

第一种方法,使用sed 进行文本处理:
#!/bin/bash
sed 's/::/%/g' users.dat >> newusers.dat
sed 's/::/%/g' movies.dat >> newmovies.dat
sed 's/::/%/g' ratings.dat >> newratings.dat

create table users(userid BigInt, gender String, age Int, occupation String, zipcode String)
row format delimited fields terminated by '%' stored as textfile;

create table movies(movieID BigInt, title String, genres String) 
row format delimited fields terminated by '%' stored as textfile;

create table ratings(userID BigInt, movieID BigInt, rating Double, timestamped String) 
row format delimited fields terminated by '%' stored as textfile;

load data local inpath '/home/user/newusers.dat' into table users;
load data local inpath '/home/user/newmovies.dat' into table movies;
load data local inpath '/home/user/newratings.dat' into table ratings;

select * from users limit 10;
select * from movies limit 10;
select * from ratings limit 10;
第二种方法
create table users(userID BigInt, gender String, age Int, occupation String, zipcode String)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s %5$s')
stored as textfile;
load data local inpath '/home/user/users.dat' INTO TABLE users;
select * from users limit 5;

create table movies(movieID BigInt, title String, genres String)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s')
stored as textfile;
load data local inpath '/home/user/movies.dat' INTO TABLE movies;
select * from movies limit 5;

create table ratings(userID BigInt, movieID BigInt, rating Double, timestamped String)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s')
stored as textfile;
load data local inpath '/home/user/ratings.dat' INTO TABLE ratings;
select * from ratings limit 5;


得到数据如下::

1、users.dat    数据格式为:  2::M::56::16::70072

1       F       1       10      48067
2       M       56      16      70072
3       M       25      15      55117

2、movies.dat		数据格式为: 2::Jumanji (1995)::Adventure|Children's|Fantasy
1       Toy Story (1995)        Animation|Children's|Comedy
2       Jumanji (1995)  Adventure|Children's|Fantasy
3       Grumpier Old Men (1995) Comedy|Romance

3、ratings.dat		数据格式为:  1::1193::5::978300760

1       1193    5.0     978300760
1       661     3.0     978302109
1       914     3.0     978301968

(2)求被评分次数最多的10部电影,并给出评分次数(电影名,评分次数)

select a.title as title , b.count as counts from 
movies a join 
(
select movieid as MovieID,count(*) as count from ratings group by movieid  sort by count desc  limit 10 
) b
on a.movieid = b.MovieID;

得到结果:

title   counts
American Beauty (1999)  3428
Star Wars: Episode IV - A New Hope (1977)       2991
Star Wars: Episode V - The Empire Strikes Back (1980)   2990
Star Wars: Episode VI - Return of the Jedi (1983)       2883
Jurassic Park (1993)    2672
Saving Private Ryan (1998)      2653
Terminator 2: Judgment Day (1991)       2649
Matrix, The (1999)      2590
Back to the Future (1985)       2583
Silence of the Lambs, The (1991)        2578

select movieid as MovieID,count(*) as count from ratings group by movieid  sort by count desc  limit 10 

(3)分别求男性,女性当中评分最高的10部电影(性别,电影名,影评分)
求男性评分最高的10部电影

1、首先根据电影评分高低进行排序  

select userid as userid ,movieid as movieid,avg(rating) as avgrate from ratings group by movieid,userid sort by avgrate desc  

2、
select collect_set(c.gender)[0] gender, a.title title , avg(b.rating) rating,count(b.rating) counts from movies a
join ratings b on a.movieid = b.movieid
join users c on b.userid = c.userid
where c.gender = "M"
group by b.movieid,a.title
having counts>70
order by rating desc 
limit 10 
;

得到如下结果:
gender  title   rating  counts
M       Godfather, The (1972)   4.583333333333333       1740
M       Seven Samurai (The Magnificent Seven) (Shichin
  • 2
    点赞
  • 15
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值