现有如此三份数据:
1、users.dat 数据格式为: 2::M::56::16::70072
对应字段为:UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String
对应字段中文解释:用户id,性别,年龄,职业,邮政编码
2、movies.dat 数据格式为: 2::Jumanji (1995)::Adventure|Children’s|Fantasy
对应字段为:MovieID BigInt, Title String, Genres String
对应字段中文解释:电影ID,电影名字,电影类型
3、ratings.dat 数据格式为: 1::1193::5::978300760
对应字段为:UserID BigInt, MovieID BigInt, Rating Double, Timestamped String
对应字段中文解释:用户ID,电影ID,评分,评分时间戳
题目要求:
数据要求:
(1)写shell脚本清洗数据。
(2)使用Hive能解析的方式进行
Hive要求:
(1)正确建表,导入数据(三张表,三份数据),并验证是否正确
思路:重点在于要把 ‘::’ 给替换掉
第一种方法,使用sed 进行文本处理:
#!/bin/bash
sed 's/::/%/g' users.dat >> newusers.dat
sed 's/::/%/g' movies.dat >> newmovies.dat
sed 's/::/%/g' ratings.dat >> newratings.dat
create table users(userid BigInt, gender String, age Int, occupation String, zipcode String)
row format delimited fields terminated by '%' stored as textfile;
create table movies(movieID BigInt, title String, genres String)
row format delimited fields terminated by '%' stored as textfile;
create table ratings(userID BigInt, movieID BigInt, rating Double, timestamped String)
row format delimited fields terminated by '%' stored as textfile;
load data local inpath '/home/user/newusers.dat' into table users;
load data local inpath '/home/user/newmovies.dat' into table movies;
load data local inpath '/home/user/newratings.dat' into table ratings;
select * from users limit 10;
select * from movies limit 10;
select * from ratings limit 10;
第二种方法
create table users(userID BigInt, gender String, age Int, occupation String, zipcode String)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s %5$s')
stored as textfile;
load data local inpath '/home/user/users.dat' INTO TABLE users;
select * from users limit 5;
create table movies(movieID BigInt, title String, genres String)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s')
stored as textfile;
load data local inpath '/home/user/movies.dat' INTO TABLE movies;
select * from movies limit 5;
create table ratings(userID BigInt, movieID BigInt, rating Double, timestamped String)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s')
stored as textfile;
load data local inpath '/home/user/ratings.dat' INTO TABLE ratings;
select * from ratings limit 5;
得到数据如下::
1、users.dat 数据格式为: 2::M::56::16::70072
1 F 1 10 48067
2 M 56 16 70072
3 M 25 15 55117
2、movies.dat 数据格式为: 2::Jumanji (1995)::Adventure|Children's|Fantasy
1 Toy Story (1995) Animation|Children's|Comedy
2 Jumanji (1995) Adventure|Children's|Fantasy
3 Grumpier Old Men (1995) Comedy|Romance
3、ratings.dat 数据格式为: 1::1193::5::978300760
1 1193 5.0 978300760
1 661 3.0 978302109
1 914 3.0 978301968
(2)求被评分次数最多的10部电影,并给出评分次数(电影名,评分次数)
select a.title as title , b.count as counts from
movies a join
(
select movieid as MovieID,count(*) as count from ratings group by movieid sort by count desc limit 10
) b
on a.movieid = b.MovieID;
得到结果:
title counts
American Beauty (1999) 3428
Star Wars: Episode IV - A New Hope (1977) 2991
Star Wars: Episode V - The Empire Strikes Back (1980) 2990
Star Wars: Episode VI - Return of the Jedi (1983) 2883
Jurassic Park (1993) 2672
Saving Private Ryan (1998) 2653
Terminator 2: Judgment Day (1991) 2649
Matrix, The (1999) 2590
Back to the Future (1985) 2583
Silence of the Lambs, The (1991) 2578
select movieid as MovieID,count(*) as count from ratings group by movieid sort by count desc limit 10
(3)分别求男性,女性当中评分最高的10部电影(性别,电影名,影评分)
求男性评分最高的10部电影
1、首先根据电影评分高低进行排序
select userid as userid ,movieid as movieid,avg(rating) as avgrate from ratings group by movieid,userid sort by avgrate desc
2、
select collect_set(c.gender)[0] gender, a.title title , avg(b.rating) rating,count(b.rating) counts from movies a
join ratings b on a.movieid = b.movieid
join users c on b.userid = c.userid
where c.gender = "M"
group by b.movieid,a.title
having counts>70
order by rating desc
limit 10
;
得到如下结果:
gender title rating counts
M Godfather, The (1972) 4.583333333333333 1740
M Seven Samurai (The Magnificent Seven) (Shichin