转载至: https://www.cnblogs.com/qingyunzong/p/8727264.html#_label2_1
1. 案例说明
1. 现有如此三份数据:
-
users.dat
- 数据格式为: 2::M::56::16::70072,
- 共有6040条数据
- 对应字段为:UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String
- 对应字段中文解释:用户id,性别,年龄,职业,邮政编码
-
movies.dat
- 数据格式为: 2::Jumanji (1995)::Adventure|Children’s|Fantasy,
- 共有3883条数据
- 对应字段为:MovieID BigInt, Title String, Genres String
- 对应字段中文解释:电影ID,电影名字,电影类型
-
ratings.dat 数据格式为: 1::1193::5::978300760,
- 共有1000209条数据
- 对应字段为:UserID BigInt, MovieID BigInt, Rating Double, Timestamped String
- 对应字段中文解释:用户ID,电影ID,评分,评分时间戳
2. 题目要求
-
数据要求:
- 写shell脚本清洗数据。(hive不支持解析多字节的分隔符,也就是说hive只能解析’:’, 不支持解析’::’,所以用普通方式建表来使用是行不通的,要求对数据做一次简单清洗)
- 使用Hive能解析的方式进行
-
Hive要求:
- 正确建表,导入数据(三张表,三份数据),并验证是否正确
- 求被评分次数最多的10部电影,并给出评分次数(电影名,评分次数)
- 分别求男性,女性当中评分最高的10部电影(性别,电影名,影评分)
- 求movieid = 2116这部电影各年龄段(因为年龄就只有7个,就按这个7个分就好了)的平均影评(年龄段,影评分)
- 求最喜欢看电影(影评次数最多)的那位女性评最高分的10部电影的平均影评分(观影者,电影名,影评分)
- 求好片(评分>=4.0)最多的那个年份的最好看的10部电影
- 求1997年上映的电影中,评分最高的10部Comedy类电影
- 该影评库中各种类型电影中评价最高的5部电影(类型,电影名,平均影评分)
- 各年评分最高的电影类型(年份,类型,影评分)
- 每个地区最高评分的电影名,把结果存入HDFS(地区,电影名,影评分)
2. 数据下载
3. 前提准备
1. 正确建表,导入数据(三张表,三份数据),并验证是否正确
-
分析需求
需要创建一个数据库movie,在movie数据库中创建3张表,t_user,t_movie,t_ratingt_user: userid bigint, sex string, age int, occupation string, zipcode string t_movie: movieid bigint, moviename string, movietype string t_rating: userid bigint, movieid bigint, rate double, times string
原始数据是以::进行切分的,所以需要使用能解析多字节分隔符的Serde即可
使用RegexSerde, 需要两个参数:
input.regex = "(.*)::(.*)::(.*)" output.format.string = "%1$s %2$s %3$s"
-
创建数据库
drop database if exists movie; create database if not exists movie; use movie;
-
创建t_user表
create table t_user( userid bigint, sex string, age int, occupation string, zipcode string) row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s %5$s') stored as textfile;
-
创建t_movie表
create table t_movie( movieid bigint, moviename string, movietype string) row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' with serdeproperties('input.regex'='(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s') stored as textfile;
-
创建t_rating表
create table t_rating( userid bigint, movieid bigint, rate double, times string) row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s') stored as textfile;
-
导入数据
2. 求被评分次数最多的10部电影,并给出评分次数(电影名,评分次数)
hive> select t1.moviename, count(t1.moviename) as total
> from t_movie t1 join t_rating t2 on t1.movieid=t2.movieid
> group by t1.moviename
> order by total desc
> limit 10;
3. 分别求男性,女性当中评分最高的10部电影(性别,电影名,影评分)
hive> select "F" as sex, moviename as name, avg(rate) as average, count(t2.movieid) as total
> from t_user t1
> join t_rating t2 on t1.userid=t2.userid
> join t_movie t3 on t2.movieid=t3.movieid
> where sex="F"
> group by moviename
> having total >= 50
> order by average desc
> limit 10;
hive> select "M" as sex, moviename as name, avg(rate) as average, count(t2.movieid) as total
> from t_user t1
> join t_rating t2 on t1.userid=t2.userid
> join t_movie t3 on t2.movieid=t3.movieid
> where sex="M"
> group by moviename
> having total >= 50
> order by average desc
> limit 10;
4. 求movieid = 2116这部电影各年龄段(因为年龄就只有7个,就按这个7个分就好了)的平均影评
hive> select age, avg(rate) as avg_rate
> from t_user t1 join t_rating t2 on t1.userid=t2.userid
> where movieid=2116
> group by age;
5. 求最喜欢看电影(影评次数最多)的那位女性评最高分的10部电影的平均影评分(观影者,电影名,影评分)
-
先求出观影次数最多的女用户的userid
hive> select userid, count(t2.movieid) as total > from t_user t1 join t_rating t2 on t1.userid=t2.userid > where sex="F" > group by t2.userid > order by total desc > limit 1;
-
再求出该用户评分最高的10部电影的movieid
hive> create table tmp1 as > select t1.movieid,moviename, rate > from t_movie t1 join t_rating t2 on t1.movieid=t2.movieid > where t2.userid=1150 > order by rate desc > limit 10;
-
求出这10部电影的平均评分
hive> select t2.movieid, moviename, avg(t2.rate) as average > from tmp1 t1 > join t_rating t2 on t1.movieid=t2.movieid > join t_movie t3 on t2.movieid=t3.movieid > group by t2.movieid, t3.moviename;
6. 求好片(评分>=4.0)最多的那个年份的最好看的10部电影
-
截取出年份, 并计算出每部电影的平均分, 并将结果写入之间表中:
hive> create table answer6_1 as > select t1.movieid, moviename, substr(moviename, -5, 4) as years, avg(rate) as avgrate > from t_movie t1 join t_rating t2 on t1.movieid=t2.movieid > group by t1.movieid, moviename;
-
求出好片(评分>=4.0)最多的那个年份
hive> create table answer6_2 as > select years, count(movieid) as total > from answer6_1 t1 > where avgrate >=4 > group by years > order by total desc > limit 1;
-
求出1998年评分最高的10部电影
hive> select movieid, moviename, avgrate > from answer6_1 > where years=1998 > order by avgrate desc > limit 10;
7. 求1997年上映的电影中,评分最高的10部Comedy类电影
instr(string str, string substr): 查找字符串str中子字符串substr出现的位置,如果查找失败将返回0
lcase/lower: 字符串转小写函数
ucase/upper: 字符串转大写函数
-
需要电影类型,所有可以将第六步中求出answer6_1表和t_movie表进行联合查询
hive> create table answer7_1 as > select b.movieid as id, b.moviename as name, b.years as years, b.avgrate as rate, a.movietype as type > from t_movie a join answer6_1 b on a.movieid=b.movieid;
-
从answer7_1按照电影类型中是否包含Comedy和按照评分>=4.0作为where过滤条件,按照评分作为排序条件进行查询,将结果保存到answer7_2中
hive> create table answer7_2 as > select t.id as id, t.name as name, t.rate as rate > from answer7_1 t > where t.years=1997 and instr(lcase(t.type),'comedy') >0 > order by rate desc > limit 10;
-
综合SQL
hive> create table answer7_3 as > select t1.movieid as id, t2.moviename, t2.movietype, avgrate > from answer6_1 t1 join t_movie t2 on t1.movieid=t2.movieid > where years=1997 and instr(lcase(t2.movietype),'comedy') > 0 > order by avgrate desc > limit 10;
8. 该影评库中各种类型电影中评价最高的5部电影(类型,电影名,平均影评分)
-
需要电影类型,所有需要将answer7_1中的type字段进行裂变,将结果保存到answer8_2中
hive> create table answer8_1 as > select a.id as id, a.name as name, a.years as years, a.rate as rate, tv.type as type > from answer7_1 a > lateral view explode(split(a.type,"\\|")) tv as type;
-
求TopN,按照type分组,需要添加一列来记录每组的顺序,将结果保存到answer8_2中
hive> create table answer8_2 as > select id,name,years,rate,type,row_number() > over(distribute by type sort by rate desc ) as num > from answer8_1; select * from answer8_2 where num <= 5;
9. 各年评分最高的电影类型(年份,类型,影评分)
-
需要按照电影类型和上映年份进行分组,按照影评分进行排序,将结果保存到answer9_A中
hive> create table answer9_1 as > select a.years as years, a.type as type, avg(a.rate) as rate > from answer8_1 a > group by a.years,a.type > order by rate desc;
-
求TopN,按照years分组,需要添加一列来记录每组的顺序,将结果保存到answer9_2中
hive> create table answer9_2 as > select years,type,rate,row_number() > over (distribute by years sort by rate) as num > from answer9_1;
-
按照num=1作为where过滤条件取出结果数据
hive> select * from answer9_2 where num=1;
10. 每个地区最高评分的电影名(地区,电影名,影评分)
-
需要把三张表进行联合查询,取出电影id、电影名称、影评分、地区,将结果保存到answer10_1表中
hive> create table answer10_1 as > select c.movieid, c.moviename, avg(b.rate) as avgrate, a.zipcode > from t_user a > join t_rating b on a.userid=b.userid > join t_movie c on b.movieid=c.movieid > group by a.zipcode,c.movieid, c.moviename;
-
求TopN,按照地区分组,按照平均排序,添加一列num用来记录地区排名,将结果保存到answer10_2表中
hive> create table answer10_2 as > select movieid,moviename,avgrate,zipcode, row_number() > over (distribute by zipcode sort by avgrate) as num > from answer10_1;
-
按照num=1作为where过滤条件取出结果数据
hive> select t.* from answer10_2 t where t.num=1;
转载至: https://www.cnblogs.com/qingyunzong/p/8727264.html#_label2_1