Hive案例:影评

数据准备

数据下载链接:

https://github.com/qianhonglinIT/HiveMovieCaseStudy

在数据库中创建3张表,t_user,t_movie,t_rating

create table t_user(
userid bigint,
sex string,
age int,
occupation string,
zipcode string) 
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' 
with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s %5$s')
stored as textfile;

create table t_movie(
movieid bigint,
moviename string,
movietype string) 
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' 
with serdeproperties('input.regex'='(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s')
stored as textfile;


create table t_rating(
userid bigint,
movieid bigint,
rate double,
times string) 
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' 
with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s')
stored as textfile;

练习

(1)求被评分次数最多的10部电影,并给出评分次数(电影名,评分次数)

按照电影名进行分组,计算每组的评论次数,并且降序排列,取前10条。

create table question1 as select m.moviename,count(m.moviename) as total from t_movie m join t_rating r on m.movieid=r.movieid group by m.moviename order by total desc limit 10;

0: jdbc:hive2://Hbase:10000> select * from question1;
+--------------------------------------------------------+------------------+--+
|                  question1.moviename                   | question1.total  |
+--------------------------------------------------------+------------------+--+
| American Beauty (1999)                                 | 3428             |
| Star Wars: Episode IV - A New Hope (1977)              | 2991             |
| Star Wars: Episode V - The Empire Strikes Back (1980)  | 2990             |
| Star Wars: Episode VI - Return of the Jedi (1983)      | 2883             |
| Jurassic Park (1993)                                   | 2672             |
| Saving Private Ryan (1998)                             | 2653             |
| Terminator 2: Judgment Day (1991)                      | 2649             |
| Matrix, The (1999)                                     | 2590             |
| Back to the Future (1985)                              | 2583             |
| Silence of the Lambs, The (1991)                       | 2578             |
+--------------------------------------------------------+------------------+--+
10 rows selected (0.147 seconds)

2)分别求男性,女性当中评分最高的10部电影(性别,电影名,影评分)
三表联合查询,按照性别过滤条件,电影名作为分组条件,影评分作为排序条件进行查询

create table question2 as select "F" as sex,m.moviename,count(m.moviename) as total from t_rating r join t_user u on r.userid=u.userid join t_movie m on r.movieid=m.movieid where sex="F" group by m.moviename order by total desc limit 10;

0: jdbc:hive2://Hbase:10000> select * from question2;
+----------------+--------------------------------------------------------+------------------+--+
| question2.sex  |                  question2.moviename                   | question2.total  |
+----------------+--------------------------------------------------------+------------------+--+
| F              | American Beauty (1999)                                 | 946              |
| F              | Shakespeare in Love (1998)                             | 798              |
| F              | Silence of the Lambs, The (1991)                       | 706              |
| F              | Sixth Sense, The (1999)                                | 664              |
| F              | Groundhog Day (1993)                                   | 658              |
| F              | Fargo (1996)                                           | 657              |
| F              | Star Wars: Episode VI - Return of the Jedi (1983)      | 653              |
| F              | Star Wars: Episode V - The Empire Strikes Back (1980)  | 648              |
| F              | Star Wars: Episode IV - A New Hope (1977)              | 647              |
| F              | Forrest Gump (1994)                                    | 644              |
+----------------+--------------------------------------------------------+------------------+--+
10 rows selected (0.177 seconds)

(3)求movieid = 2116这部电影各年龄段的平均影评(年龄段,影评分)
先以movieid = 2116为过滤条件,再按年龄分组,进行查询

create table question3 as select u.age,avg(r.rate) rate from t_rating r join t_user u on r.userid=u.userid where r.movieid=2116 group by u.age;
0: jdbc:hive2://Hbase:10000> select * from question3;
+----------------+---------------------+--+
| question3.age  |   question3.rate    |
+----------------+---------------------+--+
| 1              | 3.2941176470588234  |
| 18             | 3.3580246913580245  |
| 25             | 3.436548223350254   |
| 35             | 3.2278481012658227  |
| 45             | 2.8275862068965516  |
| 50             | 3.32                |
| 56             | 3.5                 |
+----------------+---------------------+--+
7 rows selected (0.134 seconds)

(4)求最喜欢看电影(影评次数最多)的那位女性评分最高的10部电影的影评分(观影者,电影名,影评分)
分析:
1)首先求出最喜欢看电影的女性

select u.userid,count(u.userid) from t_rating r join t_user u on r.userid=u.userid where u.sex="F" group by u.userid limit 1;
+-----------+--------+--+
| u.userid  | total  |
+-----------+--------+--+
| 1150      | 1302   |
+-----------+--------+--+
1 row selected (113.353 seconds)

查询用户id为1150评分最高的10部电影

select r.userid,m.moviename,r.rate from t_rating r join t_movie m on r.movieid=m.movieid where r.userid=1150 order by r.rate desc limit 10;

+-----------+------------------------------+---------+--+
| r.userid  |         m.moviename          | r.rate  |
+-----------+------------------------------+---------+--+
| 1150      | His Girl Friday (1940)       | 5.0     |
| 1150      | Blazing Saddles (1974)       | 5.0     |
| 1150      | City Lights (1931)           | 5.0     |
| 1150      | Annie Hall (1977)            | 5.0     |
| 1150      | Rear Window (1954)           | 5.0     |
| 1150      | Crumb (1994)                 | 5.0     |
| 1150      | White Christmas (1954)       | 5.0     |
| 1150      | Metropolitan (1990)          | 5.0     |
| 1150      | Topsy-Turvy (1999)           | 5.0     |
| 1150      | Hands on a Hard Body (1996)  | 5.0     |
+-----------+------------------------------+---------+--+
10 rows selected (66.002 seconds)

求出2)中10部电影的平均影评分

create table question3_2 as select m.moviename,avg(r.rate) as avgrate from question3_1 q join t_movie m on q.moviename=m.moviename join t_rating r on m.movieid=r.movieid group by m.moviename;

0: jdbc:hive2://Hbase:10000> select * from question3_2;
+------------------------------+----------------------+--+
|    question3_2.moviename     | question3_2.avgrate  |
+------------------------------+----------------------+--+
| Annie Hall (1977)            | 4.14167916041979     |
| Blazing Saddles (1974)       | 4.047363717605005    |
| City Lights (1931)           | 4.387453874538745    |
| Crumb (1994)                 | 4.063136456211812    |
| Hands on a Hard Body (1996)  | 4.163043478260869    |
| His Girl Friday (1940)       | 4.249370277078086    |
| Metropolitan (1990)          | 3.6464646464646466   |
| Rear Window (1954)           | 4.476190476190476    |
| Topsy-Turvy (1999)           | 3.7039473684210527   |
| White Christmas (1954)       | 3.8265682656826567   |
+------------------------------+----------------------+--+
10 rows selected (0.126 seconds)

(5)求好片(评分>=4.0)最多的那个年份的最好看的10部电影
分析:

1)需要将t_rating和t_movie表进行联合查询,将电影名当中的上映年份截取出来,保存到临时表question5_1中

create table question5_1 as select m.movieid as movieid, m.moviename as moviename,substr(m.moviename,-5,4) as years,avg(r.rate) as avgrate from t_rating r join t_movie m on r.movieid=m.movieid group by m.movieid,m.moviename;

+----------------------+-------------------------------------------------------------------------------------+--------------------+----------------------+--+
| 3947                 | Get Carter (1971)                                                                   | 1971               | 3.4727272727272727   |
| 3948                 | Meet the Parents (2000)                                                             | 2000               | 3.6357308584686776   |
| 3949                 | Requiem for a Dream (2000)                                                          | 2000               | 4.115131578947368    |
| 3950                 | Tigerland (2000)                                                                    | 2000               | 3.6666666666666665   |
| 3951                 | Two Family House (2000)                                                             | 2000               | 3.9                  |
| 3952                 | Contender, The (2000)                                                               | 2000               | 3.7809278350515463   |
+----------------------+-------------------------------------------------------------------------------------+--------------------+----------------------+--+
3,706 rows selected (0.894 seconds)

2)求拥有好片最多的那一年。

create table question5_2 as select years,count(years) as cot from question5_1 where avgrate > 4 group by years order by cot desc limit 1;
+--------------------+------------------+--+
| question5_2.years  | question5_2.cot  |
+--------------------+------------------+--+
| 1998               | 27               |
+--------------------+------------------+--+

3)求出这一年最好的10部电影。

create table question5_3 as select moviename,years,avgrate from question5_1 where years=1998 order by avgrate desc limit 10;

+---------------------------------------------+--------------------+----------------------+--+
|            question5_3.moviename            | question5_3.years  | question5_3.avgrate  |
+---------------------------------------------+--------------------+----------------------+--+
| Follow the Bitch (1998)                     | 1998               | 5.0                  |
| Apple, The (Sib) (1998)                     | 1998               | 4.666666666666667    |
| Inheritors, The (Die Siebtelbauern) (1998)  | 1998               | 4.5                  |
| Return with Honor (1998)                    | 1998               | 4.4                  |
| Saving Private Ryan (1998)                  | 1998               | 4.337353938937053    |
| Celebration, The (Festen) (1998)            | 1998               | 4.3076923076923075   |
| West Beirut (West Beyrouth) (1998)          | 1998               | 4.3                  |
| Central Station (Central do Brasil) (1998)  | 1998               | 4.283720930232558    |
| 42 Up (1998)                                | 1998               | 4.2272727272727275   |
| American History X (1998)                   | 1998               | 4.2265625            |
+---------------------------------------------+--------------------+----------------------+--+
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
(1)创建相应数据库和数据表,上传数据 首先需要在Hive中创建数据库和数据表,然后将数据上传到对应的数据表中。 ```sql -- 创建数据库 CREATE DATABASE IF NOT EXISTS movie_db; -- 切换到该数据库 USE movie_db; -- 创建数据表 CREATE TABLE IF NOT EXISTS movies ( movieid INT, title STRING, genres STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '::' STORED AS TEXTFILE; CREATE TABLE IF NOT EXISTS ratings ( userid INT, movieid INT, rating FLOAT, timestamp STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '::' STORED AS TEXTFILE; CREATE TABLE IF NOT EXISTS users ( userid INT, gender STRING, age INT, occupation INT, zipcode STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '::' STORED AS TEXTFILE; -- 将数据上传到对应的数据表中 LOAD DATA LOCAL INPATH '/path/to/ml-1m/movies.dat' OVERWRITE INTO TABLE movies; LOAD DATA LOCAL INPATH '/path/to/ml-1m/ratings.dat' OVERWRITE INTO TABLE ratings; LOAD DATA LOCAL INPATH '/path/to/ml-1m/users.dat' OVERWRITE INTO TABLE users; ``` (2)评分数最多的10电影,并给出评分数(电影评分数) ```sql SELECT m.title, COUNT(r.movieid) AS rating_count FROM movies m JOIN ratings r ON m.movieid = r.movieid GROUP BY m.title ORDER BY rating_count DESC LIMIT 10; ``` (3)男性女性当中评分最高(按平均分)的10电影性别电影影评评论数大于等于50 ```sql SELECT u.gender, m.title, AVG(r.rating) AS avg_rating FROM movies m JOIN ratings r ON m.movieid = r.movieid JOIN users u ON r.userid = u.userid WHERE u.gender = 'F' AND r.userid IN ( SELECT userid FROM ratings GROUP BY userid HAVING COUNT(movieid) >= 50 ) OR u.gender = 'M' AND r.userid IN ( SELECT userid FROM ratings GROUP BY userid HAVING COUNT(movieid) >= 50 ) GROUP BY u.gender, m.title HAVING COUNT(r.movieid) >= 50 ORDER BY u.gender, avg_rating DESC LIMIT 10; ``` (4)movieid = 2116这电影各年龄段(因为年龄就只有7个,就按这个7个就好了)的平均影评(年龄段,影评) ```sql SELECT CASE WHEN age BETWEEN 1 AND 18 THEN '1-18' WHEN age BETWEEN 18 AND 24 THEN '18-24' WHEN age BETWEEN 25 AND 34 THEN '25-34' WHEN age BETWEEN 35 AND 44 THEN '35-44' WHEN age BETWEEN 45 AND 49 THEN '45-49' WHEN age BETWEEN 50 AND 55 THEN '50-55' ELSE '56+' END AS age_group, AVG(r.rating) AS avg_rating FROM ratings r JOIN users u ON r.userid = u.userid WHERE r.movieid = 2116 GROUP BY age_group; ``` (5)最喜欢看电影影评数最多)的那位女性最高10电影的平均影评(观影者,电影影评) ```sql SELECT u.userid, m.title, AVG(r.rating) AS avg_rating FROM movies m JOIN ratings r ON m.movieid = r.movieid JOIN users u ON r.userid = u.userid WHERE u.gender = 'F' AND r.userid IN ( SELECT userid FROM ratings GROUP BY userid ORDER BY COUNT(movieid) DESC LIMIT 1 ) GROUP BY u.userid, m.title HAVING COUNT(r.movieid) >= 50 ORDER BY avg_rating DESC LIMIT 10; ``` (6)好片(平均评分>=4.0)最多的那个年份的最好看的10电影 ```sql SELECT m.title, AVG(r.rating) AS avg_rating, COUNT(r.movieid) AS rating_count, SUBSTR(r.timestamp, 1, 4) AS year FROM movies m JOIN ratings r ON m.movieid = r.movieid WHERE AVG(r.rating) >= 4.0 GROUP BY m.title, year ORDER BY rating_count DESC, avg_rating DESC LIMIT 10; ``` (7)1997年上映的电影中,评分最高10Comedy类电影 ```sql SELECT m.title, AVG(r.rating) AS avg_rating FROM movies m JOIN ratings r ON m.movieid = r.movieid WHERE m.genres LIKE '%Comedy%' AND SUBSTR(r.timestamp, 1, 4) = '1997' GROUP BY m.title ORDER BY avg_rating DESC LIMIT 10; ``` (8)该影评库中各种类型电影中评价最高的5电影(类型,电影,平均影评) ```sql SELECT m.genres, m.title, AVG(r.rating) AS avg_rating FROM movies m JOIN ratings r ON m.movieid = r.movieid GROUP BY m.genres, m.title HAVING COUNT(r.movieid) >= 100 ORDER BY m.genres, avg_rating DESC LIMIT 5; ``` (9)各年评分最高电影类型(年份,类型,影评) ```sql SELECT SUBSTR(r.timestamp, 1, 4) AS year, m.genres, MAX(avg_rating) AS max_rating FROM ( SELECT movieid, AVG(rating) AS avg_rating FROM ratings GROUP BY movieid ) r JOIN movies m ON r.movieid = m.movieid GROUP BY year, m.genres; ``` (10)每个地区最高评分电影,把结果存入HDFS(地区,电影影评) ```sql INSERT OVERWRITE DIRECTORY '/path/to/output' SELECT u.zipcode, m.title, AVG(r.rating) AS avg_rating FROM movies m JOIN ratings r ON m.movieid = r.movieid JOIN users u ON r.userid = u.userid GROUP BY u.zipcode, m.title HAVING COUNT(r.movieid) >= 50 ORDER BY u.zipcode, avg_rating DESC; ```

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值