Hive笔记——影评项目

最新推荐文章于 2023-03-09 05:11:39 发布

嘉平11

最新推荐文章于 2023-03-09 05:11:39 发布

阅读量817

点赞数

分类专栏： Hive

本文链接：https://blog.csdn.net/zgm12/article/details/104593642

版权

Hive 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

题目要求

有如此三份数据：
1、users.dat 数据格式为： 2::M::56::16::70072
对应字段为：UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String
对应字段中文解释：用户id，性别，年龄，职业，邮政编码

2、movies.dat 数据格式为： 2::Jumanji (1995)::Adventure|Children's|Fantasy
对应字段为：MovieID BigInt, Title String, Genres String
对应字段中文解释：电影ID，电影名字，电影类型

3、ratings.dat 数据格式为： 1::1193::5::978300760
对应字段为：UserID BigInt, MovieID BigInt, Rating Double, Timestamped String
对应字段中文解释：用户ID，电影ID，评分，评分时间戳

题目要求：

数据要求：
（1）写shell脚本清洗数据。（hive不支持解析多字节的分隔符，也就是说hive只能解析':', 不支持解析'::'，所以用普通方式建表来使用是行不通的，要求对数据做一次简单清洗）
（2）使用Hive能解析的方式进行

Hive要求：
1、正确建表，导入数据（三张表，三份数据），并验证是否正确

2、求被评分次数最多的10部电影，并给出评分次数（电影名，评分次数）

3、分别求男性，女性当中评分最高的10部电影（性别，电影名，影评分）

4、求movieid = 2116这部电影各年龄段（因为年龄就只有7个，
就按这个7个分就好了）的平均影评（年龄段，影评分）

5、求最喜欢看电影（影评次数最多）的那位女性评最高分的10部电影的
平均影评分（观影者，电影名，影评分）

6、求好片（评分>=4.0）最多的那个年份的最好看的10部电影

7、求1997年上映的电影中，评分最高的10部Comedy类电影

8、该影评库中各种类型电影中评价最高的5部电影（类型，电影名，平均影评分）

9、各年评分最高的电影类型（年份，类型，影评分）

10、每个地区（邮政编码）最高评分的电影名，把结果存入HDFS（地区，电影名，影评分）

一、数据清洗、建表

因为元数据的字段之间用：：分割，所以我们使用shell进行一下清洗,将：：都转换成逗号

vi change1.sh


#!/bin/bash
sed "s/::/,/g" /zgm/movies.dat>/zgm/movies2.dat
sed "s/::/,/g" /zgm/ratings.dat>/zgm/ratings2.dat
sed "s/::/,/g" /zgm/users.dat>/zgm/users2.dat


chmod +x change1.sh

/bin/sh change1.sh

或者在建表时：

 create external table if not exists ratings
      (userid bigint,
      movieid bigint,
      rating double,
      timestamped string)
      partitioned by (dt string,city string)
      row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
      with serdeProperties ('input.regex'='([0-9]*)::([0-9]*)::([0-9]*)::([0-9]*)')
      location '/zgm/ratings'
      ;

建表、load数据的完整shell

#! /bin/bash
yesterday=$(date -d 'yesterday' +%Y%m%d)
logdir=/Log/movieLog/$yesterday

if(!-d "$logdir");then
mkdir "$logdir"
fi

hql1="
      use myhive;
          drop table users;
           create external table if not exists users
      (userid bigint,
      gender string,
      age int,
      occupation string,
      zipcode string
      )
      partitioned by (dt string,city string)
      row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
      with serdeProperties ('input.regex'='([0-9]*)::([A-Z]*)::([0-9]*)::([0-9]*)::([0-9]*)')
      location '/zgm/users'
      ;

       load data local inpath '/zgm/users.dat' overwrite into table users partition(dt='${yesterday}',city='shandong');

        "
         $HIVE_HOME/bin/hive --database myhive -e "${hql1}"


hql2="
       drop table movies;
       create external table if not exists movies
           (movieid bigint,
           title string,
           genres string
           )
           partitioned by (dt string,city string)
           row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
           with serdeProperties ('input.regex'='([0-9]*)::([^@]*)::([^ ]*)')
       location '/zgm/movies'
      ;
        load data local inpath '/zgm/movies.dat' overwrite into table movies partition(dt='${yesterday}',city='shandong');
     "

       $HIVE_HOME/bin/hive --database myhive -e "${hql2}"
hql3="
      drop table ratings;
      create external table if not exists ratings
      (userid bigint,
      movieid bigint,
      rating double,
      timestamped string)
      partitioned by (dt string,city string)
      row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
      with serdeProperties ('input.regex'='([0-9]*)::([0-9]*)::([0-9]*)::([0-9]*)')
      location '/zgm/ratings'
      ;

      load data local inpath '/zgm/ratings.dat' overwrite into table ratings partition(dt='${yesterday}',city='shandong');
     "

         $HIVE_HOME/bin/hive --database myhive -e "${hql3}"

二、实现分析hql