Hive Example

Hive Example

此实例主要学习Hive的基本操作

准备数据

  • 下载数据

    wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
  • 解压数据

    unzip ml-100k.zip
  • 数据说明(具体请查看解压目录中README)
u.data     -- The full u data set, 100000 ratings by 943 users on 1682 items.
              Each user has rated at least 20 movies.  Users and items are
              numbered consecutively from 1.  The data is randomly
              ordered. This is a tab separated list of
                 user id | item id | rating | timestamp.
              The time stamps are unix seconds since 1/1/1970 UTC

u.info     -- The number of users, items, and ratings in the u data set.

u.item     -- Information about the items (movies); this is a tab separated
              list of
              movie id | movie title | release date | video release date |
              IMDb URL | unknown | Action | Adventure | Animation |
              Children's | Comedy | Crime | Documentary | Drama | Fantasy |
              Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
              Thriller | War | Western |
              The last 19 fields are the genres, a 1 indicates the movie
              is of that genre, a 0 indicates it is not; movies can be in
              several genres at once.
              The movie ids are the ones used in the u.data data set.

u.genre    -- A list of the genres.

u.user     -- Demographic information about the users; this is a tab
              separated list of
              user id | age | gender | occupation | zip code
              The user ids are the ones used in the u.data data set.

u.occupation -- A list of the occupations.

创建表

接下来使用上述数据中的其中三个数据做接下来的练习,创建文件 hiveExample.sql,hiveExample.sql内容如下:

CREATE TABLE IF NOT EXISTS u_data (
        userid INT,
        movieid INT,
        rating INT,
        unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

CREATE TABLE IF NOT EXISTS u_user(
        userid INT,
        age  INT,
        gender STRING,
        occupation STRING ,
        zipcode INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;

CREATE TABLE IF NOT EXISTS u_item(
        movieid INT,
        movietitle STRING,
        releasedate  STRING,
        videoreleasedate STRING,
        IMDbURL STRING,
        unknown  INT,
        Action INT,
        Adventure  INT,
        Animation INT,
        Childrens INT,
        Comedy INT,
        Crime INT,
        Documentary INT,
        Drama INT,
        Fantasy INT,
        FilmNoir INT,
        Horror INT,
        Musical INT,
        Mystery INT,
        Romance INT,
        SciFi INT,
        Thriller INT,
        War INT,
        Western INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;

执行如下命令,创建表:
hive -f hiveExample.sql

导入数据:

创建文件 importData.sql,importData.sql内容如下:

LOAD DATA LOCAL INPATH '/home/dev/storeFile/ml-100k/u.data' OVERWRITE INTO TABLE u_data;

LOAD DATA LOCAL INPATH '/home/dev/storeFile/ml-100k/u.user' OVERWRITE INTO TABLE u_user;

LOAD DATA LOCAL INPATH '/home/dev/storeFile/ml-100k/u.item' OVERWRITE INTO TABLE u_item;

其中/home/dev/storeFile/ml-100k/u.data为本地文件路径。
执行如下命令,导入数据:
hive -f importData.sql
导入成功后打印如下信息:

Loading data to table default.u_data
OK
Time taken: 3.852 seconds
Loading data to table default.u_user
OK
Time taken: 0.563 seconds
Loading data to table default.u_item
OK
Time taken: 0.706 seconds

HiveQL查询实例

select / order by /limit

将 u_user表中用户按照age降序排列,并查询出前5位用户信息:

hive> select * from u_user u order by u.age desc limit 5 ;

结果:

481 73  M   retired 37771
860 70  F   retired 48322
767 70  M   engineer    0
803 70  M   administrator   78212
585 69  M   librarian   98501

group by

  • group by 按照一个或者多个列对结果进行分组。
  • count / distinct 一同使用 DISTINCT 和 COUNT 关键词,来计算非重复结果的数目
    统计u_user表中不同性别的人数。
select u.gender, count(distinct u.userid) from u_user u group by u.gender;

结果:

OK
F   273
M   670
Time taken: 19.786 seconds, Fetched: 2 row(s)

参考资料

https://cwiki.apache.org/confluence/display/Hive/GettingStarted

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值