1. 首先下载测试数据,数据也可以创建
http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
2. 数据类型与字段名称
movies.csv(电影元数据)
movieId,title,genres
ratings.csv(用户打分数据)
userId,movieId,rating,timestamp
3. 先把数据存放到HDFS上
hdfs dfs -mkdir /hive_operate
hdfs dfs-mkdir /hive_operate/movie_table
hdfs dfs-mkdir /hive_operate/rating_table
hdfs dfs-put movies.csv /hive_operate/movie_table
hdfs dfs-put ratings.csv /hive_operate/rating_table
4. 创建movie_table和rating_table
]$ cat create_movie_table.sql
create external table movie_table
(
movieId STRING,
title STRING,
genres STRING
)
row format delimited fields terminated by','storedastextfile
location'/hive_operate/movie_table';
]$ cat create_rating_table.sql
create external table rating_table
(userId STRING,
movieId STRING,
rating STRING,
ts STRING
)
row format delimited fields terminated by','storedastextfile
location'/hive_operate/rating_table';
其中字段名为timestamp为hive的保留字段,执行的时候会报错,需用反引号或者修改字段名,我这边修改的字段名
5. 执行
可以通过复制命令到终端执行,也可以通过hive -f movie_table_e来创建表
6. 查看
hive>show tables;
OK
movie_table
rating_table
hive> select * from rating_table limit 10;
OK1 31 2.5 1260759144
1 1029 3.0 1260759179
1 1061 3.0 1260759182
1 1129 2.0 1260759185
1 1172 4.0 1260759205
1 1263 2.0 1260759151
1 1287 2.0 1260759187
1 1293 2.0 1260759148
1 1339 3.5 1260759125
1 1343 2.0 1260759131
7. 生成新表(行为表)
create table behavior_table as
selectB.userid, A.movieid, B.rating, A.titlefrommovie_table A
join rating_table B
on A.movieid== B.movieid;
8. 把Hive表数据导入到本地
table->local file
insert overwrite local directory'/root/hive_test/1.txt' select * from behavior_table;
9. 把Hive表数据导入到HDFS上
table->hdfs file
insert overwrite directory'/root/hive_test/1.txt' select * from behavior_table;
10. 把本地数据导入到Hive表中
local file ->table
LOAD DATA LOCAL INPATH'/root/hive_test/a.txt' OVERWRITE INTO TABLE behavior_table;
11. 把HDFS上的数导入到HIve表中
hdfs file ->table
LOAD DATA INPATH'/a.txt' OVERWRITE INTO TABLE behavior_table;