目录
1、建表
内部表与外部表最大区别:drop删除时,表中的数据和元数据同时被删除(硬链接),而删除外部表,元数据会被删除,但是实际数据仍然存在(软连接),重新建一个表,就可以将数据恢复。
//建一个内部表
CREATE Table movie_table
(movieid STRING,
title STRING,
genres STRING
)
row format delimited fields terminated by ','
stored as textfile
2、数据导入的四种方式
2.1 从本地或hdfs使用load上导入数据
//建表:
hive> create table test(
id string,
age string,
score string)
row format delimited fields terminated by ','
stored as textfile;
//从本地文件路径/home/badou/hive_test/ac.txt导入数据
hive> load data local inpath '/home/badou/hive_test/ac.txt' into table test;
//从hdfs上导入数据
hive> load data inpath '/hive/ab.txt' into table test;
2.2 使用location加载数据
//建表,使用location指定目录路径,
CREATE Table movie_table
(movieid STRING,
title STRING,
genres STRING
)
row format delimited fields terminated by ','
stored as textfile
location '/hive';
//从本地put数据到hdfs上
[root@master ml-latest-small]# hadoop fs -put movies.csv /hive
//查看数据
hive> select * from movie_table limit 10;
OK
movieId title genres
1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
2 Jumanji (1995) Adventure|Children|Fantasy
3 Grumpier Old Men (1995) Comedy|Romance
4 Waiting to Exhale (1995) Comedy|Drama|Romance
5 Father of the Bride Part II (1995) Comedy
6 Heat (1995) Action|Crime|Thriller
7 Sabrina (1995) Comedy|Romance
8 Tom and Huck (1995) Adventure|Children
9 Sudden Death (1995) Action
Time taken: 0.144 seconds, Fetched: 10 row(s)
2.3 使用insert into 从一个表的数据插入到另一个表
//将查询到的数据追加到test表中,覆盖的话使用insert overwrite
hive> insert into table test select * from movie_table limit 10;
2.4 使用as select 创建新表
hive> create table behavior_table as
select /* +MAPJOIN(b) */ B.* ,A.title
from movie_table A
join rating_table_ex B
on A.movieid ==B.id2
limit 10;
3、数据导出的两种方式
3.1 导出到本地
hive> insert overwrite local directory '/home/badou/hive_test' select * from behavior_table;
3.2 导出到hdfs
hive> insert overwrite directory '/hive' select * from behavior_table;
4. partition分区表
//创建表,partitioned by定义分区字段
hive> CREATE Table rating_table_p
(userid STRING,
movieid STRING,
rating STRING
)
partitioned by(dt STRING)
row format delimited fields terminated by '\t'
lines terminated by '\n';
//导入数据,使用分区表的需要加上partition(XXX)分区,数据文件中可以不含有2003-09,hive会自动加上
hive> LOAD DATA LOCAL INPATH '/home/badou/hive_test/data/ml-latest-small/2003-09.data' OVERWRITE INTO TABLE rating_table_p partition(dt='2003-09');
//查询数据
hive> select * from rating_table_p;
OK
15 1281 1.0 2003-09
15 3267 3.0 2003-09
15 3421 1.5 2003-09
15 3462 1.5 2003-09
166 34 4.0 2003-09
166 147 4.5 2003-09
166 163 3.5 2003-09
166 293 4.5 2003-09
166 342 4.0 2003-09
166 356 4.0 2003-09
5. bucket分桶表
//建表,使用clustered by(字段名)into (桶数量) buckets;
hive> CREATE Table rating_table_b
(userid STRING,
movieid STRING,
rating STRING
)
clustered by (userid) INTO 16 buckets;
//导入数据
hive> from rating_table_ex
insert overwrite table rating_table_b
select id1,id2,id3;
//可以查看到hdfs生成了16个桶文件
[root@master ml-latest-small]# hadoop fs -ls /user/hive/warehouse/rating_table_b
Found 16 items
-rwxr-xr-x 3 root supergroup 74544 2020-07-13 14:06 /user/hive/warehouse/rating_table_b/000000_1
-rwxr-xr-x 3 root supergroup 119542 2020-07-13 14:05 /user/hive/warehouse/rating_table_b/000001_1
-rwxr-xr-x 3 root supergroup 83047 2020-07-13 14:06 /user/hive/warehouse/rating_table_b/000002_1
-rwxr-xr-x 3 root supergroup 125551 2020-07-13 14:02 /user/hive/warehouse/rating_table_b/000003_0
-rwxr-xr-x 3 root supergroup 113070 2020-07-13 14:05 /user/hive/warehouse/rating_table_b/000004_0
-rwxr-xr-x 3 root supergroup 85340 2020-07-13 14:06 /user/hive/warehouse/rating_table_b/000005_1
-rwxr-xr-x 3 root supergroup 82565 2020-07-13 14:04 /user/hive/warehouse/rating_table_b/000006_0
-rwxr-xr-x 3 root supergroup 34999 2020-07-13 14:05 /user/hive/warehouse/rating_table_b/000007_0
-rwxr-xr-x 3 root supergroup 100112 2020-07-13 14:05 /user/hive/warehouse/rating_table_b/000008_0
-rwxr-xr-x 3 root supergroup 54563 2020-07-13 14:03 /user/hive/warehouse/rating_table_b/000009_0
-rwxr-xr-x 3 root supergroup 42638 2020-07-13 14:06 /user/hive/warehouse/rating_table_b/000010_0
-rwxr-xr-x 3 root supergroup 57536 2020-07-13 14:02 /user/hive/warehouse/rating_table_b/000011_0
-rwxr-xr-x 3 root supergroup 73950 2020-07-13 14:04 /user/hive/warehouse/rating_table_b/000012_0
-rwxr-xr-x 3 root supergroup 82775 2020-07-13 14:06 /user/hive/warehouse/rating_table_b/000013_1
-rwxr-xr-x 3 root supergroup 75989 2020-07-13 14:04 /user/hive/warehouse/rating_table_b/000014_0
-rwxr-xr-x 3 root supergroup 66814 2020-07-13 14:04 /user/hive/warehouse/rating_table_b/000015_0
//有时还需要对随机的桶数据抽样查询。这里4的意思是把桶分为(桶的总数除以4的桶),也就是分成4分,从每份的第三个抽样,也就是第2、6、10、14的桶抽取数据。
select * from rating_table_b tablesample(bucket 3 out of 4 on userid) limit 10;