一、概念
1、定义
对Hive(Inceptor)表分桶可以将表中记录按分桶键的哈希值分散进多个文件中,这些小文件称为桶。
2、为什么进行分桶
(1)获得更高的查询处理效率。比如JOIN操作。对于JOIN操作两个表有一个相同的列,如果对这两个表都进行了桶操作。那么将保存相同列值的桶进行JOIN操作就可以,可以大大较少JOIN的数据量。
(2)使取样(sampling)更高效。在处理大规模数据集时,在开发和修改查询的阶段,如果能在数据集的一小部分数据上试运行查询,会带来很多方便。
二、操作
1、创建表
hive> create table student(id int,name string) clustered by (id) into 4 buckets row format delimited fields terminated by ',';
OK
Time taken: 0.367 seconds
hive> desc formatted student;
OK
# col_name data_type comment
id int
name string
# Detailed Table Information
Database: default
Owner: root
CreateTime: Fri Mar 20 10:56:26 CST 2020
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://master:9000/user/hive/warehouse/student
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"}
numFiles 0
numRows 0
rawDataSize 0
totalSize 0
transient_lastDdlTime 1584672986
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: 4
Bucket Columns: [id]
Sort Columns: []
Storage Desc Params:
field.delim ,
serialization.format ,
Time taken: 0.072 seconds, Fetched: 32 row(s)
2、设置
将hive.enforce.bucketing的值设置为true,这个参数将强制控制ruduce的个数去和我们指定的分桶个数相适配。
hive> set hive.enforce.bucketing = true;
3、导入数据
注意:
不要直接load数据到此表,这样分桶表是创建不出来的,创建的是假的分桶;
要先创建一个不带分桶的结构一样的临时表,然后借助此临时表为分桶表导入数据;
(1)创建临时表
hive> create table stu_temp(id int,name string) row format delimited fields terminated by ',';
OK
Time taken: 0.058 seconds
hive> desc stu_temp;
OK
id int
name string
Time taken: 0.034 seconds, Fetched: 2 row(s)
(2)为临时表导数据
hive> load data local inpath'/home/test/stu.txt' into table stu_temp;
Loading data to table default.stu_temp
OK
Time taken: 1.104 seconds
hive> select * from stu_temp;
OK
1 xiaohong
2 xiaolan
3 zhangfei
4 liubie
5 machao
6 caocao
7 zhouyu
8 chaogai
9 songjiang
10 laobai
11 laozhang
12 laoliu
13 xiaoma
14 xiaofei
15 xiaofang
Time taken: 0.985 seconds, Fetched: 15 row(s)
(3)临时表数据导入分桶表
hive> insert into table student select * from stu_temp cluster by id;
hive> select * from student;
OK
4 liubie
8 chaogai
12 laoliu
1 xiaohong
5 machao
9 songjiang
13 xiaoma
2 xiaolan
6 caocao
10 laobai
14 xiaofei
3 zhangfei
7 zhouyu
11 laozhang
15 xiaofang
Time taken: 0.09 seconds, Fetched: 15 row(s)
4、分桶排序问题
hive> insert into table student select * from stu_temp cluster by id;
在上面的select语句中,我们使用了cluster by语句执行分桶的方式;我们发现其实桶内的数据是按照id字段进行升序排列的.其实cluster by相当于distribute by+sort by;sort by默认按照升序进行排列,distribute by+sort by的组合会更加的灵活,因此我们可以去按照id分桶,按照name去进行排序;
hive> insert into table student select * from stu_temp distribute by id sort by name desc;
hive> select * from student;
OK
4 liubie
12 laoliu
8 chaogai
13 xiaoma
1 xiaohong
9 songjiang
5 machao
2 xiaolan
14 xiaofei
10 laobai
6 caocao
7 zhouyu
3 zhangfei
15 xiaofang
11 laozhang
Time taken: 0.053 seconds, Fetched: 15 row(s)
5、对分桶表进行查询
(1)语法
select * from t_bucket tablesample(bucket n out of m on d);
n表示从哪个bucket进行抽样,桶计数从1开始。m用来计算抽取数据的量,计算方式为分桶数/m。假设我们一共分了128个桶,m设置为32,则表示要抽取4个bucket;m的值可以不为桶个数的公约数,可以为任意值。
(2)查询所有桶的数据
hive> select * from student tablesample(bucket 1 out of 1 on id);
OK
4 liubie
12 laoliu
8 chaogai
13 xiaoma
1 xiaohong
9 songjiang
5 machao
2 xiaolan
14 xiaofei
10 laobai
6 caocao
7 zhouyu
3 zhangfei
15 xiaofang
11 laozhang
Time taken: 0.17 seconds, Fetched: 15 row(s)
(3)查询一个桶的数据
hive> select * from student tablesample(bucket 1 out of 4 on id);
OK
4 liubie
12 laoliu
8 chaogai
Time taken: 0.076 seconds, Fetched: 3 row(s)
hive> select * from student tablesample(bucket 3 out of 4 on id);
OK
2 xiaolan
14 xiaofei
10 laobai
6 caocao
Time taken: 0.081 seconds, Fetched: 4 row(s)