1. 什么是分桶表
分桶表是按照某列属性值,把数据打散存储在不同文件的Hive表.
2. 分桶的原理
Hive官网解释:
How does Hive distribute the rows across the buckets? In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets. (There's a '0x7FFFFFFF in there too, but that's not that important). The hash_function depends on the type of the bucketing column. For an int, it's easy, hash_int(i) == i. For example, if user_id were an int, and there were 10 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc. For other datatypes, it's a little tricky. In particular, the hash of a BIGINT is not the same as the BIGINT. And the hash of a string or a complex datatype will be some number that's derived from the value, but not anything humanly-recognizable. For example, if user_id were a STRING, then the user_id's in bucket 1 would probably not end in 0. In general, distributing rows based on the hash will give you a even distribution in the buckets.
翻译过来就是:
分桶列的hash值对分桶数取模,余数相同的会被分到同一个桶内.
在hdfs上,每个桶对应一个数据文件.
比如按照name属性分为3个桶,就是对name属性值的hash值对3取模,按照取模结果对数据分桶。
-
取模结果为0的数据记录存放到一个文件
-
取模结果为1的数据记录存放到一个文件
-
取模结果为2的数据记录存放到一个文件
3. 分桶表的优点
- 避免数据倾斜
- 有利于抽样
- 使map-side JOIN更高效
4. 如何创建分桶表
-- 创建分桶表
create table db_hive.user_buckets_demo(id int, name string)
clustered by(id)
into 4 buckets
row format delimited fields terminated by '\t';
注意:
* clustered by: 指定分桶列
* into x buckets: 指定分桶数
5. 分桶表加载数据
注意:
* 需要先把数据加载到普通表,再通过insert into...select...from 插入数据.
否则,将不会产生分桶.
* hive2.0以前版本,插入数据前需要设置:
set hive.enforce.bucketing=true;
或者
set mapreduce.job.reduces=4;
创建普通表user_demo
create table db_hive.user_demo(id int, name string)
row format delimited fields terminated by '\t';
准备数据文件 buckets.txt
1 xiaoming1
2 xiaoming2
3 xiaoming3
4 xiaoming4
5 xiaoming5
6 xiaoming6
7 xiaoming7
8 xiaoming8
9 xiaoming9
10 xiaoming10
加载数据到普通表user_demo
load data local inpath '/home/hadoop/hive/data/user_bucket.txt' overwrite into table user_demo;
加载数据到分桶表user_buckets_demo
set hive.enforce.bucketing=true;
insert into user_buckets_demo select * from user_demo;
页面查看数据文件