【Hive】分桶表

1. 什么是分桶表

分桶表是按照某列属性值,把数据打散存储在不同文件的Hive表.

2. 分桶的原理

Hive官网解释:

How does Hive distribute the rows across the buckets? In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets. (There's a '0x7FFFFFFF in there too, but that's not that important). The hash_function depends on the type of the bucketing column. For an int, it's easy, hash_int(i) == i. For example, if user_id were an int, and there were 10 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc. For other datatypes, it's a little tricky. In particular, the hash of a BIGINT is not the same as the BIGINT. And the hash of a string or a complex datatype will be some number that's derived from the value, but not anything humanly-recognizable. For example, if user_id were a STRING, then the user_id's in bucket 1 would probably not end in 0. In general, distributing rows based on the hash will give you a even distribution in the buckets.

翻译过来就是:

分桶列的hash值对分桶数取模,余数相同的会被分到同一个桶内.
在hdfs上,每个桶对应一个数据文件.

比如按照name属性分为3个桶,就是对name属性值的hash值对3取模,按照取模结果对数据分桶。

  • 取模结果为0的数据记录存放到一个文件

  • 取模结果为1的数据记录存放到一个文件

  • 取模结果为2的数据记录存放到一个文件

3. 分桶表的优点
  • 避免数据倾斜
  • 有利于抽样
  • 使map-side JOIN更高效
4. 如何创建分桶表
-- 创建分桶表
create table db_hive.user_buckets_demo(id int, name string)
clustered by(id) 
into 4 buckets 
row format delimited fields terminated by '\t';

注意:
* clustered by: 指定分桶列
* into x buckets: 指定分桶数
5. 分桶表加载数据
注意:
* 需要先把数据加载到普通表,再通过insert into...select...from 插入数据.
  否则,将不会产生分桶.
* hive2.0以前版本,插入数据前需要设置:  
  set hive.enforce.bucketing=true; 
  或者 
  set mapreduce.job.reduces=4; 

创建普通表user_demo

create table db_hive.user_demo(id int, name string)
row format delimited fields terminated by '\t';

准备数据文件 buckets.txt

1	xiaoming1
2	xiaoming2
3	xiaoming3
4	xiaoming4
5	xiaoming5
6	xiaoming6
7	xiaoming7
8	xiaoming8
9	xiaoming9
10	xiaoming10

加载数据到普通表user_demo

load data local inpath '/home/hadoop/hive/data/user_bucket.txt' overwrite into table user_demo;

加载数据到分桶表user_buckets_demo

set hive.enforce.bucketing=true;
insert into user_buckets_demo select * from user_demo;

页面查看数据文件
在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值