【Hive】分桶表

最新推荐文章于 2024-09-08 18:21:03 发布

卜塔

最新推荐文章于 2024-09-08 18:21:03 发布

阅读量383

点赞数

分类专栏： Hive 文章标签： hive

本文链接：https://blog.csdn.net/NextAction/article/details/105168142

版权

Hive 专栏收录该内容

23 篇文章 1 订阅

订阅专栏

1. 什么是分桶表

分桶表是按照某列属性值,把数据打散存储在不同文件的Hive表.

2. 分桶的原理

Hive官网解释:

How does Hive distribute the rows across the buckets? In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets. (There's a '0x7FFFFFFF in there too, but that's not that important). The hash_function depends on the type of the bucketing column. For an int, it's easy, hash_int(i) == i. For example, if user_id were an int, and there were 10 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc. For other datatypes, it's a little tricky. In particular, the hash of a BIGINT is not the same as the BIGINT. And the hash of a string or a complex datatype will be some number that's derived from the value, but not anything humanly-recognizable. For example, if user_id were a STRING, then the user_id's in bucket 1 would probably not end in 0. In general, distributing rows based on the hash will give you a even distribution in the buckets.

翻译过来就是:

分桶列的hash值对分桶数取模,余数相同的会被分到同一个桶内.
在hdfs上,每个桶对应一个数据文件.

比如按照name属性分为3个桶，就是对name属性值的hash值对3取模，按照取模结果对数据分桶。

取模结果为0的数据记录存放到一个文件
取模结果为1的数据记录存放到一个文件
取模结果为2的数据记录存放到一个文件

3. 分桶表的优点

避免数据倾斜
有利于抽样
使map-side JOIN更高效

4. 如何创建分桶表

-- 创建分桶表
create table db_hive.user_buckets_demo(id int, name string)
clustered by(id) 
into 4 buckets 
row format delimited fields terminated by '\t';

注意:
* clustered by: 指定分桶列
* into x buckets: 指定分桶数

5. 分桶表加载数据

注意:
* 需要先把数据加载到普通表,再通过insert into...select...from 插入数据.
  否则,将不会产生分桶.
* hive2.0以前版本,插入数据前需要设置:  
  set hive.enforce.bucketing=true; 
  或者 
  set mapreduce.job.reduces=4;

创建普通表user_demo

create table db_hive.user_demo(id int, name string)
row format delimited fields terminated by '\t';

准备数据文件 buckets.txt

1	xiaoming1
2	xiaoming2
3	xiaoming3
4	xiaoming4
5	xiaoming5
6	xiaoming6
7	xiaoming7
8	xiaoming8
9	xiaoming9
10	xiaoming10

加载数据到普通表user_demo

load data local inpath '/home/hadoop/hive/data/user_bucket.txt' overwrite into table user_demo;

加载数据到分桶表user_buckets_demo

set hive.enforce.bucketing=true;
insert into user_buckets_demo select * from user_demo;

页面查看数据文件
在这里插入图片描述

卜塔

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录