Hive分桶的作用

最新推荐文章于 2024-07-29 14:18:22 发布

墨染百城

最新推荐文章于 2024-07-29 14:18:22 发布

阅读量9.2k

点赞数 2

分类专栏： Hive 大数据框架文章标签： hive buck

本文链接：https://blog.csdn.net/mrbcy/article/details/68490074

版权

大数据框架同时被 2 个专栏收录

31 篇文章 0 订阅

订阅专栏

Hive

11 篇文章 0 订阅

订阅专栏

分区的主要作用是可用允许我们只统计一部分内容，加快统计的速度。

什么是分桶

假如我们有个表t_buck。

create table t_buck(id string,name string)
clustered by (id) sort by(id) into 4 buckets;

指定了根据id分成4个桶。

只是说明了表会分桶，具体的分区需要在导入数据时产生。最好的导入数据方式是insert into table;

开始的时候我们的数据都是在一起的，按照上面的分桶结果，会在表目录下产生多个文件：/user/hive/warehouse/test_db/t_buk/

每个文件中的内容是根据HASH散列后得到的结果。

实验

使用下面的代码创建表：

create table t_p(id string,name string)
row format delimited fields terminated by ',';

load data local inpath '/root/buck.data' overwrite into table t_p;

create table t_buck(id string,name string)
clustered by(id) sorted by(id)
into 4 buckets
row format delimited fields terminated by ',';

# 要开启模式开关
set hive.enforce.bucketing = true;
set mapreduce.job.reduces=4;

# 查询时cluster by指定的字段就是partition时分区的key
# 每个区中的数据根据id排序。
insert into table t_buck
select * from t_p cluster by(id);

来看一下sort by的结果。

set mapreduce.job.reduces=4;
select * from t_p sort by id;

输出结果为：

+---------+-----------+--+
| t_p.id  | t_p.name  |
+---------+-----------+--+
| 12      | 12        |
| 13      | 13        |
| 4       | 4         |
| 8       | 8         |
| 14      | 14        |
| 2       | 2         |
| 6       | 6         |
| 1       | 1         |
| 10      | 10        |
| 11      | 11        |
| 3       | 3         |
| 5       | 5         |
| 7       | 7         |
| 9       | 9         |
+---------+-----------+--+

明显看出是每个Reduce中有序而不是全局有序。

cluster by(id) = distribute by(id) sort by(id)

distribute by(id)指定分发字段，sort by指定排序字段。

分桶的作用

观察下面的语句。

select a.id,a.name,b.addr from a join b on a.id = b.id;

如果a表和b表已经是分桶表，而且分桶的字段是id字段，那么做这个操作的时候就不需要再进行全表笛卡尔积了。但是如果标注了分桶但是实际上数据并没有分桶，那么结果就会出问题。

墨染百城

关注

2
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录