Hive分桶

最新推荐文章于 2019-10-11 09:15:38 发布

璀璨下的一点星辰

最新推荐文章于 2019-10-11 09:15:38 发布

阅读量153

点赞数

分类专栏： hive

本文链接：https://blog.csdn.net/cuicanxingchen123456/article/details/88088682

版权

hive 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

上述讲述的是将一个表按照id去分成四个桶

set hive.enforce.bucketing = true开启分桶

set mapreduce.job.reduces=4 分桶数量要和reduces数量一致

create table t_buck(id string,name string)
clustered by(id)
sorted by(id)
into 4 buckets
row format delimited fields terminated by '.';

创建新表t_p，load加入数据
insert into table t_buck
select id name from t_p distribute by (id) sort by (id);   按id去分发  按id去排序

效果:

使用cluster效果等于distribute by (id) sort by (id)，只要distribute和sort对应的字段一致就可以使用cluster。

insert into table t_buck
select id,name from t_p cluster by (id)

cluster=distribute+sort (distribute和sort对应的字段一致)

注：1、order by 会对输入做全局排序，因此只有一个reducer，会导致当输入规模较大时，需要较长的计算时间。

2、sort by不是全局排序，其在数据进入reducer前完成排序。因此，如果用sort by进行排序，并且设置mapred.reduce.tasks>1，则sort by只保证每个reducer的输出有序，不保证全局有序。

3、distribute by根据distribute by指定的内容将数据分到同一个reducer。

4、Cluster by 除了具有Distribute by的功能外，还会对该字段进行排序。因此，常常认为cluster by = distribute by + sort by

璀璨下的一点星辰

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hive分桶

上述讲述的是将一个表按照id去分成四个桶set hive.enforce.bucketing = true开启分桶set mapreduce.job.reduces=4 分桶数量要和reduces数量一致create table t_buck(id string,name string)clustered by(id)sorted by(id)into 4 buckets...
复制链接

扫一扫

专栏目录