Hive桶表实践

最新推荐文章于 2022-09-08 23:55:03 发布

LSB19930706

最新推荐文章于 2022-09-08 23:55:03 发布

阅读量152

点赞数

分类专栏： Hive

本文链接：https://blog.csdn.net/lsb19930706/article/details/109315781

版权

Hive 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

一、桶表的作用

1、方便抽样； 2、提高join查询效率。

二、桶表抽样查询

开启hive分桶有两种方式，一种方式如下，推荐这种方式

set hive.enforce.bucketing = true;

当目标表为4个分桶的桶表，插入数据时会自动生成4个reduce

另一种需要手动指定reduce数量，并在导入到桶表的查询中做分桶查询（distribute by或者cluster by ）。

set mapreduce.job.reduces = num; --num数量和分桶数量一致

准备工作

创建表，表数据为textfile，列分隔符为逗号

create table test(
id int
,buck_type string
)
stored as textfile 
row format delimited fields terminated by ','
;

在centos中编辑好表数据，并上传到hdfs表目录下

查询是否有数据

创建分桶表

create table test_bucket(
id int
,buck_type string
)
clustered by (id) into 4 buckets 
row format delimited fields terminated by ','
stored as textfile 
;

将test表数据插入桶表

set hive.enforce.bucketing = true;
insert overwrite table test_bucket
select * from test distribute by buck_type sort by id;

查询桶表，上述语句按buck_type分桶，按id排序

表目录下有4个文件，分别对应4个分桶

现在可以见证桶表的抽样查询了

SELECT * FROM test_bucket 
TABLESAMPLE(BUCKET 1 OUT OF 2 ON buck_type);

抽样分子分母解释

分母决定取几个桶的数据，分母为桶数的因数，该桶表一共4个桶，分母为2，代表取两个桶的数据，分子代表桶的位置，我取分子为1，表示取第1，1+2桶的数据。

三、桶表join查询

准备数据

create table test_bucket2(
id int
,col2 string
,col3 string
,col4 string
,col5 string
)
clustered by (id) into 10 buckets 
row format delimited fields terminated by ','
stored as textfile 
;
--造一千条数据
insert overwrite table test_bucket2
select a.id,'col2','col3','col4','col5' from test2 a,test2 b,test2 c
;
create table test_bucket3(
id int
,col2 string
,col3 string
,col4 string
,col5 string
)
clustered by (id) into 10 buckets 
row format delimited fields terminated by ','
stored as textfile 
;
insert overwrite table test_bucket3 select * from test_bucket2
;
create table test_join2(
id int
,col2 string
,col3 string
,col4 string
,col5 string
)
;
insert overwrite table test_join2 
select * from test_bucket2;
create table test_join3 as select * from test_join2
;

test_join2，test_join3为普通内部表，test_bucket2，test_bucket3为桶表，按id分桶

select count(*) from test_bucket2 a
join test_bucket3 b on a.id=b.id
;
select count(*) from test_join2 a
join test_join3 b on a.id=b.id
;

运行上面两个sql语句，其中桶表join耗时36s，内部表join耗时110s(基于笔记本虚拟机单机测试，虚机配置1核，3.2g内存）。由此可见桶表join确实可以大大提升性能。

四、实践总结

1.桶表字段是表中本就有的字段，分区表的字段是单独的；

2.从hdfs上看，桶表有几个桶，对应表目录下就会有几个文件（会不会受文件块大小128m影响？暂时找不到这么大数据量），分区表对应表目录下是分区目录，桶表的桶数建表是就定义好了，分区表不是；

3.分区表只能按某个值分区，分桶表分桶是按桶键范围分桶；

4.从作用上看，分区表会缩小数据扫描范围，桶表可以提升join性能；

5.分区表，桶表说到底都是对任务优化，可以结合使用，具体性能以实践为准。

LSB19930706

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Hive桶表实践

桶表的好处1、方便抽样2、提高join查询效率下面会一一体现开启hive分桶有两种方式，一种方式如下，推荐这种方式set hive.enforce.bucketing = true;另一种需要手动指定reduce数量，并在导入到桶表的查询中做分桶查询（distribute by或者cluster by）。set mapreduce.job.reduces = num; --num数量和分桶数量一致...
复制链接

扫一扫

专栏目录