Hive分桶操作（Bucket）一图掌握核心内容

最新推荐文章于 2025-05-03 10:29:55 发布

原创最新推荐文章于 2025-05-03 10:29:55 发布 · 2.1k 阅读

4 ·

CC 4.0 BY-SA版权

文章标签：

#hive #big data #大数据

Hive 专栏收录该内容

13 篇文章

订阅专栏

本文介绍了Hive中的分桶概念，详细讲解了分桶的作用，如提高查询效率和优化抽样。通过设置参数和创建命令，演示了如何在Hive中创建分桶表，并展示了插入数据和查询分桶数据的步骤。最后，通过实例解释了`tablesample`在查询分桶数据时的用法。

什么是分桶？：

Hive基于hash值对数据进行分桶，按照分桶字段的hash值除以分桶的个数进行取余(bucket_id = column.hashcode % bucket.num)。

分桶的作用：

1、有更高的查询处理效率
2、使得抽样更高效

如何分桶？：

1、分桶之前需要执行命令set hive.enforce.bucketing=true;

2、创建分桶表
首先先创建一个普通表用于给分桶表传数据

create table employee_id(
name string,
employee_id int,
work_place array<string>,
gender_age struct<gender:string,age:int>,
skills_score map<string,int>,
depart_title map<string,array<string>>
)
row format delimited fields terminated by '|'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n';

分桶表创建：

create table employee_id_buckets(
name string,
employee_id int,
work_place array<string>,
gender_age struct<gender:string,age:int>,
skills_score map<string,int>,
depart_title map<string,array<string>>
)
#创建两个桶
clustered by(employee_id) into 2 buckets
row format delimited fields terminated by '|'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n';