Hive的分桶表

最新推荐文章于 2024-09-08 18:21:03 发布

SuperDoge

最新推荐文章于 2024-09-08 18:21:03 发布

阅读量625

点赞数

分类专栏： Hive

本文链接：https://blog.csdn.net/qq_43579121/article/details/86591463

版权

Hive 专栏收录该内容

15 篇文章 0 订阅

订阅专栏

如何使用分通表

1.创建带桶的table：

create table teacher(name string) clustered by (name) into 3 buckets row format delimited fields terminated by ' ';

2.开启分桶机制：

set hive.enforce.bucketing = true;

3.往表中插入数据：

insert overwrite table teacher select * from temp; //需要提前准备好temp，从temp查询数据写入到teacher

注意:teacher 是一个分桶表，对于分通表，不允许以外部文件方式导入数据，只能从另一张表数据导入。分桶表只能是内部表。

temp文件数据样例：
java zhang
web wang
java zhao
java qin
web liu
web zheng
ios li
linux chen
ios yang
ios duan
linux ma
linux xu
java wen
web wu

作用及原理

分桶的原理是根据指定的列的计算hash值模余分桶数量后将数据分开存放。方便数据抽样：select * from teacher tablesample(bucket 1 out 3 on name);

注：分桶语法----TABLESAMPLE(BUCKET x OUT OF y)

y必须是table总bucket数的倍数或者因子。hive根据y的大小，决定抽样的比例。例如：table总共分了3份，当y=3时，抽取（3/3）=1个bucket的数据，当y=6时，抽取（3/6）=1/2个bucket的数据。

x表示从哪个bucket开始抽取。

例如：table总bucket数为3，tablesample(bucket 3 out of 3)，表示总共抽取（3/3=）1个bucket的数据，抽取第3个bucket的数据。再例如：table总bucket数为32，tablesample(bucket 3 out of 16)，表示总共抽取（32/16=）2个bucket的数据，分别为第3个bucket和第（3+16=）19个bucket的数据。

查询第一个桶里数据，并返回一半数据：

select * from teacher tablesample(bucket 1 out of 6 on name);