Hive抽样查询（桶表）

最新推荐文章于 2023-09-13 16:29:35 发布

鸷鸟之不群

最新推荐文章于 2023-09-13 16:29:35 发布

阅读量173

点赞数

分类专栏： Hadoop相关文章标签： hive Powered by 金山文档

本文链接：https://blog.csdn.net/weixin_64420247/article/details/129789682

版权

Hadoop相关专栏收录该内容

12 篇文章 3 订阅

订阅专栏

文章介绍了在Hive中为了提高抽样效率而使用的桶表结构，桶表类似于分区表，通过clusteredby和sortedby将数据分块并加速查询。强制分桶可以通过设置enforce.bucketing参数。此外，文章还展示了如何使用tablesample函数进行数据抽样，包括按行数和按桶比例抽样。

摘要由CSDN通过智能技术生成

1.抽样的结构

桶表bucket

为了抽样而设计的结构（为了让抽样更快）

桶表非常类似于分区表（将一块数据分成多块（也能提升查询速度））基于抽样分块提升抽样速度

（分区表基于字段分区）

强行分桶

set hive.enforce.bucketing=true

create table t1(
id int,
name string,
age int
) clustered by (id) sorted by(id) into 8 buckets
row format delimited fields terminated by ',';

sorted by(id) 排序

导入数据

create table t2(
id int,
name string,
age int
)
load data inpath ... into table t2

insert into table t1 select * from t2

hash(id) % 8

2.抽样的函数

tablesample()

select * from t1 tablesample(10 rows)

select * from t1 tablesample(bucket 2 out of 4)

select * from t1 tablesample(bucket 2 #不要超过总桶数 out of 4 #桶的因数倍数 )

1 2 3 4 5 6 7 8

抽到2跟6

/***
 *             ,%%%%%%%%,
 *           ,%%/\%%%%/\%%
 *          ,%%%\c "" J/%%%
 * %.       %%%%/ o  o \%%%
 * `%%.     %%%%    _  |%%%
 *  `%%     `%%%%(__Y__)%%'
 *  //       ;%%%%`\-/%%%'
 * ((       /  `%%%%%%%'
 *  \\    .'          |
 *   \\  /       \  | |
 *    \\/         ) | |
 *     \         /_ | |__
 *     (___________))))))) 攻城湿
 */