Hive分桶使用

hive 专栏收录该内容
10 篇文章 0 订阅

Hive分桶

Hive中的每一个表,每一个分区都可以进行分桶,表或者分区实际上是以文件的形式在hdfs上存储,而分桶物理上相当于将一个文件分成几个文件进行存储,分桶用于大规模数据集。

分桶的使用

1.建表时设置分桶

create table student_bucket(id INT, name STRING, age INT)
clustered by (age) into 4 buckets
ROW FROMAT DELIMITED FIELDS TERMINATED BY ',';

若需要排序可以用如下建表语句:

CREATE TABLE page_view(viewTime INT, userid BIGINT,
     page_url STRING, referrer_url STRING, ip STRING )
 PARTITIONED BY(dt STRING, country STRING)
 CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
 ROW FORMAT DELIMITED
   FIELDS TERMINATED BY '\001'
   COLLECTION ITEMS TERMINATED BY '\002'
   MAP KEYS TERMINATED BY '\003'
 STORED AS SEQUENCEFILE;

通过viewTime对每个桶的数据排序

2.开启分桶

set hive.enforce.bucketing=true;

分桶的作用

The CLUSTERED BY and SORTED BY creation commands do not affect how data is inserted into a table – only how it is read. This means that users must be careful to insert data correctly by specifying the number of reducers to be equal to the number of buckets, and using CLUSTER BY and SORT BY commands in their query.

分桶及排序不影响数据的插入方式,只影响读取方式。分桶数量与reduce Task数量一致,在查询的sql使用cluster by和sort by.

insert into student_bucket select id,name,age from student cluster by (id);

适用场景

1.数据抽样分析

2.使用分桶能提高join效率,要求两个桶表字段和数量一致

select a.id, a.age,b.name from a join b on a.id = b.id

如果a,b表都是分桶表且分桶字段一致,则不需要进行全表笛卡尔积,因为一个id会被分到相同的桶中。

  • 0
    点赞
  • 0
    评论
  • 1
    收藏
  • 一键三连
    一键三连
  • 扫一扫,分享海报

©️2021 CSDN 皮肤主题: 黑客帝国 设计师:白松林 返回首页
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、C币套餐、付费专栏及课程。

余额充值