Hive分区表与分桶表详解

Natsu爱学习

已于 2023-09-14 19:14:01 修改

阅读量69

点赞数

分类专栏： hive 文章标签： hive hadoop 数据仓库

于 2023-09-14 19:13:19 首次发布

本文链接：https://blog.csdn.net/Natsu_natsu/article/details/132885582

版权

hive 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

一:分区表

1.概念

Hive 中的表对应为 HDFS 上的指定目录，在查询数据时候，默认会对全表进行扫描，这样时间和性能的消耗都非常大。

分区为 HDFS 上表目录的子目录，数据按照分区存储在子目录中。如果查询的 where 字句的中包含分区条件，则直接从该分区去查找，而不是扫描整个表目录，合理的分区设计可以极大提高查询速度和性能。

2.基本语法

create table if not exist tb1 (
  id int.
  name string,
  gender string
)partition by (gender string)

3.加载数据语法

3.1:上传数据

load data inpath 'hdfs文件系统文件目录' into table tb1
partition (gender = 'male')

3.2:写入数据

insert overwrite table t1 partition(gender='female')
select * from t1 where gender = 'male'

4.分区表相关操作

4.1展示分区

show partitions tb1

4.2修改分区

-- 增加分区
alter table t1
add partition(day='')

-- 删除分区
alter table t1
drop partition(day='')

5.动态分区

5.1概念

动态分区是分区在数据插入的时候,根据某一列的列值动态生成.

往hive分区表中插入数据时，如果需要创建的分区很多，比如以表中某个字段进行分区存储，则需要复制粘贴修改很多sql去执行，效率低。因为hive是批处理系统，所以hive提供了一个动态分区功能，其可以基于查询参数的位置去推断分区的名称，从而建立分区。

静态分区与动态分区的主要区别在于静态分区是手动指定，而动态分区是通过数据来进行判断。详细来说，静态分区的列实在编译时期，通过用户传递来决定的；动态分区只有在SQL执行时才能决定。

5.2相关参数

-- 动态分区开关(默认true,开启)
set hive.exec.dynamic.partition=true

-- 开启mode非严格模式(默认strict严格模式)
set hive.exec.dynamic.partition.mode=nonstrict

5.3分区表实战

以此student表为例,要将age和gender分区

(1)首先,创建一个studenttp表,以展示student.txt所有内容

create table studenttp(
    id int,
    name string,
    age int,
    gender string,
    hobbies array<string>,
    address map<string,string>
)
row format delimited fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
lines terminated by '\n';

(2)加载本地文件到此表,此时可以查看表数据

load data local inpath '/opt/kb23/student2.txt' into table studenttp;
select * from studenttp;

(3)创建一张新表,表内包含studenttp表内除了age和gender字段的所有字段,并按age和gender分区

create table studenttp1(
    id int,
    name string,
    hobbies array<string>,
    address map<string,string>
)
partitioned by (age int,gender string)
row format delimited fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
lines terminated by '\n';

(4)开启动态分区和mode等级

set hive.exec.dynamic.partition=true; 
set hive.exec.dynamic.partition.mode=nonstrict;

(5)将studenttp表的数据通过分区插入到studenttp1

insert into studenttp1 partititon(age,gender)
select id,name,hobbies,address,age,gender from studenttp;

(6)此时就能看到studenttp1表的数据和分区了

select * from studenttp1;
show partitions studenttp1;

二:分桶表

1.概念

分区提供了一个隔离数据和优化查询的可行方案，但是并非所有的数据集都可以形成合理的分区，分区的数量也不是越多越好，过多的分区条件可能会导致很多分区上没有数据。同时 Hive 会限制动态分区可以创建的最大分区数，用来避免过多分区文件对文件系统产生负担。鉴于以上原因，Hive 还提供了一种更加细粒度的数据拆分方案：分桶表 (bucket Table)。

分桶表会将指定列的值进行哈希散列，并对 bucket（桶数量）取余，然后存储到对应的 bucket（桶）中。

2.基本语法

create table t2(
  id int,
  name string
)clustered by(id)
into 4 buckets;

3.创建分区分桶排序表语法

create table t2(
  id int,
  name string,
  gender string
)partition by(gender) clustered by(id) sorted by(id)
into 4 buckets;

Natsu爱学习

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
Hive分区表与分桶表详解

分区与分桶表
复制链接

扫一扫

专栏目录

Hive分区表与分桶表详解

一:分区表

1.概念

2.基本语法

3.加载数据语法

4.分区表相关操作

5.动态分区

二:分桶表

1.概念

2.基本语法

3.创建分区分桶排序表语法

“相关推荐”对你有帮助么？