Hive之分区表和分桶表

最新推荐文章于 2024-04-26 17:46:45 发布

勤奋的ls丶

最新推荐文章于 2024-04-26 17:46:45 发布

阅读量1.5k

点赞数 2

分类专栏： Hive 文章标签： hive etl hadoop

本文链接：https://blog.csdn.net/lslslslslss/article/details/122101411

版权

Hive 专栏收录该内容

11 篇文章 2 订阅

订阅专栏

一、分区表

Hive中的分区表就是分目录，分区表对应的就是HDFS文件系统上的独立的文件夹，分区就是把一个大的数据根据某些条件分成几个小的数据集。

1.分区表的基本操作

//创建一个分区表，注意day作为分区字段不能存在于表中
create table dept_partition(
deptno int, dname string, loc string
)
partitioned by (day string)
row format delimited fields terminated by '\t';
//加载数据，指定分区
load data local inpath '/opt/module/hive/datas/dept_20211223.log' into table dept_partition partition(day='20211223');
//查询指定分区数据
select * from dept_partition where day='20211223';
//查询共有多少个分区
show partitions dept_partition;
//增加分区
创建单个分区
hive (default)> alter table dept_partition add partition(day='20211224') ;
同时创建多个分区（分区之间不能有逗号）
hive (default)> alter table dept_partition add partition(day='20211225') partition(day='20211226');
//删除分区
删除单个分区
hive (default)> alter table dept_partition drop partition (day='20211224');
同时删除多个分区（分区之间必须有逗号）
hive (default)> alter table dept_partition drop partition (day='20211225'), partition(day='20211226');
//查看分区表结构
desc formatted dept_partition;

2.创建二级分区

//创建表，标明两个分区
create table dept_partition2(deptno int, dname string, loc string)
      partitioned by (day string, hour string)
      row format delimited fields terminated by '\t';
//加载数据到表中
load data local inpath '/opt/module/hive/datas/dept_20211223.log' into table
dept_partition2 partition(day='20211226', hour='12');
//查询表
select * from dept_partition2 where day='20211223' and hour='12';

数据修复

当上传数据后，hive端没有及时收到如何处理

msck repair table dept_partition;

3.动态分区

方法1

（1）开启动态分区功能（默认true，开启）

set hive.exec.dynamic.partition=true

（2）设置为非严格模式（动态分区的模式，默认strict，表示必须指定至少一个分区为静态分区，nonstrict模式表示允许所有的分区字段都可以使用动态分区。）

set hive.exec.dynamic.partition.mode=nonstrict

（3）在所有执行MR的节点上，最大一共可以创建多少个动态分区。默认1000

set hive.exec.max.dynamic.partitions=1000

（4）在每个执行MR的节点上，最大可以创建多少个动态分区。该参数需要根据实际的数据来设定。比如：源数据中包含了一年的数据，即day字段有365个值，那么该参数就需要设置成大于365，如果使用默认值100，则会报错。

set hive.exec.max.dynamic.partitions.pernode=100

（5）整个MR Job中，最大可以创建多少个HDFS文件。默认100000

set hive.exec.max.created.files=100000

（6）当有空分区生成时，是否抛出异常。一般不需要设置。默认false

set hive.error.on.empty.partition=false

方法2

在hive 3.x版本动态分区是直接拿load来做优化

//创表
create table dept_partition3(
deptno int, dname string, loc string
)
partitioned by (day string)
row format delimited fields terminated by '\t';
方法1
insert into table dept_partition partition(day) select deptno,dname,loc,day from dept1;
方法2
load data local inpath '/opt/module/hive/datas/dept_20211223.log' into table dept_partition
//二级分区
create table dept_partition_dy2(id int) partitioned by (name string,loc int) row format delimited fields terminated by '\t';

load data local inpath '/opt/module/hive/datas/dept.txt' into table dept_partition_dy2;

insert into table dept_partition_dy2 partition(name,loc)  select deptno, dname,loc from dept;

二、分桶表

分桶是将数据集分解成更容易管理的若干部分的操作。

分区针对的是数据的存储路径，分桶针对的是数据文件。

分桶规则：

根据结果可知：Hive的分桶采用对分桶字段的值进行哈希，然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中。

基本操作

//创建分桶表
create table stu_buck(id int, name string)
clustered by(id) 
into 4 buckets
row format delimited fields terminated by '\t';
//导入数据到分桶表
load data local inpath   '/opt/module/hive/datas/student.txt' into table stu_buck;

创建一个又分区又分桶的表

create table stu_buck_part(id int, name string)
partitioned by (day string)
clustered by(id) 
into 4 buckets
row format delimited fields terminated by '\t';

勤奋的ls丶

关注

2
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Hive之分区表和分桶表

目录一、分区表1.分区表的基本操作2.创建二级分区数据修复3.动态分区二、分桶表一、分区表Hive中的分区表就是分目录，分区表对应的就是HDFS文件系统上的独立的文件夹，分区就是把一个大的数据根据某些条件分成几个小的数据集。1.分区表的基本操作//创建一个分区表，注意day作为分区字段不能存在于表中create table dept_partition(deptno int, dname string, loc string)partitio...
复制链接

扫一扫