数据仓库(八)---hive的性能优化---hive动态分区

最新推荐文章于 2024-08-01 08:58:56 发布

张小凡vip

最新推荐文章于 2024-08-01 08:58:56 发布

阅读量4.9k

点赞数 2

分类专栏：数据仓库文章标签： hive 性能优化动态分区

本文链接：https://blog.csdn.net/zzq900503/article/details/79385581

版权

数据仓库专栏收录该内容

47 篇文章 36 订阅

订阅专栏

我们在上一篇文章中已经学习了如何进行分区，手动分区。
数据仓库(七)—hive的性能优化—hive的分区和分桶
但是分区之后再插入数据时，并不会自动的进行分区，而是需要再次手动分区。
关系型数据库（如Oracle）中，对分区表Insert数据时候，数据库自动会根据分区字段的值，将数据插入到相应的分区中，Hive中也提供了类似的机制，即动态分区(Dynamic Partition)，只不过，使用Hive的动态分区，需要进行相应的配置。

分区种类

分区分为两种：
静态分区static partition
动态分区dynamic partition
静态分区和动态分区的区别在于导入数据时，是手动输入分区名称，还是通过数据来判断数据分区。对于大数据批量导入来说，显然采用动态分区更为简单方便。

动态分区配置方法

修改一下hive的默认设置以支持动态分区：

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

其他参数

hive.exec.dynamic.partition
默认值：false
是否开启动态分区功能，默认false关闭。
使用动态分区时候，该参数必须设置成true;

hive.exec.dynamic.partition.mode
默认值：strict
动态分区的模式，默认strict，表示必须指定至少一个分区为静态分区，nonstrict模式表示允许所有的分区字段都可以使用动态分区。
一般需要设置为nonstrict

hive.exec.max.dynamic.partitions.pernode
默认值：100
在每个执行MR的节点上，最大可以创建多少个动态分区。
该参数需要根据实际的数据来设定。
比如：源数据中包含了一年的数据，即day字段有365个值，那么该参数就需要设置成大于365，如果使用默认值100，则会报错。

hive.exec.max.dynamic.partitions
默认值：1000
在所有执行MR的节点上，最大一共可以创建多少个动态分区。
同上参数解释。

hive.exec.max.created.files
默认值：100000
整个MR Job中，最大可以创建多少个HDFS文件。
一般默认值足够了，除非你的数据量非常大，需要创建的文件数大于100000，可根据实际情况加以调整。

hive.error.on.empty.partition
默认值：false
当有空分区生成时，是否抛出异常。
一般不需要设置。

实例

静态分区

新建一个静态分区表t_student，把原信息表t_student_info_25根据分区age=25放入分区，也就是说如果有30多个年龄阶段就需要执行30多次类似的命令。
create table if not exists t_student(id int,name string,tel string) partitioned by(age int) row format delimited fields terminated by ',' stored as textfile;

–overwrite是覆盖，into是追加
insert into table t_student partition(age='25') select id,name,tel,age from t_student_info_25;

动态分区

设置为true表示开启动态分区功能（默认为false）
set hive.exec.dynamic.partition=true;

设置为nonstrict,表示允许所有分区都是动态的（默认为strict）
set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite是覆盖，insert into是追加
set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table t_student partition(age) select id,name,tel,age from t_student_info;

注意事项

从原表中select出来的字段顺序需要与分区表的列一致，因为分区后的表是把用来分区的字段放在最后的，如果直接select * 进行insert的话，会导致列不对应。

如果查询出来的数据类型和插入表格对应的列数据类型不一致，将会进行转换，但是不能保证转换一定成功，比如如果查询出来的数据类型为int，插入表格对应的列类型为string，可以通过转换将int类型转换为string类型；但是如果查询出来的数据类型为string，插入表格对应的列类型为int，转换过程可能出现错误，因为字母就不可以转换为int，转换失败的数据将会为NULL。

静态分区和动态分区混合使用

全部DP

INSERT OVERWRITE TABLE t_student PARTITION (time, age)  
SELECT id, name, time, age FROM t_student_info WHERE time is not null and age>10;

DP/SP结合

INSERT OVERWRITE TABLE t_student PARTITION (time='2018-02-27', age)  
SELECT id, name, time,age FROM t_student_info WHERE time is not null and age>10;

注意
当SP是DP的子分区时，以下DML会报错，因为分区顺序决定了HDFS中目录的继承关系，这点是无法改变的

INSERT OVERWRITE TABLE t_student PARTITION (time, age = 11)  
SELECT id, name, time,age FROM t_student_info WHERE time is not null and age=11;

多张表插入

FROM student  
INSERT OVERWRITE TABLE t_student PARTITION (time='2018-02-27', age)  
SELECT id, name, time, age FROM t_student_info WHERE time is not null and age>10  
INSERT OVERWRITE TABLE t_student_12 PARTITION (time='2018-02-27, age=12)  
SELECT id, name, time, age from t_student_info where time is not null and age = 12;

CTAS
CREATE-AS语句，DP与SP下的CTAS语法稍有不同，因为目标表的schema无法完全的从select语句传递过去。这时需要在create语句中指定partition列

CREATE TABLE t_student (id int, name string) PARTITIONED BY (time string, age int) AS  
SELECT id, name, time, age+1 age1 FROM t_student_info WHERE time is not null and age>10;

上面展示了DP下的CTAS用法，如果希望在partition列上加一些自己的常量，可以这样做

CREATE TABLE t_student (id int, name string) PARTITIONED BY (time string, age int) AS  
SELECT id, name, "2018-02-27", age+1 age1 FROM t_student_info WHERE time is not null and age>10;