HIVE实战处理（四）大数据量导入hive动态分区异常处理

最新推荐文章于 2024-08-18 22:53:11 发布

sheep8521

最新推荐文章于 2024-08-18 22:53:11 发布

阅读量3.9k

点赞数 1

分类专栏： hive

本文链接：https://blog.csdn.net/sheep8521/article/details/105974927

版权

hive 专栏收录该内容

43 篇文章 10 订阅

订阅专栏

一、分区表的场景

分区是在处理大型事实表时常用的方法。
分区的好处在于缩小查询扫描范围，从而提高速度。
分区分为两种：静态分区static partition和动态分区dynamic partition。
静态分区和动态分区的区别在于导入数据时，是手动输入分区名称，还是通过数据来判断数据分区。对于大数据批量导入来说，显然采用动态分区更为简单方便。

1、整个数据流程
迁移过程中的hbase历史数据导入phoenix问题，借助hive2phoenix的方法，方案是把hbase的数据分批导入到指定NAS路径，之后创建hive的外部表，根据指定的分隔符创建。
如果数据文件过大建议压缩存储到NAS,linux自带支持的压缩格式是gzip，bzip2。具体的见：
HIVE实战处理（三）hive的压缩格式以及压缩文件导入hive实战

1）、创建从nas文件映射的外部表(分割符和nas文件保持一致)
create external table temp.tmp_hive1
(ststis_day string,
…
)row format delimited fields terminated by ‘|’
;

2）、因为phoenix的表的话是要求有分区，所以hive这边也一样保持分区数据
首先，新建一张我们需要的分区以后的表
create table temp.temp_kefu_user_visit_1h_delta_hourly
(statis_day string,
…

) partitioned by (dt string,hour string)

3）、导入hive1表的数据到分区表
然后，我们修改一下hive的默认设置以支持动态分区：

#参数详解：
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
#这两个参数仅在你仅使用动态分区字段做分区索引时。
#然后用hive的insert命令进行插入操作。注意，除了所有列外，需要将分区的动态字段跟在后面。

二、动态分区的异常

Caused by: org.apache.hadoop.hive.ql.metadata.HiveFatalException: [Error 20004]: Fatal error occurred when node tried to create too many dynamic partitions.
The maximum number of dynamic partitions is controlled by hive.exec.max.dynamic.partitions and hive.exec.max.dynamic.partitions.pernode.
Maximum was set to: 1000

导入动态分区的数据量太大，超过了最大的分区数设置，所以需要增加提示的参数值。

#整个执行程序
set mapreduce.job.name=temp.temp_kefu_user_visit_1h_delta_hourly_partition_0_again;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table temp.temp_kefu_user_visit_1h_delta_hourly_partition_0 partition (dt,hour) 
select 
statis_day 
,search_time   
,serv_number   
,prov_id       
,region_id     
,node_id       
,sup_node_id   
,url_detail    
,sup_url_detail
,client_id     
,chn_id        
,chn_id_source 
,cp_id         
,cp_name       
,node_type     
,net_type      
,term_type     
,gate_ip       
,session_id    
,page_id       
,term_prod_id  
,business_id   
,sub_busi_id   
,virt_busi_id  
,client_code   
,rowkey 
,case when nvl(statis_day, '') = '' then '2018' else statis_day end as dt
,case when nvl(substr(search_time,9,2), '') = '' then '00' else substr(search_time,9,2) end as hour
from temp.temp_kefu_user_visit_1h_delta_hourly;

对于数据量小的原始hive表来说，开启动态分区使用默认参数配置应该是可以插入成功的，
=数据量大或者上面执行失败的话往下看。=

三、参数设置以及代码优化

在hive中，有时候会希望根据输入的key，把结果自动输出到不同的目录中，这可以通过动态分区来实现，就是把每一个key当作一个分区，这时候要用distribute by 来限制生成的文件个数。
代码示例如下：

但是这还不够，在动态分区有可能很大的情况下，还需要其他的调整：

 #参数指的是每个节点上能够生成的最大分区，这个在最坏情况下应该是跟最大分区一样的值
hive.exec.dynamic.partitions.pernode
 #参数指的是总共的最大的动态分区数
hive.exec.dynamic.partitions.partitions
#参数指的是能够创建的最多文件数（分区一多，文件必然就多了...）
hive.exec.max.created.files 
#最后要注意的是select语句中要把distribute的key也select出来

正确的数据代码如下：

set mapreduce.job.name=temp.temp_kefu_user_visit_1h_delta_hourly_partition_0_two;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
#这两个参数仅在你仅使用动态分区字段做分区索引时。
#然后用hive的insert命令进行插入操作。注意，除了所有列外，需要将分区的动态字段跟在后面。
set hive.exec.max.dynamic.partitions.pernode=600000;
set hive.exec.max.dynamic.partitions=6000000;
set hive.exec.max.created.files=6000000;

insert overwrite table temp.temp_kefu_user_visit_1h_delta_hourly_partition_0 partition (dt,hour) 
select
statis_day 
,search_time   
,serv_number   
,prov_id       
,region_id     
,node_id       
,sup_node_id   
,url_detail    
,sup_url_detail
,client_id     
,chn_id        
,chn_id_source 
,cp_id         
,cp_name       
,node_type     
,net_type      
,term_type     
,gate_ip       
,session_id    
,page_id       
,term_prod_id  
,business_id   
,sub_busi_id   
,virt_busi_id  
,client_code   
,rowkey
,dt
,hour
from
(select 
statis_day 
,search_time   
,serv_number   
,prov_id       
,region_id     
,node_id       
,sup_node_id   
,url_detail    
,sup_url_detail
,client_id     
,chn_id        
,chn_id_source 
,cp_id         
,cp_name       
,node_type     
,net_type      
,term_type     
,gate_ip       
,session_id    
,page_id       
,term_prod_id  
,business_id   
,sub_busi_id   
,virt_busi_id  
,client_code   
,rowkey 
,case when nvl(statis_day, '') = '' then '2018' else statis_day end as dt
#动态分区的字段支持函数操作。我们得到了一张分区后的hive大表。

,case when nvl(substr(search_time,9,2), '') = '' then '00' else substr(search_time,9,2) end as hour
from temp.temp_kefu_user_visit_1h_delta_hourly
) t
distribute by dt,hour

;

整个map执行很快，在执行过程会启动1个map,但是很多reducer，这是distribute的效果，会根据分区的key进行划分多个reduce进行计算。
最后数据计算完成这部分耗时不长，stage-0 Starting task [Stage-0:MOVE] in serial mode，计算之后的数要移动到分区表所在的hdfs的目录下，这个因为文件的和reducer的个数一样很多，所以在移动过程中耗时比较长，需要耐心等待最后程序执行完成。

INFO  : Starting Job = job_1561342945833_12383421, Tracking URL = http://ddp-nn-01.cmdmp.com:8088/proxy/application_1561342945833_12383421/
INFO  : Kill Command = /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/hadoop/bin/hadoop job  -kill job_1561342945833_12383421
INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 20000
。。。。。。。
NFO  : 2020-05-07 16:44:12,109 Stage-1 map = 100%,  reduce = 96%, Cumulative CPU 39204.32 sec
INFO  : 2020-05-07 16:44:16,612 Stage-1 map = 100%,  reduce = 97%, Cumulative CPU 39608.4 sec
INFO  : 2020-05-07 16:44:21,157 Stage-1 map = 100%,  reduce = 98%, Cumulative CPU 39932.05 sec
INFO  : 2020-05-07 16:44:24,536 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 40408.81 sec
INFO  : 2020-05-07 16:44:31,246 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 40889.07 sec
INFO  : MapReduce Total cumulative CPU time: 0 days 11 hours 21 minutes 29 seconds 70 msec
INFO  : Ended Job = job_1561342945833_12383421
INFO  : Starting task [Stage-0:MOVE] in serial mode
INFO  : Loading data to table temp.temp_kefu_user_visit_1h_delta_hourly_partition_0 partition (dt=null, hour=null) from
分区表的hdfs的根目录/temp.db/temp_kefu_user_visit_1h_delta_hourly_partition_0/.hive-staging_hive_2020-05-07_16-35-45_383_5179510197282700947-417785/-ext-10000