前言
大家好,我是明哥!
本片博文是“大数据线上问题排查系列”大类别之一,讲述前段时间我司某产品在某券商遇到的一个问题及解决方案,其背后涉及到 hive 的一个 BUG,在 hive 3.0 才修复。以下是正文。
问题现象
cdh6.2.1中,开启 kerberos 和 sentry 的hive 中,使用 dml 语句 insert overwrite 插入数据到分区表的新分区时,会报错。同时注意到:
-
查看底层 hdfs 发现对应的目录和数据都有了;
-
使用 show partitions xxx 查看不到对应的新分区;
-
使用 select 语句也查不到该新分区下的数据;
-
只要是插入数据到新的分区,不管使用静态分区还是动态分区模式,都会报错;
问题分析
为确定问题,首先尝试了不同场景,发现如下现象:
-
当使用 insert overwrite 插入数据到分区表的已有分区时,不会报错;(不管该分区下有没有数据);
-
使用 insert into 插入数据到分区表的新分区时,不会报错;
-
对于非分区表,insert overwrite 和 insert into 都不会报错;
-
当使用insert overwrite 插入数据到分区表的新分区报错后,可以使用 msck repair test0317 修复hive metastore中相关元数据,修复后 select, show partitions 等语句可以查询到分区和该分区的底层数据,没有问题;
为进一步分析问题,常看了相关日志,包括客户端日志,hiveserver2 日志和 hive metastore 日志分别如下所示。
客户端日志(beeline)
INFO : Loading data to table apollo_ods_jzfix.test0317 partition (ptdate=1) from hdfs://dev-dw-nn01:8020/user/hive/warehouse/apollo_ods_jzfix.db/test0317/ptdate=1/.hive-staging_hive_2021-03-17_15-09-13_232_1543365768355672834-7333/-ext-10000
ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask. org.apache.thrift.TApplicationException: Internal error processing fire_listener_event
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 1 Cumulative CPU: 2.56 sec HDFS Read: 5344 HDFS Write: 609 HDFS EC Read: 0 SUCCESS
INFO : Total MapReduce CPU Time Spent: 2 seconds 560 msec
INFO : Completed executing command(queryId=hive_20210317150913_9d734c54-f0cf-4dc7-9117-bc7f59c2cb61); Time taken: 17.758 seconds
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask. org.apache.thrift.TApplicationException: Internal error processing fire_listener_event (state=08S01,code=1)
hive metastore 日志
2021-03-17 15:09:30,039 INFO org.apache.hadoop.hive.metastore.HiveMetaStore.audit: [pool-9-thread-122]: ugi=hive/dev-dw-nn01@GYDW.COM ip=10.2.91.100 cmd=get_table : db=apollo_ods_jzfix tbl=test0317
2021-03-17 15:09:30,044 INFO org.apache.hadoop.hive.metastore.HiveMetaStore: [pool-9-thread-122]: 121: get_partition : db=apollo_ods_jzfix tbl=test0317[1]
2021-03-17 15:09:30,044 INFO org.apache.hadoop.hive.metastore.HiveMetaStore.audit: [pool-9-thread-122]: ugi=hive/dev-dw-nn01@GYDW.COM ip=10.2.91.100 cmd=get_partition : db=apollo_ods_jzfix tbl=test0317[1]
2021-03-17 15:09:30,053 ERROR org.apache.hadoop.hive.metastore.RetryingHMSHandler: [pool-9-thread-122]: NoSuchObjectException(message:partition values=[1])
at org.apache.hadoop.hive.metastore.ObjectStore.getPartition(ObjectStore.java:2003)
at sun.reflect.GeneratedMethodAccessor97.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:101)
at com.sun.proxy.$Proxy26.getPartition(Unknown Source)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_partition(HiveMetaStore.java:3553)
at org.apache.hadoop.hive.metastore.events.InsertEvent.<init>(InsertEvent.java:62)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.fire_listener_event(HiveMetaStore.java:6737)
at sun.reflect.GeneratedMethodAccessor100.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:140)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:99)
at com.sun.proxy.$Proxy28.fire_listener_event(Unknown Source)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$fire_listener_event.getResult(ThriftHiveMetastore.java:14208)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$fire_listener_event.getResult(ThriftHiveMetastore.java:14193)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBa