早上收到Oozie报错的警告,原因是在查询一张表上一天数据时报错
具体报错如下:
2016-10-20 00:53:58,751 ERROR [main] SessionState (SessionState.java:printError(932)) - Status: Failed 2016-10-20 00:53:58,752 ERROR [main] SessionState (SessionState.java:printError(932)) - Vertex failed, vertexName=Map 1, vertexId=vertex_1476863283340_1590_1_00, diagnostics=[Task failed, taskId=task_1476863283340_1590_1_00_000018, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: java.io.IOException: Cannot obtain block length for LocatedBlock{BP-1476145958-10.50.23.210-1451352868941:blk_1084642778_10904425; getBlockSize()=2195997; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[10.50.23.215:50010,DS-2da851b4-66b1-443c-8434-2fd812182a5e,DISK], DatanodeInfoWithStorage[10.50.23.209:50010,DS-99ce413a-5ff0-47ae-a182-d1c57b0f5c30,DISK], DatanodeInfoWithStorage[10.50.23.217:50010,DS-0cc11cf2-404e-475a-9d56-e0e53987f264,DISK], DatanodeInfoWithStorage[10.50.23.214:50010,DS-699ba29c-3594-4b6b-8603-cde89a8c4f16,DISK]]} at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: java.io.IOException: java.io.IOException: Cannot obtain block length for LocatedBlock{BP-1476145958-10.50.23.210-1451352868941:blk_1084642778_10904425; getBlockSize()=2195997; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[10.50.23.215:50010,DS-2da851b4-66b1-443c-8434-2fd812182a5e,DISK], DatanodeInfoWithStorage[10.50.23.209:50010,DS-99ce413a-5ff0-47ae-a182-d1c57b0f5c30,DISK], DatanodeInfoWithStorage[10.50.23.217:50010,DS-0cc11cf2-404e-475a-9d56-e0e53987f264,DISK], DatanodeInfoWithStorage[10.50.23.214:50010,DS-699ba29c-3594-4b6b-8603-cde89a8c4f16,DISK]]} at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:196) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.<init>(TezGroupedSplitsInputFormat.java:135) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:101) at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:149) at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:80) at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:650) at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:621) at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:145) at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:109) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:406) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:128) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:149) ... 14 more
从错误的描述上可以确定是目标表文件存在不完整的情况,联想到上一天hadoop集群节点调整,重启了hdfs服务,导致flume实时写入hdfs的文件不完整,进而此表当天部分文件损坏,当简单查询此表时,没有扫描到错误的文件,不会报错;如果扫描此表当天所有数据就出现问题
解决方法:1.查看hdfs文件,手动删除后缀为tmp的文件;
2.访问服务重启时间段的数据,人工缩小数据损坏时间区间,手动删除此区间最小的文件
完成上面每一步时,都进行一次表可使用验证,可使用之后即确认故障解决
这种解决方案存在数据丢失的情况,而且目前没有比较好的办法解决hdfs重启期间数据的丢失,除非flume停止收集,但也同样存在数据丢失