NodeManager启动不了的故障总结:
报错如下:
下午11点36:07.339 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource
Resource hdfs://ns1/user/hdfs/.staging/job_1451440500748_16896/libjars/htrace-core-2.04.jar(->/dfs/data3/yarn/nm/usercache/hdfs/filecache/1168/htrace-core-2.04.jar) transitioned from INIT to LOCALIZED
下午11点36:07.339 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService
Recovering localized resource { hdfs://ns1/user/hdfs/.staging/job_1451440500748_16896/libjars/hbase-hadoop-compat.jar, 1452201034053, FILE, null } at /dfs/data3/yarn/nm/usercache/hdfs/filecache/1170/hbase-hadoop-compat.jar
下午11点36:07.339 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource
Resource hdfs://ns1/user/hdfs/.staging/job_1451440500748_16896/libjars/hbase-hadoop-compat.jar(->/dfs/data3/yarn/nm/usercache/hdfs/filecache/1170/hbase-hadoop-compat.jar) transitioned from INIT to LOCALIZED
下午11点36:07.369 INFO org.apache.hadoop.service.AbstractService
Service org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl failed in state INITED; cause: java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
下午11点36:07.380 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService waiting for pending aggregation during exit
下午11点36:07.381 INFO org.apache.hadoop.service.AbstractService
Service NodeManager failed in state INITED; cause: java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
下午11点36:07.382 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl
Stopping NodeManager metrics system...
下午11点36:07.383 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl
NodeManager metrics system stopped.
下午11点36:07.383 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl
NodeManager metrics system shutdown complete.
下午11点36:07.383 FATAL org.apache.hadoop.yarn.server.nodemanager.NodeManager
Error starting NodeManager
java.lang.NullPointerException
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
下午11点36:07.386 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager
SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NodeManager at DN-BJ-MXY-2-220/10.26.2.220
************************************************************/
问题发生的原因可能是,在停yarn的时候还有运行的任务在集群中执行,此种情况可能是集群namenode所在机器故障,或者是人为停止yarn造成。
问题的解决方法是,删除任务运行时残留的文件
删除本地文件/tmp/hadoop-yarn/yarn-nm-recovery下的两个文件夹
参考url : http://stackoverflow.com/questions/27065011/cdh-5-2-error-starting-nodemanager-service-nodemanager-failed-in-state-inited-c