项目场景:
最近在一套hadoop新集群的时候,搭完集群后,发现nodemanager不够用,遂添加了一个nodemanager节点,可是添加完之后,启动nodemanage后,一段时间又掉了。
问题描述:
2019-04-03 16:51:06,517 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: registered UNIX signal handlers for [TERM, HUP, INT] 2019-04-03 16:51:07,698 INFO org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: Using state database at /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state for recovery 2019-04-03 16:51:07,802 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService failed in state INITED; cause: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.openDatabase(NMLeveldbStateStoreService.java:966) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:953) .................................................................................................................................................... 2019-04-03 16:51:07,818 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NodeManager at jfh1d024/134.130.131.157 |
日志中的nodemanager报错异常是checksum有问题
原因分析:
谷歌一下后,是因为在该nodemanager终止情况下,在集群中添加了新的nodemanager,使得角色数目增加,而启动失败的nodemanager时,它使用存储的状态来恢复,在和数据库校验过程中发现数目不符合而启动失败,导致因为不符合,而下线。
解决方案:
删除改nodemanger所在节点的nodemanger目录下的/var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state这个目录下的文件。
我跟着移除了这些文件,重启nodemanger后,成功启动nodemanger,且稳定不挂。