NodeManager启动报错

Nodeemanager异常 org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch


系统环境:ubuntu14.04 server cloudera CDH 5.10 共计28个节点。单节点内存128G。有6台机器是24核,22台机器是32核。有10台机器磁盘容量是1.8T 18台机器磁盘容量是12.8T。
resourceManager,namenode 在同一节点上,且该节点不包括datanode和nodeManager服务。
剩余28节点均包含了datanode和nodeManager服务
最近集群做TB级数据随机化实验,hdfs各节点数据分布不太均衡,10台磁盘容量为1.8T的机器磁盘全部占满,导致这些节点上yarn的nodemanager 没有足够的空间,致使9个nodemanager 停止接受服务,1个nodemanager 服务终止,且无法启动。

考虑将10台低容量磁盘机器datanode角色停止授权,数据分发到其他大容量机器上,并删除hdfs数据。

在停止授权期间,我给集群又添加了一台32核,128G内存 1.8T磁盘的机器,分配了nodeManager角色。
上述工作顺利完成后,重新启动yarn.

大部分机器正常工作,只有上述服务终止的nodemanager依然无法启动,查询日志报错如下:
FATAL org.apache.hadoop.yarn.server.nodemanager.NodeManager
Error starting NodeManager
org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch
at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:163)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:227)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:544)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:591)
Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch
at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.openDatabase(NMLeveldbStateStoreService.java:944)
at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:931)
at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
... 5 more

google搜索 org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch 给出的解决方案:
网址:https://community.hortonworks.com/questions/86301/nodemanager-fails-to-start.html
Maybe a sst file got corrupt can you try to remove the folder of /var/log/hadoop-yarn/nodemanager/recovery-state from failed nodemanagers and check if starts?
These files stays in the system even if you decomission the nodes.
上述文件在节点上并不存在。故继续查看日志:

上午9点56:15.339分 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager
registered UNIX signal handlers for [TERM, HUP, INT]
4月 14, 上午9点56:16.426分 INFO org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService
Using state database at /var/lib/hadoop-yarn/yarn-nm-recovery/yarn-nm-state for recovery
4月 14, 上午9点56:16.457分 INFO org.apache.hadoop.service.AbstractService
Service org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService failed in state INITED; cause: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch
org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch
at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.openDatabase(NMLeveldbStateStoreService.java:944)
at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:931)
at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:163)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:227)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:544)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:591)
4月 14, 上午9点56:16.466分 INFO org.apache.hadoop.service.AbstractService
Service NodeManager failed in state INITED; cause: org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch
org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch
at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:163)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:227)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:544)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:591)
Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch
at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.openDatabase(NMLeveldbStateStoreService.java:944)
at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:931)
at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
... 5 more

发现疑似目录:/var/lib/hadoop-yarn/yarn-nm-recovery/yarn-nm-state下存在:
005615.sst  005638.log  005640.log  CURRENT  LOCK  MANIFEST-004397
移除所有文件。重启nodemanager 成功。
回顾错误原因可能是,我在该nodemanager终止情况下,在集群中添加了新的nodemanager,使得角色数目增加,而启动失败的nodemanager时,它使用存储的状态来恢复,在和数据库校验过程中发现数目不符合而启动失败。因此删除上述目录下的文件。


系统环境:ubuntu14.04 server cloudera CDH 5.10 共计28个节点。单节点内存128G。有6台机器是24核,22台机器是32核。有10台机器磁盘容量是1.8T 18台机器磁盘容量是12.8T。
resourceManager,namenode 在同一节点上,且该节点不包括datanode和nodeManager服务。
剩余28节点均包含了datanode和nodeManager服务
最近集群做TB级数据随机化实验,hdfs各节点数据分布不太均衡,10台磁盘容量为1.8T的机器磁盘全部占满,导致这些节点上yarn的nodemanager 没有足够的空间,致使9个nodemanager 停止接受服务,1个nodemanager 服务终止,且无法启动。

考虑将10台低容量磁盘机器datanode角色停止授权,数据分发到其他大容量机器上,并删除hdfs数据。

在停止授权期间,我给集群又添加了一台32核,128G内存 1.8T磁盘的机器,分配了nodeManager角色。
上述工作顺利完成后,重新启动yarn.

大部分机器正常工作,只有上述服务终止的nodemanager依然无法启动,查询日志报错如下:
FATAL org.apache.hadoop.yarn.server.nodemanager.NodeManager
Error starting NodeManager
org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch
at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:163)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:227)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:544)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:591)
Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch
at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.openDatabase(NMLeveldbStateStoreService.java:944)
at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:931)
at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
... 5 more

google搜索 org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch 给出的解决方案:
网址:https://community.hortonworks.com/questions/86301/nodemanager-fails-to-start.html
Maybe a sst file got corrupt can you try to remove the folder of /var/log/hadoop-yarn/nodemanager/recovery-state from failed nodemanagers and check if starts?
These files stays in the system even if you decomission the nodes.
上述文件在节点上并不存在。故继续查看日志:

上午9点56:15.339分 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager
registered UNIX signal handlers for [TERM, HUP, INT]
4月 14, 上午9点56:16.426分 INFO org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService
Using state database at /var/lib/hadoop-yarn/yarn-nm-recovery/yarn-nm-state for recovery
4月 14, 上午9点56:16.457分 INFO org.apache.hadoop.service.AbstractService
Service org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService failed in state INITED; cause: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch
org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch
at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.openDatabase(NMLeveldbStateStoreService.java:944)
at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:931)
at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:163)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:227)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:544)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:591)
4月 14, 上午9点56:16.466分 INFO org.apache.hadoop.service.AbstractService
Service NodeManager failed in state INITED; cause: org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch
org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch
at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:163)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:227)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:544)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:591)
Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch
at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.openDatabase(NMLeveldbStateStoreService.java:944)
at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:931)
at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
... 5 more

发现疑似目录:/var/lib/hadoop-yarn/yarn-nm-recovery/yarn-nm-state下存在:
005615.sst  005638.log  005640.log  CURRENT  LOCK  MANIFEST-004397
移除所有文件。重启nodemanager 成功。
回顾错误原因可能是,我在该nodemanager终止情况下,在集群中添加了新的nodemanager,使得角色数目增加,而启动失败的nodemanager时,它使用存储的状态来恢复,在和数据库校验过程中发现数目不符合而启动失败。因此删除上述目录下的文件。
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
HadoopNodeManager无法启动时,可能是由于以下一些常见问题引起的: 1. 配置错误:请检查NodeManager的配置文件,确保文件中的属性和值都是正确的。例如,确保NodeManager所在的机器的IP地址和端口号正确设置。 2. 权限问题:确保NodeManager所在的目录和文件的权限正确设置,并确保Hadoop用户有足够的权限来读取和写入相关的文件和目录。 3. 网络问题:请确保NodeManager所在的机器可以与Hadoop集群的其他节点进行通信。如果NodeManager无法连接到其他节点,则可能会导致启动失败。 4. 资源限制:请确保NodeManager所在的机器上有足够的内存和CPU资源来启动NodeManager进程。如果机器资源不足,则可能会导致启动失败。 如果您检查了上述问题,并且问题仍然存在,请查看NodeManager的日志文件以获取更多详细信息,以确定问题的原因。您好!Hadoop NodeManagerHadoop 分布式计算框架中的一个重要组件,它负责管理运行在每个节点上的容器(Container),监控它们的资源使用情况,并与 ResourceManager 交互以获取分配给该节点的任务信息。 如果您无法启动 NodeManager,可以尝试以下几个步骤来排查问题: 1. 检查 NodeManager 的配置文件是否正确:NodeManager 的配置文件通常位于 Hadoop 的安装目录下的 /etc/hadoop 目录中,文件名为 yarn-site.xml。确保配置文件中的参数设置正确,特别是关于 ResourceManager 的地址和端口号等参数。 2. 检查 NodeManager 的日志文件:NodeManager 的日志文件通常位于 Hadoop 的安装目录下的 logs 目录中,文件名为 yarn-yarn-nodemanager-<hostname>.log。检查日志文件中是否有任何错误或异常信息,以确定问题的原因。 3. 检查节点的网络连接和状态:NodeManager 需要与 ResourceManager 进行通信,因此请确保节点可以与 ResourceManager 正确通信。您可以尝试使用 telnet 命令测试节点是否可以连接到 ResourceManager 的端口,例如 telnet <ResourceManager IP> <ResourceManager Port>。 4. 检查节点的资源使用情况:NodeManager 运行的容器需要消耗一定的系统资源,如 CPU、内存等。请确保节点的资源使用情况正常,并且没有其他进程或服务占用了过多的资源,导致 NodeManager 无法启动。 希望这些建议对您有所帮助!如果问题仍然存在,请提供更多详细信息,以便我们更好地帮助您解决问题。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值