CDH环境中NodeManager无法启动,ResourceManager无法启动
1.NodeManager无法启动可能产生的原因
1.1 可能是在该nodemanager停止的时候,向集群中新添加了其他的nodemanager,导致nodemanager启动的时候校验不通过
可能抛出的错误信息
org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 2 missing files; e.g.: /var/lib/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/xxxxx.sst
2022-05-05 11:24:11,415 INFO org.apache.hadoop.service.AbstractService: Service NodeManager failed in state INITED; cause: org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 2 missing files; e.g.: /var/lib/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/000003.sst
org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 2 missing files; e.g.: /var/lib/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/000003.sst
at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:281)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:354)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:869)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:942)
Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 2 missing files; e.g.: /var/lib/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/000003.sst
at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.openDatabase(NMLeveldbStateStoreService.java:1517)
at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:1504)
at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:342)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
... 5 more
解决方案:删除该nodemanager所在机器的 /var/lib/hadoop-yarn/yarn-nm-recovery/yarn-nm-state 文件夹下的全部信息
rm -rf /var/lib/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/*
启动之前查看一下8041端口是否被占用,没有信息就是没占用,占用的话如果是nodemanager进程就kill掉,如果是其他进程建议就看一下是谁占用的,看能不能关掉或者是为nodemanager换一个端口。搜索配置yarn.nodemanager.address更改默认端口
lsof -i:8041
1.2 可能是启动端口被占用了
可能抛出的错误信息:
java.net.BindException: Address already in use;
INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [master01:8041] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException
解决方案:参考本文1.1,查看yarn.nodemanager.address的端口是否被占用
然后在CDH界面重启相应的NodeManager
2.ResourceManager无法启动
这里我遇到的错误如下,都是端口被占用的错误,解决方案可参考前文
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [master01:8031] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException
3. 如何查看CDH的日志
3.1 在页面上查看相关服务的日志,这个有时因为服务自身的原因可能查看不了。
3.2 在服务器查看日志文件
登录到需要查看的服务所在的机器上
# CDH安装的服务的日志文件大都在这里
[root@slave01 ~]# cd /var/log/
[root@slave01 log]# ll
total 3160
......
drwxrwxr-x 3 hdfs hadoop 4096 May 5 14:29 hadoop-hdfs
drwxrwxr-x 3 yarn hadoop 4096 May 5 14:31 hadoop-yarn
......
# 前文的Nodemanager属于yarn范畴,所以这里可以进入hadoop-yarn
[root@slave01 log]# cd hadoop-yarn/
[root@slave01 hadoop-yarn]# ll
total 2868
-rw-r--r-- 1 yarn yarn 2925441 May 5 14:31 hadoop-cmf-yarn-NODEMANAGER-slave01.log.out
-rw-r--r-- 1 yarn yarn 0 May 4 12:59 SecurityAuth-yarn.audit
drwxr-xr-x 2 yarn hadoop 4096 May 4 14:11 stacks
在 /var/log/hadoop-yarn中可以看到名为*NODEMANAGER*的日志文件,查看该日志文件即可看到具体是因为什么原因抛出错误,然后对症下药。如果是查看其他服务日志,都可以通过对应服务的日志文件的名称找到。