概述
生产环境由于下机器过程有些依赖环境需要重启namenode。
但是重启提示失败:
Command aborted because of exception: Command timed-out after 150 seconds
故障排除过程
分别CM,CA,namenode日志,在CM日志发现:
cloudera-scm-server/cloudera-scm-server.log:
2016-05-12 10:21:25,626 INFO CommandPusher:com.cloudera.cmf.service.AbstractBringUpBringDownCommands: Aborting BringUp command (8885) on service DbService{id=24, name=hdfs} role DbRole{id=151, name=hdfs-NAMENODE-c39e25d60b1d0837a21ea44240cc36fc, hostName=主机名}.
2016-05-12 10:21:39,450 WARN 1831299335@agentServer-348302:com.cloudera.server.cmf.AgentProtocolImpl: Received Process Heartbeat for unknown (or duplicate) process. Ignoring. This is expected to happen once after old process eviction or process deletion (as happens in restarts). id=3314 name=null host=4c0b8ad5-8230-4fdf-9db3-f84fedda29cb/主机名
- 检查CM监控进程,发现正常
- 检查hosts配置
- 查看/etc/hosts:是否禁用ip6(::1)
- 查看hostname:名称是否和hosts配置一致
- 查看/etc/sysconfig/network:是否禁用ip6
发现/etc/hosts中存在::1
- 从/etc/hosts删除或注释::1
- 重启cm和ca服务
再次在Cloudera Manager Admin Console中重启namenode。
提醒:操作切勿在命令行执行,导致其他意想不到异常
提示错误
Service did not start successfully; not all of the required roles started: Service has only 0 NameNode roles running instead of minimum required 1.
重启CM和CA
- cloudera-scm-server restart
- cloudera-scm-agent restart
至此,整个异常都已经解决
总结
一方面我们在配置hosts时候要严格准守cloudera 关于network要求1