错误现象,刚开始 namenode log一直刷以下错误信息:
2014-01-27 17:55:59,388 WARN resources.ExceptionHandler (ExceptionHandler.java:toResponse(92)) - INTERNAL_SERVER_ERROR
后面与此文类似,见 Hadoop运维笔记 之 Namenode异常停止后无法正常启动。
同系 Hadoop-2.10-beta 版本的 bug(testNamenodeRestart fails with NullPointerException in trunk),
This is actually due to a bug in the NN. The http services are started before the image is loaded, the edits are processed, and the rpc server is started. During image loading and edits processing, webhdfs will NPE on the rpc server.
无发启动,只好重做 Standby,具体步骤如下:
1、首先在 Active 上执行以下命令,然后手动备份整个 name目录:
# 关闭 故障自动切换控制器
hadoop-daemon.sh stop zkfc
# 进入安全模式
hdfs dfsadmin -safemode enter
# 刷新editslog 到fsimage
hdfs dfsadmin -saveNamespace
2、然后在 Standby 上,先备份整个 name 及 journal 目录,再执行:
hadoop-daemon.sh stop zkfc
hdfs namenode -bootstrapStandby
若报错:
FATAL ha.BootstrapStandby: Unable to read transaction ids 10-100 from the configured shared edits storage qjournal://1.1.1.1:8485;1.1.1.2:8485/sec-hdfs-cluster. Please copy these logs into the shared edits storage or call saveNamespace on the active node.
Error: Gap in transactions. Expected to be able to read up until at least txid 10 but unable to find any edit logs containing txid 10
则将 Active 上整个 name目录复制到 Standby,然后直接启动namenode即可:
scp -r /data/hadoop/name/ $standby_ip:/data/hadoop
hadoop-daemon.sh start namenode
3、注意,此时无需执行 “bootstrapStandby”,否则会将刚刚复制过来的 name 目录重建清空。
参考:
- hadoop HA 备份standby node损坏,该如何修复
- HDFS环境搭建-NameNode HA搭建实录
- "namenode -bootstrapStandby" failed always
- HADOOP 2.2.0 HA搭建手册V1.0