环境是Hadoop2.4.1+Zookeeper3.4.0+HA自动恢复。
前两天还可以正常启动,今天突然启动失败,NameNode进程自动关闭了!
日志打印如下:
2014-08-14 19:20:03,388 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: NameNode1/172.16.168.134:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2014-08-14 19:20:03,392 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: NameNode2/172.16.168.144:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2014-08-14 19:20:03,392 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: DataNode2/172.16.186.84:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2014-08-14 19:20:04,399 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: NameNode1/172.16.168.134:8485. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2014-08-14 19:20:04,399 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: NameNode2/172.16.168.144:8485. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2014-08-14 19:20:04,400 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: DataNode2/172.16.186.84:8485. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2014-08-14 19:20:05,401 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: NameNode1/172.16.168.134:8485. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2014-08-14 19:20:05,412 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: NameNode2/172.16.168.144:8485. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2014-08-14 19:20:12,445 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [172.16.168.134:8485, 172.16.168.144:8485, 172.16.177.183:8485, 172.16.186.84:8485], stream=null))
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 3/4.
发现是反复连接失败,但是有一个节点连接是正常的。
查了一下Hadoop官网上的ConnectionRefused说明,他列举了很多种原因,
后来发现是我的主机域名服务器被别人改了,被一台DHCP服务器自动改了。
解决办法:
修改ec/resolv.conf
nameserver 8.8.8.8