启动HDFS HA NameNode启动成功不久后消失

MiHu8123

已于 2024-01-03 22:55:38 修改

阅读量991

点赞数 22

文章标签： hdfs hadoop 大数据

于 2024-01-03 22:07:50 首次发布

本文链接：https://blog.csdn.net/MiHu8123/article/details/135373849

版权

一、观察到的问题如下：

HA按照要求配置好后，启动时，NameNod不能正常启动。在最开始，刚刚启动的时候jps看到了NameNode，但搁不久时间后，再次jps查看，发现NameNode不见。

为找到问题所在，查看相关日志发现以下报错：

2023-12-26 22:02:05,287 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave01/192.168.31.130:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:06,080 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12021 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2023-12-26 22:02:06,265 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.31.136:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:06,272 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave02/192.168.31.229:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:06,290 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave01/192.168.31.130:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:07,086 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13026 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2023-12-26 22:02:07,268 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.31.136:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:07,275 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave02/192.168.31.229:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:07,299 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave01/192.168.31.130:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:07,306 WARN org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input streams from QJM to [192.168.31.136:8485, 192.168.31.130:8485, 192.168.31.229:8485]. Skipping.
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown:
192.168.31.136:8485: Call From master.com/192.168.31.136 to master:8485 failed on connection exception: java.net.ConnectException: 拒绝连接; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
192.168.31.229:8485: Call From master.com/192.168.31.136 to slave02:8485 failed on connection exception: java.net.ConnectException: 拒绝连接; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
192.168.31.130:8485: Call From master.com/192.168.31.136 to slave01:8485 failed on connection exception: java.net.ConnectException: 拒绝连接; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
   at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)
   at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)
   at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142)
   at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:471)
   at org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:278)
   at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1508)
   at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1532)
   at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:652)
   at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:294)
   at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:975)
   at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:681)
   at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:584)
   at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:644)
   at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:811)
   at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:795)
   at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1488)
   at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1554)
2023-12-26 22:02:07,403 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: No edit log streams selected.
2023-12-26 22:02:07,910 INFO org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode: Loading 1 INodes.
2023-12-26 22:02:08,104 INFO org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: Loaded FSImage in 0 seconds.
2023-12-26 22:02:08,104 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Loaded image for txid 0 from /home/jxlgzwh/hadoop-2.7.2/data/tmp/dfs/name/current/fsimage_0000000000000000000
2023-12-26 22:02:08,113 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Need to save fs image? false (staleImage=true, haEnabled=true, isRollingUpgrade=false)
2023-12-26 22:02:08,130 INFO org.apache.hadoop.hdfs.server.namenode.NameCache: initialized with 0 entries 0 lookups
2023-12-26 22:02:08,130 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Finished loading FSImage in 21814 msecs
2023-12-26 22:02:11,546 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: RPC server is binding to master:8020
2023-12-26 22:02:11,610 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue class java.util.concurrent.LinkedBlockingQueue
2023-12-26 22:02:11,781 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 8020
2023-12-26 22:02:12,209 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemState MBean
2023-12-26 22:02:12,435 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Number of blocks under construction: 0
2023-12-26 22:02:12,436 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Number of blocks under construction: 0
2023-12-26 22:02:12,436 INFO org.apache.hadoop.hdfs.StateChange: STATE* Leaving safe mode after 30 secs
2023-12-26 22:02:12,437 INFO org.apache.hadoop.hdfs.StateChange: STATE* Network topology has 0 racks and 0 datanodes
2023-12-26 22:02:12,437 INFO org.apache.hadoop.hdfs.StateChange: STATE* UnderReplicatedBlocks has 0 blocks
2023-12-26 22:02:12,574 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Number of failed storage changes from 0 to 0
2023-12-26 22:02:12,859 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: NameNode RPC up at: master/192.168.31.136:8020
2023-12-26 22:02:12,860 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required for standby state
2023-12-26 22:02:12,864 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Will roll logs on active node at slave01/192.168.31.130:8020 every 120 seconds.
2023-12-26 22:02:12,912 INFO org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Starting standby checkpoint thread...
Checkpointing active NN at http://slave01:50070
Serving checkpoints at http://master:50070
2023-12-26 22:02:12,878 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
2023-12-26 22:02:12,878 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 8020: starting
2023-12-26 22:02:14,015 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.31.136:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:14,016 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave01/192.168.31.130:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:14,016 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave02/192.168.31.229:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:15,018 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.31.136:8485. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:15,033 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave01/192.168.31.130:8485. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:15,034 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave02/192.168.31.229:8485. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:16,038 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave01/192.168.31.130:8485. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:16,039 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave02/192.168.31.229:8485. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:16,039 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.31.136:8485. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:17,056 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.31.136:8485. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:17,057 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave02/192.168.31.229:8485. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:17,057 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave01/192.168.31.130:8485. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:18,057 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.31.136:8485. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:18,060 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave02/192.168.31.229:8485. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:18,060 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave01/192.168.31.130:8485. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:19,004 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6003 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2023-12-26 22:02:19,060 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.31.136:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:19,062 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave02/192.168.31.229:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:19,063 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave01/192.168.31.130:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:20,014 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7013 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2023-12-26 22:02:20,064 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.31.136:8485. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:20,065 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave02/192.168.31.229:8485. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:20,065 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave01/192.168.31.130:8485. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:21,016 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8014 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2023-12-26 22:02:21,065 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.31.136:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:21,067 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave02/192.168.31.229:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:21,068 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave01/192.168.31.130:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:22,017 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9016 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2023-12-26 22:02:22,068 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.31.136:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:22,072 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave02/192.168.31.229:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:22,074 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave01/192.168.31.130:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:23,019 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10017 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2023-12-26 22:02:23,070 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.31.136:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:23,074 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave02/192.168.31.229:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:23,076 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave01/192.168.31.130:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:23,079 WARN org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input streams from QJM to [192.168.31.136:8485, 192.168.31.130:8485, 192.168.31.229:8485]. Skipping.
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown:
192.168.31.136:8485: Call From master.com/192.168.31.136 to master:8485 failed on connection exception: java.net.ConnectException: 拒绝连接; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
192.168.31.229:8485: Call From master.com/192.168.31.136 to slave02:8485 failed on connection exception: java.net.ConnectException: 拒绝连接; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
192.168.31.130:8485: Call From master.com/192.168.31.136 to slave01:8485 failed on connection exception: java.net.ConnectException: 拒绝连接; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
   at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)
   at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)
   at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142)
   at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:471)
   at org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:278)
   at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1508)
   at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1532)
   at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:214)
   at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:331)
   at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:284)
   at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:301)
   at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
   at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:297)
2023-12-26 22:02:24,714 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services started for standby state
2023-12-26 22:02:24,718 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Edit log tailer interrupted
java.lang.InterruptedException: sleep interrupted
   at java.lang.Thread.sleep(Native Method)
   at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:347)
   at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:284)
   at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:301)
   at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
   at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:297)
2023-12-26 22:02:24,795 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required for active state
2023-12-26 22:02:24,905 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Starting recovery process for unclosed journal segments...
2023-12-26 22:02:25,970 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.31.136:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:25,974 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave01/192.168.31.130:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:25,992 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave02/192.168.31.229:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:26,971 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.31.136:8485. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:26,976 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave01/192.168.31.130:8485. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:26,994 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave02/192.168.31.229:8485. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:27,989 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.31.136:8485. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:27,991 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave01/192.168.31.130:8485. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:27,995 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave02/192.168.31.229:8485. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:28,990 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.31.136:8485. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:28,994 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave01/192.168.31.130:8485. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:28,998 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave02/192.168.31.229:8485. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:29,992 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.31.136:8485. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:29,996 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave01/192.168.31.130:8485. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:29,999 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave02/192.168.31.229:8485. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:30,993 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.31.136:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:30,998 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave01/192.168.31.130:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:31,002 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave02/192.168.31.229:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:31,999 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.31.136:8485. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:32,000 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave01/192.168.31.130:8485. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:32,011 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave02/192.168.31.229:8485. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:32,999 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.31.136:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:33,004 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave01/192.168.31.130:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:33,012 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave02/192.168.31.229:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:34,000 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.31.136:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:34,004 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave01/192.168.31.130:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:34,013 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave02/192.168.31.229:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:35,001 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.31.136:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:35,005 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave01/192.168.31.130:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:35,014 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave02/192.168.31.229:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2023-12-26 22:02:35,016 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [192.168.31.136:8485, 192.168.31.130:8485, 192.168.31.229:8485], stream=null))
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown:
192.168.31.229:8485: Call From master.com/192.168.31.136 to slave02:8485 failed on connection exception: java.net.ConnectException: 拒绝连接; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
192.168.31.130:8485: Call From master.com/192.168.31.136 to slave01:8485 failed on connection exception: java.net.ConnectException: 拒绝连接; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
192.168.31.136:8485: Call From master.com/192.168.31.136 to master:8485 failed on connection exception: java.net.ConnectException: 拒绝连接; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
   at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)
   at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)
   at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142)
   at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.createNewUniqueEpoch(QuorumJournalManager.java:182)
   at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.recoverUnfinalizedSegments(QuorumJournalManager.java:436)
   at org.apache.hadoop.hdfs.server.namenode.JournalSet$8.apply(JournalSet.java:624)
   at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
   at org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:621)
   at org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1439)
   at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1112)
   at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1710)
   at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
   at org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64)
   at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
   at org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1583)
   at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1478)
   at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
   at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460)
   at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:422)
   at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
2023-12-26 22:02:35,018 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2023-12-26 22:02:35,021 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at master.com/192.168.31.136
************************************************************/

二、问题查找：

对于以上日志报错，我们查找发现NameNode一直在用RPC尝试请求JournalNode。且根据日志发现，最大尝试请求次数为10次retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)）

则综上可以判断问题所在：

JournalNode在没有启动好之前，NameNode已经达到尝试请求的最大数，导致系统不能正常运行。

三、解决方法:

修改core-site.xml中的ipc参数：

<name>ipc.client.connect.max.retries</name>

</property>

<name>ipc.client.connect.retry.interval</name>

</property>

并把更改后的文件同步至集群其他节点中（scp命令，复制发送到其他节点）

增大NameNode向JournalNode发起连接请求的重试间隔时间和重试次数，启动集群后，再次jps验证发现NameNode可以正常启动，即问题达到解决。

MiHu8123

关注

22
点赞
踩
22

收藏

觉得还不错? 一键收藏
1
评论
启动HDFS HA NameNode启动成功不久后消失

HA按照要求配置好后，启动时，NameNod不能正常启动。在最开始，刚刚启动的时候jps看到了NameNode，但搁不久时间后，再次jps查看，发现NameNode不见。增大NameNode向JournalNode发起连接请求的重试间隔时间和重试次数，启动集群后，再次jps验证发现NameNode可以正常启动，即问题达到解决。JournalNode在没有启动好之前，NameNode已经达到尝试请求的最大数，导致系统不能正常运行。并把更改后的文件同步至集群其他节点中（scp命令，复制发送到其他节点）
复制链接

扫一扫