过程是痛苦的,后面的结论是令人不安的。
上一篇的分析,确定了至少两个个结论:
一、如果总体上active NN写JNs出问题,那么active NN就主动调用terminate,进程退出;
二、JNs的相关的一个配置项:dfs.namenode.shared.edits.dir,这个配置项中出现的JN的信息,对NN来说一定是“required”的。
这篇后续的分析,解释“总体上active NN写JNs出问题”,是怎么回事。以上一篇相反的另一个方向的思路,分析问题是怎么导致的,以及解释代码与QJM的quorum机制是否一致(答案必然是肯定的了)。
还是从active NN FATAL log说起。
2015-11-16 07:36:50,478 INFO namenode.FSEditLog (FSEditLog.java:printStatistics(673)) - Number of transactions: 11830 Total time for transactions(ms): 394 Number of transactions batched in Syncs: 7342 Number of syncs: 350 SyncTimes(ms): 735 30792 26555
1598 2015-11-16 07:36:50,481 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(364)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [192.168.146.66:8485, 192.168.146.67:8485, 192.168.146.68:8485], stream=QuorumOutputStream starting at txid 4776804880))
1599 java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
1600 at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
1601 at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
1602 at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
1603 at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
1604 at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:499)
1605 at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:359)
1606 at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
1607 at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:495)
1608 at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:623)
1609 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3001)
1610 at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:647)
1611 at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTrans latorPB.java:484)
1612 at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenode ProtocolProtos.java)
1613 at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
1614 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
1615 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
1616 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
1617 at java.security.AccessController.doPrivileged(Native Method)
1618 at javax.security.auth.Subject.doAs(Subject.java:415)
1619 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
1620 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
补充相关日志:
1469:2015-11-16 07:36:26,770 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(378)) - Took 4670ms to send a batch of 78 edits (12198 bytes) to remote journal 192.168.146.67:8485
1471:2015-11-16 07:36:50,383 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(378)) - Took 2126
上一篇的分析,确定了至少两个个结论:
一、如果总体上active NN写JNs出问题,那么active NN就主动调用terminate,进程退出;
二、JNs的相关的一个配置项:dfs.namenode.shared.edits.dir,这个配置项中出现的JN的信息,对NN来说一定是“required”的。
这篇后续的分析,解释“总体上active NN写JNs出问题”,是怎么回事。以上一篇相反的另一个方向的思路,分析问题是怎么导致的,以及解释代码与QJM的quorum机制是否一致(答案必然是肯定的了)。
还是从active NN FATAL log说起。
2015-11-16 07:36:50,478 INFO namenode.FSEditLog (FSEditLog.java:printStatistics(673)) - Number of transactions: 11830 Total time for transactions(ms): 394 Number of transactions batched in Syncs: 7342 Number of syncs: 350 SyncTimes(ms): 735 30792 26555
1598 2015-11-16 07:36:50,481 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(364)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [192.168.146.66:8485, 192.168.146.67:8485, 192.168.146.68:8485], stream=QuorumOutputStream starting at txid 4776804880))
1599 java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
1600 at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
1601 at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
1602 at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
1603 at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
1604 at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:499)
1605 at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:359)
1606 at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
1607 at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:495)
1608 at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:623)
1609 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3001)
1610 at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:647)
1611 at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTrans latorPB.java:484)
1612 at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenode ProtocolProtos.java)
1613 at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
1614 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
1615 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
1616 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
1617 at java.security.AccessController.doPrivileged(Native Method)
1618 at javax.security.auth.Subject.doAs(Subject.java:415)
1619 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
1620 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
补充相关日志:
1469:2015-11-16 07:36:26,770 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(378)) - Took 4670ms to send a batch of 78 edits (12198 bytes) to remote journal 192.168.146.67:8485
1471:2015-11-16 07:36:50,383 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(378)) - Took 2126