org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 51 is less than the last promised epoch 52
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:414)
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:442)
at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:342)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal
(QJournalProtocolServerSideTranslatorPB.java:158)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod
(QJournalProtocolProtos.java:25421)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
at org.apache.hadoop.ipc.Client.call(Client.java:1410)
at org.apache.hadoop.ipc.Client.call(Client.java:1363)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy12.journal(Unknown Source)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)
at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:357)
at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:350)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2016-06-17 23:56:56,082 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(364)) -
Error: flush failed for required journal (JournalAndStream(mgr=QJM to [192.168.8.14:8485, 192.168.8.15:8485, 192.168.8.16:8485],
stream=QuorumOutputStream starting at txid 939982784))
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve
quorum size 2/3. 3 exceptions thrown:
172.16.8.15:8485: IPC's epoch 51 is less than the last promised epoch 52at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:414)
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:442)
at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:342)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal
(QJournalProtocolServerSideTranslatorPB.java:158)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod
(QJournalProtocolProtos.java:25421)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
2016-06-17 23:56:56,082 WARN client.QuorumJournalManager (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream
starting at txid 939982784
2016-06-17 23:56:56,117 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 12016-06-17 23:56:56,172 INFO namenode.NameNode (StringUtils.java:run(640)) - SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at nn01/192.168.8.4
************************************************************/
一.排查问题
1,报错关键信息"IPC's epoch is less than the last promised epoch",大部分人的回答都是因为网络原因引起的.
Error: flush failed for required journal (JournalAndStream(mgr=QJM to [192.168.8.14:8485, 192.168.8.15:8485, 192.168.8.16:8485]
namenode与journalnode通信失败。
2,据上,经过看日志,每次启动另一个namenode的时候都会去探测三个journalnode服务的8485端口,提示是faild的,说明最有可能是网络问题,排查如下:
ifconfig -a看网卡是否有丢包
查看/etc/sysconfig/selinux 配置 SELINUX=disabled 是否是对的
/etc/init.d/iptables status 查看防火墙是否运行,因为我们hadoop是运行内网环境,记得之前部署的时候,防火墙是关闭的
先后检查了,三个 journalnode服务器的防火墙,也都是关闭的
二.总结如下:
hadoop故障网络引起的话
1.查网卡是否丢包
2.查防火墙配置是否正确
当前集群环境下,两个namenode的运行是依懒于journalnode服务,如果通信异常,会导致Namenode HA不可用,服务会down,启动namenode会需要点时间同步日志文件。
所以hadoop集群一定要配置HA服务,因为你无法保障hadoop环境不出现异常状态。
网上解决方法:
1)调节journalnode 的写入超时时间
如dfs.qjournal.write-txns.timeout.ms = 90000
其实在实际的生产环境中,也很容易发生类似的这种超时情况,所以我们需要把默认的20s超时改成更大的值,比如60或者90s。
我们可以在hadoop/etc/hadoop下的hdfs-site.xml中,加入一组配置:<property>
<name>dfs.qjournal.write-txns.timeout.ms</name>
<value>60000</value>
</property>
从别人博客中看到的配置方法,神奇的是,hadoop的官网中的关于hdfs-site.xml介绍中,居然找不到关于这个配置的说明
http://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
2)调整namenode 的java参数,提前触发 full gc,这样full gc 的时间就会小一些。3)默认namenode的fullgc方式是parallel gc,是stw模式的,更改为cms的格式。调整namenode的启动参数:
-XX:+UseCompressedOops
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled
-XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0
-XX:+CMSParallelRemarkEnabled -XX:+DisableExplicitGC
-XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=75
-XX:SoftRefLRUPolicyMSPerMB=0