CDH Can't scan a pre-transactional edit log,Timed out waiting 120000ms ,JournalNode数据文件破坏集群恢复方法

简介:
CDH5.11集群,由于停电或者磁盘满了造成节点全部挂掉,重启后HDFS报错,同时由于HDFS报错,引起其他基于HDFS的应用如HBASE等也报错,恢复方法如下。

报错介绍:
我这里的错误,摘录部分日志如下:
在namenode中的报错如下

2017-07-03 13:53:10,377 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [192.168.60.43:8485, 192.168.60.45:8485, 192.168.60.46:8485], stream=null))
java.io.IOException: Timed out waiting 120000ms for a quorum of nodes to respond.
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.createNewUniqueEpoch(QuorumJournalManager.java:183)
at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.recoverUnfinalizedSegments(QuorumJournalManager.java:441)
at org.apache.hadoop.hdfs.server.namenode.JournalSet 8. a p p l y ( J o u r n a l S e t . j a v a : 624 ) a t o r g . a p a c h e . h a d o o p . h d f s . s e r v e r . n a m e n o d e . J o u r n a l S e t . m a p J o u r n a l s A n d R e p o r t E r r o r s ( J o u r n a l S e t . j a v a : 393 ) a t o r g . a p a c h e . h a d o o p . h d f s . s e r v e r . n a m e n o d e . J o u r n a l S e t . r e c o v e r U n f i n a l i z e d S e g m e n t s ( J o u r n a l S e t . j a v a : 621 ) a t o r g . a p a c h e . h a d o o p . h d f s . s e r v e r . n a m e n o d e . F S E d i t L o g . r e c o v e r U n c l o s e d S t r e a m s ( F S E d i t L o g . j a v a : 1478 ) a t o r g . a p a c h e . h a d o o p . h d f s . s e r v e r . n a m e n o d e . F S N a m e s y s t e m . s t a r t A c t i v e S e r v i c e s ( F S N a m e s y s t e m . j a v a : 1236 ) a t o r g . a p a c h e . h a d o o p . h d f s . s e r v e r . n a m e n o d e . N a m e N o d e 8.apply(JournalSet.java:624) at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393) at org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:621) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1478) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1236) at org.apache.hadoop.hdfs.server.namenode.NameNode 8.apply(JournalSet.java:624)atorg.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)atorg.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:621)atorg.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1478)atorg.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1236)atorg.apache.hadoop.hdfs.server.namenode.NameNodeNameNodeHAContext.startActiveServices(NameNode.java:1771)
at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
at org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64)
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
at org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1644)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1378)
at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService 2. c a l l B l o c k i n g M e t h o d ( H A S e r v i c e P r o t o c o l P r o t o s . j a v a : 4460 ) a t o r g . a p a c h e . h a d o o p . i p c . P r o t o b u f R p c E n g i n e 2.callBlockingMethod(HAServiceProtocolProtos.java:4460) at org.apache.hadoop.ipc.ProtobufRpcEngine 2.callBlockingMethod(HAServiceProtocolProtos.java:4460)atorg.apache.hadoop.ipc.ProtobufRpcEngineServer P r o t o B u f R p c I n v o k e r . c a l l ( P r o t o b u f R p c E n g i n e . j a v a : 617 ) a t o r g . a p a c h e . h a d o o p . i p c . R P C ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)atorg.apache.hadoop.ipc.RPCServer.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler 1. r u n ( S e r v e r . j a v a : 2220 ) a t o r g . a p a c h e . h a d o o p . i p c . S e r v e r 1.run(Server.java:2220) at org.apache.hadoop.ipc.Server 1.run(Server.java:2220)atorg.apache.hadoop.ipc.ServerHandler 1. r u n ( S e r v e r . j a v a : 2216 ) a t j a v a . s e c u r i t y . A c c e s s C o n t r o l l e r . d o P r i v i l e g e d ( N a t i v e M e t h o d ) a t j a v a x . s e c u r i t y . a u t h . S u b j e c t . d o A s ( S u b j e c t . j a v a : 415 ) a t o r g . a p a c h e . h a d o o p . s e c u r i t y . U s e r G r o u p I n f o r m a t i o n . d o A s ( U s e r G r o u p I n f o r m a t i o n . j a v a : 1920 ) a t o r g . a p a c h e . h a d o o p . i p c . S e r v e r 1.run(Server.java:2216) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920) at org.apache.hadoop.ipc.Server 1.run(Server.java:2216)atjava.security.AccessController.doPrivileged(NativeMethod)atjavax.security.auth.Subject.doAs(Subject.java:415)atorg.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)atorg.apache.hadoop.ipc.ServerHandler.run(Server.java:2214)
2017-07-03 13:53:10,382 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2017-07-03 13:53:10,384 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:

在JournalNode的日志文件中,报错如下:

2017-07-03 14:06:36,898 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: Caught exception after scanning through 0 ops from /home/journal/nameservice1/current/edits_inprogress_0000000000002539938 while determining its valid length. Position was 1048576
java.io.IOException: Can’t scan a pre-transactional edit log.
at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp L e g a c y R e a d e r . s c a n O p ( F S E d i t L o g O p . j a v a : 4610 ) a t o r g . a p a c h e . h a d o o p . h d f s . s e r v e r . n a m e n o d e . E d i t L o g F i l e I n p u t S t r e a m . s c a n N e x t O p ( E d i t L o g F i l e I n p u t S t r e a m . j a v a : 245 ) a t o r g . a p a c h e . h a d o o p . h d f s . s e r v e r . n a m e n o d e . E d i t L o g F i l e I n p u t S t r e a m . s c a n E d i t L o g ( E d i t L o g F i l e I n p u t S t r e a m . j a v a : 355 ) a t o r g . a p a c h e . h a d o o p . h d f s . s e r v e r . n a m e n o d e . F i l e J o u r n a l M a n a g e r LegacyReader.scanOp(FSEditLogOp.java:4610) at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245) at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:355) at org.apache.hadoop.hdfs.server.namenode.FileJournalManager LegacyReader.scanOp(FSEditLogOp.java:4610)atorg.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245)atorg.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:355)atorg.apache.hadoop.hdfs.server.namenode.FileJournalManagerEditLogFile.scanLog(FileJournalManager.java:551)
at org.apache.hadoop.hdfs.qjournal.server.Journal.scanStorageForLatestEdits(Journal.java:193)
at org.apache.hadoop.hdfs.qjournal.server.Journal.(Journal.java:153)
at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:93)
at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:102)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:186)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:236)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService 2. c a l l B l o c k i n g M e t h o d ( Q J o u r n a l P r o t o c o l P r o t o s . j a v a : 25431 ) a t o r g . a p a c h e . h a d o o p . i p c . P r o t o b u f R p c E n g i n e 2.callBlockingMethod(QJournalProtocolProtos.java:25431) at org.apache.hadoop.ipc.ProtobufRpcEngine 2.callBlockingMethod(QJournalProtocolProtos.java:25431)atorg.apache.hadoop.ipc.ProtobufRpcEngineServer P r o t o B u f R p c I n v o k e r . c a l l ( P r o t o b u f R p c E n g i n e . j a v a : 617 ) a t o r g . a p a c h e . h a d o o p . i p c . R P C ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)atorg.apache.hadoop.ipc.RPCServer.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler 1. r u n ( S e r v e r . j a v a : 2220 ) a t o r g . a p a c h e . h a d o o p . i p c . S e r v e r 1.run(Server.java:2220) at org.apache.hadoop.ipc.Server 1.run(Server.java:2220)atorg.apache.hadoop.ipc.ServerHandler 1. r u n ( S e r v e r . j a v a : 2216 ) a t j a v a . s e c u r i t y . A c c e s s C o n t r o l l e r . d o P r i v i l e g e d ( N a t i v e M e t h o d ) a t j a v a x . s e c u r i t y . a u t h . S u b j e c t . d o A s ( S u b j e c t . j a v a : 415 ) a t o r g . a p a c h e . h a d o o p . s e c u r i t y . U s e r G r o u p I n f o r m a t i o n . d o A s ( U s e r G r o u p I n f o r m a t i o n . j a v a : 1920 ) a t o r g . a p a c h e . h a d o o p . i p c . S e r v e r 1.run(Server.java:2216) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920) at org.apache.hadoop.ipc.Server 1.run(Server.java:2216)atjava.security.AccessController.doPrivileged(NativeMethod)atjavax.security.auth.Subject.doAs(Subject.java:415)atorg.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)atorg.apache.hadoop.ipc.ServerHandler.run(Server.java:2214)
2017-07-03 14:14:59,948 INFO org.apache.hadoop.hdfs.qjournal.server.JournalNode: STARTUP_MSG:

恢复方法:
1.初步分析:经过查看日志,我分析的原因是JournalNode维护的edits文件被破坏了。我共有3个Journalnode节点,其中有2个节点的日志在报上面journalnode的错,有一台Journalnode的日志没有发现报错。于是初步分析,那2台出现的破坏,只需要将第三台完好的Journalnode的数据拷贝过去,应该就能够恢复。下面开始操作。
2.实际步骤:
①首先停掉集群所有服务。
②为了保险起见,备份所有journalnode节点维护的数据。我的journalnode数据目录是/home/journal,于是使用tar命令打包备份(这一步所有journalnode节点都要操作):

cd /home/journal
tar  -zcvf journal.bak.tar.gz  ./journal

③删除破损数据:分别进入出了问题的journalnode节点的数据目录,删除数据.(这一步只在日志报错的journalnode节点操作)

cd  /home/journal/nameservice1/current
rm -rf *

④复制数据:使用scp,将完好的journalnode节点数据复制给出错的journalnode节点。

cd  /home/journal/nameservice1/current
scp ./* root@192.168.60.45:/home/journal/nameservice1/current/
scp -r ./paxos/ root@192.168.60.45:/home/journal/nameservice1/current/

⑤修改权限:需要将文件权限修改成和以前一样,因为使用scp发送后的数据,文件组和用户会变。可以在完好的journalnode节点上,查看文件所有者和所属组,然后在出问题的journalnode节点上,改成一样的就行。
⑥重启服务。应该就没有问题了

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值