JournalNode无法启动问题排查
1.问题说明
1.1 JournalNode重新启动后又失败,一直重启不成功,经过观察,发现日志报错,经排查报错原因是编辑log损坏导致的
2018-05-28 16:06:07,896 WARN namenode.FSImage (EditLogFileInputStream.java:scanEditLog(359)) - Caught exception after scanning through 0 ops from /hadoop/hdfs/journal/DHTestCluster/current/edits_inprogress_0000000000019770365 while determining its valid length. Position was 1044480
java.io.IOException: Can’t scan a pre-transactional edit log.
at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp
L
e
g
a
c
y
R
e
a
d
e
r
.
s
c
a
n
O
p
(
F
S
E
d
i
t
L
o
g
O
p
.
j
a
v
a
:
4974
)
a
t
o
r
g
.
a
p
a
c
h
e
.
h
a
d
o
o
p
.
h
d
f
s
.
s
e
r
v
e
r
.
n
a
m
e
n
o
d
e
.
E
d
i
t
L
o
g
F
i
l
e
I
n
p
u
t
S
t
r
e
a
m
.
s
c
a
n
N
e
x
t
O
p
(
E
d
i
t
L
o
g
F
i
l
e
I
n
p
u
t
S
t
r
e
a
m
.
j
a
v
a
:
245
)
a
t
o
r
g
.
a
p
a
c
h
e
.
h
a
d
o
o
p
.
h
d
f
s
.
s
e
r
v
e
r
.
n
a
m
e
n
o
d
e
.
E
d
i
t
L
o
g
F
i
l
e
I
n
p
u
t
S
t
r
e
a
m
.
s
c
a
n
E
d
i
t
L
o
g
(
E
d
i
t
L
o
g
F
i
l
e
I
n
p
u
t
S
t
r
e
a
m
.
j
a
v
a
:
355
)
a
t
o
r
g
.
a
p
a
c
h
e
.
h
a
d
o
o
p
.
h
d
f
s
.
s
e
r
v
e
r
.
n
a
m
e
n
o
d
e
.
F
i
l
e
J
o
u
r
n
a
l
M
a
n
a
g
e
r
LegacyReader.scanOp(FSEditLogOp.java:4974) at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245) at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:355) at org.apache.hadoop.hdfs.server.namenode.FileJournalManager
LegacyReader.scanOp(FSEditLogOp.java:4974)atorg.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245)atorg.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:355)atorg.apache.hadoop.hdfs.server.namenode.FileJournalManagerEditLogFile.scanLog(FileJournalManager.java:551)
at org.apache.hadoop.hdfs.qjournal.server.Journal.scanStorageForLatestEdits(Journal.java:192)
at org.apache.hadoop.hdfs.qjournal.server.Journal.(Journal.java:152)
at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:90)
at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:99)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:189)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:224)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService
2.
c
a
l
l
B
l
o
c
k
i
n
g
M
e
t
h
o
d
(
Q
J
o
u
r
n
a
l
P
r
o
t
o
c
o
l
P
r
o
t
o
s
.
j
a
v
a
:
25431
)
a
t
o
r
g
.
a
p
a
c
h
e
.
h
a
d
o
o
p
.
i
p
c
.
P
r
o
t
o
b
u
f
R
p
c
E
n
g
i
n
e
2.callBlockingMethod(QJournalProtocolProtos.java:25431) at org.apache.hadoop.ipc.ProtobufRpcEngine
2.callBlockingMethod(QJournalProtocolProtos.java:25431)atorg.apache.hadoop.ipc.ProtobufRpcEngineServer
P
r
o
t
o
B
u
f
R
p
c
I
n
v
o
k
e
r
.
c
a
l
l
(
P
r
o
t
o
b
u
f
R
p
c
E
n
g
i
n
e
.
j
a
v
a
:
640
)
a
t
o
r
g
.
a
p
a
c
h
e
.
h
a
d
o
o
p
.
i
p
c
.
R
P
C
ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) at org.apache.hadoop.ipc.RPC
ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)atorg.apache.hadoop.ipc.RPCServer.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler
1.
r
u
n
(
S
e
r
v
e
r
.
j
a
v
a
:
2351
)
a
t
o
r
g
.
a
p
a
c
h
e
.
h
a
d
o
o
p
.
i
p
c
.
S
e
r
v
e
r
1.run(Server.java:2351) at org.apache.hadoop.ipc.Server
1.run(Server.java:2351)atorg.apache.hadoop.ipc.ServerHandler
1.
r
u
n
(
S
e
r
v
e
r
.
j
a
v
a
:
2347
)
a
t
j
a
v
a
.
s
e
c
u
r
i
t
y
.
A
c
c
e
s
s
C
o
n
t
r
o
l
l
e
r
.
d
o
P
r
i
v
i
l
e
g
e
d
(
N
a
t
i
v
e
M
e
t
h
o
d
)
a
t
j
a
v
a
x
.
s
e
c
u
r
i
t
y
.
a
u
t
h
.
S
u
b
j
e
c
t
.
d
o
A
s
(
S
u
b
j
e
c
t
.
j
a
v
a
:
422
)
a
t
o
r
g
.
a
p
a
c
h
e
.
h
a
d
o
o
p
.
s
e
c
u
r
i
t
y
.
U
s
e
r
G
r
o
u
p
I
n
f
o
r
m
a
t
i
o
n
.
d
o
A
s
(
U
s
e
r
G
r
o
u
p
I
n
f
o
r
m
a
t
i
o
n
.
j
a
v
a
:
1866
)
a
t
o
r
g
.
a
p
a
c
h
e
.
h
a
d
o
o
p
.
i
p
c
.
S
e
r
v
e
r
1.run(Server.java:2347) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) at org.apache.hadoop.ipc.Server
1.run(Server.java:2347)atjava.security.AccessController.doPrivileged(NativeMethod)atjavax.security.auth.Subject.doAs(Subject.java:422)atorg.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)atorg.apache.hadoop.ipc.ServerHandler.run(Server.java:2345)
2018-05-28 16:06:07,896 WARN namenode.FSImage (EditLogFileInputStream.java:scanEditLog(364)) - After resync, position is 1044480
2.解决方法
2.1从正常运行的JournalNode机器上复制编辑日志
2.2切换到编辑日志所在的当前目录
cd / hadoop / hdfs / journal / DHTestCluster / current(根据自己的配置文件找到当前目录)
2.3压缩当前目录
tar -zcvf current.tar.gz ./current
2.4删除损坏的编辑日志
cd / hadoop / hdfs / journal / DHTestCluster /
rm -rf电流/
2.5复制current.tar.gz到目标机器上
scp current.tar.gz hdfs @ hadoop:/ hadoop / hdfs / journal / DHTestCluster
tar -zxvf current.tar.gz
2.6重新启动JournalNode即可