Hadoop源码分析(23)
加载EditLog日志
在文档(22)中分析了加载FSImage的方法,在这个方法中最终是将FSImage中的信息加载成INodeFile和INodeDirector对象,并建立的其目录结构。这里接着解析加载editLog的方法。
加载editlog的方法依然是在loadFSImage方法中被调用的,调用的相关代码如下:
与FSImage相同editlog也是一个二进制文件,其内容如下:
同样这个文件也可以使用命令将其转换成xml文件,命令如下:
hdfs oev -i edits_0000000000000075943-0000000000000075947 -o edits.xml
转换后的xml的内容如下:
<?xml version="1.0" encoding="UTF-8"?>
<EDITS>
<EDITS_VERSION>-63</EDITS_VERSION>
<RECORD>
<OPCODE>OP_START_LOG_SEGMENT</OPCODE>
<DATA>
<TXID>75943</TXID>
</DATA>
</RECORD>
<RECORD>
<OPCODE>OP_DELETE</OPCODE>
<DATA>
<TXID>75944</TXID>
<LENGTH>0</LENGTH>
<PATH>/tmp/hive/anonymous/4ed5f9a3-a9c5-4f5b-b474-32ff863ebe96</PATH>
<TIMESTAMP>1598873879942</TIMESTAMP>
<RPC_CLIENTID>cb655db7-f8e8-49c6-93a0-79716e60fac6</RPC_CLIENTID>
<RPC_CALLID>5127</RPC_CALLID>
</DATA>
</RECORD>
<RECORD>
<OPCODE>OP_DELETE</OPCODE>
<DATA>
<TXID>75945</TXID>
<LENGTH>0</LENGTH>
<PATH>/tmp/hive/anonymous/5d068eba-5bb5-4250-aca0-a6850b414099</PATH>
<TIMESTAMP>1598873880144</TIMESTAMP>
<RPC_CLIENTID>cb655db7-f8e8-49c6-93a0-79716e60fac6</RPC_CLIENTID>
<RPC_CALLID>5129</RPC_CALLID>
</DATA>
</RECORD>
<RECORD>
<OPCODE>OP_DELETE</OPCODE>
<DATA>
<TXID>75946</TXID>
<LENGTH>0</LENGTH>
<PATH>/tmp/hive/root/20452750-7ee8-4e63-8fa2-ef0ac94fe6f9</PATH>
<TIMESTAMP>1598873880186</TIMESTAMP>
<RPC_CLIENTID>cb655db7-f8e8-49c6-93a0-79716e60fac6</RPC_CLIENTID>
<RPC_CALLID>5131</RPC_CALLID>
</DATA>
</RECORD>
<RECORD>
<OPCODE>OP_DELETE</OPCODE>
<DATA>
<TXID>75947</TXID>
<LENGTH>0</LENGTH>
<PATH>/tmp/hive/anonymous/47a71231-d282-4e8b-997c-ae2057b18328</PATH>
<TIMESTAMP>1598873880309</TIMESTAMP>
<RPC_CLIENTID>cb655db7-f8e8-49c6-93a0-79716e60fac6</RPC_CLIENTID>
<RPC_CALLID>5133</RPC_CALLID>
</DATA>
</RECORD>
</EDITS>
从这个xml中可以看出其这里的数据以RECORD标签为一个单位,一个RECORD节点代表着一次操作。然后是RECORD内的有OPCODE标签,这个标签代表着操作的名称,DATA标签代表着操作的数据内容。
解析了editlog中文件的内容后,再来分析加载editlog的loadEdits方法。其内容如下:
private long loadEdits(Iterable<EditLogInputStream> editStreams,
FSNamesystem target, StartupOption startOpt, MetaRecoveryContext recovery)
throws IOException {
LOG.debug("About to load edits:\n " + Joiner.on("\n ").join(editStreams));
StartupProgress prog = NameNode.getStartupProgress();
prog.beginPhase(Phase.LOADING_EDITS);
long prevLastAppliedTxId = lastAppliedTxId;
try {
FSEditLogLoader loader = new FSEditLogLoader(target, lastAppliedTxId);
// Load latest edits
for (EditLogInputStream editIn : editStreams) {
LOG.info("Reading " + editIn + " expecting start txid #" +
(lastAppliedTxId + 1));
try {
loader.loadFSEdits(editIn, lastAppliedTxId + 1, startOpt, recovery);
} finally {
// Update lastAppliedTxId even in case of error, since some ops may
// have been successfully applied before the error.
lastAppliedTxId = loader.getLastAppliedTxId();
}
// If we are in recovery mode, we may have skipped over some txids.
if (editIn.getLastTxId() != HdfsConstants.INVALID_TXID) {
lastAppliedTxId = editIn.getLastTxId();
}
}
} finally {
FSEditLog.closeAllStreams(editStreams);
// update the counts
updateCountForQuota(target.getBlockManager().getStoragePolicySuite(),
target.dir.rootDir, quotaInitThreads);
}
prog.endPhase(Phase.LOADING_EDITS);
return lastAppliedTxId - prevLastAppliedTxId;
}
这里首先是第10行,这里创建了一个FSEditLogLoader的对象,然后是第13行遍历之前获取的editlog流对象,对获取到的流使用loader的loadFSEdits方法来加载其中的数据。该方法具体内容如下:
long loadFSEdits(EditLogInputStream edits, long expectedStartingTxId,
StartupOption startOpt, MetaRecoveryContext recovery) throws IOException {
StartupProgress prog = NameNode.getStartupProgress();
Step step = createStartupProgressStep(edits);
prog.beginStep(Phase.LOADING_EDITS, step);
fsNamesys.writeLock();
try {
long startTime = monotonicNow();
FSImage.LOG.info("Start loading edits file " + edits.getName());
long numEdits = loadEditRecords(edits, false, expectedStartingTxId,
startOpt, recovery);
FSImage.LOG.info("Edits file " + edits.getName()
+ " of size " + edits.length() + " edits # " + numEdits
+ " loaded in " + (monotonicNow()-startTime)/1000 + " seconds");
return numEdits;
} finally {
edits.close();
fsNamesys.writeUnlock("loadFSEdits");
prog.endStep(Phase.LOADING_EDITS, step);
}
}
这里重点是第10行的loadEditRecords方法,该方法内容如下:
long loadEditRecords(EditLogInputStream in, boolean closeOnExit,
long expectedStartingTxId, StartupOption startOpt,
MetaRecoveryContext recovery) throws IOException {
FSDirectory fsDir = fsNamesys.dir;
EnumMap<FSEditLogOpCodes, Holder<Integer>> opCounts =
new EnumMap<FSEditLogOpCodes, Holder<Integer>>(FSEditLogOpCodes.class);
if (LOG.isTraceEnabled()) {
LOG.trace("Acquiring write lock to replay edit log");
}
fsNamesys.writeLock();
fsDir.writeLock();
long recentOpcodeOffsets[] = new long[4];
Arrays.fill(recentOpcodeOffsets, -1);
long expectedTxId = expectedStartingTxId;
long numEdits = 0;
long lastTxId = in.getLastTxId();
long numTxns = (lastTxId - expectedStartingTxId) + 1;
StartupProgress prog = NameNode.getStartupProgress();
Step step = createStartupProgressStep(in);
prog.setTotal(Phase.LOADING_EDITS, step, numTxns);
Counter counter = prog.getCounter(Phase.LOADING_EDITS, step);
long lastLogTime = monotonicNow();
long lastInodeId = fsNamesys.dir.getLastInodeId();
try {
while (true) {
try {
FSEditLogOp op;
try {
op = in.readOp();
if (op == null) {
break;
}
} catch (Throwable e) {
// Handle a problem with our input
check203UpgradeFailure(in.getVersion(true), e);
String errorMessage =
formatEditLogReplayError(in, recentOpcodeOffsets, expectedTxId);
FSImage.LOG.error(errorMessage, e);
if (recovery == null) {
// We will only try to skip over problematic opcodes when in
// recovery mode.
throw new EditLogInputException(errorMessage, e, numEdits);
}
MetaRecoveryContext.editLogLoaderPrompt(
"We failed to read txId " + expectedTxId,
recovery, "skipping the bad section in the log");
in.resync();
continue;
}
recentOpcodeOffsets[(int)(numEdits % recentOpcodeOffsets.length)] =
in.getPosition();
if (op.hasTransactionId()) {
if (op.getTransactionId() > expectedTxId) {
MetaRecoveryContext.editLogLoaderPrompt("There appears " +
"to be a gap in the edit log. We expected txid " +
expectedTxId + ", but got txid " +
op.getTransactionId() + ".", recovery, "ignoring missing " +
" transaction IDs");
} else if (op.getTransactionId() < expectedTxId) {
MetaRecoveryContext.editLogLoaderPrompt("There appears " +
"to be an out-of-order edit in the edit log. We " +
"expected txid " + expectedTxId + ", but got txid " +
op.getTransactionId() + ".", recovery,
"skipping the out-of-order edit");
continue;
}
}
try {
if (LOG.isTraceEnabled()) {
LOG.trace("op=" + op + ", startOpt=" + startOpt
+ ", numEdits=" + numEdits + ", totalEdits=" + totalEdits);
}
long inodeId = applyEditLogOp(op, fsDir, startOpt,
in.getVersion(true), lastInodeId);
if (lastInodeId < inodeId) {
lastInodeId = inodeId;
}
} catch (RollingUpgradeOp.RollbackException e) {
throw e;
} catch (Throwable e) {
LOG.error("Encountered exception on operation " + op, e);
if (recovery == null) {
throw e instanceof IOException? (IOException)e: new IOException(e);
}
MetaRecoveryContext.editLogLoaderPrompt("Failed to " +
"apply edit log operation " + op + ": error " +
e.getMessage(), recovery, "applying edits");
}
// Now that the operation has been successfully decoded and
// applied, update our bookkeeping.
incrOpCount(op.opCode, opCounts, step, counter);
if (op.hasTransactionId()) {
lastAppliedTxId = op.getTransactionId();
expectedTxId = lastAppliedTxId + 1;
} else {
expectedTxId = lastAppliedTxId = expectedStartingTxId;
}
// log progress
if (op.hasTransactionId()) {
long now = monotonicNow();
if (now - lastLogTime > REPLAY_TRANSACTION_LOG_INTERVAL) {
long deltaTxId = lastAppliedTxId - expectedStartingTxId + 1;
int percent = Math.round((float) deltaTxId / numTxns * 100);
LOG.info("replaying edit log: " + deltaTxId + "/" + numTxns
+ " transactions completed. (" + percent + "%)");
lastLogTime = now;
}
}
numEdits++;
totalEdits++;
} catch (RollingUpgradeOp.RollbackException e) {
LOG.info("Stopped at OP_START_ROLLING_UPGRADE for rollback.");
break;
} catch (MetaRecoveryContext.RequestStopException e) {
MetaRecoveryContext.LOG.warn("Stopped reading edit log at " +
in.getPosition() + "/" + in.length());
break;
}
}
} finally {
fsNamesys.dir.resetLastInodeId(lastInodeId);
if(closeOnExit) {
in.close();
}
fsDir.writeUnlock();
fsNamesys.writeUnlock("loadEditRecords");
if (LOG.isTraceEnabled()) {
LOG.trace("replaying edit log finished");
}
if (FSImage.LOG.isDebugEnabled()) {
dumpOpCounts(opCounts);
}
}
return numEdits;
}
这个方法用来加载editlog中的数据,其中重点在第31行的while循环,这个循环会读取所有的RECORD,并执行其中的内容。其中读取RECORD的的代码是第35行的readOp方法,执行RECORD的方法是第79行的applyEditLogOp方法。
首先是readOp方法,这个方法内容如下:
public FSEditLogOp readOp() throws IOException {
FSEditLogOp ret;
if (cachedOp != null) {
ret = cachedOp;
cachedOp = null;
return ret;
}
return nextOp();
}
这里先判断是否是有缓存的op若有则直接返回缓存的op。否则执行nextOp方法,这个方法的实现类是EditLogFileInputStream,其的内容如下:
protected FSEditLogOp nextOp() throws IOException {
return nextOpImpl(false);
}
调用的nextOpImpl方法如下:
private FSEditLogOp nextOpImpl(boolean skipBrokenEdits) throws IOException {
FSEditLogOp op = null;
switch (state) {
case UNINIT:
try {
init(true);
} catch (Throwable e) {
LOG.error("caught exception initializing " + this, e);
if (skipBrokenEdits) {
return null;
}
Throwables.propagateIfPossible(e, IOException.class);
}
Preconditions.checkState(state != State.UNINIT);
return nextOpImpl(skipBrokenEdits);
case OPEN:
op = reader.readOp(skipBrokenEdits);
if ((op != null) && (op.hasTransactionId())) {
long txId = op.getTransactionId();
if ((txId >= lastTxId) &&
(lastTxId != HdfsConstants.INVALID_TXID)) {
//
// Sometimes, the NameNode crashes while it's writing to the
// edit log. In that case, you can end up with an unfinalized edit log
// which has some garbage at the end.
// JournalManager#recoverUnfinalizedSegments will finalize these
// unfinished edit logs, giving them a defined final transaction
// ID. Then they will be renamed, so that any subsequent
// readers will have this information.
//
// Since there may be garbage at the end of these "cleaned up"
// logs, we want to be sure to skip it here if we've read everything
// we were supposed to read out of the stream.
// So we force an EOF on all subsequent reads.
//
long skipAmt = log.length() - tracker.getPos();
if (skipAmt > 0) {
if (LOG.isDebugEnabled()) {
LOG.debug("skipping " + skipAmt + " bytes at the end " +
"of edit log '" + getName() + "': reached txid " + txId +
" out of " + lastTxId);
}
tracker.clearLimit();
IOUtils.skipFully(tracker, skipAmt);
}
}
}
break;
case CLOSED:
break; // return null
}
return op;
}
这里会根据state的值来执行不同的方法,这个参数的值的初始值为UNINIT,这时该方法会执行第6行的init方法,这个方法会修改state的值为OPEN,然后在递归调用nextOpImpl方法,然后执行OPEN参数下的代码,这里最重要的是第17行执行的reader的read方法。
其中init方法的内容如下:
private void init(boolean verifyLayoutVersion)
throws LogHeaderCorruptException, IOException {
Preconditions.checkState(state == State.UNINIT);
BufferedInputStream bin = null;
try {
fStream = log.getInputStream();
bin = new BufferedInputStream(fStream);
tracker = new FSEditLogLoader.PositionTrackingInputStream(bin);
dataIn = new DataInputStream(tracker);
try {
logVersion = readLogVersion(dataIn, verifyLayoutVersion);
} catch (EOFException eofe) {
throw new LogHeaderCorruptException("No header found in log");
}
// We assume future layout will also support ADD_LAYOUT_FLAGS
if (NameNodeLayoutVersion.supports(
LayoutVersion.Feature.ADD_LAYOUT_FLAGS, logVersion) ||
logVersion < NameNodeLayoutVersion.CURRENT_LAYOUT_VERSION) {
try {
LayoutFlags.read(dataIn);
} catch (EOFException eofe) {
throw new LogHeaderCorruptException("EOF while reading layout " +
"flags from log");
}
}
reader = new FSEditLogOp.Reader(dataIn, tracker, logVersion);
reader.setMaxOpSize(maxOpSize);
state = State.OPEN;
} finally {
if (reader == null) {
IOUtils.cleanup(LOG, dataIn, tracker, bin, fStream);
state = State.CLOSED;
}
}
}
这里的重点在第26行,这里会根据前文的代码来创建一个Reader,然后是第28行将state的状态转换为OPEN。
然后是在open状态下会调用Reader的readOp方法,这个方法内容如下:
public FSEditLogOp readOp(boolean skipBrokenEdits) throws IOException {
while (true) {
try {
return decodeOp();
} catch (IOException e) {
in.reset();
if (!skipBrokenEdits) {
throw e;
}
} catch (RuntimeException e) {
// FSEditLogOp#decodeOp is not supposed to throw RuntimeException.
// However, we handle it here for recovery mode, just to be more
// robust.
in.reset();
if (!skipBrokenEdits) {
throw e;
}
} catch (Throwable e) {
in.reset();
if (!skipBrokenEdits) {
throw new IOException("got unexpected exception " +
e.getMessage(), e);
}
}
// Move ahead one byte and re-try the decode process.
if (in.skip(1) < 1) {
return null;
}
}
}
这里的重点就第4行这一句代码。其余的都是异常处理。decodeOp方法内容如下:
private FSEditLogOp decodeOp() throws IOException {
limiter.setLimit(maxOpSize);
in.mark(maxOpSize);
if (checksum != null) {
checksum.reset();
}
byte opCodeByte;
try {
opCodeByte = in.readByte();
} catch (EOFException eof) {
// EOF at an opcode boundary is expected.
return null;
}
FSEditLogOpCodes opCode = FSEditLogOpCodes.fromByte(opCodeByte);
if (opCode == OP_INVALID) {
verifyTerminator();
return null;
}
FSEditLogOp op = cache.get(opCode);
if (op == null) {
throw new IOException("Read invalid opcode " + opCode);
}
if (supportEditLogLength) {
in.readInt();
}
if (NameNodeLayoutVersion.supports(
LayoutVersion.Feature.STORED_TXIDS, logVersion)) {
// Read the txid
op.setTransactionId(in.readLong());
} else {
op.setTransactionId(HdfsConstants.INVALID_TXID);
}
op.readFields(in, logVersion);
validateChecksum(in, checksum, op.txid);
return op;
}
首先是第11行从输入流中读取到opcode的byte,然后是第17行将opcode从byte转为FSEditLogOpCodes对象。然后是第23行获取该opcode对应的op对象。最后是第28行到最后读取其他数据到op中。
读取完一条RECORD数据后,再回到loadEditRecords方法,这里会调用applyEditLogOp方法来执行这个op中的内容。这个方法内容如下:
private long applyEditLogOp(FSEditLogOp op, FSDirectory fsDir,
StartupOption startOpt, int logVersion, long lastInodeId) throws IOException {
long inodeId = INodeId.GRANDFATHER_INODE_ID;
if (LOG.isTraceEnabled()) {
LOG.trace("replaying edit log: " + op);
}
final boolean toAddRetryCache = fsNamesys.hasRetryCache() && op.hasRpcIds();
switch (op.opCode) {
...
case OP_DELETE: {
DeleteOp deleteOp = (DeleteOp)op;
FSDirDeleteOp.deleteForEditLog(
fsDir, renameReservedPathsOnUpgrade(deleteOp.path, logVersion),
deleteOp.timestamp);
if (toAddRetryCache) {
fsNamesys.addCacheEntry(deleteOp.rpcClientId, deleteOp.rpcCallId);
}
break;
}
...
default:
throw new IOException("Invalid operation read " + op.opCode);
}
return inodeId;
}
这个方法实际就是一个switch语句,原文的switch语句很长,包括namenode所以的写操作。上文只留了之前提到的DELETE操作的处理代码。
{
…
case OP_DELETE: {
DeleteOp deleteOp = (DeleteOp)op;
FSDirDeleteOp.deleteForEditLog(
fsDir, renameReservedPathsOnUpgrade(deleteOp.path, logVersion),
deleteOp.timestamp);
if (toAddRetryCache) {
fsNamesys.addCacheEntry(deleteOp.rpcClientId, deleteOp.rpcCallId);
}
break;
}
...
default:
throw new IOException("Invalid operation read " + op.opCode);
}
return inodeId;
}
<p>  这个方法实际就是一个switch语句,原文的switch语句很长,包括namenode所以的写操作。上文只留了之前提到的DELETE操作的处理代码。
</p>