Hadoop源码分析(23)

Hadoop源码分析(23)

加载EditLog日志

  在文档(22)中分析了加载FSImage的方法,在这个方法中最终是将FSImage中的信息加载成INodeFile和INodeDirector对象,并建立的其目录结构。这里接着解析加载editLog的方法。

  加载editlog的方法依然是在loadFSImage方法中被调用的,调用的相关代码如下:

加载editlog代码片段

  与FSImage相同editlog也是一个二进制文件,其内容如下:

editlog文件内容

  同样这个文件也可以使用命令将其转换成xml文件,命令如下:

hdfs oev -i edits_0000000000000075943-0000000000000075947 -o edits.xml

  转换后的xml的内容如下:

<?xml version="1.0" encoding="UTF-8"?>
<EDITS>
  <EDITS_VERSION>-63</EDITS_VERSION>
  <RECORD>
    <OPCODE>OP_START_LOG_SEGMENT</OPCODE>
    <DATA>
      <TXID>75943</TXID>
    </DATA>
  </RECORD>
  <RECORD>
    <OPCODE>OP_DELETE</OPCODE>
    <DATA>
      <TXID>75944</TXID>
      <LENGTH>0</LENGTH>
      <PATH>/tmp/hive/anonymous/4ed5f9a3-a9c5-4f5b-b474-32ff863ebe96</PATH>
      <TIMESTAMP>1598873879942</TIMESTAMP>
      <RPC_CLIENTID>cb655db7-f8e8-49c6-93a0-79716e60fac6</RPC_CLIENTID>
      <RPC_CALLID>5127</RPC_CALLID>
    </DATA>
  </RECORD>
  <RECORD>
    <OPCODE>OP_DELETE</OPCODE>
    <DATA>
      <TXID>75945</TXID>
      <LENGTH>0</LENGTH>
      <PATH>/tmp/hive/anonymous/5d068eba-5bb5-4250-aca0-a6850b414099</PATH>
      <TIMESTAMP>1598873880144</TIMESTAMP>
      <RPC_CLIENTID>cb655db7-f8e8-49c6-93a0-79716e60fac6</RPC_CLIENTID>
      <RPC_CALLID>5129</RPC_CALLID>
    </DATA>
  </RECORD>
  <RECORD>
    <OPCODE>OP_DELETE</OPCODE>
    <DATA>
      <TXID>75946</TXID>
      <LENGTH>0</LENGTH>
      <PATH>/tmp/hive/root/20452750-7ee8-4e63-8fa2-ef0ac94fe6f9</PATH>
      <TIMESTAMP>1598873880186</TIMESTAMP>
      <RPC_CLIENTID>cb655db7-f8e8-49c6-93a0-79716e60fac6</RPC_CLIENTID>
      <RPC_CALLID>5131</RPC_CALLID>
    </DATA>
  </RECORD>
  <RECORD>
    <OPCODE>OP_DELETE</OPCODE>
    <DATA>
      <TXID>75947</TXID>
      <LENGTH>0</LENGTH>
      <PATH>/tmp/hive/anonymous/47a71231-d282-4e8b-997c-ae2057b18328</PATH>
      <TIMESTAMP>1598873880309</TIMESTAMP>
      <RPC_CLIENTID>cb655db7-f8e8-49c6-93a0-79716e60fac6</RPC_CLIENTID>
      <RPC_CALLID>5133</RPC_CALLID>
    </DATA>
  </RECORD>
</EDITS>

  从这个xml中可以看出其这里的数据以RECORD标签为一个单位,一个RECORD节点代表着一次操作。然后是RECORD内的有OPCODE标签,这个标签代表着操作的名称,DATA标签代表着操作的数据内容。

  解析了editlog中文件的内容后,再来分析加载editlog的loadEdits方法。其内容如下:

 private long loadEdits(Iterable<EditLogInputStream> editStreams,
      FSNamesystem target, StartupOption startOpt, MetaRecoveryContext recovery)
      throws IOException {
    LOG.debug("About to load edits:\n  " + Joiner.on("\n  ").join(editStreams));
    StartupProgress prog = NameNode.getStartupProgress();
    prog.beginPhase(Phase.LOADING_EDITS);

    long prevLastAppliedTxId = lastAppliedTxId;  
    try {    
      FSEditLogLoader loader = new FSEditLogLoader(target, lastAppliedTxId);

      // Load latest edits
      for (EditLogInputStream editIn : editStreams) {
        LOG.info("Reading " + editIn + " expecting start txid #" +
              (lastAppliedTxId + 1));
        try {
          loader.loadFSEdits(editIn, lastAppliedTxId + 1, startOpt, recovery);
        } finally {
          // Update lastAppliedTxId even in case of error, since some ops may
          // have been successfully applied before the error.
          lastAppliedTxId = loader.getLastAppliedTxId();
        }
        // If we are in recovery mode, we may have skipped over some txids.
        if (editIn.getLastTxId() != HdfsConstants.INVALID_TXID) {
          lastAppliedTxId = editIn.getLastTxId();
        }
      }
    } finally {
      FSEditLog.closeAllStreams(editStreams);
      // update the counts
      updateCountForQuota(target.getBlockManager().getStoragePolicySuite(),
          target.dir.rootDir, quotaInitThreads);
    }
    prog.endPhase(Phase.LOADING_EDITS);
    return lastAppliedTxId - prevLastAppliedTxId;
  }

  这里首先是第10行,这里创建了一个FSEditLogLoader的对象,然后是第13行遍历之前获取的editlog流对象,对获取到的流使用loader的loadFSEdits方法来加载其中的数据。该方法具体内容如下:

long loadFSEdits(EditLogInputStream edits, long expectedStartingTxId,
      StartupOption startOpt, MetaRecoveryContext recovery) throws IOException {
    StartupProgress prog = NameNode.getStartupProgress();
    Step step = createStartupProgressStep(edits);
    prog.beginStep(Phase.LOADING_EDITS, step);
    fsNamesys.writeLock();
    try {
      long startTime = monotonicNow();
      FSImage.LOG.info("Start loading edits file " + edits.getName());
      long numEdits = loadEditRecords(edits, false, expectedStartingTxId,
          startOpt, recovery);
      FSImage.LOG.info("Edits file " + edits.getName() 
          + " of size " + edits.length() + " edits # " + numEdits 
          + " loaded in " + (monotonicNow()-startTime)/1000 + " seconds");
      return numEdits;
    } finally {
      edits.close();
      fsNamesys.writeUnlock("loadFSEdits");
      prog.endStep(Phase.LOADING_EDITS, step);
    }
  }

  这里重点是第10行的loadEditRecords方法,该方法内容如下:

 long loadEditRecords(EditLogInputStream in, boolean closeOnExit,
      long expectedStartingTxId, StartupOption startOpt,
      MetaRecoveryContext recovery) throws IOException {
    FSDirectory fsDir = fsNamesys.dir;

    EnumMap<FSEditLogOpCodes, Holder<Integer>> opCounts =
      new EnumMap<FSEditLogOpCodes, Holder<Integer>>(FSEditLogOpCodes.class);

    if (LOG.isTraceEnabled()) {
      LOG.trace("Acquiring write lock to replay edit log");
    }

    fsNamesys.writeLock();
    fsDir.writeLock();

    long recentOpcodeOffsets[] = new long[4];
    Arrays.fill(recentOpcodeOffsets, -1);

    long expectedTxId = expectedStartingTxId;
    long numEdits = 0;
    long lastTxId = in.getLastTxId();
    long numTxns = (lastTxId - expectedStartingTxId) + 1;
    StartupProgress prog = NameNode.getStartupProgress();
    Step step = createStartupProgressStep(in);
    prog.setTotal(Phase.LOADING_EDITS, step, numTxns);
    Counter counter = prog.getCounter(Phase.LOADING_EDITS, step);
    long lastLogTime = monotonicNow();
    long lastInodeId = fsNamesys.dir.getLastInodeId();

    try {
      while (true) {
        try {
          FSEditLogOp op;
          try {
            op = in.readOp();
            if (op == null) {
              break;
            }
          } catch (Throwable e) {
            // Handle a problem with our input
            check203UpgradeFailure(in.getVersion(true), e);
            String errorMessage =
              formatEditLogReplayError(in, recentOpcodeOffsets, expectedTxId);
            FSImage.LOG.error(errorMessage, e);
            if (recovery == null) {
               // We will only try to skip over problematic opcodes when in
               // recovery mode.
              throw new EditLogInputException(errorMessage, e, numEdits);
            }
            MetaRecoveryContext.editLogLoaderPrompt(
                "We failed to read txId " + expectedTxId,
                recovery, "skipping the bad section in the log");
            in.resync();
            continue;
          }
          recentOpcodeOffsets[(int)(numEdits % recentOpcodeOffsets.length)] =
            in.getPosition();
          if (op.hasTransactionId()) {
            if (op.getTransactionId() > expectedTxId) { 
              MetaRecoveryContext.editLogLoaderPrompt("There appears " +
                  "to be a gap in the edit log.  We expected txid " +
                  expectedTxId + ", but got txid " +
                  op.getTransactionId() + ".", recovery, "ignoring missing " +
                  " transaction IDs");
            } else if (op.getTransactionId() < expectedTxId) { 
              MetaRecoveryContext.editLogLoaderPrompt("There appears " +
                  "to be an out-of-order edit in the edit log.  We " +
                  "expected txid " + expectedTxId + ", but got txid " +
                  op.getTransactionId() + ".", recovery,
                  "skipping the out-of-order edit");
              continue;
            }
          }
          try {
            if (LOG.isTraceEnabled()) {
              LOG.trace("op=" + op + ", startOpt=" + startOpt
                  + ", numEdits=" + numEdits + ", totalEdits=" + totalEdits);
            }
            long inodeId = applyEditLogOp(op, fsDir, startOpt,
                in.getVersion(true), lastInodeId);
            if (lastInodeId < inodeId) {
              lastInodeId = inodeId;
            }
          } catch (RollingUpgradeOp.RollbackException e) {
            throw e;
          } catch (Throwable e) {
            LOG.error("Encountered exception on operation " + op, e);
            if (recovery == null) {
              throw e instanceof IOException? (IOException)e: new IOException(e);
            }

            MetaRecoveryContext.editLogLoaderPrompt("Failed to " +
             "apply edit log operation " + op + ": error " +
             e.getMessage(), recovery, "applying edits");
          }
          // Now that the operation has been successfully decoded and
          // applied, update our bookkeeping.
          incrOpCount(op.opCode, opCounts, step, counter);
          if (op.hasTransactionId()) {
            lastAppliedTxId = op.getTransactionId();
            expectedTxId = lastAppliedTxId + 1;
          } else {
            expectedTxId = lastAppliedTxId = expectedStartingTxId;
          }
          // log progress
          if (op.hasTransactionId()) {
            long now = monotonicNow();
            if (now - lastLogTime > REPLAY_TRANSACTION_LOG_INTERVAL) {
              long deltaTxId = lastAppliedTxId - expectedStartingTxId + 1;
              int percent = Math.round((float) deltaTxId / numTxns * 100);
              LOG.info("replaying edit log: " + deltaTxId + "/" + numTxns
                  + " transactions completed. (" + percent + "%)");
              lastLogTime = now;
            }
          }
          numEdits++;
          totalEdits++;
        } catch (RollingUpgradeOp.RollbackException e) {
          LOG.info("Stopped at OP_START_ROLLING_UPGRADE for rollback.");
          break;
        } catch (MetaRecoveryContext.RequestStopException e) {
          MetaRecoveryContext.LOG.warn("Stopped reading edit log at " +
              in.getPosition() + "/"  + in.length());
          break;
        }
      }
    } finally {
      fsNamesys.dir.resetLastInodeId(lastInodeId);
      if(closeOnExit) {
        in.close();
      }
      fsDir.writeUnlock();
      fsNamesys.writeUnlock("loadEditRecords");

      if (LOG.isTraceEnabled()) {
        LOG.trace("replaying edit log finished");
      }

      if (FSImage.LOG.isDebugEnabled()) {
        dumpOpCounts(opCounts);
      }
    }
    return numEdits;
  }

  这个方法用来加载editlog中的数据,其中重点在第31行的while循环,这个循环会读取所有的RECORD,并执行其中的内容。其中读取RECORD的的代码是第35行的readOp方法,执行RECORD的方法是第79行的applyEditLogOp方法。

  首先是readOp方法,这个方法内容如下:

  public FSEditLogOp readOp() throws IOException {
    FSEditLogOp ret;
    if (cachedOp != null) {
      ret = cachedOp;
      cachedOp = null;
      return ret;
    }
    return nextOp();
  }

  这里先判断是否是有缓存的op若有则直接返回缓存的op。否则执行nextOp方法,这个方法的实现类是EditLogFileInputStream,其的内容如下:

  protected FSEditLogOp nextOp() throws IOException {
    return nextOpImpl(false);
  }

  调用的nextOpImpl方法如下:

private FSEditLogOp nextOpImpl(boolean skipBrokenEdits) throws IOException {
    FSEditLogOp op = null;
    switch (state) {
    case UNINIT:
      try {
        init(true);
      } catch (Throwable e) {
        LOG.error("caught exception initializing " + this, e);
        if (skipBrokenEdits) {
          return null;
        }
        Throwables.propagateIfPossible(e, IOException.class);
      }
      Preconditions.checkState(state != State.UNINIT);
      return nextOpImpl(skipBrokenEdits);
    case OPEN:
      op = reader.readOp(skipBrokenEdits);
      if ((op != null) && (op.hasTransactionId())) {
        long txId = op.getTransactionId();
        if ((txId >= lastTxId) &&
            (lastTxId != HdfsConstants.INVALID_TXID)) {
          //
          // Sometimes, the NameNode crashes while it's writing to the
          // edit log.  In that case, you can end up with an unfinalized edit log
          // which has some garbage at the end.
          // JournalManager#recoverUnfinalizedSegments will finalize these
          // unfinished edit logs, giving them a defined final transaction 
          // ID.  Then they will be renamed, so that any subsequent
          // readers will have this information.
          //
          // Since there may be garbage at the end of these "cleaned up"
          // logs, we want to be sure to skip it here if we've read everything
          // we were supposed to read out of the stream.
          // So we force an EOF on all subsequent reads.
          //
          long skipAmt = log.length() - tracker.getPos();
          if (skipAmt > 0) {
            if (LOG.isDebugEnabled()) {
                LOG.debug("skipping " + skipAmt + " bytes at the end " +
                  "of edit log  '" + getName() + "': reached txid " + txId +
                  " out of " + lastTxId);
            }
            tracker.clearLimit();
            IOUtils.skipFully(tracker, skipAmt);
          }
        }
      }
      break;
      case CLOSED:
        break; // return null
    }
    return op;
  }

  这里会根据state的值来执行不同的方法,这个参数的值的初始值为UNINIT,这时该方法会执行第6行的init方法,这个方法会修改state的值为OPEN,然后在递归调用nextOpImpl方法,然后执行OPEN参数下的代码,这里最重要的是第17行执行的reader的read方法。

  其中init方法的内容如下:

private void init(boolean verifyLayoutVersion)
      throws LogHeaderCorruptException, IOException {
    Preconditions.checkState(state == State.UNINIT);
    BufferedInputStream bin = null;
    try {
      fStream = log.getInputStream();
      bin = new BufferedInputStream(fStream);
      tracker = new FSEditLogLoader.PositionTrackingInputStream(bin);
      dataIn = new DataInputStream(tracker);
      try {
        logVersion = readLogVersion(dataIn, verifyLayoutVersion);
      } catch (EOFException eofe) {
        throw new LogHeaderCorruptException("No header found in log");
      }
      // We assume future layout will also support ADD_LAYOUT_FLAGS
      if (NameNodeLayoutVersion.supports(
          LayoutVersion.Feature.ADD_LAYOUT_FLAGS, logVersion) ||
          logVersion < NameNodeLayoutVersion.CURRENT_LAYOUT_VERSION) {
        try {
          LayoutFlags.read(dataIn);
        } catch (EOFException eofe) {
          throw new LogHeaderCorruptException("EOF while reading layout " +
              "flags from log");
        }
      }
      reader = new FSEditLogOp.Reader(dataIn, tracker, logVersion);
      reader.setMaxOpSize(maxOpSize);
      state = State.OPEN;
    } finally {
      if (reader == null) {
        IOUtils.cleanup(LOG, dataIn, tracker, bin, fStream);
        state = State.CLOSED;
      }
    }
  }

  这里的重点在第26行,这里会根据前文的代码来创建一个Reader,然后是第28行将state的状态转换为OPEN。

  然后是在open状态下会调用Reader的readOp方法,这个方法内容如下:

public FSEditLogOp readOp(boolean skipBrokenEdits) throws IOException {
      while (true) {
        try {
          return decodeOp();
        } catch (IOException e) {
          in.reset();
          if (!skipBrokenEdits) {
            throw e;
          }
        } catch (RuntimeException e) {
          // FSEditLogOp#decodeOp is not supposed to throw RuntimeException.
          // However, we handle it here for recovery mode, just to be more
          // robust.
          in.reset();
          if (!skipBrokenEdits) {
            throw e;
          }
        } catch (Throwable e) {
          in.reset();
          if (!skipBrokenEdits) {
            throw new IOException("got unexpected exception " +
                e.getMessage(), e);
          }
        }
        // Move ahead one byte and re-try the decode process.
        if (in.skip(1) < 1) {
          return null;
        }
      }
    }

  这里的重点就第4行这一句代码。其余的都是异常处理。decodeOp方法内容如下:

private FSEditLogOp decodeOp() throws IOException {
      limiter.setLimit(maxOpSize);
      in.mark(maxOpSize);

      if (checksum != null) {
        checksum.reset();
      }

      byte opCodeByte;
      try {
        opCodeByte = in.readByte();
      } catch (EOFException eof) {
        // EOF at an opcode boundary is expected.
        return null;
      }

      FSEditLogOpCodes opCode = FSEditLogOpCodes.fromByte(opCodeByte);
      if (opCode == OP_INVALID) {
        verifyTerminator();
        return null;
      }

      FSEditLogOp op = cache.get(opCode);
      if (op == null) {
        throw new IOException("Read invalid opcode " + opCode);
      }

      if (supportEditLogLength) {
        in.readInt();
      }

      if (NameNodeLayoutVersion.supports(
          LayoutVersion.Feature.STORED_TXIDS, logVersion)) {
        // Read the txid
        op.setTransactionId(in.readLong());
      } else {
        op.setTransactionId(HdfsConstants.INVALID_TXID);
      }

      op.readFields(in, logVersion);

      validateChecksum(in, checksum, op.txid);
      return op;
    }

  首先是第11行从输入流中读取到opcode的byte,然后是第17行将opcode从byte转为FSEditLogOpCodes对象。然后是第23行获取该opcode对应的op对象。最后是第28行到最后读取其他数据到op中。

  读取完一条RECORD数据后,再回到loadEditRecords方法,这里会调用applyEditLogOp方法来执行这个op中的内容。这个方法内容如下:

 private long applyEditLogOp(FSEditLogOp op, FSDirectory fsDir,
      StartupOption startOpt, int logVersion, long lastInodeId) throws IOException {
    long inodeId = INodeId.GRANDFATHER_INODE_ID;
    if (LOG.isTraceEnabled()) {
      LOG.trace("replaying edit log: " + op);
    }
    final boolean toAddRetryCache = fsNamesys.hasRetryCache() && op.hasRpcIds();

    switch (op.opCode) {
    ...
    case OP_DELETE: {
      DeleteOp deleteOp = (DeleteOp)op;
      FSDirDeleteOp.deleteForEditLog(
          fsDir, renameReservedPathsOnUpgrade(deleteOp.path, logVersion),
          deleteOp.timestamp);

      if (toAddRetryCache) {
        fsNamesys.addCacheEntry(deleteOp.rpcClientId, deleteOp.rpcCallId);
      }
      break;
    }
    ...
    default:
      throw new IOException("Invalid operation read " + op.opCode);
    }
    return inodeId;
  }

  这个方法实际就是一个switch语句,原文的switch语句很长,包括namenode所以的写操作。上文只留了之前提到的DELETE操作的处理代码。

{

case OP_DELETE: {
DeleteOp deleteOp = (DeleteOp)op;
FSDirDeleteOp.deleteForEditLog(
fsDir, renameReservedPathsOnUpgrade(deleteOp.path, logVersion),
deleteOp.timestamp);

  if (toAddRetryCache) {
    fsNamesys.addCacheEntry(deleteOp.rpcClientId, deleteOp.rpcCallId);
  }
  break;
}
...
default:
  throw new IOException("Invalid operation read " + op.opCode);
}
return inodeId;

}


 <p>&emsp; 这个方法实际就是一个switch语句,原文的switch语句很长,包括namenode所以的写操作。上文只留了之前提到的DELETE操作的处理代码。
</p>






  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值