HDFS源码解析之HDFS元数据写入机制剖析(七)

1. HDFS元数据写入机制剖析

1.1 HDFS元数据如何写入内存?

我们接着上一篇博客的mkdirs之后到底发生了什么?之后我们目录已经创建到了我们的目录树上,那么接下来,我们是不是需要看一下元数据是如何写入磁盘的呢

1.1.1 createSingleDirectory

private static INodesInPath createSingleDirectory(FSDirectory fsd,
      INodesInPath existing, String localName, PermissionStatus perm)
      throws IOException {
    assert fsd.hasWriteLock();

    //TODO 更新目录树,这颗目录树是存在内存中的
    //更新内存里面的数据
    existing = unprotectedMkdir(fsd, fsd.allocateNewInodeId(), existing,
        localName.getBytes(Charsets.UTF_8), perm, null, now());
    if (existing == null) {
      return null;
    }

    final INode newNode = existing.getLastINode();
    // Directory creation also count towards FilesCreated
    // to match count of FilesDeleted metric.
    NameNode.getNameNodeMetrics().incrFilesCreated();

    String cur = existing.getPath();

    //TODO 把元数据信息记录到磁盘上(但是一开始先写到内存)
    //往磁盘上面记录元数据日志
    fsd.getEditLog().logMkDir(cur, newNode);
    if (NameNode.stateChangeLog.isDebugEnabled()) {
      NameNode.stateChangeLog.debug("mkdirs: created directory " + cur);
    }
    return existing;
  }
1.1.2 fsd.getEditLog().logMkDir(cur, newNode)
  /** 
   * Add create directory record to edit log
   */
  public void logMkDir(String path, INode newNode) {
    PermissionStatus permissions = newNode.getPermissionStatus();

    //TODO 创建日志对象 建造者模式
    MkdirOp op = MkdirOp.getInstance(cache.get())
      .setInodeId(newNode.getId())
      .setPath(path)
      .setTimestamp(newNode.getModificationTime())
      .setPermissionStatus(permissions);

    AclFeature f = newNode.getAclFeature();
    if (f != null) {
      op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
    }

    XAttrFeature x = newNode.getXAttrFeature();
    if (x != null) {
      op.setXAttrs(x.getXAttrs());
    }

    //TODO 记录日志
    logEdit(op);
  }

1.1.3 logEdit

 /**
   * Write an operation to the edit log. Do not sync to persistent
   * store yet.
   */
  void logEdit(final FSEditLogOp op) {
    //来了就加锁
    synchronized (this) {
      assert isOpenForWrite() :
        "bad state: " + state;
      
      // wait if an automatic sync is scheduled
      //一开始不需要等待
      waitIfAutoSyncScheduled();

      //TODO 步骤一:获取当前的独一无二的事务ID
      long start = beginTransaction();
      op.setTransactionId(txid);

      try {
        /**
         * 1. namenode editlog 文件缓冲里面
         * 2. journalnode的内存缓冲
         * JournalSetOutputStream
         */
        //TODO QuormJournalManager QuormOutputStream
        //TODO FileJournalManager EditLogFileOutputStream

        //TODO 步骤二: 把元数据接入到内存缓冲
        editLogStream.write(op);
      } catch (IOException ex) {
        // All journals failed, it is handled in logSync.
      } finally {
        op.reset();
      }

      endTransaction(start);
      
      // check if it is time to schedule an automatic sync
      if (!shouldForceSync()) {
        return;
      }
      isAutoSyncScheduled = true;
    }
    
    // sync buffered edit log entries to persistent store
    //TODO 写元数据到磁盘
    logSync();
  }

1.1.4 QuorumOutputStream的write

  @Override
  public void write(FSEditLogOp op) throws IOException {
    buf.writeOp(op);
  }

1.1.5 EditsDoubleBuffer的writeOp

    public void writeOp(FSEditLogOp op) throws IOException {
      if (firstTxId == HdfsConstants.INVALID_TXID) {
        firstTxId = op.txid;
      } else {
        assert op.txid > firstTxId;
      }
      writer.writeOp(op);
      numTxns++;
    }

1.1.5 EditsDoubleBuffer的重要属性

  //第一个缓冲
  private TxnBuffer bufCurrent; // current buffer for writing
  //第二个缓冲
  private TxnBuffer bufReady; // buffer ready for flushing

1.1.6 FSEditLogOp的writeOp

    /**
     * Write an operation to the output stream
     * 
     * @param op The operation to write
     * @throws IOException if an error occurs during writing.
     */
    public void writeOp(FSEditLogOp op) throws IOException {
      int start = buf.getLength();
      // write the op code first to make padding and terminator verification
      // work
      buf.writeByte(op.opCode.getOpCode());
      buf.writeInt(0); // write 0 for the length first
      buf.writeLong(op.txid);
      op.writeFields(buf);
      int end = buf.getLength();
      
      // write the length back: content of the op + 4 bytes checksum - op_code
      int length = end - start - 1;
      buf.writeInt(length, start + 1);

      checksum.reset();
      checksum.update(buf.getData(), start, end-start);
      int sum = (int)checksum.getValue();
      buf.writeInt(sum);
    }


/** 其实调用的就是NIO写入内存**/

1.1.7 FSEditLogOp的writeOp(和JournalOutputStream是一样的)


    /**
     * Write an operation to the output stream
     * 
     * @param op The operation to write
     * @throws IOException if an error occurs during writing.
     */
    public void writeOp(FSEditLogOp op) throws IOException {
      int start = buf.getLength();
      // write the op code first to make padding and terminator verification
      // work
      buf.writeByte(op.opCode.getOpCode());
      buf.writeInt(0); // write 0 for the length first
      buf.writeLong(op.txid);
      op.writeFields(buf);
      int end = buf.getLength();
      
      // write the length back: content of the op + 4 bytes checksum - op_code
      int length = end - start - 1;
      buf.writeInt(length, start + 1);

      checksum.reset();
      checksum.update(buf.getData(), start, end-start);
      int sum = (int)checksum.getValue();
      buf.writeInt(sum);
    }
  }

1.1.8 双缓冲交换内存

  public void setReadyToFlush() {
    assert isFlushed() : "previous data not flushed yet";
    TxnBuffer tmp = bufReady;
    bufReady = bufCurrent;
    bufCurrent = tmp;
  }

1.2 HDFS元数据如何写入磁盘?

1.2.1 EditLogFileOutputStream的写入磁盘

    /**
     * Writes <code>len</code> bytes from the specified byte array
     * starting at offset <code>off</code> to this output stream.
     * The general contract for <code>write(b, off, len)</code> is that
     * some of the bytes in the array <code>b</code> are written to the
     * output stream in order; element <code>b[off]</code> is the first
     * byte written and <code>b[off+len-1]</code> is the last byte written
     * by this operation.
     * <p>
     * The <code>write</code> method of <code>OutputStream</code> calls
     * the write method of one argument on each of the bytes to be
     * written out. Subclasses are encouraged to override this method and
     * provide a more efficient implementation.
     * <p>
     * If <code>b</code> is <code>null</code>, a
     * <code>NullPointerException</code> is thrown.
     * <p>
     * If <code>off</code> is negative, or <code>len</code> is negative, or
     * <code>off+len</code> is greater than the length of the array
     * <code>b</code>, then an <tt>IndexOutOfBoundsException</tt> is thrown.
     *
     * @param      b     the data.
     * @param      off   the start offset in the data.
     * @param      len   the number of bytes to write.
     * @exception  IOException  if an I/O error occurs. In particular,
     *             an <code>IOException</code> is thrown if the output
     *             stream is closed.
     */
    public void write(byte b[], int off, int len) throws IOException {
        if (b == null) {
            throw new NullPointerException();
        } else if ((off < 0) || (off > b.length) || (len < 0) ||
                   ((off + len) > b.length) || ((off + len) < 0)) {
            throw new IndexOutOfBoundsException();
        } else if (len == 0) {
            return;
        }
        for (int i = 0 ; i < len ; i++) {
            write(b[off + i]);
        }
    }

1.2.2 EditLogFileOutputStream的flushAndSync

 protected void flushAndSync(boolean durable) throws IOException {
    int numReadyBytes = buf.countReadyBytes();
    if (numReadyBytes > 0) {
      int numReadyTxns = buf.countReadyTxns();
      long firstTxToFlush = buf.getFirstReadyTxId();

      assert numReadyTxns > 0;

      // Copy from our double-buffer into a new byte array. This is for
      // two reasons:
      // 1) The IPC code has no way of specifying to send only a slice of
      //    a larger array.
      // 2) because the calls to the underlying nodes are asynchronous, we
      //    need a defensive copy to avoid accidentally mutating the buffer
      //    before it is sent.
      DataOutputBuffer bufToSend = new DataOutputBuffer(numReadyBytes);
      buf.flushTo(bufToSend);
      assert bufToSend.getLength() == numReadyBytes;
      byte[] data = bufToSend.getData();
      assert data.length == bufToSend.getLength();

      //把数据写入到journalnode中磁盘上面
      QuorumCall<AsyncLogger, Void> qcall = loggers.sendEdits(
          segmentTxId, firstTxToFlush,
          numReadyTxns, data);
      //TODO 这是一个阻塞的方法等待写入到journalnode集群的处理结果
      loggers.waitForWriteQuorum(qcall, writeTimeoutMs, "sendEdits");
      
      // Since we successfully wrote this batch, let the loggers know. Any future
      // RPCs will thus let the loggers know of the most recent transaction, even
      // if a logger has fallen behind.
      loggers.setCommittedTxId(firstTxToFlush + numReadyTxns - 1);
    }
  }

1.2.3 sendEdits

  
  public QuorumCall<AsyncLogger, Void> sendEdits(
      long segmentTxId, long firstTxnId, int numTxns, byte[] data) {
    Map<AsyncLogger, ListenableFuture<Void>> calls = Maps.newHashMap();
    for (AsyncLogger logger : loggers) {

      //TODO 遍历所有的loggers
      // 每个AsyncLogger对象代表就是一个journalnode
      ListenableFuture<Void> future =
              // 往journal发送数据
        logger.sendEdits(segmentTxId, firstTxnId, numTxns, data);
      calls.put(logger, future);
    }
    return QuorumCall.create(calls);
  }

1.2.4 logger.sendEdits

@Override
  public ListenableFuture<Void> sendEdits(
      final long segmentTxId, final long firstTxnId,
      final int numTxns, final byte[] data) {
    try {
      reserveQueueSpace(data.length);
    } catch (LoggerTooFarBehindException e) {
      return Futures.immediateFailedFuture(e);
    }
    
    // When this batch is acked, we use its submission time in order
    // to calculate how far we are lagging.
    final long submitNanos = System.nanoTime();
    
    ListenableFuture<Void> ret = null;
    try {
      ret = singleThreadExecutor.submit(new Callable<Void>() {
        @Override
        public Void call() throws IOException {
          throwIfOutOfSync();

          long rpcSendTimeNanos = System.nanoTime();
          try {
            //TODO 获取Journalnode的代理
            getProxy().journal(createReqInfo(),
                segmentTxId, firstTxnId, numTxns, data);
          } catch (IOException e) {
            QuorumJournalManager.LOG.warn(
                "Remote journal " + IPCLoggerChannel.this + " failed to " +
                "write txns " + firstTxnId + "-" + (firstTxnId + numTxns - 1) +
                ". Will try to write to this JN again after the next " +
                "log roll.", e); 
            synchronized (IPCLoggerChannel.this) {
              outOfSync = true;
            }
            throw e;
          } finally {
            long now = System.nanoTime();
            long rpcTime = TimeUnit.MICROSECONDS.convert(
                now - rpcSendTimeNanos, TimeUnit.NANOSECONDS);
            long endToEndTime = TimeUnit.MICROSECONDS.convert(
                now - submitNanos, TimeUnit.NANOSECONDS);
            metrics.addWriteEndToEndLatency(endToEndTime);
            metrics.addWriteRpcLatency(rpcTime);
            if (rpcTime / 1000 > WARN_JOURNAL_MILLIS_THRESHOLD) {
              QuorumJournalManager.LOG.warn(
                  "Took " + (rpcTime / 1000) + "ms to send a batch of " +
                  numTxns + " edits (" + data.length + " bytes) to " +
                  "remote journal " + IPCLoggerChannel.this);
            }
          }
          synchronized (IPCLoggerChannel.this) {
            highestAckedTxId = firstTxnId + numTxns - 1;
            lastAckNanos = submitNanos;
          }
          return null;
        }
      });
    } finally {
      if (ret == null) {
        // it didn't successfully get submitted,
        // so adjust the queue size back down.
        unreserveQueueSpace(data.length);
      } else {
        // It was submitted to the queue, so adjust the length
        // once the call completes, regardless of whether it
        // succeeds or fails.
        Futures.addCallback(ret, new FutureCallback<Void>() {
          @Override
          public void onFailure(Throwable t) {
            unreserveQueueSpace(data.length);
          }

          @Override
          public void onSuccess(Void t) {
            unreserveQueueSpace(data.length);
          }
        });
      }
    }
    return ret;
  }

1.2.6 JournalNodeRpcServer的journal方法(写入到Journalnode磁盘)

 /**
   * Write a batch of edits to the journal.
   * {@see QJournalProtocol#journal(RequestInfo, long, long, int, byte[])}
   */
  synchronized void journal(RequestInfo reqInfo,
      long segmentTxId, long firstTxnId,
      int numTxns, byte[] records) throws IOException {
    checkFormatted();
    checkWriteRequest(reqInfo);

    checkSync(curSegment != null,
        "Can't write, no segment open");
    
    if (curSegmentTxId != segmentTxId) {
      // Sanity check: it is possible that the writer will fail IPCs
      // on both the finalize() and then the start() of the next segment.
      // This could cause us to continue writing to an old segment
      // instead of rolling to a new one, which breaks one of the
      // invariants in the design. If it happens, abort the segment
      // and throw an exception.
      JournalOutOfSyncException e = new JournalOutOfSyncException(
          "Writer out of sync: it thinks it is writing segment " + segmentTxId
          + " but current segment is " + curSegmentTxId);
      abortCurSegment();
      throw e;
    }
      
    checkSync(nextTxId == firstTxnId,
        "Can't write txid " + firstTxnId + " expecting nextTxId=" + nextTxId);
    
    long lastTxnId = firstTxnId + numTxns - 1;
    if (LOG.isTraceEnabled()) {
      LOG.trace("Writing txid " + firstTxnId + "-" + lastTxnId);
    }

    // If the edit has already been marked as committed, we know
    // it has been fsynced on a quorum of other nodes, and we are
    // "catching up" with the rest. Hence we do not need to fsync.
    boolean isLagging = lastTxnId <= committedTxnId.get();
    boolean shouldFsync = !isLagging;
    
    curSegment.writeRaw(records, 0, records.length);
    curSegment.setReadyToFlush();
    StopWatch sw = new StopWatch();
    sw.start();
    curSegment.flush(shouldFsync);
    sw.stop();

    long nanoSeconds = sw.now();
    metrics.addSync(
        TimeUnit.MICROSECONDS.convert(nanoSeconds, TimeUnit.NANOSECONDS));
    long milliSeconds = TimeUnit.MILLISECONDS.convert(
        nanoSeconds, TimeUnit.NANOSECONDS);

    if (milliSeconds > WARN_SYNC_MILLIS_THRESHOLD) {
      LOG.warn("Sync of transaction range " + firstTxnId + "-" + lastTxnId +
               " took " + milliSeconds + "ms");
    }

    if (isLagging) {
      // This batch of edits has already been committed on a quorum of other
      // nodes. So, we are in "catch up" mode. This gets its own metric.
      metrics.batchesWrittenWhileLagging.incr(1);
    }
    
    metrics.batchesWritten.incr(1);
    metrics.bytesWritten.incr(records.length);
    metrics.txnsWritten.incr(numTxns);
    
    highestWrittenTxId = lastTxnId;
    nextTxId = lastTxnId + 1;
  }

1.3 写入元数据总结

image-20200708225633951

  • 我们在编码过程中首先会创建一个FileSystem的一个方法来进行创建我们的文件,其实他底层调用的是我们的NameNodeRpcServer代理对象的mkdirs方法
  • 通过我们的NameNodeRpcServer中的FSNamesyem管理元数据的对象,获取目录树,将我们的要创建的目录节点依次封装为INode(INodeDirectory,INodeFile)对象,然后添加到我们的目录树上,也就是更新我们的目录树,此时我们通过50070端口就可以进行查看了
  • 然后会调用EditLog对象进行元数据的刷写,这里是整个HDFS代码写入的精华部分,采用了分段加锁双缓冲机制,分为两条直线,其实是一个方法实现的,
  • 写入磁盘是首先写入内存,内存满了或者达到一定条件之后进行双缓冲区的交换,将我们的buffReady采用NIO的方式写入到磁盘上面,也就是我们看到的hdfs元数据文件,带有很多数字
  • 写入Journalnode也是采用的双缓冲机制,但是刷写的时候有一些不同,因为是进程间通信,需要通过Rpc或者Http这里采用的是Rpc,首先是获取到我们的AsyncL oggerSet集合,这个时候采用的是异步写入,调用的JournalnodeRpcServer的journal方法进行刷写到journalnode的磁盘
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值