2021-12-30 hadoop3 写数据流程(五):BlockService

参考源码hadoop-3.3.0,如有理解问题,敬请指出,不胜感激。

1 概述

上一篇介绍到,写数据块请求通过流式接口到达Datanode之后, Datanode上监听流式接口请求的DataXceiverServer会接收这个请求, 并构造一个DataXceiver对象, 然后在DataXceiver对象上调用DataXceiver.writeBlock()方法响应这个请求。 当前Datanode的DataXceiver.writeBlock()方法会级联向数据流管道中的下一个Datanode发送写数据块请求, 这个流式请求会一直在数据流管道中传递下去, 直到写数据块请求到达数据流管道中的最后一个Datanode。

而在DataXceiver#writeBlock方法中又会调用BlockReceiver#receiveBlock接收数据块数据,而过程中又基本上将数据接受转给receivePacket方法。

2 BlockService源码

BlockReceiver类负责从数据流管道中的上游节点接收数据块, 然后保存数据块到当前数据节点的存储中, 再将数据块转发到数据流管道中的下游节点。 同时BlockReceiver还会接收来自下游节点的响应, 并把这个响应发送给数据流管道中的上游节点。

2.1 构造函数

通过之前的代码,在DataXceiver#writeBlock中,会调用getBlockReceiver方法创建BlockService,在构造函数中主要是初始化一些变量:

BlockReceiver(final ExtendedBlock block, final StorageType storageType,
              final DataInputStream in,
              final String inAddr, final String myAddr,
              final BlockConstructionStage stage, 
              final long newGs, final long minBytesRcvd, final long maxBytesRcvd, 
              final String clientname, final DatanodeInfo srcDataNode,
              final DataNode datanode, DataChecksum requestedChecksum,
              CachingStrategy cachingStrategy,
              final boolean allowLazyPersist,
              final boolean pinning,
              final String storageId) throws IOException {
    try{
        this.block = block;
        this.in = in;
        this.inAddr = inAddr;
        this.myAddr = myAddr;
        this.srcDataNode = srcDataNode;
        this.datanode = datanode;

        this.clientname = clientname;
        this.isDatanode = clientname.length() == 0;
        this.isClient = !this.isDatanode;
        this.restartBudget = datanode.getDnConf().restartReplicaExpiry;
        this.datanodeSlowLogThresholdMs =
            datanode.getDnConf().getSlowIoWarningThresholdMs();
        // For replaceBlock() calls response should be sent to avoid socketTimeout
        // at clients. So sending with the interval of 0.5 * socketTimeout
        final long readTimeout = datanode.getDnConf().socketTimeout;
        this.responseInterval = (long) (readTimeout * 0.5);
        //for datanode, we have
        //1: clientName.length() == 0, and
        //2: stage == null or PIPELINE_SETUP_CREATE
        this.stage = stage;
        this.isTransfer = stage == BlockConstructionStage.TRANSFER_RBW
            || stage == BlockConstructionStage.TRANSFER_FINALIZED;

        this.pinning = pinning;
        this.lastSentTime.set(Time.monotonicNow());
        // Downstream will timeout in readTimeout on receiving the next packet.
        // If there is no data traffic, a heartbeat packet is sent at
        // the interval of 0.5*readTimeout. Here, we set 0.9*readTimeout to be
        // the threshold for detecting congestion.
        this.maxSendIdleTime = (long) (readTimeout * 0.9);
        if (LOG.isDebugEnabled()) {
            LOG.debug(getClass().getSimpleName() + ": " + block
                      + "\n storageType=" + storageType + ", inAddr=" + inAddr
                      + ", myAddr=" + myAddr + "\n stage=" + stage + ", newGs=" + newGs
                      + ", minBytesRcvd=" + minBytesRcvd
                      + ", maxBytesRcvd=" + maxBytesRcvd + "\n clientname=" + clientname
                      + ", srcDataNode=" + srcDataNode
                      + ", datanode=" + datanode.getDisplayName()
                      + "\n requestedChecksum=" + requestedChecksum
                      + "\n cachingStrategy=" + cachingStrategy
                      + "\n allowLazyPersist=" + allowLazyPersist + ", pinning=" + pinning
                      + ", isClient=" + isClient + ", isDatanode=" + isDatanode
                      + ", responseInterval=" + responseInterval
                      + ", storageID=" + (storageId != null ? storageId : "null")
                     );
        }

        //
        // Open local disk out
        //
        if (isDatanode) { //replication or move
            // 创建一个临时replica
            replicaHandler =
                datanode.data.createTemporary(storageType, storageId, block, false);
        } else {
            switch (stage) {
                case PIPELINE_SETUP_CREATE:
                    replicaHandler = datanode.data.createRbw(storageType, storageId,
                                                             block, allowLazyPersist);
                    datanode.notifyNamenodeReceivingBlock(
                        block, replicaHandler.getReplica().getStorageUuid());
                    break;
                case PIPELINE_SETUP_STREAMING_RECOVERY:
                    replicaHandler = datanode.data.recoverRbw(
                        block, newGs, minBytesRcvd, maxBytesRcvd);
                    block.setGenerationStamp(newGs);
                    break;
                case PIPELINE_SETUP_APPEND:
                    replicaHandler = datanode.data.append(block, newGs, minBytesRcvd);
                    block.setGenerationStamp(newGs);
                    datanode.notifyNamenodeReceivingBlock(
                        block, replicaHandler.getReplica().getStorageUuid());
                    break;
                case PIPELINE_SETUP_APPEND_RECOVERY:
                    replicaHandler = datanode.data.recoverAppend(block, newGs, minBytesRcvd);
                    block.setGenerationStamp(newGs);
                    datanode.notifyNamenodeReceivingBlock(
                        block, replicaHandler.getReplica().getStorageUuid());
                    break;
                case TRANSFER_RBW:
                case TRANSFER_FINALIZED:
                    // this is a transfer destination
                    replicaHandler = datanode.data.createTemporary(storageType, storageId,
                                                                   block, isTransfer);
                    break;
                default: throw new IOException("Unsupported stage " + stage + 
                                               " while receiving block " + block + " from " + inAddr);
            }
        }
        replicaInfo = replicaHandler.getReplica();
        this.dropCacheBehindWrites = (cachingStrategy.getDropBehind() == null) ?
            datanode.getDnConf().dropCacheBehindWrites :
        cachingStrategy.getDropBehind();
        this.syncBehindWrites = datanode.getDnConf().syncBehindWrites;
        this.syncBehindWritesInBackground = datanode.getDnConf().
            syncBehindWritesInBackground;

        final boolean isCreate = isDatanode || isTransfer 
            || stage == BlockConstructionStage.PIPELINE_SETUP_CREATE;
        streams = replicaInfo.createStreams(isCreate, requestedChecksum);
        assert streams != null : "null streams!";

        // read checksum meta information
        this.clientChecksum = requestedChecksum;
        this.diskChecksum = streams.getChecksum();
        this.needsChecksumTranslation = !clientChecksum.equals(diskChecksum);
        this.bytesPerChecksum = diskChecksum.getBytesPerChecksum();
        this.checksumSize = diskChecksum.getChecksumSize();

        this.checksumOut = new DataOutputStream(new BufferedOutputStream(
            streams.getChecksumOut(), DFSUtilClient.getSmallBufferSize(
                datanode.getConf())));
        // write data chunk header if creating a new replica
        if (isCreate) {
            BlockMetadataHeader.writeHeader(checksumOut, diskChecksum);
        } 
    } catch (ReplicaAlreadyExistsException bae) {
        throw bae;
    } catch (ReplicaNotFoundException bne) {
        throw bne;
    } catch(IOException ioe) {
        if (replicaInfo != null) {
            replicaInfo.releaseAllBytesReserved();
        }
        IOUtils.closeStream(this);
        cleanupBlock();

        // check if there is a disk error
        IOException cause = DatanodeUtil.getCauseIfDiskError(ioe);
        DataNode.LOG
            .warn("IOException in BlockReceiver constructor :" + ioe.getMessage()
                  + (cause == null ? "" : ". Cause is "), cause);
        if (cause != null) {
            ioe = cause;
            // Volume error check moved to FileIoProvider
        }

        throw ioe;
    }
}

2.2 receiveBlock

前文可知,dn接受到数据后,完成输入输出流的构建后,会调用BlockReceiver#receiverBlock()方法,BlockReceiver.receiveBlock()方法的逻辑比较简单, 它先启动PacketResponder线程负责接收并转发下游节点发送的确认数据包的ACK消息。 之后receiveBlock()方法循环调用receivePacket()方法接收上游写入的数据包并转发这个数据包到下游节点。 成功完成整个数据块的写入操作后, receiveBlock()方法关闭PacketResponder线程

void receiveBlock(
    DataOutputStream mirrOut, // output to next datanode
    DataInputStream mirrIn,   // input from next datanode
    DataOutputStream replyOut,  // output to previous datanode
    String mirrAddr, DataTransferThrottler throttlerArg,
    DatanodeInfo[] downstreams,
    boolean isReplaceBlock) throws IOException {

    syncOnClose = datanode.getDnConf().syncOnClose;
    dirSyncOnFinalize = syncOnClose;
    boolean responderClosed = false;
    mirrorOut = mirrOut;
    mirrorAddr = mirrAddr;
    initPerfMonitoring(downstreams);
    throttler = throttlerArg;

    this.replyOut = replyOut;
    this.isReplaceBlock = isReplaceBlock;

    // 构建 PacketResponder
    // 启动PacketResponder线程处理确认包的接收和转发
    try {
        if (isClient && !isTransfer) {
            responder = new Daemon(datanode.threadGroup, 
                                   new PacketResponder(replyOut, mirrIn, downstreams));
            responder.start(); // start thread to processes responses
        }

        // 接受并处理packet,直到处理完所有packet
        while (receivePacket() >= 0) { /* Receive until the last packet */ }

        // wait for all outstanding packet responses. And then
        // indicate responder to gracefully shutdown.
        // Mark that responder has been closed for future processing
        if (responder != null) {
            // 完成数据块的写入操作后, 结束PacketResponder线程
            ((PacketResponder)responder.getRunnable()).close();
            responderClosed = true;
        }

        // If this write is for a replication or transfer-RBW/Finalized,
        // then finalize block or convert temporary to RBW.
        // For client-writes, the block is finalized in the PacketResponder.
        if (isDatanode || isTransfer) {
            // Hold a volume reference to finalize block.
            try (ReplicaHandler handler = claimReplicaHandler()) {
                // close the block/crc files
                close();
                block.setNumBytes(replicaInfo.getNumBytes());

                if (stage == BlockConstructionStage.TRANSFER_RBW) {
                    // for TRANSFER_RBW, convert temporary to RBW
                    datanode.data.convertTemporaryToRbw(block);
                } else {
                    // for isDatnode or TRANSFER_FINALIZED
                    // Finalize the block.
                    datanode.data.finalizeBlock(block, dirSyncOnFinalize);
                }
            }
            datanode.metrics.incrBlocksWritten();
        }

    } catch (IOException ioe) {
        replicaInfo.releaseAllBytesReserved();
        if (datanode.isRestarting()) {
            // Do not throw if shutting down for restart. Otherwise, it will cause
            // premature termination of responder.
            LOG.info("Shutting down for restart (" + block + ").");
        } else {
            LOG.info("Exception for " + block, ioe);
            throw ioe;
        }
    } finally {
        // Clear the previous interrupt state of this thread.
        Thread.interrupted();

        // If a shutdown for restart was initiated, upstream needs to be notified.
        // There is no need to do anything special if the responder was closed
        // normally.
        if (!responderClosed) { // Data transfer was not complete.
            if (responder != null) {
                // In case this datanode is shutting down for quick restart,
                // send a special ack upstream.
                if (datanode.isRestarting() && isClient && !isTransfer) {
                    try (Writer out = new OutputStreamWriter(
                        replicaInfo.createRestartMetaStream(), "UTF-8")) {
                        // write out the current time.
                        out.write(Long.toString(Time.now() + restartBudget));
                        out.flush();
                    } catch (IOException ioe) {
                        // The worst case is not recovering this RBW replica. 
                        // Client will fall back to regular pipeline recovery.
                    } finally {
                        IOUtils.closeStream(streams.getDataOut());
                    }
                    try {              
                        // Even if the connection is closed after the ack packet is
                        // flushed, the client can react to the connection closure 
                        // first. Insert a delay to lower the chance of client 
                        // missing the OOB ack.
                        Thread.sleep(1000);
                    } catch (InterruptedException ie) {
                        // It is already going down. Ignore this.
                    }
                }
                responder.interrupt();
            }
            IOUtils.closeStream(this);
            cleanupBlock();
        }
        if (responder != null) {
            try {
                responder.interrupt();
                // join() on the responder should timeout a bit earlier than the
                // configured deadline. Otherwise, the join() on this thread will
                // likely timeout as well.
                long joinTimeout = datanode.getDnConf().getXceiverStopTimeout();
                joinTimeout = joinTimeout > 1  ? joinTimeout*8/10 : joinTimeout;
                responder.join(joinTimeout);
                if (responder.isAlive()) {
                    String msg = "Join on responder thread " + responder
                        + " timed out";
                    LOG.warn(msg + "\n" + StringUtils.getStackTrace(responder));
                    throw new IOException(msg);
                }
            } catch (InterruptedException e) {
                responder.interrupt();
                // do not throw if shutting down for restart.
                if (!datanode.isRestarting()) {
                    throw new InterruptedIOException("Interrupted receiveBlock");
                }
            }
            responder = null;
        }
    }
}

2.3 receivePacket

receiveBlock其实将大部分操作都转到了receivePacket方法中,receivePacket()方法首先调用packetReceiver.receiveNextPacket()方法从输入流中读入一个数据包(packet) , 并将这个数据包放入ByteBuffer缓冲区curPacketBuf中。

readNextPacket()方法就是按照数据包格式从输入流中读取数据并放入指定的ByteBuffer缓冲区中。

receivePacket()成功接收数据包后, 会判断当前节点是否是数据流管道中的最后一个节点, 或者是输入流启动了sync标识(syncBlock) 要求Datanode立即将数据包同步到磁盘。 在这两种情况下, Datanode会先将数据写入磁盘, 然后再通知PacketResponder处理确认(ACK) 消息; 否则, receivePacket()方法接收完数据包后会立即通知PacketResponder处理确认消息。

接下来receivePacket()会将数据包发送给数据流管道中的下游节点, 然后就可以将数据块文件和校验文件写入数据节点的磁盘了。 写入磁盘之后, receivePacket()方法需要调用flushOrSync()方法将输出流缓存中的数据全部同步到磁盘, 最后还需要调用manageWriterOsCache(清理) 操作系统缓存中的数据。 需要注意的是, 如果当前节点是数据流管道中的最后一个节点, 则在写入磁盘前, 需要先对数据块中的所有数据包进行校验.

由于在Datanode是数据流管道中最后一个节点, 以及携带了sync标识两种情况下,receivePacket()方法并没有通知PacketResponder处理响应消息, 所以在receiverPacket()方法的最后通知PacketResponder为这两种情况处理响应。

private int receivePacket() throws IOException {
    // read the next packet
    packetReceiver.receiveNextPacket(in);

    PacketHeader header = packetReceiver.getHeader();
    if (LOG.isDebugEnabled()){
        LOG.debug("Receiving one packet for block " + block +
                  ": " + header);
    }

    // Sanity check the header
    if (header.getOffsetInBlock() > replicaInfo.getNumBytes()) {
        throw new IOException("Received an out-of-sequence packet for " + block + 
                              "from " + inAddr + " at offset " + header.getOffsetInBlock() +
                              ". Expecting packet starting at " + replicaInfo.getNumBytes());
    }
    if (header.getDataLen() < 0) {
        throw new IOException("Got wrong length during writeBlock(" + block + 
                              ") from " + inAddr + " at offset " + 
                              header.getOffsetInBlock() + ": " +
                              header.getDataLen()); 
    }

    long offsetInBlock = header.getOffsetInBlock();
    long seqno = header.getSeqno();
    boolean lastPacketInBlock = header.isLastPacketInBlock();
    final int len = header.getDataLen();
    boolean syncBlock = header.getSyncBlock();

    // avoid double sync'ing on close
    if (syncBlock && lastPacketInBlock) {
        this.syncOnClose = false;
        // sync directory for finalize irrespective of syncOnClose config since
        // sync is requested.
        this.dirSyncOnFinalize = true;
    }

    // update received bytes
    final long firstByteInBlock = offsetInBlock;
    offsetInBlock += len;
    if (replicaInfo.getNumBytes() < offsetInBlock) {
        replicaInfo.setNumBytes(offsetInBlock);
    }

    // put in queue for pending acks, unless sync was requested
    // 如果不是数据流管道中的最后一个数据节点, 则立即处理响应消息
    if (responder != null && !syncBlock && !shouldVerifyChecksum()) {
        ((PacketResponder) responder.getRunnable()).enqueue(seqno,
                                                            lastPacketInBlock, offsetInBlock, Status.SUCCESS);
    }

    // Drop heartbeat for testing.
    if (seqno < 0 && len == 0 &&
        DataNodeFaultInjector.get().dropHeartbeatPacket()) {
        return 0;
    }

    //First write the packet to the mirror:
    // 向下游节点发送数据包
    if (mirrorOut != null && !mirrorError) {
        try {
            long begin = Time.monotonicNow();
            // For testing. Normally no-op.
            DataNodeFaultInjector.get().stopSendingPacketDownstream(mirrorAddr);
            packetReceiver.mirrorPacketTo(mirrorOut);
            mirrorOut.flush();
            long now = Time.monotonicNow();
            this.lastSentTime.set(now);
            long duration = now - begin;
            DataNodeFaultInjector.get().logDelaySendingPacketDownstream(
                mirrorAddr,
                duration);
            trackSendPacketToLastNodeInPipeline(duration);
            if (duration > datanodeSlowLogThresholdMs && LOG.isWarnEnabled()) {
                LOG.warn("Slow BlockReceiver write packet to mirror took " + duration
                         + "ms (threshold=" + datanodeSlowLogThresholdMs + "ms), "
                         + "downstream DNs=" + Arrays.toString(downstreamDNs)
                         + ", blockId=" + replicaInfo.getBlockId());
            }
        } catch (IOException e) {
            handleMirrorOutError(e);
        }
    }

    ByteBuffer dataBuf = packetReceiver.getDataSlice();
    ByteBuffer checksumBuf = packetReceiver.getChecksumSlice();

    if (lastPacketInBlock || len == 0) {
        if(LOG.isDebugEnabled()) {
            LOG.debug("Receiving an empty packet or the end of the block " + block);
        }
        // sync block if requested
        // 如果接收了完整的数据块, 并且启动了sync标识, 则立即将数据同步到磁盘
        if (syncBlock) {
            flushOrSync(true);
        }
    } else {
        // 如果当前节点是数据流管道中的最后一个节点, 则验证数据包的校验和
        final int checksumLen = diskChecksum.getChecksumSize(len);
        final int checksumReceivedLen = checksumBuf.capacity();

        if (checksumReceivedLen > 0 && checksumReceivedLen != checksumLen) {
            throw new IOException("Invalid checksum length: received length is "
                                  + checksumReceivedLen + " but expected length is " + checksumLen);
        }

        if (checksumReceivedLen > 0 && shouldVerifyChecksum()) {
            try {
                // 调用verifyChunks()验证数据包校验和
                verifyChunks(dataBuf, checksumBuf);
            } catch (IOException ioe) {
                // checksum error detected locally. there is no reason to continue.
                // 验证出现异常, 则向上游客户端发送校验异常消息
                if (responder != null) {
                    try {
                        ((PacketResponder) responder.getRunnable()).enqueue(seqno,
                                                                            lastPacketInBlock, offsetInBlock,
                                                                            Status.ERROR_CHECKSUM);
                        // Wait until the responder sends back the response
                        // and interrupt this thread.
                        Thread.sleep(3000);
                    } catch (InterruptedException e) { }
                }
                throw new IOException("Terminating due to a checksum error." + ioe);
            }

            // 如果客户端发送的数据校验方式和当前数据节点的不一致, 则转换校验和
            if (needsChecksumTranslation) {
                // overwrite the checksums in the packet buffer with the
                // appropriate polynomial for the disk storage.
                translateChunks(dataBuf, checksumBuf);
            }
        }

        if (checksumReceivedLen == 0 && !streams.isTransientStorage()) {
            // checksum is missing, need to calculate it
            checksumBuf = ByteBuffer.allocate(checksumLen);
            diskChecksum.calculateChunkedSums(dataBuf, checksumBuf);
        }

        // by this point, the data in the buffer uses the disk checksum

        final boolean shouldNotWriteChecksum = checksumReceivedLen == 0
            && streams.isTransientStorage();
        try {
            long onDiskLen = replicaInfo.getBytesOnDisk();
            if (onDiskLen<offsetInBlock) {
                // Normally the beginning of an incoming packet is aligned with the
                // existing data on disk. If the beginning packet data offset is not
                // checksum chunk aligned, the end of packet will not go beyond the
                // next chunk boundary.
                // When a failure-recovery is involved, the client state and the
                // the datanode state may not exactly agree. I.e. the client may
                // resend part of data that is already on disk. Correct number of
                // bytes should be skipped when writing the data and checksum
                // buffers out to disk.
                long partialChunkSizeOnDisk = onDiskLen % bytesPerChecksum;
                long lastChunkBoundary = onDiskLen - partialChunkSizeOnDisk;
                boolean alignedOnDisk = partialChunkSizeOnDisk == 0;
                boolean alignedInPacket = firstByteInBlock % bytesPerChecksum == 0;

                // If the end of the on-disk data is not chunk-aligned, the last
                // checksum needs to be overwritten.
                boolean overwriteLastCrc = !alignedOnDisk && !shouldNotWriteChecksum;
                // If the starting offset of the packat data is at the last chunk
                // boundary of the data on disk, the partial checksum recalculation
                // can be skipped and the checksum supplied by the client can be used
                // instead. This reduces disk reads and cpu load.
                boolean doCrcRecalc = overwriteLastCrc &&
                    (lastChunkBoundary != firstByteInBlock);

                // If this is a partial chunk, then verify that this is the only
                // chunk in the packet. If the starting offset is not chunk
                // aligned, the packet should terminate at or before the next
                // chunk boundary.
                if (!alignedInPacket && len > bytesPerChecksum) {
                    throw new IOException("Unexpected packet data length for "
                                          +  block + " from " + inAddr + ": a partial chunk must be "
                                          + " sent in an individual packet (data length = " + len
                                          +  " > bytesPerChecksum = " + bytesPerChecksum + ")");
                }

                // If the last portion of the block file is not a full chunk,
                // then read in pre-existing partial data chunk and recalculate
                // the checksum so that the checksum calculation can continue
                // from the right state. If the client provided the checksum for
                // the whole chunk, this is not necessary.
                Checksum partialCrc = null;
                if (doCrcRecalc) {
                    if (LOG.isDebugEnabled()) {
                        LOG.debug("receivePacket for " + block 
                                  + ": previous write did not end at the chunk boundary."
                                  + " onDiskLen=" + onDiskLen);
                    }
                    long offsetInChecksum = BlockMetadataHeader.getHeaderSize() +
                        onDiskLen / bytesPerChecksum * checksumSize;
                    partialCrc = computePartialChunkCrc(onDiskLen, offsetInChecksum);
                }

                // The data buffer position where write will begin. If the packet
                // data and on-disk data have no overlap, this will not be at the
                // beginning of the buffer.
                int startByteToDisk = (int)(onDiskLen-firstByteInBlock) 
                    + dataBuf.arrayOffset() + dataBuf.position();

                // Actual number of data bytes to write.
                int numBytesToDisk = (int)(offsetInBlock-onDiskLen);

                // Write data to disk.
                long begin = Time.monotonicNow();
                streams.writeDataToDisk(dataBuf.array(),
                                        startByteToDisk, numBytesToDisk);
                long duration = Time.monotonicNow() - begin;
                if (duration > datanodeSlowLogThresholdMs && LOG.isWarnEnabled()) {
                    LOG.warn("Slow BlockReceiver write data to disk cost:" + duration
                             + "ms (threshold=" + datanodeSlowLogThresholdMs + "ms), "
                             + "volume=" + getVolumeBaseUri()
                             + ", blockId=" + replicaInfo.getBlockId());
                }

                if (duration > maxWriteToDiskMs) {
                    maxWriteToDiskMs = duration;
                }

                final byte[] lastCrc;
                if (shouldNotWriteChecksum) {
                    lastCrc = null;
                } else {
                    int skip = 0;
                    byte[] crcBytes = null;

                    // First, prepare to overwrite the partial crc at the end.
                    if (overwriteLastCrc) { // not chunk-aligned on disk
                        // prepare to overwrite last checksum
                        adjustCrcFilePosition();
                    }

                    // The CRC was recalculated for the last partial chunk. Update the
                    // CRC by reading the rest of the chunk, then write it out.
                    if (doCrcRecalc) {
                        // Calculate new crc for this chunk.
                        int bytesToReadForRecalc =
                            (int)(bytesPerChecksum - partialChunkSizeOnDisk);
                        if (numBytesToDisk < bytesToReadForRecalc) {
                            bytesToReadForRecalc = numBytesToDisk;
                        }

                        partialCrc.update(dataBuf.array(), startByteToDisk,
                                          bytesToReadForRecalc);
                        byte[] buf = FSOutputSummer.convertToByteStream(partialCrc,
                                                                        checksumSize);
                        crcBytes = copyLastChunkChecksum(buf, checksumSize, buf.length);
                        checksumOut.write(buf);
                        if(LOG.isDebugEnabled()) {
                            LOG.debug("Writing out partial crc for data len " + len +
                                      ", skip=" + skip);
                        }
                        skip++; //  For the partial chunk that was just read.
                    }

                    // Determine how many checksums need to be skipped up to the last
                    // boundary. The checksum after the boundary was already counted
                    // above. Only count the number of checksums skipped up to the
                    // boundary here.
                    long skippedDataBytes = lastChunkBoundary - firstByteInBlock;

                    if (skippedDataBytes > 0) {
                        skip += (int)(skippedDataBytes / bytesPerChecksum) +
                            ((skippedDataBytes % bytesPerChecksum == 0) ? 0 : 1);
                    }
                    skip *= checksumSize; // Convert to number of bytes

                    // write the rest of checksum
                    final int offset = checksumBuf.arrayOffset() +
                        checksumBuf.position() + skip;
                    final int end = offset + checksumLen - skip;
                    // If offset >= end, there is no more checksum to write.
                    // I.e. a partial chunk checksum rewrite happened and there is no
                    // more to write after that.
                    if (offset >= end && doCrcRecalc) {
                        lastCrc = crcBytes;
                    } else {
                        final int remainingBytes = checksumLen - skip;
                        lastCrc = copyLastChunkChecksum(checksumBuf.array(),
                                                        checksumSize, end);
                        checksumOut.write(checksumBuf.array(), offset, remainingBytes);
                    }
                }

                /// flush entire packet, sync if requested
                flushOrSync(syncBlock);

                replicaInfo.setLastChecksumAndDataLen(offsetInBlock, lastCrc);

                datanode.metrics.incrBytesWritten(len);
                datanode.metrics.incrTotalWriteTime(duration);

                manageWriterOsCache(offsetInBlock);
            }
        } catch (IOException iex) {
            // Volume error check moved to FileIoProvider
            throw iex;
        }
    }

    // if sync was requested, put in queue for pending acks here
    // (after the fsync finished)
    if (responder != null && (syncBlock || shouldVerifyChecksum())) {
        ((PacketResponder) responder.getRunnable()).enqueue(seqno,
                                                            lastPacketInBlock, offsetInBlock, Status.SUCCESS);
    }

    /*
     * Send in-progress responses for the replaceBlock() calls back to caller to
     * avoid timeouts due to balancer throttling. HDFS-6247
     */
    if (isReplaceBlock
        && (Time.monotonicNow() - lastResponseTime > responseInterval)) {
        BlockOpResponseProto.Builder response = BlockOpResponseProto.newBuilder()
            .setStatus(Status.IN_PROGRESS);
        response.build().writeDelimitedTo(replyOut);
        replyOut.flush();

        lastResponseTime = Time.monotonicNow();
    }

    if (throttler != null) { // throttle I/O
        throttler.throttle(len);
    }

    return lastPacketInBlock?-1:len;
}

2.4 PacketResponder

BlockReceiver负责从上游数据节点接收数据包并转发到下游数据节点。 同时, 如果当前节点是数据流管道中的最后一个节点, BlockReceiver还需要验证数据包的校验和是否正确。 除此之外, 数据节点还需要从下游节点接收数据包的确认消息, 然后转发给上游节点, BlockReceiver将这部分功能委托给了内部类PacketResponder。

PacketResponder是BlockService的内部类,且是一个线程。它和BlockReceiver所在的线程共同完成数据块的写操作流程。 这里之所以将数据块处理和数据块响应消息处理放在两个线程中, 是因为如果使用一个线程处理, 需要同时监听上游节点的输入流和下游节点的输入流, 任意一个输入流阻塞都会造成另一个输入流上消息处理的延迟。

BlockReceiver完成对指定数据包的处理之后, 会触发PacketResponder类处理当前数据包的响应消息。 PacketResponder类监听下游的输入流, 接收到这个数据包的确认消息后, 在确认消息中添加当前数据节点的消息, 然后将这个消息发送给上游数据节点。

BlockReceiver完成对指定数据包的处理之后, 会调用enqueue()方法通知PacketResponder类处理这个数据包的响应。

ackQueue是一个典型的生产者-消费者队列 , 通过enqueue()方法添加任务, 它将等待PacketResponder类处理的数据包加入ackQueue队列中,ackQueue队列中保存的所有数据包都会由PacketResponder的run()方法处理。 成功将数据包添加到ackQueue队列后, enqueue()方法会调用notify()方法通知run()方法处理数据包。

2.4.1 enqueue方法

/**
     * enqueue the seqno that is still be to acked by the downstream datanode.
     * @param seqno sequence number of the packet
     * @param lastPacketInBlock if true, this is the last packet in block
     * @param offsetInBlock offset of this packet in block
     */
void enqueue(final long seqno, final boolean lastPacketInBlock,
             final long offsetInBlock, final Status ackStatus) {
    final Packet p = new Packet(seqno, lastPacketInBlock, offsetInBlock,
                                System.nanoTime(), ackStatus);
    LOG.debug("{}: enqueue {}", this, p);
    synchronized (ackQueue) {
        if (running) {
            ackQueue.add(p);
            ackQueue.notifyAll();
        }
    }
}

2.4.2 run方法

PacketResponder.run()方法会循环对数据块中的所有数据包执行确认响应的逻辑, 如果抛出异常, 则将PacketResponder.running设置为false, 使得while循环的isRunning()判断为false, PacketResponder线程也就会结束。 如果PacketResponder阻塞了, 则通过interrupt()方法中断PacketResponder线程。

run()方法首先会从下游节点的输入流中读取一个响应, 并判断这个响应中是否有OOB消息(Datanode在写操作时被触发重启的情况下, 会通过数据流管道逆向发送一个OOB响应消息给客户端, 由客户端处理数据流管道中Datanode节点重启的情况) , 如果有OOB消息, 则立即将这个消息转发给上游数据节点。 接下来run()方法会在ackQueue队列上等待需要处理的数据包, 然后判断从下游节点接收的数据包响应与从ackQueue队列中取出的待处理数据包是否匹配, 如果不匹配则抛出异常。 如果PacketResponder在从下游节点读入ack的过程中出现异常, 则将mirrorError字段设置为true, run()方法会在后续向上游节点发送的响应中携带错误信息。

PacketResponder会判断当前接收的数据包响应是否为数据块中最后一个数据包的响应, 如果是, 则调用finalizeBlock()方法向Namenode提交这个数据块。 接下来run()方法会调用sendAckUpstream()方法复制下游节点的数据包响应, 并在该响应中加入当前节点的状态, 然后构造新的响应发送给上游节点。 完成数据包响应的处理后, 从ackQueue队列中移除这个数据包。

/**
     * Thread to process incoming acks.
     * @see java.lang.Runnable#run()
     */
@Override
public void run() {
    boolean lastPacketInBlock = false;
    final long startTime = ClientTraceLog.isInfoEnabled() ? System.nanoTime() : 0;
    // 循环处理数据包
    while (isRunning() && !lastPacketInBlock) {
        long totalAckTimeNanos = 0;
        boolean isInterrupted = false;
        try {
            // 记录当前处理的数据包
            Packet pkt = null;
            long expected = -2;
            // 构建一个数据包响应ack
            PipelineAck ack = new PipelineAck();
            // 响应数据包的序列号
            long seqno = PipelineAck.UNKOWN_SEQNO;
            long ackRecvNanoTime = 0;
            try {
                if (type != PacketResponderType.LAST_IN_PIPELINE && !mirrorError) {
                    DataNodeFaultInjector.get().failPipeline(replicaInfo, mirrorAddr);
                    // read an ack from downstream datanode
                    // 从下游节点的输出流中读入一个响应
                    ack.readFields(downstreamIn);
                    ackRecvNanoTime = System.nanoTime();
                    if (LOG.isDebugEnabled()) {
                        LOG.debug(myString + " got " + ack);
                    }
                    // Process an OOB ACK.
                    // 判断响应消息是否是OOB消息, 例如下游节点重启
                    Status oobStatus = ack.getOOBStatus();
                    if (oobStatus != null) {
                        LOG.info("Relaying an out of band ack of type " + oobStatus);
                        // 将OOB消息转发给上游节点处理
                        sendAckUpstream(ack, PipelineAck.UNKOWN_SEQNO, 0L, 0L,
                                        PipelineAck.combineHeader(datanode.getECN(),
                                                                  Status.SUCCESS));
                        continue;
                    }
                    seqno = ack.getSeqno();
                }
                if (seqno != PipelineAck.UNKOWN_SEQNO
                    || type == PacketResponderType.LAST_IN_PIPELINE) {
                    // 从ackQueue队列中取出待处理数据包
                    pkt = waitForAckHead(seqno);
                    if (!isRunning()) {
                        break;
                    }
                    expected = pkt.seqno;
                    // 判断当前接收的数据包响应是否匹配待处理数据包
                    if (type == PacketResponderType.HAS_DOWNSTREAM_IN_PIPELINE
                        && seqno != expected) {
                        throw new IOException(myString + "seqno: expected=" + expected
                                              + ", received=" + seqno);
                    }
                    if (type == PacketResponderType.HAS_DOWNSTREAM_IN_PIPELINE) {
                        // The total ack time includes the ack times of downstream
                        // nodes.
                        // The value is 0 if this responder doesn't have a downstream
                        // DN in the pipeline.
                        totalAckTimeNanos = ackRecvNanoTime - pkt.ackEnqueueNanoTime;
                        // Report the elapsed time from ack send to ack receive minus
                        // the downstream ack time.
                        long ackTimeNanos = totalAckTimeNanos
                            - ack.getDownstreamAckTimeNanos();
                        if (ackTimeNanos < 0) {
                            if (LOG.isDebugEnabled()) {
                                LOG.debug("Calculated invalid ack time: " + ackTimeNanos
                                          + "ns.");
                            }
                        } else {
                            datanode.metrics.addPacketAckRoundTripTimeNanos(ackTimeNanos);
                        }
                    }
                    lastPacketInBlock = pkt.lastPacketInBlock;
                }
            } catch (InterruptedException ine) {
                isInterrupted = true;
            } catch (IOException ioe) {
                if (Thread.interrupted()) {
                    isInterrupted = true;
                } else if (ioe instanceof EOFException && !packetSentInTime()) {
                    // The downstream error was caused by upstream including this
                    // node not sending packet in time. Let the upstream determine
                    // who is at fault.  If the immediate upstream node thinks it
                    // has sent a packet in time, this node will be reported as bad.
                    // Otherwise, the upstream node will propagate the error up by
                    // closing the connection.
                    LOG.warn("The downstream error might be due to congestion in " +
                             "upstream including this node. Propagating the error: ",
                             ioe);
                    throw ioe;
                } else {
                    // continue to run even if can not read from mirror
                    // notify client of the error
                    // and wait for the client to shut down the pipeline
                    // 如果从下游节点读取数据时抛出异常, 则将mirrorError设置为true
                    mirrorError = true;
                    LOG.info(myString, ioe);
                }
            }

            if (Thread.interrupted() || isInterrupted) {
                /*
             * The receiver thread cancelled this thread. We could also check
             * any other status updates from the receiver thread (e.g. if it is
             * ok to write to replyOut). It is prudent to not send any more
             * status back to the client because this datanode has a problem.
             * The upstream datanode will detect that this datanode is bad, and
             * rightly so.
             *
             * The receiver thread can also interrupt this thread for sending
             * an out-of-band response upstream.
             */
                LOG.info(myString + ": Thread is interrupted.");
                //对于线程中断, 则将running设置为false, 停止当前线程
                running = false;
                continue;
            }

            if (lastPacketInBlock) {
                // Finalize the block and close the block file
                // 如果数据包响应是数据块的最后一个数据包响应, 则提交这个数据块并关闭此block 文件
                finalizeBlock(startTime);
            }

            Status myStatus = pkt != null ? pkt.ackStatus : Status.SUCCESS;
            // 复制下游节点的数据包响应, 加入当前数据节点的状态, 并发送给上游节点
            sendAckUpstream(ack, expected, totalAckTimeNanos,
                            (pkt != null ? pkt.offsetInBlock : 0),
                            PipelineAck.combineHeader(datanode.getECN(), myStatus));
            if (pkt != null) {
                // remove the packet from the ack queue
                // 已经完成响应处理的数据包, 从ackQueue队列中移除
                removeAckHead();
            }
        } catch (IOException e) {
            LOG.warn("IOException in PacketResponder.run(): ", e);
            if (running) {
                // Volume error check moved to FileIoProvider
                LOG.info(myString, e);
                running = false;
                if (!Thread.interrupted()) { // failure not caused by interruption
                    receiverThread.interrupt();
                }
            }
        } catch (Throwable e) {
            if (running) {
                LOG.info(myString, e);
                running = false;
                receiverThread.interrupt();
            }
        }
    }
    LOG.info(myString + " terminating");
}

2.4.3 sendAckUpstreamUnprotected

sendAckUpstream()方法会调用sendAckUpstreamUnprotected()方法发送响应给上游节点。

向上游节点发送的响应分为如下几种情况 :

  • 下游节点发送的是OOB消息, 则将下游节点的数据包响应原封不动地保存。
  • 如果从下游节点读取响应异常, 也就是mirrorError为true, 则将异常Status.MIRROR_ERROR_STATUS记录在数据包响应中。
  • 对于其他流程, 则将当前Datanode的状态放入数据包响应中。
/**
     * The wrapper for the unprotected version. This is only called by
     * the responder's run() method.
     *
     * @param ack Ack received from downstream
     * @param seqno sequence number of ack to be sent upstream
     * @param totalAckTimeNanos total ack time including all the downstream
     *          nodes
     * @param offsetInBlock offset in block for the data in packet
     * @param myHeader the local ack header
     */
private void sendAckUpstream(PipelineAck ack, long seqno,
                             long totalAckTimeNanos, long offsetInBlock,
                             int myHeader) throws IOException {
    try {
        // Wait for other sender to finish. Unless there is an OOB being sent,
        // the responder won't have to wait.
        synchronized(this) {
            while(sending) {
                wait();
            }
            sending = true;
        }

        try {
            if (!running) return;
            sendAckUpstreamUnprotected(ack, seqno, totalAckTimeNanos,
                                       offsetInBlock, myHeader);
        } finally {
            synchronized(this) {
                sending = false;
                notify();
            }
        }
    } catch (InterruptedException ie) {
        // The responder was interrupted. Make it go down without
        // interrupting the receiver(writer) thread.  
        running = false;
    }
}

/**
     * @param ack Ack received from downstream
     * @param seqno sequence number of ack to be sent upstream
     * @param totalAckTimeNanos total ack time including all the downstream
     *          nodes
     * @param offsetInBlock offset in block for the data in packet
     * @param myHeader the local ack header
     */
private void sendAckUpstreamUnprotected(PipelineAck ack, long seqno,
                                        long totalAckTimeNanos, long offsetInBlock, int myHeader)
    throws IOException {
    final int[] replies;
    // 如果下游节点发送的是OOB消息, 则保留下游节点消息内容
    if (ack == null) {
        // A new OOB response is being sent from this node. Regardless of
        // downstream nodes, reply should contain one reply.
        // 正在从该节点发送新的 OOB 响应。无论下游节点如何,回复都应包含一个回复。
        replies = new int[] { myHeader };
    } else if (mirrorError) { // ack read error
        // 如果是当前节点从下游节点读取响应消息异常
        // 则将当前Datanode的错误记录在响应消息中
        int h = PipelineAck.combineHeader(datanode.getECN(), Status.SUCCESS);
        int h1 = PipelineAck.combineHeader(datanode.getECN(), Status.ERROR);
        replies = new int[] {h, h1};
    } else {
        // 其他流程
        short ackLen = type == PacketResponderType.LAST_IN_PIPELINE ? 0 : ack
            .getNumOfReplies();
        replies = new int[ackLen + 1];
        // 将当前数据的状态放入响应中
        replies[0] = myHeader;
        for (int i = 0; i < ackLen; ++i) {
            // 将下游节点的状态放入响应中
            replies[i + 1] = ack.getHeaderFlag(i);
        }
        // If the mirror has reported that it received a corrupt packet,
        // do self-destruct to mark myself bad, instead of making the
        // mirror node bad. The mirror is guaranteed to be good without
        // corrupt data on disk.
        if (ackLen > 0 && PipelineAck.getStatusFromHeader(replies[1]) ==
            Status.ERROR_CHECKSUM) {
            throw new IOException("Shutting down writer and responder "
                                  + "since the down streams reported the data sent by this "
                                  + "thread is corrupt");
        }
    }
    //  根据响应信息构造新的数据包响应消息
    PipelineAck replyAck = new PipelineAck(seqno, replies,
                                           totalAckTimeNanos);
    if (replyAck.isSuccess()
        && offsetInBlock > replicaInfo.getBytesAcked()) {
        replicaInfo.setBytesAcked(offsetInBlock);
    }
    // send my ack back to upstream datanode
    // 将数据包响应消息发送给上游节点
    long begin = Time.monotonicNow();
    /* for test only, no-op in production system */
    DataNodeFaultInjector.get().delaySendingAckToUpstream(inAddr);
    replyAck.write(upstreamOut);
    upstreamOut.flush();
    long duration = Time.monotonicNow() - begin;
    DataNodeFaultInjector.get().logDelaySendingAckToUpstream(
        inAddr,
        duration);
    if (duration > datanodeSlowLogThresholdMs) {
        LOG.warn("Slow PacketResponder send ack to upstream took " + duration
                 + "ms (threshold=" + datanodeSlowLogThresholdMs + "ms), " + myString
                 + ", replyAck=" + replyAck
                 + ", downstream DNs=" + Arrays.toString(downstreamDNs)
                 + ", blockId=" + replicaInfo.getBlockId());
    } else if (LOG.isDebugEnabled()) {
        LOG.debug(myString + ", replyAck=" + replyAck);
    }

    // If a corruption was detected in the received data, terminate after
    // sending ERROR_CHECKSUM back.
    // 如果当前节检测到校验和错误, 则停止BlockReceiver以及PacketResponder线程
    Status myStatus = PipelineAck.getStatusFromHeader(myHeader);
    if (myStatus == Status.ERROR_CHECKSUM) {
        throw new IOException("Shutting down writer and responder "
                              + "due to a checksum error in received data. The error "
                              + "response has been sent upstream.");
    }
}

2.4.4 finalizeBlock:处理结果上报

Datanode完成数据块的写入操作后需要向Namenode汇报这个新的数据块, 以方便Namenode更新命名空间。 PacketResponder在确认数据块中所有数据包的响应都正确处理之后, 会调用BlockReceiver.finalizeBlock()方法通知名字节点当前Datanode成功接收了这个数据块。

if (lastPacketInBlock) {
    // Finalize the block and close the block file
    // 如果数据包响应是数据块的最后一个数据包响应, 则提交这个数据块并关闭此block 文件
    finalizeBlock(startTime);
}
/**
     * Finalize the block and close the block file
     * 确定block以及关闭block file
     * @param startTime time when BlockReceiver started receiving the block
     */
private void finalizeBlock(long startTime) throws IOException {
    long endTime = 0;
    // Hold a volume reference to finalize block.
    try (ReplicaHandler handler = BlockReceiver.this.claimReplicaHandler()) {
        BlockReceiver.this.close();
        endTime = ClientTraceLog.isInfoEnabled() ? System.nanoTime() : 0;
        block.setNumBytes(replicaInfo.getNumBytes());
        // 这里data变量是FsDatasetImpl对象,调用这个方法完成block的写操作
        datanode.data.finalizeBlock(block, dirSyncOnFinalize);
    }

    // 设置一个块固定在这个数据节点上,使其无法移动通过平衡器/移动器。
    if (pinning) {
        datanode.data.setPinning(block);
    }

    // datanode 关闭block
    datanode.closeBlock(block, null, replicaInfo.getStorageUuid(),
                        replicaInfo.isOnTransientStorage());
    if (ClientTraceLog.isInfoEnabled() && isClient) {
        long offset = 0;
        DatanodeRegistration dnR = datanode.getDNRegistrationForBP(block
                                                                   .getBlockPoolId());
        ClientTraceLog.info(String.format(DN_CLIENTTRACE_FORMAT, inAddr,
                                          myAddr, block.getNumBytes(), "HDFS_WRITE", clientname, offset,
                                          dnR.getDatanodeUuid(), block, endTime - startTime));
    } else {
        LOG.info("Received " + block + " size " + block.getNumBytes()
                 + " from " + inAddr);
    }
}

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值