一文弄懂 Flink 数据交互之 Netty 通信源码

其他 Flink 重要源码点击这里

flink数据的通信采用的netty框架,分为客户端和服务端,每个taskmanager即是客户端也是服务端,客户端用于向上游任务请求数据,服务端用于接收下游客户端请求,将数据发送给下游任务。数据处理的逻辑都是在ChannelHandler中完成,客户端和服务端有不同的ChannelHandler

1. Netty 初始化

在初始化TaskManager的时候,会创建网络服务NetworkEnvironment,同时会启动网络服务,NetworkEnvironment会启动NettyConnectionManager,NettyConnectionManager负责netty的连接,其中包含NettyServer和NettyClient,即netty的服务端和客户端,NettyConnectionManager的启动会初始化NettyServer和NettyClient

//NetworkEnvironment类
public void start() throws IOException {
   synchronized (lock) {
      Preconditions.checkState(!isShutdown, "The NetworkEnvironment has already been shut down.");

      LOG.info("Starting the network environment and its components.");

      try {
         LOG.debug("Starting network connection manager");
         connectionManager.start(resultPartitionManager, taskEventDispatcher);
      } catch (IOException t) {
         throw new IOException("Failed to instantiate network connection manager.", t);
      }

      ...
   }
}
//NettyConnectionManager类
public void start(ResultPartitionProvider partitionProvider, TaskEventDispatcher taskEventDispatcher) throws IOException {
   NettyProtocol partitionRequestProtocol = new NettyProtocol(
      partitionProvider,
      taskEventDispatcher,
      client.getConfig().isCreditBasedEnabled());
   //NettyClient初始化
   client.init(partitionRequestProtocol, bufferPool);
   //NettyServer初始化
   server.init(partitionRequestProtocol, bufferPool);
}

2. NettyClient初始化

首先来看客户端NettyClient的初始化
客户端的初始化比较简单,只是创建了一个启动器bootstrap,设置了一些参数,此时还并未设置ChannelHandler,也并未去连接服务端,因为此时Task还未启动,也不知道要去连接哪个节点地址。

//NettyClient类
void init(final NettyProtocol protocol, NettyBufferPool nettyBufferPool) throws IOException {
   checkState(bootstrap == null, "Netty client has already been initialized.");

   this.protocol = protocol;

   final long start = System.nanoTime();
   //创建启动器bootstrap
   bootstrap = new Bootstrap();

   // --------------------------------------------------------------------
   // Transport-specific configuration
   // --------------------------------------------------------------------

   switch (config.getTransportType()) {
      case NIO: //默认传输类型是NIO
         //设置EventLoopGroup,包括NIO线程数量等
         initNioBootstrap();
         break;

      case EPOLL:
         initEpollBootstrap();
         break;

      case AUTO:
         if (Epoll.isAvailable()) {
            initEpollBootstrap();
            LOG.info("Transport type 'auto': using EPOLL.");
         }
         else {
            initNioBootstrap();
            LOG.info("Transport type 'auto': using NIO.");
         }
   }

   // --------------------------------------------------------------------
   // Configuration
   // --------------------------------------------------------------------

   bootstrap.option(ChannelOption.TCP_NODELAY, true);
   bootstrap.option(ChannelOption.SO_KEEPALIVE, true);

   // Timeout for new connections
   bootstrap.option(ChannelOption.CONNECT_TIMEOUT_MILLIS, config.getClientConnectTimeoutSeconds() * 1000);

   // Pooled allocator for Netty's ByteBuf instances
   bootstrap.option(ChannelOption.ALLOCATOR, nettyBufferPool);

   // Receive and send buffer size
   int receiveAndSendBufferSize = config.getSendAndReceiveBufferSize();
   if (receiveAndSendBufferSize > 0) {
      bootstrap.option(ChannelOption.SO_SNDBUF, receiveAndSendBufferSize);
      bootstrap.option(ChannelOption.SO_RCVBUF, receiveAndSendBufferSize);
   }

   try {
      clientSSLFactory = config.createClientSSLEngineFactory();
   } catch (Exception e) {
      throw new IOException("Failed to initialize SSL Context for the Netty client", e);
   }

   final long duration = (System.nanoTime() - start) / 1_000_000;
   LOG.info("Successful initialization (took {} ms).", duration);
}

3. NettyServer初始化

再看服务端NettyServer的初始化
服务端的初始化就比较完整了,不仅创建了server端的启动器bootstrap,也添加了server端的ChannelHandlers,同时也启动了server服务

void init(final NettyProtocol protocol, NettyBufferPool nettyBufferPool) throws IOException {
   checkState(bootstrap == null, "Netty server has already been initialized.");

   final long start = System.nanoTime();
   //创建启动器
   bootstrap = new ServerBootstrap();

   // --------------------------------------------------------------------
   // Transport-specific configuration
   // --------------------------------------------------------------------

   switch (config.getTransportType()) {
       //和客户端一样
      case NIO:
         initNioBootstrap();
         break;

      case EPOLL:
         initEpollBootstrap();
         break;

      case AUTO:
         if (Epoll.isAvailable()) {
            initEpollBootstrap();
            LOG.info("Transport type 'auto': using EPOLL.");
         }
         else {
            initNioBootstrap();
            LOG.info("Transport type 'auto': using NIO.");
         }
   }

   // --------------------------------------------------------------------
   // Configuration
   // --------------------------------------------------------------------

   // Server bind address
   bootstrap.localAddress(config.getServerAddress(), config.getServerPort());

   // Pooled allocators for Netty's ByteBuf instances
   bootstrap.option(ChannelOption.ALLOCATOR, nettyBufferPool);
   bootstrap.childOption(ChannelOption.ALLOCATOR, nettyBufferPool);

   if (config.getServerConnectBacklog() > 0) {
      bootstrap.option(ChannelOption.SO_BACKLOG, config.getServerConnectBacklog());
   }

   // Receive and send buffer size
   int receiveAndSendBufferSize = config.getSendAndReceiveBufferSize();
   if (receiveAndSendBufferSize > 0) {
      bootstrap.childOption(ChannelOption.SO_SNDBUF, receiveAndSendBufferSize);
      bootstrap.childOption(ChannelOption.SO_RCVBUF, receiveAndSendBufferSize);
   }

   // Low and high water marks for flow control
   // hack around the impossibility (in the current netty version) to set both watermarks at
   // the same time:
   final int defaultHighWaterMark = 64 * 1024; // from DefaultChannelConfig (not exposed)
   final int newLowWaterMark = config.getMemorySegmentSize() + 1;
   final int newHighWaterMark = 2 * config.getMemorySegmentSize();
   if (newLowWaterMark > defaultHighWaterMark) {
      bootstrap.childOption(ChannelOption.WRITE_BUFFER_HIGH_WATER_MARK, newHighWaterMark);
      bootstrap.childOption(ChannelOption.WRITE_BUFFER_LOW_WATER_MARK, newLowWaterMark);
   } else { // including (newHighWaterMark < defaultLowWaterMark)
      bootstrap.childOption(ChannelOption.WRITE_BUFFER_LOW_WATER_MARK, newLowWaterMark);
      bootstrap.childOption(ChannelOption.WRITE_BUFFER_HIGH_WATER_MARK, newHighWaterMark);
   }

   // SSL related configuration
   final SSLHandlerFactory sslHandlerFactory;
   try {
      sslHandlerFactory = config.createServerSSLEngineFactory();
   } catch (Exception e) {
      throw new IOException("Failed to initialize SSL Context for the Netty Server", e);
   }

   // --------------------------------------------------------------------
   // Child channel pipeline for accepted connections
   // --------------------------------------------------------------------

   bootstrap.childHandler(new ChannelInitializer<SocketChannel>() {
      @Override
      public void initChannel(SocketChannel channel) throws Exception {
         if (sslHandlerFactory != null) {
            channel.pipeline().addLast("ssl", sslHandlerFactory.createNettySSLHandler());
         }
         //在channel的pipeline添加数据处理的ChannelHandlers
         channel.pipeline().addLast(protocol.getServerChannelHandlers());
      }
   });

   // --------------------------------------------------------------------
   // Start Server
   // --------------------------------------------------------------------
   //启动服务端
   bindFuture = bootstrap.bind().syncUninterruptibly();

   localAddress = (InetSocketAddress) bindFuture.channel().localAddress();

   final long duration = (System.nanoTime() - start) / 1_000_000;
   LOG.info("Successful initialization (took {} ms). Listening on SocketAddress {}.", duration, localAddress);
}

我们来看看server端的ChannelHandler都是什么
通过源码看到有消息编码器、解码器、最重要的就是PartitionRequestServerHandler和PartitionRequestQueue,数据的核心处理逻辑都在这两个ChannelHandler里

//NettyProtocol类
public ChannelHandler[] getServerChannelHandlers() {
   PartitionRequestQueue queueOfPartitionQueues = new PartitionRequestQueue();
   PartitionRequestServerHandler serverHandler = new PartitionRequestServerHandler(
      partitionProvider, taskEventDispatcher, queueOfPartitionQueues, creditBasedEnabled);

   return new ChannelHandler[] {
      messageEncoder,
      new NettyMessage.NettyMessageDecoder(!creditBasedEnabled),
      serverHandler,
      queueOfPartitionQueues
   };
}

4. 客户端请求数据

上述说道,在初始化阶段客户端只创建了bootstrap,并未有其他的动作。在《Task数据交互之数据读》中提到过,消费端任务线程是从InputGate中获取数据的,而InputGate会调用requestPartitions()来向上游节点发起数据请求,所以我们要从这个方法看起

//SingleInputGate类
public void requestPartitions() throws IOException, InterruptedException {
   synchronized (requestLock) {
      if (!requestedPartitionsFlag) {
         if (isReleased) {
            throw new IllegalStateException("Already released.");
         }

         // Sanity checks
         if (numberOfInputChannels != inputChannels.size()) {
            throw new IllegalStateException("Bug in input gate setup logic: mismatch between" +
                  "number of total input channels and the currently set number of input " +
                  "channels.");
         }
          //每个InputChannel也要发起数据请求
         for (InputChannel inputChannel : inputChannels.values()) {
            inputChannel.requestSubpartition(consumedSubpartitionIndex);
         }
      }

      requestedPartitionsFlag = true;
   }
}

可以看到,InputGate中有很多InputChannel,每个InputChannel也会发起数据请求,这里我们只看RemoteInputChannel,向远程节点发起数据请求的情况。这里我们也可以看到,InputGate只会发起一次数据请求,之后就不会再发请求了,之后就是基于credit的数据交互了。也可以看到同一个InputGate中的所有RemoteInputChannel请求的SubpartitionIndex都是一样的,也就是说,一个reduce任务会请求上游所有Map任务的相同index的ResultSubPartition,比如都是第二个ResultSubPartition

RemoteInputChannel会创建partitionRequestClient,通过partitionRequestClient向服务端发请求,通过这里我们也看到,每个RemoteInputChannel都会持有一个partitionRequestClient,因为每个RemoteInputChannel都对应了一个上游任务,这些上游任务会分布在不同的节点

//RemoteInputChannel类
public void requestSubpartition(int subpartitionIndex) throws IOException, InterruptedException {
   if (partitionRequestClient == null) {
      // Create a client and request the partition
      partitionRequestClient = connectionManager
         .createPartitionRequestClient(connectionId);

      partitionRequestClient.requestSubpartition(partitionId, subpartitionIndex, this, 0);
   }
}

5. 创建PartitionRequestClient

partitionRequestClient的创建过程最终是调用了NettyClient.connect()方法,在这个过程中,给NettyClient设置了ChannelHandler,并且进行连接服务端。因为存在多个RemoteInputChannel,所以会对上游多个taskmanager都进行连接,每个taskmanager连接都会有一个ChannelHandler,这里我们看到客户端的ChannelHandler默认情况下是CreditBasedPartitionRequestClientHandler,就是基于Credit的处理器。后面的flink版本就只有这一种ChannelHandler了

//PartitionRequestClientFactory类
PartitionRequestClient createPartitionRequestClient(ConnectionID connectionId) throws IOException, InterruptedException {
   Object entry;
   PartitionRequestClient client = null;

   while (client == null) {
      entry = clients.get(connectionId);

      if (entry != null) {
         // Existing channel or connecting channel
         //如果通道已经连接过,就直接取。
         //例如有两个map任务在同一个taskmanager的情况,这时只需要连接一次就行,partitionRequestClient共用一个
        if (entry instanceof PartitionRequestClient) {
           client = (PartitionRequestClient) entry;
        }
        else {
           ConnectingChannel future = (ConnectingChannel) entry;
           client = future.waitForChannel();

           clients.replace(connectionId, future, client);
        }
      }
      else {
         // No channel yet. Create one, but watch out for a race.
         ConnectingChannel connectingChannel = new ConnectingChannel(connectionId, this);
         Object old = clients.putIfAbsent(connectionId, connectingChannel);

         if (old == null) {
             //使用NettyClient进行连接服务端
            nettyClient.connect(connectionId.getAddress()).addListener(connectingChannel);
            //等待连接成功
            client = connectingChannel.waitForChannel();

            clients.replace(connectionId, connectingChannel, client);
         }
         ...
   }

   return client;
}
//NettyClient类
ChannelFuture connect(final InetSocketAddress serverSocketAddress) {
   checkState(bootstrap != null, "Client has not been initialized yet.");

   // --------------------------------------------------------------------
   // Child channel pipeline for accepted connections
   // --------------------------------------------------------------------

   bootstrap.handler(new ChannelInitializer<SocketChannel>() {
      @Override
      public void initChannel(SocketChannel channel) throws Exception {

         // SSL handler should be added first in the pipeline
         if (clientSSLFactory != null) {
            SslHandler sslHandler = clientSSLFactory.createNettySSLHandler(
                  serverSocketAddress.getAddress().getCanonicalHostName(),
                  serverSocketAddress.getPort());
            channel.pipeline().addLast("ssl", sslHandler);
         }
         //在channel的pipeline添加数据处理的ChannelHandlers
         channel.pipeline().addLast(protocol.getClientChannelHandlers());
      }
   });

   try {
       //连接服务端NettyServer
      return bootstrap.connect(serverSocketAddress);
   }
   ...
}
//NettyProtocol类
public ChannelHandler[] getClientChannelHandlers() {
    //默认情况是CreditBasedPartitionRequestClientHandler
   NetworkClientHandler networkClientHandler =
      creditBasedEnabled ? new CreditBasedPartitionRequestClientHandler() :
         new PartitionRequestClientHandler();
   return new ChannelHandler[] {
      messageEncoder,
      new NettyMessage.NettyMessageDecoder(!creditBasedEnabled),
      networkClientHandler};
}

6. 发起partition数据请求

创建完partitionRequestClient之后,就会发起数据请求,实现方法在PartitionRequestClient.requestSubpartition()中。
通过源码可以看到,首先创建一个请求实例PartitionRequest,包含了请求的是哪个ResultSubPartition,和当前的RemoteInputChannel初始Credit。Credit是信任消费凭证,具体的介绍可以参考《Flink基于Credit的数据传输和背压》,简单来说就是消费者有多个credit,生产端就能给消费端发送多少个数据buffer,消费端的credit值等于可用于接收数据的空闲buffer数。

//PartitionRequestClient类
public ChannelFuture requestSubpartition(
      final ResultPartitionID partitionId,
      final int subpartitionIndex,
      final RemoteInputChannel inputChannel,
      int delayMs) throws IOException {

    ...
    //将InputChannel添加到CreditBasedPartitionRequestClientHandler
   clientHandler.addInputChannel(inputChannel);
   //创建请求体PartitionRequest
   final PartitionRequest request = new PartitionRequest(
         partitionId, subpartitionIndex, inputChannel.getInputChannelId(), inputChannel.getInitialCredit());

   ...

   if (delayMs == 0) {
      ChannelFuture f = tcpChannel.writeAndFlush(request);
      f.addListener(listener);
      return f;
   } 
   ...
}

创建完请求实例request以后,将这个请求实例,也可以称之为消息,发送给服务端。接下来就是服务端接收到请求消息之后的逻辑了。

7. 服务端处理数据请求

服务端会调用ChannelHander.channelRead()来读取接收到的消息,PartitionRequestServerHandler.channelRead()调用了channelRead0()方法,实现也在channelRead0()中,实现大致如下:
1、接收到客户端的PartitionRequest之后,会给这个请求创建一个reader,在这里我们只看基于Credit的CreditBasedSequenceNumberingViewReader,每个reader都有一个初始凭据credit,值等于消费端RemoteInputChannel的独占buffer数。
2、这个reader随后会创建一个ResultSubpartitionView,reader就是通过这个ResultSubpartitionView来从对应的ResultSubpartition里读取数据,在实时计算里,这个ResultSubpartitionView是PipelinedSubpartitionView的实例。

//PartitionRequestServerHandler类
protected void channelRead0(ChannelHandlerContext ctx, NettyMessage msg) throws Exception {
   try {
      Class<?> msgClazz = msg.getClass();
      
      if (msgClazz == PartitionRequest.class) {
         PartitionRequest request = (PartitionRequest) msg;

         LOG.debug("Read channel on {}: {}.", ctx.channel().localAddress(), request);

         try {
            NetworkSequenceViewReader reader;
            if (creditBasedEnabled) {
               reader = new CreditBasedSequenceNumberingViewReader(
                  request.receiverId,
                  request.credit,
                  outboundQueue);
            } else {
               reader = new SequenceNumberingViewReader(
                  request.receiverId,
                  outboundQueue);
            }

            reader.requestSubpartitionView(
               partitionProvider,
               request.partitionId,
               request.queueIndex);

            outboundQueue.notifyReaderCreated(reader);
         } catch (PartitionNotFoundException notFound) {
            respondWithError(ctx, notFound, request.receiverId);
         }
      }
      ...
}

//CreditBasedSequenceNumberingViewReader类
public void requestSubpartitionView(
   ResultPartitionProvider partitionProvider,
   ResultPartitionID resultPartitionId,
   int subPartitionIndex) throws IOException {

   synchronized (requestLock) {
      if (subpartitionView == null) {
         this.subpartitionView = partitionProvider.createSubpartitionView(
            resultPartitionId,
            subPartitionIndex,
            this);
      } else {
         throw new IllegalStateException("Subpartition already requested");
      }
   }
}

我们可以分析一下,一个上游(Map)任务对应一个ResultPartition,每个ResultPartition有多个ResultSubPartition,每个ResultSubPartition对应一个下游(Reduce)任务,每个下游任务都会来请求ResultPartition里的一个ResultSubPartition。比如有10个ResultSubPartition,就有10个Reduce任务,每个Reduce发起一个PartitionRequest,Map端就会创建10个reader,每个reader读取一个ResultSubPartition。

8. 创建ResultSubpartitionView

ResultSubpartitionView的创建最终是通过PipelinedSubpartitionView.createReadView()来实现, ResultSubpartitionView创建之后会立刻触发数据发送,调用的是PipelinedSubpartition.notifyDataAvailable()方法,这个方法最终会回调到CreditBasedSequenceNumberingViewReader.notifyDataAvailable()方法,触发PartitionRequestQueue.userEventTriggered(),上述说了PartitionRequestQueue也是一个ChannelHandler。

//PipelinedSubpartition类
public PipelinedSubpartitionView createReadView(BufferAvailabilityListener availabilityListener) throws IOException {
   final boolean notifyDataAvailable;
   synchronized (buffers) {
      ...
      readView = new PipelinedSubpartitionView(this, availabilityListener);
      notifyDataAvailable = !buffers.isEmpty();
   }
   if (notifyDataAvailable) {
      notifyDataAvailable();
   }

   return readView;
}

//PipelinedSubpartition类
private void notifyDataAvailable() {
   if (readView != null) {
      readView.notifyDataAvailable();
   }
}

//PipelinedSubpartitionView类
public void notifyDataAvailable() {
   availabilityListener.notifyDataAvailable();
}

//CreditBasedSequenceNumberingViewReader类
public void notifyDataAvailable() {
   requestQueue.notifyReaderNonEmpty(this);
}

//PartitionRequestQueue类
void notifyReaderNonEmpty(final NetworkSequenceViewReader reader) {
   ctx.executor().execute(() -> ctx.pipeline().fireUserEventTriggered(reader));
}

PartitionRequestQueue.userEventTriggered()会将这个reader(CreditBasedSequenceNumberingViewReader)添加到可用reader队列中,同时触发数据读取和写出,具体实现在writeAndFlushNextMessageIfPossible()中

//PartitionRequestQueue类
public void userEventTriggered(ChannelHandlerContext ctx, Object msg) throws Exception {
    
   if (msg instanceof NetworkSequenceViewReader) {
      enqueueAvailableReader((NetworkSequenceViewReader) msg);
   } 
   ...
}

private void enqueueAvailableReader(final NetworkSequenceViewReader reader) throws Exception {
   if (reader.isRegisteredAsAvailable() || !reader.isAvailable()) {
      return;
   }
   // Queue an available reader for consumption. If the queue is empty,
   // we try trigger the actual write. Otherwise this will be handled by
   // the writeAndFlushNextMessageIfPossible calls.
   boolean triggerWrite = availableReaders.isEmpty();
   //将reader添加到可读取队列
   registerAvailableReader(reader);

   if (triggerWrite) {
       //触发数据发送
      writeAndFlushNextMessageIfPossible(ctx.channel());
   }
}

private void registerAvailableReader(NetworkSequenceViewReader reader) {
   availableReaders.add(reader);
   reader.setRegisteredAsAvailable(true);
}

9. 发送buffer

下面就是关键的writeAndFlushNextMessageIfPossible()方法,方法大致逻辑如下:
1、从可用reader队列里拿一个reader,reader从ResultSubPartition里读取一个数据buffer,每读一个buffer,凭据credit就减1
2、如果ResultSubPartition里还有可消费的数据,就再次将这个reader添加到可用reader队列里,以便继续读数据。
3、将刚才获取到的buffer进行封装,写出到socket,也就是发送给下游。这里有一个变量很重要,就是buffersInBacklog,这个值是ResultSubPartition中的buffer积压量,就是有多少个buffer积压了,这对于flink的背压判断很重要,这个积压量决定了Reduce应该使用多少个buffer来接收数据。在上述步骤2中,如果说消费端已经没有足够的buffer来接收数据了,那么也就不会再给消费端发送数据了
4、继续循环上面的步骤,直到没有可用的reader了,这时也说明Map端暂时没有可消费的数据了,比如数据还未填满一个buffer,或者还没有到定时flush的时间,或者消费端没有空闲的buffer接收数据(对应的就是生产端的凭据credit等于0了)

private void writeAndFlushNextMessageIfPossible(final Channel channel) throws IOException {
   ...

   BufferAndAvailability next = null;
   try {
      while (true) {
          //从可读取reader队列取出一个reader
         NetworkSequenceViewReader reader = pollAvailableReader();

         // No queue with available data. We allow this here, because
         // of the write callbacks that are executed after each write.
         if (reader == null) {
            return;
         }
         //reader从ResultSubPartition里读取一个数据buffer
         next = reader.getNextBuffer();
         if (next == null) {
            ... //没取到数的异常情况
         } else {
            // This channel was now removed from the available reader queue.
            // We re-add it into the queue if it is still available
            if (next.moreAvailable()) {
                //再次将这个reader添加到可用reader队列里
               registerAvailableReader(reader);
            }

            BufferResponse msg = new BufferResponse(
               next.buffer(),
               reader.getSequenceNumber(),
               reader.getReceiverId(),
               next.buffersInBacklog());

            ...

            // Write and flush and wait until this is done before
            // trying to continue with the next buffer.
            //将buffer数据发送出去
            channel.writeAndFlush(msg).addListener(writeListener);

            return;
         }
      }
   } 
   ...
}

来看一下reader从ResultSubPartition中读数据的过程,通过subpartitionView从PipelinedSubpartition中获取一个buffer,获取一个就将credit值减1。获取buffer的逻辑简单来说就是从ResultSubPartition的buffers队列里拿一个buffer

//CreditBasedSequenceNumberingViewReader类
public BufferAndAvailability getNextBuffer() throws IOException, InterruptedException {
   //通过subpartitionView获取一个buffer
   BufferAndBacklog next = subpartitionView.getNextBuffer();
   if (next != null) {
      sequenceNumber++;
      //获取一个buffer,信任值credit就减1
      if (next.buffer().isBuffer() && --numCreditsAvailable < 0) {
         throw new IllegalStateException("no credit available");
      }
      //对buffer进行封装
      return new BufferAndAvailability(
         next.buffer(), isAvailable(next), next.buffersInBacklog());
   } else {
      return null;
   }
}
//PipelinedSubpartitionView类
public BufferAndBacklog getNextBuffer() {
   return parent.pollBuffer();
}

//PipelinedSubpartition类
BufferAndBacklog pollBuffer() {
   synchronized (buffers) {
      Buffer buffer = null;

      if (buffers.isEmpty()) {
         flushRequested = false;
      }

      while (!buffers.isEmpty()) {
          //从PipelinedSubpartition的buffers数据队列取队头的buffer
         BufferConsumer bufferConsumer = buffers.peek();
         buffer = bufferConsumer.build();

         checkState(bufferConsumer.isFinished() || buffers.size() == 1,
            "When there are multiple buffers, an unfinished bufferConsumer can not be at the head of the buffers queue.");

         if (buffers.size() == 1) {
            // turn off flushRequested flag if we drained all of the available data
            flushRequested = false;
         }
        //如果buffer是已经被写满的,不是写了一半数据的那种,就可以从buffers队列里删掉了
         if (bufferConsumer.isFinished()) {
            buffers.pop().close();
            decreaseBuffersInBacklogUnsafe(bufferConsumer.isBuffer());
         }

         if (buffer.readableBytes() > 0) {
            break;
         }
         buffer.recycleBuffer();
         buffer = null;
         if (!bufferConsumer.isFinished()) {
            break;
         }
      }

      if (buffer == null) {
         return null;
      }
      //更新PipelinedSubpartition的数据状态
      updateStatistics(buffer);
      // Do not report last remaining buffer on buffers as available to read (assuming it's unfinished).
      // It will be reported for reading either on flush or when the number of buffers in the queue
      // will be 2 or more.
      return new BufferAndBacklog(
         buffer,
         isAvailableUnsafe(),
         getBuffersInBacklog(),
         nextBufferIsEventUnsafe());
   }
}

10. 生成者主动触发数据发送

上述的过程是下游消费者在开始阶段向上游请求数据,然后触发生产者向下游发送数据。如果在某个时间点,比如上述步骤4中说了,生产者暂时没有数据生产了。那么在有数据生产之后,又是如何触发数据发送的呢?
答案就是每当数据写满一个buffer之后,或者当数据定时flush的时候,都可能会触发PipelinedSubpartition.notifyDataAvailable(),触发之后就又是上述writeAndFlushNextMessageIfPossible()方法的逻辑

//PipelinedSubpartition类
private boolean add(BufferConsumer bufferConsumer, boolean finish) {
   checkNotNull(bufferConsumer);

   final boolean notifyDataAvailable;
   synchronized (buffers) {
      if (isFinished || isReleased) {
         bufferConsumer.close();
         return false;
      }

      // Add the bufferConsumer and update the stats
      buffers.add(bufferConsumer);
      updateStatistics(bufferConsumer);
      increaseBuffersInBacklog(bufferConsumer);
      notifyDataAvailable = shouldNotifyDataAvailable() || finish;

      isFinished |= finish;
   }
    //判断是否要触发数据消费,发送给消费者
   if (notifyDataAvailable) {
      notifyDataAvailable();
   }

   return true;
}
//PipelinedSubpartition类
public void flush() {
   final boolean notifyDataAvailable;
   synchronized (buffers) {
      if (buffers.isEmpty()) {
         return;
      }
      // if there is more then 1 buffer, we already notified the reader
      // (at the latest when adding the second buffer)
      notifyDataAvailable = !flushRequested && buffers.size() == 1;
      flushRequested = true;
   }
   //判断是否要触发数据消费,发送给消费者
   if (notifyDataAvailable) {
      notifyDataAvailable();
   }
}

11. 客户端接收数据

下面来看客户端(消费端)的逻辑
消费端接收处理数据的起点在CreditBasedPartitionRequestClientHandler.channelRead()方法,该方法接收到消息之后会先将消息进行解码。CreditBasedPartitionRequestClientHandler先拿到这个消息的所属的InputChannel,然后从InputChannel中获取一个buffer,将接收到的消息数据拷贝到buffer中。

//CreditBasedPartitionRequestClientHandler类
public void channelRead(ChannelHandlerContext ctx, Object msg) throws Exception {
   try {
      decodeMsg(msg);
   } catch (Throwable t) {
      notifyAllChannelsOfErrorAndClose(t);
   }
}

private void decodeMsg(Object msg) throws Throwable {
   final Class<?> msgClazz = msg.getClass();

   // ---- Buffer --------------------------------------------------------
   if (msgClazz == NettyMessage.BufferResponse.class) {
      NettyMessage.BufferResponse bufferOrEvent = (NettyMessage.BufferResponse) msg;

      RemoteInputChannel inputChannel = inputChannels.get(bufferOrEvent.receiverId);
      ...

      decodeBufferOrEvent(inputChannel, bufferOrEvent);

   } 
    ...
}

private void decodeBufferOrEvent(RemoteInputChannel inputChannel, NettyMessage.BufferResponse bufferOrEvent) throws Throwable {
   try {
      ByteBuf nettyBuffer = bufferOrEvent.getNettyBuffer();
      final int receivedSize = nettyBuffer.readableBytes();
      if (bufferOrEvent.isBuffer()) {
         // ---- Buffer ------------------------------------------------

         // Early return for empty buffers. Otherwise Netty's readBytes() throws an
         // IndexOutOfBoundsException.
         if (receivedSize == 0) {
            inputChannel.onEmptyBuffer(bufferOrEvent.sequenceNumber, bufferOrEvent.backlog);
            return;
         }
         //从InputChannel中获取一个buffer
         Buffer buffer = inputChannel.requestBuffer();
         if (buffer != null) {
             //将数据拷贝到buffer中
            nettyBuffer.readBytes(buffer.asByteBuf(), receivedSize);
            //onBuffer()核心处理逻辑
            inputChannel.onBuffer(buffer, bufferOrEvent.sequenceNumber, bufferOrEvent.backlog);
         } else if (inputChannel.isReleased()) {
            cancelRequestFor(bufferOrEvent.receiverId);
         } else {
            throw new IllegalStateException("No buffer available in credit-based input channel.");
         }
      } 
      ...
}

12. 使用空闲buffer接收数据

RemoteInputChannel获取buffer是从RemoteInputChannel的bufferQueue中获取的,在《Task数据交互之数据读》中我们分析到RemoteInputChannel中有两种类型的buffer,一种是独占buffer,一种是浮动buffer,浮动buffer是从LocalBufferPool中申请的,所有RemoteInputChannel可以共享。这两种类型的buffer都在bufferQueue中。

//RemoteInputChannel类
public Buffer requestBuffer() {
   synchronized (bufferQueue) {
      return bufferQueue.takeBuffer();
   }
}

//AvailableBufferQueue类
Buffer takeBuffer() {
    //优化获取浮动buffer
   if (floatingBuffers.size() > 0) {
      return floatingBuffers.poll();
   } else {
      return exclusiveBuffers.poll();
   }
}

解码消息之后,就会调用inputChannel.onBuffer()方法来进行核心数据处理了,该方法的处理逻辑大致如下:
1、将上述解码之后的buffer添加到RemoteInputChannel的buffer数据列表中
2、仅添加到buffer数据列表中还不行,在《Task数据交互之数据读》中我们分析到,如果receivedBuffers是空的,这个InputChannel就会被移除出InputGate的inputChannelsWithData队列里,无法再被InputGate轮询,所以当这个InputChannel有数据了之后要再次将自己入队到InputGate中,以便可以让自己继续被InputGate所消费。
3、判断生产端的数据积压量,决定要向LocalBufferPool申请多少个空闲buffer,同时新申请的buffer数要反馈给生产者,增加生产者的credit信任值,以便生产者可以继续往消费者发送数据。

public void onBuffer(Buffer buffer, int sequenceNumber, int backlog) throws IOException {
   boolean recycleBuffer = true;

   try {

      final boolean wasEmpty;
      synchronized (receivedBuffers) {
         ...

         wasEmpty = receivedBuffers.isEmpty();
         //将这个buffer添加到自己的buffer数据队列里
         receivedBuffers.add(buffer);
         recycleBuffer = false;
      }

      ++expectedSequenceNumber;

      if (wasEmpty) {
          //让自己可以被InputGate消费
         notifyChannelNonEmpty();
      }

      if (backlog >= 0) {
          //向生产者反馈credit信任值
         onSenderBacklog(backlog);
      }
   } finally {
      if (recycleBuffer) {
         buffer.recycleBuffer();
      }
   }
}

简单看下上述2的逻辑,最终会调用SingleInputGate.queueChannel()方法

//SingleInputGate类
private void queueChannel(InputChannel channel) {
   int availableChannels;

   synchronized (inputChannelsWithData) {
      if (enqueuedInputChannelsWithData.get(channel.getChannelIndex())) {
         return;
      }
      availableChannels = inputChannelsWithData.size();
       //将自己添加到inputChannelsWithData,可以继续被InputGate轮询
      inputChannelsWithData.add(channel);
      enqueuedInputChannelsWithData.set(channel.getChannelIndex());

      if (availableChannels == 0) {
          //唤醒被阻塞的InputGate
         inputChannelsWithData.notifyAll();
      }
   }

  ...
}

13. 反馈Credit信任值

这里我们重点看看上述3中的逻辑,实现在RemoteInputChannel.onSenderBacklog()方法中,实现如下:
1、消费者需要的buffer=生产者的数据积压量+初始信任值,可以看到消费者需求的buffer数实际上要略大于生产者的数据积压量的
2、如果InputChannel中没有足够的空闲buffer,会向LocalBufferPool中去申请,申请的这部分属于浮动buffer,申请到了多少buffer,就 增加多少credit信任值。如果申请不到,就暂时给LocalBufferPool添加一个listener,当LocalBufferPool有空闲的buffer的时候,就会把buffer分配给这个InputChannel
3、向生产者发送credit信任值

void onSenderBacklog(int backlog) throws IOException {
   int numRequestedBuffers = 0;

   synchronized (bufferQueue) {
      // Similar to notifyBufferAvailable(), make sure that we never add a buffer
      // after releaseAllResources() released all buffers (see above for details).
      if (isReleased.get()) {
         return;
      }

      numRequiredBuffers = backlog + initialCredit;
      while (bufferQueue.getAvailableBufferSize() < numRequiredBuffers && !isWaitingForFloatingBuffers) {
         Buffer buffer = inputGate.getBufferPool().requestBuffer();
         if (buffer != null) {
             //申请到buffer的情况,将buffer作为浮动buffer
            bufferQueue.addFloatingBuffer(buffer);
            numRequestedBuffers++;
         } else if (inputGate.getBufferProvider().addBufferListener(this)) {
            // If the channel has not got enough buffers, register it as listener to wait for more floating buffers.
            //申请不到buffer时暂时给LocalBufferPool添加一个listener,等待分配
            isWaitingForFloatingBuffers = true;
            break;
         }
      }
   }

   if (numRequestedBuffers > 0 && unannouncedCredit.getAndAdd(numRequestedBuffers) == 0) {
      //向生产者发送credit信任值
      notifyCreditAvailable();
   }
}

14. 其他反馈Credit信任值的时机

上述只是说了RemoteInputChannel.onSenderBacklog()方法中将目前申请到的buffer添加到credit信任值,那其他什么时刻会再次增加信任值呢?
有两个地方:一个就是上述说的当LocalBufferPool有空闲的buffer的时候,就会把buffer分配给这个InputChannel,这个时候会增加信任值;

//LocalBufferPool类
public void recycle(MemorySegment segment) {
   BufferListener listener;
   NotificationResult notificationResult = NotificationResult.BUFFER_NOT_USED;
   while (!notificationResult.isBufferUsed()) {
      synchronized (availableMemorySegments) {
         if (isDestroyed || numberOfRequestedMemorySegments > currentPoolSize) {
            returnMemorySegment(segment);
            return;
         } else {
             //有超额的buffer请求,将回收的buffer分配给对应的RemoteInputChannel
            listener = registeredListeners.poll();
            if (listener == null) {
               availableMemorySegments.add(segment);
               availableMemorySegments.notify();
               return;
            }
         }
      }
      notificationResult = fireBufferAvailableNotification(listener, segment);
   }
}

//RemoteInputChannel类
public NotificationResult notifyBufferAvailable(Buffer buffer) {
   NotificationResult notificationResult = NotificationResult.BUFFER_NOT_USED;
   try {
      synchronized (bufferQueue) {
         ...
         //将buffer添加到浮动buffer里
         bufferQueue.addFloatingBuffer(buffer);

         ...
      //同时增加credit信任值
      if (unannouncedCredit.getAndAdd(1) == 0) {
         notifyCreditAvailable();
      }
   } catch (Throwable t) {
      setError(t);
   }
   return notificationResult;
}

另一个地方就是当RemoteInputChannel的独占buffer数据消费完进行回收的时候,同样也会增加信任值。也就是说,只要RemoteInputChannel有新的空闲buffer加入的时候,都会增加信任值

//RemoteInputChannel类
public void recycle(MemorySegment segment) {
   int numAddedBuffers;

   synchronized (bufferQueue) {
      ...
      numAddedBuffers = bufferQueue.addExclusiveBuffer(new NetworkBuffer(segment, this), numRequiredBuffers);
   }

   if (numAddedBuffers > 0 && unannouncedCredit.getAndAdd(numAddedBuffers) == 0) {
      notifyCreditAvailable();
   }
}

notifyCreditAvailable()会调用CreditBasedPartitionRequestClientHandler.userEventTriggered(),最终在writeAndFlushNextMessageIfPossible()方法中将credit以AddCredit消息类型发送给生产者

//CreditBasedPartitionRequestClientHandler类
public void userEventTriggered(ChannelHandlerContext ctx, Object msg) throws Exception {
   if (msg instanceof RemoteInputChannel) {
      boolean triggerWrite = inputChannelsWithCredit.isEmpty();

      inputChannelsWithCredit.add((RemoteInputChannel) msg);

      if (triggerWrite) {
         writeAndFlushNextMessageIfPossible(ctx.channel());
      }
   } else {
      ctx.fireUserEventTriggered(msg);
   }
}

private void writeAndFlushNextMessageIfPossible(Channel channel) {
   if (channelError.get() != null || !channel.isWritable()) {
      return;
   }

   while (true) {
      RemoteInputChannel inputChannel = inputChannelsWithCredit.poll();

      // The input channel may be null because of the write callbacks
      // that are executed after each write.
      if (inputChannel == null) {
         return;
      }

      //It is no need to notify credit for the released channel.
      if (!inputChannel.isReleased()) {
         AddCredit msg = new AddCredit(
            inputChannel.getPartitionId(),
            inputChannel.getAndResetUnannouncedCredit(),
            inputChannel.getInputChannelId());

         // Write and flush and wait until this is done before
         // trying to continue with the next input channel.
         //将Credit消息发送给生产者
         channel.writeAndFlush(msg).addListener(writeListener);

         return;
      }
   }
}

15. 服务端接收AddCredit消息

我们再回到生产者,也就是Netty的服务端,同样是PartitionRequestServerHandler.channelRead0()方法,生产者再接收到credit信任值后,会给对应的reader增加信任值,意味着可以继续往消费者发送credit个数量的buffer了。PartitionRequestQueue会将这个reader重新添加到可读取reader列表中,使这个reader可以被继续轮询读取ResultSubPartition的数据。如果这个reader已经在可读取的reader中,因为这个reader增加了credit信任值,这使得它可以多读取新增的credit个数据buffer。

//PartitionRequestServerHandler类
protected void channelRead0(ChannelHandlerContext ctx, NettyMessage msg) throws Exception {
   try {
      Class<?> msgClazz = msg.getClass();

      ...
      } else if (msgClazz == AddCredit.class) {
         AddCredit request = (AddCredit) msg;

         outboundQueue.addCredit(request.receiverId, request.credit);
      } 
      ...
}

//PartitionRequestQueue类
void addCredit(InputChannelID receiverId, int credit) throws Exception {
   if (fatalError) {
      return;
   }

   NetworkSequenceViewReader reader = allReaders.get(receiverId);
   if (reader != null) {
       //给reader增加信任值
      reader.addCredit(credit);
      //重新入队reader
      enqueueAvailableReader(reader);
   } else {
      throw new IllegalStateException("No reader for receiverId = " + receiverId + " exists.");
   }
}

credit的用处在生产者中的体现是在reader.isAvailable()方法,只有当credit信任值>0时,reader才可以继续读取ResultSubPartition的数据

//CreditBasedSequenceNumberingViewReader类
public boolean isAvailable() {
   // BEWARE: this must be in sync with #isAvailable(BufferAndBacklog)!
   return hasBuffersAvailable() &&
      (numCreditsAvailable > 0 || subpartitionView.nextBufferIsEvent());
}

在此之后,生产者、消费者之间将循环进行这个数据交互过程,生产者将数据发送给消费者,消费者反馈credit给生产者,使得数据可以进行持续的生产、消费。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值