一文弄懂 Flink Task 数据交互之数据读源码
在Reduce端的数据读过程取过程,以及Reduce端的数据模型。本文暂时只分析Reduce端任务处理线程的数据读,不涉及到与上游任务的网络交换和数据请求,Task之间的数据交换主要是基于Credit的Netty网络通信,这部分将在之后
1. OneInputStreamTask
首先从下游的任务类型和执行流程开始分析,下游的任务类型主要是OneInputStreamTask,任务执行就是调用其run()方法,run()方法里就是循环的调用inputProcessor.processInput()进行数据处理。
//OneInputStreamTask类
protected void run() throws Exception {
// cache processor reference on the stack, to make the code more JIT friendly
final StreamInputProcessor<IN> inputProcessor = this.inputProcessor;
while (running && inputProcessor.processInput()) {
// all the work happens in the "processInput" method
}
}
//vStreamTask 接口
protected void processInput(MailboxDefaultAction.Controller controller) throws Exception {
// 这里调用 inputProcessor.processInput
InputStatus status = inputProcessor.processInput();
if (status == InputStatus.MORE_AVAILABLE && recordWriter.isAvailable()) {
return;
}
if (status == InputStatus.END_OF_INPUT) {
controller.allActionsCompleted();
return;
}
CompletableFuture<?> jointFuture = getInputOutputJointFuture(status);
MailboxDefaultAction.Suspension suspendedDefaultAction = controller.suspendDefaultAction();
jointFuture.thenRun(suspendedDefaultAction::resume);
}
在inputProcessor.processInput()中,数据获取的逻辑就是从barrierHandler中获取一个buffer,然后依次从buffer中获取一条数据进行处理,buffer数据被消费完之后再接着从barrierHandler中获取一个buffer,如此循环下去。
// StreamOneInputProcessor
@Override
public InputStatus processInput() throws Exception {
// 这里调用 input.emitNext 方法
InputStatus status = input.emitNext(output);
if (status == InputStatus.END_OF_INPUT) {
operatorChain.endHeadOperatorInput(1);
}
return status;
}
// StreamTaskNetworkInput
@Override
public InputStatus emitNext(DataOutput<T> output) throws Exception {
while (true) {
// get the stream element from the deserializer
if (currentRecordDeserializer != null) {
// 先从之前已经获取的buffer里反序列化出一条数据记录
DeserializationResult result = currentRecordDeserializer.getNextRecord(deserializationDelegate);
if (result.isBufferConsumed()) {
currentRecordDeserializer.getCurrentBuffer().recycleBuffer();
currentRecordDeserializer = null;
}
if (result.isFullRecord()) {
// 在这里处理数据
processElement(deserializationDelegate.getInstance(), output);
return InputStatus.MORE_AVAILABLE;
}
}
// 如果之前的buffer已经消费完了,就重新再从barrierHandler获取一个buffer
Optional<BufferOrEvent> bufferOrEvent = checkpointedInputGate.pollNext();
if (bufferOrEvent.isPresent()) {
// return to the mailbox after receiving a checkpoint barrier to avoid processing of
// data after the barrier before checkpoint is performed for unaligned checkpoint mode
if (bufferOrEvent.get().isEvent() && bufferOrEvent.get().getEvent() instanceof CheckpointBarrier) {
return InputStatus.MORE_AVAILABLE;
}
processBufferOrEvent(bufferOrEvent.get());
} else {
if (checkpointedInputGate.isFinished()) {
checkState(checkpointedInputGate.getAvailableFuture().isDone(), "Finished BarrierHandler should be available");
return InputStatus.END_OF_INPUT;
}
return InputStatus.NOTHING_AVAILABLE;
}
}
}
private void processElement(StreamElement recordOrMark, DataOutput<T> output) throws Exception {
if (recordOrMark.isRecord()){
// 进行实际的数据处理
output.emitRecord(recordOrMark.asRecord());
} else if (recordOrMark.isWatermark()) {
// 处理 WaterMark 数据
statusWatermarkValve.inputWatermark(recordOrMark.asWatermark(), lastChannel);
} else if (recordOrMark.isLatencyMarker()) {
output.emitLatencyMarker(recordOrMark.asLatencyMarker());
} else if (recordOrMark.isStreamStatus()) {
statusWatermarkValve.inputStreamStatus(recordOrMark.asStreamStatus(), lastChannel);
} else {
throw new UnsupportedOperationException("Unknown type of StreamElement");
}
}
2. CheckpointBarrierHandler
这里我们需要进一步去看看barrierHandler中数据获取的逻辑,barrierHandler这个组件我们在分析checkpoint的时候分析过,barrierHandler意思就是barrier处理器,它处理的是checkpoint barrier。根据checkpoint模式的不同会创建不同的barrierHandler,如果是EXACTLY_ONCE,就会生成BarrierBuffer,会进行barrier对齐;如果是AT_LEAST_ONCE,那就会生成BarrierTracker,不会进行barrier对齐。如果在程序中没有设置checkpoint,那么默认的也是AT_LEAST_ONCE,barrierHandler是BarrierTracker实例。这里就不详细讲是怎么处理checkpoint的了,我们只看它是怎么获取数据buffer的
// BarrierBuffer
public BufferOrEvent getNextNonBlocked() throws Exception {
while (true) {
// process buffered BufferOrEvents before grabbing new ones
Optional<BufferOrEvent> next;
//currentBuffered里是进行checkpoint barrier对齐时缓存的那些被阻塞channel的buffer
//currentBuffered不为空发生在完成一个checkpoint之后,将bufferBlocker里的buffer放到currentBuffered
if (currentBuffered == null) {
//一般情况下是从inputGate获取buffer
next = inputGate.getNextBufferOrEvent();
}
else {
next = Optional.ofNullable(currentBuffered.getNext());
...
}
...
BufferOrEvent bufferOrEvent = next.get();
if (isBlocked(bufferOrEvent.getChannelIndex())) {
// if the channel is blocked, we just store the BufferOrEvent
//如果一个channel被阻塞了(已经接收到checkpoint barrier了),先添加到bufferBlocker里,但不会放到currentBuffered
bufferBlocker.add(bufferOrEvent);
checkSizeLimit();
}
else if (bufferOrEvent.isBuffer()) {
return bufferOrEvent;
}
...//其他处理checkpoint事件
}
}
// BarrierTracker
public BufferOrEvent getNextNonBlocked() throws Exception {
while (true) {
//比较简单,直接从inputGate中获取
Optional<BufferOrEvent> next = inputGate.getNextBufferOrEvent();
if (!next.isPresent()) {
// buffer or input exhausted
return null;
}
BufferOrEvent bufferOrEvent = next.get();
if (bufferOrEvent.isBuffer()) {
return bufferOrEvent;
}
...//其他的checkpoint事件
}
}
// BarrierTracker
public BufferOrEvent getNextNonBlocked() throws Exception {
while (true) {
//比较简单,直接从inputGate中获取
Optional<BufferOrEvent> next = inputGate.getNextBufferOrEvent();
if (!next.isPresent()) {
// buffer or input exhausted
return null;
}
BufferOrEvent bufferOrEvent = next.get();
if (bufferOrEvent.isBuffer()) {
return bufferOrEvent;
}
...//其他的checkpoint事件
}
}
从上面代码中可以看到,不管是BarrierBuffer还是BarrierTracker,都是从inputGate中来获取buffer的,这里就引出来Reduce端数据输入的一个最重要的模型InputGate。在分析inputGate.getNextBufferOrEvent()之前先来分析一下InputGate的数据结构。
3. SingleInputGate
InputGate的实现由两种,SingleInputGate和UnionInputGate,常见的就是SingleInputGate,UnionInputGate是将多个SingleInputGate进行联合在一起的InputGate,例如join算子,是从两个流中进行数据输入,它就是一个UnionInputGate。这里我们分析SingleInputGate就行了
public class SingleInputGate implements InputGate {
/** The type of the partition the input gate is consuming. */
private final ResultPartitionType consumedPartitionType;
/**
* The index of the consumed subpartition of each consumed partition. This index depends on the
* {@link DistributionPattern} and the subtask indices of the producing and consuming task.
*/
private final int consumedSubpartitionIndex;
/** The number of input channels (equivalent to the number of consumed partitions). */
private final int numberOfInputChannels;
/**
* Input channels. There is a one input channel for each consumed intermediate result partition.
* We store this in a map for runtime updates of single channels.
*/
private final Map<IntermediateResultPartitionID, InputChannel> inputChannels;
/** Channels, which notified this input gate about available data. */
private final ArrayDeque<InputChannel> inputChannelsWithData = new ArrayDeque<>();
/**
* Buffer pool for incoming buffers. Incoming data from remote channels is copied to buffers
* from this pool.
*/
private BufferPool bufferPool;
/** Global network buffer pool to request and recycle exclusive buffers (only for credit-based). */
private NetworkBufferPool networkBufferPool;
private final boolean isCreditBased;
/** Flag indicating whether partitions have been requested. */
private boolean requestedPartitionsFlag;
/** Number of network buffers to use for each remote input channel. */
private int networkBuffersPerChannel;
//其他一些未列出的非核心成员
...
主要成员:
consumedPartitionType: ResultPartitionType,数据之间交换的类型,有BLOCKING, PIPELINED, PIPELINED_BOUNDED。具体含义在《Task数据交互之数据写》中已经解释过,在实时流里都是PIPELINED_BOUNDED,意味有限制的流水线模式,上游生产的同时下游可以进行消费,采用有限的buffer去缓存这些数据。
consumedSubpartitionIndex: 在《Task数据交互之数据写》中说过,每个ResultPartition有多个ResultSubPartition,代表多个下游任务,每个ResultSubPartition都被下游的一个任务所消费。consumedSubpartitionIndex所代表的就是这个下游任务消费的是上游哪个ResultSubPartition
numberOfInputChannels: InputChannel的数量,InputChannel代表的是上游任务数据通道,每个InputChannel对应一个上游任务,例如上游有10个Map任务,那每个Reduce任务就会有10个InputChannel,如果map任务和reduce任务在同一个节点上,那InputChannel类型就是LocalInputChannel,否则就是RemoteInputChannel。如果InputGate对应的是ResultPartition,那InputChannel对应的就是ResultSubPartition
inputChannels: 这个InputGate所有的InputChannel
inputChannelsWithData: 接收到上游有数据发送过来的inputChannels,如果某个上游任务长时间没有数据产出,那它不会在这个队列里
bufferPool: LocalBufferPool,本地buffer池,InputChannel可以从这个buffer池中获取浮动buffer去缓存从上游任务发送过来的数据。
isCreditBased: 是否基于Credit的数据传输,默认是
networkBuffersPerChannel: 用于接收每个InputChannel数据的buffer数量,这部分当做InputChannel的独占buffer,默认是每个InputChannel使用两个独占buffer来缓存数据。
那这个SingleInputGate是在哪里创建的呢?
答案是JobMaster在部署Task的时候会创建TaskDeploymentDescriptor(根据ExecutionGraph),TaskDeploymentDescriptor中就包含了InputGate的描述信息InputGateDeploymentDescriptor,JobMaster将TaskDeploymentDescriptor发送给TaskManager之后,TaskManager会根据TaskDeploymentDescriptor构建Task,这时候会去创建InputGate
//Task构造方法
public Task(
...
Collection<ResultPartitionDeploymentDescriptor> resultPartitionDeploymentDescriptors,
Collection<InputGateDeploymentDescriptor> inputGateDeploymentDescriptors,
...) {
...
counter = 0;
for (InputGateDeploymentDescriptor inputGateDeploymentDescriptor: inputGateDeploymentDescriptors) {
SingleInputGate gate = SingleInputGate.create(
taskNameWithSubtaskAndId,
jobId,
executionId,
inputGateDeploymentDescriptor,
networkEnvironment,
this,
metricGroup.getIOMetricGroup());
inputGates[counter] = gate;
inputGatesById.put(gate.getConsumedResultId(), gate);
++counter;
}
...
}
在SingleInputGate的结构中,最核心的当属inputChannels和bufferPool了。bufferPool的创建跟Map端类似,在《Task数据交互之数据写》中,我们分析了在map端LocalBufferPool中最大的buffer数 = task对应的下游任务数 * 每个下游任务需要的buffer + 额外多分配的buffer数。但在reduce端,LocalBufferPool中最大的buffer数 = 额外多分配的buffer数,默认情况下额外多分配的buffer数为8,这部分作为浮动的buffer。如果项目的任务数较大,应该调大这个参数。
4. InputChannel
接下来看看InputChannel
上面说过,每个InputChannel对应一个上游任务,如果map任务和reduce任务在同一个节点上,那InputChannel类型就是LocalInputChannel,否则就是RemoteInputChannel,正式环境中,大部分使用的还是RemoteInputChannel,因为每个下游任务节点要和多个上游节点去进行数据交互。所以我们先来看RemoteInputChannel
4.1RemoteInputChannel
InputChannel
接下来看看InputChannel
上面说过,每个InputChannel对应一个上游任务,如果map任务和reduce任务在同一个节点上,那InputChannel类型就是LocalInputChannel,否则就是RemoteInputChannel,正式环境中,大部分使用的还是RemoteInputChannel,因为每个下游任务节点要和多个上游节点去进行数据交互。所以我们先来看RemoteInputChannel
RemoteInputChannel
public class RemoteInputChannel extends InputChannel implements BufferRecycler, BufferListener {
/** The connection manager to use connect to the remote partition provider. */
// 与其他节点进行通信连接的管理者
private final ConnectionManager connectionManager;
/**
* The received buffers. Received buffers are enqueued by the network I/O thread and the queue
* is consumed by the receiving task thread.
*/
// 从上游任务节点接收到的buffer数据队列,数据将会被任务处理线程所消费
private final ArrayDeque<Buffer> receivedBuffers = new ArrayDeque<>();
/** Client to establish a (possibly shared) TCP connection and request the partition.
*/
// 与上游节点通信的客户端,也可以说是netty客户端
private volatile PartitionRequestClient partitionRequestClient;
/** The initial number of exclusive buffers assigned to this channel. */
// 初始的消费凭证,flink节点间数据传输默认是基于credit消费凭证的,关于具体的概念可以参考《Flink基于Credit的数据传输和背压》
private int initialCredit;
/** The available buffer queue wraps both exclusive and requested floating buffers. */
/**
可用的空闲buffer队列,下游任务接收到上游任务的数据时,从这个队列中拿一个空闲的buffer来缓存接收到的数据,
并放到receivedBuffers列队里。bufferQueue包含两种buffer队列,一种是该RemoteInputChannel独占的buffer,
数量等于networkBuffersPerChannel,独占的buffer是RemoteInputChannel私有的,
在回收时会再次添加到RemoteInputChannel的可用buffer队列里;另一种是可以被多个RemoteInputChannel所共享的buffer,
称之为浮动buffer,浮动buffer总量等于LocalBufferPool中的buffer数,当某个RemoteInputChannel没有足够的空闲buffer了
(比如数据处理线程执行很慢的时候),可以从LocalBufferPool中申请浮动的buffer来缓存接收的数据,
浮动的buffer在回收时放到LocalBufferPool里。
接收端使用buffer接收数据时,优先使用浮动的buffer,再使用独占的buffer
*/
private final AvailableBufferQueue bufferQueue = new AvailableBufferQueue();
/** The number of required buffers that equals to sender's backlog plus initial credit. */
@GuardedBy("bufferQueue")
/**
RemoteInputChannel接收数据需求的buffer数量,等于发送端数据积压的量+initialCredit,
initialCredit的值等于每个RemoteInputChannel的独占buffer数。那么这个numRequiredBuffers
的值其实是要大于发送端的数据积压量的,这样可以做到更加安全保险
*/
private int numRequiredBuffers;
/** The tag indicates whether this channel is waiting for additional floating buffers from the buffer pool. */
@GuardedBy("bufferQueue")
/**
是否需要等待浮动的buffer,当RemoteInputChannel申请浮动buffer的时候,发现LocalBufferPool也没有足够的buffer了,
就会标识要等待空闲buffer,当LocalBufferPool有buffer回收了,就会分配给该RemoteInputChannel
*/
private boolean isWaitingForFloatingBuffers;
//其他一些未列出的非核心成员
...
4.2 LocalInputChannel
上面分析了RemoteInputChannel,下面再来看看LocalInputChannel,它代表的是和下游任务在同一节点(JVM)的上游任务。
可以看到LocalInputChannel的结构相对简单,没有本地buffer队列什么的,因为在同一节点一个JVM中,直接读取上游任务产生的数据即可
public class LocalInputChannel extends InputChannel implements BufferAvailabilityListener {
private final Object requestLock = new Object();
/** The local partition manager. */
// ResultSubpartition的一个视图,可以通过这个视图直接读取上游任务写到ResultSubPartition中的buffer,不需要再进行数据交换了
private final ResultPartitionManager partitionManager;
/** Task event dispatcher for backwards events. */
private final TaskEventDispatcher taskEventDispatcher;
/** The consumed subpartition. */
private volatile ResultSubpartitionView subpartitionView;
private volatile boolean isReleased;
5. SingleInputGate从InputChannel获取数据
回到上文,任务处理线程读数据最终是通过inputGate.getNextBufferOrEvent()来获取的,我却花了这么多篇幅来介绍SingleInputGate和InputChannel,是因为了解了这两个组件就能更好的理解数据接收端的架构了。那么下面就来具体看一下inputGate.getNextBufferOrEvent()的实现
//SingleInputGate类
public Optional<BufferOrEvent> getNextBufferOrEvent() throws IOException, InterruptedException {
return getNextBufferOrEvent(true);
}
private Optional<BufferOrEvent> getNextBufferOrEvent(boolean blocking) throws IOException, InterruptedException {
...
//向服务端,也就是上游任务节点发起数据请求
requestPartitions();
InputChannel currentChannel;
boolean moreAvailable;
Optional<BufferAndAvailability> result = Optional.empty();
do {
synchronized (inputChannelsWithData) {
while (inputChannelsWithData.size() == 0) {
if (isReleased) {
throw new IllegalStateException("Released");
}
if (blocking) {
//如果没有任何InputChannel接收到数据,线程就会阻塞
inputChannelsWithData.wait();
}
else {
return Optional.empty();
}
}
//从接收到数据的InputChannel队列inputChannelsWithData里出队一个InputChannel
currentChannel = inputChannelsWithData.remove();
enqueuedInputChannelsWithData.clear(currentChannel.getChannelIndex());
moreAvailable = !inputChannelsWithData.isEmpty();
}
//从InputChannel里获取一个buffer
result = currentChannel.getNextBuffer();
} while (!result.isPresent());
// this channel was now removed from the non-empty channels queue
// we re-add it in case it has more data, because in that case no "non-empty" notification
// will come for that channel
//如果InputChannel还有多余的数据,则继续放到inputChannelsWithData队列里
if (result.get().moreAvailable()) {
queueChannel(currentChannel);
moreAvailable = true;
}
final Buffer buffer = result.get().buffer();
if (buffer.isBuffer()) {
//将buffer封装成BufferOrEvent返回
return Optional.of(new BufferOrEvent(buffer, currentChannel.getChannelIndex(), moreAvailable));
}
else {
... //事件
}
}
通过源码可以看到实现逻辑大致如下:
1、首先去请求上游的ResultPartition,这通常发生在首次获取数据的时候,向上游任务节点发送数据请求,建立tcp连接,之后这个连接会一直存在
2、从接收到数据的InputChannel队列inputChannelsWithData里出队一个InputChannel,再从这个InputChannel里获取一个buffer,如果没有任何InputChannel接收到数据,任务处理线程就会阻塞,直到有InputChannel接收到数据。
3、如果2中的InputChannel还有多余的数据,则继续放到inputChannelsWithData队列里,以便可以继续获取InputChannel后面的数据
接下来就看看InputChannel怎么获取buffer
首先看RemoteInputChannel
5.1 首先看RemoteInputChannel
//RemoteInputChannel类
Optional<BufferAndAvailability> getNextBuffer() throws IOException {
final Buffer next;
final boolean moreAvailable;
synchronized (receivedBuffers) {
next = receivedBuffers.poll();
moreAvailable = !receivedBuffers.isEmpty();
}
numBytesIn.inc(next.getSizeUnsafe());
numBuffersIn.inc();
return Optional.of(new BufferAndAvailability(next, moreAvailable, getSenderBacklog()));
}
逻辑比较简单,就是从接收到的buffer队列里取一个buffer就行了。
5.2 再看LocalInputChannel的实现
//LocalInputChannel类
Optional<BufferAndAvailability> getNextBuffer() throws IOException, InterruptedException {
checkError();
ResultSubpartitionView subpartitionView = this.subpartitionView;
if (subpartitionView == null) {
...
subpartitionView = checkAndWaitForSubpartitionView();
}
//通过subpartitionView获取,在JVM本地,而非远程
BufferAndBacklog next = subpartitionView.getNextBuffer();
...
return Optional.of(new BufferAndAvailability(next.buffer(), next.isMoreAvailable(), next.buffersInBacklog()));
}
//PipelinedSubpartitionView类
public BufferAndBacklog getNextBuffer() {
return parent.pollBuffer();
}
//PipelinedSubpartition类
BufferAndBacklog pollBuffer() {
synchronized (buffers) {
Buffer buffer = null;
if (buffers.isEmpty()) {
flushRequested = false;
}
while (!buffers.isEmpty()) {
//从PipelinedSubpartition的buffers数据队列取队头的buffer
BufferConsumer bufferConsumer = buffers.peek();
buffer = bufferConsumer.build();
checkState(bufferConsumer.isFinished() || buffers.size() == 1,
"When there are multiple buffers, an unfinished bufferConsumer can not be at the head of the buffers queue.");
if (buffers.size() == 1) {
// turn off flushRequested flag if we drained all of the available data
flushRequested = false;
}
//如果buffer是已经被写满的,不是写了一半数据的那种,就可以从buffers队列里删掉了
if (bufferConsumer.isFinished()) {
buffers.pop().close();
decreaseBuffersInBacklogUnsafe(bufferConsumer.isBuffer());
}
if (buffer.readableBytes() > 0) {
break;
}
buffer.recycleBuffer();
buffer = null;
if (!bufferConsumer.isFinished()) {
break;
}
}
if (buffer == null) {
return null;
}
//更新PipelinedSubpartition的数据状态
updateStatistics(buffer);
// Do not report last remaining buffer on buffers as available to read (assuming it's unfinished).
// It will be reported for reading either on flush or when the number of buffers in the queue
// will be 2 or more.
return new BufferAndBacklog(
buffer,
isAvailableUnsafe(),
getBuffersInBacklog(),
nextBufferIsEventUnsafe());
}
}
总体逻辑也比较简单,就是从Map任务生产的ResultSubPartition里获取一个buffer,Map生产的数据都放到ResultSubPartition的buffers队列里了。
但是细节的一个问题是这个Reduce任务获取的这个buffer并不一定就从ResultSubPartition的buffers队列里删掉了,而是可能会获取这个buffer的部分数据。这是因为如果Map任务的数据生产很慢,经过200ms(默认buffer timeout)还没有填满一个buffer,它会进行刷新数据,让下游任务来访问或者推送到下游,这时就不会返回一整个buffer,而是返回这个buffer里200ms内所产生的数据,Map端会继续往这个buffer里去写数据,直至写满才会另写一个新的buffer。
再回到上文,任务线程从inputGate获取到buffer之后会进行反序列化,然后进行任务数据处理。总结一下,数据读的流程大致是OneInputStreamTask–>StreamInputProcessor–>CheckpointBarrierHandler–>InputGate–>InputChannel。
到此,Task的数据读基本就分析完了。后面会分析Reduce端和Map端的数据传输和交互过程,也就是RemoteInputChannel的数据是怎么接收的,ResultSubPartition里的数据又是怎么发送到下游的。