flink在client端形成jobGraph之后会提交给JobMaster ,在这里会形成ExecutionGraph
JobMaster的构造函数中有这么一句话:
this.executionGraph = this.createAndRestoreExecutionGraph(this.jobManagerJobMetricGroup);
一直追踪导EecutionGraphBulider#buildGraph
这个方法中有比较重要的一句话
// 根据JobVertex列表,生成execution graph
executionGraph.attachJobGraph(sortedTopology);
根据jobGraph生成executionGraph的大部分逻辑都在这个方法中
for (JobVertex jobVertex : topologiallySorted) {
if (jobVertex.isInputVertex() && !jobVertex.isStoppable()) {
this.isStoppable = false;
}
// create the execution job vertex and attach it to the graph
ExecutionJobVertex ejv = new ExecutionJobVertex(
this,
jobVertex,
1,
rpcTimeout,
globalModVersion,
createTimestamp);
ejv.connectToPredecessors(this.intermediateResults);
ExecutionJobVertex previousTask = this.tasks.putIfAbsent(jobVertex.getID(), ejv);
if (previousTask != null) {
throw new JobException(String.format("Encountered two job vertices with ID %s : previous=[%s] / new=[%s]",
jobVertex.getID(), ejv, previousTask));
}
看上面这段代码的逻辑:
首先会遍历所有的JobVertex,根据每一个JobVertex生成一个ExecutionJobVertex。重点在ExecutionJobVertex的构造函数中:
重要的代码片段:
this.producedDataSets = new IntermediateResult[jobVertex.getNumberOfProducedIntermediateDataSets
for (int i = 0; i < jobVertex.getProducedDataSets().size(); i++) {
final IntermediateDataSet result = jobVertex.getProducedDataSets().get(i);
this.producedDataSets[i] = new IntermediateResult(
result.getId(),
this,
numTaskVertices,
result.getResultType());
}
首先会创建一个producedDataSets列表,然后根据JobVertext中的ProducedDataSet变量,给produceDatSets列表赋值
for (int i = 0; i < numTaskVertices; i++) {
ExecutionVertex vertex = new ExecutionVertex(
this,
i,
producedDataSets,
timeout,
initialGlobalModVersion,
createTimestamp,
maxPriorAttemptsHistoryLength);
this.taskVertices[i] = vertex;
}
这里则是根据并行度,创建一个ExecutionVertex,每个并行度就是一个ExecutionVertex,然后放在taskVertices数组中。
ExecutionJobVertex创建完毕之后会进入
ejv.connectToPredecessors(this.intermediateResults);
追踪导ExectionVertex#connectSource
public void connectSource(int inputNumber, IntermediateResult source, JobEdge edge, int consumerNumber) {
final DistributionPattern pattern = edge.getDistributionPattern();
final IntermediateResultPartition[] sourcePartitions = source.getPartitions();
ExecutionEdge[] edges;
switch (pattern) {
case POINTWISE:
edges = connectPointwise(sourcePartitions, inputNumber);
break;
case ALL_TO_ALL:
edges = connectAllToAll(sourcePartitions, inputNumber);
break;
default:
throw new RuntimeException("Unrecognized distribution pattern.");
}
这里根据switch case 有两个分支,先看第一个分支
private ExecutionEdge[] connectPointwise(IntermediateResultPartition[] sourcePartitions, int inputNumber) {
final int numSources = sourcePartitions.length;
final int parallelism = getTotalNumberOfParallelSubtasks();
// simple case same number of sources as targets
if (numSources == parallelism) {
return new ExecutionEdge[] { new ExecutionEdge(sourcePartitions[subTaskIndex], this, inputNumber) };
}
else if (numSources < parallelism) {
int sourcePartition;
// check if the pattern is regular or irregular
// we use int arithmetics for regular, and floating point with rounding for irregular
if (parallelism % numSources == 0) {
// same number of targets per source
int factor = parallelism / numSources;
sourcePartition = subTaskIndex / factor;
}
else {
// different number of targets per source
float factor = ((float) parallelism) / numSources;
sourcePartition = (int) (subTaskIndex / factor);
}
return new ExecutionEdge[] { new ExecutionEdge(sourcePartitions[sourcePartition], this, inputNumber) };
}
else {
if (numSources % parallelism == 0) {
// same number of targets per source
int factor = numSources / parallelism;
int startIndex = subTaskIndex * factor;
ExecutionEdge[] edges = new ExecutionEdge[factor];
for (int i = 0; i < factor; i++) {
edges[i] = new ExecutionEdge(sourcePartitions[startIndex + i], this, inputNumber);
}
return edges;
}
else {
float factor = ((float) numSources) / parallelism;
int start = (int) (subTaskIndex * factor);
int end = (subTaskIndex == getTotalNumberOfParallelSubtasks() - 1) ?
sourcePartitions.length :
(int) ((subTaskIndex + 1) * factor);
ExecutionEdge[] edges = new ExecutionEdge[end - start];
for (int i = 0; i < edges.length; i++) {
edges[i] = new ExecutionEdge(sourcePartitions[start + i], this, inputNumber);
}
return edges;
}
}
}
逻辑比较复杂,描述一下:
这里会获取到ExecutionVertex的并行度和上游的IntermediateResultPartition的数目来执行不同的策略:
(1) 如果并发数等于partition
数,则一对一进行连接。如下图所示:
即numSources == parallelism
(2) 如果并发数大于partition
数,则一对多进行连接。如下图所示:
即numSources < parallelism
,且parallelism % numSources == 0
(3) 如果并发数小于partition
数,则多对一进行连接。如下图所示:
即numSources > parallelism
,且numSources % parallelism == 0
再看conectAlltoAll,全连接模式
ExecutionEdge[] edges = new ExecutionEdge[sourcePartitions.length];
for (int i = 0; i < sourcePartitions.length; i++) {
IntermediateResultPartition irp = sourcePartitions[i];
edges[i] = new ExecutionEdge(irp, this, inputNumber);
}
return edges;
这就有点类似sql中的join操作的笛卡尔积模式
ExecutionVertex有两个不同的输入:输入A和B。其中输入A的partition=1, 输入B的partition=8,那么这个二维数组inputEdges如下(为简短,以irp代替IntermediateResultPartition)
[ ExecutionEdge[ A.irp[0]] ]
[ ExecutionEdge[ B.irp[0], B.irp[1], ..., B.irp[7] ]
------------------------------------
接着看物理执行图
找到ExecutionGraph#scheduleForExecution方法
通常都是
case EAGER:
newSchedulingFuture = scheduleEager(slotProvider, allocationTimeout);
这种模式
private CompletableFuture<Void> scheduleEager(SlotProvider slotProvider, final Time timeout) {
checkState(state == JobStatus.RUNNING, "job is not running currently");
// Important: reserve all the space we need up front.
// that way we do not have any operation that can fail between allocating the slots
// and adding them to the list. If we had a failure in between there, that would
// cause the slots to get lost
final boolean queued = allowQueuedScheduling;
// collecting all the slots may resize and fail in that operation without slots getting lost
final ArrayList<CompletableFuture<Execution>> allAllocationFutures = new ArrayList<>(getNumberOfExecutionJobVertices());
// allocate the slots (obtain all their futures
for (ExecutionJobVertex ejv : getVerticesTopologically()) {
// these calls are not blocking, they only return futures
Collection<CompletableFuture<Execution>> allocationFutures = ejv.allocateResourcesForAll(
slotProvider,
queued,
LocationPreferenceConstraint.ALL,
allocationTimeout);
allAllocationFutures.addAll(allocationFutures);
}
// this future is complete once all slot futures are complete.
// the future fails once one slot future fails.
final ConjunctFuture<Collection<Execution>> allAllocationsFuture = FutureUtils.combineAll(allAllocationFutures);
final CompletableFuture<Void> currentSchedulingFuture = allAllocationsFuture
.thenAccept(
(Collection<Execution> executionsToDeploy) -> {
for (Execution execution : executionsToDeploy) {
try {
execution.deploy();
} catch (Throwable t) {
throw new CompletionException(
new FlinkException(
String.format("Could not deploy execution %s.", execution),
t));
}
}
})
// Generate a more specific failure message for the eager scheduling
.exceptionally(
(Throwable throwable) -> {
final Throwable strippedThrowable = ExceptionUtils.stripCompletionException(throwable);
final Throwable resultThrowable;
if (strippedThrowable instanceof TimeoutException) {
int numTotal = allAllocationsFuture.getNumFuturesTotal();
int numComplete = allAllocationsFuture.getNumFuturesCompleted();
String message = "Could not allocate all requires slots within timeout of " +
timeout + ". Slots required: " + numTotal + ", slots allocated: " + numComplete;
resultThrowable = new NoResourceAvailableException(message);
} else {
resultThrowable = strippedThrowable;
}
throw new CompletionException(resultThrowable);
});
return currentSchedulingFuture;
}
这段代码大量使用jdk8新增的CompleteFuture特性,这里不做介绍,网上有大量文章介绍这个组件
这里会根据
ExecutionJobVertices的数量创建异步任务。并且给每个ExecutionJobVertices分配适当的slot,然后调用
execution.deploy();方法
上面方法进入之后截取主要的几句话:
final TaskDeploymentDescriptor deployment = vertex.createDeploymentDescriptor(
attemptId,
slot,
taskRestore,
attemptNumber);
final TaskManagerGateway taskManagerGateway = slot.getTaskManagerGateway();
final CompletableFuture<Acknowledge> submitResultFuture = taskManagerGateway.submitTask(deployment, rpcTimeout);
第一句话
包含了从Execution Graph到真正物理执行图的转换。如将IntermediateResultPartition转化成ResultPartition,ExecutionEdge转成InputChannelDeploymentDescriptor(最终会在执行时转化成InputGate)。
最后通过RPC方法提交task,实际会调用到TaskExecutor.submitTask
方法中。
这个方法会创建真正的Task,然后调用task.startTaskThread();
开始task的执行。
在Task构造函数中,会根据输入的参数,创建InputGate, ResultPartition, ResultPartitionWriter等。
而startTaskThread
方法,则会执行executingThread.start
,从而调用Task.run
方法。
进入TaskExecutor#submitTask
找到task.startTaskThread();
进入之后找到
executingThread.start();
我们要看executingThread的run方法
找到几句核心代码:
invokable = loadAndInstantiateInvokable(userCodeClassLoader, nameOfInvokableClass, env);
这里的invokable即为operator对象实例,通过反射创建
那么用户真正写的逻辑代码在哪里呢?比如word count中的Tokenizer,去了哪里呢?
OneInputStreamTask的基类StreamTask,包含了headOperator和operatorChain。当我们调用dataStream.flatMap(new Tokenizer())
的时候,会生成一个StreamFlatMap的operator,这个operator是一个AbstractUdfStreamOperator,而用户的代码new Tokenizer
,即为它的userFunction。
所以再串回来,以OneInputStreamTask为例,Task的核心执行代码即为OneInputStreamTask.invoke
方法,它会调用StreamTask.run
方法,这是个抽象方法,最终会调用其派生类的run方法,即OneInputStreamTask, SourceStreamTask等。
OneInputStreamTask的run方法代码如下:
final OneInputStreamOperator<IN, OUT> operator = this.headOperator;
final StreamInputProcessor<IN> inputProcessor = this.inputProcessor;
final Object lock = getCheckpointLock();
while (running && inputProcessor.processInput(operator, lock)) {
// all the work happens in the "processInput" method
}
就是一直不停地循环调用inputProcessor.processInput(operator, lock)
方法,即StreamInputProcessor.processInput
方法:
public boolean processInput(OneInputStreamOperator<IN, ?> streamOperator, final Object lock) throws Exception {
// ...
while (true) {
if (currentRecordDeserializer != null) {
// ...
if (result.isFullRecord()) {
StreamElement recordOrMark = deserializationDelegate.getInstance();
// 处理watermark,则框架处理
if (recordOrMark.isWatermark()) {
// watermark处理逻辑
// ...
continue;
} else if(recordOrMark.isLatencyMarker()) {
// 处理latency mark,也是由框架处理
synchronized (lock) {
streamOperator.processLatencyMarker(recordOrMark.asLatencyMarker());
}
continue;
} else {
// ***** 这里是真正的用户逻辑代码 *****
StreamRecord<IN> record = recordOrMark.asRecord();
synchronized (lock) {
numRecordsIn.inc();
streamOperator.setKeyContextElement1(record);
streamOperator.processElement(record);
}
return true;
}
}
}
// 其他处理逻辑
// ...
}
}
上面的代码中,streamOperator.processElement(record);
才是真正处理用户逻辑的代码,以StreamFlatMap为例,即为它的processElement方法:
public void processElement(StreamRecord<IN> element) throws Exception {
collector.setTimestamp(element);
userFunction.flatMap(element.getValue(), collector);
}