flink源码阅读之ExecutionGraph的生成过程
StreamGraph和JobGraph都是在client端生成的,JobGraph相比于StreamGrap,已经有了一定程度上的优化,当client将JobGraph提交给JobManager时,JobManager会根据JobGraph生成对应的ExecutionGraph。TaskManager最终会根据ExecutionGraph执行任务
flink运行架构
在了解ExecutionGraph的生成过程时,首先要了解flink的运行架构,flink的运行架构如下所示
![](https://i-blog.csdnimg.cn/blog_migrate/f3a270407f7155240327b86757099a89.png)
主要包括 JobManger和TaskManager两个组件
JobManager
JobManager主要的职责如下:
- 它决定何时安排下一个任务(或一组任务),对完成的任务或执行失败做出反应,
- 协调检查点并协调失败的恢复。
JobManager中包含三个不同的组件,这三个组件互相协同工作。
-
ResourceManager
ResourceManager通过管理集群中的Task slot来实现flink集群中资源的取消与分配,其中Task slot是flink集群中资源调度的基本单位,flink为不同的执行环境(Yarn,K8S)实现了多个ResourceManager,但在standalone模式中,ResourceManager只能分配可用TaskManager的插槽,而不能自行启动新的TaskManager
-
Dispatcher
Dispatcher提供REST界面来提交Flink应用程序以供执行,并为每个提交的作业启动一个新的JobManager。 它还运行Flink WebUI以提供有关作业执行的信息。
-
JobMaster
JobMaster负责管理单个JobGraph的执行。 Flink群集中可以同时运行多个作业,每个作业都有自己的JobMaster。
TaskManager
TaskManager相当于Slave节点,负责执行任务,在自身节点缓冲数据,并在TaskManager之间交换数据流。
flink任务整体的提交过程
flink作业从Client端提交到Flink集群的执行流程如下图所示
![](https://i-blog.csdnimg.cn/blog_migrate/a6bd19bb59bce3cee8ca6fc609124c95.png)
- client端会创建StreamGraph,并将StreamGraph转化到JobGraph,最后将application提交到flink集群。
- 如果是local模式,则拉起一个minicluster,同时dispatcher负责任务处理
- dispatcher会拉起JobManagerRunner,多个JobMangerRunner进行选举,选出一个leader,之后,会创建一个Defaultscheduler,并且在加载DefaultScheduler的同时,也会加载他的父类SchedulerBase,ExecutionGraph在这个过程中被创建。
- 在SchedulerBase初始化时生成ExecutionGraph后,TaskManager根据ExecutionGraph运行任务。
上述过程的在源码中的调用流程如下
![](https://i-blog.csdnimg.cn/blog_migrate/9b26bbc471f66d864d42f530baf8085a.png)
ExecutionGraph的生成细节
先看一下官网上JobGraph转化为ExecutionGraph的图解
![](https://i-blog.csdnimg.cn/blog_migrate/32c44c0a9c395b66d138d7ba5af8c809.png)
ExecutionGraph中的组件
如上图所示,ExecutionGraph和JobGraph在结构上还是有很大差别的,接下来介绍ExecutionGraph中的组件信息
- ExecutionJobVertex: 在 ExecutionGraph 中,节点对应的是 ExecutionJobVertex,它与 JobGraph 中的 JobVertex 一一对应。
- ExecutionVertex: 在 ExecutionJobVertex 中有一个
taskVertices
变量,它是 ExecutionVertex 类型的数组,数组的大小就是这个 JobVertex 的并发度,在创建 ExecutionJobVertex 对象时,会创建相同并发度梳理的 ExecutionVertex 对象,在真正调度时,一个 ExecutionVertex 实际就是一个 task,它是 ExecutionJobVertex 并行执行的一个子任务; - IntermediateResult: 在 JobGraph 中用 IntermediateDataSet 表示 JobVertex 的输出 stream,一个 JobGraph 可能会有多个输出 stream,在 ExecutionGraph 中,与之对应的就是 IntermediateResult 对象;
- IntermediateResultPartition: 由于 ExecutionJobVertex 可能有多个并行的子任务,所以每个 IntermediateResult 可能就有多个生产者,每个生产者的在相应的 IntermediateResult 上的输出对应一个 IntermediateResultPartition 对象,IntermediateResultPartition 表示的是 ExecutionVertex 的一个输出分区;
- ExecutionEdge: ExecutionEdge 表示 ExecutionVertex 的输入,通过 ExecutionEdge 将 ExecutionVertex 和 IntermediateResultPartition 连接起来,进而在 ExecutionVertex 和 IntermediateResultPartition 之间建立联系。
- Execution:Execution是对Execution Vertex的一次执行,使用ExecutionAttemptld作为唯一标识,一个ExecutionVertex在某些情况下可能会执行多次,比如遇到失败的情况或者该task的数据需要重新计算。
根据上述概念可知
- JobVertex = ExecutionJobVertex = JobVertex的并行度 * Execution Vertex
- IntermediateDataSet = IntermediateResult = JobVertex的并行度 * IntermediateResultPartition
- 每个 JobVertex 可能有多个 IntermediateDataSet,所以每个 ExecutionJobVertex 可能有多个 IntermediateResult,因此,每个 ExecutionVertex 也可能会包含多个 IntermediateResultPartition;
通过SchedulerBase创建ExecutionGraph
通过源码调用流程图可知,defaulScheduler继承了schedulerBase,在创建defaultScheduler时,也会对父类schedulerBase进行加载,在这个类的构造函数中,调用createAndRestoreExecutionGraph()
方法开始创建ExecutionGraph,源码如下所示
public SchedulerBase(
......
// 开始创建
this.executionGraph = createAndRestoreExecutionGraph(jobManagerJobMetricGroup, checkNotNull(shuffleMaster), checkNotNull(partitionTracker), checkNotNull(executionDeploymentTracker));
......
}
接着调用attachJobGraph()
完成对ExecutionGraph的创建,这其中涉及两个过程
- 创建ExecutionJobVertex创建
- 调用
connectToPredecessors()
创建ExecutionEdge将ExecutionVertex与intermediateResultPartition连接在一起。
attachJobGraph()
方法源码如下所示
public void attachJobGraph(List<JobVertex> topologiallySorted) throws JobException {
assertRunningInJobMasterMainThread();
LOG.debug("Attaching {} topologically sorted vertices to existing job graph with {} " +
"vertices and {} intermediate results.",
topologiallySorted.size(),
tasks.size(),
intermediateResults.size());
final ArrayList<ExecutionJobVertex> newExecJobVertices = new ArrayList<>(topologiallySorted.size());
final long createTimestamp = System.currentTimeMillis();
for (JobVertex jobVertex : topologiallySorted) {
if (jobVertex.isInputVertex() && !jobVertex.isStoppable()) {
this.isStoppable = false;
}
// create the execution job vertex and attach it to the graph 创建ExecutionJobVertex,并加入ExecutionGraph中
ExecutionJobVertex ejv = new ExecutionJobVertex(
this,
jobVertex,
1,
maxPriorAttemptsHistoryLength,
rpcTimeout,
globalModVersion,
createTimestamp);
ejv.connectToPredecessors(this.intermediateResults);
ExecutionJobVertex previousTask = this.tasks.putIfAbsent(jobVertex.getID(), ejv);
if (previousTask != null) {
throw new JobException(String.format("Encountered two job vertices with ID %s : previous=[%s] / new=[%s]",
jobVertex.getID(), ejv, previousTask));
}
for (IntermediateResult res : ejv.getProducedDataSets()) {
IntermediateResult previousDataSet = this.intermediateResults.putIfAbsent(res.getId(), res);
if (previousDataSet != null) {
throw new JobException(String.format("Encountered two intermediate data set with ID %s : previous=[%s] / new=[%s]",
res.getId(), res, previousDataSet));
}
}
this.verticesInCreationOrder.add(ejv);
this.numVerticesTotal += ejv.getParallelism();
newExecJobVertices.add(ejv);
}
创建ExecutionJobVertex对象
attachJobGraph
方法中遍历JobVertex的集合并创建ExecutionJobVertex,看一下ExecutionJobVertex的构造方法,在这个方法中主要做了一下工作
- 根据这个 JobVertex 的
results
(IntermediateDataSet
列表)来创建相应的IntermediateResult
对象,每个IntermediateDataSet
都会对应的一个IntermediateResult
; - 再根据这个 JobVertex 的并发度,来创建相同数量的
ExecutionVertex
对象,每个ExecutionVertex
对象在调度时实际上就是一个 task 任务; - 在创建
IntermediateResult
和ExecutionVertex
对象时都会记录它们之间的关系
public ExecutionJobVertex(
ExecutionGraph graph,
JobVertex jobVertex,
int defaultParallelism,
int maxPriorAttemptsHistoryLength,
Time timeout,
long initialGlobalModVersion,
long createTimestamp) throws JobException {
if (graph == null || jobVertex == null) {
throw new NullPointerException();
}
this.graph = graph;
this.jobVertex = jobVertex;
int vertexParallelism = jobVertex.getParallelism();
int numTaskVertices = vertexParallelism > 0 ? vertexParallelism : defaultParallelism;
final int configuredMaxParallelism = jobVertex.getMaxParallelism();
this.maxParallelismConfigured = (VALUE_NOT_SET != configuredMaxParallelism);
// if no max parallelism was configured by the user, we calculate and set a default
setMaxParallelismInternal(maxParallelismConfigured ?
configuredMaxParallelism : KeyGroupRangeAssignment.computeDefaultMaxParallelism(numTaskVertices));
// verify that our parallelism is not higher than the maximum parallelism
if (numTaskVertices > maxParallelism) {
throw new JobException(
String.format("Vertex %s's parallelism (%s) is higher than the max parallelism (%s). Please lower the parallelism or increase the max parallelism.",
jobVertex.getName(),
numTaskVertices,
maxParallelism));
}
this.parallelism = numTaskVertices;
this.resourceProfile = ResourceProfile.fromResourceSpec(jobVertex.getMinResources(), MemorySize.ZERO);
// taskVertices 记录这个 task 每个并发
this.taskVertices = new ExecutionVertex[numTaskVertices];
this.inputs = new ArrayList<>(jobVertex.getInputs().size());
// take the sharing group
this.slotSharingGroup = jobVertex.getSlotSharingGroup();
this.coLocationGroup = jobVertex.getCoLocationGroup();
// setup the coLocation group
if (coLocationGroup != null && slotSharingGroup == null) {
throw new JobException("Vertex uses a co-location constraint without using slot sharing");
}
// create the intermediate results
this.producedDataSets = new IntermediateResult[jobVertex.getNumberOfProducedIntermediateDataSets()];
// 遍历jobVertex的IntermediateDataSet,添加到ExecutionJobVertex的produceDataSet
for (int i = 0; i < jobVertex.getProducedDataSets().size(); i++) {
final IntermediateDataSet result = jobVertex.getProducedDataSets().get(i);
this.producedDataSets[i] = new IntermediateResult(
result.getId(),
this,
numTaskVertices,
result.getResultType());
}
// create all task vertices
for (int i = 0; i < numTaskVertices; i++) {
ExecutionVertex vertex = new ExecutionVertex(
this,
i,
producedDataSets,
timeout,
initialGlobalModVersion,
createTimestamp,
maxPriorAttemptsHistoryLength);
this.taskVertices[i] = vertex;
}
// sanity check for the double referencing between intermediate result partitions and execution vertices
for (IntermediateResult ir : this.producedDataSets) {
if (ir.getNumberOfAssignedPartitions() != parallelism) {
throw new RuntimeException("The intermediate result's partitions were not correctly assigned.");
}
}
final List<SerializedValue<OperatorCoordinator.Provider>> coordinatorProviders = getJobVertex().getOperatorCoordinators();
if (coordinatorProviders.isEmpty()) {
this.operatorCoordinators = Collections.emptyList();
} else {
final ArrayList<OperatorCoordinatorHolder> coordinators = new ArrayList<>(coordinatorProviders.size());
try {
for (final SerializedValue<OperatorCoordinator.Provider> provider : coordinatorProviders) {
coordinators.add(OperatorCoordinatorHolder.create(provider, this, graph.getUserClassLoader()));
}
} catch (Exception | LinkageError e) {
IOUtils.closeAllQuietly(coordinators);
throw new JobException("Cannot instantiate the coordinator for operator " + getName(), e);
}
this.operatorCoordinators = Collections.unmodifiableList(coordinators);
}
// set up the input splits, if the vertex has any
try {
@SuppressWarnings("unchecked")
InputSplitSource<InputSplit> splitSource = (InputSplitSource<InputSplit>) jobVertex.getInputSplitSource();
if (splitSource != null) {
Thread currentThread = Thread.currentThread();
ClassLoader oldContextClassLoader = currentThread.getContextClassLoader();
currentThread.setContextClassLoader(graph.getUserClassLoader());
try {
inputSplits = splitSource.createInputSplits(numTaskVertices);
if (inputSplits != null) {
splitAssigner = splitSource.getInputSplitAssigner(inputSplits);
}
} finally {
currentThread.setContextClassLoader(oldContextClassLoader);
}
}
else {
inputSplits = null;
}
}
catch (Throwable t) {
throw new JobException("Creating the input splits caused an error: " + t.getMessage(), t);
}
}
创建ExecutionVertex对象
在创建ExecutionJobVertex对象的过程中,也会完成对ExecutionVertex对象的创建
- 根据这个 ExecutionJobVertex 的
producedDataSets
(IntermediateResult 类型的数组),给每个 ExecutionVertex 创建相应的 IntermediateResultPartition 对象,它代表了一个 IntermediateResult 分区; - 调用 IntermediateResult 的
setPartition()
方法,记录 IntermediateResult 与 IntermediateResultPartition 之间的关系; - 给这个 ExecutionVertex 创建一个 Execution 对象,如果这个 ExecutionVertex 重新调度(失败重新恢复等情况),那么 Execution 对应的
attemptNumber
将会自增加 1,这里初始化的时候其值为 0。
public ExecutionVertex(
ExecutionJobVertex jobVertex,
int subTaskIndex,
IntermediateResult[] producedDataSets,
Time timeout,
long initialGlobalModVersion,
long createTimestamp,
int maxPriorExecutionHistoryLength) {
this.jobVertex = jobVertex;
this.subTaskIndex = subTaskIndex;
this.executionVertexId = new ExecutionVertexID(jobVertex.getJobVertexId(), subTaskIndex);
this.taskNameWithSubtask = String.format("%s (%d/%d)",
jobVertex.getJobVertex().getName(), subTaskIndex + 1, jobVertex.getParallelism());
this.resultPartitions = new LinkedHashMap<>(producedDataSets.length, 1);
// 遍历JobVertex的producedDataSets,每个ExecutionVertex可能多个IntermediateResult,具体有多少个则根据JobVertex的并行度决定
for (IntermediateResult result : producedDataSets) {
// 一个ExecutionVertex相当于一个subTask,一个subTask对应着不同的IntermediateResult的IntermediateResultPartition
IntermediateResultPartition irp = new IntermediateResultPartition(result, this, subTaskIndex);
result.setPartition(subTaskIndex, irp);
resultPartitions.put(irp.getPartitionId(), irp);
}
this.inputEdges = new ExecutionEdge[jobVertex.getJobVertex().getInputs().size()][];
this.priorExecutions = new EvictingBoundedList<>(maxPriorExecutionHistoryLength);
this.currentExecution = new Execution(
getExecutionGraph().getFutureExecutor(),
this,
0,
initialGlobalModVersion,
createTimestamp,
timeout);
// create a co-location scheduling hint, if necessary
CoLocationGroup clg = jobVertex.getCoLocationGroup();
if (clg != null) {
this.locationConstraint = clg.getLocationConstraint(subTaskIndex);
}
else {
this.locationConstraint = null;
}
getExecutionGraph().registerExecution(currentExecution);
this.timeout = timeout;
this.inputSplits = new ArrayList<>();
}
创建ExecutionEdge
ExecutionJobVertex在创建完成之后,通过调用connectToPredecessors
创建ExecutionEdge,源码如下
public void connectToPredecessors(Map<IntermediateDataSetID, IntermediateResult> intermediateDataSets) throws JobException {
List<JobEdge> inputs = jobVertex.getInputs();
if (LOG.isDebugEnabled()) {
LOG.debug(String.format("Connecting ExecutionJobVertex %s (%s) to %d predecessors.", jobVertex.getID(), jobVertex.getName(), inputs.size()));
}
for (int num = 0; num < inputs.size(); num++) {
JobEdge edge = inputs.get(num);
if (LOG.isDebugEnabled()) {
if (edge.getSource() == null) {
LOG.debug(String.format("Connecting input %d of vertex %s (%s) to intermediate result referenced via ID %s.",
num, jobVertex.getID(), jobVertex.getName(), edge.getSourceId()));
} else {
LOG.debug(String.format("Connecting input %d of vertex %s (%s) to intermediate result referenced via predecessor %s (%s).",
num, jobVertex.getID(), jobVertex.getName(), edge.getSource().getProducer().getID(), edge.getSource().getProducer().getName()));
}
}
// fetch the intermediate result via ID. if it does not exist, then it either has not been created, or the order
// in which this method is called for the job vertices is not a topological order
IntermediateResult ires = intermediateDataSets.get(edge.getSourceId());
if (ires == null) {
throw new JobException("Cannot connect this job graph to the previous graph. No previous intermediate result found for ID "
+ edge.getSourceId());
}
this.inputs.add(ires);
int consumerIndex = ires.registerConsumer();
for (int i = 0; i < parallelism; i++) {
ExecutionVertex ev = taskVertices[i];
ev.connectSource(num, ires, edge, consumerIndex);
}
}
}
根据ExecutionJobVertex每个jobedge的并行度,调用connectSource方法,将ExecutionVertex与上游节点连接起来,其中DistributionPattern方式有两种,分别为
- ALL_TO_ALL:每个生产subtask都连接到消费任务的每个subtask(一对一)
- POINTWISE:每个生产subtask都连接到消费任务的多个subtask(一对多)
public void connectSource(int inputNumber, IntermediateResult source, JobEdge edge, int consumerNumber) {
final DistributionPattern pattern = edge.getDistributionPattern();
final IntermediateResultPartition[] sourcePartitions = source.getPartitions();
ExecutionEdge[] edges;
// 只有 forward/RESCALE 的方式的情况下,pattern 才是 POINTWISE 的,否则均为 ALL_TO_ALL
switch (pattern) {
case POINTWISE:
edges = connectPointwise(sourcePartitions, inputNumber);
break;
case ALL_TO_ALL:
edges = connectAllToAll(sourcePartitions, inputNumber);
break;
default:
throw new RuntimeException("Unrecognized distribution pattern.");
}
inputEdges[inputNumber] = edges;
// add the consumers to the source
// for now (until the receiver initiated handshake is in place), we need to register the
// edges as the execution graph
for (ExecutionEdge ee : edges) {
ee.getSource().addConsumer(ee, consumerNumber);
}
}
当ExecutionEdge创建完成之后,ExecutionGraph创建过程结束。
ExecutionGraph的提交以及提交后任务的执行流程
当ExecutionGraph生成完毕之后,开始基于ExecutionGraph进行作业调度,源码调度流程图如下所示
- 由于是Streaming Job,所以选择Eager调度模式(batch job为lazy模式)
- 在部署前,校验ExecutionVertex对应的Execution的状态是否为CREATED,如果是则将待部署的Execution状态变为Schedule,然后开始为ExecutionVertex分配Slot
- 逐一异步部署各ExecutionVertex,部署也是根据不同的Slot提供策略来分配
- 在分配slot时,首先会在JobMaster中SlotPool中进行分配,具体是先在SlotPool中获取所有slot,然后尝试选择一个最合适的slot进行分配,这里的选择有两种策略,即按照位置优先和按照之前已分配的slot优先;若从SlotPool无法分配,则通过RPC请求向ResourceManager请求slot,若此时并未连接上ResourceManager,则会将请求缓存起来,待连接上ResourceManager后再申请。
- 当ResourceManager收到申请slot的请求时,若发现该JobManager未注册,则直接抛出异常;否则将请求转发给SlotManager处理,SlotManager中维护了集群所有空闲的slot(TaskManager会向ResourceManager上报自己的信息,在ResourceManager中由SlotManager保存Slot和TaskManager对应关系),并从其中找出符合条件的slot,然后向TaskManager发送RPC请求申请对应的slot。
- 等待所有的slot申请完成后,然后会将ExecutionVertex对应的Execution分配给对应的Slot,即从Slot中分配对应的资源给Execution,完成分配后可开始部署作业。
- 每次调度ExecutionVertex,都会有一个Execute,在此阶段会将Executison的状态变更为DEPLOYING状态,并且为该ExecutionVertex生成对应的部署描述信息,然后从对应的slot中获取对应的TaskManagerGateway,以便向对应的TaskManager提交Task
- submitTask(此时便将Task通过RPC提交给了TaskManager)。
- TaskManager(TaskExecutor)在接收到提交Task的请求后,会经过一些初始化(如从BlobServer拉取文件,反序列化作业和Task信息、LibaryCacheManager等),然后这些初始化的信息会用于生成Task(Runnable对象),然后启动该Task,