flink源码阅读之ExecutionGraph的生成过程

最新推荐文章于 2024-04-10 09:44:46 发布

瓜不田

最新推荐文章于 2024-04-10 09:44:46 发布

阅读量602

点赞数

分类专栏： BigData 文章标签： flink

本文链接：https://blog.csdn.net/Jerseywwwwei/article/details/108312983

版权

BigData 专栏收录该内容

19 篇文章 0 订阅

订阅专栏

flink源码阅读之ExecutionGraph的生成过程

StreamGraph和JobGraph都是在client端生成的，JobGraph相比于StreamGrap，已经有了一定程度上的优化，当client将JobGraph提交给JobManager时，JobManager会根据JobGraph生成对应的ExecutionGraph。TaskManager最终会根据ExecutionGraph执行任务

flink运行架构

在了解ExecutionGraph的生成过程时，首先要了解flink的运行架构，flink的运行架构如下所示

主要包括 JobManger和TaskManager两个组件

JobManager

JobManager主要的职责如下：

它决定何时安排下一个任务（或一组任务），对完成的任务或执行失败做出反应，
协调检查点并协调失败的恢复。

JobManager中包含三个不同的组件，这三个组件互相协同工作。

ResourceManager

ResourceManager通过管理集群中的Task slot来实现flink集群中资源的取消与分配，其中Task slot是flink集群中资源调度的基本单位，flink为不同的执行环境(Yarn，K8S)实现了多个ResourceManager，但在standalone模式中，ResourceManager只能分配可用TaskManager的插槽，而不能自行启动新的TaskManager
Dispatcher

Dispatcher提供REST界面来提交Flink应用程序以供执行，并为每个提交的作业启动一个新的JobManager。它还运行Flink WebUI以提供有关作业执行的信息。
JobMaster

JobMaster负责管理单个JobGraph的执行。 Flink群集中可以同时运行多个作业，每个作业都有自己的JobMaster。

TaskManager

TaskManager相当于Slave节点，负责执行任务，在自身节点缓冲数据，并在TaskManager之间交换数据流。

flink任务整体的提交过程

flink作业从Client端提交到Flink集群的执行流程如下图所示

client端会创建StreamGraph，并将StreamGraph转化到JobGraph，最后将application提交到flink集群。
如果是local模式，则拉起一个minicluster，同时dispatcher负责任务处理
dispatcher会拉起JobManagerRunner，多个JobMangerRunner进行选举，选出一个leader，之后，会创建一个Defaultscheduler，并且在加载DefaultScheduler的同时，也会加载他的父类SchedulerBase，ExecutionGraph在这个过程中被创建。
在SchedulerBase初始化时生成ExecutionGraph后，TaskManager根据ExecutionGraph运行任务。

上述过程的在源码中的调用流程如下

ExecutionGraph的生成细节

先看一下官网上JobGraph转化为ExecutionGraph的图解

ExecutionGraph中的组件

如上图所示，ExecutionGraph和JobGraph在结构上还是有很大差别的，接下来介绍ExecutionGraph中的组件信息

ExecutionJobVertex: 在 ExecutionGraph 中，节点对应的是 ExecutionJobVertex，它与 JobGraph 中的 JobVertex 一一对应。
ExecutionVertex: 在 ExecutionJobVertex 中有一个 taskVertices 变量，它是 ExecutionVertex 类型的数组，数组的大小就是这个 JobVertex 的并发度，在创建 ExecutionJobVertex 对象时，会创建相同并发度梳理的 ExecutionVertex 对象，在真正调度时，一个 ExecutionVertex 实际就是一个 task，它是 ExecutionJobVertex 并行执行的一个子任务；
IntermediateResult: 在 JobGraph 中用 IntermediateDataSet 表示 JobVertex 的输出 stream，一个 JobGraph 可能会有多个输出 stream，在 ExecutionGraph 中，与之对应的就是 IntermediateResult 对象；
IntermediateResultPartition: 由于 ExecutionJobVertex 可能有多个并行的子任务，所以每个 IntermediateResult 可能就有多个生产者，每个生产者的在相应的 IntermediateResult 上的输出对应一个 IntermediateResultPartition 对象，IntermediateResultPartition 表示的是 ExecutionVertex 的一个输出分区；
ExecutionEdge: ExecutionEdge 表示 ExecutionVertex 的输入，通过 ExecutionEdge 将 ExecutionVertex 和 IntermediateResultPartition 连接起来，进而在 ExecutionVertex 和 IntermediateResultPartition 之间建立联系。
Execution：Execution是对Execution Vertex的一次执行，使用ExecutionAttemptld作为唯一标识，一个ExecutionVertex在某些情况下可能会执行多次，比如遇到失败的情况或者该task的数据需要重新计算。

根据上述概念可知

JobVertex = ExecutionJobVertex = JobVertex的并行度 * Execution Vertex
IntermediateDataSet = IntermediateResult = JobVertex的并行度 * IntermediateResultPartition
每个 JobVertex 可能有多个 IntermediateDataSet，所以每个 ExecutionJobVertex 可能有多个 IntermediateResult，因此，每个 ExecutionVertex 也可能会包含多个 IntermediateResultPartition；

通过SchedulerBase创建ExecutionGraph

通过源码调用流程图可知，defaulScheduler继承了schedulerBase，在创建defaultScheduler时，也会对父类schedulerBase进行加载，在这个类的构造函数中，调用createAndRestoreExecutionGraph()方法开始创建ExecutionGraph，源码如下所示

public SchedulerBase(
		......
		// 开始创建
		this.executionGraph = createAndRestoreExecutionGraph(jobManagerJobMetricGroup, checkNotNull(shuffleMaster), checkNotNull(partitionTracker), checkNotNull(executionDeploymentTracker));
      	......
	}

接着调用attachJobGraph()完成对ExecutionGraph的创建，这其中涉及两个过程

创建ExecutionJobVertex创建
调用connectToPredecessors()创建ExecutionEdge将ExecutionVertex与intermediateResultPartition连接在一起。

attachJobGraph()方法源码如下所示

public void attachJobGraph(List<JobVertex> topologiallySorted) throws JobException {

		assertRunningInJobMasterMainThread();

		LOG.debug("Attaching {} topologically sorted vertices to existing job graph with {} " +
				"vertices and {} intermediate results.",
			topologiallySorted.size(),
			tasks.size(),
			intermediateResults.size());

		final ArrayList<ExecutionJobVertex> newExecJobVertices = new ArrayList<>(topologiallySorted.size());
		final long createTimestamp = System.currentTimeMillis();

		for (JobVertex jobVertex : topologiallySorted) {

			if (jobVertex.isInputVertex() && !jobVertex.isStoppable()) {
				this.isStoppable = false;
			}

			// create the execution job vertex and attach it to the graph 创建ExecutionJobVertex，并加入ExecutionGraph中
			ExecutionJobVertex ejv = new ExecutionJobVertex(
					this,
					jobVertex,
					1,
					maxPriorAttemptsHistoryLength,
					rpcTimeout,
					globalModVersion,
					createTimestamp);

			ejv.connectToPredecessors(this.intermediateResults);

			ExecutionJobVertex previousTask = this.tasks.putIfAbsent(jobVertex.getID(), ejv);
			if (previousTask != null) {
				throw new JobException(String.format("Encountered two job vertices with ID %s : previous=[%s] / new=[%s]",
					jobVertex.getID(), ejv, previousTask));
			}

			for (IntermediateResult res : ejv.getProducedDataSets()) {
				IntermediateResult previousDataSet = this.intermediateResults.putIfAbsent(res.getId(), res);
				if (previousDataSet != null) {
					throw new JobException(String.format("Encountered two intermediate data set with ID %s : previous=[%s] / new=[%s]",
						res.getId(), res, previousDataSet));
				}
			}

			this.verticesInCreationOrder.add(ejv);
			this.numVerticesTotal += ejv.getParallelism();
			newExecJobVertices.add(ejv);
		}

创建ExecutionJobVertex对象

attachJobGraph方法中遍历JobVertex的集合并创建ExecutionJobVertex，看一下ExecutionJobVertex的构造方法，在这个方法中主要做了一下工作

根据这个 JobVertex 的 results（IntermediateDataSet 列表）来创建相应的 IntermediateResult 对象，每个 IntermediateDataSet 都会对应的一个 IntermediateResult；
再根据这个 JobVertex 的并发度，来创建相同数量的 ExecutionVertex 对象，每个 ExecutionVertex 对象在调度时实际上就是一个 task 任务；
在创建 IntermediateResult 和 ExecutionVertex 对象时都会记录它们之间的关系

public ExecutionJobVertex(
			ExecutionGraph graph,
			JobVertex jobVertex,
			int defaultParallelism,
			int maxPriorAttemptsHistoryLength,
			Time timeout,
			long initialGlobalModVersion,
			long createTimestamp) throws JobException {

		if (graph == null || jobVertex == null) {
			throw new NullPointerException();
		}

		this.graph = graph;
		this.jobVertex = jobVertex;

		int vertexParallelism = jobVertex.getParallelism();
		int numTaskVertices = vertexParallelism > 0 ? vertexParallelism : defaultParallelism;

		final int configuredMaxParallelism = jobVertex.getMaxParallelism();

		this.maxParallelismConfigured = (VALUE_NOT_SET != configuredMaxParallelism);

		// if no max parallelism was configured by the user, we calculate and set a default
		setMaxParallelismInternal(maxParallelismConfigured ?
				configuredMaxParallelism : KeyGroupRangeAssignment.computeDefaultMaxParallelism(numTaskVertices));

		// verify that our parallelism is not higher than the maximum parallelism
		if (numTaskVertices > maxParallelism) {
			throw new JobException(
				String.format("Vertex %s's parallelism (%s) is higher than the max parallelism (%s). Please lower the parallelism or increase the max parallelism.",
					jobVertex.getName(),
					numTaskVertices,
					maxParallelism));
		}

		this.parallelism = numTaskVertices;
		this.resourceProfile = ResourceProfile.fromResourceSpec(jobVertex.getMinResources(), MemorySize.ZERO);
		// taskVertices 记录这个 task 每个并发
		this.taskVertices = new ExecutionVertex[numTaskVertices];

		this.inputs = new ArrayList<>(jobVertex.getInputs().size());

		// take the sharing group
		this.slotSharingGroup = jobVertex.getSlotSharingGroup();
		this.coLocationGroup = jobVertex.getCoLocationGroup();

		// setup the coLocation group
		if (coLocationGroup != null && slotSharingGroup == null) {
			throw new JobException("Vertex uses a co-location constraint without using slot sharing");
		}

		// create the intermediate results
		this.producedDataSets = new IntermediateResult[jobVertex.getNumberOfProducedIntermediateDataSets()];
		// 遍历jobVertex的IntermediateDataSet，添加到ExecutionJobVertex的produceDataSet
		for (int i = 0; i < jobVertex.getProducedDataSets().size(); i++) {
			final IntermediateDataSet result = jobVertex.getProducedDataSets().get(i);

			this.producedDataSets[i] = new IntermediateResult(
					result.getId(),
					this,
					numTaskVertices,
					result.getResultType());
		}

		// create all task vertices
		for (int i = 0; i < numTaskVertices; i++) {
			ExecutionVertex vertex = new ExecutionVertex(
					this,
					i,
					producedDataSets,
					timeout,
					initialGlobalModVersion,
					createTimestamp,
					maxPriorAttemptsHistoryLength);

			this.taskVertices[i] = vertex;
		}

		// sanity check for the double referencing between intermediate result partitions and execution vertices
		for (IntermediateResult ir : this.producedDataSets) {
			if (ir.getNumberOfAssignedPartitions() != parallelism) {
				throw new RuntimeException("The intermediate result's partitions were not correctly assigned.");
			}
		}

		final List<SerializedValue<OperatorCoordinator.Provider>> coordinatorProviders = getJobVertex().getOperatorCoordinators();
		if (coordinatorProviders.isEmpty()) {
			this.operatorCoordinators = Collections.emptyList();
		} else {
			final ArrayList<OperatorCoordinatorHolder> coordinators = new ArrayList<>(coordinatorProviders.size());
			try {
				for (final SerializedValue<OperatorCoordinator.Provider> provider : coordinatorProviders) {
					coordinators.add(OperatorCoordinatorHolder.create(provider, this, graph.getUserClassLoader()));
				}
			} catch (Exception | LinkageError e) {
				IOUtils.closeAllQuietly(coordinators);
				throw new JobException("Cannot instantiate the coordinator for operator " + getName(), e);
			}
			this.operatorCoordinators = Collections.unmodifiableList(coordinators);
		}

		// set up the input splits, if the vertex has any
		try {
			@SuppressWarnings("unchecked")
			InputSplitSource<InputSplit> splitSource = (InputSplitSource<InputSplit>) jobVertex.getInputSplitSource();

			if (splitSource != null) {
				Thread currentThread = Thread.currentThread();
				ClassLoader oldContextClassLoader = currentThread.getContextClassLoader();
				currentThread.setContextClassLoader(graph.getUserClassLoader());
				try {
					inputSplits = splitSource.createInputSplits(numTaskVertices);

					if (inputSplits != null) {
						splitAssigner = splitSource.getInputSplitAssigner(inputSplits);
					}
				} finally {
					currentThread.setContextClassLoader(oldContextClassLoader);
				}
			}
			else {
				inputSplits = null;
			}
		}
		catch (Throwable t) {
			throw new JobException("Creating the input splits caused an error: " + t.getMessage(), t);
		}
	}

创建ExecutionVertex对象

在创建ExecutionJobVertex对象的过程中，也会完成对ExecutionVertex对象的创建

根据这个 ExecutionJobVertex 的 producedDataSets（IntermediateResult 类型的数组），给每个 ExecutionVertex 创建相应的 IntermediateResultPartition 对象，它代表了一个 IntermediateResult 分区；
调用 IntermediateResult 的 setPartition() 方法，记录 IntermediateResult 与 IntermediateResultPartition 之间的关系；
给这个 ExecutionVertex 创建一个 Execution 对象，如果这个 ExecutionVertex 重新调度（失败重新恢复等情况），那么 Execution 对应的 attemptNumber 将会自增加 1，这里初始化的时候其值为 0。

public ExecutionVertex(
			ExecutionJobVertex jobVertex,
			int subTaskIndex,
			IntermediateResult[] producedDataSets,
			Time timeout,
			long initialGlobalModVersion,
			long createTimestamp,
			int maxPriorExecutionHistoryLength) {

		this.jobVertex = jobVertex;
		this.subTaskIndex = subTaskIndex;
		this.executionVertexId = new ExecutionVertexID(jobVertex.getJobVertexId(), subTaskIndex);
		this.taskNameWithSubtask = String.format("%s (%d/%d)",
				jobVertex.getJobVertex().getName(), subTaskIndex + 1, jobVertex.getParallelism());

		this.resultPartitions = new LinkedHashMap<>(producedDataSets.length, 1);
		// 遍历JobVertex的producedDataSets，每个ExecutionVertex可能多个IntermediateResult，具体有多少个则根据JobVertex的并行度决定
		for (IntermediateResult result : producedDataSets) { 
            // 一个ExecutionVertex相当于一个subTask，一个subTask对应着不同的IntermediateResult的IntermediateResultPartition
			IntermediateResultPartition irp = new IntermediateResultPartition(result, this, subTaskIndex);
			result.setPartition(subTaskIndex, irp);

			resultPartitions.put(irp.getPartitionId(), irp);
		}

		this.inputEdges = new ExecutionEdge[jobVertex.getJobVertex().getInputs().size()][];

		this.priorExecutions = new EvictingBoundedList<>(maxPriorExecutionHistoryLength);

		this.currentExecution = new Execution(
			getExecutionGraph().getFutureExecutor(),
			this,
			0,
			initialGlobalModVersion,
			createTimestamp,
			timeout);

		// create a co-location scheduling hint, if necessary
		CoLocationGroup clg = jobVertex.getCoLocationGroup();
		if (clg != null) {
			this.locationConstraint = clg.getLocationConstraint(subTaskIndex);
		}
		else {
			this.locationConstraint = null;
		}

		getExecutionGraph().registerExecution(currentExecution);

		this.timeout = timeout;
		this.inputSplits = new ArrayList<>();
	}

创建ExecutionEdge

ExecutionJobVertex在创建完成之后，通过调用connectToPredecessors创建ExecutionEdge，源码如下

public void connectToPredecessors(Map<IntermediateDataSetID, IntermediateResult> intermediateDataSets) throws JobException {

		List<JobEdge> inputs = jobVertex.getInputs();

		if (LOG.isDebugEnabled()) {
			LOG.debug(String.format("Connecting ExecutionJobVertex %s (%s) to %d predecessors.", jobVertex.getID(), jobVertex.getName(), inputs.size()));
		}

		for (int num = 0; num < inputs.size(); num++) {
			JobEdge edge = inputs.get(num);

			if (LOG.isDebugEnabled()) {
				if (edge.getSource() == null) {
					LOG.debug(String.format("Connecting input %d of vertex %s (%s) to intermediate result referenced via ID %s.",
							num, jobVertex.getID(), jobVertex.getName(), edge.getSourceId()));
				} else {
					LOG.debug(String.format("Connecting input %d of vertex %s (%s) to intermediate result referenced via predecessor %s (%s).",
							num, jobVertex.getID(), jobVertex.getName(), edge.getSource().getProducer().getID(), edge.getSource().getProducer().getName()));
				}
			}

			// fetch the intermediate result via ID. if it does not exist, then it either has not been created, or the order
			// in which this method is called for the job vertices is not a topological order
			IntermediateResult ires = intermediateDataSets.get(edge.getSourceId());
			if (ires == null) {
				throw new JobException("Cannot connect this job graph to the previous graph. No previous intermediate result found for ID "
						+ edge.getSourceId());
			}

			this.inputs.add(ires);

			int consumerIndex = ires.registerConsumer();

			for (int i = 0; i < parallelism; i++) {
				ExecutionVertex ev = taskVertices[i];
				ev.connectSource(num, ires, edge, consumerIndex);
			}
		}
	}

根据ExecutionJobVertex每个jobedge的并行度，调用connectSource方法，将ExecutionVertex与上游节点连接起来，其中DistributionPattern方式有两种，分别为

ALL_TO_ALL：每个生产subtask都连接到消费任务的每个subtask(一对一)
POINTWISE：每个生产subtask都连接到消费任务的多个subtask(一对多)

public void connectSource(int inputNumber, IntermediateResult source, JobEdge edge, int consumerNumber) {

		final DistributionPattern pattern = edge.getDistributionPattern();
		final IntermediateResultPartition[] sourcePartitions = source.getPartitions();

		ExecutionEdge[] edges;

    	// 只有 forward/RESCALE 的方式的情况下，pattern 才是 POINTWISE 的，否则均为 ALL_TO_ALL
		switch (pattern) {
			case POINTWISE:
				edges = connectPointwise(sourcePartitions, inputNumber);
				break;

			case ALL_TO_ALL:
				edges = connectAllToAll(sourcePartitions, inputNumber);
				break;

			default:
				throw new RuntimeException("Unrecognized distribution pattern.");

		}

		inputEdges[inputNumber] = edges;

		// add the consumers to the source
		// for now (until the receiver initiated handshake is in place), we need to register the
		// edges as the execution graph
		for (ExecutionEdge ee : edges) {
			ee.getSource().addConsumer(ee, consumerNumber);
		}
	}

当ExecutionEdge创建完成之后，ExecutionGraph创建过程结束。

ExecutionGraph的提交以及提交后任务的执行流程

当ExecutionGraph生成完毕之后，开始基于ExecutionGraph进行作业调度，源码调度流程图如下所示

由于是Streaming Job，所以选择Eager调度模式(batch job为lazy模式)
在部署前，校验ExecutionVertex对应的Execution的状态是否为CREATED，如果是则将待部署的Execution状态变为Schedule，然后开始为ExecutionVertex分配Slot
逐一异步部署各ExecutionVertex，部署也是根据不同的Slot提供策略来分配
在分配slot时，首先会在JobMaster中SlotPool中进行分配，具体是先在SlotPool中获取所有slot，然后尝试选择一个最合适的slot进行分配，这里的选择有两种策略，即按照位置优先和按照之前已分配的slot优先；若从SlotPool无法分配，则通过RPC请求向ResourceManager请求slot，若此时并未连接上ResourceManager，则会将请求缓存起来，待连接上ResourceManager后再申请。
当ResourceManager收到申请slot的请求时，若发现该JobManager未注册，则直接抛出异常；否则将请求转发给SlotManager处理，SlotManager中维护了集群所有空闲的slot（TaskManager会向ResourceManager上报自己的信息，在ResourceManager中由SlotManager保存Slot和TaskManager对应关系），并从其中找出符合条件的slot，然后向TaskManager发送RPC请求申请对应的slot。
等待所有的slot申请完成后，然后会将ExecutionVertex对应的Execution分配给对应的Slot，即从Slot中分配对应的资源给Execution，完成分配后可开始部署作业。
每次调度ExecutionVertex，都会有一个Execute，在此阶段会将Executison的状态变更为DEPLOYING状态，并且为该ExecutionVertex生成对应的部署描述信息，然后从对应的slot中获取对应的TaskManagerGateway，以便向对应的TaskManager提交Task
submitTask（此时便将Task通过RPC提交给了TaskManager）。
TaskManager（TaskExecutor）在接收到提交Task的请求后，会经过一些初始化（如从BlobServer拉取文件，反序列化作业和Task信息、LibaryCacheManager等），然后这些初始化的信息会用于生成Task(Runnable对象)，然后启动该Task，

reference

瓜不田

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
flink源码阅读之ExecutionGraph的生成过程

flink源码阅读之ExecutionGraph的生成过程StreamGraph和JobGraph都是在client端生成的，JobGraph相比于StreamGrap，已经有了一定程度上的优化，当client将JobGraph提交给JobManager时，JobManager会根据JobGraph生成对应的ExecutionGraph。TaskManager最终会根据ExecutionGraph执行任务flink运行架构在了解ExecutionGraph的生成过程时，首先要了解flink的运行架构，
复制链接

扫一扫

专栏目录