flink源码阅读之ExecutionGraph的生成过程

flink源码阅读之ExecutionGraph的生成过程

StreamGraph和JobGraph都是在client端生成的,JobGraph相比于StreamGrap,已经有了一定程度上的优化,当client将JobGraph提交给JobManager时,JobManager会根据JobGraph生成对应的ExecutionGraph。TaskManager最终会根据ExecutionGraph执行任务

flink运行架构

在了解ExecutionGraph的生成过程时,首先要了解flink的运行架构,flink的运行架构如下所示

主要包括 JobManger和TaskManager两个组件

JobManager

JobManager主要的职责如下:

  • 它决定何时安排下一个任务(或一组任务),对完成的任务或执行失败做出反应,
  • 协调检查点并协调失败的恢复。

JobManager中包含三个不同的组件,这三个组件互相协同工作。

  • ResourceManager

    ResourceManager通过管理集群中的Task slot来实现flink集群中资源的取消与分配,其中Task slot是flink集群中资源调度的基本单位,flink为不同的执行环境(Yarn,K8S)实现了多个ResourceManager,但在standalone模式中,ResourceManager只能分配可用TaskManager的插槽,而不能自行启动新的TaskManager

  • Dispatcher

    Dispatcher提供REST界面来提交Flink应用程序以供执行,并为每个提交的作业启动一个新的JobManager。 它还运行Flink WebUI以提供有关作业执行的信息。

  • JobMaster

    JobMaster负责管理单个JobGraph的执行。 Flink群集中可以同时运行多个作业,每个作业都有自己的JobMaster。

TaskManager

TaskManager相当于Slave节点,负责执行任务,在自身节点缓冲数据,并在TaskManager之间交换数据流。

flink任务整体的提交过程

flink作业从Client端提交到Flink集群的执行流程如下图所示

  1. client端会创建StreamGraph,并将StreamGraph转化到JobGraph,最后将application提交到flink集群。
  2. 如果是local模式,则拉起一个minicluster,同时dispatcher负责任务处理
  3. dispatcher会拉起JobManagerRunner,多个JobMangerRunner进行选举,选出一个leader,之后,会创建一个Defaultscheduler,并且在加载DefaultScheduler的同时,也会加载他的父类SchedulerBase,ExecutionGraph在这个过程中被创建。
  4. 在SchedulerBase初始化时生成ExecutionGraph后,TaskManager根据ExecutionGraph运行任务。

上述过程的在源码中的调用流程如下

ExecutionGraph的生成细节

先看一下官网上JobGraph转化为ExecutionGraph的图解

ExecutionGraph中的组件

如上图所示,ExecutionGraph和JobGraph在结构上还是有很大差别的,接下来介绍ExecutionGraph中的组件信息

  • ExecutionJobVertex: 在 ExecutionGraph 中,节点对应的是 ExecutionJobVertex,它与 JobGraph 中的 JobVertex 一一对应。
  • ExecutionVertex: 在 ExecutionJobVertex 中有一个 taskVertices 变量,它是 ExecutionVertex 类型的数组,数组的大小就是这个 JobVertex 的并发度,在创建 ExecutionJobVertex 对象时,会创建相同并发度梳理的 ExecutionVertex 对象,在真正调度时,一个 ExecutionVertex 实际就是一个 task,它是 ExecutionJobVertex 并行执行的一个子任务;
  • IntermediateResult: 在 JobGraph 中用 IntermediateDataSet 表示 JobVertex 的输出 stream,一个 JobGraph 可能会有多个输出 stream,在 ExecutionGraph 中,与之对应的就是 IntermediateResult 对象;
  • IntermediateResultPartition: 由于 ExecutionJobVertex 可能有多个并行的子任务,所以每个 IntermediateResult 可能就有多个生产者,每个生产者的在相应的 IntermediateResult 上的输出对应一个 IntermediateResultPartition 对象,IntermediateResultPartition 表示的是 ExecutionVertex 的一个输出分区;
  • ExecutionEdge: ExecutionEdge 表示 ExecutionVertex 的输入,通过 ExecutionEdge 将 ExecutionVertex 和 IntermediateResultPartition 连接起来,进而在 ExecutionVertex 和 IntermediateResultPartition 之间建立联系。
  • Execution:Execution是对Execution Vertex的一次执行,使用ExecutionAttemptld作为唯一标识,一个ExecutionVertex在某些情况下可能会执行多次,比如遇到失败的情况或者该task的数据需要重新计算。

根据上述概念可知

  • JobVertex = ExecutionJobVertex = JobVertex的并行度 * Execution Vertex
  • IntermediateDataSet = IntermediateResult = JobVertex的并行度 * IntermediateResultPartition
  • 每个 JobVertex 可能有多个 IntermediateDataSet,所以每个 ExecutionJobVertex 可能有多个 IntermediateResult,因此,每个 ExecutionVertex 也可能会包含多个 IntermediateResultPartition;

通过SchedulerBase创建ExecutionGraph

通过源码调用流程图可知,defaulScheduler继承了schedulerBase,在创建defaultScheduler时,也会对父类schedulerBase进行加载,在这个类的构造函数中,调用createAndRestoreExecutionGraph()方法开始创建ExecutionGraph,源码如下所示

public SchedulerBase(
		......
		// 开始创建
		this.executionGraph = createAndRestoreExecutionGraph(jobManagerJobMetricGroup, checkNotNull(shuffleMaster), checkNotNull(partitionTracker), checkNotNull(executionDeploymentTracker));
      	......
	}

接着调用attachJobGraph()完成对ExecutionGraph的创建,这其中涉及两个过程

  1. 创建ExecutionJobVertex创建
  2. 调用connectToPredecessors()创建ExecutionEdge将ExecutionVertex与intermediateResultPartition连接在一起。

attachJobGraph()方法源码如下所示

public void attachJobGraph(List<JobVertex> topologiallySorted) throws JobException {

		assertRunningInJobMasterMainThread();

		LOG.debug("Attaching {} topologically sorted vertices to existing job graph with {} " +
				"vertices and {} intermediate results.",
			topologiallySorted.size(),
			tasks.size(),
			intermediateResults.size());

		final ArrayList<ExecutionJobVertex> newExecJobVertices = new ArrayList<>(topologiallySorted.size());
		final long createTimestamp = System.currentTimeMillis();

		for (JobVertex jobVertex : topologiallySorted) {

			if (jobVertex.isInputVertex() && !jobVertex.isStoppable()) {
				this.isStoppable = false;
			}

			// create the execution job vertex and attach it to the graph 创建ExecutionJobVertex,并加入ExecutionGraph中
			ExecutionJobVertex ejv = new ExecutionJobVertex(
					this,
					jobVertex,
					1,
					maxPriorAttemptsHistoryLength,
					rpcTimeout,
					globalModVersion,
					createTimestamp);

			ejv.connectToPredecessors(this.intermediateResults);

			ExecutionJobVertex previousTask = this.tasks.putIfAbsent(jobVertex.getID(), ejv);
			if (previousTask != null) {
				throw new JobException(String.format("Encountered two job vertices with ID %s : previous=[%s] / new=[%s]",
					jobVertex.getID(), ejv, previousTask));
			}

			for (IntermediateResult res : ejv.getProducedDataSets()) {
				IntermediateResult previousDataSet = this.intermediateResults.putIfAbsent(res.getId(), res);
				if (previousDataSet != null) {
					throw new JobException(String.format("Encountered two intermediate data set with ID %s : previous=[%s] / new=[%s]",
						res.getId(), res, previousDataSet));
				}
			}

			this.verticesInCreationOrder.add(ejv);
			this.numVerticesTotal += ejv.getParallelism();
			newExecJobVertices.add(ejv);
		}

创建ExecutionJobVertex对象

attachJobGraph方法中遍历JobVertex的集合并创建ExecutionJobVertex,看一下ExecutionJobVertex的构造方法,在这个方法中主要做了一下工作

  1. 根据这个 JobVertex 的 resultsIntermediateDataSet 列表)来创建相应的 IntermediateResult 对象,每个 IntermediateDataSet 都会对应的一个 IntermediateResult
  2. 再根据这个 JobVertex 的并发度,来创建相同数量的 ExecutionVertex 对象,每个 ExecutionVertex 对象在调度时实际上就是一个 task 任务;
  3. 在创建 IntermediateResultExecutionVertex 对象时都会记录它们之间的关系
public ExecutionJobVertex(
			ExecutionGraph graph,
			JobVertex jobVertex,
			int defaultParallelism,
			int maxPriorAttemptsHistoryLength,
			Time timeout,
			long initialGlobalModVersion,
			long createTimestamp) throws JobException {

		if (graph == null || jobVertex == null) {
			throw new NullPointerException();
		}

		this.graph = graph;
		this.jobVertex = jobVertex;

		int vertexParallelism = jobVertex.getParallelism();
		int numTaskVertices = vertexParallelism > 0 ? vertexParallelism : defaultParallelism;

		final int configuredMaxParallelism = jobVertex.getMaxParallelism();

		this.maxParallelismConfigured = (VALUE_NOT_SET != configuredMaxParallelism);

		// if no max parallelism was configured by the user, we calculate and set a default
		setMaxParallelismInternal(maxParallelismConfigured ?
				configuredMaxParallelism : KeyGroupRangeAssignment.computeDefaultMaxParallelism(numTaskVertices));

		// verify that our parallelism is not higher than the maximum parallelism
		if (numTaskVertices > maxParallelism) {
			throw new JobException(
				String.format("Vertex %s's parallelism (%s) is higher than the max parallelism (%s). Please lower the parallelism or increase the max parallelism.",
					jobVertex.getName(),
					numTaskVertices,
					maxParallelism));
		}

		this.parallelism = numTaskVertices;
		this.resourceProfile = ResourceProfile.fromResourceSpec(jobVertex.getMinResources(), MemorySize.ZERO);
		// taskVertices 记录这个 task 每个并发
		this.taskVertices = new ExecutionVertex[numTaskVertices];

		this.inputs = new ArrayList<>(jobVertex.getInputs().size());

		// take the sharing group
		this.slotSharingGroup = jobVertex.getSlotSharingGroup();
		this.coLocationGroup = jobVertex.getCoLocationGroup();

		// setup the coLocation group
		if (coLocationGroup != null && slotSharingGroup == null) {
			throw new JobException("Vertex uses a co-location constraint without using slot sharing");
		}

		// create the intermediate results
		this.producedDataSets = new IntermediateResult[jobVertex.getNumberOfProducedIntermediateDataSets()];
		// 遍历jobVertex的IntermediateDataSet,添加到ExecutionJobVertex的produceDataSet
		for (int i = 0; i < jobVertex.getProducedDataSets().size(); i++) {
			final IntermediateDataSet result = jobVertex.getProducedDataSets().get(i);

			this.producedDataSets[i] = new IntermediateResult(
					result.getId(),
					this,
					numTaskVertices,
					result.getResultType());
		}

		// create all task vertices
		for (int i = 0; i < numTaskVertices; i++) {
			ExecutionVertex vertex = new ExecutionVertex(
					this,
					i,
					producedDataSets,
					timeout,
					initialGlobalModVersion,
					createTimestamp,
					maxPriorAttemptsHistoryLength);

			this.taskVertices[i] = vertex;
		}

		// sanity check for the double referencing between intermediate result partitions and execution vertices
		for (IntermediateResult ir : this.producedDataSets) {
			if (ir.getNumberOfAssignedPartitions() != parallelism) {
				throw new RuntimeException("The intermediate result's partitions were not correctly assigned.");
			}
		}

		final List<SerializedValue<OperatorCoordinator.Provider>> coordinatorProviders = getJobVertex().getOperatorCoordinators();
		if (coordinatorProviders.isEmpty()) {
			this.operatorCoordinators = Collections.emptyList();
		} else {
			final ArrayList<OperatorCoordinatorHolder> coordinators = new ArrayList<>(coordinatorProviders.size());
			try {
				for (final SerializedValue<OperatorCoordinator.Provider> provider : coordinatorProviders) {
					coordinators.add(OperatorCoordinatorHolder.create(provider, this, graph.getUserClassLoader()));
				}
			} catch (Exception | LinkageError e) {
				IOUtils.closeAllQuietly(coordinators);
				throw new JobException("Cannot instantiate the coordinator for operator " + getName(), e);
			}
			this.operatorCoordinators = Collections.unmodifiableList(coordinators);
		}

		// set up the input splits, if the vertex has any
		try {
			@SuppressWarnings("unchecked")
			InputSplitSource<InputSplit> splitSource = (InputSplitSource<InputSplit>) jobVertex.getInputSplitSource();

			if (splitSource != null) {
				Thread currentThread = Thread.currentThread();
				ClassLoader oldContextClassLoader = currentThread.getContextClassLoader();
				currentThread.setContextClassLoader(graph.getUserClassLoader());
				try {
					inputSplits = splitSource.createInputSplits(numTaskVertices);

					if (inputSplits != null) {
						splitAssigner = splitSource.getInputSplitAssigner(inputSplits);
					}
				} finally {
					currentThread.setContextClassLoader(oldContextClassLoader);
				}
			}
			else {
				inputSplits = null;
			}
		}
		catch (Throwable t) {
			throw new JobException("Creating the input splits caused an error: " + t.getMessage(), t);
		}
	}

创建ExecutionVertex对象

在创建ExecutionJobVertex对象的过程中,也会完成对ExecutionVertex对象的创建

  1. 根据这个 ExecutionJobVertex 的 producedDataSets(IntermediateResult 类型的数组),给每个 ExecutionVertex 创建相应的 IntermediateResultPartition 对象,它代表了一个 IntermediateResult 分区;
  2. 调用 IntermediateResult 的 setPartition() 方法,记录 IntermediateResult 与 IntermediateResultPartition 之间的关系;
  3. 给这个 ExecutionVertex 创建一个 Execution 对象,如果这个 ExecutionVertex 重新调度(失败重新恢复等情况),那么 Execution 对应的 attemptNumber 将会自增加 1,这里初始化的时候其值为 0。
public ExecutionVertex(
			ExecutionJobVertex jobVertex,
			int subTaskIndex,
			IntermediateResult[] producedDataSets,
			Time timeout,
			long initialGlobalModVersion,
			long createTimestamp,
			int maxPriorExecutionHistoryLength) {

		this.jobVertex = jobVertex;
		this.subTaskIndex = subTaskIndex;
		this.executionVertexId = new ExecutionVertexID(jobVertex.getJobVertexId(), subTaskIndex);
		this.taskNameWithSubtask = String.format("%s (%d/%d)",
				jobVertex.getJobVertex().getName(), subTaskIndex + 1, jobVertex.getParallelism());

		this.resultPartitions = new LinkedHashMap<>(producedDataSets.length, 1);
		// 遍历JobVertex的producedDataSets,每个ExecutionVertex可能多个IntermediateResult,具体有多少个则根据JobVertex的并行度决定
		for (IntermediateResult result : producedDataSets) { 
            // 一个ExecutionVertex相当于一个subTask,一个subTask对应着不同的IntermediateResult的IntermediateResultPartition
			IntermediateResultPartition irp = new IntermediateResultPartition(result, this, subTaskIndex);
			result.setPartition(subTaskIndex, irp);

			resultPartitions.put(irp.getPartitionId(), irp);
		}

		this.inputEdges = new ExecutionEdge[jobVertex.getJobVertex().getInputs().size()][];

		this.priorExecutions = new EvictingBoundedList<>(maxPriorExecutionHistoryLength);

		this.currentExecution = new Execution(
			getExecutionGraph().getFutureExecutor(),
			this,
			0,
			initialGlobalModVersion,
			createTimestamp,
			timeout);

		// create a co-location scheduling hint, if necessary
		CoLocationGroup clg = jobVertex.getCoLocationGroup();
		if (clg != null) {
			this.locationConstraint = clg.getLocationConstraint(subTaskIndex);
		}
		else {
			this.locationConstraint = null;
		}

		getExecutionGraph().registerExecution(currentExecution);

		this.timeout = timeout;
		this.inputSplits = new ArrayList<>();
	}

创建ExecutionEdge

ExecutionJobVertex在创建完成之后,通过调用connectToPredecessors创建ExecutionEdge,源码如下

public void connectToPredecessors(Map<IntermediateDataSetID, IntermediateResult> intermediateDataSets) throws JobException {

		List<JobEdge> inputs = jobVertex.getInputs();

		if (LOG.isDebugEnabled()) {
			LOG.debug(String.format("Connecting ExecutionJobVertex %s (%s) to %d predecessors.", jobVertex.getID(), jobVertex.getName(), inputs.size()));
		}

		for (int num = 0; num < inputs.size(); num++) {
			JobEdge edge = inputs.get(num);

			if (LOG.isDebugEnabled()) {
				if (edge.getSource() == null) {
					LOG.debug(String.format("Connecting input %d of vertex %s (%s) to intermediate result referenced via ID %s.",
							num, jobVertex.getID(), jobVertex.getName(), edge.getSourceId()));
				} else {
					LOG.debug(String.format("Connecting input %d of vertex %s (%s) to intermediate result referenced via predecessor %s (%s).",
							num, jobVertex.getID(), jobVertex.getName(), edge.getSource().getProducer().getID(), edge.getSource().getProducer().getName()));
				}
			}

			// fetch the intermediate result via ID. if it does not exist, then it either has not been created, or the order
			// in which this method is called for the job vertices is not a topological order
			IntermediateResult ires = intermediateDataSets.get(edge.getSourceId());
			if (ires == null) {
				throw new JobException("Cannot connect this job graph to the previous graph. No previous intermediate result found for ID "
						+ edge.getSourceId());
			}

			this.inputs.add(ires);

			int consumerIndex = ires.registerConsumer();

			for (int i = 0; i < parallelism; i++) {
				ExecutionVertex ev = taskVertices[i];
				ev.connectSource(num, ires, edge, consumerIndex);
			}
		}
	}

根据ExecutionJobVertex每个jobedge的并行度,调用connectSource方法,将ExecutionVertex与上游节点连接起来,其中DistributionPattern方式有两种,分别为

  • ALL_TO_ALL:每个生产subtask都连接到消费任务的每个subtask(一对一)
  • POINTWISE:每个生产subtask都连接到消费任务的多个subtask(一对多)
public void connectSource(int inputNumber, IntermediateResult source, JobEdge edge, int consumerNumber) {

		final DistributionPattern pattern = edge.getDistributionPattern();
		final IntermediateResultPartition[] sourcePartitions = source.getPartitions();

		ExecutionEdge[] edges;

    	// 只有 forward/RESCALE 的方式的情况下,pattern 才是 POINTWISE 的,否则均为 ALL_TO_ALL
		switch (pattern) {
			case POINTWISE:
				edges = connectPointwise(sourcePartitions, inputNumber);
				break;

			case ALL_TO_ALL:
				edges = connectAllToAll(sourcePartitions, inputNumber);
				break;

			default:
				throw new RuntimeException("Unrecognized distribution pattern.");

		}

		inputEdges[inputNumber] = edges;

		// add the consumers to the source
		// for now (until the receiver initiated handshake is in place), we need to register the
		// edges as the execution graph
		for (ExecutionEdge ee : edges) {
			ee.getSource().addConsumer(ee, consumerNumber);
		}
	}

当ExecutionEdge创建完成之后,ExecutionGraph创建过程结束。

ExecutionGraph的提交以及提交后任务的执行流程

当ExecutionGraph生成完毕之后,开始基于ExecutionGraph进行作业调度,源码调度流程图如下所示

  1. 由于是Streaming Job,所以选择Eager调度模式(batch job为lazy模式)
  2. 在部署前,校验ExecutionVertex对应的Execution的状态是否为CREATED,如果是则将待部署的Execution状态变为Schedule,然后开始为ExecutionVertex分配Slot
  3. 逐一异步部署各ExecutionVertex,部署也是根据不同的Slot提供策略来分配
  4. 在分配slot时,首先会在JobMaster中SlotPool中进行分配,具体是先在SlotPool中获取所有slot,然后尝试选择一个最合适的slot进行分配,这里的选择有两种策略,即按照位置优先和按照之前已分配的slot优先;若从SlotPool无法分配,则通过RPC请求向ResourceManager请求slot,若此时并未连接上ResourceManager,则会将请求缓存起来,待连接上ResourceManager后再申请。
  5. 当ResourceManager收到申请slot的请求时,若发现该JobManager未注册,则直接抛出异常;否则将请求转发给SlotManager处理,SlotManager中维护了集群所有空闲的slot(TaskManager会向ResourceManager上报自己的信息,在ResourceManager中由SlotManager保存Slot和TaskManager对应关系),并从其中找出符合条件的slot,然后向TaskManager发送RPC请求申请对应的slot。
  6. 等待所有的slot申请完成后,然后会将ExecutionVertex对应的Execution分配给对应的Slot,即从Slot中分配对应的资源给Execution,完成分配后可开始部署作业。
  7. 每次调度ExecutionVertex,都会有一个Execute,在此阶段会将Executison的状态变更为DEPLOYING状态,并且为该ExecutionVertex生成对应的部署描述信息,然后从对应的slot中获取对应的TaskManagerGateway,以便向对应的TaskManager提交Task
  8. submitTask(此时便将Task通过RPC提交给了TaskManager)。
  9. TaskManager(TaskExecutor)在接收到提交Task的请求后,会经过一些初始化(如从BlobServer拉取文件,反序列化作业和Task信息、LibaryCacheManager等),然后这些初始化的信息会用于生成Task(Runnable对象),然后启动该Task,

reference

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值