Flink源码-Execution的生成

上一节我们分析到了在jobmaster启动后,会将JobGraph转换成ExecutionGraph,同时也会将checkpoint相关配置传给executionGraph,并且还创建了checkpointCoordinator。下面我们接着上节的地方继续往下分析。

1.start

jobmaster启动后在选主完成后会调用它的start方法,start方法中会调用startJobExecution方法,开始job的执行。

2.startJobExecution

然后再startJobExecution方法中又调用了其自身的两个方法 :

        startJobMasterService:在这里才是真正启动jobMaster的服务

        resetAndScheduler: 重置和启动调度器

private Acknowledge startJobExecution(JobMasterId newJobMasterId) throws Exception {

		validateRunsInMainThread();

		checkNotNull(newJobMasterId, "The new JobMasterId must not be null.");

		if (Objects.equals(getFencingToken(), newJobMasterId)) {
			log.info("Already started the job execution with JobMasterId {}.", newJobMasterId);

			return Acknowledge.get();
		}

		setNewFencingToken(newJobMasterId);

		/*TODO 真正启动JobMaster服务*/
		startJobMasterServices();

		log.info("Starting execution of job {} ({}) under job master id {}.", jobGraph.getName(), jobGraph.getJobID(), newJobMasterId);

		/*TODO 重置和启动调度器*/
		resetAndStartScheduler();

		return Acknowledge.get();
	}

 3.startJobMasterService

在这个方法里会做以下几件事:

        1.开启和TaskManager和ResourceManager的心跳服务

        2.启动slotPool,这个slotPool是jobmaster这边负责管理slot资源的,里面保存了该job持有的资源,如果slot资源不够时,slotPool会向ResourceManager去申请slot,这个ResourceManager是flink集群自身的ResourceManager,而非yarn的ResourceManager,当ResourceManager收到slotPool发过来的slot申请时,会去向TaskManager申请空闲的slot

        3.建立与resourceManager的连接,这个resoureManager就是第二点中讲到的resourceManager

private void startJobMasterServices() throws Exception {
		/*TODO 启动心跳服务:taskmanager、resourcemanager*/
		startHeartbeatServices();

		// start the slot pool make sure the slot pool now accepts messages for this leader
		/*TODO 启动 slotpool*/
		slotPool.start(getFencingToken(), getAddress(), getMainThreadExecutor());

		//TODO: Remove once the ZooKeeperLeaderRetrieval returns the stored address upon start
		// try to reconnect to previously known leader
		reconnectToResourceManager(new FlinkException("Starting JobMaster component."));

		// job is ready to go, try to establish connection with resource manager
		//   - activate leader retrieval for the resource manager
		//   - on notification of the leader, the connection will be established and
		//     the slot pool will start requesting slots
		/*TODO 与ResourceManager建立连接,slotpool开始请求资源*/
		resourceManagerLeaderRetriever.start(new ResourceManagerLeaderListener());
	}

4.resetAndStartScheduler

这个方法中为要调度的job创建了调度器并给该job进行了分配,然后调用startScheduling方法进行调度

private void resetAndStartScheduler() throws Exception {
		validateRunsInMainThread();

		final CompletableFuture<Void> schedulerAssignedFuture;

		if (schedulerNG.requestJobStatus() == JobStatus.CREATED) {
			schedulerAssignedFuture = CompletableFuture.completedFuture(null);
			schedulerNG.setMainThreadExecutor(getMainThreadExecutor());
		} else {
			suspendAndClearSchedulerFields(new FlinkException("ExecutionGraph is being reset in order to be rescheduled."));
			final JobManagerJobMetricGroup newJobManagerJobMetricGroup = jobMetricGroupFactory.create(jobGraph);

            //创建调度器,并且为该job分配调度器
			final SchedulerNG newScheduler = createScheduler(executionDeploymentTracker, newJobManagerJobMetricGroup);

			schedulerAssignedFuture = schedulerNG.getTerminationFuture().handle(
				(ignored, throwable) -> {
					newScheduler.setMainThreadExecutor(getMainThreadExecutor());
					assignScheduler(newScheduler, newJobManagerJobMetricGroup);
					return null;
				}
			);
		}
    
 //调用startScheduling方法对job进行调度,异步获取调度的结果
		FutureUtils.assertNoException(schedulerAssignedFuture.thenRun(this::startScheduling));
	}

在startScheduling方法中为该job注册了job状态的监听器后然后调用startScheduling方法进行调度 

在startScheduling方法中判断了当前执行的线程是否为主线程, 注册作业监控,开启算子协调器,然后调用startstartSchedulingInternal方法开始调度

 

然后再startSchedulingInternal方法中 ,调用prepareExecutionGraphForNgScheduling方法设置executionGraph的一些属性,比如将job状态从created改为running,然后调用scheduling.startScheduling方法开始调度,他有几个实现类,这里我们选择PipelinedRegionSchedulingStrategy,新版本中流式作业默认的调度策略就是PipelinedRegionSchedulingStrategy

 

5.startScheduling

schedulingTopology就是executionVertex的拓扑结构,首先获取到所有的pipelineRegions,然后获取source节点,调用maybeScheduleRegions方法进行调度

6.maybeScheduleRegions

获取对source节点的拓扑结构进行排序然后循环遍历set中的sourceRegion,调用maybeScheduleRegion方法进行调度

private void maybeScheduleRegions(final Set<SchedulingPipelinedRegion> regions) {
		final List<SchedulingPipelinedRegion> regionsSorted =
			SchedulingStrategyUtils.sortPipelinedRegionsInTopologicalOrder(schedulingTopology, regions);
		for (SchedulingPipelinedRegion region : regionsSorted) {
			maybeScheduleRegion(region);
		}
	}

7.maybeScheduleRegion

在这里就是获取slot准备开始部署对应的executionVertex了

	private void maybeScheduleRegion(final SchedulingPipelinedRegion region) {
		if (!areRegionInputsAllConsumable(region)) {
			return;
		}

		checkState(areRegionVerticesAllInCreatedState(region), "BUG: trying to schedule a region which is not in CREATED state");

		final List<ExecutionVertexDeploymentOption> vertexDeploymentOptions =
			SchedulingStrategyUtils.createExecutionVertexDeploymentOptions(
				regionVerticesSorted.get(region),
				id -> deploymentOption);
		schedulerOperations.allocateSlotsAndDeploy(vertexDeploymentOptions);
	}

下面就是如何分配资源的代码,比较繁琐,我这里简要列一下代码的互相调用,有兴趣的兄弟可以下去自己看看:

schedulerOperations # allocateSlotsAndDeploy

defaultScheduler # allocateSlotsAndDeploy

defaultScheduler # waitForAllSlotsAndDeploy

defaultScheduler # deployAll

defaultScheduler # deployOrHandleError

defaultScheduler # deployTaskSafe

DefaultExecutionVertexOperations # deploy

ExecutionVertex # deploy

Execution # deploy

public void deploy() throws JobException {
		assertRunningInJobMasterMainThread();

        //分配slot
		final LogicalSlot slot  = assignedResource;

		checkNotNull(slot, "In order to deploy the execution we first have to assign a resource via tryAssignResource.");

		// Check if the TaskManager died in the meantime
		// This only speeds up the response to TaskManagers failing concurrently to deployments.
		// The more general check is the rpcTimeout of the deployment call
		if (!slot.isAlive()) {
			throw new JobException("Target slot (TaskManager) for deployment is no longer alive.");
		}

		// make sure exactly one deployment call happens from the correct state
		// note: the transition from CREATED to DEPLOYING is for testing purposes only
		ExecutionState previous = this.state;
        
        //更新execution的状态为deploying
		if (previous == SCHEDULED || previous == CREATED) {
			if (!transitionState(previous, DEPLOYING)) {
				// race condition, someone else beat us to the deploying call.
				// this should actually not happen and indicates a race somewhere else
				throw new IllegalStateException("Cannot deploy task: Concurrent deployment call race.");
			}
		}
		else {
			// vertex may have been cancelled, or it was already scheduled
			throw new IllegalStateException("The vertex must be in CREATED or SCHEDULED state to be deployed. Found state " + previous);
		}

		if (this != slot.getPayload()) {
			throw new IllegalStateException(
				String.format("The execution %s has not been assigned to the assigned slot.", this));
		}

		try {

			// race double check, did we fail/cancel and do we need to release the slot?
			if (this.state != DEPLOYING) {
				slot.releaseSlot(new FlinkException("Actual state of execution " + this + " (" + state + ") does not match expected state DEPLOYING."));
				return;
			}

			LOG.info("Deploying {} (attempt #{}) with attempt id {} to {} with allocation id {}", vertex.getTaskNameWithSubtaskIndex(),
				attemptNumber, vertex.getCurrentExecutionAttempt().getAttemptId(), getAssignedResourceLocation(), slot.getAllocationId());

			if (taskRestore != null) {
				checkState(taskRestore.getTaskStateSnapshot().getSubtaskStateMappings().stream().allMatch(entry ->
					entry.getValue().getInputRescalingDescriptor().equals(InflightDataRescalingDescriptor.NO_RESCALE) &&
					entry.getValue().getOutputRescalingDescriptor().equals(InflightDataRescalingDescriptor.NO_RESCALE)),
					"Rescaling from unaligned checkpoint is not yet supported.");
			}

			// 将 IntermediateResultPartition 转化成 ResultPartition
			// 将 ExecutionEdge 转成 InputChannelDeploymentDescriptor(最终会在执行时转化成InputGate)
			final TaskDeploymentDescriptor deployment = TaskDeploymentDescriptorFactory
				.fromExecutionVertex(vertex, attemptNumber)
				.createDeploymentDescriptor(
					slot.getAllocationId(),
					slot.getPhysicalSlotNumber(),
					taskRestore,
					producedPartitions.values());

			// null taskRestore to let it be GC'ed
			taskRestore = null;

			final TaskManagerGateway taskManagerGateway = slot.getTaskManagerGateway();

			final ComponentMainThreadExecutor jobMasterMainThreadExecutor =
				vertex.getExecutionGraph().getJobMasterMainThreadExecutor();

			getVertex().notifyPendingDeployment(this);
			// We run the submission in the future executor so that the serialization of large TDDs does not block
			// the main thread and sync back to the main thread once submission is completed.
            //提交对应的task
			CompletableFuture.supplyAsync(() -> taskManagerGateway.submitTask(deployment, rpcTimeout), executor)
				.thenCompose(Function.identity())
				.whenCompleteAsync(
					(ack, failure) -> {
						if (failure == null) {
							vertex.notifyCompletedDeployment(this);
						} else {
							if (failure instanceof TimeoutException) {
								String taskname = vertex.getTaskNameWithSubtaskIndex() + " (" + attemptId + ')';

								markFailed(new Exception(
									"Cannot deploy task " + taskname + " - TaskManager (" + getAssignedResourceLocation()
										+ ") not responding after a rpcTimeout of " + rpcTimeout, failure));
							} else {
								markFailed(failure);
							}
						}
					},
					jobMasterMainThreadExecutor);

		}
		catch (Throwable t) {
			markFailed(t);

			if (isLegacyScheduling()) {
				ExceptionUtils.rethrow(t);
			}
		}
	}

到这里我们的job就已经部署成功了,这里的execution也就是我们常说的物理执行图,接下来就是要提交对应的task。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值