Flink1.12源码解读——物理执行图(一)之TM实例构建过程

Apache Flink作为国内最火的大数据计算引擎之一,自身支持高吞吐,低延迟,exactly-once语义,有状态流等特性,阅读源码有助加深对框架的理解和认知。

Flink分为四种执行图,本章主要解析物理执行图的生成计划前半部分

本章及后续源码解读环境以生产为主:运行模式:OnYarn,HA模式:ZK,Mode:Streaming,关键逻辑解释我会备注到代码上(灰色字体)勿忽视。

此章节涉及的组件和框架的代码依赖部分大数据生态包括Yarn,Akka,ZK等。建议有此基础上阅读会提高可阅读性和实现理解。

 

由于整个构建过程比较复杂 ,我打算分为两个章节依次解析,当前章节主要解析在Flink向TM提交TaskDescriptor之前也就是部署Slot后生成TM实例的构建过程,在过程中调用的部分外部依赖比如基于Akka的RPC通信,基于ZK的HA选举,Yarn的使用机制请参考我上一章节ExecutionGraph构建过程https://blog.csdn.net/ws0owws0ow/article/details/113991593?spm=1001.2014.3001.5501,最后创建TM时涉及的Flink内存管理模型Netty通信场景会放在后面章节解析。

接着上一章节开始,在生成完SlotPool,Scheduler,ExecutionGraph等服务后,JobManager初始完毕,开始进行选主

 

//创建JM,JM是Flink是最核心的组件之一,JM维护了负责内部的Slot资源调度的SlotPool以及负责维护CheckpointCoordinator和构建ExecutionGraph的Scheduler。
	CompletableFuture<JobManagerRunner> createJobManagerRunner(JobGraph jobGraph, long initializationTimestamp) {
		//获取clusterEntrypoint的actorSystem用来创建JobManagerRunnerRpcpoint
		final RpcService rpcService = getRpcService();
		//创建JM->生成SchedulerNG->生成ExecutionGraph->生成ExecutionJobVertex
		//->生成ExecutionVertex,IntermediateResult/partiton->生成CheckpointCoordinator
		//->JM选主->启动SchedulerNG分发任务和ck
		return CompletableFuture.supplyAsync(
			() -> {
				try {
					JobManagerRunner runner = jobManagerRunnerFactory.createJobManagerRunner(
						jobGraph,
						configuration,
						rpcService,
						highAvailabilityServices,
						heartbeatServices,
						jobManagerSharedServices,
						new DefaultJobManagerJobMetricGroupFactory(jobManagerMetricGroup),
						fatalErrorHandler,
						initializationTimestamp);
					//开始选举JMLeader
					runner.start();
					return runner;
				} catch (Exception e) {
					throw new CompletionException(new JobInitializationException(jobGraph.getJobID(), "Could not instantiate JobManager.", e));
				}
			},
			ioExecutor); // do not use main thread executor. Otherwise, Dispatcher is blocked on JobManager creation
	}

public void grantLeadership(final UUID leaderSessionID) {
		synchronized (lock) {
			if (shutdown) {
				log.debug("JobManagerRunner cannot be granted leadership because it is already shut down.");
				return;
			}

			leadershipOperation = leadershipOperation.thenCompose(
				(ignored) -> {
					synchronized (lock) {//准备启动JM
						return verifyJobSchedulingStatusAndStartJobManager(leaderSessionID);
					}
				});

			handleException(leadershipOperation, "Could not start the job manager.");
		}
	}

选主成功后,JM开始启动SlotPool,然后转换JM上的JobStatus为Running状态,触发Scheduler开始调度Checkpoint并分配Slot及部署Task

 

备注:这里准备开始启动Checkpoint,Flink的Checkpoint执行机制可以参考我之前写的Checkpoint详细执行过程

 

private Acknowledge startJobExecution(JobMasterId newJobMasterId) throws Exception {

		validateRunsInMainThread();

		checkNotNull(newJobMasterId, "The new JobMasterId must not be null.");

		if (Objects.equals(getFencingToken(), newJobMasterId)) {
			log.info("Already started the job execution with JobMasterId {}.", newJobMasterId);

			return Acknowledge.get();
		}

		setNewFencingToken(newJobMasterId);

		//启动SlotPool,连接RM
		startJobMasterServices();

		log.info("Starting execution of job {} ({}) under job master id {}.", jobGraph.getName(), jobGraph.getJobID(), newJobMasterId);

		//验证JobStatus是否是CREATED(之前初始化ExecutionGraph时状态设置为CREATED)
		//反之重新创建SchedulerNG并重复之前生成JM等步骤
		//最后再通过SchedulerNG启动CheckpointCoordinator并申请slot资源和部署task
		resetAndStartScheduler();

		return Acknowledge.get();
	}

private void resetAndStartScheduler() throws Exception {
		validateRunsInMainThread();

		final CompletableFuture<Void> schedulerAssignedFuture;

		//创建ExecutionGraph时,会设置JobStatus = JobStatus.CREATED
		//executionGraph.transitionToRunning 才会转换成Running
		if (schedulerNG.requestJobStatus() == JobStatus.CREATED) {
			schedulerAssignedFuture = CompletableFuture.completedFuture(null);
			schedulerNG.setMainThreadExecutor(getMainThreadExecutor());
		} else {
			//重新创建SchedulerNG并重复之前生成JM等步骤
			suspendAndClearSchedulerFields(new FlinkException("ExecutionGraph is being reset in order to be rescheduled."));
			final JobManagerJobMetricGroup newJobManagerJobMetricGroup = jobMetricGroupFactory.create(jobGraph);
			final SchedulerNG newScheduler = createScheduler(executionDeploymentTracker, newJobManagerJobMetricGroup);

			schedulerAssignedFuture = schedulerNG.getTerminationFuture().handle(
				(ignored, throwable) -> {
					newScheduler.setMainThreadExecutor(getMainThreadExecutor());
					assignScheduler(newScheduler, newJobManagerJobMetricGroup);
					return null;
				}
			);
		}

		//启动Scheduler
		schedulerAssignedFuture.thenRun(this::startScheduling);
	}

protected void startSchedulingInternal() {
		log.info("Starting scheduling with scheduling strategy [{}]", schedulingStrategy.getClass().getName());
		//通知ExecutionGraph的CheckpointCoordinator改变状态为running并准备执行checkpoint
		//生成的ScheduledTriggerRunnable主要包含checkpoint周期调用(CheckpointCoordinator.startCheckpointScheduler)的逻辑
		//JM的Checkpoint会判断Execution的assignedResource是否为空,否则不会向TM提交Checkpoint
		//当提交申请TM部署slot成功后,Execution的assignedResource才会被赋值,此时JM的Checkpoint周期线程才会被往后继续执行调用TM的task执行checkpoint
		prepareExecutionGraphForNgScheduling();
		schedulingStrategy.startScheduling();
	}

JM整个Slot申请流程分为两阶段提交,大概流程:

  1. JM先向本地SlotPool申请Slot资源,如果有则返会有空闲Slot资源的TM位置,使用资源等INFO信息,然后提交申请到TM的SlotTable
  2. 如果没有,SlotPool会向FlinkRM上的SlotManager提交Slot申请,如果有则返会有空闲Slot资源的TM位置,使用资源等INFO信息,然后提交申请到TM的SlotTable
  3. 如果SlotManager还是没匹配到空闲资源的TM可用,RM会就向YarnRM申请分配Container启动TaskManager并执行启动Yanr的Container命令 ,启动命令后Container会进入并执行提交的时候指定的yarnContainer的.class主入口函数,初始化TaskManager实例,并生成NettyShuffleEnvironment等服务,生成TM实例,最后跟JM的RM等服务建立连接

 

public void startScheduling() {
		allocateSlotsAndDeploy(SchedulingStrategyUtils.getAllVertexIdsFromTopology(schedulingTopology));
	}

public void allocateSlotsAndDeploy(final List<ExecutionVertexDeploymentOption> executionVertexDeploymentOptions) {
		
        ....
		
		//这里返会的是LogicalSlot,而真正提交Task任务是下一步waitForAllSlotsAndDeploy
		final List<SlotExecutionVertexAssignment> slotExecutionVertexAssignments =
			allocateSlots(executionVertexDeploymentOptions);

		final List<DeploymentHandle> deploymentHandles = createDeploymentHandles(
			requiredVersionByVertex,
			deploymentOptionsByVertex,
			slotExecutionVertexAssignments);

		//开始向TM提交Task任务
		waitForAllSlotsAndDeploy(deploymentHandles);
	}

开始Slot申请,这里我们默认使用同一个Group下的Slot,这样不同ExecutionVertex都可能会分配到同一个Slot下

 

private List<SlotExecutionVertexAssignment> allocateSlots(final List<ExecutionVertexDeploymentOption> executionVertexDeploymentOptions) {
		//从executionVertexDeploymentOptions抽取出SlotExecutionVertexAssignment后开始为它们分配Slot
		return executionSlotAllocator.allocateSlotsFor(executionVertexDeploymentOptions
			.stream()
			.map(ExecutionVertexDeploymentOption::getExecutionVertexId)
			.map(this::getExecutionVertex)
			.map(ExecutionVertexSchedulingRequirementsMapper::from)
			.collect(Collectors.toList()));
	}

开始在本地SlotPool查询是否有可用资源,这里主要在JM的SlotPool上检查是否有现成可用的Slot资源 ,如果有则返会查到的multiTaskSlotLocality,multiTaskSlotLocality主要封装了可用的TM信息,比如地址和可用的ResourceProfile等信息,返会后直接向该TM申请资源部署 ,反之说明当前SlotPool没有可用资源,这样则向JM的RM上的SlotManager提交Slot资源申请

 

private SlotSharingManager.MultiTaskSlotLocality allocateMultiTaskSlot(
			AbstractID groupId,
			SlotSharingManager slotSharingManager,
			SlotProfile slotProfile,
			@Nullable Time allocationTimeout) {

		Collection<SlotSelectionStrategy.SlotInfoAndResources> resolvedRootSlotsInfo =
				slotSharingManager.listResolvedRootSlotInfo(groupId);

		SlotSelectionStrategy.SlotInfoAndLocality bestResolvedRootSlotWithLocality =
			slotSelectionStrategy.selectBestSlotForProfile(resolvedRootSlotsInfo, slotProfile).orElse(null);

		final SlotSharingManager.MultiTaskSlotLocality multiTaskSlotLocality = bestResolvedRootSlotWithLocality != null ?
			new SlotSharingManager.MultiTaskSlotLocality(
				slotSharingManager.getResolvedRootSlot(bestResolvedRootSlotWithLocality.getSlotInfo()),
				bestResolvedRootSlotWithLocality.getLocality()) :
			null;

		if (multiTaskSlotLocality != null && multiTaskSlotLocality.getLocality() == Locality.LOCAL) {
			return multiTaskSlotLocality;
		}

		final SlotRequestId allocatedSlotRequestId = new SlotRequestId();
		final SlotRequestId multiTaskSlotRequestId = new SlotRequestId();

		//先向自己(JM)的slotPool申请可用资源
		//主要是查看SlotPool的AvailableSlots实例(主要维护的是可用的TM,Slot的Map集合)是否有为空(有可用资源)
		//如果不为空则说明目前TM上有可用Slot,则返会不为空的可用的Optional<SlotAndLocality>(包含TM地址,可用的ResourceProfile等信息)
		Optional<SlotAndLocality> optionalPoolSlotAndLocality = tryAllocateFromAvailable(allocatedSlotRequestId, slotProfile);

		//如果申请到了就创建TaskSlot后直接返回
		if (optionalPoolSlotAndLocality.isPresent()) {
			SlotAndLocality poolSlotAndLocality = optionalPoolSlotAndLocality.get();
			if (poolSlotAndLocality.getLocality() == Locality.LOCAL || bestResolvedRootSlotWithLocality == null) {

				//当前SlotPool上可用的slot物理资源
				final PhysicalSlot allocatedSlot = poolSlotAndLocality.getSlot();
				//创建multiTaskSlot
				final SlotSharingManager.MultiTaskSlot multiTaskSlot = slotSharingManager.createRootSlot(
					multiTaskSlotRequestId,
					CompletableFuture.completedFuture(poolSlotAndLocality.getSlot()),
					allocatedSlotRequestId);

				//这里只是在可用的Slot物理资源标记上这次申请的multiTaskSlot
				if (allocatedSlot.tryAssignPayload(multiTaskSlot)) {
					//返会封装成MultiTaskSlotLocality的
					return SlotSharingManager.MultiTaskSlotLocality.of(multiTaskSlot, poolSlotAndLocality.getLocality());
				} else {
					multiTaskSlot.release(new FlinkException("Could not assign payload to allocated slot " +
						allocatedSlot.getAllocationId() + '.'));
				}
			}
		}

		//如果不为空说明当前SlotPool有可用的slot资源,直接返会使用此multiTaskSlotLocality
		if (multiTaskSlotLocality != null) {
			// prefer slot sharing group slots over unused slots
			if (optionalPoolSlotAndLocality.isPresent()) {
				slotPool.releaseSlot(
					allocatedSlotRequestId,
					new FlinkException("Locality constraint is not better fulfilled by allocated slot."));
			}
			return multiTaskSlotLocality;
		}

		// there is no slot immediately available --> check first for uncompleted slots at the slot sharing group
		// 走到这里说明没发现可用的slot资源,再会检查下是否还有没分配的slot
		SlotSharingManager.MultiTaskSlot multiTaskSlot = slotSharingManager.getUnresolvedRootSlot(groupId);

		//如果这里为空说明上面步骤检查到无可用slot
		if (multiTaskSlot == null) {
			// it seems as if we have to request a new slot from the resource manager, this is always the last resort!!!
			//向RM申请slot的物理资源
			final CompletableFuture<PhysicalSlot> slotAllocationFuture = requestNewAllocatedSlot(
				allocatedSlotRequestId,
				slotProfile,
				allocationTimeout);

			....
	}

若本地SlotPool匹配到没有可用资源则向resourceManagerGateway提交Slot申请,RM收到申请通知后开始调用SlotManager提交Slot申请

 

private CompletableFuture<AllocatedSlot> requestNewAllocatedSlotInternal(PendingRequest pendingRequest) {

		if (resourceManagerGateway == null) {
			//连接不上RM的先缓存起来,后面链接上了再提交申请
			stashRequestWaitingForResourceManager(pendingRequest);
		} else {
			//向resourceManagerGateway提交申请
			requestSlotFromResourceManager(resourceManagerGateway, pendingRequest);
		}

		return pendingRequest.getAllocatedSlotFuture();
	}

private void requestSlotFromResourceManager(
			final ResourceManagerGateway resourceManagerGateway,
			final PendingRequest pendingRequest) {

		....

		//调用resourceManagerGateway提交申请
		CompletableFuture<Acknowledge> rmResponse = resourceManagerGateway.requestSlot(
			jobMasterId,
			new SlotRequest(jobId, allocationId, pendingRequest.getResourceProfile(), jobManagerAddress),
			rpcTimeout);

		...
	}

public CompletableFuture<Acknowledge> requestSlot(
			JobMasterId jobMasterId,
			SlotRequest slotRequest,
			final Time timeout) {

		...

				try {//RM调用内部的SlotManager提交Slot申请
					slotManager.registerSlotRequest(slotRequest);
				} catch (ResourceManagerException e) {
					return FutureUtils.completedExceptionally(e);
				}

				...
	}

SlotManager开始查询本地是否有可用资源

 

private void internalRequestSlot(PendingSlotRequest pendingSlotRequest) throws ResourceManagerException {
		final ResourceProfile resourceProfile = pendingSlotRequest.getResourceProfile();

		//SlotManager会根据申请的resourceProfile(需要多少core,内存等)去匹配是否有TaskManager有空闲的slot可以满足
		OptionalConsumer.of(findMatchingSlot(resourceProfile))
			//如果满足说明TM有可用Slot:向TaskManager申请Slot分配资源
			.ifPresent(taskManagerSlot -> allocateSlot(taskManagerSlot, pendingSlotRequest))
			//如果不满足:SlotManager会向YarnRM申请Container 部署TaskManager初始化新的Slot
			.ifNotPresent(() -> fulfillPendingSlotRequestWithPendingTaskManagerSlot(pendingSlotRequest));
	}

如果SlotManager本地没匹配到可用资源,RM则会调用resourceManagerDriver向集群提交资源申请

当前我们默认使用YarnResourceManagerDriver

 

private void requestNewWorker(WorkerResourceSpec workerResourceSpec) {
		final TaskExecutorProcessSpec taskExecutorProcessSpec =
				TaskExecutorProcessUtils.processSpecFromWorkerResourceSpec(flinkConfig, workerResourceSpec);
		final int pendingCount = pendingWorkerCounter.increaseAndGet(workerResourceSpec);

		log.info("Requesting new worker with resource spec {}, current pending count: {}.",
				workerResourceSpec,
				pendingCount);

		//Flink可用的外部资源管理框架:KubernetesResourceManagerDriver,MesosResourceManagerDriver,YarnResourceManagerDriver
		CompletableFuture<WorkerType> requestResourceFuture = resourceManagerDriver.requestResource(taskExecutorProcessSpec);
		FutureUtils.assertNoException(
				requestResourceFuture.handle((worker, exception) -> {
					if (exception != null) {
						final int count = pendingWorkerCounter.decreaseAndGet(workerResourceSpec);
						log.warn("Failed requesting worker with resource spec {}, current pending count: {}, exception: {}",
								workerResourceSpec,
								count,
								exception);
						requestWorkerIfRequired();
					} else {
						final ResourceID resourceId = worker.getResourceID();
						workerNodeMap.put(resourceId, worker);
						currentAttemptUnregisteredWorkers.put(resourceId, workerResourceSpec);
						log.info("Requested worker {} with resource spec {}.",
								resourceId.getStringWithMetadata(),
								workerResourceSpec);
					}
					return null;
				}));
	}

Flink 对Yarn的组件并没有做太多的封装,直接调用org.apache.hadoop.yarn.client.api包下的AMRMClientAsync 客户端开始向Yarn集群提交资源申请并初始化分配Container

 

public CompletableFuture<YarnWorkerNode> requestResource(TaskExecutorProcessSpec taskExecutorProcessSpec) {
		checkInitialized();

		final CompletableFuture<YarnWorkerNode> requestResourceFuture = new CompletableFuture<>();

		//Flink需要的资源信息,用来提交给YarnRM
		final Optional<TaskExecutorProcessSpecContainerResourcePriorityAdapter.PriorityAndResource> priorityAndResourceOpt =
			taskExecutorProcessSpecContainerResourcePriorityAdapter.getPriorityAndResource(taskExecutorProcessSpec);

		if (!priorityAndResourceOpt.isPresent()) {
			requestResourceFuture.completeExceptionally(
				new ResourceManagerException(
					String.format("Could not compute the container Resource from the given TaskExecutorProcessSpec %s. " +
							"This usually indicates the requested resource is larger than Yarn's max container resource limit.",
						taskExecutorProcessSpec)));
		} else {
			final Priority priority = priorityAndResourceOpt.get().getPriority();
			final Resource resource = priorityAndResourceOpt.get().getResource();

			//向RM申请container,RM如果发现有空闲的NM资源可申请 就会回复成功(回调用户代码onContainersAllocated)
			resourceManagerClient.addContainerRequest(getContainerRequest(resource, priority));

			// make sure we transmit the request fast and receive fast news of granted allocations
			resourceManagerClient.setHeartbeatInterval(containerRequestHeartbeatIntervalMillis);

			requestResourceFutures.computeIfAbsent(taskExecutorProcessSpec, ignore -> new LinkedList<>()).add(requestResourceFuture);

			log.info("Requesting new TaskExecutor container with resource {}, priority {}.", taskExecutorProcessSpec, priority);
		}

		return requestResourceFuture;
	}

static AMRMClient.ContainerRequest getContainerRequest(Resource containerResource, Priority priority) {
		//向Yarn提交申请,根据containerResource初始化Container
		return new AMRMClient.ContainerRequest(
			containerResource,
			null,
			null,
			priority);
	}

在resourceManagerClient.addContainerRequest申请container资源后并且得到了RM的回复有空闲NM的资源后开始回调Client端的onContainersAllocated接口并开始初始化Container

 

public void onContainersAllocated(List<Container> containers) {
			runAsyncWithFatalHandler(() -> {
				checkInitialized();
				log.info("Received {} containers.", containers.size());

				//遍历出Yarn成功分配的所有Container
				for (Map.Entry<Priority, List<Container>> entry : groupContainerByPriority(containers).entrySet()) {
					//并准备初始化容器
					onContainersOfPriorityAllocated(entry.getKey(), entry.getValue());
				}

				// if we are waiting for no further containers, we can go to the
				// regular heartbeat interval
				if (getNumRequestedNotAllocatedWorkers() <= 0) {
					resourceManagerClient.setHeartbeatInterval(yarnHeartbeatIntervalMillis);
				}
			});
		}

private void onContainersOfPriorityAllocated(Priority priority, List<Container> containers) {
		....

		//遍历出需要启动的Container
		while (containerIterator.hasNext() && pendingContainerRequestIterator.hasNext()) {
			final Container container = containerIterator.next();
			final AMRMClient.ContainerRequest pendingRequest = pendingContainerRequestIterator.next();
			final ResourceID resourceId = getContainerResourceId(container);

			final CompletableFuture<YarnWorkerNode> requestResourceFuture = pendingRequestResourceFutures.poll();
			Preconditions.checkState(requestResourceFuture != null);

			if (pendingRequestResourceFutures.isEmpty()) {
				requestResourceFutures.remove(taskExecutorProcessSpec);
			}

			//启动Container
			startTaskExecutorInContainerAsync(container, taskExecutorProcessSpec, resourceId, requestResourceFuture);
			removeContainerRequest(pendingRequest);

			...
	}

FLinkClient向NameNode提交准备开始初始化Container,提交过程类似前面说过的提交JobGraph,主要是把创建用户Jar包,所需硬件资源,Kerberos验证,YarnTaskExecutorRunner.class用户入口主函数等申请信息添加进LaunchContext后调用NMClientAsync客户端发起向NM提交初始化Container任务。
 

private void startTaskExecutorInContainerAsync(
		Container container,
		TaskExecutorProcessSpec taskExecutorProcessSpec,
		ResourceID resourceId,
		CompletableFuture<YarnWorkerNode> requestResourceFuture) {
		final CompletableFuture<ContainerLaunchContext> containerLaunchContextFuture =
			
			//封装Flink需要提交的资源成ContainerLaunchContext
			FutureUtils.supplyAsync(() -> createTaskExecutorLaunchContext(
				resourceId, container.getNodeId().getHost(), taskExecutorProcessSpec), getIoExecutor());

		FutureUtils.assertNoException(
			containerLaunchContextFuture.handleAsync((context, exception) -> {
				if (exception == null) {
					//调用NMClientAsync向NM申请启动Container
					//在NM成功启动Container后会执行YarnTaskExecutorRunner.class主函数,主要是创建TaskExecutor实例,等待后面JM提交Task任务
					nodeManagerClient.startContainerAsync(container, context);
					requestResourceFuture.complete(new YarnWorkerNode(container, resourceId));
				} else {
					requestResourceFuture.completeExceptionally(exception);
				}
				return null;
			}, getMainThreadExecutor()));
	}

在NM成功启动Container后会执行Flink提交的Context中指定YarnTaskExecutorRunner.class主入口函数

进入TM端主入口:

 

private ContainerLaunchContext createTaskExecutorLaunchContext(
		ResourceID containerId,
		String host,
		TaskExecutorProcessSpec taskExecutorProcessSpec) throws Exception {

		.....

        //创建指定YarnTaskExecutorRunner.class主入口函数的Context
		final ContainerLaunchContext taskExecutorLaunchContext = Utils.createTaskExecutorContext(
			flinkConfig,
			yarnConfig,
			configuration,
			taskManagerParameters,
			taskManagerDynamicProperties,
			currDir,
			YarnTaskExecutorRunner.class,
			log);

		taskExecutorLaunchContext.getEnvironment()
			.put(ENV_FLINK_NODE_ID, host);
		return taskExecutorLaunchContext;
	}

public class YarnTaskExecutorRunner {

	....

    //Executor端主入口函数
	public static void main(String[] args) {
		EnvironmentInformation.logEnvironmentInfo(LOG, "YARN TaskExecutor runner", args);
		SignalHandler.register(LOG);
		JvmShutdownSafeguard.installAsShutdownHook(LOG);

        //启动TM
		runTaskManagerSecurely(args);
	}

开始初始化TM实例,包括向JM注册自己并创建AkkaActor,创建负责数据交互的NettyShuffleEnvironment,NetworkBufferPool,维护state的KVStateManager,SlotTable等 ,跟JM的RM等服务建立连接,等待JM提交task过来并触发任务

 

public static void runTaskManager(Configuration configuration, PluginManager pluginManager) throws Exception {

		final TaskManagerRunner taskManagerRunner = new TaskManagerRunner(configuration, pluginManager, TaskManagerRunner::createTaskExecutorService);

		taskManagerRunner.start();
	}

//注册基于Akka的RpcEndpoint(akkaActor) 
public class TaskExecutor extends RpcEndpoint implements TaskExecutorGateway {...}

public static TaskExecutor startTaskManager(
			Configuration configuration,
			ResourceID resourceID,
			RpcService rpcService,
			HighAvailabilityServices highAvailabilityServices,
			HeartbeatServices heartbeatServices,
			MetricRegistry metricRegistry,
			BlobCacheService blobCacheService,
			boolean localCommunicationOnly,
			ExternalResourceInfoProvider externalResourceInfoProvider,
			FatalErrorHandler fatalErrorHandler) throws Exception {

		checkNotNull(configuration);
		checkNotNull(resourceID);
		checkNotNull(rpcService);
		checkNotNull(highAvailabilityServices);

		LOG.info("Starting TaskManager with ResourceID: {}", resourceID.getStringWithMetadata());

		//JM上的rpcService(actorSystem)位置,在JM初始化Dispatcher之前生成的Akka顶级父节点
		String externalAddress = rpcService.getAddress();

		//当前TM上配置的CPU,内存等硬件资源
		final TaskExecutorResourceSpec taskExecutorResourceSpec = TaskExecutorResourceUtils.resourceSpecFromConfig(configuration);

		//NettyShuffleEnvironment,Checkpoint等服务参数配置
		TaskManagerServicesConfiguration taskManagerServicesConfiguration =
			TaskManagerServicesConfiguration.fromConfiguration(
				configuration,
				resourceID,
				externalAddress,
				localCommunicationOnly,
				taskExecutorResourceSpec);

		//主要封装的是hostname,TMID等。
		Tuple2<TaskManagerMetricGroup, MetricGroup> taskManagerMetricGroup = MetricUtils.instantiateTaskManagerMetricGroup(
			metricRegistry,
			externalAddress,
			resourceID,
			taskManagerServicesConfiguration.getSystemResourceMetricsProbingInterval());

		//创建FixedThreadPool线程池,线程个数对应cluster.io-pool.size参数,默认4*cpu个数
		final ExecutorService ioExecutor = Executors.newFixedThreadPool(
			taskManagerServicesConfiguration.getNumIoThreads(),
			new ExecutorThreadFactory("flink-taskexecutor-io"));

		//创建TaskManagerServices,包含NettyShuffleEnvironment,SlotTable等服务
		TaskManagerServices taskManagerServices = TaskManagerServices.fromConfiguration(
			taskManagerServicesConfiguration,
			blobCacheService.getPermanentBlobService(),
			taskManagerMetricGroup.f1,
			ioExecutor,
			fatalErrorHandler);

		//初始化TM内存指标监控服务
		MetricUtils.instantiateFlinkMemoryMetricGroup(
			taskManagerMetricGroup.f1,
			taskManagerServices.getTaskSlotTable(),
			taskManagerServices::getManagedMemorySize);

		//创建TM各配置参数文件,比如Slot相关参数,上面生成的硬件资源,临时目录路径,日志路径等
		TaskManagerConfiguration taskManagerConfiguration =
			TaskManagerConfiguration.fromConfiguration(configuration, taskExecutorResourceSpec, externalAddress);

		String metricQueryServiceAddress = metricRegistry.getMetricQueryServiceGatewayRpcAddress();

		//创建TaskExecutor(RpcEndpoint)
		return new TaskExecutor(
			rpcService,
			taskManagerConfiguration,
			highAvailabilityServices,
			taskManagerServices,
			externalResourceInfoProvider,
			heartbeatServices,
			taskManagerMetricGroup.f0,
			metricQueryServiceAddress,
			blobCacheService,
			fatalErrorHandler,
			new TaskExecutorPartitionTrackerImpl(taskManagerServices.getShuffleEnvironment()),
			createBackPressureSampleService(configuration, rpcService.getScheduledExecutor()));
	}

TM创建的服务比较多比如kvStateService,taskSlotTable,broadcastVariableManager等

这里我们主要解析下比较核心的负责各Execution数据交互的NettyShuffleEnvironment构建过程

创建NettyShuffleEnvironment主要的三个核心组件,这里涉及的对象会放在后面章节解析

  • 负责TM上堆内/堆外内存管理的networkBufferPool
  • 负责向下游Execution发送数据的resultPartitionFactory
  • 负责拉去上游Execution数据的inputGateFactory

 

private static ShuffleEnvironment<?, ?> createShuffleEnvironment(
			TaskManagerServicesConfiguration taskManagerServicesConfiguration,
			TaskEventDispatcher taskEventDispatcher,
			MetricGroup taskManagerMetricGroup,
			Executor ioExecutor) throws FlinkException {

		//创建包含所有配置信息的上下文
		final ShuffleEnvironmentContext shuffleEnvironmentContext = new ShuffleEnvironmentContext(
			taskManagerServicesConfiguration.getConfiguration(),
			taskManagerServicesConfiguration.getResourceID(),
			taskManagerServicesConfiguration.getNetworkMemorySize(),
			taskManagerServicesConfiguration.isLocalCommunicationOnly(),
			taskManagerServicesConfiguration.getBindAddress(),
			taskEventDispatcher,
			taskManagerMetricGroup,
			ioExecutor);

		return ShuffleServiceLoader
			.loadShuffleServiceFactory(taskManagerServicesConfiguration.getConfiguration())
			//创建networkBufferPool,resultPartitionFactory,inputGateFactory
			.createShuffleEnvironment(shuffleEnvironmentContext);
	}

static NettyShuffleEnvironment createNettyShuffleEnvironment(
			NettyShuffleEnvironmentConfiguration config,
			ResourceID taskExecutorResourceId,
			TaskEventPublisher taskEventPublisher,
			ResultPartitionManager resultPartitionManager,
			MetricGroup metricGroup,
			Executor ioExecutor) {
		checkNotNull(config);
		checkNotNull(taskExecutorResourceId);
		checkNotNull(taskEventPublisher);
		checkNotNull(resultPartitionManager);
		checkNotNull(metricGroup);

		NettyConfig nettyConfig = config.nettyConfig();

		FileChannelManager fileChannelManager = new FileChannelManagerImpl(config.getTempDirs(), DIR_NAME_PREFIX);

		ConnectionManager connectionManager = nettyConfig != null ?
			new NettyConnectionManager(resultPartitionManager, taskEventPublisher, nettyConfig) :
			new LocalConnectionManager();

		//负责当前TM的内存划分和管理,Task的LocalBufferPool会向它申请segment
		NetworkBufferPool networkBufferPool = new NetworkBufferPool(
			config.numNetworkBuffers(),//可以分配多少个segment ,在前面fromConfiguration时计算出
			config.networkBufferSize(),//segment容量大小 默认32kb
			config.getRequestSegmentsTimeout());//segment请求超时时间

		registerShuffleMetrics(metricGroup, networkBufferPool);

		//主要负责创建包含ResultSubpartition的ResultPartition以及创建LocalBufferPool
		ResultPartitionFactory resultPartitionFactory = new ResultPartitionFactory(
			resultPartitionManager,
			fileChannelManager,
			networkBufferPool,
			config.getBlockingSubpartitionType(),
			config.networkBuffersPerChannel(),
			config.floatingNetworkBuffersPerGate(),
			config.networkBufferSize(),
			config.isBlockingShuffleCompressionEnabled(),
			config.getCompressionCodec(),
			config.getMaxBuffersPerChannel());

		//主要负责创建包含inputChannel的InputGate以及创建LocalBufferPool
		SingleInputGateFactory singleInputGateFactory = new SingleInputGateFactory(
			taskExecutorResourceId,
			config,
			connectionManager,
			resultPartitionManager,
			taskEventPublisher,
			networkBufferPool);

		//创建NettyShuffleEnvironment实例
		return new NettyShuffleEnvironment(
			taskExecutorResourceId,
			config,
			networkBufferPool,
			connectionManager,
			resultPartitionManager,
			fileChannelManager,
			resultPartitionFactory,
			singleInputGateFactory,
			ioExecutor);
	}

TM创建完以上所有服务对象后,开始启动TM(AkkaRpcServer)并连接JM上RM等服务

 

public void start() throws Exception {
		taskExecutorService.start();
	}

//启动TM
	public void onStart() throws Exception {
		try {
			startTaskExecutorServices();
		} catch (Throwable t) {
			final TaskManagerException exception = new TaskManagerException(String.format("Could not start the TaskExecutor %s", getAddress()), t);
			onFatalError(exception);
			throw exception;
		}

		startRegistrationTimeout();
	}

private void startTaskExecutorServices() throws Exception {
		try {
			// start by connecting to the ResourceManager
			//跟JM上的RM建立连接
			resourceManagerLeaderRetriever.start(new ResourceManagerLeaderListener());

			// tell the task slot table who's responsible for the task slot actions
			//启动taskSlotTable
			taskSlotTable.start(new SlotActionsImpl(), getMainThreadExecutor());

			// start the job leader service
			//启动AkkaRpcService,基于ZK的haServices,JobLeaderListener服务
			jobLeaderService.start(getAddress(), getRpcService(), haServices, new JobLeaderListenerImpl());

			fileCache = new FileCache(taskManagerConfiguration.getTmpDirectories(), blobCacheService.getPermanentBlobService());
		} catch (Exception e) {
			handleStartTaskExecutorServicesException(e);
		}
	}

至此,Flink从JM启动到Slot申请到Container部署最后到TM实例化启动,整个过程主要是资源的申请,配置参数的划分,各服务的初始化,集群还未开始完全运作起来。

在后面章节中会解析到物理执行图的最后一步,也就是Flink向TM提交任务,启动Task线程,用户数据的交互等。Thanks。

 

 

 

 

 

  • 2
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值