Flink1.12源码解读——物理执行图（一）之TM实例构建过程

最新推荐文章于 2024-01-05 11:54:52 发布

按时吃早饭ABC

最新推荐文章于 2024-01-05 11:54:52 发布

阅读量935

点赞数 2

分类专栏： flink 大数据文章标签： flink 大数据

本文链接：https://blog.csdn.net/ws0owws0ow/article/details/114368079

版权

大数据同时被 2 个专栏收录

11 篇文章 1 订阅

订阅专栏

flink

6 篇文章 2 订阅

订阅专栏

Apache Flink作为国内最火的大数据计算引擎之一，自身支持高吞吐，低延迟，exactly-once语义，有状态流等特性，阅读源码有助加深对框架的理解和认知。

Flink分为四种执行图，本章主要解析物理执行图的生成计划前半部分

本章及后续源码解读环境以生产为主：运行模式：OnYarn，HA模式：ZK，Mode：Streaming，关键逻辑解释我会备注到代码上（灰色字体）勿忽视。

此章节涉及的组件和框架的代码依赖部分大数据生态包括Yarn，Akka，ZK等。建议有此基础上阅读会提高可阅读性和实现理解。

由于整个构建过程比较复杂，我打算分为两个章节依次解析，当前章节主要解析在Flink向TM提交TaskDescriptor之前也就是部署Slot后生成TM实例的构建过程，在过程中调用的部分外部依赖比如基于Akka的RPC通信，基于ZK的HA选举，Yarn的使用机制请参考我上一章节ExecutionGraph构建过程https://blog.csdn.net/ws0owws0ow/article/details/113991593?spm=1001.2014.3001.5501，最后创建TM时涉及的Flink内存管理模型Netty通信场景会放在后面章节解析。

接着上一章节开始，在生成完SlotPool，Scheduler，ExecutionGraph等服务后，JobManager初始完毕，开始进行选主

//创建JM，JM是Flink是最核心的组件之一，JM维护了负责内部的Slot资源调度的SlotPool以及负责维护CheckpointCoordinator和构建ExecutionGraph的Scheduler。
	CompletableFuture<JobManagerRunner> createJobManagerRunner(JobGraph jobGraph, long initializationTimestamp) {
		//获取clusterEntrypoint的actorSystem用来创建JobManagerRunnerRpcpoint
		final RpcService rpcService = getRpcService();
		//创建JM->生成SchedulerNG->生成ExecutionGraph->生成ExecutionJobVertex
		//->生成ExecutionVertex,IntermediateResult/partiton->生成CheckpointCoordinator
		//->JM选主->启动SchedulerNG分发任务和ck
		return CompletableFuture.supplyAsync(
			() -> {
				try {
					JobManagerRunner runner = jobManagerRunnerFactory.createJobManagerRunner(
						jobGraph,
						configuration,
						rpcService,
						highAvailabilityServices,
						heartbeatServices,
						jobManagerSharedServices,
						new DefaultJobManagerJobMetricGroupFactory(jobManagerMetricGroup),
						fatalErrorHandler,
						initializationTimestamp);
					//开始选举JMLeader
					runner.start();
					return runner;
				} catch (Exception e) {
					throw new CompletionException(new JobInitializationException(jobGraph.getJobID(), "Could not instantiate JobManager.", e));
				}
			},
			ioExecutor); // do not use main thread executor. Otherwise, Dispatcher is blocked on JobManager creation
	}

public void grantLeadership(final UUID leaderSessionID) {
		synchronized (lock) {
			if (shutdown) {
				log.debug("JobManagerRunner cannot be granted leadership because it is already shut down.");
				return;
			}

			leadershipOperation = leadershipOperation.thenCompose(
				(ignored) -> {
					synchronized (lock) {//准备启动JM
						return verifyJobSchedulingStatusAndStartJobManager(leaderSessionID);
					}
				});

			handleException(leadershipOperation, "Could not start the job manager.");
		}
	}

选主成功后，JM开始启动SlotPool，然后转换JM上的JobStatus为Running状态，触发Scheduler开始调度Checkpoint并分配Slot及部署Task

备注：这里准备开始启动Checkpoint，Flink的Checkpoint执行机制可以参考我之前写的Checkpoint详细执行过程

private Acknowledge startJobExecution(JobMasterId newJobMasterId) throws Exception {

		validateRunsInMainThread();

		checkNotNull(newJobMasterId, "The new JobMasterId must not be null.");

		if (Objects.equals(getFencingToken(), newJobMasterId)) {
			log.info("Already started the job execution with JobMasterId {}.", newJobMasterId);

			return Acknowledge.get();
		}

		setNewFencingToken(newJobMasterId);

		//启动SlotPool，连接RM
		startJobMasterServices();

		log.info("Starting execution of job {} ({}) under job master id {}.", jobGraph.getName(), jobGraph.getJobID(), newJobMasterId);

		//验证JobStatus是否是CREATED（之前初始化ExecutionGraph时状态设置为CREATED）
		//反之重新创建SchedulerNG并重复之前生成JM等步骤
		//最后再通过SchedulerNG启动CheckpointCoordinator并申请slot资源和部署task
		resetAndStartScheduler();

		return Acknowledge.get();
	}

private void resetAndStartScheduler() throws Exception {
		validateRunsInMainThread();

		final CompletableFuture<Void> schedulerAssignedFuture;

		//创建ExecutionGraph时，会设置JobStatus = JobStatus.CREATED
		//executionGraph.transitionToRunning 才会转换成Running
		if (schedulerNG.requestJobStatus() == JobStatus.CREATED) {
			schedulerAssignedFuture = CompletableFuture.completedFuture(null);
			schedulerNG.setMainThreadExecutor(getMainThreadExecutor());
		} else {
			//重新创建SchedulerNG并重复之前生成JM等步骤
			suspendAndClearSchedulerFields(new FlinkException("ExecutionGraph is being reset in order to be rescheduled."));
			final JobManagerJobMetricGroup newJobManagerJobMetricGroup = jobMetricGroupFactory.create(jobGraph);
			final SchedulerNG newScheduler = createScheduler(executionDeploymentTracker, newJobManagerJobMetricGroup);

			schedulerAssignedFuture = schedulerNG.getTerminationFuture().handle(
				(ignored, throwable) -> {
					newScheduler.setMainThreadExecutor(getMainThreadExecutor());
					assignScheduler(newScheduler, newJobManagerJobMetricGroup);
					return null;
				}
			);
		}

		//启动Scheduler
		schedulerAssignedFuture.thenRun(this::startScheduling);
	}

protected void startSchedulingInternal() {
		log.info("Starting scheduling with scheduling strategy [{}]", schedulingStrategy.getClass().getName());
		//通知ExecutionGraph的CheckpointCoordinator改变状态为running并准备执行checkpoint
		//生成的ScheduledTriggerRunnable主要包含checkpoint周期调用(CheckpointCoordinator.startCheckpointScheduler)的逻辑
		//JM的Checkpoint会判断Execution的assignedResource是否为空，否则不会向TM提交Checkpoint
		//当提交申请TM部署slot成功后，Execution的assignedResource才会被赋值，此时JM的Checkpoint周期线程才会被往后继续执行调用TM的task执行checkpoint
		prepareExecutionGraphForNgScheduling();
		schedulingStrategy.startScheduling();
	}

JM整个Slot申请流程分为两阶段提交，大概流程：

JM先向本地SlotPool申请Slot资源，如果有则返会有空闲Slot资源的TM位置，使用资源等INFO信息，然后提交申请到TM的SlotTable
如果没有，SlotPool会向FlinkRM上的SlotManager提交Slot申请，如果有则返会有空闲Slot资源的TM位置，使用资源等INFO信息，然后提交申请到TM的SlotTable
如果SlotManager还是没匹配到空闲资源的TM可用，RM会就向YarnRM申请分配Container启动TaskManager并执行启动Yanr的Container命令，启动命令后Container会进入并执行提交的时候指定的yarnContainer的.class主入口函数，初始化TaskManager实例，并生成NettyShuffleEnvironment等服务，生成TM实例，最后跟JM的RM等服务建立连接

public void startScheduling() {
		allocateSlotsAndDeploy(SchedulingStrategyUtils.getAllVertexIdsFromTopology(schedulingTopology));
	}

public void allocateSlotsAndDeploy(final List<ExecutionVertexDeploymentOption> executionVertexDeploymentOptions) {
		
        ....
		
		//这里返会的是LogicalSlot,而真正提交Task任务是下一步waitForAllSlotsAndDeploy
		final List<SlotExecutionVertexAssignment> slotExecutionVertexAssignments =
			allocateSlots(executionVertexDeploymentOptions);

		final List<DeploymentHandle> deploymentHandles = createDeploymentHandles(
			requiredVersionByVertex,
			deploymentOptionsByVertex,
			slotExecutionVertexAssignments);

		//开始向TM提交Task任务
		waitForAllSlotsAndDeploy(deploymentHandles);
	}

开始Slot申请，这里我们默认使用同一个Group下的Slot，这样不同ExecutionVertex都可能会分配到同一个Slot下

private List<SlotExecutionVertexAssignment> allocateSlots(final List<ExecutionVertexDeploymentOption> executionVertexDeploymentOptions) {
		//从executionVertexDeploymentOptions抽取出SlotExecutionVertexAssignment后开始为它们分配Slot
		return executionSlotAllocator.allocateSlotsFor(executionVertexDeploymentOptions
			.stream()
			.map(ExecutionVertexDeploymentOption::getExecutionVertexId)
			.map(this::getExecutionVertex)
			.map(ExecutionVertexSchedulingRequirementsMapper::from)
			.collect(Collectors.toList()));
	}

开始在本地SlotPool查询是否有可用资源，这里主要在JM的SlotPool上检查是否有现成可用的Slot资源，如果有则返会查到的multiTaskSlotLocality，multiTaskSlotLocality主要封装了可用的TM信息，比如地址和可用的ResourceProfile等信息，返会后直接向该TM申请资源部署，反之说明当前SlotPool没有可用资源，这样则向JM的RM上的SlotManager提交Slot资源申请

private SlotSharingManager.MultiTaskSlotLocality allocateMultiTaskSlot(
			AbstractID groupId,
			SlotSharingManager slotSharingManager,
			SlotProfile slotProfile,
			@Nullable Time allocationTimeout) {

		Collection<SlotSelectionStrategy.SlotInfoAndResources> resolvedRootSlotsInfo =
				slotSharingManager.listResolvedRootSlotInfo(groupId);

		SlotSelectionStrategy.SlotInfoAndLocality bestResolvedRootSlotWithLocality =
			slotSelectionStrategy.selectBestSlotForProfile(resolvedRootSlotsInfo, slotProfile).orElse(null);

		final SlotSharingManager.MultiTaskSlotLocality multiTaskSlotLocality = bestResolvedRootSlotWithLocality != null ?
			new SlotSharingManager.MultiTaskSlotLocality(
				slotSharingManager.getResolvedRootSlot(bestResolvedRootSlotWithLocality.getSlotInfo()),
				bestResolvedRootSlotWithLocality.getLocality()) :
			null;

		if (multiTaskSlotLocality != null && multiTaskSlotLocality.getLocality() == Locality.LOCAL) {
			return multiTaskSlotLocality;
		}

		final SlotRequestId allocatedSlotRequestId = new SlotRequestId();
		final SlotRequestId multiTaskSlotRequestId = new SlotRequestId();

		//先向自己（JM）的slotPool申请可用资源
		//主要是查看SlotPool的AvailableSlots实例（主要维护的是可用的TM，Slot的Map集合）是否有为空（有可用资源）
		//如果不为空则说明目前TM上有可用Slot，则返会不为空的可用的Optional<SlotAndLocality>（包含TM地址，可用的ResourceProfile等信息）
		Optional<SlotAndLocality> optionalPoolSlotAndLocality = tryAllocateFromAvailable(allocatedSlotRequestId, slotProfile);

		//如果申请到了就创建TaskSlot后直接返回
		if (optionalPoolSlotAndLocality.isPresent()) {
			SlotAndLocality poolSlotAndLocality = optionalPoolSlotAndLocality.get();
			if (poolSlotAndLocality.getLocality() == Locality.LOCAL || bestResolvedRootSlotWithLocality == null) {

				//当前SlotPool上可用的slot物理资源
				final PhysicalSlot allocatedSlot = poolSlotAndLocality.getSlot();
				//创建multiTaskSlot
				final SlotSharingManager.MultiTaskSlot multiTaskSlot = slotSharingManager.createRootSlot(
					multiTaskSlotRequestId,
					CompletableFuture.completedFuture(poolSlotAndLocality.getSlot()),
					allocatedSlotRequestId);

				//这里只是在可用的Slot物理资源标记上这次申请的multiTaskSlot
				if (allocatedSlot.tryAssignPayload(multiTaskSlot)) {
					//返会封装成MultiTaskSlotLocality的
					return SlotSharingManager.MultiTaskSlotLocality.of(multiTaskSlot, poolSlotAndLocality.getLocality());
				} else {
					multiTaskSlot.release(new FlinkException("Could not assign payload to allocated slot " +
						allocatedSlot.getAllocationId() + '.'));
				}
			}
		}

		//如果不为空说明当前SlotPool有可用的slot资源，直接返会使用此multiTaskSlotLocality
		if (multiTaskSlotLocality != null) {
			// prefer slot sharing group slots over unused slots
			if (optionalPoolSlotAndLocality.isPresent()) {
				slotPool.releaseSlot(
					allocatedSlotRequestId,
					new FlinkException("Locality constraint is not better fulfilled by allocated slot."));
			}
			return multiTaskSlotLocality;
		}

		// there is no slot immediately available --> check first for uncompleted slots at the slot sharing group
		// 走到这里说明没发现可用的slot资源，再会检查下是否还有没分配的slot
		SlotSharingManager.MultiTaskSlot multiTaskSlot = slotSharingManager.getUnresolvedRootSlot(groupId);

		//如果这里为空说明上面步骤检查到无可用slot
		if (multiTaskSlot == null) {
			// it seems as if we have to request a new slot from the resource manager, this is always the last resort!!!
			//向RM申请slot的物理资源
			final CompletableFuture<PhysicalSlot> slotAllocationFuture = requestNewAllocatedSlot(
				allocatedSlotRequestId,
				slotProfile,
				allocationTimeout);

			....
	}

若本地SlotPool匹配到没有可用资源则向resourceManagerGateway提交Slot申请,RM收到申请通知后开始调用SlotManager提交Slot申请

private CompletableFuture<AllocatedSlot> requestNewAllocatedSlotInternal(PendingRequest pendingRequest) {

		if (resourceManagerGateway == null) {
			//连接不上RM的先缓存起来，后面链接上了再提交申请
			stashRequestWaitingForResourceManager(pendingRequest);
		} else {
			//向resourceManagerGateway提交申请
			requestSlotFromResourceManager(resourceManagerGateway, pendingRequest);
		}

		return pendingRequest.getAllocatedSlotFuture();
	}

private void requestSlotFromResourceManager(
			final ResourceManagerGateway resourceManagerGateway,
			final PendingRequest pendingRequest) {

		....

		//调用resourceManagerGateway提交申请
		CompletableFuture<Acknowledge> rmResponse = resourceManagerGateway.requestSlot(
			jobMasterId,
			new SlotRequest(jobId, allocationId, pendingRequest.getResourceProfile(), jobManagerAddress),
			rpcTimeout);

		...
	}

public CompletableFuture<Acknowledge> requestSlot(
			JobMasterId jobMasterId,
			SlotRequest slotRequest,
			final Time timeout) {

		...

				try {//RM调用内部的SlotManager提交Slot申请
					slotManager.registerSlotRequest(slotRequest);
				} catch (ResourceManagerException e) {
					return FutureUtils.completedExceptionally(e);
				}

				...
	}

SlotManager开始查询本地是否有可用资源

private void internalRequestSlot(PendingSlotRequest pendingSlotRequest) throws ResourceManagerException {
		final ResourceProfile resourceProfile = pendingSlotRequest.getResourceProfile();

		//SlotManager会根据申请的resourceProfile（需要多少core，内存等)去匹配是否有TaskManager有空闲的slot可以满足
		OptionalConsumer.of(findMatchingSlot(resourceProfile))
			//如果满足说明TM有可用Slot：向TaskManager申请Slot分配资源
			.ifPresent(taskManagerSlot -> allocateSlot(taskManagerSlot, pendingSlotRequest))
			//如果不满足：SlotManager会向YarnRM申请Container 部署TaskManager初始化新的Slot
			.ifNotPresent(() -> fulfillPendingSlotRequestWithPendingTaskManagerSlot(pendingSlotRequest));
	}

如果SlotManager本地没匹配到可用资源，RM则会调用resourceManagerDriver向集群提交资源申请

当前我们默认使用YarnResourceManagerDriver

private void requestNewWorker(WorkerResourceSpec workerResourceSpec) {
		final TaskExecutorProcessSpec taskExecutorProcessSpec =
				TaskExecutorProcessUtils.processSpecFromWorkerResourceSpec(flinkConfig, workerResourceSpec);
		final int pendingCount = pendingWorkerCounter.increaseAndGet(workerResourceSpec);

		log.info("Requesting new worker with resource spec {}, current pending count: {}.",
				workerResourceSpec,
				pendingCount);

		//Flink可用的外部资源管理框架：KubernetesResourceManagerDriver，MesosResourceManagerDriver，YarnResourceManagerDriver
		CompletableFuture<WorkerType> requestResourceFuture = resourceManagerDriver.requestResource(taskExecutorProcessSpec);
		FutureUtils.assertNoException(
				requestResourceFuture.handle((worker, exception) -> {
					if (exception != null) {
						final int count = pendingWorkerCounter.decreaseAndGet(workerResourceSpec);
						log.warn("Failed requesting worker with resource spec {}, current pending count: {}, exception: {}",
								workerResourceSpec,
								count,
								exception);
						requestWorkerIfRequired();
					} else {
						final ResourceID resourceId = worker.getResourceID();
						workerNodeMap.put(resourceId, worker);
						currentAttemptUnregisteredWorkers.put(resourceId, workerResourceSpec);
						log.info("Requested worker {} with resource spec {}.",
								resourceId.getStringWithMetadata(),
								workerResourceSpec);
					}
					return null;
				}));
	}

Flink 对Yarn的组件并没有做太多的封装，直接调用org.apache.hadoop.yarn.client.api包下的AMRMClientAsync 客户端开始向Yarn集群提交资源申请并初始化分配Container

public CompletableFuture<YarnWorkerNode> requestResource(TaskExecutorProcessSpec taskExecutorProcessSpec) {
		checkInitialized();

		final CompletableFuture<YarnWorkerNode> requestResourceFuture = new CompletableFuture<>();

		//Flink需要的资源信息，用来提交给YarnRM
		final Optional<TaskExecutorProcessSpecContainerResourcePriorityAdapter.PriorityAndResource> priorityAndResourceOpt =
			taskExecutorProcessSpecContainerResourcePriorityAdapter.getPriorityAndResource(taskExecutorProcessSpec);

		if (!priorityAndResourceOpt.isPresent()) {
			requestResourceFuture.completeExceptionally(
				new ResourceManagerException(
					String.format("Could not compute the container Resource from the given TaskExecutorProcessSpec %s. " +
							"This usually indicates the requested resource is larger than Yarn's max container resource limit.",
						taskExecutorProcessSpec)));
		} else {
			final Priority priority = priorityAndResourceOpt.get().getPriority();
			final Resource resource = priorityAndResourceOpt.get().getResource();

			//向RM申请container，RM如果发现有空闲的NM资源可申请 就会回复成功（回调用户代码onContainersAllocated）
			resourceManagerClient.addContainerRequest(getContainerRequest(resource, priority));

			// make sure we transmit the request fast and receive fast news of granted allocations
			resourceManagerClient.setHeartbeatInterval(containerRequestHeartbeatIntervalMillis);

			requestResourceFutures.computeIfAbsent(taskExecutorProcessSpec, ignore -> new LinkedList<>()).add(requestResourceFuture);

			log.info("Requesting new TaskExecutor container with resource {}, priority {}.", taskExecutorProcessSpec, priority);
		}

		return requestResourceFuture;
	}

static AMRMClient.ContainerRequest getContainerRequest(Resource containerResource, Priority priority) {
		//向Yarn提交申请，根据containerResource初始化Container
		return new AMRMClient.ContainerRequest(
			containerResource,
			null,
			null,
			priority);
	}

在resourceManagerClient.addContainerRequest申请container资源后并且得到了RM的回复有空闲NM的资源后开始回调Client端的onContainersAllocated接口并开始初始化Container

public void onContainersAllocated(List<Container> containers) {
			runAsyncWithFatalHandler(() -> {
				checkInitialized();
				log.info("Received {} containers.", containers.size());

				//遍历出Yarn成功分配的所有Container
				for (Map.Entry<Priority, List<Container>> entry : groupContainerByPriority(containers).entrySet()) {
					//并准备初始化容器
					onContainersOfPriorityAllocated(entry.getKey(), entry.getValue());
				}

				// if we are waiting for no further containers, we can go to the
				// regular heartbeat interval
				if (getNumRequestedNotAllocatedWorkers() <= 0) {
					resourceManagerClient.setHeartbeatInterval(yarnHeartbeatIntervalMillis);
				}
			});
		}

private void onContainersOfPriorityAllocated(Priority priority, List<Container> containers) {
		....

		//遍历出需要启动的Container
		while (containerIterator.hasNext() && pendingContainerRequestIterator.hasNext()) {
			final Container container = containerIterator.next();
			final AMRMClient.ContainerRequest pendingRequest = pendingContainerRequestIterator.next();
			final ResourceID resourceId = getContainerResourceId(container);

			final CompletableFuture<YarnWorkerNode> requestResourceFuture = pendingRequestResourceFutures.poll();
			Preconditions.checkState(requestResourceFuture != null);

			if (pendingRequestResourceFutures.isEmpty()) {
				requestResourceFutures.remove(taskExecutorProcessSpec);
			}

			//启动Container
			startTaskExecutorInContainerAsync(container, taskExecutorProcessSpec, resourceId, requestResourceFuture);
			removeContainerRequest(pendingRequest);

			...
	}

FLinkClient向NameNode提交准备开始初始化Container，提交过程类似前面说过的提交JobGraph，主要是把创建用户Jar包，所需硬件资源，Kerberos验证，YarnTaskExecutorRunner.class用户入口主函数等申请信息添加进LaunchContext后调用NMClientAsync客户端发起向NM提交初始化Container任务。

private void startTaskExecutorInContainerAsync(
		Container container,
		TaskExecutorProcessSpec taskExecutorProcessSpec,
		ResourceID resourceId,
		CompletableFuture<YarnWorkerNode> requestResourceFuture) {
		final CompletableFuture<ContainerLaunchContext> containerLaunchContextFuture =
			
			//封装Flink需要提交的资源成ContainerLaunchContext
			FutureUtils.supplyAsync(() -> createTaskExecutorLaunchContext(
				resourceId, container.getNodeId().getHost(), taskExecutorProcessSpec), getIoExecutor());

		FutureUtils.assertNoException(
			containerLaunchContextFuture.handleAsync((context, exception) -> {
				if (exception == null) {
					//调用NMClientAsync向NM申请启动Container
					//在NM成功启动Container后会执行YarnTaskExecutorRunner.class主函数，主要是创建TaskExecutor实例，等待后面JM提交Task任务
					nodeManagerClient.startContainerAsync(container, context);
					requestResourceFuture.complete(new YarnWorkerNode(container, resourceId));
				} else {
					requestResourceFuture.completeExceptionally(exception);
				}
				return null;
			}, getMainThreadExecutor()));
	}

在NM成功启动Container后会执行Flink提交的Context中指定YarnTaskExecutorRunner.class主入口函数

进入TM端主入口：

private ContainerLaunchContext createTaskExecutorLaunchContext(
		ResourceID containerId,
		String host,
		TaskExecutorProcessSpec taskExecutorProcessSpec) throws Exception {

		.....

        //创建指定YarnTaskExecutorRunner.class主入口函数的Context
		final ContainerLaunchContext taskExecutorLaunchContext = Utils.createTaskExecutorContext(
			flinkConfig,
			yarnConfig,
			configuration,
			taskManagerParameters,
			taskManagerDynamicProperties,
			currDir,
			YarnTaskExecutorRunner.class,
			log);

		taskExecutorLaunchContext.getEnvironment()
			.put(ENV_FLINK_NODE_ID, host);
		return taskExecutorLaunchContext;
	}

public class YarnTaskExecutorRunner {

	....

    //Executor端主入口函数
	public static void main(String[] args) {
		EnvironmentInformation.logEnvironmentInfo(LOG, "YARN TaskExecutor runner", args);
		SignalHandler.register(LOG);
		JvmShutdownSafeguard.installAsShutdownHook(LOG);

        //启动TM
		runTaskManagerSecurely(args);
	}

开始初始化TM实例，包括向JM注册自己并创建AkkaActor，创建负责数据交互的NettyShuffleEnvironment，NetworkBufferPool，维护state的KVStateManager，SlotTable等，跟JM的RM等服务建立连接，等待JM提交task过来并触发任务

public static void runTaskManager(Configuration configuration, PluginManager pluginManager) throws Exception {

		final TaskManagerRunner taskManagerRunner = new TaskManagerRunner(configuration, pluginManager, TaskManagerRunner::createTaskExecutorService);

		taskManagerRunner.start();
	}

//注册基于Akka的RpcEndpoint（akkaActor） 
public class TaskExecutor extends RpcEndpoint implements TaskExecutorGateway {...}

public static TaskExecutor startTaskManager(
			Configuration configuration,
			ResourceID resourceID,
			RpcService rpcService,
			HighAvailabilityServices highAvailabilityServices,
			HeartbeatServices heartbeatServices,
			MetricRegistry metricRegistry,
			BlobCacheService blobCacheService,
			boolean localCommunicationOnly,
			ExternalResourceInfoProvider externalResourceInfoProvider,
			FatalErrorHandler fatalErrorHandler) throws Exception {

		checkNotNull(configuration);
		checkNotNull(resourceID);
		checkNotNull(rpcService);
		checkNotNull(highAvailabilityServices);

		LOG.info("Starting TaskManager with ResourceID: {}", resourceID.getStringWithMetadata());

		//JM上的rpcService（actorSystem）位置，在JM初始化Dispatcher之前生成的Akka顶级父节点
		String externalAddress = rpcService.getAddress();

		//当前TM上配置的CPU，内存等硬件资源
		final TaskExecutorResourceSpec taskExecutorResourceSpec = TaskExecutorResourceUtils.resourceSpecFromConfig(configuration);

		//NettyShuffleEnvironment，Checkpoint等服务参数配置
		TaskManagerServicesConfiguration taskManagerServicesConfiguration =
			TaskManagerServicesConfiguration.fromConfiguration(
				configuration,
				resourceID,
				externalAddress,
				localCommunicationOnly,
				taskExecutorResourceSpec);

		//主要封装的是hostname，TMID等。
		Tuple2<TaskManagerMetricGroup, MetricGroup> taskManagerMetricGroup = MetricUtils.instantiateTaskManagerMetricGroup(
			metricRegistry,
			externalAddress,
			resourceID,
			taskManagerServicesConfiguration.getSystemResourceMetricsProbingInterval());

		//创建FixedThreadPool线程池，线程个数对应cluster.io-pool.size参数，默认4*cpu个数
		final ExecutorService ioExecutor = Executors.newFixedThreadPool(
			taskManagerServicesConfiguration.getNumIoThreads(),
			new ExecutorThreadFactory("flink-taskexecutor-io"));

		//创建TaskManagerServices，包含NettyShuffleEnvironment，SlotTable等服务
		TaskManagerServices taskManagerServices = TaskManagerServices.fromConfiguration(
			taskManagerServicesConfiguration,
			blobCacheService.getPermanentBlobService(),
			taskManagerMetricGroup.f1,
			ioExecutor,
			fatalErrorHandler);

		//初始化TM内存指标监控服务
		MetricUtils.instantiateFlinkMemoryMetricGroup(
			taskManagerMetricGroup.f1,
			taskManagerServices.getTaskSlotTable(),
			taskManagerServices::getManagedMemorySize);

		//创建TM各配置参数文件，比如Slot相关参数，上面生成的硬件资源，临时目录路径，日志路径等
		TaskManagerConfiguration taskManagerConfiguration =
			TaskManagerConfiguration.fromConfiguration(configuration, taskExecutorResourceSpec, externalAddress);

		String metricQueryServiceAddress = metricRegistry.getMetricQueryServiceGatewayRpcAddress();

		//创建TaskExecutor（RpcEndpoint）
		return new TaskExecutor(
			rpcService,
			taskManagerConfiguration,
			highAvailabilityServices,
			taskManagerServices,
			externalResourceInfoProvider,
			heartbeatServices,
			taskManagerMetricGroup.f0,
			metricQueryServiceAddress,
			blobCacheService,
			fatalErrorHandler,
			new TaskExecutorPartitionTrackerImpl(taskManagerServices.getShuffleEnvironment()),
			createBackPressureSampleService(configuration, rpcService.getScheduledExecutor()));
	}

TM创建的服务比较多比如kvStateService，taskSlotTable，broadcastVariableManager等

这里我们主要解析下比较核心的负责各Execution数据交互的NettyShuffleEnvironment构建过程

创建NettyShuffleEnvironment主要的三个核心组件，这里涉及的对象会放在后面章节解析

负责TM上堆内/堆外内存管理的networkBufferPool
负责向下游Execution发送数据的resultPartitionFactory
负责拉去上游Execution数据的inputGateFactory

private static ShuffleEnvironment<?, ?> createShuffleEnvironment(
			TaskManagerServicesConfiguration taskManagerServicesConfiguration,
			TaskEventDispatcher taskEventDispatcher,
			MetricGroup taskManagerMetricGroup,
			Executor ioExecutor) throws FlinkException {

		//创建包含所有配置信息的上下文
		final ShuffleEnvironmentContext shuffleEnvironmentContext = new ShuffleEnvironmentContext(
			taskManagerServicesConfiguration.getConfiguration(),
			taskManagerServicesConfiguration.getResourceID(),
			taskManagerServicesConfiguration.getNetworkMemorySize(),
			taskManagerServicesConfiguration.isLocalCommunicationOnly(),
			taskManagerServicesConfiguration.getBindAddress(),
			taskEventDispatcher,
			taskManagerMetricGroup,
			ioExecutor);

		return ShuffleServiceLoader
			.loadShuffleServiceFactory(taskManagerServicesConfiguration.getConfiguration())
			//创建networkBufferPool，resultPartitionFactory，inputGateFactory
			.createShuffleEnvironment(shuffleEnvironmentContext);
	}

static NettyShuffleEnvironment createNettyShuffleEnvironment(
			NettyShuffleEnvironmentConfiguration config,
			ResourceID taskExecutorResourceId,
			TaskEventPublisher taskEventPublisher,
			ResultPartitionManager resultPartitionManager,
			MetricGroup metricGroup,
			Executor ioExecutor) {
		checkNotNull(config);
		checkNotNull(taskExecutorResourceId);
		checkNotNull(taskEventPublisher);
		checkNotNull(resultPartitionManager);
		checkNotNull(metricGroup);

		NettyConfig nettyConfig = config.nettyConfig();

		FileChannelManager fileChannelManager = new FileChannelManagerImpl(config.getTempDirs(), DIR_NAME_PREFIX);

		ConnectionManager connectionManager = nettyConfig != null ?
			new NettyConnectionManager(resultPartitionManager, taskEventPublisher, nettyConfig) :
			new LocalConnectionManager();

		//负责当前TM的内存划分和管理，Task的LocalBufferPool会向它申请segment
		NetworkBufferPool networkBufferPool = new NetworkBufferPool(
			config.numNetworkBuffers(),//可以分配多少个segment ，在前面fromConfiguration时计算出
			config.networkBufferSize(),//segment容量大小 默认32kb
			config.getRequestSegmentsTimeout());//segment请求超时时间

		registerShuffleMetrics(metricGroup, networkBufferPool);

		//主要负责创建包含ResultSubpartition的ResultPartition以及创建LocalBufferPool
		ResultPartitionFactory resultPartitionFactory = new ResultPartitionFactory(
			resultPartitionManager,
			fileChannelManager,
			networkBufferPool,
			config.getBlockingSubpartitionType(),
			config.networkBuffersPerChannel(),
			config.floatingNetworkBuffersPerGate(),
			config.networkBufferSize(),
			config.isBlockingShuffleCompressionEnabled(),
			config.getCompressionCodec(),
			config.getMaxBuffersPerChannel());

		//主要负责创建包含inputChannel的InputGate以及创建LocalBufferPool
		SingleInputGateFactory singleInputGateFactory = new SingleInputGateFactory(
			taskExecutorResourceId,
			config,
			connectionManager,
			resultPartitionManager,
			taskEventPublisher,
			networkBufferPool);

		//创建NettyShuffleEnvironment实例
		return new NettyShuffleEnvironment(
			taskExecutorResourceId,
			config,
			networkBufferPool,
			connectionManager,
			resultPartitionManager,
			fileChannelManager,
			resultPartitionFactory,
			singleInputGateFactory,
			ioExecutor);
	}

TM创建完以上所有服务对象后，开始启动TM（AkkaRpcServer）并连接JM上RM等服务

public void start() throws Exception {
		taskExecutorService.start();
	}

//启动TM
	public void onStart() throws Exception {
		try {
			startTaskExecutorServices();
		} catch (Throwable t) {
			final TaskManagerException exception = new TaskManagerException(String.format("Could not start the TaskExecutor %s", getAddress()), t);
			onFatalError(exception);
			throw exception;
		}

		startRegistrationTimeout();
	}

private void startTaskExecutorServices() throws Exception {
		try {
			// start by connecting to the ResourceManager
			//跟JM上的RM建立连接
			resourceManagerLeaderRetriever.start(new ResourceManagerLeaderListener());

			// tell the task slot table who's responsible for the task slot actions
			//启动taskSlotTable
			taskSlotTable.start(new SlotActionsImpl(), getMainThreadExecutor());

			// start the job leader service
			//启动AkkaRpcService，基于ZK的haServices，JobLeaderListener服务
			jobLeaderService.start(getAddress(), getRpcService(), haServices, new JobLeaderListenerImpl());

			fileCache = new FileCache(taskManagerConfiguration.getTmpDirectories(), blobCacheService.getPermanentBlobService());
		} catch (Exception e) {
			handleStartTaskExecutorServicesException(e);
		}
	}

至此，Flink从JM启动到Slot申请到Container部署最后到TM实例化启动，整个过程主要是资源的申请，配置参数的划分，各服务的初始化，集群还未开始完全运作起来。

在后面章节中会解析到物理执行图的最后一步，也就是Flink向TM提交任务，启动Task线程，用户数据的交互等。Thanks。

按时吃早饭ABC

关注

2
点赞
踩
3

收藏

觉得还不错? 一键收藏
2
评论
Flink1.12源码解读——物理执行图（一）之TM实例构建过程

Apache Flink作为国内最火的大数据计算引擎之一，自身支持高吞吐，低延迟，exactly-once语义，有状态流等特性，阅读源码有助加深对框架的理解和认知。Flink分为四种执行图，本章主要解析物理执行图的生成计划前半部分本章及后续源码解读环境以生产为主：运行模式：OnYarn，HA模式：ZK，Mode：Streaming，关键逻辑解释我会备注到代码上（灰色字体）勿忽视。此章节涉及的组件和框架的代码依赖部分大数据生态包括Yarn，Akka，ZK等。建议有此基础上阅读会提高可阅读性和实现理
复制链接

扫一扫