Flink源码漫游指南＜伍＞集群是如何启动的

Petrov_Dong

已于 2022-03-19 11:41:42 修改

阅读量2.7k

点赞数 1

分类专栏： flink 文章标签： flink maven 大数据

于 2022-03-18 18:23:35 首次发布

本文链接：https://blog.csdn.net/Li_DeSheng/article/details/121567901

版权

flink 专栏收录该内容

7 篇文章 2 订阅

订阅专栏

一、ClusterEntrypoint

当用户用Session cli命令启动集群时，首先会在Flink集群启动脚本中调用ClusterEntrypoint抽象类中提供的main()方法，以启动和运行相应类型的集群环境。

也就是说，ClusterEntrypoint是整个集群的入口类，且带有main()方法。在运行时管理中，所有的服务都是通过CE类进行触发和启动，进而完成核心组件的创建和初始化。

我们先通过下图看一下CE抽象类的继承关系

可以看到ClusterEntrypoint分为两类

SessionClusterEntrypoint
- 只建立一个集群，能够同时运行多个作业，这样资源利用率更高，但是如果集群挂掉，会影响很多作业。
JobClusterEntrypoint
- 又叫Per-job模式，为每个job单独创建一个集群，这样如果集群挂掉也只影响一个任务。

standalone对应的本地模式，mesos、yarn集群模式的不同调度器。

我们再从StandaloneSessionClusterEntrypoint中的main()方法开始，看看ClusterEntrypoint如何启动集群

public static void main(String[] args) {
		// startup checks and logging 启动配置检查和日志加载
		EnvironmentInformation.logEnvironmentInfo(LOG, StandaloneSessionClusterEntrypoint.class.getSimpleName(), args);
		SignalHandler.register(LOG);
		JvmShutdownSafeguard.installAsShutdownHook(LOG);

		EntrypointClusterConfiguration entrypointClusterConfiguration = null;
		final CommandLineParser<EntrypointClusterConfiguration> commandLineParser = new CommandLineParser<>(new EntrypointClusterConfigurationParserFactory());

		try {
			entrypointClusterConfiguration = commandLineParser.parse(args);
		} catch (FlinkParseException e) {
			LOG.error("Could not parse command line arguments {}.", args, e);
			commandLineParser.printHelp(StandaloneSessionClusterEntrypoint.class.getSimpleName());
			System.exit(1);
		}

		Configuration configuration = loadConfiguration(entrypointClusterConfiguration);

		StandaloneSessionClusterEntrypoint entrypoint = new StandaloneSessionClusterEntrypoint(configuration);

        //经过上面一系列的配置之后，通过调用CE抽象类的runClusterEntrypoint启动
		ClusterEntrypoint.runClusterEntrypoint(entrypoint);
	}

通过最后一行代码我们可以发现，经过一系列的配置和日志加载，最后调用了ClusterEntrypoint里的runClusterEntrypoint方法。我们再来看看这个方法干了什么。

	public static void runClusterEntrypoint(ClusterEntrypoint clusterEntrypoint) {

		final String clusterEntrypointName = clusterEntrypoint.getClass().getSimpleName();
		try {
			clusterEntrypoint.startCluster();//⭐通过这一行启动集群
		} catch (ClusterEntrypointException e) {
			LOG.error(String.format("Could not start cluster entrypoint %s.", clusterEntrypointName), e);
			System.exit(STARTUP_FAILURE_RETURN_CODE);
		}

		clusterEntrypoint.getTerminationFuture().whenComplete((applicationStatus, throwable) -> {
			final int returnCode;

			if (throwable != null) {
				returnCode = RUNTIME_FAILURE_RETURN_CODE;
			} else {
				returnCode = applicationStatus.processExitCode();
			}

			LOG.info("Terminating cluster entrypoint process {} with exit code {}.", clusterEntrypointName, returnCode, throwable);
			System.exit(returnCode);
		});
	}

上述代码中带⭐的代码又调用的CE.startCluster()继续启动，然后等运行结束，用clusterEntrypoint.getTerminationFuture().whenComplete()获取运行结束状态并进行对应的处理。

我们再看看startCluster()干了什么

	public void startCluster() throws ClusterEntrypointException {
		LOG.info("Starting {}.", getClass().getSimpleName());

		try {
			configureFileSystems(configuration);//配置文件系统

			SecurityContext securityContext = installSecurityContext(configuration);

			securityContext.runSecured((Callable<Void>) () -> {
				runCluster(configuration);//⭐在securityContext安全环境里继续启动

				return null;
			});
		} catch (Throwable t) {
			final Throwable strippedThrowable = ExceptionUtils.stripException(t, UndeclaredThrowableException.class);

			try {
				// clean up any partial state
				shutDownAsync(
					ApplicationStatus.FAILED,
					ExceptionUtils.stringifyException(strippedThrowable),
					false).get(INITIALIZATION_SHUTDOWN_TIMEOUT.toMilliseconds(), TimeUnit.MILLISECONDS);
			} catch (InterruptedException | ExecutionException | TimeoutException e) {
				strippedThrowable.addSuppressed(e);
			}

			throw new ClusterEntrypointException(
				String.format("Failed to initialize the cluster entrypoint %s.", getClass().getSimpleName()),
				strippedThrowable);
		}
	}

注意⭐号的代码，这里是SecurityContext在继续runCluster，而不是ClusterEntrypoint在做，继续看runCluster

	private void runCluster(Configuration configuration) throws Exception {
		synchronized (lock) {
			//⭐初始化运行时集群需要创建的基础组件服务，如HAServices、CommonRPCService等。
			initializeServices(configuration);

			// write host information into configuration 把host信息写入配置
			configuration.setString(JobManagerOptions.ADDRESS, commonRpcService.getAddress());
			configuration.setInteger(JobManagerOptions.PORT, commonRpcService.getPort());

			final DispatcherResourceManagerComponentFactory<?> dispatcherResourceManagerComponentFactory = createDispatcherResourceManagerComponentFactory(configuration);

			//⭐创建集群组件clusterComponent
			//⭐其中包含了resourceManager、dispatcher、webMonitorEndpoint
			clusterComponent = dispatcherResourceManagerComponentFactory.create(
				configuration,
				commonRpcService,
				haServices,
				blobServer,
				heartbeatServices,
				metricRegistry,
				archivedExecutionGraphStore,
				new AkkaQueryServiceRetriever(
					metricQueryServiceActorSystem,
					Time.milliseconds(configuration.getLong(WebOptions.TIMEOUT))),
				this);

			clusterComponent.getShutDownFuture().whenComplete(
				(ApplicationStatus applicationStatus, Throwable throwable) -> {
					if (throwable != null) {
						shutDownAsync(
							ApplicationStatus.UNKNOWN,
							ExceptionUtils.stringifyException(throwable),
							false);
					} else {
						// This is the general shutdown path. If a separate more specific shutdown was
						// already triggered, this will do nothing
						shutDownAsync(
							applicationStatus,
							null,
							true);
					}
				});
		}
	}

这一步启动了多种服务和组件，并通过dispatcherResourceManagerComponentFactory调用create来启动，继续看

	@Override
	public DispatcherResourceManagerComponent<T> create(
			Configuration configuration,
			RpcService rpcService,
			HighAvailabilityServices highAvailabilityServices,
			BlobServer blobServer,
			HeartbeatServices heartbeatServices,
			MetricRegistry metricRegistry,
			ArchivedExecutionGraphStore archivedExecutionGraphStore,
			MetricQueryServiceRetriever metricQueryServiceRetriever,
			FatalErrorHandler fatalErrorHandler) throws Exception {

		LeaderRetrievalService dispatcherLeaderRetrievalService = null;
		LeaderRetrievalService resourceManagerRetrievalService = null;
		WebMonitorEndpoint<U> webMonitorEndpoint = null;
		ResourceManager<?> resourceManager = null;
		JobManagerMetricGroup jobManagerMetricGroup = null;
		T dispatcher = null;

		try {
			dispatcherLeaderRetrievalService = highAvailabilityServices.getDispatcherLeaderRetriever();

			resourceManagerRetrievalService = highAvailabilityServices.getResourceManagerLeaderRetriever();

			final LeaderGatewayRetriever<DispatcherGateway> dispatcherGatewayRetriever = new RpcGatewayRetriever<>(
				rpcService,
				DispatcherGateway.class,
				DispatcherId::fromUuid,
				10,
				Time.milliseconds(50L));

			final LeaderGatewayRetriever<ResourceManagerGateway> resourceManagerGatewayRetriever = new RpcGatewayRetriever<>(
				rpcService,
				ResourceManagerGateway.class,
				ResourceManagerId::fromUuid,
				10,
				Time.milliseconds(50L));

			final ExecutorService executor = WebMonitorEndpoint.createExecutorService(
				configuration.getInteger(RestOptions.SERVER_NUM_THREADS),
				configuration.getInteger(RestOptions.SERVER_THREAD_PRIORITY),
				"DispatcherRestEndpoint");

			final long updateInterval = configuration.getLong(MetricOptions.METRIC_FETCHER_UPDATE_INTERVAL);
			final MetricFetcher metricFetcher = updateInterval == 0
				? VoidMetricFetcher.INSTANCE
				: MetricFetcherImpl.fromConfiguration(
					configuration,
					metricQueryServiceRetriever,
					dispatcherGatewayRetriever,
					executor);

			webMonitorEndpoint = restEndpointFactory.createRestEndpoint(
				configuration,
				dispatcherGatewayRetriever,
				resourceManagerGatewayRetriever,
				blobServer,
				executor,
				metricFetcher,
				highAvailabilityServices.getWebMonitorLeaderElectionService(),
				fatalErrorHandler);//⭐创建webMonitorEndpoint

			log.debug("Starting Dispatcher REST endpoint.");
			webMonitorEndpoint.start();//⭐启动webMonitorEndpoint

			final String hostname = getHostname(rpcService);

			jobManagerMetricGroup = MetricUtils.instantiateJobManagerMetricGroup(
				metricRegistry,
				hostname,
				ConfigurationUtils.getSystemResourceMetricsProbingInterval(configuration));

			resourceManager = resourceManagerFactory.createResourceManager(
				configuration,
				ResourceID.generate(),
				rpcService,
				highAvailabilityServices,
				heartbeatServices,
				metricRegistry,
				fatalErrorHandler,
				new ClusterInformation(hostname, blobServer.getPort()),
				webMonitorEndpoint.getRestBaseUrl(),
				jobManagerMetricGroup); //⭐创建ResourceManager

			final HistoryServerArchivist historyServerArchivist = HistoryServerArchivist.createHistoryServerArchivist(configuration, webMonitorEndpoint);

			dispatcher = dispatcherFactory.createDispatcher(
				configuration,
				rpcService,
				highAvailabilityServices,
				resourceManagerGatewayRetriever,
				blobServer,
				heartbeatServices,
				jobManagerMetricGroup,
				metricRegistry.getMetricQueryServicePath(),
				archivedExecutionGraphStore,
				fatalErrorHandler,
				historyServerArchivist);//⭐创建dispatcher

			log.debug("Starting ResourceManager.");
			resourceManager.start();//⭐启动ResourceManager
			resourceManagerRetrievalService.start(resourceManagerGatewayRetriever);

			log.debug("Starting Dispatcher.");
			dispatcher.start();//⭐启动dispatcher
			dispatcherLeaderRetrievalService.start(dispatcherGatewayRetriever);

			return createDispatcherResourceManagerComponent(
				dispatcher,
				resourceManager,
				dispatcherLeaderRetrievalService,
				resourceManagerRetrievalService,
				webMonitorEndpoint,
				jobManagerMetricGroup);

		} catch (Exception exception) {
			// clean up all started components
			if (dispatcherLeaderRetrievalService != null) {
				try {
					dispatcherLeaderRetrievalService.stop();
				} catch (Exception e) {
					exception = ExceptionUtils.firstOrSuppressed(e, exception);
				}
			}

			if (resourceManagerRetrievalService != null) {
				try {
					resourceManagerRetrievalService.stop();
				} catch (Exception e) {
					exception = ExceptionUtils.firstOrSuppressed(e, exception);
				}
			}

			final Collection<CompletableFuture<Void>> terminationFutures = new ArrayList<>(3);

			if (webMonitorEndpoint != null) {
				terminationFutures.add(webMonitorEndpoint.closeAsync());
			}

			if (resourceManager != null) {
				terminationFutures.add(resourceManager.closeAsync());
			}

			if (dispatcher != null) {
				terminationFutures.add(dispatcher.closeAsync());
			}

			final FutureUtils.ConjunctFuture<Void> terminationFuture = FutureUtils.completeAll(terminationFutures);

			try {
				terminationFuture.get();
			} catch (Exception e) {
				exception = ExceptionUtils.firstOrSuppressed(e, exception);
			}

			if (jobManagerMetricGroup != null) {
				jobManagerMetricGroup.close();
			}

			throw new FlinkException("Could not create the DispatcherResourceManagerComponent.", exception);
		}
	}

几个创建和启动组件的地方用⭐标注出来了。

可以看到最后ClusterEntrypoint启动了WebMonitorEndpoint、Dispatcher、ResourceManager几个组件，我们再分别来看一看具体是如何启动的。

二、WebMonitorEndpoint

WebMonitorEndpoint基于Netty通信框架实现了Restful的服务后端，提供Restful接口支持Flink Web在内的所有Rest请求，例如获取集群监控指标。

如果没接触过Netty和Rest api，可以通过这个了解一下

我们再看看这个类的继承关系

WME的父类RestServerPoint基于Netty实现了Rest后端，并提供了自定义Handler的初始化和现实抽象方法。WebMonitorEndpoint和DispatcherRestEndpoint等子类能拓展处理各自业务的Rest接口对应的Handler实现。

MiniDRE是本地执行实现的DRE，区别在于mini版不用加载JobGraph提交使用的Handler。在Idea里运行时创建的实际上就是MiniDRE。

我们先来看一看上一节最后一个代码块中，AbstractDispatcherResourceManagerComponentFactory是如何创建和启动WME的

webMonitorEndpoint = restEndpointFactory.createRestEndpoint(
				configuration,   //集群配置参数
				dispatcherGatewayRetriever,  //dispatcherGateway服务地址获取器，用于获取当前活跃的dG地址，基于dG可以实现与dispatcher的RPC通信，最终提交的JobGraph通过dispatcherGateway发送给dispatcher
				resourceManagerGatewayRetriever,//作用和上面类似，用于获取RMG的地址，在TaskManagersHandler中可以通过调用RMG获取集群中的TaskManager监控信息
				blobServer,//临时二进制对象数据存储服务
				executor,//用于处理WebMonitorEndpoint请求的线程池服务
				metricFetcher,//用于拉去JobManager和TaskManager上的Metric监控指标
				highAvailabilityServices.getWebMonitorLeaderElectionService(),//实现高可用的服务
				fatalErrorHandler //异常处理器，WME异常时，用这个中的处理接口
                );

			log.debug("Starting Dispatcher REST endpoint.");
			webMonitorEndpoint.start();//⭐启动webMonitorEndpoint

启动所需要的参数都在上面列出来了，也做了注释，最后是WebMonitorEndpoint.start()启动了服务，我们再来看看这个start干了什么

	/**
	 * Starts this REST server endpoint.
	 *
	 * @throws Exception if we cannot start the RestServerEndpoint
	 */
	public final void start() throws Exception {
		synchronized (lock) {
			//检查RestServerEndpoint.state是否为Created状态
			Preconditions.checkState(state == State.CREATED, "The RestServerEndpoint cannot be restarted.");

			//启动Rest Endpoint
			log.info("Starting rest endpoint.");
			final Router router = new Router();//根据Router寻找Handlers
			final CompletableFuture<String> restAddressFuture = new CompletableFuture<>();

			//初始化Handlers
			handlers = initializeHandlers(restAddressFuture);

			/* handlers排序
			 * sort the handlers such that they are ordered the following:
			 * /jobs
			 * /jobs/overview
			 * /jobs/:jobid
			 * /jobs/:jobid/config
			 * /:*
			 */
			Collections.sort(
				handlers,
				RestHandlerUrlComparator.INSTANCE);

			//挨个注册
			handlers.forEach(handler -> {
				registerHandler(router, handler, log);
			});

			//创建channelInitializer，初始化channel
			ChannelInitializer<SocketChannel> initializer = new ChannelInitializer<SocketChannel>() {

				@Override
				protected void initChannel(SocketChannel ch) {
					//创建路由RouterHandler，完成业务请求拦截
					RouterHandler handler = new RouterHandler(router, responseHeaders);

					// SSL should be the first handler in the pipeline 把SSL放在第一个handler
					if (isHttpsEnabled()) {
						ch.pipeline().addLast("ssl",
							new RedirectingSslHandler(restAddress, restAddressFuture, sslHandlerFactory));
					}

					ch.pipeline()
						.addLast(new HttpServerCodec())
						.addLast(new FileUploadHandler(uploadDir))
						.addLast(new FlinkHttpObjectAggregator(maxContentLength, responseHeaders))
						.addLast(new ChunkedWriteHandler())
						.addLast(handler.getName(), handler)
						.addLast(new PipelineErrorHandler(log, responseHeaders));
				}
			};

			//创建bossGroup和workerGroup
			NioEventLoopGroup bossGroup = new NioEventLoopGroup(1, new ExecutorThreadFactory("flink-rest-server-netty-boss"));
			NioEventLoopGroup workerGroup = new NioEventLoopGroup(0, new ExecutorThreadFactory("flink-rest-server-netty-worker"));

			//创建ServerBootstrap启动类
			bootstrap = new ServerBootstrap();
			//绑定bossGroup和workerGroup以及initializer
			bootstrap
				.group(bossGroup, workerGroup)
				.channel(NioServerSocketChannel.class)
				.childHandler(initializer);

			//从restBindPortRanger选择端口
			Iterator<Integer> portsIterator;
			try {
				portsIterator = NetUtils.getPortRangeFromString(restBindPortRange);
			} catch (IllegalConfigurationException e) {
				throw e;
			} catch (Exception e) {
				throw new IllegalArgumentException("Invalid port range definition: " + restBindPortRange);
			}

			//从portsIterator选择没有占用的端口，作为bootstrap启动的端口
			int chosenPort = 0;
			while (portsIterator.hasNext()) {
				try {
					chosenPort = portsIterator.next();
					final ChannelFuture channel;
					if (restBindAddress == null) {
						channel = bootstrap.bind(chosenPort);
					} else {
						channel = bootstrap.bind(restBindAddress, chosenPort);
					}
					serverChannel = channel.syncUninterruptibly().channel();
					break;
				} catch (final Exception e) {
					// continue if the exception is due to the port being in use, fail early otherwise
					if (!(e instanceof org.jboss.netty.channel.ChannelException || e instanceof java.net.BindException)) {
						throw e;
					}
				}
			}

			if (serverChannel == null) {
				throw new BindException("Could not start rest endpoint on any port in port range " + restBindPortRange);
			}

			//ServerBootstrap启动成功
			log.debug("Binding rest endpoint to {}:{}.", restBindAddress, chosenPort);

			final InetSocketAddress bindAddress = (InetSocketAddress) serverChannel.localAddress();
			final String advertisedAddress;
			if (bindAddress.getAddress().isAnyLocalAddress()) {
				advertisedAddress = this.restAddress;
			} else {
				advertisedAddress = bindAddress.getAddress().getHostAddress();
			}
			final int port = bindAddress.getPort();

			log.info("Rest endpoint listening at {}:{}", advertisedAddress, port);

			restBaseUrl = new URL(determineProtocol(), advertisedAddress, port, "").toString();

			restAddressFuture.complete(restBaseUrl);

			//状态设定为running
			state = State.RUNNING;

			//调用内部启动方法，启动RestEndpoint服务
			startInternal();
		}
	}

每一段的功能都标在上面了，读者可以自己看。最后用startInternal()启动了服务，这是一个抽象方法，不同的WME有不同的实现。

三、Dispatcher

Dispatcher涉及的组件很多，可以通过下面的图大致了解一下

Dispatcher：负责对集群中的作业进行接收和分发处理。客户端把作业通过ClusterClient提交到Dispatcher，Dispatcher通过JobGraph启动JobManager。
DispatcherRunner：负责启动和管理Dispatcher组件，支持Leader选举。
DispatcherLeaderProcess：管理Dispatcher生命周期，提供对JobGraph的任务恢复管理功能。
DispatcherGatewayService：用于获取DispatcherGateway。

这部分源码还没看明白，等研究到job提交那块再来补

三、ResourceManager

RM的功能应该知道的比较多，我就不细写了

先看看RM的继承关系

继承关系是 RpcEndpoint→FencedRpcEndpoint→ResourceManager

RpcEndpoint：rpc节点的基类，所有提供原创调用的组件都要继承这个类，所以可以看见Dispatcher、JobMaster也在下面

FencedRpcEndpoint：包装了一层高可用相关的功能

我们以StandaloneResourceManagerFactory的createResourceManager()为例看一下Rm怎么创建

	public ResourceManager<ResourceID> createResourceManager(
			Configuration configuration,
			ResourceID resourceId,
			RpcService rpcService,
			HighAvailabilityServices highAvailabilityServices,
			HeartbeatServices heartbeatServices,
			MetricRegistry metricRegistry,
			FatalErrorHandler fatalErrorHandler,
			ClusterInformation clusterInformation,
			@Nullable String webInterfaceUrl,
			JobManagerMetricGroup jobManagerMetricGroup) throws Exception {

		final ResourceManagerRuntimeServicesConfiguration resourceManagerRuntimeServicesConfiguration = ResourceManagerRuntimeServicesConfiguration.fromConfiguration(configuration);

        //先创建runtimeService再返回要创建的RM
		final ResourceManagerRuntimeServices resourceManagerRuntimeServices = ResourceManagerRuntimeServices.fromConfiguration(
			resourceManagerRuntimeServicesConfiguration,
			highAvailabilityServices,
			rpcService.getScheduledExecutor());

		return new StandaloneResourceManager(
			rpcService,
			getEndpointId(),
			resourceId,
			highAvailabilityServices,
			heartbeatServices,
			resourceManagerRuntimeServices.getSlotManager(),
			metricRegistry,
			resourceManagerRuntimeServices.getJobLeaderIdService(),
			clusterInformation,
			fatalErrorHandler,
			jobManagerMetricGroup);
	}

可以看到代码中在返回RM之前创建了应该ResourceManagerRuntimeServices，其中fromConfiguration方法包含了SlotManager和JobLeaderService两个内部服务的创建

创建RM需要RpcService、HeartbeatService、HAService等服务，这些已经提前创建好作为参数传进来了。

1、SlotManager

SlotManager创建完成之后，会调用ResourceManager.start()来启动RM组件，因为RM继承自RpcEndpoint，所以RM本质上是一个RPC组件服务，启动RM组件实际上就是启动RM组件对应的RpcServer，当RM对应的RPC服务启动后，就会通过RpcEndpoint调用RM.onStart()方法启动RM内部的其他核心服务，最终完成RM的启动流程。

当对应的RPC服务启动后，会通过RpcEndpoint调用RM.onstart()方法中的startResourceManagerServices()启动RM内部其他组件。该方法代码如下

	private void startResourceManagerServices() throws Exception {
		try {
			//从高可用服务中获得选举服务
			leaderElectionService = highAvailabilityServices.getResourceManagerLeaderElectionService();

			//初始化
			initialize();

			//通过LeaderElectionService服务启动当前RM，并设定为Leader
			leaderElectionService.start(this);
			//启动JobLeaderIdService
			jobLeaderIdService.start(new JobLeaderIdActionsImpl());

			//注册slot和TaskExecutor的监控指标
			registerSlotAndTaskExecutorMetrics();
		} catch (Exception e) {
			handleStartResourceManagerServicesException(e);
		}
	}

在上面的leaderElectionService.start(this)的代码会调用一个RM的grantLeadership()方法，会把当前节点设为RM的leader节点，这个方法代码如下

@Override
	public void grantLeadership(final UUID newLeaderSessionID) {
		//增加异步操作
		final CompletableFuture<Boolean> acceptLeadershipFuture = clearStateFuture
			.thenComposeAsync((ignored) -> tryAcceptLeadership(newLeaderSessionID), getUnfencedMainThreadExecutor());

		final CompletableFuture<Void> confirmationFuture = acceptLeadershipFuture.thenAcceptAsync(
			(acceptLeadership) -> {
				if (acceptLeadership) {
					// confirming the leader session ID might be blocking,进行leadership确认
					leaderElectionService.confirmLeaderSessionID(newLeaderSessionID);
				}
			},
			getRpcService().getExecutor());

		confirmationFuture.whenComplete(
			(Void ignored, Throwable throwable) -> {
				if (throwable != null) {
					onFatalError(ExceptionUtils.stripCompletionException(throwable));
				}
			});
	}

其中 tryAcceptLeadership(newLeaderSessionID)方法启动了心跳服务和slotmanager服务，代码如下

private CompletableFuture<Boolean> tryAcceptLeadership(final UUID newLeaderSessionID) {
		if (leaderElectionService.hasLeadership(newLeaderSessionID)) {
			final ResourceManagerId newResourceManagerId = ResourceManagerId.fromUuid(newLeaderSessionID);

			log.info("ResourceManager {} was granted leadership with fencing token {}", getAddress(), newResourceManagerId);

			// clear the state if we've been the leader before
			if (getFencingToken() != null) {
				clearStateInternal();
			}

			setFencingToken(newResourceManagerId);

			//启动心跳服务
			startHeartbeatServices();

			//启动SM服务
			slotManager.start(getFencingToken(), getMainThreadExecutor(), new ResourceActionsImpl());

			return prepareLeadershipAsync().thenApply(ignored -> true);
		} else {
			return CompletableFuture.completedFuture(false);
		}
	}

HeartService的启动方法包含了对JobManager和TaskManager两种组件的心跳服务。

另外SlotManager的服务也被slotManager.start()启动起来，代码如下

public void start(ResourceManagerId newResourceManagerId, Executor newMainThreadExecutor, ResourceActions newResourceActions) {
		LOG.info("Starting the SlotManager.");

		//校验参数
		this.resourceManagerId = Preconditions.checkNotNull(newResourceManagerId);
		mainThreadExecutor = Preconditions.checkNotNull(newMainThreadExecutor);
		resourceActions = Preconditions.checkNotNull(newResourceActions);

		//开始状态设为true
		started = true;

		//周期性超时检查线程服务，防止TM长时间掉线
		taskManagerTimeoutCheck = scheduledExecutor.scheduleWithFixedDelay(
			() -> mainThreadExecutor.execute(
				() -> checkTaskManagerTimeouts()),
			0L,
			taskManagerTimeout.toMilliseconds(),
			TimeUnit.MILLISECONDS);

		//启动SlotRequest周期性超时检查
		slotRequestTimeoutCheck = scheduledExecutor.scheduleWithFixedDelay(
			() -> mainThreadExecutor.execute(
				() -> checkSlotRequestTimeouts()),
			0L,
			slotRequestTimeout.toMilliseconds(),
			TimeUnit.MILLISECONDS);
	}

这样，RM就被成功启动起来，此时RM可以和TM和JM交互进行工作了。