Dispatcher 总结
一、概述
1、Dispatcher负责接收客户端提交的JobGraph对象。
dispatcherGateway.submitJob(jobGraph, rpcTimeout)
2、Dispatcher会根据接收的JobGraph对象为任务创建 JobManagerRunner 服务。
JobManagerRunner jobManagerRunner = createJobManagerRunner(jobGraph, initializationTimestamp);
二、核心组件
1、Dispatcher
负责对集群中的作业进行接收和分发处理,客户端可以通过与Dispatcher建立RPC连接,将作业通过ClusterClient提交到集群Dispatcher服务中。
Dispatcher通过JobGraph对象启动JobManagerRunner服务。
2、DispatcherRunner
1.概述
负责启动和管理Dispatcher组件,并支持对Dispatcher组件的Leader选举。
当Dispatcher集群组件出现异常并停止时,会通过DispatcherRunner重新选择和启动新的Dispatcher服务,保证Dispatcher高可用。
2.分类
DispatcherRunner有DefaultDispatcherRunner和DispatcherRunnerLeaderElectionLifecycleManager两种实现。
前者是DispatcherRunner接口的主要实现;
后者实现了DispatcherRunner的LeaderElection生命周期管理,包括使用LeaderElectionService启动和停止DispatcherRunner线程。
DefaultDispatcherRunnerFactory->[SessionDispatcherLeaderProcessFactoryFactory\JobDispatcherLeaderProcessFactoryFactory]
3、DispatcherLeaderProcess
1.概述
负责管理Dispatcher生命周期,提供了对JobGraph的任务恢复功能。
如果基于ZooKeeper实现了集群高可用,DispatcherLeaderProcess会将提交的JobGraph存储在ZooKeeper中,当集群停止或者出现异常时,会通过DispatcherLeaderProcess对集群中的JobGraph进行恢复,这些JobGraph都会被存储在JobGraphStore的实现类中。
2.分类
3.类结构
4.分工
DispatcherLeaderProcess
在DispatcherLeaderProcess接口中定义了start()方法,用于启动DispatcherLeaderProcess服务,同时提供了获取DispatcherGateway.ShutDownFuture的方法。
AbstractDispatcherLeaderProcess
在AbstractDispatcherLeaderProcess基本实现类中,主要实现了DispatcherLeaderProcess中的接口方法,并提供了onStart()和onClose()两个抽象方法,用于定义和实现子类。
在AbstractDispatcherLeaderProcess类中,通过内部类定义了DispatcherGatewayService接口以及获取DispatcherGatewayService的工厂接口。
SessionDispatcherLeaderProcess
在SessionDispatcherLeaderProcess实现类中主要实现了与Session集群相关的Dispatcher处理逻辑,主要用于对JobGraphStore中存储的JobGraph进行恢复。
在非高可用集群下,JobGraphStore的实现类为StandaloneJobGraphStore,也就是不对JobGraph进行存储和管理。
在高可用集群中,JobGraphStore基于ZooKeeper存储集群中的JobGraph。
JobDispatcherLeaderProcess
在JobDispatcherLeaderProcess实现类中包含了对单个JobGraph进行创建和提交的方法,因此JobDispatcherLeaderProcess主要涵盖了对单个JobGraph的提交逻辑,不存在JobGraphStore的概念。
JobDispatcherLeaderProcess伴随作业的结束,其生命周期也会同步终止。
4、DispatcherGatewayService
主要基于Dispatcher实现的GatewayService,用于获取DispatcherGateway。
三、启动流程 StandaloneSeesion 集群 DefaultDispatcherResourceManagerComponentFactory#create 方法
1、创建 PartialDispatcherServices
final PartialDispatcherServices partialDispatcherServices =
new PartialDispatcherServices(
configuration,
highAvailabilityServices,
resourceManagerGatewayRetriever,
blobServer,
heartbeatServices,
() ->
MetricUtils.instantiateJobManagerMetricGroup(
metricRegistry, hostname),
executionGraphInfoStore,
fatalErrorHandler,
historyServerArchivist,
metricRegistry.getMetricQueryServiceGatewayRpcAddress(),
ioExecutor);
2、创建 DispatcherRunner
dispatcherRunner =
dispatcherRunnerFactory.createDispatcherRunner(
highAvailabilityServices.getDispatcherLeaderElectionService(),
fatalErrorHandler,
new HaServicesJobGraphStoreFactory(highAvailabilityServices),
ioExecutor,
rpcService,
partialDispatcherServices);
3、在 DispatcherRunnerLeaderElectionLifecycleManager 的构造方法中启动 dispatcherRunner
private DispatcherRunnerLeaderElectionLifecycleManager(
T dispatcherRunner, LeaderElectionService leaderElectionService) throws Exception {
this.dispatcherRunner = dispatcherRunner;
this.leaderElectionService = leaderElectionService;
leaderElectionService.start(dispatcherRunner);
}
4、创建 JobGraphStore
@Override
public JobGraphStore create() {
try {
return highAvailabilityServices.getJobGraphStore();
} catch (Exception e) {
throw new FlinkRuntimeException(
String.format(
"Could not create %s from %s.",
JobGraphStore.class.getSimpleName(),
highAvailabilityServices.getClass().getSimpleName()),
e);
}
}
5、回调 SessionDispatcherLeaderProcess#onStart 方法
1)启动 jobGraphStore
private void startServices() {
try {
jobGraphStore.start(this);
} catch (Exception e) {
throw new FlinkRuntimeException(
String.format(
"Could not start %s when trying to start the %s.",
jobGraphStore.getClass().getSimpleName(), getClass().getSimpleName()),
e);
}
}
2)从 JobGraphStore 中恢复 JobGraph
return jobGraphStore.recoverJobGraph(jobId);
3)创建 Dispatcher
return new StandaloneDispatcher(
rpcService,
fencingToken,
recoveredJobs,
dispatcherBootstrapFactory,
DispatcherServices.from(
partialDispatcherServicesWithJobGraphStore,
JobMasterServiceLeadershipRunnerFactory.INSTANCE));
1.创建 DispatcherServices
return new DispatcherServices(
partialDispatcherServicesWithJobGraphStore.getConfiguration(),
partialDispatcherServicesWithJobGraphStore.getHighAvailabilityServices(),
partialDispatcherServicesWithJobGraphStore.getResourceManagerGatewayRetriever(),
partialDispatcherServicesWithJobGraphStore.getBlobServer(),
partialDispatcherServicesWithJobGraphStore.getHeartbeatServices(),
partialDispatcherServicesWithJobGraphStore.getArchivedExecutionGraphStore(),
partialDispatcherServicesWithJobGraphStore.getFatalErrorHandler(),
partialDispatcherServicesWithJobGraphStore.getHistoryServerArchivist(),
partialDispatcherServicesWithJobGraphStore.getMetricQueryServiceAddress(),
partialDispatcherServicesWithJobGraphStore
.getJobManagerMetricGroupFactory()
.create(),
partialDispatcherServicesWithJobGraphStore.getJobGraphWriter(),
jobManagerRunnerFactory,
partialDispatcherServicesWithJobGraphStore.getIoExecutor());
2.启动 RpcServer
protected RpcEndpoint(final RpcService rpcService, final String endpointId) {
this.rpcService = checkNotNull(rpcService, "rpcService");
this.endpointId = checkNotNull(endpointId, "endpointId");
this.rpcServer = rpcService.startServer(this);
this.mainThreadExecutor = new MainThreadExecutor(rpcServer, this::validateRunsInMainThread);
}
6、启动 Dispatcher
dispatcher.start()->Dispatcher#onStart
Dispatcher#onStart 方法
@Override
public void onStart() throws Exception {
try {
// 注册 Dispatcher 监控
startDispatcherServices();
} catch (Throwable t) {
... ...
}
// 根据恢复的 JobGraph 执行 Job
startRecoveredJobs();
// 返回 DispatcherGateway
this.dispatcherBootstrap =
this.dispatcherBootstrapFactory.create(
getSelfGateway(DispatcherGateway.class),
this.getRpcService().getScheduledExecutor(),
this::onFatalError);
}