1.概述
TaskManager 是 Flink 集群的工作进程,执行数据流的具体计算,称之为"Worker"。Flink集群必须至少有一个TaskManager;每一个TaskManager都包含了一定数量的任务槽(task slots)。Slot是资源调度的最小单位,slot的数量限制了TaskManager能够并行处理的任务数量。
启动之后,TaskManager会向资源管理器注册它的slots;收到资源管理器的指令后,TaskManager就会将一个或者多个槽位提供给JobMaster调用,JobMaster就可以分配任务来执行了。
在Job执行过程中,TaskManager可以缓冲数据,还可以跟其他运行同一应用的TaskManager交换数据。
TaskManager 是一个逻辑抽象,代表一台服务器,服务器的启动,必然会包含一些服务,另外再包含一个 TaskExecutor,存在于TaskManager的内部,真实的帮助TaskManager 完成各种核心操作:提交Task执行、申请和释放slot。
2.TaskManager启动
TaskManager主要负责本机slot资源的管理与具体task的执行。根据集群启动脚本分析:TaskManager 的启动主类: TaskManagerRunner(org.apache.flink.runtime.taskexecutor.TaskManagerRunner)。
2.1 启动入口(main)
public static void main(String[] args) throws Exception {
// startup checks and logging 从节点启动的时打印的相关信息
EnvironmentInformation.logEnvironmentInfo(LOG, "TaskManager", args);
SignalHandler.register(LOG);
JvmShutdownSafeguard.installAsShutdownHook(LOG);
long maxOpenFileHandles = EnvironmentInformation.getOpenFileHandlesLimit();
if(maxOpenFileHandles != -1L) {
LOG.info("Maximum number of open file descriptors is {}.", maxOpenFileHandles);
} else {
LOG.info("Cannot determine the maximum number of open file descriptors");
}
// 注释: 启动入口
runTaskManagerSecurely(args, ResourceID.generate());
}
注:ResourceID:Flink集群启动时主节点和从节点都会生成一个全局唯一的ID。
2.2 runTaskManagerSecurely(入口)
- 1.加载参数(解析main方法参数+配置文件参数)
- 2.启动TaskManager
- 2.1 初始化插件服务以及文件系统服务(基础服务)
- 2.2 通过线程启动TaskManager
- 2.2.1 构建TaskManagerRunner实例对象
- ① 初始化很多服务(对外提供服务)
- ② 初始化Executor
- 2.2.2 发送start消息确认是否启动成功
- 2.2.1 构建TaskManagerRunner实例对象
public static void runTaskManagerSecurely(String[] args, ResourceID resourceID) {
try {
// 注释: 加载配置参数(shell脚本传入参数+flink-conf.yaml文件)
Configuration configuration = loadConfiguration(args);
// 注释: 启动TaskManager
runTaskManagerSecurely(configuration, resourceID);
} catch(Throwable t) {
final Throwable strippedThrowable = ExceptionUtils.stripException(t, UndeclaredThrowableException.class);
LOG.error("TaskManager initialization failed.", strippedThrowable);
System.exit(STARTUP_FAILURE_RETURN_CODE);
}
}
// --> runTaskManagerSecurely(configuration, resourceID);
// 1.初始化插件服务和文件系统
// 2.通过线程启动TaskManger,与main线程不是同一个线程中
public static void runTaskManagerSecurely(Configuration configuration, ResourceID resourceID) throws Exception {
replaceGracefulExitWithHaltIfConfigured(configuration);
/*************************************************
* 注释: 初始化插件
*/
final PluginManager pluginManager = PluginUtils.createPluginManagerFromRootFolder(configuration);
// TODO_MA 注释: 初始化文件系统
FileSystem.initialize(configuration, pluginManager);
SecurityUtils.install(new SecurityConfiguration(configuration));
/*************************************************
* 注释: 包装启动
*/
SecurityUtils.getInstalledContext().runSecured(
// 注释: 通过一个线程来启动 TaskManager
() -> {
runTaskManager(configuration, resourceID, pluginManager);
return null;
});
}
// -->runTaskManager(configuration, resourceID, pluginManager);
public static void runTaskManager(Configuration configuration, ResourceID resourceId, PluginManager pluginManager) throws Exception {
/*************************************************
* 注释: 构建 TaskManager 实例
* TaskManagerRunner 是 standalone 模式下 TaskManager 的可执行入口点。
* 它构造相关组件(network, I/O manager, memory manager, RPC service, HA service)并启动它们。
*/
final TaskManagerRunner taskManagerRunner = new TaskManagerRunner(configuration, resourceId, pluginManager);
/*************************************************
* 注释: 发送 START 消息,确认是否启动成功
*/
taskManagerRunner.start();
}
2.3 实例化TaskManagerRunner对象
1.初始化服务:
- 线程池:异步回调函数的处理(异步编程:future.xxx(() -> xxxxx(), exceutor))
- HA服务:ZooKeeperHaServices(flink-conf.yaml文件中HA参数为ZooKeeper)
- Rpc服务:通过创建代理对象的方式创建RpcServer
- Heartbeat服务:心跳服务(ResourceManger与TaskManager的两个关键参数:10s与50s)
- Blob服务:内部就是两个定时任务,用来定时检查删除过期的Job的资源文件。通过引用计数的方法,判断文件是否过期。PermanentBlobCache与TransientBlobCache
2.启动TaskManager
- 负责启动TaskExecutor,负责多个Task的执行
public TaskManagerRunner(Configuration configuration, ResourceID resourceId, PluginManager pluginManager) throws Exception {
this.configuration = checkNotNull(configuration);
this.resourceId = checkNotNull(resourceId);
timeout = AkkaUtils.getTimeoutAsTime(configuration);
// 注释:初始化进行回调处理的线程池
this.executor = java.util.concurrent.Executors
.newScheduledThreadPool(Hardware.getNumberCPUCores(), new ExecutorThreadFactory("taskmanager-future"));
/*************************************************
* 注释:HA 服务: ZooKeeperHaServices
* 提供对高可用性所需的所有服务的访问注册,分布式计数器和Leader选举
*/
highAvailabilityServices = HighAvailabilityServicesUtils
.createHighAvailabilityServices(configuration, executor, HighAvailabilityServicesUtils.AddressResolution.NO_ADDRESS_RESOLUTION);
// 注释:初始化 RpcService
rpcService = createRpcService(configuration, highAvailabilityServices);
// 注释:初始化 HeartbeatServices
HeartbeatServices heartbeatServices = HeartbeatServices.fromConfiguration(configuration);
metricRegistry = new MetricRegistryImpl(MetricRegistryConfiguration.fromConfiguration(configuration),
ReporterSetup.fromConfiguration(configuration, pluginManager));
final RpcService metricQueryServiceRpcService = MetricUtils.startRemoteMetricsRpcService(configuration, rpcService.getAddress());
metricRegistry.startQueryService(metricQueryServiceRpcService, resourceId);
// 注释:初始化 BlobCacheService
blobCacheService = new BlobCacheService(configuration, highAvailabilityServices.createBlobStore(), null);
// 注释:提供外部资源的信息
final ExternalResourceInfoProvider externalResourceInfoProvider = ExternalResourceUtils
.createStaticExternalResourceInfoProvider(ExternalResourceUtils.getExternalResourceAmountMap(configuration),
ExternalResourceUtils.externalResourceDriversFromConfig(configuration, pluginManager));
/*************************************************
* 注释:启动 TaskManager
* 负责创建 TaskExecutor,负责多个任务Task的运行
*/
taskManager = startTaskManager(this.configuration, this.resourceId, rpcService, highAvailabilityServices, heartbeatServices, metricRegistry,
blobCacheService, false, externalResourceInfoProvider, this);
this.terminationFuture = new CompletableFuture<>();
this.shutdown = false;
MemoryLogger.startIfConfigured(LOG, configuration, terminationFuture);
}
2.4 startTaskManager(启动TaskManager对象)
- 1.获取资源定义对象:一台真实的物理节点的资源(cpu,memory,network)
- 2.taskExecutorResourceSpec–> TaskManagerServicesConfiguration(配置信息封装在TaskManagerServicesConfiguration对象中)
- 3.初始化ioExecutor(io线程池)
- 3.构建TaskManagerServices对象封装了TaskManager运行过程中需要对外提供服务的各种服务组件
- 1.初始化 TaskEventDispatcher(调度的作用)
- 2.初始化 IOManagerASync(通过异步的形式实现数据流转)
- 3.shuffleEnvironment = NettyShuffleEnvironment(上下游Task存在shuffle)
- 4.初始化 KVStageService(状态服务)
- 5.初始化 BroadCastVariableManager(广播服务)
- 6.初始化 TaskSlotTable【interface–>TaskSlotTableImpl】(维护TaskManager上所有的TaskSlot与Task以及Job的关系)
- 7.初始化 DefaultJobTable服务(job信息)
- 8.初始化 JobLeaderService服务(为JobMaster启动提供服务)
- 4.返回TaskExecutor对象(startTaskManager–>TaskExecutor),内部构建了两个重要的心跳管理器
- JobManagerHeartbeatManager
- ResourceManagerHeartbeatManager
public static TaskExecutor startTaskManager(Configuration configuration, ResourceID resourceID, RpcService rpcService,
HighAvailabilityServices highAvailabilityServices, HeartbeatServices heartbeatServices, MetricRegistry metricRegistry,
BlobCacheService blobCacheService, boolean localCommunicationOnly, ExternalResourceInfoProvider externalResourceInfoProvider,
FatalErrorHandler fatalErrorHandler) throws Exception {
checkNotNull(configuration);
checkNotNull(resourceID);
checkNotNull(rpcService);
checkNotNull(highAvailabilityServices);
LOG.info("Starting TaskManager with ResourceID: {}", resourceID);
String externalAddress = rpcService.getAddress();
final TaskExecutorResourceSpec taskExecutorResourceSpec = TaskExecutorResourceUtils.resourceSpecFromConfig(configuration);
// 注释: TaskManagerServicesConfiguration
TaskManagerServicesConfiguration taskManagerServicesConfiguration = TaskManagerServicesConfiguration.fromConfiguration(configuration, resourceID, externalAddress, localCommunicationOnly, taskExecutorResourceSpec);
Tuple2<TaskManagerMetricGroup, MetricGroup> taskManagerMetricGroup = MetricUtils.instantiateTaskManagerMetricGroup(metricRegistry, externalAddress, resourceID, taskManagerServicesConfiguration.getSystemResourceMetricsProbingInterval());
// 注释: 初始化 ioExecutor
final ExecutorService ioExecutor =
newFixedThreadPool(taskManagerServicesConfiguration.getNumIoThreads(), new ExecutorThreadFactory("flink-taskexecutor-io"));
// 注释: taskManagerServices = TaskManagerServices
TaskManagerServices taskManagerServices = TaskManagerServices
.fromConfiguration(taskManagerServicesConfiguration, blobCacheService.getPermanentBlobService(), taskManagerMetricGroup.f1, ioExecutor,
fatalErrorHandler);
// 注释: TaskManagerConfiguration
TaskManagerConfiguration taskManagerConfiguration = TaskManagerConfiguration
.fromConfiguration(configuration, taskExecutorResourceSpec, externalAddress);
String metricQueryServiceAddress = metricRegistry.getMetricQueryServiceGatewayRpcAddress();
/*************************************************
* 注释: 创建 TaskExecutor 实例
* 内部会创建两个重要的心跳管理器:
* 1、JobManagerHeartbeatManager
* 2、ResourceManagerHeartbeatManager
*/
return new TaskExecutor(rpcService, taskManagerConfiguration, highAvailabilityServices, taskManagerServices, externalResourceInfoProvider,
heartbeatServices, taskManagerMetricGroup.f0, metricQueryServiceAddress, blobCacheService, fatalErrorHandler,
new TaskExecutorPartitionTrackerImpl(taskManagerServices.getShuffleEnvironment()),
createBackPressureSampleService(configuration, rpcService.getScheduledExecutor()));
}
//--> TaskManagerServices fromConfiguration
public static TaskManagerServices fromConfiguration(TaskManagerServicesConfiguration taskManagerServicesConfiguration,
PermanentBlobService permanentBlobService, MetricGroup taskManagerMetricGroup, ExecutorService ioExecutor,
FatalErrorHandler fatalErrorHandler) throws Exception {
// pre-start checks 检查工作目录
checkTempDirs(taskManagerServicesConfiguration.getTmpDirPaths());
// 注释: 初始化 TaskEventDispatcher
final TaskEventDispatcher taskEventDispatcher = new TaskEventDispatcher();
//注释: 初始化 IOManagerASync
// start the I/O manager, it will create some temp directories.
final IOManager ioManager = new IOManagerAsync(taskManagerServicesConfiguration.getTmpDirPaths());
//注释: shuffleEnvironment = NettyShuffleEnvironment
final ShuffleEnvironment<?, ?> shuffleEnvironment = createShuffleEnvironment(taskManagerServicesConfiguration, taskEventDispatcher,
taskManagerMetricGroup, ioExecutor);
final int listeningDataPort = shuffleEnvironment.start();
// 注释: 初始化 KVStageService
final KvStateService kvStateService = KvStateService.fromConfiguration(taskManagerServicesConfiguration);
kvStateService.start();
final UnresolvedTaskManagerLocation unresolvedTaskManagerLocation = new UnresolvedTaskManagerLocation(
taskManagerServicesConfiguration.getResourceID(), taskManagerServicesConfiguration.getExternalAddress(),
// we expose the task manager location with the listening port
// iff the external data port is not explicitly defined
taskManagerServicesConfiguration.getExternalDataPort() > 0 ? taskManagerServicesConfiguration.getExternalDataPort() : listeningDataPort);
// 注释: 初始化 BroadCastVariableManager
final BroadcastVariableManager broadcastVariableManager = new BroadcastVariableManager();
// 注释: 初始化 TaskSlotTable
final TaskSlotTable<Task> taskSlotTable = createTaskSlotTable(taskManagerServicesConfiguration.getNumberOfSlots(),
taskManagerServicesConfiguration.getTaskExecutorResourceSpec(), taskManagerServicesConfiguration.getTimerServiceShutdownTimeout(),
taskManagerServicesConfiguration.getPageSize(), ioExecutor);
// 注释: 初始化 DefaultJobTable
final JobTable jobTable = DefaultJobTable.create();
// 注释: 初始化 JobLeaderService
final JobLeaderService jobLeaderService = new DefaultJobLeaderService(unresolvedTaskManagerLocation,
taskManagerServicesConfiguration.getRetryingRegistrationConfiguration());
final String[] stateRootDirectoryStrings = taskManagerServicesConfiguration.getLocalRecoveryStateRootDirectories();
final File[] stateRootDirectoryFiles = new File[stateRootDirectoryStrings.length];
for(int i = 0; i < stateRootDirectoryStrings.length; ++i) {
stateRootDirectoryFiles[i] = new File(stateRootDirectoryStrings[i], LOCAL_STATE_SUB_DIRECTORY_ROOT);
}
// 注释: 初始化 TaskExecutorLocalStateStoresManager
final TaskExecutorLocalStateStoresManager taskStateManager = new TaskExecutorLocalStateStoresManager(
taskManagerServicesConfiguration.isLocalRecoveryEnabled(), stateRootDirectoryFiles, ioExecutor);
final boolean failOnJvmMetaspaceOomError = taskManagerServicesConfiguration.getConfiguration()
.getBoolean(CoreOptions.FAIL_ON_USER_CLASS_LOADING_METASPACE_OOM);
// 注释: 初始化 LibraryCacheManager
final LibraryCacheManager libraryCacheManager = new BlobLibraryCacheManager(permanentBlobService, BlobLibraryCacheManager .defaultClassLoaderFactory(taskManagerServicesConfiguration.getClassLoaderResolveOrder(), taskManagerServicesConfiguration.getAlwaysParentFirstLoaderPatterns(), failOnJvmMetaspaceOomError ? fatalErrorHandler : null));
// 注释: 返回: TaskManagerServices
return new TaskManagerServices(unresolvedTaskManagerLocation, taskManagerServicesConfiguration.getManagedMemorySize().getBytes(), ioManager,
shuffleEnvironment, kvStateService, broadcastVariableManager, taskSlotTable, jobTable, jobLeaderService, taskStateManager,
taskEventDispatcher, ioExecutor, libraryCacheManager);
}
TaskManager实例化主流程图:
BlobService 服务初始化流程:
在服务内部其实就是启动了两个定时任务,定时执行检查,删除过期的 Job 的资源文件。
rpcService 服务初始化流程:
在初始 rpcService 服务时,内部会启动一个 ActorSystem 对象, 在该 ActorSystem 对象内部会启动一个 Acotr 对象。
2.5 startTaskManager–>TaskExecutor(返回TaskExecutor)
从节点的启动是通过实例化TaskManagerRunner对象,后续分为两部分工作:
- 1.初始化各种基础服务(线程池、HA、Rpc、心跳服务、以及Blob服务)、
- 2.启动Taskmanager:通过将硬件配置信息封装在TaskManagerServicesConfiguration对象中,初始化IO线程池,
- 3.通过构建TaskManagerServices对象(封装了TaskManager运行过程中对外提供服务的各种服务组件),最终返回TaskExecutor对象(封装了两个心跳服务:JobManagerHeartBeatManager与ResourceManagerHeartBeatManager)。
//-->返回TaskExecutor对象
return new TaskExecutor(rpcService, taskManagerConfiguration, highAvailabilityServices, taskManagerServices, externalResourceInfoProvider,heartbeatServices, taskManagerMetricGroup.f0, metricQueryServiceAddress, blobCacheService, fatalErrorHandler,
new TaskExecutorPartitionTrackerImpl(taskManagerServices.getShuffleEnvironment()),
createBackPressureSampleService(configuration, rpcService.getScheduledExecutor()))
// 当前构造方法执行完了之后,执行 onStart() 方法,因为 TaskExecutor 是一个 RpcEndpoint
public TaskExecutor(RpcService rpcService, TaskManagerConfiguration taskManagerConfiguration, HighAvailabilityServices haServices,
TaskManagerServices taskExecutorServices, ExternalResourceInfoProvider externalResourceInfoProvider, HeartbeatServices heartbeatServices,
TaskManagerMetricGroup taskManagerMetricGroup, @Nullable String metricQueryServiceAddress, BlobCacheService blobCacheService,
FatalErrorHandler fatalErrorHandler, TaskExecutorPartitionTracker partitionTracker, BackPressureSampleService backPressureSampleService) {
//创建形式为prefix_X随机名称,其中X为递增数字
super(rpcService, AkkaRpcServiceUtils.createRandomName(TASK_MANAGER_NAME));
checkArgument(taskManagerConfiguration.getNumberSlots() > 0, "The number of slots has to be larger than 0.");
this.taskManagerConfiguration = checkNotNull(taskManagerConfiguration);
this.taskExecutorServices = checkNotNull(taskExecutorServices);
this.haServices = checkNotNull(haServices);
this.fatalErrorHandler = checkNotNull(fatalErrorHandler);
this.partitionTracker = partitionTracker;
this.taskManagerMetricGroup = checkNotNull(taskManagerMetricGroup);
this.blobCacheService = checkNotNull(blobCacheService);
this.metricQueryServiceAddress = metricQueryServiceAddress;
this.backPressureSampleService = checkNotNull(backPressureSampleService);
this.externalResourceInfoProvider = checkNotNull(externalResourceInfoProvider);
this.libraryCacheManager = taskExecutorServices.getLibraryCacheManager();
this.taskSlotTable = taskExecutorServices.getTaskSlotTable();
this.jobTable = taskExecutorServices.getJobTable();
this.jobLeaderService = taskExecutorServices.getJobLeaderService();
this.unresolvedTaskManagerLocation = taskExecutorServices.getUnresolvedTaskManagerLocation();
this.localStateStoresManager = taskExecutorServices.getTaskManagerStateStore();
this.shuffleEnvironment = taskExecutorServices.getShuffleEnvironment();
this.kvStateService = taskExecutorServices.getKvStateService();
this.ioExecutor = taskExecutorServices.getIOExecutor();
this.resourceManagerLeaderRetriever = haServices.getResourceManagerLeaderRetriever();
//硬件抽象对象
this.hardwareDescription = HardwareDescription.extractFromSystem(taskExecutorServices.getManagedMemorySize());
this.resourceManagerAddress = null;
this.resourceManagerConnection = null;
this.currentRegistrationTimeoutId = null;
final ResourceID resourceId = taskExecutorServices.getUnresolvedTaskManagerLocation().getResourceID();
// 注释: HeartbeatManagerImpl jobManagerHeartbeatManager
this.jobManagerHeartbeatManager = createJobManagerHeartbeatManager(heartbeatServices, resourceId);
// 注释: HeartbeatManagerImpl resourceManagerHeartbeatManager
this.resourceManagerHeartbeatManager = createResourceManagerHeartbeatManager(heartbeatServices, resourceId);
}
//-->代码执行到就到去到TaskExecutor的Onstart()方法
public void onStart() throws Exception {
try {
/*************************************************
* 注释: 开启服务
* 重要的服务:
* 1.监控ResourceManager
* 2.启动TaskSlotTable服务
* 3.监控JobMaster
* 4.启动FileCache服务
*/
startTaskExecutorServices();
} catch(Exception e) {
final TaskManagerException exception = new TaskManagerException(String.format("Could not start the TaskExecutor %s", getAddress()), e);
onFatalError(exception);
throw exception;
}
// 注释: 开始注册
startRegistrationTimeout();
}
//-->startTaskExecutorServices();
private void startTaskExecutorServices() throws Exception {
try {
// 注释: 启动 ResourceManagerLeaderListener,监听 TaskManger 向 ResourceManager 注册是通过ResourceManagerLeaderListener 来完成的,它会监控 ResourceManager 的 leader 变化, 如果有新的 leader 被选举出来, 将会调用 notifyLeaderAddress() 方法去触发与 ResourceManager 的重连
// start by connecting to the ResourceManager
resourceManagerLeaderRetriever.start(new ResourceManagerLeaderListener());
// 理解上述代码:
/*1.ResourceManagerLeaderListener 是 LeaderRetrieverListener的子类,构建ResourceManagerLeaderListener对象中,执行的是notifyLeaderAddress()方法【连接RM】,构建TaskExecutorRegistration对象与TaskExecutorToResourceManagerConnection对象(TaskExecutor 和 ResourceManager 之间的链接对象),启动,创建注册对象newRegistration,开始进行注册(向ResourceManager进行注册),ResourceManager获取TaskManager的全局唯一ID,taskExecutorRegistration-->WorkerRegistration对象,将taskExecutorResourceId与registration放入一个map结构中,taskExecutors.put(taskExecutorResourceId, registration),最终返回一个TaskExecutorRegistrationSuccess注册成功的对象,接下来要维持ResourceManager与TaskExecutor之间的心跳,taskExecutorGateway相当于注册成功的那个TaskExecutor,taskExecutorGateway.heartbeatFromResourceManager(resourceID),此时TaskExecutor接收到ResourceManager的心跳请求,此时TaskExecutor向ResourceManager汇报心跳 ;
*2.在StandAlone场景中,resourceManagerLeaderRetriever的实现类是ZooKeeperLeaderRetrievalService,ZooKeeperLeaderRetrievalService是 NodeCacheListener的子类,NodeCacheListener(接口) 是 curator提供的监听器,当指定的zookeeper中的znode节点数据发生改变,则会收到通知,回调nodeChanged()方法【ZooKeeperLeaderRetrievalService中的nodeChanged()】,在nodeChanged()中会调用对应的LeaderRetrieverListener的notifyIfNewLeaderAddress()方法
*3.resourceManagerLeaderRetriever的实现类是:ZooKeeperLeaderRetrievalService,它是LeaderRetrievalService的子类
*4.resourceManagerLeaderRetriever进行监听,当发生变更时,就会调用ResourceManagerLeaderListener的notifyLeaderAddress()方法
*/
// 注释: 启动 TaskSlotTable
// tell the task slot table who's responsible for the task slot actions
taskSlotTable.start(new SlotActionsImpl(), getMainThreadExecutor());
// 注释: 启动 JobLeaderService
// start the job leader service
jobLeaderService.start(getAddress(), getRpcService(), haServices, new JobLeaderListenerImpl());
// 注释: 初始化 FileCache
fileCache = new FileCache(taskManagerConfiguration.getTmpDirectories(), blobCacheService.getPermanentBlobService());
} catch(Exception e) {
handleStartTaskExecutorServicesException(e);
}
}
TaskManager 启动主流程图解:
TaskManagerServices 实例化流程图解:
1. shuffleEnvironment 初始化流程图解:
2. netty 初始化流程图解: shuffleEnvironment 初始化完成之后就启动,启动过程中初始化netty 服务端 与 客户端
3. 状态管理初始化流程图解
1.TaskExecutor 实例化流程图解:
实例化TaskExecutor对象后,就要执行TaskExecutor对象的onStart()方法:
- 开启服务 startTaskExecutorServices()
- 监控ResourceManager(连接ResourceManager,注册(超时注册机制),监听RM)
开启ResourceManager的监听
当ResourceManager发生变更时(集群启动时也适用), ResourceManager 重连注册流程:
当TaskExecutor与ResourceManager链接完成之后, 注册的详细流程:
当注册完成之后,TaskExecutor与ResourceManager之间维持心跳的过程:
- 监控ResourceManager(连接ResourceManager,注册(超时注册机制),监听RM)
- Flink的主从节点的心跳:
- 1.启动ResourceManager,启动HeartBeatManager,每隔10s钟,遍历注册的TaskExecutor,执行发送心跳请求
- 2.启动TaskExecutor,启动超市注册检查机制(每隔5min),完成启动后进行注册,接收到心跳的请求之后,相当于RM与TaskManager之间维持心跳
- 3.TaskManager每次接收到ResourceManager的心跳后,重置超时任务。
- 启动TaskSlotTable服务:内部包含一个超时检查服务
- 监控JobLeaderService服务:启动一个监听(当已启动的jobMaster发生节点迁移,JobLeaderService接收到请求进行处理)
- 启动FileCache服务:资源的缓存服务
- 开始注册 startRegistrationTimeout()
- 上述步骤如果超过5min就超时了。超时检查机制(5min的注册超时检查)
slot的注册汇报
3.总结
-
TaskManager作为Flink集群的从节点,主要负责slot资源的管理以及具体task的执行,同时保持与JobManager之间的通信。
-
TaskManager的具体实现类为TaskManagerRunner。
-
TaskManger的启动过程主要为:
- 加载配置信息(main传入的参数+flink-conf.yaml文件),初始化插件服务以及文件系统服务
- 通过线程的方式启动TaskManager
- 实例化TaskManagerRunner对象,成功之后给自己发送一个hello确认。
- TaskManagerRunner包含了很多基础服务(HA/rpc/HeartBeatServices/大文件处理的服务)
- 启动TaskManager,最终返回TaskExecutor,负责多个任务Task的运行。
- 初始化TaskManagerServices(包含很多对外提供服务的服务组件:shuffleEnvironment、TaskSlotTable)以及JobLeaderService等等
- 创建两个心跳管理服务:JobManagerHeartbeatManager、ResourceManagerHeartbeatManager
- TaskExecutor实例化完成之后会执行对应的onStart()方法,其中启动四个服务:心跳服务、管理Task、Slot之间的对应关系、JobMaster服务以及文件缓存服务。