Apache Flink作为国内最火的大数据计算引擎之一,自身支持高吞吐,低延迟,exactly-once语义,有状态流等特性,阅读源码有助加深对框架的理解和认知。
Flink分为四种执行图,本章主要解析物理执行图的生成计划前半部分
本章及后续源码解读环境以生产为主:运行模式:OnYarn,HA模式:ZK,Mode:Streaming,关键逻辑解释我会备注到代码上(灰色字体)勿忽视。
此章节涉及的组件和框架的代码依赖部分大数据生态包括Yarn,Akka,ZK等。建议有此基础上阅读会提高可阅读性和实现理解。
由于整个构建过程比较复杂 ,我打算分为两个章节依次解析,当前章节主要解析在Flink向TM提交TaskDescriptor之前也就是部署Slot后生成TM实例的构建过程,在过程中调用的部分外部依赖比如基于Akka的RPC通信,基于ZK的HA选举,Yarn的使用机制请参考我上一章节ExecutionGraph构建过程https://blog.csdn.net/ws0owws0ow/article/details/113991593?spm=1001.2014.3001.5501,最后创建TM时涉及的Flink内存管理模型Netty通信场景会放在后面章节解析。
接着上一章节开始,在生成完SlotPool,Scheduler,ExecutionGraph等服务后,JobManager初始完毕,开始进行选主
//创建JM,JM是Flink是最核心的组件之一,JM维护了负责内部的Slot资源调度的SlotPool以及负责维护CheckpointCoordinator和构建ExecutionGraph的Scheduler。
CompletableFuture<JobManagerRunner> createJobManagerRunner(JobGraph jobGraph, long initializationTimestamp) {
//获取clusterEntrypoint的actorSystem用来创建JobManagerRunnerRpcpoint
final RpcService rpcService = getRpcService();
//创建JM->生成SchedulerNG->生成ExecutionGraph->生成ExecutionJobVertex
//->生成ExecutionVertex,IntermediateResult/partiton->生成CheckpointCoordinator
//->JM选主->启动SchedulerNG分发任务和ck
return CompletableFuture.supplyAsync(
() -> {
try {
JobManagerRunner runner = jobManagerRunnerFactory.createJobManagerRunner(
jobGraph,
configuration,
rpcService,
highAvailabilityServices,
heartbeatServices,
jobManagerSharedServices,
new DefaultJobManagerJobMetricGroupFactory(jobManagerMetricGroup),
fatalErrorHandler,
initializationTimestamp);
//开始选举JMLeader
runner.start();
return runner;
} catch (Exception e) {
throw new CompletionException(new JobInitializationException(jobGraph.getJobID(), "Could not instantiate JobManager.", e));
}
},
ioExecutor); // do not use main thread executor. Otherwise, Dispatcher is blocked on JobManager creation
}
public void grantLeadership(final UUID leaderSessionID) {
synchronized (lock) {
if (shutdown) {
log.debug("JobManagerRunner cannot be granted leadership because it is already shut down.");
return;
}
leadershipOperation = leadershipOperation.thenCompose(
(ignored) -> {
synchronized (lock) {//准备启动JM
return verifyJobSchedulingStatusAndStartJobManager(leaderSessionID);
}
});
handleException(leadershipOperation, "Could not start the job manager.");
}
}
选主成功后,JM开始启动SlotPool,然后转换JM上的JobStatus为Running状态,触发Scheduler开始调度Checkpoint并分配Slot及部署Task
备注:这里准备开始启动Checkpoint,Flink的Checkpoint执行机制可以参考我之前写的Checkpoint详细执行过程
private Acknowledge startJobExecution(JobMasterId newJobMasterId) throws Exception {
validateRunsInMainThread();
checkNotNull(newJobMasterId, "The new JobMasterId must not be null.");
if (Objects.equals(getFencingToken(), newJobMasterId)) {
log.info("Already started the job execution with JobMasterId {}.", newJobMasterId);
return Acknowledge.get();
}
setNewFencingToken(newJobMasterId);
//启动SlotPool,连接RM
startJobMasterServices();
log.info("Starting execution of job {} ({}) under job master id {}.", jobGraph.getName(), jobGraph.getJobID(), newJobMasterId);
//验证JobStatus是否是CREATED(之前初始化ExecutionGraph时状态设置为CREATED)
//反之重新创建SchedulerNG并重复之前生成JM等步骤
//最后再通过SchedulerNG启动CheckpointCoordinator并申请slot资源和部署task
resetAndStartScheduler();
return Acknowledge.get();
}
private void resetAndStartScheduler() throws Exception {
validateRunsInMainThread();
final CompletableFuture<Void> schedulerAssignedFuture;
//创建ExecutionGraph时,会设置JobStatus = JobStatus.CREATED
//executionGraph.transitionToRunning 才会转换成Running
if (schedulerNG.requestJobStatus() == JobStatus.CREATED) {
schedulerAssignedFuture = CompletableFuture.completedFuture(null);
schedulerNG.setMainThreadExecutor(getMainThreadExecutor());
} else {
//重新创建SchedulerNG并重复之前生成JM等步骤
suspendAndClearSchedulerFields(new FlinkException("ExecutionGraph is being reset in order to be rescheduled."));
final JobManagerJobMetricGroup newJobManagerJobMetricGroup = jobMetricGroupFactory.create(jobGraph);
final SchedulerNG newScheduler = createScheduler(executionDeploymentTracker, newJobManagerJobMetricGroup);
schedulerAssignedFuture = schedulerNG.getTerminationFuture().handle(
(ignored, throwable) -> {
newScheduler.setMainThreadExecutor(getMainThreadExecutor());
assignScheduler(newScheduler, newJobManagerJobMetricGroup);
return null;
}
);
}
//启动Scheduler
schedulerAssignedFuture.thenRun(this::startScheduling);
}
protected void startSchedulingInternal() {
log.info("Starting scheduling with scheduling strategy [{}]", schedulingStrategy.getClass().getName());
//通知ExecutionGraph的CheckpointCoordinator改变状态为running并准备执行checkpoint
//生成的ScheduledTriggerRunnable主要包含checkpoint周期调用(CheckpointCoordinator.startCheckpointScheduler)的逻辑
//JM的Checkpoint会判断Execution的assignedResource是否为空,否则不会向TM提交Checkpoint
//当提交申请TM部署slot成功后,Execution的assignedResource才会被赋值,此时JM的Checkpoint周期线程才会被往后继续执行调用TM的task执行checkpoint
prepareExecutionGraphForNgScheduling();
schedulingStrategy.startScheduling();
}
JM整个Slot申请流程分为两阶段提交,大概流程:
- JM先向本地SlotPool申请Slot资源,如果有则返会有空闲Slot资源的TM位置,使用资源等INFO信息,然后提交申请到TM的SlotTable
- 如果没有,SlotPool会向FlinkRM上的SlotManager提交Slot申请,如果有则返会有空闲Slot资源的TM位置,使用资源等INFO信息,然后提交申请到TM的SlotTable
- 如果SlotManager还是没匹配到空闲资源的TM可用,RM会就向YarnRM申请分配Container启动TaskManager并执行启动Yanr的Container命令 ,启动命令后Container会进入并执行提交的时候指定的yarnContainer的.class主入口函数,初始化TaskManager实例,并生成NettyShuffleEnvironment等服务,生成TM实例,最后跟JM的RM等服务建立连接
public void startScheduling() {
allocateSlotsAndDeploy(SchedulingStrategyUtils.getAllVertexIdsFromTopology(schedulingTopology));
}
public void allocateSlotsAndDeploy(final List<ExecutionVertexDeploymentOption> executionVertexDeploymentOptions) {
....
//这里返会的是LogicalSlot,而真正提交Task任务是下一步waitForAllSlotsAndDeploy
final List<SlotExecutionVertexAssignment> slotExecutionVertexAssignments =
allocateSlots(executionVertexDeploymentOptions);
final List<DeploymentHandle> deploymentHandles = createDeploymentHandles(
requiredVersionByVertex,
deploymentOptionsByVertex,
slotExecutionVertexAssignments);
//开始向TM提交Task任务
waitForAllSlotsAndDeploy(deploymentHandles);
}
开始Slot申请,这里我们默认使用同一个Group下的Slot,这样不同ExecutionVertex都可能会分配到同一个Slot下
private List<SlotExecutionVertexAssignment> allocateSlots(final List<ExecutionVertexDeploymentOption> executionVertexDeploymentOptions) {
//从executionVertexDeploymentOptions抽取出SlotExecutionVertexAssignment后开始为它们分配Slot
return executionSlotAllocator.allocateSlotsFor(executionVertexDeploymentOptions
.stream()
.map(ExecutionVertexDeploymentOption::getExecutionVertexId)
.map(this::getExecutionVertex)
.map(ExecutionVertexSchedulingRequirementsMapper::from)
.collect(Collectors.toList()));
}
开始在本地SlotPool查询是否有可用资源,这里主要在JM的SlotPool上检查是否有现成可用的Slot资源 ,如果有则返会查到的multiTaskSlotLocality,multiTaskSlotLocality主要封装了可用的TM信息,比如地址和可用的ResourceProfile等信息,返会后直接向该TM申请资源部署 ,反之说明当前SlotPool没有可用资源,这样则向JM的RM上的SlotManager提交Slot资源申请
private SlotSharingManager.MultiTaskSlotLocality allocateMultiTaskSlot(
AbstractID groupId,
SlotSharingManager slotSharingManager,
SlotProfile slotProfile,
@Nullable Time allocationTimeout) {
Collection<SlotSelectionStrategy.SlotInfoAndResources> resolvedRootSlotsInfo =
slotSharingManager.listResolvedRootSlotInfo(groupId);
SlotSelectionStrategy.SlotInfoAndLocality bestResolvedRootSlotWithLocality =
slotSelectionStrategy.selectBestSlotForProfile(resolvedRootSlotsInfo, slotProfile).orElse(null);
final SlotSharingManager.MultiTaskSlotLocality multiTaskSlotLocality = bestResolvedRootSlotWithLocality != null ?
new SlotSharingManager.MultiTaskSlotLocality(
slotSharingManager.getResolvedRootSlot(bestResolvedRootSlotWithLocality.getSlotInfo()),
bestResolvedRootSlotWithLocality.getLocality()) :
null;
if (multiTaskSlotLocality != null && multiTaskSlotLocality.getLocality() == Locality.LOCAL) {
return multiTaskSlotLocality;
}
final SlotRequestId allocatedSlotRequestId = new SlotRequestId();
final SlotRequestId multiTaskSlotRequestId = new SlotRequestId();
//先向自己(JM)的slotPool申请可用资源
//主要是查看SlotPool的AvailableSlots实例(主要维护的是可用的TM,Slot的Map集合)是否有为空(有可用资源)
//如果不为空则说明目前TM上有可用Slot,则返会不为空的可用的Optional<SlotAndLocality>(包含TM地址,可用的ResourceProfile等信息)
Optional<SlotAndLocality> optionalPoolSlotAndLocality = tryAllocateFromAvailable(allocatedSlotRequestId, slotProfile);
//如果申请到了就创建TaskSlot后直接返回
if (optionalPoolSlotAndLocality.isPresent()) {
SlotAndLocality poolSlotAndLocality = optionalPoolSlotAndLocality.get();
if (poolSlotAndLocality.getLocality() == Locality.LOCAL || bestResolvedRootSlotWithLocality == null) {
//当前SlotPool上可用的slot物理资源
final PhysicalSlot allocatedSlot = poolSlotAndLocality.getSlot();
//创建multiTaskSlot
final SlotSharingManager.MultiTaskSlot multiTaskSlot = slotSharingManager.createRootSlot(
multiTaskSlotRequestId,
CompletableFuture.completedFuture(poolSlotAndLocality.getSlot()),
allocatedSlotRequestId);
//这里只是在可用的Slot物理资源标记上这次申请的multiTaskSlot
if (allocatedSlot.tryAssignPayload(multiTaskSlot)) {
//返会封装成MultiTaskSlotLocality的
return SlotSharingManager.MultiTaskSlotLocality.of(multiTaskSlot, poolSlotAndLocality.getLocality());
} else {
multiTaskSlot.release(new FlinkException("Could not assign payload to allocated slot " +
allocatedSlot.getAllocationId() + '.'));
}
}
}
//如果不为空说明当前SlotPool有可用的slot资源,直接返会使用此multiTaskSlotLocality
if (multiTaskSlotLocality != null) {
// prefer slot sharing group slots over unused slots
if (optionalPoolSlotAndLocality.isPresent()) {
slotPool.releaseSlot(
allocatedSlotRequestId,
new FlinkException("Locality constraint is not better fulfilled by allocated slot."));
}
return multiTaskSlotLocality;
}
// there is no slot immediately available --> check first for uncompleted slots at the slot sharing group
// 走到这里说明没发现可用的slot资源,再会检查下是否还有没分配的slot
SlotSharingManager.MultiTaskSlot multiTaskSlot = slotSharingManager.getUnresolvedRootSlot(groupId);
//如果这里为空说明上面步骤检查到无可用slot
if (multiTaskSlot == null) {
// it seems as if we have to request a new slot from the resource manager, this is always the last resort!!!
//向RM申请slot的物理资源
final CompletableFuture<PhysicalSlot> slotAllocationFuture = requestNewAllocatedSlot(
allocatedSlotRequestId,
slotProfile,
allocationTimeout);
....
}
若本地SlotPool匹配到没有可用资源则向resourceManagerGateway提交Slot申请,RM收到申请通知后开始调用SlotManager提交Slot申请
private CompletableFuture<AllocatedSlot> requestNewAllocatedSlotInternal(PendingRequest pendingRequest) {
if (resourceManagerGateway == null) {
//连接不上RM的先缓存起来,后面链接上了再提交申请
stashRequestWaitingForResourceManager(pendingRequest);
} else {
//向resourceManagerGateway提交申请
requestSlotFromResourceManager(resourceManagerGateway, pendingRequest);
}
return pendingRequest.getAllocatedSlotFuture();
}
private void requestSlotFromResourceManager(
final ResourceManagerGateway resourceManagerGateway,
final PendingRequest pendingRequest) {
....
//调用resourceManagerGateway提交申请
CompletableFuture<Acknowledge> rmResponse = resourceManagerGateway.requestSlot(
jobMasterId,
new SlotRequest(jobId, allocationId, pendingRequest.getResourceProfile(), jobManagerAddress),
rpcTimeout);
...
}
public CompletableFuture<Acknowledge> requestSlot(
JobMasterId jobMasterId,
SlotRequest slotRequest,
final Time timeout) {
...
try {//RM调用内部的SlotManager提交Slot申请
slotManager.registerSlotRequest(slotRequest);
} catch (ResourceManagerException e) {
return FutureUtils.completedExceptionally(e);
}
...
}
SlotManager开始查询本地是否有可用资源
private void internalRequestSlot(PendingSlotRequest pendingSlotRequest) throws ResourceManagerException {
final ResourceProfile resourceProfile = pendingSlotRequest.getResourceProfile();
//SlotManager会根据申请的resourceProfile(需要多少core,内存等)去匹配是否有TaskManager有空闲的slot可以满足
OptionalConsumer.of(findMatchingSlot(resourceProfile))
//如果满足说明TM有可用Slot:向TaskManager申请Slot分配资源
.ifPresent(taskManagerSlot -> allocateSlot(taskManagerSlot, pendingSlotRequest))
//如果不满足:SlotManager会向YarnRM申请Container 部署TaskManager初始化新的Slot
.ifNotPresent(() -> fulfillPendingSlotRequestWithPendingTaskManagerSlot(pendingSlotRequest));
}
如果SlotManager本地没匹配到可用资源,RM则会调用resourceManagerDriver向集群提交资源申请
当前我们默认使用YarnResourceManagerDriver
private void requestNewWorker(WorkerResourceSpec workerResourceSpec) {
final TaskExecutorProcessSpec taskExecutorProcessSpec =
TaskExecutorProcessUtils.processSpecFromWorkerResourceSpec(flinkConfig, workerResourceSpec);
final int pendingCount = pendingWorkerCounter.increaseAndGet(workerResourceSpec);
log.info("Requesting new worker with resource spec {}, current pending count: {}.",
workerResourceSpec,
pendingCount);
//Flink可用的外部资源管理框架:KubernetesResourceManagerDriver,MesosResourceManagerDriver,YarnResourceManagerDriver
CompletableFuture<WorkerType> requestResourceFuture = resourceManagerDriver.requestResource(taskExecutorProcessSpec);
FutureUtils.assertNoException(
requestResourceFuture.handle((worker, exception) -> {
if (exception != null) {
final int count = pendingWorkerCounter.decreaseAndGet(workerResourceSpec);
log.warn("Failed requesting worker with resource spec {}, current pending count: {}, exception: {}",
workerResourceSpec,
count,
exception);
requestWorkerIfRequired();
} else {
final ResourceID resourceId = worker.getResourceID();
workerNodeMap.put(resourceId, worker);
currentAttemptUnregisteredWorkers.put(resourceId, workerResourceSpec);
log.info("Requested worker {} with resource spec {}.",
resourceId.getStringWithMetadata(),
workerResourceSpec);
}
return null;
}));
}
Flink 对Yarn的组件并没有做太多的封装,直接调用org.apache.hadoop.yarn.client.api包下的AMRMClientAsync 客户端开始向Yarn集群提交资源申请并初始化分配Container
public CompletableFuture<YarnWorkerNode> requestResource(TaskExecutorProcessSpec taskExecutorProcessSpec) {
checkInitialized();
final CompletableFuture<YarnWorkerNode> requestResourceFuture = new CompletableFuture<>();
//Flink需要的资源信息,用来提交给YarnRM
final Optional<TaskExecutorProcessSpecContainerResourcePriorityAdapter.PriorityAndResource> priorityAndResourceOpt =
taskExecutorProcessSpecContainerResourcePriorityAdapter.getPriorityAndResource(taskExecutorProcessSpec);
if (!priorityAndResourceOpt.isPresent()) {
requestResourceFuture.completeExceptionally(
new ResourceManagerException(
String.format("Could not compute the container Resource from the given TaskExecutorProcessSpec %s. " +
"This usually indicates the requested resource is larger than Yarn's max container resource limit.",
taskExecutorProcessSpec)));
} else {
final Priority priority = priorityAndResourceOpt.get().getPriority();
final Resource resource = priorityAndResourceOpt.get().getResource();
//向RM申请container,RM如果发现有空闲的NM资源可申请 就会回复成功(回调用户代码onContainersAllocated)
resourceManagerClient.addContainerRequest(getContainerRequest(resource, priority));
// make sure we transmit the request fast and receive fast news of granted allocations
resourceManagerClient.setHeartbeatInterval(containerRequestHeartbeatIntervalMillis);
requestResourceFutures.computeIfAbsent(taskExecutorProcessSpec, ignore -> new LinkedList<>()).add(requestResourceFuture);
log.info("Requesting new TaskExecutor container with resource {}, priority {}.", taskExecutorProcessSpec, priority);
}
return requestResourceFuture;
}
static AMRMClient.ContainerRequest getContainerRequest(Resource containerResource, Priority priority) {
//向Yarn提交申请,根据containerResource初始化Container
return new AMRMClient.ContainerRequest(
containerResource,
null,
null,
priority);
}
在resourceManagerClient.addContainerRequest申请container资源后并且得到了RM的回复有空闲NM的资源后开始回调Client端的onContainersAllocated接口并开始初始化Container
public void onContainersAllocated(List<Container> containers) {
runAsyncWithFatalHandler(() -> {
checkInitialized();
log.info("Received {} containers.", containers.size());
//遍历出Yarn成功分配的所有Container
for (Map.Entry<Priority, List<Container>> entry : groupContainerByPriority(containers).entrySet()) {
//并准备初始化容器
onContainersOfPriorityAllocated(entry.getKey(), entry.getValue());
}
// if we are waiting for no further containers, we can go to the
// regular heartbeat interval
if (getNumRequestedNotAllocatedWorkers() <= 0) {
resourceManagerClient.setHeartbeatInterval(yarnHeartbeatIntervalMillis);
}
});
}
private void onContainersOfPriorityAllocated(Priority priority, List<Container> containers) {
....
//遍历出需要启动的Container
while (containerIterator.hasNext() && pendingContainerRequestIterator.hasNext()) {
final Container container = containerIterator.next();
final AMRMClient.ContainerRequest pendingRequest = pendingContainerRequestIterator.next();
final ResourceID resourceId = getContainerResourceId(container);
final CompletableFuture<YarnWorkerNode> requestResourceFuture = pendingRequestResourceFutures.poll();
Preconditions.checkState(requestResourceFuture != null);
if (pendingRequestResourceFutures.isEmpty()) {
requestResourceFutures.remove(taskExecutorProcessSpec);
}
//启动Container
startTaskExecutorInContainerAsync(container, taskExecutorProcessSpec, resourceId, requestResourceFuture);
removeContainerRequest(pendingRequest);
...
}
FLinkClient向NameNode提交准备开始初始化Container,提交过程类似前面说过的提交JobGraph,主要是把创建用户Jar包,所需硬件资源,Kerberos验证,YarnTaskExecutorRunner.class用户入口主函数等申请信息添加进LaunchContext后调用NMClientAsync客户端发起向NM提交初始化Container任务。
private void startTaskExecutorInContainerAsync(
Container container,
TaskExecutorProcessSpec taskExecutorProcessSpec,
ResourceID resourceId,
CompletableFuture<YarnWorkerNode> requestResourceFuture) {
final CompletableFuture<ContainerLaunchContext> containerLaunchContextFuture =
//封装Flink需要提交的资源成ContainerLaunchContext
FutureUtils.supplyAsync(() -> createTaskExecutorLaunchContext(
resourceId, container.getNodeId().getHost(), taskExecutorProcessSpec), getIoExecutor());
FutureUtils.assertNoException(
containerLaunchContextFuture.handleAsync((context, exception) -> {
if (exception == null) {
//调用NMClientAsync向NM申请启动Container
//在NM成功启动Container后会执行YarnTaskExecutorRunner.class主函数,主要是创建TaskExecutor实例,等待后面JM提交Task任务
nodeManagerClient.startContainerAsync(container, context);
requestResourceFuture.complete(new YarnWorkerNode(container, resourceId));
} else {
requestResourceFuture.completeExceptionally(exception);
}
return null;
}, getMainThreadExecutor()));
}
在NM成功启动Container后会执行Flink提交的Context中指定YarnTaskExecutorRunner.class主入口函数
进入TM端主入口:
private ContainerLaunchContext createTaskExecutorLaunchContext(
ResourceID containerId,
String host,
TaskExecutorProcessSpec taskExecutorProcessSpec) throws Exception {
.....
//创建指定YarnTaskExecutorRunner.class主入口函数的Context
final ContainerLaunchContext taskExecutorLaunchContext = Utils.createTaskExecutorContext(
flinkConfig,
yarnConfig,
configuration,
taskManagerParameters,
taskManagerDynamicProperties,
currDir,
YarnTaskExecutorRunner.class,
log);
taskExecutorLaunchContext.getEnvironment()
.put(ENV_FLINK_NODE_ID, host);
return taskExecutorLaunchContext;
}
public class YarnTaskExecutorRunner {
....
//Executor端主入口函数
public static void main(String[] args) {
EnvironmentInformation.logEnvironmentInfo(LOG, "YARN TaskExecutor runner", args);
SignalHandler.register(LOG);
JvmShutdownSafeguard.installAsShutdownHook(LOG);
//启动TM
runTaskManagerSecurely(args);
}
开始初始化TM实例,包括向JM注册自己并创建AkkaActor,创建负责数据交互的NettyShuffleEnvironment,NetworkBufferPool,维护state的KVStateManager,SlotTable等 ,跟JM的RM等服务建立连接,等待JM提交task过来并触发任务
public static void runTaskManager(Configuration configuration, PluginManager pluginManager) throws Exception {
final TaskManagerRunner taskManagerRunner = new TaskManagerRunner(configuration, pluginManager, TaskManagerRunner::createTaskExecutorService);
taskManagerRunner.start();
}
//注册基于Akka的RpcEndpoint(akkaActor)
public class TaskExecutor extends RpcEndpoint implements TaskExecutorGateway {...}
public static TaskExecutor startTaskManager(
Configuration configuration,
ResourceID resourceID,
RpcService rpcService,
HighAvailabilityServices highAvailabilityServices,
HeartbeatServices heartbeatServices,
MetricRegistry metricRegistry,
BlobCacheService blobCacheService,
boolean localCommunicationOnly,
ExternalResourceInfoProvider externalResourceInfoProvider,
FatalErrorHandler fatalErrorHandler) throws Exception {
checkNotNull(configuration);
checkNotNull(resourceID);
checkNotNull(rpcService);
checkNotNull(highAvailabilityServices);
LOG.info("Starting TaskManager with ResourceID: {}", resourceID.getStringWithMetadata());
//JM上的rpcService(actorSystem)位置,在JM初始化Dispatcher之前生成的Akka顶级父节点
String externalAddress = rpcService.getAddress();
//当前TM上配置的CPU,内存等硬件资源
final TaskExecutorResourceSpec taskExecutorResourceSpec = TaskExecutorResourceUtils.resourceSpecFromConfig(configuration);
//NettyShuffleEnvironment,Checkpoint等服务参数配置
TaskManagerServicesConfiguration taskManagerServicesConfiguration =
TaskManagerServicesConfiguration.fromConfiguration(
configuration,
resourceID,
externalAddress,
localCommunicationOnly,
taskExecutorResourceSpec);
//主要封装的是hostname,TMID等。
Tuple2<TaskManagerMetricGroup, MetricGroup> taskManagerMetricGroup = MetricUtils.instantiateTaskManagerMetricGroup(
metricRegistry,
externalAddress,
resourceID,
taskManagerServicesConfiguration.getSystemResourceMetricsProbingInterval());
//创建FixedThreadPool线程池,线程个数对应cluster.io-pool.size参数,默认4*cpu个数
final ExecutorService ioExecutor = Executors.newFixedThreadPool(
taskManagerServicesConfiguration.getNumIoThreads(),
new ExecutorThreadFactory("flink-taskexecutor-io"));
//创建TaskManagerServices,包含NettyShuffleEnvironment,SlotTable等服务
TaskManagerServices taskManagerServices = TaskManagerServices.fromConfiguration(
taskManagerServicesConfiguration,
blobCacheService.getPermanentBlobService(),
taskManagerMetricGroup.f1,
ioExecutor,
fatalErrorHandler);
//初始化TM内存指标监控服务
MetricUtils.instantiateFlinkMemoryMetricGroup(
taskManagerMetricGroup.f1,
taskManagerServices.getTaskSlotTable(),
taskManagerServices::getManagedMemorySize);
//创建TM各配置参数文件,比如Slot相关参数,上面生成的硬件资源,临时目录路径,日志路径等
TaskManagerConfiguration taskManagerConfiguration =
TaskManagerConfiguration.fromConfiguration(configuration, taskExecutorResourceSpec, externalAddress);
String metricQueryServiceAddress = metricRegistry.getMetricQueryServiceGatewayRpcAddress();
//创建TaskExecutor(RpcEndpoint)
return new TaskExecutor(
rpcService,
taskManagerConfiguration,
highAvailabilityServices,
taskManagerServices,
externalResourceInfoProvider,
heartbeatServices,
taskManagerMetricGroup.f0,
metricQueryServiceAddress,
blobCacheService,
fatalErrorHandler,
new TaskExecutorPartitionTrackerImpl(taskManagerServices.getShuffleEnvironment()),
createBackPressureSampleService(configuration, rpcService.getScheduledExecutor()));
}
TM创建的服务比较多比如kvStateService,taskSlotTable,broadcastVariableManager等
这里我们主要解析下比较核心的负责各Execution数据交互的NettyShuffleEnvironment构建过程
创建NettyShuffleEnvironment主要的三个核心组件,这里涉及的对象会放在后面章节解析
- 负责TM上堆内/堆外内存管理的networkBufferPool
- 负责向下游Execution发送数据的resultPartitionFactory
- 负责拉去上游Execution数据的inputGateFactory
private static ShuffleEnvironment<?, ?> createShuffleEnvironment(
TaskManagerServicesConfiguration taskManagerServicesConfiguration,
TaskEventDispatcher taskEventDispatcher,
MetricGroup taskManagerMetricGroup,
Executor ioExecutor) throws FlinkException {
//创建包含所有配置信息的上下文
final ShuffleEnvironmentContext shuffleEnvironmentContext = new ShuffleEnvironmentContext(
taskManagerServicesConfiguration.getConfiguration(),
taskManagerServicesConfiguration.getResourceID(),
taskManagerServicesConfiguration.getNetworkMemorySize(),
taskManagerServicesConfiguration.isLocalCommunicationOnly(),
taskManagerServicesConfiguration.getBindAddress(),
taskEventDispatcher,
taskManagerMetricGroup,
ioExecutor);
return ShuffleServiceLoader
.loadShuffleServiceFactory(taskManagerServicesConfiguration.getConfiguration())
//创建networkBufferPool,resultPartitionFactory,inputGateFactory
.createShuffleEnvironment(shuffleEnvironmentContext);
}
static NettyShuffleEnvironment createNettyShuffleEnvironment(
NettyShuffleEnvironmentConfiguration config,
ResourceID taskExecutorResourceId,
TaskEventPublisher taskEventPublisher,
ResultPartitionManager resultPartitionManager,
MetricGroup metricGroup,
Executor ioExecutor) {
checkNotNull(config);
checkNotNull(taskExecutorResourceId);
checkNotNull(taskEventPublisher);
checkNotNull(resultPartitionManager);
checkNotNull(metricGroup);
NettyConfig nettyConfig = config.nettyConfig();
FileChannelManager fileChannelManager = new FileChannelManagerImpl(config.getTempDirs(), DIR_NAME_PREFIX);
ConnectionManager connectionManager = nettyConfig != null ?
new NettyConnectionManager(resultPartitionManager, taskEventPublisher, nettyConfig) :
new LocalConnectionManager();
//负责当前TM的内存划分和管理,Task的LocalBufferPool会向它申请segment
NetworkBufferPool networkBufferPool = new NetworkBufferPool(
config.numNetworkBuffers(),//可以分配多少个segment ,在前面fromConfiguration时计算出
config.networkBufferSize(),//segment容量大小 默认32kb
config.getRequestSegmentsTimeout());//segment请求超时时间
registerShuffleMetrics(metricGroup, networkBufferPool);
//主要负责创建包含ResultSubpartition的ResultPartition以及创建LocalBufferPool
ResultPartitionFactory resultPartitionFactory = new ResultPartitionFactory(
resultPartitionManager,
fileChannelManager,
networkBufferPool,
config.getBlockingSubpartitionType(),
config.networkBuffersPerChannel(),
config.floatingNetworkBuffersPerGate(),
config.networkBufferSize(),
config.isBlockingShuffleCompressionEnabled(),
config.getCompressionCodec(),
config.getMaxBuffersPerChannel());
//主要负责创建包含inputChannel的InputGate以及创建LocalBufferPool
SingleInputGateFactory singleInputGateFactory = new SingleInputGateFactory(
taskExecutorResourceId,
config,
connectionManager,
resultPartitionManager,
taskEventPublisher,
networkBufferPool);
//创建NettyShuffleEnvironment实例
return new NettyShuffleEnvironment(
taskExecutorResourceId,
config,
networkBufferPool,
connectionManager,
resultPartitionManager,
fileChannelManager,
resultPartitionFactory,
singleInputGateFactory,
ioExecutor);
}
TM创建完以上所有服务对象后,开始启动TM(AkkaRpcServer)并连接JM上RM等服务
public void start() throws Exception {
taskExecutorService.start();
}
//启动TM
public void onStart() throws Exception {
try {
startTaskExecutorServices();
} catch (Throwable t) {
final TaskManagerException exception = new TaskManagerException(String.format("Could not start the TaskExecutor %s", getAddress()), t);
onFatalError(exception);
throw exception;
}
startRegistrationTimeout();
}
private void startTaskExecutorServices() throws Exception {
try {
// start by connecting to the ResourceManager
//跟JM上的RM建立连接
resourceManagerLeaderRetriever.start(new ResourceManagerLeaderListener());
// tell the task slot table who's responsible for the task slot actions
//启动taskSlotTable
taskSlotTable.start(new SlotActionsImpl(), getMainThreadExecutor());
// start the job leader service
//启动AkkaRpcService,基于ZK的haServices,JobLeaderListener服务
jobLeaderService.start(getAddress(), getRpcService(), haServices, new JobLeaderListenerImpl());
fileCache = new FileCache(taskManagerConfiguration.getTmpDirectories(), blobCacheService.getPermanentBlobService());
} catch (Exception e) {
handleStartTaskExecutorServicesException(e);
}
}
至此,Flink从JM启动到Slot申请到Container部署最后到TM实例化启动,整个过程主要是资源的申请,配置参数的划分,各服务的初始化,集群还未开始完全运作起来。
在后面章节中会解析到物理执行图的最后一步,也就是Flink向TM提交任务,启动Task线程,用户数据的交互等。Thanks。