【flink】Flink 1.12.2 源码浅析 : yarn-per-job模式解析 TaskMasger 启动

本文深入分析了Flink 1.12.2在Yarn-per-job模式下TaskManager的启动流程,从TaskManagerRunner的main方法开始,详细阐述了TaskManager的启动、向ResourceManager注册、接收并分配Slot的过程,直至JobMaster获取到Slot的整个过程。
摘要由CSDN通过智能技术生成

在这里插入图片描述

1.概述

转载:Flink 1.12.2 源码浅析 : yarn-per-job模式解析 [四]

上一篇: 【flink】Flink 1.12.2 源码浅析 : yarn-per-job模式解析 JobMasger启动 YarnJobClusterEntrypoint

整体流程图

在这里插入图片描述
TaskManager启动大体流程
YarnTaskExecutorRunner:Yarn模式下的TaskManager的入口类

  1. 启动 TaskExecutor
  2. 向ResourceManager注册slot
  3. ResourceManager分配slot
  4. TaskExecutor接收到分配的指令,提供offset给JobMaster(slotpool)
  5. JobMaster提交任务给TaskExecutor去执行

二 .代码分析

在这里插入图片描述

2.1. 执行入口

TaskManager的执行的入口类为 YarnTaskExecutorRunner#mian

   /**
     * The entry point for the YARN task executor runner.
     *
     * @param args The command line arguments.
     */
    public static void main(String[] args) {
        EnvironmentInformation.logEnvironmentInfo(LOG, "YARN TaskExecutor runner", args);
        SignalHandler.register(LOG);
        JvmShutdownSafeguard.installAsShutdownHook(LOG);

        runTaskManagerSecurely(args);
    }

2.2. TaskManagerRunner#runTaskManagerProcessSecurely

TaskManagerRunner.runTaskManagerProcessSecurely 是TaskManager的启动方法.
通过 YarnTaskExecutorRunner#mian 进行跳转. 跳转顺序入下:

YarnTaskExecutorRunner#mian
			--> YarnTaskExecutorRunner#runTaskManagerSecurely
				--> TaskManagerRunner.runTaskManagerProcessSecurely(Preconditions.checkNotNull(configuration));


    public static void runTaskManagerProcessSecurely(Configuration configuration) {
        replaceGracefulExitWithHaltIfConfigured(configuration);

        // 加载插件 ???
        final PluginManager pluginManager =
                PluginUtils.createPluginManagerFromRootFolder(configuration);
        // 文件系统初始化
        FileSystem.initialize(configuration, pluginManager);

        int exitCode;
        Throwable throwable = null;

        try {
            // 安全模块
            SecurityUtils.install(new SecurityConfiguration(configuration));

            exitCode =
                    SecurityUtils.getInstalledContext()
                            // 启动TaskManager = > runTaskManager
                            .runSecured(() -> runTaskManager(configuration, pluginManager)  );
        } catch (Throwable t) {
            throwable = ExceptionUtils.stripException(t, UndeclaredThrowableException.class);
            exitCode = FAILURE_EXIT_CODE;
        }

        if (throwable != null) {
            LOG.error("Terminating TaskManagerRunner with exit code {}.", exitCode, throwable);
        } else {
            LOG.info("Terminating TaskManagerRunner with exit code {}.", exitCode);
        }

        System.exit(exitCode);
    }

2.3. TaskManagerRunner#runTaskManager

主要负责构建&启动TaskManagerRunner


    public static int runTaskManager(Configuration configuration, PluginManager pluginManager)
            throws Exception {
        final TaskManagerRunner taskManagerRunner;

        try {
            // 构建 TaskManagerRunner
            taskManagerRunner =
                    new TaskManagerRunner(
                            configuration,
                            pluginManager,
                            TaskManagerRunner::createTaskExecutorService);

            // 启动 TaskManagerRunner
            taskManagerRunner.start();
        } catch (Exception exception) {
            throw new FlinkException("Failed to start the TaskManagerRunner.", exception);
        }

        try {
            return taskManagerRunner.getTerminationFuture().get().getExitCode();
        } catch (Throwable t) {
            throw new FlinkException(
                    "Unexpected failure during runtime of TaskManagerRunner.",
                    ExceptionUtils.stripExecutionException(t));
        }
    }

2.4. 启动 TaskManagerRunner

TaskManagerRunner#runTaskManager 方法中的 taskManagerRunner.start() 会调用TaskExecutor#onStart 方法.

   public void start() throws Exception {
        taskExecutorService.start();
    }

TaskExecutorToServiceAdapter

 @Override
    public void start() {
        taskExecutor.start();
    }

 @Override
    public void onStart() throws Exception {
        try {
            startTaskExecutorServices();
        } catch (Throwable t) {
            final TaskManagerException exception =
                    new TaskManagerException(
                            String.format("Could not start the TaskExecutor %s", getAddress()), t);
            onFatalError(exception);
            throw exception;
        }

        startRegistrationTimeout();
    }

然后再TaskExecutor#onStart方法中会执行TaskExecutor#startTaskExecutorServices方法.
在这里会连接ResourceManager, 注册Slot


    // 启动 TaskExecutor 服务.
    private void startTaskExecutorServices() throws Exception {
        try {
            // start by connecting to the ResourceManager
            // 连接 ResourceManager .   StandaloneLeaderRetrievalService#Start
            resourceManagerLeaderRetriever.start(new ResourceManagerLeaderListener());

            // tell the task slot table who's responsible for the task slot actions
            // 告诉 task slot table 谁负责 task slot 操作
            taskSlotTable.start(new SlotActionsImpl(), getMainThreadExecutor());

            // start the job leader service
            jobLeaderService.start(
                    getAddress(), getRpcService(), haServices, new JobLeaderListenerImpl());

            fileCache =
                    new FileCache(
                            taskManagerConfiguration.getTmpDirectories(),
                            blobCacheService.getPermanentBlobService());
        } catch (Exception e) {
            handleStartTaskExecutorServicesException(e);
        }
    }


2.5. 向ResourceManager注册&启动

TaskExecutor#startTaskExecutorServices方法. 中 的 resourceManagerLeaderRetriever.start.
向ResourceManager注册

跳转顺序

resourceManagerLeaderRetriever.start
	--> StandaloneLeaderRetrievalService#Start
		--> TaskExecutor#notifyLeaderAddress
			--> TaskExecutor#notifyOfNewResourceManagerLeader
				--> TaskExecutor#reconnectToResourceManager
					--> TaskExecutor#tryConnectToResourceManager
						--> TaskExecutor#connectToResourceManager
1

TaskExecutor#connectToResourceManager


    private void connectToResourceManager() {
        // 校验相关
        assert (resourceManagerAddress != null);
        assert (establishedResourceManagerConnection == null);
        assert (resourceManagerConnection == null);

        log.info("Connecting to ResourceManager {}.", resourceManagerAddress);

        // 构建 TaskExecutor Registration
        final TaskExecutorRegistration taskExecutorRegistration =
                new TaskExecutorRegistration(
                        getAddress(),
                        getResourceID(),
                        unresolvedTaskManagerLocation.getDataPort(),
                        JMXService.getPort().orElse(-1),
                        hardwareDescription,
                        memoryConfiguration,
                        taskManagerConfiguration.getDefaultSlotResourceProfile(),
                        taskManagerConfiguration.getTotalResourceProfile());

        // 资源管理 Connection
        resourceManagerConnection =
                new TaskExecutorToResourceManagerConnection(
                        log,
                        getRpcService(),
                        taskManagerConfiguration.getRetryingRegistrationConfiguration(),
                        resourceManagerAddress.getAddress(),
                        resourceManagerAddress.getResourceManagerId(),
                        getMainThreadExecutor(),
                        new ResourceManagerRegistrationListener(),
                        taskExecutorRegistration);


        // 建立连接操作
        resourceManagerConnection.start();
    }

RegisteredRpcConnection#start 方法负责建立连接操作.

   // ------------------------------------------------------------------------
    //  Life cycle
    // ------------------------------------------------------------------------

    public void start() {
        checkState(!closed, "The RPC connection is already closed");
        checkState(
                !isConnected() && pendingRegistration == null,
                "The RPC connection is already started");

        // 构建注册... ,  生成 generateRegistration
        final RetryingRegistration<F, G, S> newRegistration = createNewRegistration();


        if (REGISTRATION_UPDATER.compareAndSet(this, null, newRegistration)) {
            // 开始注册...
            // 注册成功之后,调用startRegistration
            newRegistration.startRegistration();
        } else {
            // concurrent start operation
            newRegistration.cancel();
        }
    }

RegisteredRpcConnection#createNewRegistration


    // ------------------------------------------------------------------------
    //  Internal methods
    // ------------------------------------------------------------------------

    private RetryingRegistration<F, G, S> createNewRegistration() {

        // 生成注册 : JobMaster: generateRegistration 方法
        RetryingRegistration<F, G, S> newRegistration = checkNotNull(generateRegistration());

        CompletableFuture<Tuple2<G, S>> future = newRegistration.getFuture();

        future.whenCompleteAsync(
                (Tuple2<G, S> result, Throwable failure) -> {
                    if (failure != null) {
                        if (failure instanceof CancellationException) {
                            // we ignore cancellation exceptions because they originate from
                            // cancelling
                            // the RetryingRegistration
                            log.debug(
                                    "Retrying registration towards {} was cancelled.",
                                    targetAddress);
                        } else {
                            // this future should only ever fail if there is a bug, not if the
                            // registration is declined
                            onRegistrationFailure(failure);
                        }
                    } else {
                        // 注册成功...
                        targetGateway = result.f0;
                        onRegistrationSuccess(result.f1);
                    }
                },
                executor);

        return newRegistration;
    }

TaskExecutorToResourceManagerConnection#generateRegistration
生成注册

    @Override
    protected RetryingRegistration<
                    ResourceManagerId, ResourceManagerGateway, TaskExecutorRegistrationSuccess>
            generateRegistration() {
        // 构建生成 TaskExecutorToResourceManagerConnection
        return new TaskExecutorToResourceManagerConnection.ResourceManagerRegistration(
                log,
                rpcService,
                getTargetAddress(),
                getTargetLeaderId(),
                retryingRegistrationConfiguration,
                taskExecutorRegistration);
    }

2.6. RetryingRegistration#startRegistration();

注册成功之后,调用startRegistration

  • connect 建立连接…
  • register 向RM 注册
  • startRegistrationLater
   /**
     * This method resolves the target address to a callable gateway and starts the registration
     * after that.
     */
    @SuppressWarnings("unchecked")
    public void startRegistration() {
        if (canceled) {
            // we already got canceled
            return;
        }

        try {
            // trigger resolution of the target address to a callable gateway
            final CompletableFuture<G> rpcGatewayFuture;

            if (FencedRpcGateway.class.isAssignableFrom(targetType)) {
                rpcGatewayFuture =
                        (CompletableFuture<G>)

                                // 建立连接....
                                rpcService.connect(
                                        targetAddress,
                                        fencingToken,
                                        targetType.asSubclass(FencedRpcGateway.class));
            } else {
                rpcGatewayFuture = rpcService.connect(targetAddress, targetType);
            }

            // upon success, start the registration attempts
            CompletableFuture<Void> rpcGatewayAcceptFuture =
                    rpcGatewayFuture.thenAcceptAsync(
                            (G rpcGateway) -> {
                                log.info("Resolved {} address, beginning registration", targetName);

                                // 执行注册操作
                                register(
                                        rpcGateway,
                                        1,
                                        retryingRegistrationConfiguration
                                                .getInitialRegistrationTimeoutMillis());
                            },
                            rpcService.getExecutor());

            // upon failure, retry, unless this is cancelled
            rpcGatewayAcceptFuture.whenCompleteAsync(
                    (Void v, Throwable failure) -> {
                        if (failure != null && !canceled) {
                            final Throwable strippedFailure =
                                    ExceptionUtils.stripCompletionException(failure);
                            if (log.isDebugEnabled()) {
                                log.debug(
                                        "Could not resolve {} address {}, retrying in {} ms.",
                                        targetName,
                                        targetAddress,
                                        retryingRegistrationConfiguration.getErrorDelayMillis(),
                                        strippedFailure);
                            } else {
                                log.info(
                                        "Could not resolve {} address {}, retrying in {} ms: {}",
                                        targetName,
                                        targetAddress,
                                        retryingRegistrationConfiguration.getErrorDelayMillis(),
                                        strippedFailure.getMessage());
                            }

                            // 开展注册
                            startRegistrationLater(
                                    retryingRegistrationConfiguration.getErrorDelayMillis());
                        }
                    },
                    rpcService.getExecutor());
        } catch (Throwable t) {
            completionFuture.completeExceptionally(t);
            cancel();
        }
    }

RetryingRegistration#register 注册的时候会调用 invokeRegistration 方法

		// 开始注册的时候 RetryingRegistration#register 方法
        // 会 调用 invokeRegistration 方法
        @Override
        protected CompletableFuture<RegistrationResponse> invokeRegistration(
                ResourceManagerGateway resourceManager,
                ResourceManagerId fencingToken,
                long timeoutMillis)
                throws Exception {

            Time timeout = Time.milliseconds(timeoutMillis);

            // ResourceManager#registerTaskExecutor
            // 注册 TaskExecutor
            return resourceManager.registerTaskExecutor(taskExecutorRegistration, timeout);
        }

RetryingRegistration#register 中的代码片段… 会调用invokeRegistration 方法

			// [重点]注册的时候会调用 invokeRegistration 方法 .................
            // TaskExecutorToResourceManagerConnection#invokeRegistration
            CompletableFuture<RegistrationResponse> registrationFuture =
                    invokeRegistration(gateway, fencingToken, timeoutMillis)

TaskExecutorToResourceManagerConnection#invokeRegistration
注册 TaskExecutor

 // 开始注册的时候 RetryingRegistration#register 方法
        // 会 调用 invokeRegistration 方法
        @Override
        protected CompletableFuture<RegistrationResponse> invokeRegistration(
                ResourceManagerGateway resourceManager,
                ResourceManagerId fencingToken,
                long timeoutMillis)
                throws Exception {

            Time timeout = Time.milliseconds(timeoutMillis);

            // ResourceManager#registerTaskExecutor
            // 注册 TaskExecutor
            return resourceManager.registerTaskExecutor(taskExecutorRegistration, timeout);
        }
    }

ResourceManager#registerTaskExecutor
向ResouceManager 注册 TaskExecutor

  • 连接taskExecutor
  • 加入缓存
  • 开始注册TaskExecutor == > ResourceManager#registerTaskExecutorInternal

    /**
     * 注册 TaskExecutor
     * Registers a new TaskExecutor.
     *
     * @param taskExecutorRegistration task executor registration parameters
     * @return RegistrationResponse
     */
    private RegistrationResponse registerTaskExecutorInternal(
            TaskExecutorGateway taskExecutorGateway,
            TaskExecutorRegistration taskExecutorRegistration) {


        // 获取 taskExecutorResourceId
        ResourceID taskExecutorResourceId = taskExecutorRegistration.getResourceId();


        // 移除缓存信息
        WorkerRegistration<WorkerType> oldRegistration =  taskExecutors.remove(taskExecutorResourceId);



        if (oldRegistration != null) {
            // 清理就的注册操作....


            // TODO :: suggest old taskExecutor to stop itself
            log.debug(
                    "Replacing old registration of TaskExecutor {}.",
                    taskExecutorResourceId.getStringWithMetadata());

            // remove old task manager registration from slot manager
            slotManager.unregisterTaskManager(
                    oldRegistration.getInstanceID(),
                    new ResourceManagerException(
                            String.format(
                                    "TaskExecutor %s re-connected to the ResourceManager.",
                                    taskExecutorResourceId.getStringWithMetadata())));
        }

        // 获取新的WorkerType
        final WorkerType newWorker = workerStarted(taskExecutorResourceId);

        // 获取 taskExecutor 地址
        String taskExecutorAddress = taskExecutorRegistration.getTaskExecutorAddress();
        if (newWorker == null) {
            // 如果newWorker为null 抛出异常...
            log.warn(
                    "Discard registration from TaskExecutor {} at ({}) because the framework did "
                            + "not recognize it",
                    taskExecutorResourceId.getStringWithMetadata(),
                    taskExecutorAddress);
            return new RegistrationResponse.Decline("unrecognized TaskExecutor");
        } else {

            // 构造 WorkerRegistration 对象
            WorkerRegistration<WorkerType> registration =
                    new WorkerRegistration<>(
                            taskExecutorGateway,
                            newWorker,
                            taskExecutorRegistration.getDataPort(),
                            taskExecutorRegistration.getJmxPort(),
                            taskExecutorRegistration.getHardwareDescription(),
                            taskExecutorRegistration.getMemoryConfiguration());

            log.info(
                    "Registering TaskManager with ResourceID {} ({}) at ResourceManager",
                    taskExecutorResourceId.getStringWithMetadata(),
                    taskExecutorAddress);


            // 加入缓存
            taskExecutors.put(taskExecutorResourceId, registration);


            // 监控&心跳相关
            taskManagerHeartbeatManager.monitorTarget(
                    taskExecutorResourceId,
                    new HeartbeatTarget<Void>() {
                        @Override
                        public void receiveHeartbeat(ResourceID resourceID, Void payload) {
                            // the ResourceManager will always send heartbeat requests to the
                            // TaskManager
                        }

                        @Override
                        public void requestHeartbeat(ResourceID resourceID, Void payload) {
                            taskExecutorGateway.heartbeatFromResourceManager(resourceID);
                        }
                    });

            // 反馈注册成功信息
            return new TaskExecutorRegistrationSuccess(
                    registration.getInstanceID(), resourceId, clusterInformation);
        }
    }

2.7. 注册成功 TaskExecutor#onRegistrationSuccess

ResourceManagerRegistrationListener#onRegistrationSuccess


        @Override
        public void onRegistrationSuccess(
                TaskExecutorToResourceManagerConnection connection,
                TaskExecutorRegistrationSuccess success) {
            final ResourceID resourceManagerId = success.getResourceManagerId();
            final InstanceID taskExecutorRegistrationId = success.getRegistrationId();
            final ClusterInformation clusterInformation = success.getClusterInformation();
            final ResourceManagerGateway resourceManagerGateway = connection.getTargetGateway();

            // 异步执行
            runAsync(
                    () -> {
                        // filter out outdated connections
                        //noinspection ObjectEquality

                        if (resourceManagerConnection == connection) {
                            try {

                                // 建立连接
                                establishResourceManagerConnection(
                                        resourceManagerGateway,
                                        resourceManagerId,
                                        taskExecutorRegistrationId,
                                        clusterInformation);



                            } catch (Throwable t) {
                                log.error(
                                        "Establishing Resource Manager connection in Task Executor failed",
                                        t);
                            }
                        }
                    });
        }

TaskExecutor#establishResourceManagerConnection


    private void establishResourceManagerConnection(
            ResourceManagerGateway resourceManagerGateway,
            ResourceID resourceManagerResourceId,
            InstanceID taskExecutorRegistrationId,
            ClusterInformation clusterInformation) {

        // ResourceManager#sendSlotReport
        // 发送 Slot 报告
        final CompletableFuture<Acknowledge> slotReportResponseFuture =
                resourceManagerGateway.sendSlotReport(
                        getResourceID(),
                        taskExecutorRegistrationId,
                        taskSlotTable.createSlotReport(getResourceID()),
                        taskManagerConfiguration.getTimeout());

        slotReportResponseFuture.whenCompleteAsync(
                (acknowledge, throwable) -> {
                    if (throwable != null) {

                        // 重新连接 ResourceManager  抛出异常...
                        reconnectToResourceManager(
                                new TaskManagerException(
                                        "Failed to send initial slot report to ResourceManager.",
                                        throwable));
                    }
                },
                getMainThreadExecutor());

        // 心跳相关
        // monitor the resource manager as heartbeat target
        resourceManagerHeartbeatManager.monitorTarget(
                resourceManagerResourceId,
                new HeartbeatTarget<TaskExecutorHeartbeatPayload>() {
                    @Override
                    public void receiveHeartbeat(
                            ResourceID resourceID, TaskExecutorHeartbeatPayload heartbeatPayload) {
                        resourceManagerGateway.heartbeatFromTaskManager(
                                resourceID, heartbeatPayload);
                    }

                    @Override
                    public void requestHeartbeat(
                            ResourceID resourceID, TaskExecutorHeartbeatPayload heartbeatPayload) {
                        // the TaskManager won't send heartbeat requests to the ResourceManager
                    }
                });

        // 设置 blob server 地址...
        // set the propagated blob server address
        final InetSocketAddress blobServerAddress =
                new InetSocketAddress(
                        clusterInformation.getBlobServerHostname(),
                        clusterInformation.getBlobServerPort());

        blobCacheService.setBlobServerAddress(blobServerAddress);


        // 建立ResourceManager 连接...
        establishedResourceManagerConnection =
                new EstablishedResourceManagerConnection(
                        resourceManagerGateway,
                        resourceManagerResourceId,
                        taskExecutorRegistrationId);

        // 停止超时操作...
        stopRegistrationTimeout();
    }

ResourceManager#sendSlotReport
向slotManager 注册


    @Override
    public CompletableFuture<Acknowledge> sendSlotReport(
            ResourceID taskManagerResourceId,
            InstanceID taskManagerRegistrationId,
            SlotReport slotReport,
            Time timeout) {


        // 获取 worker 的注册信息
        final WorkerRegistration<WorkerType> workerTypeWorkerRegistration = taskExecutors.get(taskManagerResourceId);

        if (workerTypeWorkerRegistration.getInstanceID().equals(taskManagerRegistrationId)) {

            // 向 slotManager 注册 slot 信息
            // SlotManagerImpl#registerTaskManager
            if (slotManager.registerTaskManager(workerTypeWorkerRegistration, slotReport)) {

                // 注册完成之后的操作...
                onWorkerRegistered(workerTypeWorkerRegistration.getWorker());
            }


            return CompletableFuture.completedFuture(Acknowledge.get());


        } else {
            return FutureUtils.completedExceptionally(
                    new ResourceManagerException(
                            String.format(
                                    "Unknown TaskManager registration id %s.",
                                    taskManagerRegistrationId)));
        }
    }

2.8. SlotManagerImpl#registerTaskManager

向 slotManager 注册 slot 信息


    /**
     *
     * 在 slot manager中注册一个新的task manager
     *
     * 从而是 task managers slots 可以被感知/调度
     *
     * Registers a new task manager at the slot manager.
     * This will make the task managers slots known and, thus, available for allocation.
     *
     * @param taskExecutorConnection for the new task manager
     * @param initialSlotReport for the new task manager
     * @return True if the task manager has not been registered before and is registered
     *     successfully; otherwise false
     */
    @Override
    public boolean registerTaskManager(
            final TaskExecutorConnection taskExecutorConnection, SlotReport initialSlotReport) {

        // 初始化检查
        // The slot manager has not been started.
        checkInit();

        LOG.debug(
                "Registering TaskManager {} under {} at the SlotManager.",
                taskExecutorConnection.getResourceID().getStringWithMetadata(),
                taskExecutorConnection.getInstanceID());

        // 我们通过任务管理器的实例id来识别它们
        // we identify task managers by their instance id

        if (taskManagerRegistrations.containsKey(taskExecutorConnection.getInstanceID())) {

            // 之间已经连接过, 直接报搞slot的状态.
            reportSlotStatus(taskExecutorConnection.getInstanceID(), initialSlotReport);
            return false;
        } else {

            if (isMaxSlotNumExceededAfterRegistration(initialSlotReport)) {
                // 是否查过最大的 slot 数量...
                LOG.info(
                        "The total number of slots exceeds the max limitation {}, release the excess resource.",
                        maxSlotNum);
                resourceActions.releaseResource(
                        taskExecutorConnection.getInstanceID(),
                        new FlinkException(
                                "The total number of slots exceeds the max limitation."));
                return false;
            }

            // 第一次注册TaskManager
            // first register the TaskManager
            ArrayList<SlotID> reportedSlots = new ArrayList<>();

            for (SlotStatus slotStatus : initialSlotReport) {
                reportedSlots.add(slotStatus.getSlotID());
            }

            TaskManagerRegistration taskManagerRegistration =
                    new TaskManagerRegistration(taskExecutorConnection, reportedSlots);

            taskManagerRegistrations.put(
                    taskExecutorConnection.getInstanceID(), taskManagerRegistration);


            // next register the new slots
            for (SlotStatus slotStatus : initialSlotReport) {

                // 开始注册slots
                registerSlot(
                        slotStatus.getSlotID(),
                        slotStatus.getAllocationID(),
                        slotStatus.getJobID(),
                        slotStatus.getResourceProfile(),
                        taskExecutorConnection);
            }

            return true;
        }
    }

SlotManagerImpl#registerSlot


    /**
     *
     * 在 slot manager 中为给定的task manager 注册slot。
     * slot由给定的 slot id 标识。
     * 给定的资源配置文件定义了slot的可用资源。
     *
     * Registers a slot for the given task manager at the slot manager.
     * The slot is identified by the given slot id.
     *
     * The given resource profile defines the available resources for the slot.
     *
     * The task manager connection can be used to communicate with the task manager.
     *
     * @param slotId identifying the slot on the task manager
     * @param allocationId which is currently deployed in the slot
     * @param resourceProfile of the slot
     * @param taskManagerConnection to communicate with the remote task manager
     */
    private void registerSlot(
            SlotID slotId,
            AllocationID allocationId,
            JobID jobId,
            ResourceProfile resourceProfile,
            TaskExecutorConnection taskManagerConnection) {

        // 移除缓存汇总的 slot
        if (slots.containsKey(slotId)) {
            // remove the old slot first
            removeSlot(
                    slotId,
                    new SlotManagerException(
                            String.format(
                                    "Re-registration of slot %s. This indicates that the TaskExecutor has re-connected.",
                                    slotId)));
        }


        // 构建 slot 信息
        final TaskManagerSlot slot =
                createAndRegisterTaskManagerSlot(slotId, resourceProfile, taskManagerConnection);


        final PendingTaskManagerSlot pendingTaskManagerSlot;

        // 获取队列中挂起的slot请求...
        if (allocationId == null) {
            pendingTaskManagerSlot = findExactlyMatchingPendingTaskManagerSlot(resourceProfile);
        } else {
            pendingTaskManagerSlot = null;
        }

        // 如果队列中挂起的slot为null , 直接更新
        if (pendingTaskManagerSlot == null) {
            // 更新slot状态
            updateSlot(slotId, allocationId, jobId);
        } else {
            // 将队列中挂起的solt清理掉
            pendingSlots.remove(pendingTaskManagerSlot.getTaskManagerSlotId());

            // 获取挂起slot的请求
            final PendingSlotRequest assignedPendingSlotRequest =
                    pendingTaskManagerSlot.getAssignedPendingSlotRequest();

            // 分配slot`在这里插入代码片`
            if (assignedPendingSlotRequest == null) {
                // 挂起的请求都已经满足了, 处理空闲的slot.
                handleFreeSlot(slot);
            } else {
                // 表示该slot要被分配...
                assignedPendingSlotRequest.unassignPendingTaskManagerSlot();
                // 执行分配操作...
                allocateSlot(slot, assignedPendingSlotRequest);
            }
        }
    }

2.9. SlotManagerImpl#allocateSlot 分配Slot

SlotManagerImpl#allocateSlot


    /**
     * Allocates the given slot for the given slot request. This entails sending a registration
     * message to the task manager and treating failures.
     *
     * @param taskManagerSlot to allocate for the given slot request
     * @param pendingSlotRequest to allocate the given slot for
     */
    private void allocateSlot(
            TaskManagerSlot taskManagerSlot, PendingSlotRequest pendingSlotRequest) {
        // 检测 taskManager 的slot状态是否空闲
        Preconditions.checkState(taskManagerSlot.getState() == SlotState.FREE);

        // 获取 taskManager 的连接信息
        TaskExecutorConnection taskExecutorConnection = taskManagerSlot.getTaskManagerConnection();

        // 获取 TaskManager 的Gateway ...
        TaskExecutorGateway gateway = taskExecutorConnection.getTaskExecutorGateway();


        // 缓存task slot的回调 completableFuture
        final CompletableFuture<Acknowledge> completableFuture = new CompletableFuture<>();

        // 获取 allocationId
        final AllocationID allocationId = pendingSlotRequest.getAllocationId();

        // 获取 slotId
        final SlotID slotId = taskManagerSlot.getSlotId();

        // 获取taskManager的 实例id
        final InstanceID instanceID = taskManagerSlot.getInstanceId();


        // task manager 的slot 处理 : 将slot的状态设置为 PENDING
        taskManagerSlot.assignPendingSlotRequest(pendingSlotRequest);

        // 设置回调 completableFuture ...
        pendingSlotRequest.setRequestFuture(completableFuture);

        //设置 返回挂起的TaskManager Slot
        returnPendingTaskManagerSlotIfAssigned(pendingSlotRequest);

        // 获取实例的 TaskManagerRegistration
        TaskManagerRegistration taskManagerRegistration = taskManagerRegistrations.get(instanceID);

        if (taskManagerRegistration == null) {
            throw new IllegalStateException(
                    "Could not find a registered task manager for instance id " + instanceID + '.');
        }

        // taskManagerRegistration 标注为已经使用 ???
        taskManagerRegistration.markUsed();

        // RPC 通知 Task Manager 分配slot给 JobManager
        // TaskExecutor#requestSlot
        // RPC call to the task manager
        CompletableFuture<Acknowledge> requestFuture =
                gateway.requestSlot(
                        slotId,
                        pendingSlotRequest.getJobId(),
                        allocationId,
                        pendingSlotRequest.getResourceProfile(),
                        pendingSlotRequest.getTargetAddress(),
                        resourceManagerId,
                        taskManagerRequestTimeout);


        requestFuture.whenComplete(
                (Acknowledge acknowledge, Throwable throwable) -> {
                    if (acknowledge != null) {
                        // 请求继续 ???
                        completableFuture.complete(acknowledge);
                    } else {
                        // 执行完成 ...
                        completableFuture.completeExceptionally(throwable);
                    }
                });

        // 异步操作 ???
        completableFuture.whenCompleteAsync(
                (Acknowledge acknowledge, Throwable throwable) -> {
                    try {
                        if (acknowledge != null) {
                            // 更新 slot
                            updateSlot(slotId, allocationId, pendingSlotRequest.getJobId());
                        } else {
                            // 处理异常状况信息....
                            if (throwable instanceof SlotOccupiedException) {
                                SlotOccupiedException exception = (SlotOccupiedException) throwable;
                                updateSlot(
                                        slotId, exception.getAllocationId(), exception.getJobId());
                            } else {
                                removeSlotRequestFromSlot(slotId, allocationId);
                            }

                            if (!(throwable instanceof CancellationException)) {
                                handleFailedSlotRequest(slotId, allocationId, throwable);
                            } else {
                                LOG.debug(
                                        "Slot allocation request {} has been cancelled.",
                                        allocationId,
                                        throwable);
                            }
                        }
                    } catch (Exception e) {
                        LOG.error("Error while completing the slot allocation.", e);
                    }
                },
                mainThreadExecutor);
    }

2.10. TaskManager分配Slot

SlotManager 向 TaskManager 发出RPC请求, 要求其分配Slot给JobManager.

TaskExecutor#requestSlot

   // ----------------------------------------------------------------------
    // Slot allocation RPCs
    //
    // ----------------------------------------------------------------------

    @Override
    public CompletableFuture<Acknowledge> requestSlot(
            final SlotID slotId,
            final JobID jobId,
            final AllocationID allocationId,
            final ResourceProfile resourceProfile,
            final String targetAddress,
            final ResourceManagerId resourceManagerId,
            final Time timeout) {
        // TODO: Filter invalid requests from the resource manager by using the
        // instance/registration Id

        // 输出日志信息
        log.info(
                "Receive slot request {} for job {} from resource manager with leader id {}.",
                allocationId,
                jobId,
                resourceManagerId);


        // 是否连接到 ResourceManager
        if (!isConnectedToResourceManager(resourceManagerId)) {
            final String message =
                    String.format(
                            "TaskManager is not connected to the resource manager %s.",
                            resourceManagerId);
            log.debug(message);
            return FutureUtils.completedExceptionally(new TaskManagerException(message));
        }

        try {

            //[重点] 分配 slot
            allocateSlot(slotId, jobId, allocationId, resourceProfile);
        } catch (SlotAllocationException sae) {
            return FutureUtils.completedExceptionally(sae);
        }

        final JobTable.Job job;

        try {

            // 获取/构建  JobTable.Job


            job =jobTable.getOrCreateJob(jobId, () -> registerNewJobAndCreateServices(jobId, targetAddress));



        } catch (Exception e) {
            // free the allocated slot
            try {
                taskSlotTable.freeSlot(allocationId);
            } catch (SlotNotFoundException slotNotFoundException) {
                // slot no longer existent, this should actually never happen, because we've
                // just allocated the slot. So let's fail hard in this case!
                onFatalError(slotNotFoundException);
            }

            // release local state under the allocation id.
            localStateStoresManager.releaseLocalStateForAllocationId(allocationId);

            // sanity check
            if (!taskSlotTable.isSlotFree(slotId.getSlotNumber())) {
                onFatalError(new Exception("Could not free slot " + slotId));
            }

            return FutureUtils.completedExceptionally(
                    new SlotAllocationException("Could not create new job.", e));
        }

        if (job.isConnected()) {

            //[重要]  向JobManager提供Slot
            offerSlotsToJobManager(jobId);
        }

        return CompletableFuture.completedFuture(Acknowledge.get());
    }

TaskExecutor#allocateSlot


    private void allocateSlot(
            SlotID slotId, JobID jobId, AllocationID allocationId, ResourceProfile resourceProfile)
            throws SlotAllocationException {


        if (taskSlotTable.isSlotFree(slotId.getSlotNumber())) {


            // 进行分配操作..
            // TaskSlotTableImpl # allocateSlot
            if (taskSlotTable.allocateSlot(
                    slotId.getSlotNumber(),
                    jobId,
                    allocationId,
                    resourceProfile,
                    taskManagerConfiguration.getTimeout())) {
                log.info("Allocated slot for {}.", allocationId);
            } else {
                log.info("Could not allocate slot for {}.", allocationId);
                throw new SlotAllocationException("Could not allocate slot.");
            }
        } else if (!taskSlotTable.isAllocated(slotId.getSlotNumber(), jobId, allocationId)) {
            final String message =
                    "The slot " + slotId + " has already been allocated for a different job.";

            log.info(message);

            final AllocationID allocationID =
                    taskSlotTable.getCurrentAllocation(slotId.getSlotNumber());
            throw new SlotOccupiedException(
                    message, allocationID, taskSlotTable.getOwningJob(allocationID));
        }
    }

2.11. TaskExecutor向JobManager提供Slot

TaskExecutor#offerSlotsToJobManager

向JobManager提供Slot : TaskExecutor#internalOfferSlotsToJobManager


    private void internalOfferSlotsToJobManager(JobTable.Connection jobManagerConnection) {
        // 获取JobID
        final JobID jobId = jobManagerConnection.getJobId();

        // JobID是否已经分配
        if (taskSlotTable.hasAllocatedSlots(jobId)) {

            log.info("Offer reserved slots to the leader of job {}.", jobId);

            // 获取JobMaster 的  Gateway
            final JobMasterGateway jobMasterGateway = jobManagerConnection.getJobManagerGateway();

            // 获取 分配给jobId 的所有 TaskSlot
            final Iterator<TaskSlot<Task>> reservedSlotsIterator =  taskSlotTable.getAllocatedSlots(jobId);

            // 获取 JobMasterId
            final JobMasterId jobMasterId = jobManagerConnection.getJobMasterId();

            // 保留的Slot
            final Collection<SlotOffer> reservedSlots = new HashSet<>(2);

            while (reservedSlotsIterator.hasNext()) {
                SlotOffer offer = reservedSlotsIterator.next().generateSlotOffer();
                reservedSlots.add(offer);
            }

            // offerSlots
            // JobMaster#offerSlots
            CompletableFuture<Collection<SlotOffer>> acceptedSlotsFuture =
                    jobMasterGateway.offerSlots(
                            getResourceID(), reservedSlots, taskManagerConfiguration.getTimeout());


            // 异步操作.
            acceptedSlotsFuture.whenCompleteAsync(
                    handleAcceptedSlotOffers(jobId, jobMasterGateway, jobMasterId, reservedSlots),
                    getMainThreadExecutor());
        } else {
            log.debug("There are no unassigned slots for the job {}.", jobId);
        }
    }

2.12. JobMaster收到分配的Slot

JobMaster#offerSlots , 给slotPool


    @Override
    public CompletableFuture<Collection<SlotOffer>> offerSlots(
            final ResourceID taskManagerId, final Collection<SlotOffer> slots, final Time timeout) {

        Tuple2<TaskManagerLocation, TaskExecutorGateway> taskManager =
                registeredTaskManagers.get(taskManagerId);

        if (taskManager == null) {
            return FutureUtils.completedExceptionally(
                    new Exception("Unknown TaskManager " + taskManagerId));
        }

        final TaskManagerLocation taskManagerLocation = taskManager.f0;
        final TaskExecutorGateway taskExecutorGateway = taskManager.f1;

        final RpcTaskManagerGateway rpcTaskManagerGateway =
                new RpcTaskManagerGateway(taskExecutorGateway, getFencingToken());

        return CompletableFuture.completedFuture(
                // [重要] 提供slots
                slotPool.offerSlots(taskManagerLocation, rpcTaskManagerGateway, slots)

        );
    }


SlotPoolImpl#offerSlots

    @Override
    public Collection<SlotOffer> offerSlots(
            TaskManagerLocation taskManagerLocation,
            TaskManagerGateway taskManagerGateway,
            Collection<SlotOffer> offers) {

        ArrayList<SlotOffer> result = new ArrayList<>(offers.size());

        for (SlotOffer offer : offers) {
            // 提供 offerSlot
            if (offerSlot(taskManagerLocation, taskManagerGateway, offer)) {

                result.add(offer);
            }
        }

        return result;
    }

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值