Adaptive调度器

不甚了然

已于 2022-04-17 10:49:37 修改

阅读量1k

点赞数

分类专栏： Flink知识集文章标签： flink 源码

于 2022-04-11 17:37:52 首次发布

本文链接：https://blog.csdn.net/blackjjcat/article/details/124104404

版权

Flink知识集专栏收录该内容

26 篇文章 2 订阅

订阅专栏

1.前言

调度器是服务级别配置的，但是从源码层面看，是绑定JobMaster的，在JobMaster创建的时候创建调度器，调度器里创建DeclarativeSlotPool，所以是每个作业对应一个对象

2.测试

SA测试结果：设置并行度和最大并行度。1、资源充足时，不会使用最大并行度，使用的是并行度；2、资源不足时，会减小并行度；3、作业的并行度不会超过设置的并行度，最大并行度无效
Yarn-session：1、也能调整并行度；2、TaskManager失败时，优先启动新的TaskManager；3、资源不足时，减少并行度；4、资源补足后，补充并行度
Yarn-session设置stabilization-timeout，资源不足时会延长INITIALIZING阶段时间，但是JobManager的slot资源并没有释放

3.配置启用

jobmanager.scheduler: adaptive：将默认的调度器换成 Adaptive
cluster.declarative-resource-management.enabled：声明式资源管理必须开启（默认开启）
Reactive（动态缩放，scheduler-mode配置）实际也会开启Adaptive
如下是对调度器的判断，adaptive和reactive都会开启Adaptive调度器：

public static JobManagerOptions.SchedulerType getSchedulerType(Configuration configuration) {
    if (isAdaptiveSchedulerEnabled(configuration) || isReactiveModeEnabled(configuration)) {
        return JobManagerOptions.SchedulerType.Adaptive;
    } else {
        return configuration.get(JobManagerOptions.SCHEDULER);
    }
}

此外，adaptive不支持批模式，所以如果是批模式，会自动切换为普通调度器

if (schedulerType == JobManagerOptions.SchedulerType.Adaptive && jobType == JobType.BATCH) {
    LOG.info(
            "Adaptive Scheduler configured, but Batch job detected. Changing scheduler type to NG / DefaultScheduler.");
    // overwrite
    schedulerType = JobManagerOptions.SchedulerType.Ng;
}

4.其他配置参数

4.1.主要配置

1、jobmanager.adaptive-scheduler.min-parallelism-increase
并行度扩增的最小增加值，默认1
2、jobmanager.adaptive-scheduler.resource-stabilization-timeout
资源稳定超时定义了如果可用资源少于所需但足够的资源，JobManager 将等待的时间。一旦有足够的资源来运行作业，超时就会开始。一旦此超时时间过去，作业将使用可用资源开始执行。如果 scheduler-mode 配置为 REACTIVE，此配置值将默认为 0，以便作业立即使用可用资源启动。默认10 s
对资源不足有效，资源充足直接获取资源执行
3、jobmanager.adaptive-scheduler.resource-wait-timeout
JobManager 在作业提交或重新启动后等待获取所有必需资源的最长时间。一旦过去，它将尝试以较低的并行度运行作业，或者如果无法获取最小数量的资源则失败。增加此值将使集群对临时资源短缺更具弹性（例如，有更多时间重新启动失败的 TaskManager）。设置负的持续时间将禁用资源超时：JobManager 将无限期地等待资源出现。如果 scheduler-mode 配置为 REACTIVE，此配置值将默认为负值以禁用资源超时。默认5 min

4.2.其他可能相关的配置

1、slot.request.timeout
请求slot的超时时间，默认300000 ms
2、slot.idle.timeout
slot空闲超时时间，默认50000 ms

5.调用流程

调用流程其实就是JobManager接收到Job以后的整个启动的调用流程
配置调用链如下：

JobSubmitHandler.handleRequest
->Dispatcher.submitJob
->Dispatcher.internalSubmitJob
->Dispatcher.runJob
->Dispatcher.createJobManagerRunner
->JobMasterServiceLeadershipRunnerFactory.createJobManagerRunner
->DefaultSlotPoolServiceSchedulerFactory.fromConfiguration

创建调度器调用链如下

JobMasterServiceLeadershipRunner.grantLeadership
->JobMasterServiceLeadershipRunner.verifyJobSchedulingStatusAndCreateJobMasterServiceProcess
->JobMasterServiceLeadershipRunner.createNewJobMasterServiceProcess
->DefaultJobMasterServiceProcessFactory.create
->DefaultJobMasterServiceFactory.createJobMasterService
->DefaultJobMasterServiceFactory.internalCreateJobMasterService
->JobMaster.createScheduler
->DefaultSlotPoolServiceSchedulerFactory.createScheduler
->AdaptiveSchedulerFactory.createInstance

6.配置Adaptive调度器

进行Adaptive调度器的参数配置，并创建对应工厂类DefaultSlotPoolServiceSchedulerFactory的getAdaptiveSchedulerFactoryFromConfiguration方法中完成创建，相关配置为：
jobmanager.adaptive-scheduler.resource-wait-timeout：资源等待时间，作业提交或重启后，等待一定的时间获取完整资源。如果超过时间未获取全部资源，则以低并行度执行；如果不满足最小资源要求，则失败
jobmanager.adaptive-scheduler.resource-stabilization-timeout：定义获取可运行资源后的等待时间。当作业获取可运行资源但不足请求资源时，等待一定的时间，超过时间以后，启动作业
此外，如果是reactive模式，上诉配置分别为-1和0（此处有疑问，从代码逻辑看，特殊设置的只是默认值，用户设置了会被覆盖）
此处是创建了工厂类，传入了相关的一些超时参数，并没有创建调度器

private static AdaptiveSchedulerFactory getAdaptiveSchedulerFactoryFromConfiguration(
        Configuration configuration) {
    Duration allocationTimeoutDefault = JobManagerOptions.RESOURCE_WAIT_TIMEOUT.defaultValue();
    Duration stabilizationTimeoutDefault =
            JobManagerOptions.RESOURCE_STABILIZATION_TIMEOUT.defaultValue();

    if (configuration.get(JobManagerOptions.SCHEDULER_MODE)
            == SchedulerExecutionMode.REACTIVE) {
        allocationTimeoutDefault = Duration.ofMillis(-1);
        stabilizationTimeoutDefault = Duration.ZERO;
    }

    final Duration initialResourceAllocationTimeout =
            configuration
                    .getOptional(JobManagerOptions.RESOURCE_WAIT_TIMEOUT)
                    .orElse(allocationTimeoutDefault);

    final Duration resourceStabilizationTimeout =
            configuration
                    .getOptional(JobManagerOptions.RESOURCE_STABILIZATION_TIMEOUT)
                    .orElse(stabilizationTimeoutDefault);

    return new AdaptiveSchedulerFactory(
            initialResourceAllocationTimeout, resourceStabilizationTimeout);
}

7.DefaultDeclarativeSlotPool

声明式的资源管理插件，这个是Adaptive的基础，跟原来的设计最大的差别在于：1、JobMaster不再去逐个请求Slot，而是声明它需要的资源的情况；2、对资源的要求是个弹性的范围，而不是固定的
声明式资源管理将作业和资源申请进行了隔离，抽象了一个中间管理层来进行资源的管理。作业只是提交需求的资源信息，由中间的管理层进行资源的申请回收等
DefaultDeclarativeSlotPool有三个接收声明式资源的接口，分别为increaseResourceRequirementsBy(增加)、decreaseResourceRequirementsBy(减少)、setResourceRequirements(设置)。反向追踪来看，setResourceRequirements是Adaptive调度器调用到的，increaseResourceRequirementsBy是NG调用到的。应该是slotpool兼容了传统和adaptive方式，传统的每个slot单独申请一次，adaptive是一次性申请。这里的totalResourceRequirements 是单个job的，从代码追踪来看，DeclarativeSlotPool是由JobMaster触发的创建，是每个Job单独的。
如下，totalResourceRequirements是资源申请列表，反向追踪调用链来看，是Adaptive调度器调用的

@Override
public void increaseResourceRequirementsBy(ResourceCounter increment) {
    if (increment.isEmpty()) {
        return;
    }
    totalResourceRequirements = totalResourceRequirements.add(increment);

    declareResourceRequirements();
}

@Override
public void decreaseResourceRequirementsBy(ResourceCounter decrement) {
    if (decrement.isEmpty()) {
        return;
    }
    totalResourceRequirements = totalResourceRequirements.subtract(decrement);

    declareResourceRequirements();
}

@Override
public void setResourceRequirements(ResourceCounter resourceRequirements) {
    totalResourceRequirements = resourceRequirements;

    declareResourceRequirements();
}

7.1.NewSlotsListener

资源监听类，当有新的slot可用时，触发通知。实际的监听类会由Adaptive调用接口注册，当slot增加时，就可以触发Adaptive调度器进行扩容处理
在DefaultDeclarativeSlotPool当中，有两个接口会触发notifyNewSlotsAreAvailable，也就是通知新资源达到：offerSlots、freeReservedSlot

7.2.offerSlots

字面理解，提供slot。由TaskManager（TaskExecutor）触发，TaskManager启动以后，会通过JobManagerGateway向JobManager提供Slot。
TaskExecutor的internalOfferSlotsToJobManager有如下内容（此处需要研究，jobMasterGateway最终指向的是jobMaster，jobMaster是基于每个作业创建的）

CompletableFuture<Collection<SlotOffer>> acceptedSlotsFuture =
        jobMasterGateway.offerSlots(
                getResourceID(),
                reservedSlots,
                taskManagerConfiguration.getRpcTimeout());

jobMaster接收到以后，调用SlotPoolService的接口

return CompletableFuture.completedFuture(
        slotPoolService.offerSlots(taskManagerLocation, rpcTaskManagerGateway, slots));
&emsp;&emsp;最后，触发newSlotsListener的监听
if (!acceptedSlots.isEmpty()) {
    LOG.debug(
            "Acquired new resources; new total acquired resources: {}",
            fulfilledResourceRequirements);
    newSlotsListener.notifyNewSlotsAreAvailable(acceptedSlots);
}

问题：DeclarativeSlotPool是Job单独的，那新slot上线时，怎么确定通知哪个Job的DeclarativeSlotPool。

7.3.freeReservedSlot

这个接口是资源释放时调用的。释放的操作有两个：
1、超时，资源申请超时，在DeclarativeSlotPoolBridge中有releaseSlot接口调用，其上层调用为DeclarativeSlotPoolBridge的timeoutPendingSlotRequest
2、取消，在PhysicalSlotProviderImpl的cancelSlotRequest，是在SharedSlot中设置的回调函数

7.4.缩容触发

缩容的触发是任务的失败，在调度器侧，最终应该调用到goToRestarting，因为缩容肯定是要做重启的。基于这个接口，向上追溯，主要是Executing的handleAnyFailure接口。
关于失败的触发，DefaultExecutionGraph中有一个成员变量InternalFailuresListener，负责失败监听
作业失败有多个触发接口，部分是直接cancel的，部分不会cancel，这个需要继续研究场景，这边触发使用的肯定都是不cancel的接口

8.AdaptiveScheduler

Adaptive调度器的实现类

8.1.使用条件

创建调度器时，首先会基于JobGraph对需求条件进行判断。条件主要是：1、流模式；2、每个vertex都设置了大于0的并行度；3、以pipelined方式进行数据交互

private static void assertPreconditions(JobGraph jobGraph) throws RuntimeException {
    Preconditions.checkState(
            jobGraph.getJobType() == JobType.STREAMING,
            "The adaptive scheduler only supports streaming jobs.");

    for (JobVertex vertex : jobGraph.getVertices()) {
        Preconditions.checkState(
                vertex.getParallelism() > 0,
                "The adaptive scheduler expects the parallelism being set for each JobVertex (violated JobVertex: %s).",
                vertex.getID());
        for (JobEdge jobEdge : vertex.getInputs()) {
            Preconditions.checkState(
                    jobEdge.getSource().getResultType().isPipelined(),
                    "The adaptive scheduler supports pipelined data exchanges (violated by %s -> %s).",
                    jobEdge.getSource().getProducer(),
                    jobEdge.getTarget().getID());
        }
    }
}

8.2.计算并行度信息

接下来会计算每个vertex的最大并行度信息，Reactive模式特殊处理

private static VertexParallelismStore computeVertexParallelismStore(
        JobGraph jobGraph, SchedulerExecutionMode executionMode) {
    if (executionMode == SchedulerExecutionMode.REACTIVE) {
        return computeReactiveModeVertexParallelismStore(
                jobGraph.getVertices(), SchedulerBase::getDefaultMaxParallelism, true);
    }
    return SchedulerBase.computeVertexParallelismStore(jobGraph);
}

8.2.1、默认最大并行度

最大并行度的默认值计算方式如下（用户没有设置最大并行度时使用），其中operatorParallelism的值如果用户设置了vertex的并行度，则使用；如果没有设置，则为1。roundUpToPowerOfTwo计算数字的下一个2的幂值，DEFAULT_LOWER_BOUND_MAX_PARALLELISM为1 << 7（128），UPPER_BOUND_MAX_PARALLELISM为1 << 15（32768）

public static int computeDefaultMaxParallelism(int operatorParallelism) {

    checkParallelismPreconditions(operatorParallelism);

    return Math.min(
            Math.max(
                    MathUtils.roundUpToPowerOfTwo(
                            operatorParallelism + (operatorParallelism / 2)),
                    DEFAULT_LOWER_BOUND_MAX_PARALLELISM),
            UPPER_BOUND_MAX_PARALLELISM);
}

8.2.2、Reactive特殊处理

Reactive做并行度信息计算时会做特殊处理，会将parallelism设置成maxParallelism，也就是最大程度的利用资源（此接口有一个参数，配置是否调整并行度为最大并行度，此处调用链会修改，还有一条调用链不做修改）

8.2.3、DefaultVertexParallelismInfo

此类存储的是任务的并行度信息，主要两个成员（parallelism和maxParallelism）
代码逻辑整体就是设置两个并行度信息，对于并行度，没有设置则默认为1；最大并行度如果没有设置，默认值如上文所述

int parallelism = normalizeParallelism(vertex.getParallelism());

int maxParallelism = vertex.getMaxParallelism();
final boolean autoConfigured;
// if no max parallelism was configured by the user, we calculate and set a default
if (maxParallelism == JobVertex.MAX_PARALLELISM_DEFAULT) {
    maxParallelism = defaultMaxParallelismFunc.apply(vertex);
    autoConfigured = true;
} else {
    autoConfigured = false;
}

8.3.作业执行的最大并行度问题

如前言所述，测试结果作业的最大并行度是由并行度决定的而不是最大并行度决定的。整个决定策略如下：
AdaptiveScheduler的goToCreatingExecutionGraph触发ExecutionGraph的构建，ExecutionGraph决定作业的并行度

@Override
public void goToCreatingExecutionGraph() {
    final CompletableFuture<CreatingExecutionGraph.ExecutionGraphWithVertexParallelism>
            executionGraphWithAvailableResourcesFuture =
                    createExecutionGraphWithAvailableResourcesAsync();

    transitionToState(
            new CreatingExecutionGraph.Factory(
                    this, executionGraphWithAvailableResourcesFuture, LOG));
}

后续会调用到触发创建

return createExecutionGraphAndRestoreStateAsync(adjustedParallelismStore)

adjustedParallelismStore处理流程与上节计算并行度信息的流程是类似的，也就是说，最终设置的是parallelism（关于值的确定，缩扩容流程会改变这个值）
而在ExecutionGraph的创建过程中，在ExecutionJobVertex类有如下内容，此处使用的是并行度而不是最大并行度，也就决定了作业不会超过设置的并行度

this.taskVertices = new ExecutionVertex[this.parallelismInfo.getParallelism()];

8.4.扩容处理流程

处理流程的关键是NewSlotsListener这个类，在Adaptive调度器初始化时，会在DeclarativeSlotPool注册一个资源监听，负责监听资源变更的后续动作触发（此处应该只是扩容的触发）

declarativeSlotPool.registerNewSlotsListener(this::newResourcesAvailable);

private void newResourcesAvailable(Collection<? extends PhysicalSlot> physicalSlots) {
    state.tryRun(
            ResourceConsumer.class,
            ResourceConsumer::notifyNewResourcesAvailable,
            "newResourcesAvailable");
}

其中，ResourceConsumer是对新资源做出反应的接口，目前有两个实现类：Executing、WaitingForResources（也就是说，资源的变更只对这两种状态的作业产生影响）notifyNewResourcesAvailable触发操作

8.4.1、计算是否可以扩容

以Executing为例，资源增加时的反应如下：

public void notifyNewResourcesAvailable() {
    if (context.canScaleUp(getExecutionGraph())) {
        getLogger().info("New resources are available. Restarting job to scale up.");
        context.goToRestarting(
                getExecutionGraph(),
                getExecutionGraphHandler(),
                getOperatorCoordinatorHandler(),
                Duration.ofMillis(0L));
    }
}

canScaleUp判断是否可以做扩容，会调用AdaptiveScheduler的对应方法。
判断时，先比较SlotSharingGroups，保证每个group都至少有一个slot。这里的freeSlots是已经分配的slot和freeslot的总和

// TODO: This can waste slots if the max parallelism for slot sharing groups is not equal
final int slotsPerSlotSharingGroup =
        freeSlots.size() / jobInformation.getSlotSharingGroups().size();

if (slotsPerSlotSharingGroup == 0) {
    // => less slots than slot-sharing groups
    return Optional.empty();
}

最终并行度的计算如下，availableSlots就是上段的freeSlots，jobVertex.getParallelism()应该是上节计算并行度中设置的并行度（应该是这里产生的测试的结论，就是最大并行度不会超过设置的并行度而不是最大并行度）
这里有个问题，freeSlots是个平均值，那有可能各个group不一致的话，设置高的group可能获取到一个小于设置并行度的值，低的group又不会达到扩容的目的

for (JobInformation.VertexInformation jobVertex : containedJobVertices) {
    final int parallelism = Math.min(jobVertex.getParallelism(), availableSlots);

    vertexParallelism.put(jobVertex.getJobVertexID(), parallelism);
}

做是否执行的判断，此处是做并行度累加进行判断，也就是说，不能解决不一致的问题。

if (potentialNewParallelism.isPresent()) {
    int currentCumulativeParallelism = getCurrentCumulativeParallelism(executionGraph);
    int newCumulativeParallelism =
            getCumulativeParallelism(potentialNewParallelism.get());
    if (newCumulativeParallelism > currentCumulativeParallelism) {
        LOG.debug(
                "Offering scale up to scale up controller with currentCumulativeParallelism={}, newCumulativeParallelism={}",
                currentCumulativeParallelism,
                newCumulativeParallelism);
        return scaleUpController.canScaleUp(
                currentCumulativeParallelism, newCumulativeParallelism);
    }
}

最后还有一个canScaleUp的判断，是根据配置判断扩容的最小扩容值的

public static final ConfigOption<Integer> MIN_PARALLELISM_INCREASE =
        key("jobmanager.adaptive-scheduler.min-parallelism-increase")
                .intType()
                .defaultValue(1)
                .withDescription(
                        "Configure the minimum increase in parallelism for a job to scale up.");

8.4.2、重启

如前文，Executing判断可以扩缩容后的后续是会触发重启，重启还会被handleAnyFailure触发

private void handleAnyFailure(Throwable cause) {
    final FailureResult failureResult = context.howToHandleFailure(cause);

    if (failureResult.canRestart()) {
        getLogger().info("Restarting job.", failureResult.getFailureCause());
        context.goToRestarting(
                getExecutionGraph(),
                getExecutionGraphHandler(),
                getOperatorCoordinatorHandler(),
                failureResult.getBackoffTime());
    } else {
        getLogger().info("Failing job.", failureResult.getFailureCause());
        context.goToFailing(
                getExecutionGraph(),
                getExecutionGraphHandler(),
                getOperatorCoordinatorHandler(),
                failureResult.getFailureCause());
    }
}

重启也是进行状态的转换，另外会增加重启次数

public void goToRestarting(
        ExecutionGraph executionGraph,
        ExecutionGraphHandler executionGraphHandler,
        OperatorCoordinatorHandler operatorCoordinatorHandler,
        Duration backoffTime) {

    for (ExecutionVertex executionVertex : executionGraph.getAllExecutionVertices()) {
        final int attemptNumber =
                executionVertex.getCurrentExecutionAttempt().getAttemptNumber();

        this.vertexAttemptNumberStore.setAttemptCount(
                executionVertex.getJobvertexId(),
                executionVertex.getParallelSubtaskIndex(),
                attemptNumber + 1);
    }

    transitionToState(
            new Restarting.Factory(
                    this,
                    executionGraph,
                    executionGraphHandler,
                    operatorCoordinatorHandler,
                    LOG,
                    backoffTime));
    numRestarts++;
}

Restarting的父类是StateWithExecutionGraph，StateWithExecutionGraph的初始化方法中有如下调用

FutureUtils.assertNoException(
        executionGraph
                .getTerminationFuture()
                .thenAcceptAsync(
                        jobStatus -> {
                            if (jobStatus.isGloballyTerminalState()) {
                                context.runIfState(
                                        this, () -> onGloballyTerminalState(jobStatus));
                            }
                        },
                        context.getMainThreadExecutor()));

context.runIfState触发了执行，最终会转到WaitingForResources状态

@Override
void onGloballyTerminalState(JobStatus globallyTerminalState) {
    Preconditions.checkArgument(globallyTerminalState == JobStatus.CANCELED);
    goToWaitingForResourcesFuture =
            context.runIfState(this, context::goToWaitingForResources, backoffTime);
}

9.新入资源触发哪个作业

描述新入资源以后（启动新的TaskManager），如何触发作业的扩容，这个解释了7.2节的问题
一个入口点在于JobManager侧的DeclarativeSlotManager的checkResourceRequirements接口。根据维护的作业的resourceRequirements，计算是否缺少资源，构建缺少资源的列表，之后循环列表，发起申请，也就是说，列表中的第一个会优先申请（如何成为第一个需要研究，应该是随机的）

for (Map.Entry<JobID, Collection<ResourceRequirement>> resourceRequirements :
        missingResources.entrySet()) {
    final JobID jobId = resourceRequirements.getKey();

    final ResourceCounter unfulfilledJobRequirements =
            tryAllocateSlotsForJob(jobId, resourceRequirements.getValue());
    if (!unfulfilledJobRequirements.isEmpty()) {
        unfulfilledRequirements.put(jobId, unfulfilledJobRequirements);
    }
}

missingResources计算如下，在JobScopedResourceTracker的getMissingResources

public Collection<ResourceRequirement> getMissingResources() {
    final Collection<ResourceRequirement> missingResources = new ArrayList<>();
    for (Map.Entry<ResourceProfile, Integer> requirement :
            resourceRequirements.getResourcesWithCount()) {
        ResourceProfile requirementProfile = requirement.getKey();

        int numRequiredResources = requirement.getValue();
        int numAcquiredResources =
                resourceToRequirementMapping.getNumFulfillingResources(requirementProfile);

        if (numAcquiredResources < numRequiredResources) {
            missingResources.add(
                    ResourceRequirement.create(
                            requirementProfile, numRequiredResources - numAcquiredResources));
        }
    }
    return missingResources;
}