Flink Adaptive Batch Scheduler

1、综述

  自适应批调度器,即自动推导设置作业并行度,无需用户手动设置作业并行度,由Flink根据用户设置的期望及作业执行情况,自动设置作业的并行度

2、配置

  启用功能配置:

jobmanager.scheduler: AdaptiveBatch

execution.batch-shuffle-mode: ALL_EXCHANGES_BLOCKING(默认值)

作业并行度置为-1

  其他配置

jobmanager.adaptive-batch-scheduler.min-parallelism:允许并行度的下限自适应设置。目前,此选项应配置为2的幂,否则将自动向上取整为2的幂。

obmanager.adaptive-batch-scheduler.max-parallelism:允许并行度的上限自适应设置。目前这个选项应该配置为2的幂,否则会自动向下取整为2的幂。

jobmanager.adaptive-batch-scheduler.avg-data-volume-per-task:期望每个任务实例处理的数据量的平均大小。请注意,由于顶点的平行度调整为2的幂,因此实际平均大小将是该值的 0.75~1.5倍。另外需要注意的是,当出现数据倾斜,或者决定的并行度达到max parallelism时(由于数据过多),某些任务实际处理的数据可能会远远超过这个值。默认1024M

jobmanager.adaptive-batch-scheduler.default-source-parallelism:数据源的默认并行度。

3、AdaptiveBatchSchedulerFactory

  这是自适应批调度对应的工厂类,负责创建对应的调度器

3.1、Slot选择策略

  工厂类中会设置Slot选择策略

final SlotSelectionStrategy slotSelectionStrategy =
        SlotSelectionStrategyUtils.selectSlotSelectionStrategy(
                JobType.BATCH, jobMasterConfiguration);

  有三个实现类+一个抽象类:LocationPreferenceSlotSelectionStrategy(抽象类,子类DefaultLocationPreferenceSlotSelectionStrategy、EvenlySpreadOutLocationPreferenceSlotSelectionStrategy)、PreviousAllocationSlotSelectionStrategy,分别为随机选择、均衡选择和已分配优先(应该是指前一次的分配结果)

  这里还涉及到本地恢复,批作业不支持本地恢复,所以不支持PreviousAllocationSlotSelectionStrategy

final boolean isLocalRecoveryEnabled =
        configuration.getBoolean(CheckpointingOptions.LOCAL_RECOVERY);
if (isLocalRecoveryEnabled) {
    if (jobType == JobType.STREAMING) {
        return PreviousAllocationSlotSelectionStrategy.create(
                locationPreferenceSlotSelectionStrategy);
    } else {
        LOG.warn(
                "Batch job does not support local recovery. Falling back to use "
                        + locationPreferenceSlotSelectionStrategy.getClass());
        return locationPreferenceSlotSelectionStrategy;
    }

3.2、重启策略

  这里还会设置重启策略

final RestartBackoffTimeStrategy restartBackoffTimeStrategy =
        RestartBackoffTimeStrategyFactoryLoader.createRestartBackoffTimeStrategyFactory(
                        jobGraph.getSerializedExecutionConfig()
                                .deserializeValue(userCodeLoader)
                                .getRestartStrategy(),
                        jobMasterConfiguration,
                        jobGraph.isCheckpointingEnabled())
                .create();

  有四个实现:NoRestartBackoffTimeStrategy、FixedDelayRestartBackoffTimeStrategy、FailureRateRestartBackoffTimeStrategy、ExponentialDelayRestartBackoffTimeStrategy,分别为无重启、指定重试次数、指定失败比率和指数级延迟的无限重启策略

4、AdaptiveBatchScheduler

4.1、initializeVerticesIfPossible

  这是核心接口,一系列后续的操作都是由它触发,接口进行三步操作
  设置并行度

for (ExecutionJobVertex jobVertex : getExecutionGraph().getVerticesTopologically()) {
    maybeSetParallelism(jobVertex);
}

  初始化

for (ExecutionJobVertex jobVertex : getExecutionGraph().getVerticesTopologically()) {
    if (canInitialize(jobVertex)) {
        getExecutionGraph().initializeJobVertex(jobVertex, createTimestamp);
        newlyInitializedJobVertices.add(jobVertex);
    }
}

  更新拓扑

if (newlyInitializedJobVertices.size() > 0) {
    updateTopology(newlyInitializedJobVertices);
}

4.2、maybeSetParallelism

  设置并行度,是这个调度器最重要的一个接口
  首先,如果jobVertex设置了并行度(基本可以认为是写代码时在算子上进行设置),就退出

if (jobVertex.isParallelismDecided()) {
    return;
}

public boolean isParallelismDecided() {
    return parallelismInfo.getParallelism() > 0;
}

  获取结果信息,做并行度推导需要作业的执行信息

Optional<List<BlockingResultInfo>> consumedResultsInfo =
        tryGetConsumedResultsInfo(jobVertex);
if (!consumedResultsInfo.isPresent()) {
    return;
}

  后续就是确定并行度,这里有一个ForwardGroup的概念,即上下游节点必须保持相同并行度的一组节点,如果这一组有设置并行度,则使用组并行度

if (forwardGroup != null && forwardGroup.isParallelismDecided()) {
    parallelism = forwardGroup.getParallelism();
    log.info(
            "Parallelism of JobVertex: {} ({}) is decided to be {} according to forward group's parallelism.",
            jobVertex.getName(),
            jobVertex.getJobVertexId(),
            parallelism);

}

  否则推导并行度,基于vertexParallelismDecider

else {
    parallelism =
            vertexParallelismDecider.decideParallelismForVertex(consumedResultsInfo.get());
    if (forwardGroup != null) {
        forwardGroup.setParallelism(parallelism);
    }

    log.info(
            "Parallelism of JobVertex: {} ({}) is decided to be {}.",
            jobVertex.getName(),
            jobVertex.getJobVertexId(),
            parallelism);
}

  最后设置并行度

changeJobVertexParallelism(jobVertex, parallelism);

  设置并行度就是简单的接口调用设置

private void changeJobVertexParallelism(ExecutionJobVertex jobVertex, int parallelism) {
    // update the JSON Plan, it's needed to enable REST APIs to return the latest parallelism of
    // job vertices
    jobVertex.getJobVertex().setParallelism(parallelism);
    try {
        getExecutionGraph().setJsonPlan(JsonPlanGenerator.generatePlan(getJobGraph()));
    } catch (Throwable t) {
        log.warn("Cannot create JSON plan for job", t);
        // give the graph an empty plan
        getExecutionGraph().setJsonPlan("{}");
    }

    jobVertex.setParallelism(parallelism);
}

4.3、tryGetConsumedResultsInfo

  用于获取结果信息,整体逻辑就是获取所有的输入节点,然后获取输入节点的信息。需要所有的输入节点全部完成才会获取,只要任意一个没有完成,就返回空集

for (DefaultLogicalResult consumedResult : consumedResults) {
    final ExecutionJobVertex producerVertex =
            getExecutionJobVertex(consumedResult.getProducer().getId());
    if (producerVertex.isFinished()) {
        IntermediateResult intermediateResult =
                getExecutionGraph().getAllIntermediateResults().get(consumedResult.getId());
        checkNotNull(intermediateResult);

        consumableResultInfo.add(
                BlockingResultInfo.createFromIntermediateResult(intermediateResult));
    } else {
        // not all inputs consumable, return Optional.empty()
        return Optional.empty();
    }
}

  上游任务的执行信息是基于IOMetrics获取的,其记录了任务的IO信息,即传输了多少数据

IOMetrics ioMetrics =
        partition.getProducer().getCurrentExecutionAttempt().getIOMetrics();
checkNotNull(ioMetrics, "IOMetrics should not be null.");

blockingPartitionSizes.add(
        ioMetrics.getNumBytesProducedOfPartitions().get(partition.getPartitionId()));

4.4、vertexParallelismDecider.decideParallelismForVertex

  并行度推导,实现类是DefaultVertexParallelismDecider
  如果没有上游统计信息,使用默认并行度,即jobmanager.adaptive-batch-scheduler.default-source-parallelism配置的值

if (consumedResults.isEmpty()) {
    // source job vertex
    return defaultSourceParallelism;
}

  计算并行度先分别获取上游广播和非广播的数据量

long broadcastBytes =
        consumedResults.stream()
                .filter(BlockingResultInfo::isBroadcast)
                .mapToLong(
                        consumedResult ->
                                consumedResult.getBlockingPartitionSizes().stream()
                                        .reduce(0L, Long::sum))
                .sum();

long nonBroadcastBytes =
        consumedResults.stream()
                .filter(consumedResult -> !consumedResult.isBroadcast())
                .mapToLong(
                        consumedResult ->
                                consumedResult.getBlockingPartitionSizes().stream()
                                        .reduce(0L, Long::sum))
                .sum();

  之后计算期望的最大广播量,dataVolumePerTask是jobmanager.adaptive-batch-scheduler.avg-data-volume-per-task设置的值,用户设置的任务期处理望值,CAP_RATIO_OF_BROADCAST是0.5(因为分了广播和非广播)
  如果前面的广播值大于期望的广播值,则设置为期望值

long expectedMaxBroadcastBytes =
        (long) Math.ceil((dataVolumePerTask * CAP_RATIO_OF_BROADCAST));

  之后计算并行度,并基于2的幂等进行归一化

int initialParallelism =
        (int) Math.ceil((double) nonBroadcastBytes / (dataVolumePerTask - broadcastBytes));
int parallelism = normalizeParallelism(initialParallelism);

  最终,跟jobmanager.adaptive-batch-scheduler.max-parallelism、jobmanager.adaptive-batch-scheduler.min-parallelism配置进行对比,超限则取极限值

5、性能调优

  1、使用Sort Shuffl并设置taskmanager.network.memory.buffers-per-channel为0,这样使内存和并行度分离,防止Insufficient number of network buffers

  2、jobmanager.adaptive-batch-scheduler.max-parallelism不宜设置过大,过大会增加上游子分区数,从而导致shuffle和网络传输的开销

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值