1、综述
自适应批调度器,即自动推导设置作业并行度,无需用户手动设置作业并行度,由Flink根据用户设置的期望及作业执行情况,自动设置作业的并行度
2、配置
启用功能配置:
jobmanager.scheduler: AdaptiveBatch
execution.batch-shuffle-mode: ALL_EXCHANGES_BLOCKING(默认值)
作业并行度置为-1
其他配置
jobmanager.adaptive-batch-scheduler.min-parallelism:允许并行度的下限自适应设置。目前,此选项应配置为2的幂,否则将自动向上取整为2的幂。
obmanager.adaptive-batch-scheduler.max-parallelism:允许并行度的上限自适应设置。目前这个选项应该配置为2的幂,否则会自动向下取整为2的幂。
jobmanager.adaptive-batch-scheduler.avg-data-volume-per-task:期望每个任务实例处理的数据量的平均大小。请注意,由于顶点的平行度调整为2的幂,因此实际平均大小将是该值的 0.75~1.5倍。另外需要注意的是,当出现数据倾斜,或者决定的并行度达到max parallelism时(由于数据过多),某些任务实际处理的数据可能会远远超过这个值。默认1024M
jobmanager.adaptive-batch-scheduler.default-source-parallelism:数据源的默认并行度。
3、AdaptiveBatchSchedulerFactory
这是自适应批调度对应的工厂类,负责创建对应的调度器
3.1、Slot选择策略
工厂类中会设置Slot选择策略
final SlotSelectionStrategy slotSelectionStrategy =
SlotSelectionStrategyUtils.selectSlotSelectionStrategy(
JobType.BATCH, jobMasterConfiguration);
有三个实现类+一个抽象类:LocationPreferenceSlotSelectionStrategy(抽象类,子类DefaultLocationPreferenceSlotSelectionStrategy、EvenlySpreadOutLocationPreferenceSlotSelectionStrategy)、PreviousAllocationSlotSelectionStrategy,分别为随机选择、均衡选择和已分配优先(应该是指前一次的分配结果)
这里还涉及到本地恢复,批作业不支持本地恢复,所以不支持PreviousAllocationSlotSelectionStrategy
final boolean isLocalRecoveryEnabled =
configuration.getBoolean(CheckpointingOptions.LOCAL_RECOVERY);
if (isLocalRecoveryEnabled) {
if (jobType == JobType.STREAMING) {
return PreviousAllocationSlotSelectionStrategy.create(
locationPreferenceSlotSelectionStrategy);
} else {
LOG.warn(
"Batch job does not support local recovery. Falling back to use "
+ locationPreferenceSlotSelectionStrategy.getClass());
return locationPreferenceSlotSelectionStrategy;
}
3.2、重启策略
这里还会设置重启策略
final RestartBackoffTimeStrategy restartBackoffTimeStrategy =
RestartBackoffTimeStrategyFactoryLoader.createRestartBackoffTimeStrategyFactory(
jobGraph.getSerializedExecutionConfig()
.deserializeValue(userCodeLoader)
.getRestartStrategy(),
jobMasterConfiguration,
jobGraph.isCheckpointingEnabled())
.create();
有四个实现:NoRestartBackoffTimeStrategy、FixedDelayRestartBackoffTimeStrategy、FailureRateRestartBackoffTimeStrategy、ExponentialDelayRestartBackoffTimeStrategy,分别为无重启、指定重试次数、指定失败比率和指数级延迟的无限重启策略
4、AdaptiveBatchScheduler
4.1、initializeVerticesIfPossible
这是核心接口,一系列后续的操作都是由它触发,接口进行三步操作
设置并行度
for (ExecutionJobVertex jobVertex : getExecutionGraph().getVerticesTopologically()) {
maybeSetParallelism(jobVertex);
}
初始化
for (ExecutionJobVertex jobVertex : getExecutionGraph().getVerticesTopologically()) {
if (canInitialize(jobVertex)) {
getExecutionGraph().initializeJobVertex(jobVertex, createTimestamp);
newlyInitializedJobVertices.add(jobVertex);
}
}
更新拓扑
if (newlyInitializedJobVertices.size() > 0) {
updateTopology(newlyInitializedJobVertices);
}
4.2、maybeSetParallelism
设置并行度,是这个调度器最重要的一个接口
首先,如果jobVertex设置了并行度(基本可以认为是写代码时在算子上进行设置),就退出
if (jobVertex.isParallelismDecided()) {
return;
}
public boolean isParallelismDecided() {
return parallelismInfo.getParallelism() > 0;
}
获取结果信息,做并行度推导需要作业的执行信息
Optional<List<BlockingResultInfo>> consumedResultsInfo =
tryGetConsumedResultsInfo(jobVertex);
if (!consumedResultsInfo.isPresent()) {
return;
}
后续就是确定并行度,这里有一个ForwardGroup的概念,即上下游节点必须保持相同并行度的一组节点,如果这一组有设置并行度,则使用组并行度
if (forwardGroup != null && forwardGroup.isParallelismDecided()) {
parallelism = forwardGroup.getParallelism();
log.info(
"Parallelism of JobVertex: {} ({}) is decided to be {} according to forward group's parallelism.",
jobVertex.getName(),
jobVertex.getJobVertexId(),
parallelism);
}
否则推导并行度,基于vertexParallelismDecider
else {
parallelism =
vertexParallelismDecider.decideParallelismForVertex(consumedResultsInfo.get());
if (forwardGroup != null) {
forwardGroup.setParallelism(parallelism);
}
log.info(
"Parallelism of JobVertex: {} ({}) is decided to be {}.",
jobVertex.getName(),
jobVertex.getJobVertexId(),
parallelism);
}
最后设置并行度
changeJobVertexParallelism(jobVertex, parallelism);
设置并行度就是简单的接口调用设置
private void changeJobVertexParallelism(ExecutionJobVertex jobVertex, int parallelism) {
// update the JSON Plan, it's needed to enable REST APIs to return the latest parallelism of
// job vertices
jobVertex.getJobVertex().setParallelism(parallelism);
try {
getExecutionGraph().setJsonPlan(JsonPlanGenerator.generatePlan(getJobGraph()));
} catch (Throwable t) {
log.warn("Cannot create JSON plan for job", t);
// give the graph an empty plan
getExecutionGraph().setJsonPlan("{}");
}
jobVertex.setParallelism(parallelism);
}
4.3、tryGetConsumedResultsInfo
用于获取结果信息,整体逻辑就是获取所有的输入节点,然后获取输入节点的信息。需要所有的输入节点全部完成才会获取,只要任意一个没有完成,就返回空集
for (DefaultLogicalResult consumedResult : consumedResults) {
final ExecutionJobVertex producerVertex =
getExecutionJobVertex(consumedResult.getProducer().getId());
if (producerVertex.isFinished()) {
IntermediateResult intermediateResult =
getExecutionGraph().getAllIntermediateResults().get(consumedResult.getId());
checkNotNull(intermediateResult);
consumableResultInfo.add(
BlockingResultInfo.createFromIntermediateResult(intermediateResult));
} else {
// not all inputs consumable, return Optional.empty()
return Optional.empty();
}
}
上游任务的执行信息是基于IOMetrics获取的,其记录了任务的IO信息,即传输了多少数据
IOMetrics ioMetrics =
partition.getProducer().getCurrentExecutionAttempt().getIOMetrics();
checkNotNull(ioMetrics, "IOMetrics should not be null.");
blockingPartitionSizes.add(
ioMetrics.getNumBytesProducedOfPartitions().get(partition.getPartitionId()));
4.4、vertexParallelismDecider.decideParallelismForVertex
并行度推导,实现类是DefaultVertexParallelismDecider
如果没有上游统计信息,使用默认并行度,即jobmanager.adaptive-batch-scheduler.default-source-parallelism配置的值
if (consumedResults.isEmpty()) {
// source job vertex
return defaultSourceParallelism;
}
计算并行度先分别获取上游广播和非广播的数据量
long broadcastBytes =
consumedResults.stream()
.filter(BlockingResultInfo::isBroadcast)
.mapToLong(
consumedResult ->
consumedResult.getBlockingPartitionSizes().stream()
.reduce(0L, Long::sum))
.sum();
long nonBroadcastBytes =
consumedResults.stream()
.filter(consumedResult -> !consumedResult.isBroadcast())
.mapToLong(
consumedResult ->
consumedResult.getBlockingPartitionSizes().stream()
.reduce(0L, Long::sum))
.sum();
之后计算期望的最大广播量,dataVolumePerTask是jobmanager.adaptive-batch-scheduler.avg-data-volume-per-task设置的值,用户设置的任务期处理望值,CAP_RATIO_OF_BROADCAST是0.5(因为分了广播和非广播)
如果前面的广播值大于期望的广播值,则设置为期望值
long expectedMaxBroadcastBytes =
(long) Math.ceil((dataVolumePerTask * CAP_RATIO_OF_BROADCAST));
之后计算并行度,并基于2的幂等进行归一化
int initialParallelism =
(int) Math.ceil((double) nonBroadcastBytes / (dataVolumePerTask - broadcastBytes));
int parallelism = normalizeParallelism(initialParallelism);
最终,跟jobmanager.adaptive-batch-scheduler.max-parallelism、jobmanager.adaptive-batch-scheduler.min-parallelism配置进行对比,超限则取极限值
5、性能调优
1、使用Sort Shuffl并设置taskmanager.network.memory.buffers-per-channel为0,这样使内存和并行度分离,防止Insufficient number of network buffers
2、jobmanager.adaptive-batch-scheduler.max-parallelism不宜设置过大,过大会增加上游子分区数,从而导致shuffle和网络传输的开销