本篇文章是接着上一篇文章讲的,推荐看完上一篇文章《Flink 弹性伸缩 - Reactive 模式 ( 上 )》再来阅读本篇文章。
四、AdaptiveScheduler 运行原理
1、Reactive 模式调度器所涉及到的类型
2、 AdaptiveScheduler 调度器本身是一个状态机
3、AdaptiveScheduler 各个执行状态都在转换
(WaitingForResources -> CreatingExecutionGraph)
org.apache.flink.runtime.scheduler.adaptive.WaitingForResources.java
private void createExecutionGraphWithAvailableResources() {
context.goToCreatingExecutionGraph();
}
在这段代码中调用了 AdaptiveScheduler 的 goToCreatingExecutionGraph 方法,进入到创建执行图的状态。
org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler.java
@Override
public void goToCreatingExecutionGraph() {
final CompletableFuture<CreatingExecutionGraph.ExecutionGraphWithVertexParallelism>
executionGraphWithAvailableResourcesFuture =
createExecutionGraphWithAvailableResourcesAsync();
transitionToState(
new CreatingExecutionGraph.Factory(
this, executionGraphWithAvailableResourcesFuture, LOG));
}
这里通过 createExecutionGraphWithAvailableResourcesAsync 方法重新创建了一个执行图,然后将调度器的状态转换到了 CreatingExecutionGraph 状态,在重新创建执行图的过程中,根据当前集群资源的情况重新计算了 JobGraph 中每一个作业顶点的并行度,比如集群中有10个slot, JobGraph中有2个资源共享组,那每一个资源共享组就能分配到5个slot,其并行度也就是5,计算完成后改变 JobGraph 中作业顶点的并行度,然后重新生成一个 ExecutionGraph ,然后恢复到上一个 Checkpoint 的状态后,将执行流程转换到 CreatingExecutionGraph 状态。
org.apache.flink.runtime.scheduler.adaptive.CreatingExecutionGraph.java
private void handleExecutionGraphCreation(
@Nullable ExecutionGraphWithVertexParallelism executionGraphWithVertexParallelism,
@Nullable Throwable throwable) {
if (throwable != null) {
log.info(
"Failed to go from {} to {} because the ExecutionGraph creation failed.",
CreatingExecutionGraph.class.getSimpleName(),
Executing.class.getSimpleName(),
throwable);
context.goToFinished(context.getArchivedExecutionGraph(JobStatus.FAILED, throwable));
} else {
final AssignmentResult result =
context.tryToAssignSlots(executionGraphWithVertexParallelism);
if (result.isSuccess()) {
log.debug(
"Successfully reserved and assigned the required slots for the ExecutionGraph.");
context.goToExecuting(result.getExecutionGraph());
} else {
log.debug(
"Failed to reserve and assign the required slots. Waiting for new resources.");
context.goToWaitingForResources();
}
}
}
(CreatingExecutionGraph -> Executing)
在 CreatingExecutionGraph 创建的过程中会调用 handleExecutionGraphCreation 方法,在这段代码中首先是判断在处理过程中是否有错,如果有,则将调度器的状态转换到完成状态,如果一切正常,则先给执行图设置slot资源,然后将调度器的状态转换到执行状态。
org.apache.flink.runtime.scheduler.adaptive.CreatingExecutionGraph.java
Executing(
ExecutionGraph executionGraph,
ExecutionGraphHandler executionGraphHandler,
OperatorCoordinatorHandler operatorCoordinatorHandler,
Logger logger,
Context context,
ClassLoader userCodeClassLoader) {
super(context, executionGraph, executionGraphHandler, operatorCoordinatorHandler, logger);
this.context = context;
this.userCodeClassLoader = userCodeClassLoader;
Preconditions.checkState(
executionGraph.getState() == JobStatus.RUNNING, "Assuming running execution graph");
deploy();
...
}
在Executing状态的创建过程中,调用了 deploy 方法来部署执行 executionGraph 。
org.apache.flink.runtime.scheduler.adaptive.CreatingExecutionGraph.java
private void deploy() {
for (ExecutionJobVertex executionJobVertex :
getExecutionGraph().getVerticesTopologically()) {
for (ExecutionVertex executionVertex : executionJobVertex.getTaskVertices()) {
if (executionVertex.getExecutionState() == ExecutionState.CREATED
|| executionVertex.getExecutionState() == ExecutionState.SCHEDULED) {
deploySafely(executionVertex);
}
}
}
}
private void deploySafely(ExecutionVertex executionVertex) {
try {
executionVertex.deploy();
} catch (JobException e) {
handleDeploymentFailure(executionVertex, e);
}
}
在 deploy 方法中有两层循环,第一层循环是遍历执行图中的执行作业顶点(flink 作业 web 监控界面上图形中的节点),第二层循环是遍历执行作业顶点的子任务(比如这个执行作业顶点的并行度是5,则就有5个子任务),对于每一个子任务会执行 ExecutionVertex 的 deploy 方法。
org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler.java
public void deploy() throws JobException {
currentExecution.deploy();
}
public void deploy() throws JobException {
...
final TaskDeploymentDescriptor deployment =
TaskDeploymentDescriptorFactory.fromExecutionVertex(vertex, attemptNumber)
.createDeploymentDescriptor(
slot.getAllocationId(),
taskRestore,
producedPartitions.values());
...
CompletableFuture.supplyAsync(
() -> taskManagerGateway.submitTask(deployment, rpcTimeout), executor)
.thenCompose(Function.identity())
...
} catch (Throwable t) {
markFailed(t);
}
}
在这里, 将要部署的子任务封装成 TaskDeploymentDescriptor 对象,再通过 taskManagerGateway.submitTask 方法将子任务部署到具体的 TaskManager 执行。
4、上面介绍了AdaptiveScheduler 从创建到开始调度再到作业执行状态的整个过程,与 AdaptiveScheduler 管理作业生命周期相关的是其实现了接口 SchedulerNG (这个接口也是其他调度器需要实现的接口,是调度器的核心所在)。
org.apache.flink.runtime.scheduler.SchedulerNG.java
public interface SchedulerNG extends AutoCloseableAsync {
void startScheduling();
void cancel();
}
startScheduling 方法是调度器开始调度作业的入口方法;
cancel 方法是调度器取消作业执行的入口方法。
5、AdaptiveScheduler 能实现伸缩的机制
org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler.java
private void newResourcesAvailable(Collection<? extends PhysicalSlot> physicalSlots) {
state.tryRun(
ResourceConsumer.class,
ResourceConsumer::notifyNewResourcesAvailable,
"newResourcesAvailable");
}
当有新的 TaskManager 加入到集群之后,会执行到 newResourcesAvailable 方法,该方法主要是调用调度器当前状态的 tryRun 方法来执行接下来的动作,本质上是调用当前状态的 notifyNewResourcesAvailable 方法。而实现 ResourceConsumer::notifyNewResourcesAvailable 方法的只有 Executing 与WaitingForResources 两种状态,也就说调度器只有处于这两种状态下时才会响应资源伸缩事件。
五、知识延伸-有限状态机的介绍
状态机可归纳为4个要素,即现态、条件、动作、次态。“现态”和“条件”是因,“动作”和“次态”是果。详解如下:
①现态:是指当前所处的状态。
②条件:又称为“事件”。当一个条件被满足,将会触发一个动作,或者执行一次状态的迁移。
③动作:条件满足后执行的动作。动作执行完毕后,可以迁移到新的状态,也可以仍旧保持原状态。动作不是必需的,当条件满足后,也可以不执行任何动作,直接迁移到新状态。
④次态:条件满足后要迁往的新状态。“次态”是相对于“现态”而言的,“次态”一旦被激活,就转变成新的“现态”了。
插曲
下面是本人抖音号(2026775054),欢迎关注,可以将您想要了解的Flink源码部分私信发给我,后续给您录制视频讲解,谢谢支持~~~