业务代码中调用执行env.enableCheckpointing()时,会开启Checkpoint。此时CheckpointConfig会被存储到StreamGraph、JobGraph中,JobGraph被提交到集群运行时,会根据Checkpoint配置来创建ExecutionGraph。
/**
* 初始化ExecutionGraph
*/
public static ExecutionGraph buildGraph(
@Nullable ExecutionGraph prior,
JobGraph jobGraph,
Configuration jobManagerConfig,
ScheduledExecutorService futureExecutor,
Executor ioExecutor,
SlotProvider slotProvider,
ClassLoader classLoader,
CheckpointRecoveryFactory recoveryFactory,
Time rpcTimeout,
RestartStrategy restartStrategy,
MetricGroup metrics,
BlobWriter blobWriter,
Time allocationTimeout,
Logger log,
ShuffleMaster<?> shuffleMaster,
JobMasterPartitionTracker partitionTracker,
FailoverStrategy.Factory failoverStrategyFactory) throws JobExecutionException, JobException {
// 省略部分代码...
final ExecutionGraph executionGraph;
try {
// 创建ExecutionGraph
executionGraph = (prior != null) ? prior :
new ExecutionGraph(
jobInformation,
futureExecutor,
ioExecutor,
rpcTimeout,
restartStrategy,
maxPriorAttemptsHistoryLength,
failoverStrategyFactory,
slotProvider,
classLoader,
blobWriter,
allocationTimeout,
partitionReleaseStrategyFactory,
shuffleMaster,
partitionTracker,
jobGraph.getScheduleMode());
} catch (IOException e) {
throw new JobException("Could not create the ExecutionGraph.", e);
}
// 省略部分代码...
// 从JobGraph中获取Checkpoint配置(由StreamGraph中的CheckpointConfig转换而来)
JobCheckpointingSettings snapshotSettings = jobGraph.getCheckpointingSettings();
// 如果CheckpointingSettings不为null,就开启Checkpoint
if (snapshotSettings != null) {
// 存储了所有的Source节点,这些节点通过CheckpointCoordinator主动触发Checkpoint
List<ExecutionJobVertex> triggerVertices =
idToVertex(snapshotSettings.getVerticesToTrigger(), executionGraph);
/**
* 存储了StreamGraph中的全部节点,表示所有节点都得返回ack
*/
List<ExecutionJobVertex> ackVertices =
idToVertex(snapshotSettings.getVerticesToAcknowledge(), executionGraph);
/**
* 存储了StreamGraph中的全部节点,表示所有节点都得确认Checkpoint已执行成功
*/
List<ExecutionJobVertex> confirmVertices =
idToVertex(snapshotSettings.getVerticesToConfirm(), executionGraph);
// 省略部分代码,包括创建StateBackend的部分...
// 开启Checkpoint
executionGraph.enableCheckpointing(
chkConfig,
triggerVertices,
ackVertices,
confirmVertices,
hooks,
checkpointIdCounter,
completedCheckpoints,
rootBackend,
checkpointStatsTracker);
}
// 省略部分代码...
return executionGraph;
}
ExecutionGraph已经“成功加持”了Checkpoint的相关配置,现在看ExecutionGraph是如何开启Checkpoint的。
/**
* 在作业的执行、调度过程中,开启Checkpoint
*/
public void enableCheckpointing(
CheckpointCoordinatorConfiguration chkConfig,
List<ExecutionJobVertex> verticesToTrigger,
List<ExecutionJobVertex> verticesToWaitFor,
List<ExecutionJobVertex> verticesToCommitTo,
List<MasterTriggerRestoreHook<?>> masterHooks,
CheckpointIDCounter checkpointIDCounter,
CompletedCheckpointStore checkpointStore,
StateBackend checkpointStateBackend,
CheckpointStatsTracker statsTracker) {
checkState(state == JobStatus.CREATED, "Job must be in CREATED state");
checkState(checkpointCoordinator == null, "checkpointing already enabled");
/**
* 将Source节点、需要发送ACK消息的所有节点、需要确认Checkpoint已执行成功的所有节点,分别各自转换为ExecutionVertex[]数组,
* 每个ExecutionVertex代表ExecutionJobVertes中的一个SubTask节点
*/
ExecutionVertex[] tasksToTrigger = collectExecutionVertices(verticesToTrigger);
ExecutionVertex[] tasksToWaitFor = collectExecutionVertices(verticesToWaitFor);
ExecutionVertex[] tasksToCommitTo = collectExecutionVertices(verticesToCommitTo);
checkpointStatsTracker = checkNotNull(statsTracker, "CheckpointStatsTracker");
// 用来在Checkpoint执行过程中进行容错管理
CheckpointFailureManager failureManager = new CheckpointFailureManager(
chkConfig.getTolerableCheckpointFailureNumber(),
new CheckpointFailureManager.FailJobCallback() {
@Override
public void failJob(Throwable cause) {
getJobMasterMainThreadExecutor().execute(() -> failGlobal(cause));
}
@Override
public void failJobDueToTaskFailure(Throwable cause, ExecutionAttemptID failingTask) {
getJobMasterMainThreadExecutor().execute(() -> failGlobalIfExecutionIsStillRunning(cause, failingTask));
}
}
);
checkState(checkpointCoordinatorTimer == null);
// ScheduledExecutorService负责对Checkpoint异步线程进行定时调度、执行
checkpointCoordinatorTimer = Executors.newSingleThreadScheduledExecutor(
new DispatcherThreadFactory(
Thread.currentThread().getThreadGroup(), "Checkpoint Timer"));
/**
* 初始化CheckpointCoordinator组件:协调和管理作业中的Checkpoint,同时收集各Task节点中Checkpoint的执行情况等信息
*/
checkpointCoordinator = new CheckpointCoordinator(
jobInformation.getJobId(),
chkConfig,
tasksToTrigger,
tasksToWaitFor,
tasksToCommitTo,
checkpointIDCounter,
checkpointStore,
checkpointStateBackend,
ioExecutor,
new ScheduledExecutorServiceAdapter(checkpointCoordinatorTimer),
SharedStateRegistry.DEFAULT_FACTORY,
failureManager);
// 省略部分代码...
/**
* 只要开发者配置的Checkpoint的间隔时间是有效地,就可以正常开启Checkpoint功能。
* 将可以监听Job状态变化的监听器JobStatusListener,注册到JobManager中。
* 一旦监听到Job运行状态为RUNNING,会触发CheckpointCoordinator组件的启动:控制Source节点触发Checkpoint
*/
if (chkConfig.getCheckpointInterval() != Long.MAX_VALUE) {
// 注册可以监听Job状态变化的监听器JobStatusListener
// 此时系统会根据作业的运行状态,控制CheckpointCoordinator的启停,RUNNING状态时会触发CheckpointCoordinator组件的启动
registerJobStatusListener(checkpointCoordinator.createActivatorDeactivator());
}
this.stateBackendName = checkpointStateBackend.getClass().getSimpleName();
}
能够看出,ExecutionGraph开启Checkpoint的本质就是准备好用来控制Source节点触发Checkpoint的CheckpointCoordinator组件,并向JobManager注册一个能够监听到Job状态变化的监听器JobStatusListener。一旦监听到Job运行状态为RUNNING,会触发CheckpointCoordinator组件的启动,就此,奔流不息的Checkpoint就正式揭开帷幕了…
文章详细解析了在Flink中,ExecutionGraph在接收到JobGraph的Checkpoint配置后,如何初始化并启用CheckpointCoordinator来协调作业的检查点操作。当Job状态变为RUNNING时,CheckpointCoordinator开始工作,触发Source节点的Checkpoint,确保数据流处理的容错和状态保存。
1790

被折叠的 条评论
为什么被折叠?



