解析开启Checkpoint流程源码

文章详细解析了在Flink中,ExecutionGraph在接收到JobGraph的Checkpoint配置后,如何初始化并启用CheckpointCoordinator来协调作业的检查点操作。当Job状态变为RUNNING时,CheckpointCoordinator开始工作,触发Source节点的Checkpoint,确保数据流处理的容错和状态保存。
摘要由CSDN通过智能技术生成

业务代码中调用执行env.enableCheckpointing()时,会开启Checkpoint。此时CheckpointConfig会被存储到StreamGraph、JobGraph中,JobGraph被提交到集群运行时,会根据Checkpoint配置来创建ExecutionGraph。

/**
 * 初始化ExecutionGraph
 */
public static ExecutionGraph buildGraph(
    @Nullable ExecutionGraph prior,
    JobGraph jobGraph,
    Configuration jobManagerConfig,
    ScheduledExecutorService futureExecutor,
    Executor ioExecutor,
    SlotProvider slotProvider,
    ClassLoader classLoader,
    CheckpointRecoveryFactory recoveryFactory,
    Time rpcTimeout,
    RestartStrategy restartStrategy,
    MetricGroup metrics,
    BlobWriter blobWriter,
    Time allocationTimeout,
    Logger log,
    ShuffleMaster<?> shuffleMaster,
    JobMasterPartitionTracker partitionTracker,
    FailoverStrategy.Factory failoverStrategyFactory) throws JobExecutionException, JobException {

    // 省略部分代码...

    final ExecutionGraph executionGraph;
    try {
        // 创建ExecutionGraph
        executionGraph = (prior != null) ? prior :
        new ExecutionGraph(
            jobInformation,
            futureExecutor,
            ioExecutor,
            rpcTimeout,
            restartStrategy,
            maxPriorAttemptsHistoryLength,
            failoverStrategyFactory,
            slotProvider,
            classLoader,
            blobWriter,
            allocationTimeout,
            partitionReleaseStrategyFactory,
            shuffleMaster,
            partitionTracker,
            jobGraph.getScheduleMode());
    } catch (IOException e) {
        throw new JobException("Could not create the ExecutionGraph.", e);
    }

    // 省略部分代码...

    // 从JobGraph中获取Checkpoint配置(由StreamGraph中的CheckpointConfig转换而来)
    JobCheckpointingSettings snapshotSettings = jobGraph.getCheckpointingSettings();
    // 如果CheckpointingSettings不为null,就开启Checkpoint
    if (snapshotSettings != null) {
        // 存储了所有的Source节点,这些节点通过CheckpointCoordinator主动触发Checkpoint
        List<ExecutionJobVertex> triggerVertices =
            idToVertex(snapshotSettings.getVerticesToTrigger(), executionGraph);

        /**
			 * 存储了StreamGraph中的全部节点,表示所有节点都得返回ack
			 */
        List<ExecutionJobVertex> ackVertices =
            idToVertex(snapshotSettings.getVerticesToAcknowledge(), executionGraph);

        /**
			 * 存储了StreamGraph中的全部节点,表示所有节点都得确认Checkpoint已执行成功
			 */
        List<ExecutionJobVertex> confirmVertices =
            idToVertex(snapshotSettings.getVerticesToConfirm(), executionGraph);

        // 省略部分代码,包括创建StateBackend的部分...
        
        // 开启Checkpoint
        executionGraph.enableCheckpointing(
            chkConfig,
            triggerVertices,
            ackVertices,
            confirmVertices,
            hooks,
            checkpointIdCounter,
            completedCheckpoints,
            rootBackend,
            checkpointStatsTracker);
    }

    // 省略部分代码...

    return executionGraph;
}

ExecutionGraph已经“成功加持”了Checkpoint的相关配置,现在看ExecutionGraph是如何开启Checkpoint的。

/**
 * 在作业的执行、调度过程中,开启Checkpoint
 */
public void enableCheckpointing(
    CheckpointCoordinatorConfiguration chkConfig,
    List<ExecutionJobVertex> verticesToTrigger,
    List<ExecutionJobVertex> verticesToWaitFor,
    List<ExecutionJobVertex> verticesToCommitTo,
    List<MasterTriggerRestoreHook<?>> masterHooks,
    CheckpointIDCounter checkpointIDCounter,
    CompletedCheckpointStore checkpointStore,
    StateBackend checkpointStateBackend,
    CheckpointStatsTracker statsTracker) {

    checkState(state == JobStatus.CREATED, "Job must be in CREATED state");
    checkState(checkpointCoordinator == null, "checkpointing already enabled");

    /**
	 * 将Source节点、需要发送ACK消息的所有节点、需要确认Checkpoint已执行成功的所有节点,分别各自转换为ExecutionVertex[]数组,
	 * 每个ExecutionVertex代表ExecutionJobVertes中的一个SubTask节点
	 */
    ExecutionVertex[] tasksToTrigger = collectExecutionVertices(verticesToTrigger);
    ExecutionVertex[] tasksToWaitFor = collectExecutionVertices(verticesToWaitFor);
    ExecutionVertex[] tasksToCommitTo = collectExecutionVertices(verticesToCommitTo);

    checkpointStatsTracker = checkNotNull(statsTracker, "CheckpointStatsTracker");

    // 用来在Checkpoint执行过程中进行容错管理
    CheckpointFailureManager failureManager = new CheckpointFailureManager(
        chkConfig.getTolerableCheckpointFailureNumber(),
        new CheckpointFailureManager.FailJobCallback() {
            @Override
            public void failJob(Throwable cause) {
                getJobMasterMainThreadExecutor().execute(() -> failGlobal(cause));
            }

            @Override
            public void failJobDueToTaskFailure(Throwable cause, ExecutionAttemptID failingTask) {
                getJobMasterMainThreadExecutor().execute(() -> failGlobalIfExecutionIsStillRunning(cause, failingTask));
            }
        }
    );

    checkState(checkpointCoordinatorTimer == null);

    // ScheduledExecutorService负责对Checkpoint异步线程进行定时调度、执行
    checkpointCoordinatorTimer = Executors.newSingleThreadScheduledExecutor(
        new DispatcherThreadFactory(
            Thread.currentThread().getThreadGroup(), "Checkpoint Timer"));

    /**
	 * 初始化CheckpointCoordinator组件:协调和管理作业中的Checkpoint,同时收集各Task节点中Checkpoint的执行情况等信息
	 */
    checkpointCoordinator = new CheckpointCoordinator(
        jobInformation.getJobId(),
        chkConfig,
        tasksToTrigger,
        tasksToWaitFor,
        tasksToCommitTo,
        checkpointIDCounter,
        checkpointStore,
        checkpointStateBackend,
        ioExecutor,
        new ScheduledExecutorServiceAdapter(checkpointCoordinatorTimer),
        SharedStateRegistry.DEFAULT_FACTORY,
        failureManager);

    // 省略部分代码...


    /**
	 * 只要开发者配置的Checkpoint的间隔时间是有效地,就可以正常开启Checkpoint功能。
	 * 将可以监听Job状态变化的监听器JobStatusListener,注册到JobManager中。
	 * 一旦监听到Job运行状态为RUNNING,会触发CheckpointCoordinator组件的启动:控制Source节点触发Checkpoint
	 */
    if (chkConfig.getCheckpointInterval() != Long.MAX_VALUE) {
        // 注册可以监听Job状态变化的监听器JobStatusListener
        // 此时系统会根据作业的运行状态,控制CheckpointCoordinator的启停,RUNNING状态时会触发CheckpointCoordinator组件的启动
        registerJobStatusListener(checkpointCoordinator.createActivatorDeactivator());
    }

    this.stateBackendName = checkpointStateBackend.getClass().getSimpleName();
}

能够看出,ExecutionGraph开启Checkpoint的本质就是准备好用来控制Source节点触发Checkpoint的CheckpointCoordinator组件,并向JobManager注册一个能够监听到Job状态变化的监听器JobStatusListener。一旦监听到Job运行状态为RUNNING,会触发CheckpointCoordinator组件的启动,就此,奔流不息的Checkpoint就正式揭开帷幕了…

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值