Flink Checkpoint源码浅析

1. JobManager 端checkpoint调度

dispatcher分发任务后会启动相应的jobMaster, 在创建jobMaster 构建过程中会执行jobGraph -> executeGraph的转换,源码如下:

// JobMaster类
public JobMaster(
            RpcService rpcService,
            JobMasterConfiguration jobMasterConfiguration,
            ...)
            throws Exception {
    ...
    this.jobManagerJobMetricGroup = jobMetricGroupFactory.create(jobGraph);
            this.schedulerNG = createScheduler(executionDeploymentTracker, jobManagerJobMetricGroup);
            this.jobStatusListener = null;
    ...
}
// SchedulerBase类
public SchedulerBase(
            final Logger log,
            final JobGraph jobGraph,
            final BackPressureStatsTracker backPressureStatsTracker,
            ...)
            throws Exception {
    ...
        this.executionGraph =
                createAndRestoreExecutionGraph(
                        jobManagerJobMetricGroup,
                        checkNotNull(shuffleMaster),
                        checkNotNull(partitionTracker),
                        checkNotNull(executionDeploymentTracker),
                        initializationTimestamp);
    ...
}
private ExecutionGraph createExecutionGraph(JobManagerJobMetricGroup currentJobManagerJobMetricGroup,
            ShuffleMaster<?> shuffleMaster,
            final JobMasterPartitionTracker partitionTracker,
            ExecutionDeploymentTracker executionDeploymentTracker,
            long initializationTimestamp)
            throws JobExecutionException, JobException {
    ...
        return ExecutionGraphBuilder.buildGraph(null,
                jobGraph,
                jobMasterConfiguration,
                ...);
    ...
    
}

createAndRestoreExecutionGraph()方法调用了createExecutionGraph()方法最终使用ExecutionGraphBuilder进行了ExecuteGraph的生成。

在构建ExecutionGraph过程中(ExecutionGraphBuilder.buildGraph()方法),会调用ExecutionGraph.enableCheckpointing()方法,这个方法不管任务里有没有设置checkpoint都会调用的。在enableCheckpointing()方法里会创建CheckpointCoordinator,这是负责checkpoint的核心实现类,同时会给job添加一个监听器CheckpointCoordinatorDeActivator(只有设置了checkpoint才会注册这个监听器),CheckpointCoordinatorDeActivator负责checkpoint的启动和停止。源码如下:

// ExecutionGraphBuilder类
public static ExecutionGraph buildGraph(
            @Nullable ExecutionGraph prior,
            JobGraph jobGraph,
            Configuration jobManagerConfig,
            ...)
            throws JobExecutionException, JobException {
    ...
        // configure the state checkpointing
        JobCheckpointingSettings snapshotSettings = jobGraph.getCheckpointingSettings();
        if (snapshotSettings != null) {
            List<ExecutionJobVertex> triggerVertices =
                    idToVertex(snapshotSettings.getVerticesToTrigger(), executionGraph);
​
            List<ExecutionJobVertex> ackVertices =
                    idToVertex(snapshotSettings.getVerticesToAcknowledge(), executionGraph);
​
            List<ExecutionJobVertex> confirmVertices =
                    idToVertex(snapshotSettings.getVerticesToConfirm(), executionGraph);
            // 一系列的checkpoint设置,包括statebackend, user-define hook, checkpointIdCounter等
            ...
            executionGraph.enableCheckpointing(
                    chkConfig,
                    triggerVertices,
                    ackVertices,
                    confirmVertices,
                    hooks,
                    checkpointIdCounter,
                    completedCheckpoints,
                    rootBackend,
                    checkpointStatsTracker);
    ...    
}

在 build graph 时确定了 triggerVertices ( 用来触发 chekcpoint),ackVertices ( 用来接收 checkpoint 已经完成的报告 )以及 confirmVertices ( 用来确认 checkpoint 已经完成 )。

executionGraph.enableCheckpointing()中做了一些checkpoint相关类的初始化操作,以及checkpoint状态监听器的注册。在JobManager端开始进行任务调度的时候,会对job的状态进行转换,由CREATED转成RUNNING,实现在transitionState()方法中,在这个过程中刚才设置的job监听器CheckpointCoordinatorDeActivator就开始启动checkpoint的定时任务了,调用链为ExecutionGraph.transitionToRunning() -> transitionState() -> notifyJobStatusChange() -> CheckpointCoordinatorDeActivator.jobStatusChanges() -> CheckpointCoordinator.startCheckpointScheduler()源码如下:

public void transitionToRunning() {
    if (!transitionState(JobStatus.CREATED, JobStatus.RUNNING)) {
        throw new IllegalStateException(
                "Job may only be scheduled from state " + JobStatus.CREATED);
    }
}
​
private boolean transitionState(JobStatus current, JobStatus newState, Throwable error) {
    ...
    if (state == current) {
        notifyJobStatusChange(newState, error);
        return true;
    }
    ...
}
​
private void notifyJobStatusChange(JobStatus newState, Throwable error) {
    if (jobStatusListeners.size() > 0) {
        final long timestamp = System.currentTimeMillis();
        final Throwable serializedError = error == null ? null : new SerializedThrowable(error);
​
        for (JobStatusListener listener : jobStatusListeners) {
            try {
                listener.jobStatusChanges(getJobID(), newState, timestamp, serializedError);
            } catch (Throwable t) {
                LOG.warn("Error while notifying JobStatusListener", t);
            }
        }
    }
}
​
// CheckpointCoordinatorDeActivator类
@Override
    public void jobStatusChanges(
            JobID jobId, JobStatus newJobStatus, long timestamp, Throwable error) {
        if (newJobStatus == JobStatus.RUNNING) {
            // start the checkpoint scheduler
            coordinator.startCheckpointScheduler();
        } else {
            // anything else should stop the trigger for now
            coordinator.stopCheckpointScheduler();
        }
    }

CheckpointCoordinator会部署一个定时任务,用于周期性的触发checkpoint,这个定时任务就是ScheduledTrigger类。

public void startCheckpointScheduler() {
    synchronized (lock) {
        if (shutdown) {
            throw new IllegalArgumentException("Checkpoint coordinator is shut down");
        }
​
        // make sure
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值