Flink源码分析-Checkpoint流程

Checkpoint调用链分析

JobMaster.triggerSavepoint

JobMaster触发savepoint的时候,会启动checkpoint。schedulerNG是调度flink jobs的接口。

 @Override
 //方法需要两个参数:checkpoint存储路径,任务是否取消
    public CompletableFuture<String> triggerSavepoint(
            @Nullable final String targetDirectory, final boolean cancelJob, final Time timeout) {
   

        return schedulerNG.triggerSavepoint(targetDirectory, cancelJob);
    }

下面我们看一下SchedulerNG的一个实现SchedulerBase

SchedulerBase.triggerSavepoint

主要流程:

  1. 从executionGraph获取checkpoint的协调器checkpointCoordinator
  2. 执行一次savepoint
  3. 如果之前步骤有异常,作业需要取消,则再次启动checkpointCoordinator,抛出异常
  4. 如果需要取消作业,之前步骤没有异常,作业取消
@Override
    public CompletableFuture<String> triggerSavepoint(
            final String targetDirectory, final boolean cancelJob) {
   
        mainThreadExecutor.assertRunningInMainThread();
		//获取checkpointCoordinator 
        final CheckpointCoordinator checkpointCoordinator =
                executionGraph.getCheckpointCoordinator();
        if (checkpointCoordinator == null) {
   
            throw new IllegalStateException(
                    String.format("Job %s is not a streaming job.", jobGraph.getJobID()));
        } else if (targetDirectory == null
                && !checkpointCoordinator.getCheckpointStorage().hasDefaultSavepointLocation()) {
   
            log.info(
                    "Trying to cancel job {} with savepoint, but no savepoint directory configured.",
                    jobGraph.getJobID());

            throw new IllegalStateException(
                    "No savepoint directory configured. You can either specify a directory "
                            + "while cancelling via -s :targetDirectory or configure a cluster-wide "
                            + "default via key '"
                            + CheckpointingOptions.SAVEPOINT_DIRECTORY.key()
                            + "'.");
        }

        log.info(
                "Triggering {}savepoint for job {}.",
                cancelJob ? "cancel-with-" : "",
                jobGraph.getJobID());
//如果取消作业,则停止调度checkpoint
        if (cancelJob) {
   
            stopCheckpointScheduler();
        }
//首先执行一次savapoint过程,其实就是一次对齐检查点的checkpoint,接下来返回保存checkpoint文件的路径
        return checkpointCoordinator
                .triggerSavepoint(targetDirectory)
                .thenApply(CompletedCheckpoint::getExternalPointer)
                .handleAsync(
                        (path, throwable) -> {
   
                            if (throwable != null) {
   
                                if (cancelJob) {
   
                                    startCheckpointScheduler();
                                }
                                throw new CompletionException(throwable);
                            } else if (cancelJob) {
   
                                log.info(
                                        "Savepoint stored in {}. Now cancelling {}.",
                                        path,
                                        jobGraph.getJobID());
                                cancel();
                            }
                            return path;
                        },
                        mainThreadExecutor);
    }

CheckpointCoordinator

CheckpointCoordinator负责协调所有算子的分布式快照和状态。它向相关的
task发送消息来触发快照动作,之后收集它们快照成功的确认消息(ack)。

CheckpointCoordinator.createActivatorDeactivator会产生一个job状态监听器,负责监听job状态的变化。

//监听作业状态变化,以开启或取消任务的checkpoint
public JobStatusListener createActivatorDeactivator() {
   
        synchronized (lock) {
   
            if (shutdown) {
   
                throw new IllegalArgumentException("Checkpoint coordinator is shut down");
            }

            if (jobStatusListener == null) {
   
                jobStatusListener = new CheckpointCoordinatorDeActivator(this);
            }

            return jobStatusListener;
        }
    }

JobStatusListener是一个接口,其具体实现CheckpointCoordinatorDeActivator, CheckpointCoordinatorDeActivator.jobStatusChanges方法如下:

//当作业状态为RUNNING,开启checkpoint周期性的调度
@Override
    public void jobStatusChanges(
            JobID jobId, JobStatus newJobStatus, long timestamp, Throwable error) {
   
        if (newJobStatus == JobStatus.RUNNING) {
   
            // start the checkpoint scheduler
            coordinator.startCheckpointScheduler();
        } else {
   
            // anything else should stop the trigger for now
            coordinator.stopCheckpointScheduler();
        }
    }

接下来看一下startCheckpointScheduler:

public void startCheckpointScheduler() {
   
        synchronized (lock) {
   
            if (shutdown) {
   
                throw new IllegalArgumentException("Checkpoint coordinator is shut down");
            }
            Preconditions.checkState(
                    isPeriodicCheckpointingConfigured(),
                    "Can not start checkpoint scheduler, if no periodic checkpointing is configured");

            // make sure all prior timers are cancelled
            //先停止之前的调度器
            stopCheckpointScheduler();

			//创建新的调度器并延迟触发(延迟时间为checkpoint间隔最短时间到checkpoint间隔时间+1(开区间)之间的随机值)
            periodicScheduling = true;
            currentPeriodicTrigger = scheduleTriggerWithDelay(getRandomInitDelay());
        }
    }

scheduleTriggerWithDelay方法启动了一个定时器,定时执行的逻辑在ScheduledTrigger类中,ScheduledTrigger为CheckpointCoordinator的一个内部类。

 private ScheduledFuture<?> scheduleTriggerWithDelay(long initDelay) {
   
        return timer.scheduleAtFixedRate(
                new ScheduledTrigger(), initDelay, baseInterval, TimeUnit.MILLISECONDS);
    }
private final class ScheduledTrigger implements Runnable {
   

        @Override
        public void run() {
   
            try {
   
                triggerCheckpoint(true);
            } catch (Exception e) {
   
                LOG.error("Exception while triggering checkpoint for job {}.", job, e);
            }
        }
    }

我们接着往下看triggerCheckpoint方法:

private void startTriggeringCheckpoint(CheckpointTriggerRequest request) {
   
        try {
   
            synchronized (lock) {
   
                preCheckGlobalState(request.isPeriodic);
            }

            // we will actually trigger this checkpoint!
            // 真正开始触发checkpoint
            Preconditions.checkState(!isTriggering);
            isTriggering = true;

            final long timestamp = System.currentTimeMillis();

            //计算下一次触发checkpoint的计划,所谓计划就是告诉我们哪些任务需要被触发,哪些任务在等待或提交
            CompletableFuture<CheckpointPlan> checkpointPlanFuture =
                    checkpointPlanCalculator.calculateCheckpointPlan();

            final CompletableFuture<PendingCheckpoint> pendingCheckpointCompletableFuture =
                    checkpointPlanFuture
                            .thenApplyAsync(
                                    plan -> {
   
                                        try {
   
                                            CheckpointIdAndStorageLocation
                                                    checkpointIdAndStorageLocation =
                                                            initializeCheckpoint(
                                                                    request.props,
                                                                    request.externalSavepointLocation);
                                            return new Tuple2<>(
                                                    plan, checkpointIdAndStorageLocation);
                                        } catch (Throwable e) {
   
                                            throw new CompletionException(e);
                                        }
                                    },
                                    executor)
                            .thenApplyAsync(
                                    (checkpointInfo) ->
                                            //pendingCheckpoint是已经启动但尚未被所有需要确认它的任务确认的检查点。一旦所有任务都确认了它,它就变成了{@link CompletedCheckpoint}。
                                            createPendingCheckpoint(
                                                    timestamp,
                                                    request.props,
                                                    checkpointInfo.f0,
                                                    request.isPeriodic,
                                                    checkpointInfo.f1.checkpointId,
                                                    checkpointInfo.f1.checkpointStorageLocation,
                                                    request.getOnCompletionFuture()),
                                    timer);

            final CompletableFuture<?> coordinatorCheckpointsComplete =
                    pendingCheckpointCompletableFuture.thenComposeAsync(
                            (pendingCheckpoint) ->
                                    OperatorCoordinatorCheckpoints
                                            //触发并确认所有CoordinatorCheckpoints
                                            .triggerAndAcknowledgeAllCoordinatorCheckpointsWithCompletion(
                                                    coordinatorsToCheckpoint,
                                                    pendingCheckpoint,
                                                    timer),
                            timer);

            //oordinator checkpoints检查点完成之后,需要调用master的钩子函数,MasterHook用于生成或回复checkpoint之前通知外部系统
            // We have to take the snapshot of the master hooks after the coordinator checkpoints
            // has completed.
            // This is to ensure the tasks are checkpointed after the OperatorCoordinators in case
            // ExternallyInducedSource is used.
            final CompletableFuture<?> masterStatesComplete =
                    coordinatorCheckpointsComplete.thenComposeAsync(
                            ignored -> {
   
                                //代码执行到此,可以确保 pending checkpoint部位空
                                // If the code reaches here, the pending checkpoint is guaranteed to
                                // be not null.
                                //我们使用FutureUtils.getWithoutException()来让编译器乐于接受签名中的受控异常。
                                // We use FutureUtils.getWithoutException() to make compiler happy
                                // with checked
                                // exceptions in the signature.
                                PendingCheckpoint checkpoint =
                                        FutureUtils.getWithoutException(
                                                pendingCheckpointCompletableFuture);
                                return snapshotMasterState(checkpoint);
                            },
                            timer);

            FutureUtils.assertNoException(
                    CompletableFuture.allOf(masterStatesComplete, coordinatorCheckpointsComplete)
                            .handleAsync(
                                    (ignored, throwable) -> {
   
                                        final PendingCheckpoint checkpoint =
                                                FutureUtils.getWithoutException(
                                                        pendingCheckpointCompletableFuture);

                                        Preconditions.checkState(
                                                checkpoint != null || throwable != null,
                                                "Either the pending checkpoint needs to be created or an error must have occurred.");

                                        if (throwable != null) {
   
                                            // the initialization might not be finished yet
                                            // 初始化可能还没有完成
                                            if (checkpoint == null) {
   
                                                onTriggerFailure(request, throwable);
                                            } else {
   
                                                onTriggerFailure(checkpoint, throwable);
                                            }
                                        } else {
   
                                            //这里开始发送checkpoint触发请求
                                            triggerCheckpointRequest(
                                                    request, timestamp, checkpoint);
                                        }
                                        return null;
                                    },
                                    timer)
                            .exceptionally(
                                    error -> {
   
                                        if (!isShutdown()) {
   
                                            throw new CompletionException(error);
                                        } else if (findThrowable(
                                                        error, RejectedExecutionException.class)
                                                .isPresent()) {
   
                                            LOG.debug("Execution rejected during shutdown");
                                        } else {
   
                                            LOG.warn("Error encountered during shutdown", error);
                                        }
                                        return null;
                                    })
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值