Flink CheckPoint的触发过程

CheckpointCoordinator的转换及调度

1、转换过程

在Flink JobMaster中有用于协调和触发checkpoint机制的协调管理器CheckpointCoordinator,其是Flink分布式快照的核心管理控制组件,其主要维护的功能如下:

  • 发起checkpoint触发的消息,并接收不同task对checkpoint的响应信息(Ack)
  • 维护Ack中附带的状态句柄(state-handle)的全局视图

其针对checkpoint的配置及最终触发调度主要集中在两个转换过程中:

  1. StreamGraph-->JobGraph的转换流程:在客户端中的streamGraph向jobGraph转换过程中,其主要依赖于StreamingJobGraphGenerator#createJobGraph()方法,在生成JobGraph之后会调用StreamingJobGraphGenerator#configureCheckpointing()方法进行Checkpoint相关的配置。其主要的工作就是读取streamGraph中的checkpoint配置,并将其节点划分为trigger节点、ack节点以及commit节点。在该类内部标识为对应的三个列表:
    1. List<JobVertexID> triggerVertices
    2. List<JobVertexID> ackVertices
    3. List<JobVertexID> commitVertices
      class StreamingJobGraphGenerator {
          private void configureCheckpointing() {
             CheckpointConfig cfg = streamGraph.getCheckpointConfig();
             long interval = cfg.getCheckpointInterval();
             if (interval > 0) {
                ExecutionConfig executionConfig = streamGraph.getExecutionConfig();
                // propagate the expected behaviour for checkpoint errors to task.
                executionConfig.setFailTaskOnCheckpointError(cfg.isFailOnCheckpointingErrors());
             } else {
                // interval of max value means disable periodic checkpoint
                interval = Long.MAX_VALUE;
             }
             //  --- configure the participating vertices ---
          
             // collect the vertices that receive "trigger checkpoint" messages.
             // currently, these are all the sources
             List<JobVertexID> triggerVertices = new ArrayList<>();
          
             // collect the vertices that need to acknowledge the checkpoint
             // currently, these are all vertices
             List<JobVertexID> ackVertices = new ArrayList<>(jobVertices.size());
          
             // collect the vertices that receive "commit checkpoint" messages
             // currently, these are all vertices
             List<JobVertexID> commitVertices = new ArrayList<>(jobVertices.size());
          
             for (JobVertex vertex : jobVertices.values()) {
                if (vertex.isInputVertex()) {
                   triggerVertices.add(vertex.getID());
                }
                commitVertices.add(vertex.getID());
                ackVertices.add(vertex.getID());
             }
          
             //  --- configure options ---
             CheckpointRetentionPolicy retentionAfterTermination;
             if (cfg.isExternalizedCheckpointsEnabled()) {
                CheckpointConfig.ExternalizedCheckpointCleanup cleanup = cfg.getExternalizedCheckpointCleanup();
                // Sanity check
                if (cleanup == null) {
                   throw new IllegalStateException("Externalized checkpoints enabled, but no cleanup mode configured.");
                }
                retentionAfterTermination = cleanup.deleteOnCancellation() ?
                      CheckpointRetentionPolicy.RETAIN_ON_FAILURE :
                      CheckpointRetentionPolicy.RETAIN_ON_CANCELLATION;
             } else {
                retentionAfterTermination = CheckpointRetentionPolicy.NEVER_RETAIN_AFTER_TERMINATION;
             }
             CheckpointingMode mode = cfg.getCheckpointingMode();
             boolean isExactlyOnce;
             if (mode == CheckpointingMode.EXACTLY_ONCE) {
                isExactlyOnce = true;
             } else if (mode == CheckpointingMode.AT_LEAST_ONCE) {
                isExactlyOnce = false;
             } else {
                throw new IllegalStateException("Unexpected checkpointing mode. " +
                   "Did not expect there to be another checkpointing mode besides " +
                   "exactly-once or at-least-once.");
             }
          
             //  --- configure the master-side checkpoint hooks ---
             final ArrayList<MasterTriggerRestoreHook.Factory> hooks = new ArrayList<>();
             for (StreamNode node : streamGraph.getStreamNodes()) {
                StreamOperator<?> op = node.getOperator();
                if (op instanceof AbstractUdfStreamOperator) {
                   Function f = ((AbstractUdfStreamOperator<?, ?>) op).getUserFunction();
                   if (f instanceof WithMasterCheckpointHook) {
                      hooks.add(new FunctionMasterCheckpointHookFactory((WithMasterCheckpointHook<?>) f));
                   }
                }
             }
          
             // because the hooks can have user-defined code, they need to be stored as
             // eagerly serialized values
             final SerializedValue<MasterTriggerRestoreHook.Factory[]> serializedHooks;
             if (hooks.isEmpty()) {
                serializedHooks = null;
             } else {
                try {
                   MasterTriggerRestoreHook.Factory[] asArray = hooks.toArray(new MasterTriggerRestoreHook.Factory[hooks.size()]);
                   serializedHooks = new SerializedValue<>(asArray);
                } catch (IOException e) {
                   throw new FlinkRuntimeException("Trigger/restore hook is not serializable", e);
                }
             }
          
             // because the state backend can have user-defined code, it needs to be stored as
             // eagerly serialized value
             final SerializedValue<StateBackend> serializedStateBackend;
             if (streamGraph.getStateBackend() == null) {
                serializedStateBackend = null;
             } else {
                try {
                   serializedStateBackend = new SerializedValue<StateBackend>(streamGraph.getStateBackend());
                } catch (IOException e) {
                   throw new FlinkRuntimeException("State backend is not serializable", e);
                }
             }
          
             //  --- done, put it all together ---
             JobCheckpointingSettings settings = new JobCheckpointingSettings(
                triggerVertices,
                ackVertices,
                commitVertices,
                new CheckpointCoordinatorConfiguration(
                   interval,
                   cfg.getCheckpointTimeout(),
                   cfg.getMinPauseBetweenCheckpoints(),
                   cfg.getMaxConcurrentCheckpoints(),
                   retentionAfterTermination,
                   isExactlyOnce),
                serializedStateBackend,
                serializedHooks);
          
             jobGraph.setSnapshotSettings(settings);
          }
      }
  2. JobGraph-->ExecutionGraph的转换流程:在客户端将JobGraph提交到Dispatcher之后,其会生成对应的JobMaster来处理该JobGraph到ExecutionGraph的可执行图的转换,转换过程中主要依赖于ExecutionGraphBuilder#buildGraph()方法,在构建过程中,如果作业开启了checkpoint,则会调用ExecutionGraph.enableCheckpointing()方法,这里会创建CheckpointCoordinator对象,并注册一个作业状态的监听CheckpointCoordinatorDeActivator,CheckpointCoordinatorDeActivator会在作业状态发生改变时得到通知。
     
    class ExecutionGraph {
        public void enableCheckpointing(
              long interval,
              long checkpointTimeout,
              long minPauseBetweenCheckpoints,
              int maxConcurrentCheckpoints,
              CheckpointRetentionPolicy retentionPolicy,
              List<ExecutionJobVertex> verticesToTrigger,
              List<ExecutionJobVertex> verticesToWaitFor,
              List<ExecutionJobVertex> verticesToCommitTo,
              List<MasterTriggerRestoreHook<?>> masterHooks,
              CheckpointIDCounter checkpointIDCounter,
              CompletedCheckpointStore checkpointStore,
              StateBackend checkpointStateBackend,
              CheckpointStatsTracker statsTracker) {
        
           // simple sanity checks
           checkArgument(interval >= 10, "checkpoint interval must not be below 10ms");
           checkArgument(checkpointTimeout >= 10, "checkpoint timeout must not be below 10ms");
           checkState(state == JobStatus.CREATED, "Job must be in CREATED state");
           checkState(checkpointCoordinator == null, "checkpointing already enabled");
        
           ExecutionVertex[] tasksToTrigger = collectExecutionVertices(verticesToTrigger);
           ExecutionVertex[] tasksToWaitFor = collectExecutionVertices(verticesToWaitFor);
           ExecutionVertex[] tasksToCommitTo = collectExecutionVertices(verticesToCommitTo);
        
           checkpointStatsTracker = checkNotNull(statsTracker, "CheckpointStatsTracker");
        
           // create the coordinator that triggers and commits checkpoints and holds the state
           checkpointCoordinator = new CheckpointCoordinator(
              jobInformation.getJobId(),
              interval,
              checkpointTimeout,
              minPauseBetweenCheckpoints,
              maxConcurrentCheckpoints,
              retentionPolicy,
              tasksToTrigger,
              tasksToWaitFor,
              tasksToCommitTo,
              checkpointIDCounter,
              checkpointStore,
              checkpointStateBackend,
              ioExecutor,
              SharedStateRegistry.DEFAULT_FACTORY);
        
           // register the master hooks on the checkpoint coordinator
           for (MasterTriggerRestoreHook<?> hook : masterHooks) {
              if (!checkpointCoordinator.addMasterHook(hook)) {
                 LOG.warn("Trying to register multiple checkpoint hooks with the name: {}", hook.getIdentifier());
              }
           }
           checkpointCoordinator.setCheckpointStatsTracker(checkpointStatsTracker);
        
           // interval of max long value indicates disable periodic checkpoint,
           // the CheckpointActivatorDeactivator should be created only if the interval is not max value
           if (interval != Long.MAX_VALUE) {
              // the periodic checkpoint scheduler is activated and deactivated as a result of
              // job status changes (running -> on, all other states -> off)
              // 注册任务状态更改 监听器通知
              registerJobStatusListener(checkpointCoordinator.createActivatorDeactivator());
           }
        }
        
        private void notifyJobStatusChange(JobStatus newState, Throwable error) {
           if (jobStatusListeners.size() > 0) {
              final long timestamp = System.currentTimeMillis();
              final Throwable serializedError = error == null ? null : new SerializedThrowable(error);
        
              for (JobStatusListener listener : jobStatusListeners) {
                 try {
                    listener.jobStatusChanges(getJobID(), newState, timestamp, serializedError);
                 } catch (Throwable t) {
                    LOG.warn("Error while notifying JobStatusListener", t);
                 }
              }
           }
        }
    }

    当JobStatus状态变为RUNNING时,listener.jobStatusChanges()会通知其上注册的监听器;CheckpointCoordinatorDeActivator会得到通知,并且通过CheckpointCoordinator.startCheckpointScheduler启动checkpoint的定时器。

    class CheckpointCoordinatorDeActivator implements JobStatusListener {
        @Override
        public void jobStatusChanges(JobID jobId, JobStatus newJobStatus, long timestamp, Throwable error) {
           if (newJobStatus == JobStatus.RUNNING) {
              // start the checkpoint scheduler
              coordinator.startCheckpointScheduler();
           } else {
              // anything else should stop the trigger for now
              coordinator.stopCheckpointScheduler();
           }
        }
    }

2、调度触发

        在CheckpointCoordinator.startCheckpointScheduler启动checkpoint定时器之后,也就意味着该checkpoint已启动并会周期性的调度执行。在flink中,其对该checkpoint调度的定时任务被封装为ScheduledTrigger,运行时会调用CheckpointCoordinator.triggerCheckpoint()触发一次checkpoint。其triggerCheckpoint()方法主要执行功能如下:

  • 检查是否可以触发checkpoint,包括是否需要强制进行checkpoint,当前正在排队的并发checkpoint的数目是否超过阈值,距离上一次成功checkpoint的间隔时间是否过小等,如果这些条件不满足,则当前检查点的触发请求不会执行
  • 检查是否所有需要触发checkpoint的Execution都是RUNNING状态
  • 生成此次checkpoint的checkpointID(id是严格自增的),并初始化CheckpointStorageLocation,CheckpointStorageLocation是此次checkpoint存储位置的抽象,通过CheckpointStorage.initializeLocationForCheckpoint()创建(CheckpointStorage目前有两个具体实现,分别为FsCheckpointStorage和MemoryBackendCheckpointStorage),CheckpointStorage则是从StateBackend中创建
  • 生成PendingCheckpoint,这表示一个处于中间状态的checkpoint,并保存在checkpointId->PendingCheckpoint这样的映射关系中
  • 注册一个调度任务,在checkpoint超时后取消此次checkpoint,并重新触发一次新的checkpoint
  • 调用Execution.triggerCheckpoint()方法向所有需要trigger的task发起checkpoint请求

savepoint和checkpoint的处理逻辑基本一致,只是savepoint是强制触发的,需要调用Execution.triggerSynchronousSavepoint()进行触发。

 

在CheckpointCoordinator内部也有三个列表:

  • ExecutionVertex[] tasksToTrigger;
  • ExecutionVertex[] tasksToWaitFor;
  • ExecutionVertex[] tasksToCommitTo;

        这就对应了前面JobGraph中的三个列表,在触发checkpoint的时候,只有作为source的Execution会调用Execution.triggerCheckpoint()方法。会通过RPC调用通知对应的RpcTaskManagerGateway调用triggerCheckpoint。

public class CheckpointCoordinator {
    public CheckpointTriggerResult triggerCheckpoint(
          long timestamp,
          CheckpointProperties props,
          @Nullable String externalSavepointLocation,
          boolean isPeriodic) {
    
       // make some eager pre-checks
       synchronized (lock) {
          // 检查是否可以触发checkpoint(筛选条件)
          // abort if the coordinator has been shutdown in the meantime
          // Don't allow periodic checkpoint if scheduling has been disabled
          
          // validate whether the checkpoint can be triggered, with respect to the limit of
          // concurrent checkpoints, and the minimum time between checkpoints.
          // these checks are not relevant for savepoints
          
       // check if all tasks that we need to trigger are running.
       // if not, abort the checkpoint
       // 检查是否所有需要触发checkpoint的Execution都是RUNNING状态
       Execution[] executions = new Execution[tasksToTrigger.length];
       for (int i = 0; i < tasksToTrigger.length; i++) {
          Execution ee = tasksToTrigger[i].getCurrentExecutionAttempt();
          if (ee == null) {
             // ......
          } else if (ee.getState() == ExecutionState.RUNNING) {
             executions[i] = ee;
          } else {
             // ......
          }
       }
       // ......
       // we will actually trigger this checkpoint!
    
       // we lock with a special lock to make sure that trigger requests do not overtake each other.
       // this is not done with the coordinator-wide lock, because the 'checkpointIdCounter'
       // may issue blocking operations. Using a different lock than the coordinator-wide lock,
       // we avoid blocking the processing of 'acknowledge/decline' messages during that time.
       synchronized (triggerLock) {
          final CheckpointStorageLocation checkpointStorageLocation;
          final long checkpointID;
          try {
             // this must happen outside the coordinator-wide lock, because it communicates
             // with external services (in HA mode) and may block for a while.
             checkpointID = checkpointIdCounter.getAndIncrement();
             // 生成此次checkpoint的checkpointID(id是严格自增的),并初始化CheckpointStorageLocation
             checkpointStorageLocation = props.isSavepoint() ?
                   checkpointStorage.initializeLocationForSavepoint(checkpointID, externalSavepointLocation) :
                   checkpointStorage.initializeLocationForCheckpoint(checkpointID);
          } catch (Throwable t) {
              // ......
          }
    
          final PendingCheckpoint checkpoint = new PendingCheckpoint(
             job,
             checkpointID,
             timestamp,
             ackTasks,
             props,
             checkpointStorageLocation,
             executor);
          if (statsTracker != null) {
             PendingCheckpointStats callback = statsTracker.reportPendingCheckpoint(
                checkpointID,
                timestamp,
                props);
             checkpoint.setStatsCallback(callback);
          }
    
          // schedule the timer that will clean up the expired checkpoints
          // 注册一个调度任务,在checkpoint超时后取消此次checkpoint,并重新触发一次新的checkpoint
          final Runnable canceller = () -> {
             synchronized (lock) {
                // only do the work if the checkpoint is not discarded anyways
                // note that checkpoint completion discards the pending checkpoint object
                if (!checkpoint.isDiscarded()) {
                   LOG.info("Checkpoint {} of job {} expired before completing.", checkpointID, job);
                   checkpoint.abortExpired();
                   pendingCheckpoints.remove(checkpointID);
                   rememberRecentCheckpointId(checkpointID);
                   triggerQueuedRequests(); // 重新调度触发一次新的checkpoint
                }
             }
          };
    
          try {
             // re-acquire the coordinator-wide lock
             synchronized (lock) {
                // since we released the lock in the meantime, we need to re-check
                // that the conditions still hold.
                // ......
                LOG.info("Triggering checkpoint {} @ {} for job {}.", checkpointID, timestamp, job);
                pendingCheckpoints.put(checkpointID, checkpoint);
                ScheduledFuture<?> cancellerHandle = timer.schedule(
                      canceller,
                      checkpointTimeout, TimeUnit.MILLISECONDS);
                if (!checkpoint.setCancellerHandle(cancellerHandle)) {
                   // checkpoint is already disposed!
                   cancellerHandle.cancel(false);
                }
                
                // trigger the master hooks for the checkpoint
                final List<MasterState> masterStates = MasterHooks.triggerMasterHooks(masterHooks.values(),
                      checkpointID, timestamp, executor, Time.milliseconds(checkpointTimeout));
                for (MasterState s : masterStates) {
                   checkpoint.addMasterState(s);
                }
             }
             // end of lock scope
    
             final CheckpointOptions checkpointOptions = new CheckpointOptions(
                   props.getCheckpointType(),
                   checkpointStorageLocation.getLocationReference());
    
             // send the messages to the tasks that trigger their checkpoint
             // 调用Execution.triggerCheckpoint()方法向所有需要trigger的task发起checkpoint请求
             for (Execution execution: executions) {
                execution.triggerCheckpoint(checkpointID, timestamp, checkpointOptions);
             }
    
             numUnsuccessfulCheckpointsTriggers.set(0);
             return new CheckpointTriggerResult(checkpoint);
          } catch (Throwable t) {
             // guard the map against concurrent modifications
             // ......
          }
       } // end trigger lock
    }
}


public class Execution implements AccessExecution, Archiveable<ArchivedExecution>, LogicalSlot.Payload {
    /**
     * Trigger a new checkpoint on the task of this execution.
     *
     * @param checkpointId of th checkpoint to trigger
     * @param timestamp of the checkpoint to trigger
     * @param checkpointOptions of the checkpoint to trigger
     */
    public void triggerCheckpoint(long checkpointId, long timestamp, CheckpointOptions checkpointOptions) {
       final LogicalSlot slot = assignedResource;
       if (slot != null) {
          final TaskManagerGateway taskManagerGateway = slot.getTaskManagerGateway();
          // 只有作为source的Execution会调用Execution.triggerCheckpoint()
          // 通过RPC调用通知对应的RpcTaskManagerGateway调用triggerCheckpoint
          taskManagerGateway.triggerCheckpoint(attemptId, getVertex().getJobId(), checkpointId, timestamp, checkpointOptions);
       } else {
          LOG.debug("The execution has no slot assigned. This indicates that the execution is " +
             "no longer running.");
       }
    }
}

Checkpoint的执行

1、barrier的流动

        CheckpointCoordinator发出触发checkpoint的消息,最终通过RPC调用TaskExecutorGateway.triggerCheckpoint,即请求执行TaskExecutor.triggerCheckpoin()。因为一个TaskExecutor中可能有多个Task正在运行,因而要根据触发checkpoint的ExecutionAttemptID找到对应的Task,然后调用Task.triggerCheckpointBarrier()方法。只有作为source的Task才会触发triggerCheckpointBarrier()方法的调用。

        在Task中,checkpoint的触发被封装为一个异步任务执行:

class Task {
    /**
     * Calls the invokable to trigger a checkpoint.
     *
     * @param checkpointID The ID identifying the checkpoint.
     * @param checkpointTimestamp The timestamp associated with the checkpoint.
     * @param checkpointOptions Options for performing this checkpoint.
     */
    public void triggerCheckpointBarrier(
          final long checkpointID,
          long checkpointTimestamp,
          final CheckpointOptions checkpointOptions) {
    
       final AbstractInvokable invokable = this.invokable;
       final CheckpointMetaData checkpointMetaData = new CheckpointMetaData(checkpointID, checkpointTimestamp);
    
       if (executionState == ExecutionState.RUNNING && invokable != null) {
          // build a local closure
          final String taskName = taskNameWithSubtask;
          final SafetyNetCloseableRegistry safetyNetCloseableRegistry =
             FileSystemSafetyNet.getSafetyNetCloseableRegistryForThread();
    
          Runnable runnable = new Runnable() {
             @Override
             public void run() {
                // set safety net from the task's context for checkpointing thread
                LOG.debug("Creating FileSystem stream leak safety net for {}", Thread.currentThread().getName());
                FileSystemSafetyNet.setSafetyNetCloseableRegistryForThread(safetyNetCloseableRegistry);
                try {
                   // 触发checkpoint真正的调用逻辑
                   boolean success = invokable.triggerCheckpoint(checkpointMetaData, checkpointOptions);
                   if (!success) {
                      checkpointResponder.declineCheckpoint(
                            getJobID(), getExecutionId(), checkpointID,
                            new CheckpointDeclineTaskNotReadyException(taskName));
                   }
                } catch (Throwable t) {
                    // ......
                } finally {
                   FileSystemSafetyNet.setSafetyNetCloseableRegistryForThread(null);
                }
             }
          };
          // 异步执行
          executeAsyncCallRunnable(runnable, String.format("Checkpoint Trigger for %s (%s).", taskNameWithSubtask, executionId));
       }
       else {
          LOG.debug("Declining checkpoint request for non-running task {} ({}).", taskNameWithSubtask, executionId);
          // send back a message that we did not do the checkpoint
          checkpointResponder.declineCheckpoint(jobId, executionId, checkpointID,
                new CheckpointDeclineTaskNotReadyException(taskNameWithSubtask));
       }
    }
}

Task执行checkpoint的真正逻辑被封装在AbstractInvokable.triggerCheckpoint(...)中,AbstractInvokable中有两个触发checkpoint的方法:

  • triggerCheckpoint
  • triggerCheckpointOnBarrier

       其中triggerCheckpoint是触发checkpoint的源头,会向下游注入CheckpointBarrier;而下游的其他任务在收到CheckpointBarrier后调用triggerCheckpointOnBarrier方法。这两个方法的具体实现有一些细微的差异,但主要的逻辑是一致的,在StreamTask.performCheckpoint()方法中:1)先向下游发送barrier,2)存储检查点快照。

一旦StreamTask.triggerCheckpoint()或StreamTask.triggerCheckpointOnBarrier()被调用,就会通过OperatorChain.broadcastCheckpointBarrier()向下游发送barrier:

public abstract class StreamTask<OUT, OP extends StreamOperator<OUT>>
      extends AbstractInvokable
      implements AsyncExceptionHandler {
    private boolean performCheckpoint(
          CheckpointMetaData checkpointMetaData,
          CheckpointOptions checkpointOptions,
          CheckpointMetrics checkpointMetrics) throws Exception {
       LOG.debug("Starting checkpoint ({}) {} on task {}",
          checkpointMetaData.getCheckpointId(), checkpointOptions.getCheckpointType(), getName());
    
       synchronized (lock) {
          if (isRunning) {
             // we can do a checkpoint
    
             // All of the following steps happen as an atomic step from the perspective of barriers and
             // records/watermarks/timers/callbacks.
             // We generally try to emit the checkpoint barrier as soon as possible to not affect downstream
             // checkpoint alignments
    
             // Step (1): Prepare the checkpoint, allow operators to do some pre-barrier work.
             //           The pre-barrier work should be nothing or minimal in the common case.
             operatorChain.prepareSnapshotPreBarrier(checkpointMetaData.getCheckpointId());
    
             // Step (2): Send the checkpoint barrier downstream   // 向下游发送barrier
             operatorChain.broadcastCheckpointBarrier(
                   checkpointMetaData.getCheckpointId(),
                   checkpointMetaData.getTimestamp(),
                   checkpointOptions);
    
             // Step (3): Take the state snapshot. This should be largely asynchronous, to not
             //           impact progress of the streaming topology
             checkpointState(checkpointMetaData, checkpointOptions, checkpointMetrics);   // 存储检查点快照
             return true;
          }
          else {
             // we cannot perform our checkpoint - let the downstream operators know that they
             // should not wait for any input from this operator
    
             // we cannot broadcast the cancellation markers on the 'operator chain', because it may not
             // yet be created
             final CancelCheckpointMarker message = new CancelCheckpointMarker(checkpointMetaData.getCheckpointId());
             Exception exception = null;
    
             for (StreamRecordWriter<SerializationDelegate<StreamRecord<OUT>>> streamRecordWriter : streamRecordWriters) {
                try {
                   streamRecordWriter.broadcastEvent(message);
                } catch (Exception e) {
                   // ......
                }
             }
             // ......
             return false;
          }
       }
    }
}
public class OperatorChain<OUT, OP extends StreamOperator<OUT>> implements StreamStatusMaintainer {
    public void broadcastCheckpointBarrier(long id, long timestamp, CheckpointOptions checkpointOptions) throws IOException {
       // 创建一个CheckpointBarrier
       CheckpointBarrier barrier = new CheckpointBarrier(id, timestamp, checkpointOptions);
       for (RecordWriterOutput<?> streamOutput : streamOutputs) {
          // 向所有的下游发送
          streamOutput.broadcastEvent(barrier);
       }
    }
}

在每一个Task中,其会通过InputGate消费上游Task产生的数据,在flink实际的任务执行运行中;其主要的checkpoint触发与流向如下:

  • 在执行图JobGraph转化到JobMaster中的ExecutionGraph中;其会通过配置开启enableCheckpointing();在这里其会初始化CheckpointCoordinator检查点协作组件(主要包括ckp的周期、输入源、操作算子等ExecutionVertex);并通过ExecutionGraph#registerJobStatusListener注册自己,当任务状态JobStatus变化到Running的过程中;其会接受回调并触发coordinator.startCheckpointScheduler();开启周期性ckp的触发;
  • CheckpointCoordinator.startCheckpointScheduler触发过程中主要通过找到只作为source源输入的Execution;并调用其Execution.triggerCheckpoint()方法。其会通过RPC调用通知对应的RpcTaskManagerGateway调用triggerCheckpoint。
  • 在对应的TaskExecutor接受到对应的调用后,其会通过taskSlotTable获取到对应的Task任务(SourceStreamTask);并触发其task.triggerCheckpointBarrier()进行检查点的触发;在SourceStreamTask触发检查点Barrier操作时,其会委托给抽象的父类StreamTask进行触发;
  • 在SourceStreamTask委托给抽象父类StreamTask进行ckp触发后;1、其会先向下游算子(OneInputStreamTask、TwoInputStreamTask等)广播当前的CheckpointBarrier;2、再对本算子进行当前状态的status存储;
  • 下游算子(以OneInputStreamTask为例)在其自己初始化的时候会获取其对应的InputGate;并将其封装委托给对应初始化的StreamInputProcessor;在StreamInputProcessor初始化的时候会根据ckp模式和InputGate进行对应CheckpointBarrierHandler的初始化;并将其作为接收到上游发出的CheckpointBarrier的处理核心;
  • 下游算子在周期性处理函数inputProcessor.processInput()中接受到的任务数据时;1、其会先调用其内部持有的操作算子streamOperator.processElement(record);来对数据进行自定义userFunction的处理;并通过output.collect(element)向下游传递;2、其会接收下一个数据块BufferOrEvent bufferOrEvent = barrierHandler.getNextNonBlocked();在接收的过程中,其会判断其是不是CheckpointBarrier事件,如果是,其会触发当前算子的ckp操作;

2、barrier的接收及处理

       在Task接收上游数据的实现中(OneInputStreamTask),其主要会在StreamInputProcessor和StreamTwoInputProcessor中创建CheckpointBarrierHandler;CheckpointBarrierHandler是对InputGate的一层封装,增加了对CheckpointBarrier等事件的处理。CheckpointBarrierHandler有两个具体的实现,即BarrierTracker和BarrierBuffer,分别对应AT_LEAST_ONCE和EXACTLY_ONCE这两种模式。

StreamInputProcessor和StreamTwoInputProcessor中循环调用CheckpointBarrierHandler.getNextNonBlocked()获取新数据,因而在CheckpointBarrierHandler获得CheckpointBarrier后可以及时地进行checkpoint相关的操作。

1、AT_LEAST_ONCE模式下对应的BarrierTracker,它仅仅追踪从每一个inputchannel接收到的barrier,当所有inputchannel的barrier都被接收时,就可以触发checkpoint了:

public class BarrierTracker implements CheckpointBarrierHandler {
    @Override
    public BufferOrEvent getNextNonBlocked() throws Exception {
       while (true) {
          Optional<BufferOrEvent> next = inputGate.getNextBufferOrEvent();
          if (!next.isPresent()) {
             // buffer or input exhausted
             return null;
          }
    
          BufferOrEvent bufferOrEvent = next.get();
          if (bufferOrEvent.isBuffer()) {
             return bufferOrEvent;
          }
          else if (bufferOrEvent.getEvent().getClass() == CheckpointBarrier.class) {
             // 接收到 CheckpointBarrier
             processBarrier((CheckpointBarrier) bufferOrEvent.getEvent(), bufferOrEvent.getChannelIndex());
          }
          else if (bufferOrEvent.getEvent().getClass() == CancelCheckpointMarker.class) {
             // 接收到 CancelCheckpointMarker
             processCheckpointAbortBarrier((CancelCheckpointMarker) bufferOrEvent.getEvent(), bufferOrEvent.getChannelIndex());
          }
          else {
             // some other event
             return bufferOrEvent;
          }
       }
    }
    
    private void processBarrier(CheckpointBarrier receivedBarrier, int channelIndex) throws Exception {
       final long barrierId = receivedBarrier.getId();
       // fast path for single channel trackers
       if (totalNumberOfInputChannels == 1) {
          notifyCheckpoint(barrierId, receivedBarrier.getTimestamp(), receivedBarrier.getCheckpointOptions());
          return;
       }
    
       // general path for multiple input channels
       if (LOG.isDebugEnabled()) {
          LOG.debug("Received barrier for checkpoint {} from channel {}", barrierId, channelIndex);
       }
       // find the checkpoint barrier in the queue of pending barriers
       CheckpointBarrierCount cbc = null;
       int pos = 0;
    
       for (CheckpointBarrierCount next : pendingCheckpoints) {
          if (next.checkpointId == barrierId) {
             cbc = next;
             break;
          }
          pos++;
       }
    
       if (cbc != null) {
          // add one to the count to that barrier and check for completion
          int numBarriersNew = cbc.incrementBarrierCount();
          if (numBarriersNew == totalNumberOfInputChannels) {
             // checkpoint can be triggered (or is aborted and all barriers have been seen)
             // first, remove this checkpoint and all all prior pending
             // checkpoints (which are now subsumed)
             // 在当前 barrierId 前面的所有未完成的 checkpoint 都可以丢弃了
             for (int i = 0; i <= pos; i++) {
                pendingCheckpoints.pollFirst();
             }
             // notify the listener
             if (!cbc.isAborted()) {
                if (LOG.isDebugEnabled()) {
                   LOG.debug("Received all barriers for checkpoint {}", barrierId);
                }
                // 通知进行 checkpoint
                notifyCheckpoint(receivedBarrier.getId(), receivedBarrier.getTimestamp(), receivedBarrier.getCheckpointOptions());
             }
          }
       }
       else {
          // first barrier for that checkpoint ID
          // add it only if it is newer than the latest checkpoint.
          // if it is not newer than the latest checkpoint ID, then there cannot be a
          // successful checkpoint for that ID anyways
          if (barrierId > latestPendingCheckpointID) {
             latestPendingCheckpointID = barrierId;
             pendingCheckpoints.addLast(new CheckpointBarrierCount(barrierId));
    
             // make sure we do not track too many checkpoints
             if (pendingCheckpoints.size() > MAX_CHECKPOINTS_TO_TRACK) {
                pendingCheckpoints.pollFirst();
             }
          }
       }
    }
}

2、EXACTLY_ONCE模式下对应的BarrierBuffer,它除了要追踪每一个inputchannel接收到的barrier之外,在接收到所有的barrier之前,先收到barrier的channel要进入阻塞状态。当然为了避免进入“反压”状态,BarrierBuffer会继续接收数据,但会对接收到的数据进行缓存,直到所有的barrier都到达。

public class BarrierBuffer implements CheckpointBarrierHandler {
	/** To utility to write blocked data to a file channel. */
	private final BufferBlocker bufferBlocker; // 用于缓存被阻塞的channel接收的数据

	/**
	 * The sequence of buffers/events that has been unblocked and must now be consumed before
	 * requesting further data from the input gate.
	 */
	private BufferOrEventSequence currentBuffered; // 当前缓存的数据
 
    // ------------------------------------------------------------------------
    //  Buffer and barrier handling
    // ------------------------------------------------------------------------
    @Override
    public BufferOrEvent getNextNonBlocked() throws Exception {
       while (true) {
          // process buffered BufferOrEvents before grabbing new ones
          // 先处理缓存的数据
          Optional<BufferOrEvent> next;
          if (currentBuffered == null) {
             next = inputGate.getNextBufferOrEvent();
          }
          else {
             next = Optional.ofNullable(currentBuffered.getNext());
             if (!next.isPresent()) {
                completeBufferedSequence();
                return getNextNonBlocked();
             }
          }
    
          if (!next.isPresent()) {
             if (!endOfStream) {
                // end of input stream. stream continues with the buffered data
                endOfStream = true;
                releaseBlocksAndResetBarriers();
                return getNextNonBlocked();
             }
             else {
                // final end of both input and buffered data
                return null;
             }
          }
    
          BufferOrEvent bufferOrEvent = next.get();
          if (isBlocked(bufferOrEvent.getChannelIndex())) {
             // if the channel is blocked we, we just store the BufferOrEvent
             // 如果当前 channel 是 block 状态,先写入缓存
             bufferBlocker.add(bufferOrEvent);
             checkSizeLimit();
          }
          else if (bufferOrEvent.isBuffer()) {
             return bufferOrEvent;
          }
          else if (bufferOrEvent.getEvent().getClass() == CheckpointBarrier.class) {
             if (!endOfStream) {
                // process barriers only if there is a chance of the checkpoint completing
                processBarrier((CheckpointBarrier) bufferOrEvent.getEvent(), bufferOrEvent.getChannelIndex());
             }
          }
          else if (bufferOrEvent.getEvent().getClass() == CancelCheckpointMarker.class) {
             processCancellationBarrier((CancelCheckpointMarker) bufferOrEvent.getEvent());
          }
          else {
             if (bufferOrEvent.getEvent().getClass() == EndOfPartitionEvent.class) {
                processEndOfPartition();
             }
             return bufferOrEvent;
          }
       }
    }
}

除了CheckpointBarrier消息以外,在checkpoint发生异常或取消checkpoint的时候,会向下游发送CancelCheckpointMarker消息。

JobMaster对Checkpoint的确认

      Task对checkpoint的响应是通过CheckpointResponder接口完成的:

public interface CheckpointResponder {
	/**
	 * Acknowledges the given checkpoint.
	 */
    void acknowledgeCheckpoint(
       JobID jobID,
       ExecutionAttemptID executionAttemptID,
       long checkpointId,
       CheckpointMetrics checkpointMetrics,
       TaskStateSnapshot subtaskState);
  
	/**
	 * Declines the given checkpoint.
	 */
    void declineCheckpoint(
       JobID jobID,
       ExecutionAttemptID executionAttemptID,
       long checkpointId,
       Throwable cause);
}

         RpcCheckpointResponder作为CheckpointResponder的具体实现,主要是通过RPC调用通知CheckpointCoordinatorGateway,即通知给JobMaster;JobMaster调用CheckpointCoordinator.receiveAcknowledgeMessage()和CheckpointCoordinator.receiveDeclineMessage()进行处理。

确认完成

         在一个Task完成checkpoint操作后,其会通过checkpointCoordinatorGateway的RPC接口checkpointResponder.acknowledgeCheckpoint(),发送ACK响应给CheckpointCoordinator;在JobMaster中的CheckpointCoordinator接收到Ack响应,对Ack响应的处理流程主要如下:

  • 根据Ack的checkpointID从Map<Long,PendingCheckpoint> pendingCheckpoints中查找对应的PendingCheckpoint
  • 若存在对应的PendingCheckpoint
    • 这个PendingCheckpoint没有被丢弃,调用PendingCheckpoint.acknowledgeTask方法处理Ack,根据处理结果的不同:
      • SUCCESS:判断是否已经接受了所有需要响应的Ack,如果是,则调用completePendingCheckpoint完成此次checkpoint
      • DISCARD:Checkpoint已经被discard,清理上报的Ack中携带的状态句柄
      • UNKNOWN:未知的Ack消息,清理上报的Ack中携带的状态句柄
      • DUPLICATE:Ack消息重复接收,直接忽略
    • 这个PendingCheckpoint已经被丢弃,抛出异常
  • 若不存在对应的PendingCheckpoint,则清理上报的Ack中携带的状态句柄

相应的代码如下:

public class CheckpointCoordinator {
    /**
     * Receives an AcknowledgeCheckpoint message and returns whether the
     * message was associated with a pending checkpoint.
     */
    public boolean receiveAcknowledgeMessage(AcknowledgeCheckpoint message) throws CheckpointException {
       if (shutdown || message == null) {
          return false;
       }
       if (!job.equals(message.getJob())) {
          LOG.error("Received wrong AcknowledgeCheckpoint message for job {}: {}", job, message);
          return false;
       }
       
       final long checkpointId = message.getCheckpointId();
       synchronized (lock) {
          // we need to check inside the lock for being shutdown as well, otherwise we
          // get races and invalid error log messages
          if (shutdown) {
             return false;
          }
          final PendingCheckpoint checkpoint = pendingCheckpoints.get(checkpointId); // 查找对应的PendingCheckpoint
          if (checkpoint != null && !checkpoint.isDiscarded()) {
             switch (checkpoint.acknowledgeTask(message.getTaskExecutionId(), message.getSubtaskState(), message.getCheckpointMetrics())) {
                case SUCCESS:
                   LOG.debug("Received acknowledge message for checkpoint {} from task {} of job {}.",
                      checkpointId, message.getTaskExecutionId(), message.getJob());
                   if (checkpoint.isFullyAcknowledged()) {
                      completePendingCheckpoint(checkpoint);
                   }
                   break;
                case DUPLICATE:
                   LOG.debug("Received a duplicate acknowledge message for checkpoint {}, task {}, job {}.",
                      message.getCheckpointId(), message.getTaskExecutionId(), message.getJob());
                   break;
                case UNKNOWN:
                   LOG.warn("Could not acknowledge the checkpoint {} for task {} of job {}, " +
                         "because the task's execution attempt id was unknown. Discarding " +
                         "the state handle to avoid lingering state.", message.getCheckpointId(),
                      message.getTaskExecutionId(), message.getJob());
                   discardSubtaskState(message.getJob(), message.getTaskExecutionId(), message.getCheckpointId(), message.getSubtaskState());
                   break;
                case DISCARDED:
                   LOG.warn("Could not acknowledge the checkpoint {} for task {} of job {}, " +
                      "because the pending checkpoint had been discarded. Discarding the " +
                         "state handle tp avoid lingering state.",
                      message.getCheckpointId(), message.getTaskExecutionId(), message.getJob());
                   discardSubtaskState(message.getJob(), message.getTaskExecutionId(), message.getCheckpointId(), message.getSubtaskState());
             }
             return true;
          }
          else if (checkpoint != null) {
             // this should not happen
             throw new IllegalStateException(
                   "Received message for discarded but non-removed checkpoint " + checkpointId);
          }
          else {
             boolean wasPendingCheckpoint;
             // message is for an unknown checkpoint, or comes too late (checkpoint disposed)
             if (recentPendingCheckpoints.contains(checkpointId)) {
                wasPendingCheckpoint = true;
                LOG.warn("Received late message for now expired checkpoint attempt {} from " +
                   "{} of job {}.", checkpointId, message.getTaskExecutionId(), message.getJob());
             }
             else {
                LOG.debug("Received message for an unknown checkpoint {} from {} of job {}.",
                   checkpointId, message.getTaskExecutionId(), message.getJob());
                wasPendingCheckpoint = false;
             }
             // try to discard the state so that we don't have lingering state lying around
             discardSubtaskState(message.getJob(), message.getTaskExecutionId(), message.getCheckpointId(), message.getSubtaskState());
             return wasPendingCheckpoint;
          }
       }
    }
}

对于一个已经触发但还没有完成的checkpoint,即PendingCheckpoint,它是如何处理Ack消息的呢?在PendingCheckpoint内部维护了两个Map,分别是:

  • Map<OperatorID, OperatorState> operatorStates;已经接收到Ack的算子的状态句柄
  • Map<ExecutionAttemptID, ExecutionVertex> notYetAcknowledgedTasks;需要Ack但还没有接收到的Task

每当接收到一个Ack消息时,PendingCheckpoint就从notYetAcknowledgedTasks中移除对应的Task,并保存Ack携带的状态句柄保存。当notYetAcknowledgedTasks为空时,表明所有的Ack消息都接收到了。

其中OperatorState是算子状态句柄的一层封装:

class OperatorState implements CompositeStateHandle {
	/** handles to non-partitioned states, subtaskindex -> subtaskstate */
	private final Map<Integer, OperatorSubtaskState> operatorSubtaskStates;
}

public class OperatorSubtaskState implements CompositeStateHandle {
	/** Snapshot from the {@link org.apache.flink.runtime.state.OperatorStateBackend}. */
	@Nonnull
	private final StateObjectCollection<OperatorStateHandle> managedOperatorState;

	/** Snapshot written using {@link org.apache.flink.runtime.state.OperatorStateCheckpointOutputStream}. */
	@Nonnull
	private final StateObjectCollection<OperatorStateHandle> rawOperatorState;

	/** Snapshot from {@link org.apache.flink.runtime.state.KeyedStateBackend}. */
	@Nonnull
	private final StateObjectCollection<KeyedStateHandle> managedKeyedState;

	/** Snapshot written using {@link org.apache.flink.runtime.state.KeyedStateCheckpointOutputStream}. */
	@Nonnull
	private final StateObjectCollection<KeyedStateHandle> rawKeyedState;
}

一旦PendingCheckpoint调用checkpoint.acknowledgeTask()确认了所有Ack消息都已经接收,那么就可以完成此次checkpoint了,具体包括:

  • 调用PendingCheckpoint.finalizeCheckpoint()将PendingCheckpoint转化为CompletedCheckpoint
    • 获取CheckpointMetadataOutputStream,将所有的状态句柄信息通过CheckpointMetadataOutputStream写入到存储系统中
    • 创建一个CompletedCheckpoint对象
  • 将CompletedCheckpoint保存到CompletedCheckpointStore中
    • CompletedCheckpointStore有两种实现,分别为StandaloneCompletedCheckpointStore和ZooKeeperCompletedCheckpointStore
    • StandaloneCompletedCheckpointStore简单地将CompletedCheckpointStore存放在一个数组中
    • ZooKeeperCompletedCheckpointStore提供高可用实现:先将CompletedCheckpointStore写入到RetrievableStateStorageHelper中(通常是文件系统),然后将文件句柄存在ZK中
    • 保存的CompletedCheckpointStore数量是有限的,会删除旧的快照
  • 移除被越过的PendingCheckpoint,因为CheckpointID是递增的,那么所有比当前完成的CheckpointID小的PendingCheckpoint都可以被丢弃了
  • 依次调用Execution.notifyCheckpointComplete()通知所有的Task当前Checkpoint已经完成
    • 通过RPC调用TaskExecutor.confirmCheckpoint()告知对应的Task

拒绝

         在Task进行checkpoint的过程,可能会发生异常导致checkpoint失败,在这种情况下会通过CheckpointResponder发出回绝的消息。当CheckpointCoordinator接收到DeclineCheckpoint消息后会移除PendingCheckpoint,并尝试丢弃已经接收到的Ack消息中已完成的状态句柄:

public class CheckpointCoordinator {
    /**
     * Receives a {@link DeclineCheckpoint} message for a pending checkpoint.
     */
    public void receiveDeclineMessage(DeclineCheckpoint message) {
       if (shutdown || message == null) {
          return;
       }
       if (!job.equals(message.getJob())) {
          throw new IllegalArgumentException("Received DeclineCheckpoint message for job " +
             message.getJob() + " while this coordinator handles job " + job);
       }
       final long checkpointId = message.getCheckpointId();
       final String reason = (message.getReason() != null ? message.getReason().getMessage() : "");
    
       PendingCheckpoint checkpoint;
       synchronized (lock) {
          // we need to check inside the lock for being shutdown as well, otherwise we
          // get races and invalid error log messages
          if (shutdown) {
             return;
          }
          checkpoint = pendingCheckpoints.remove(checkpointId);
          if (checkpoint != null && !checkpoint.isDiscarded()) {
             LOG.info("Decline checkpoint {} by task {} of job {}.", checkpointId, message.getTaskExecutionId(), job);
             discardCheckpoint(checkpoint, message.getReason());
          }
          else if (checkpoint != null) {
             // this should not happen
             throw new IllegalStateException("Received message for discarded but non-removed checkpoint " + checkpointId);
          }
          else if (LOG.isDebugEnabled()) {
             if (recentPendingCheckpoints.contains(checkpointId)) {
                // message is for an unknown checkpoint, or comes too late (checkpoint disposed)
                LOG.debug("Received another decline message for now expired checkpoint attempt {} of job {} : {}", checkpointId, job, reason);
             } else {
                // message is for an unknown checkpoint. might be so old that we don't even remember it any more
                LOG.debug("Received decline message for unknown (too old?) checkpoint attempt {} of job {} : {}", checkpointId, job, reason);
             }
          }
       }
    }
}
  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值