一文弄懂FLink状态及checkpoint源码

1. Flink 状态源码

1.1 valueState源码

1.1.1 Update方法

class HeapValueState<K, N, V>
        extends AbstractHeapState<K, N, V>
        implements InternalValueState<K, N, V> {
    /** The current namespace, which the access methods will refer to. */
    protected N currentNamespace;
    /** Map containing the actual key/value pairs. */
    protected final StateTable<K, N, SV> stateTable;

    @Override
    public void update(V value) {
        if (value == null) {
                clear();
                return;
        }
        // 向state中传值,namespace从上下文获取
        stateTable.put(currentNamespace, value);
    }
}
public abstract class StateTable<K, N, S>
        implements StateSnapshotRestore, Iterable<StateEntry<K, N, S>> {
        
    // Maps the composite of active key and given namespace to the specified state.
    // 根据key+namespace找到对应的state
    // 其中,key从上下文中获取,入参中的state是用户需要保存的值
    public void put(N namespace, S state) {
            put(keyContext.getCurrentKey(), keyContext.getCurrentKeyGroupIndex(), namespace, state);
    }
    
    public void put(K key, int keyGroup, N namespace, S state) {
            checkKeyNamespacePreconditions(key, namespace);

            StateMap<K, N, S> stateMap = getMapForKeyGroup(keyGroup);
            // 将 state 根据 key 的值放入到 stateMap 中
            stateMap.put(key, namespace, state);
    }
}
public class CopyOnWriteStateMap<K, N, S> extends StateMap<K, N, S> {
    @Override
    public void put(K key, N namespace, S value) {
        // 获取当前key对应的StateMapEntry结构
        final StateMapEntry<K, N, S> e = putEntry(key, namespace);
        // 获取到key+namespace对应的StateMapEntry对象之后,将value赋值进去
        e.state = value;
        e.stateVersion = stateMapVersion;
    }
    
    private StateMapEntry<K, N, S> putEntry(K key, N namespace) {
        // 根据key+namespace求hash值,然后再通过hash值得到位于数组tab中的StateMapEntry结构的准确下标
        final int hash = computeHashForOperationAndDoIncrementalRehash(key, namespace);
        final StateMapEntry<K, N, S>[] tab = selectActiveTable(hash);
        // 这里的位与运算与hashmap中的寻址方法一模一样,位与运算可以保证得到的index值小于数组长度
        int index = hash & (tab.length - 1);

        for (StateMapEntry<K, N, S> e = tab[index]; e != null; e = e.next) {
            // StateMapEntry结构以链表的形式串联,因此找到index之后,还需要遍历链表,通过key+namespace找到目标StateMapEntry,是不是跟hashmap很像?
                if (e.hash == hash && key.equals(e.key) && namespace.equals(e.namespace)) {

                        // copy-on-write check for entry
                        if (e.entryVersion < highestRequiredSnapshotVersion) {
                                e = handleChainedEntryCopyOnWrite(tab, index, e);
                        }

                        return e;
                }
        }

        ++modCount;
        if (size() > threshold) {
                doubleCapacity();
        }

        return addNewStateMapEntry(tab, key, namespace, hash);
    }
}

保存key+namespace+value的最终结构是StateMapEntry,定义如下:

protected static class StateMapEntry<K, N, S> implements StateEntry<K, N, S> {
    /**
     * The key. Assumed to be immumap and not null.
     */
    @Nonnull
    final K key;

    /**
     * The namespace. Assumed to be immumap and not null.
     */
    @Nonnull
    final N namespace;

    /**
     * The state. This is not final to allow exchanging the object for copy-on-write. Can be null.
     */
    @Nullable
    // 用户自定义类型
    S state;

    /**
     * Link to another {@link StateMapEntry}. This is used to resolve collisions in the
     * {@link CopyOnWriteStateMap} through chaining.
     */
    @Nullable
    // 链表
    StateMapEntry<K, N, S> next;

    /**
     * The version of this {@link StateMapEntry}. This is meta data for copy-on-write of the map structure.
     */
    int entryVersion;

    /**
     * The version of the state object in this entry. This is meta data for copy-on-write of the state object itself.
     */
    int stateVersion;

    /**
     * The computed secondary hash for the composite of key and namespace.
     */
    final int hash;
}

类层级结构图

1.1.2 Value 方法

class HeapValueState<K, N, V>
        extends AbstractHeapState<K, N, V>
        implements InternalValueState<K, N, V> {
        
        /** The current namespace, which the access methods will refer to. */
        protected N currentNamespace;

        @Override
        public V value() {
            // 上下文中获取namespace
                final V result = stateTable.get(currentNamespace);

                if (result == null) {
                        return getDefaultValue();
                }

                return result;
        }
}
public abstract class StateTable<K, N, S>
        implements StateSnapshotRestore, Iterable<StateEntry<K, N, S>> {
    public S get(N namespace) {
        return get(keyContext.getCurrentKey(), keyContext.getCurrentKeyGroupIndex(), namespace);
    }

    private S get(K key, int keyGroupIndex, N namespace) {
        checkKeyNamespacePreconditions(key, namespace);

        StateMap<K, N, S> stateMap = getMapForKeyGroup(keyGroupIndex);

        if (stateMap == null) {
                return null;
        }

        return stateMap.get(key, namespace);
    }
}

public class CopyOnWriteStateMap<K, N, S> extends StateMap<K, N, S> {
    @Override
    public S get(K key, N namespace) {
    // 根据key+namespace计算hash
        final int hash = computeHashForOperationAndDoIncrementalRehash(key, namespace);
        final int requiredVersion = highestRequiredSnapshotVersion;
        // 最终存储StateMapEntry对象的数组
        final StateMapEntry<K, N, S>[] tab = selectActiveTable(hash);
        int index = hash & (tab.length - 1);
        // 遍历链表,然后根据key+namespace+hash寻找StateMapEntry对象,可以理解为hashmap中的重写eqauls方法
        for (StateMapEntry<K, N, S> e = tab[index]; e != null; e = e.next) {
                final K eKey = e.key;
                final N eNamespace = e.namespace;
                if ((e.hash == hash && key.equals(eKey) && namespace.equals(eNamespace))) {

                        // copy-on-write check for state
                        if (e.stateVersion < requiredVersion) {
                                // copy-on-write check for entry
                                if (e.entryVersion < requiredVersion) {
                                        e = handleChainedEntryCopyOnWrite(tab, hash & (tab.length - 1), e);
                                }
                                e.stateVersion = stateMapVersion;
                                e.state = getStateSerializer().copy(e.state);
                        }

                        return e.state;
                }
        }
        return null;
    }
}

2. checkPoint 源码分析

2.1 SourceStreamTask的checkpoint实现

2.1.1 JobManager端checkpoint调度

在JobManager端构建ExecutionGraph过程中(ExecutionGraphBuilder.buildGraph()方法),会调用ExecutionGraph.enableCheckpointing()方法,这个方法不管任务里有没有设置checkpoint都会调用的。在enableCheckpointing()方法里会创建CheckpointCoordinator,这是负责checkpoint的核心实现类,同时会给job添加一个监听器CheckpointCoordinatorDeActivator(只有设置了checkpoint才会注册这个监听器),CheckpointCoordinatorDeActivator负责checkpoint的启动和停止。源码如下:

//ExecutionGraphBuilder类
public static ExecutionGraph buildGraph(
      @Nullable ExecutionGraph prior,
      JobGraph jobGraph,
      Configuration jobManagerConfig,
      ScheduledExecutorService futureExecutor,
      Executor ioExecutor,
      SlotProvider slotProvider,
      ClassLoader classLoader,
      CheckpointRecoveryFactory recoveryFactory,
      Time rpcTimeout,
      RestartStrategy restartStrategy,
      MetricGroup metrics,
      int parallelismForAutoMax,
      BlobWriter blobWriter,
      Time allocationTimeout,
      Logger log)
   throws JobExecutionException, JobException {
      ...
   // configure the state checkpointing
   // 配置 checkpointing state
   JobCheckpointingSettings snapshotSettings = jobGraph.getCheckpointingSettings();
   if (snapshotSettings != null) {
        ...

      final CheckpointCoordinatorConfiguration chkConfig = snapshotSettings.getCheckpointCoordinatorConfiguration();

      // 启动 checkpoint
      executionGraph.enableCheckpointing(
         chkConfig.getCheckpointInterval(), //checkpoint间隔
         chkConfig.getCheckpointTimeout(), //checkpoint超时时间
         chkConfig.getMinPauseBetweenCheckpoints(), //两次checkpoint间隔最短时间
         chkConfig.getMaxConcurrentCheckpoints(), //同时可并发执行的checkpoint数量
         chkConfig.getCheckpointRetentionPolicy(), //checkpoint的保留策略
         triggerVertices, //需要触发checkpoint的任务,即所有的sourceTask
         ackVertices, //需要确认checkpoint的任务,这里是所有的任务
         confirmVertices, //checkpoint执行完需要进行commit的任务,也是所有的任务
         hooks,
         checkpointIdCounter, //checkpoint ID计数器,每执行一次就+1
         completedCheckpoints, //CompletedCheckpointStore, 存放已完成的checkpoint
         rootBackend,
         checkpointStatsTracker);
   }

    ...

   return executionGraph;
}
//ExecutionGraph类
public void enableCheckpointing(
      long interval,
      long checkpointTimeout,
      long minPauseBetweenCheckpoints,
      int maxConcurrentCheckpoints,
      CheckpointRetentionPolicy retentionPolicy,
      List<ExecutionJobVertex> verticesToTrigger,
      List<ExecutionJobVertex> verticesToWaitFor,
      List<ExecutionJobVertex> verticesToCommitTo,
      List<MasterTriggerRestoreHook<?>> masterHooks,
      CheckpointIDCounter checkpointIDCounter,
      CompletedCheckpointStore checkpointStore,
      StateBackend checkpointStateBackend,
      CheckpointStatsTracker statsTracker) {

   // simple sanity checks
   checkArgument(interval >= 10, "checkpoint interval must not be below 10ms");
   checkArgument(checkpointTimeout >= 10, "checkpoint timeout must not be below 10ms");

   checkState(state == JobStatus.CREATED, "Job must be in CREATED state");
   checkState(checkpointCoordinator == null, "checkpointing already enabled");

   ExecutionVertex[] tasksToTrigger = collectExecutionVertices(verticesToTrigger);
   ExecutionVertex[] tasksToWaitFor = collectExecutionVertices(verticesToWaitFor);
   ExecutionVertex[] tasksToCommitTo = collectExecutionVertices(verticesToCommitTo);

   checkpointStatsTracker = checkNotNull(statsTracker, "CheckpointStatsTracker");
   
   //创建checkpointCoordinator,这里的checkpointCoordinator不管有没有设置checkpoint都会创建
   // create the coordinator that triggers and commits checkpoints and holds the state
   checkpointCoordinator = new CheckpointCoordinator(
      jobInformation.getJobId(),
      interval,
      checkpointTimeout,
      minPauseBetweenCheckpoints,
      maxConcurrentCheckpoints,
      retentionPolicy,
      tasksToTrigger,
      tasksToWaitFor,
      tasksToCommitTo,
      checkpointIDCounter,
      checkpointStore,
      checkpointStateBackend,
      ioExecutor,
      SharedStateRegistry.DEFAULT_FACTORY);

   // register the master hooks on the checkpoint coordinator
   for (MasterTriggerRestoreHook<?> hook : masterHooks) {
      if (!checkpointCoordinator.addMasterHook(hook)) {
         LOG.warn("Trying to register multiple checkpoint hooks with the name: {}", hook.getIdentifier());
      }
   }

   checkpointCoordinator.setCheckpointStatsTracker(checkpointStatsTracker);

   //注册一个job状态监听器,在job任务出现改变时会进行一些相应的操作
   //注意如果没有设置checkpoint的话,则不会注册这个checkpoint监听器
   // interval of max long value indicates disable periodic checkpoint,
   // the CheckpointActivatorDeactivator should be created only if the interval is not max value
   if (interval != Long.MAX_VALUE) {
      // the periodic checkpoint scheduler is activated and deactivated as a result of
      // job status changes (running -> on, all other states -> off)
      registerJobStatusListener(checkpointCoordinator.createActivatorDeactivator());
   }
}

//CheckpointCoordinator类
public JobStatusListener createActivatorDeactivator() {
   synchronized (lock) {
      if (shutdown) {
         throw new IllegalArgumentException("Checkpoint coordinator is shut down");
      }

      if (jobStatusListener == null) {
         jobStatusListener = new CheckpointCoordinatorDeActivator(this);
      }

      return jobStatusListener;
   }
}

在JobManager端开始进行任务调度的时候,会对job的状态进行转换,由CREATED转成RUNNING,实现在transitionState()方法中,在这个过程中刚才设置的job监听器CheckpointCoordinatorDeActivator就开始启动checkpoint的定时任务了,调用链为ExecutionGraph.scheduleForExecution() -> transitionState() -> notifyJobStatusChange() -> CheckpointCoordinatorDeActivator.jobStatusChanges() -> CheckpointCoordinator.startCheckpointScheduler()源码如下

//ExecutionGraph类
public void scheduleForExecution() throws JobException {

   assertRunningInJobMasterMainThread();

   final long currentGlobalModVersion = globalModVersion;

   if (transitionState(JobStatus.CREATED, JobStatus.RUNNING)) {

      ...
   }
   else {
      throw new IllegalStateException("Job may only be scheduled from state " + JobStatus.CREATED);
   }
}

private boolean transitionState(JobStatus current, JobStatus newState, Throwable error) {
   assertRunningInJobMasterMainThread();
    ...
   // now do the actual state transition
   if (STATE_UPDATER.compareAndSet(this, current, newState)) {
      LOG.info("Job {} ({}) switched from state {} to {}.", getJobName(), getJobID(), current, newState, error);

      stateTimestamps[newState.ordinal()] = System.currentTimeMillis();
      notifyJobStatusChange(newState, error);
      return true;
   }
   else {
      return false;
   }
}

private void notifyJobStatusChange(JobStatus newState, Throwable error) {
   if (jobStatusListeners.size() > 0) {
      final long timestamp = System.currentTimeMillis();
      final Throwable serializedError = error == null ? null : new SerializedThrowable(error);

      for (JobStatusListener listener : jobStatusListeners) {
         try {
            listener.jobStatusChanges(getJobID(), newState, timestamp, serializedError);
         } catch (Throwable t) {
            LOG.warn("Error while notifying JobStatusListener", t);
         }
      }
   }
}

//CheckpointCoordinatorDeActivator类
public void jobStatusChanges(JobID jobId, JobStatus newJobStatus, long timestamp, Throwable error) {
   if (newJobStatus == JobStatus.RUNNING) {
      // start the checkpoint scheduler
      // 在这里启动 checkPoint 调度
      coordinator.startCheckpointScheduler();
   } else {
      // anything else should stop the trigger for now
      coordinator.stopCheckpointScheduler();
   }
}

CheckpointCoordinator会部署一个定时任务,用于周期性的触发checkpoint,这个定时任务就是ScheduledTrigger类

public void startCheckpointScheduler() {
   synchronized (lock) {
      if (shutdown) {
         throw new IllegalArgumentException("Checkpoint coordinator is shut down");
      }

      // make sure all prior timers are cancelled
      stopCheckpointScheduler();

      periodicScheduling = true;
      long initialDelay = ThreadLocalRandom.current().nextLong(
         minPauseBetweenCheckpointsNanos / 1_000_000L, baseInterval + 1L);
      // 启动调度
      currentPeriodicTrigger = timer.scheduleAtFixedRate(
            new ScheduledTrigger(), initialDelay, baseInterval, TimeUnit.MILLISECONDS);
   }
}

2.1.2 ScheduledTrigger定时触发checkpoint

下面我们来看ScheduledTrigger的实现,主要就在CheckpointCoordinator.triggerCheckpoint()中:

  1. 在触发checkpoint之前先做一遍检查,检查当前正在处理的checkpoint是否超过设置的最大并发checkpoint数量,检查checkpoint的间隔是否达到设置的两次checkpoint的时间间隔,在都没有问题的情况下才可以触发checkpoint
  2. 检查需要触发的task是否都正常运行,即所有的source task
  3. 检查JobManager端需要确认checkpoint信息的task时候正常运行,这里就是所有运行task,即所有的task都需要向JobManager发送确认自己checkpoint的消息
  4. 正式开始触发checkpoint,创建一个PendingCheckpoint,包含了checkpointID和timestamp,向所有的source task去触发checkpoint。
//CheckpointCoordinator类
private final class ScheduledTrigger implements Runnable {

   @Override
   public void run() {
      try {
         // 触犯 checkPoint
         triggerCheckpoint(System.currentTimeMillis(), true);
      }
      catch (Exception e) {
         LOG.error("Exception while triggering checkpoint for job {}.", job, e);
      }
   }
}

public boolean triggerCheckpoint(long timestamp, boolean isPeriodic) {
   return triggerCheckpoint(timestamp, checkpointProperties, null, isPeriodic).isSuccess();
}

public CheckpointTriggerResult triggerCheckpoint(
      long timestamp,
      CheckpointProperties props,
      @Nullable String externalSavepointLocation,
      boolean isPeriodic) {
    
    // 首先做一些检查,看能不能触发checkpoint,
    //1. 主要就是检查最大并发checkpoint数,checkpoint间隔时间
   // make some eager pre-checks
   synchronized (lock) {
      // abort if the coordinator has been shutdown in the meantime
      if (shutdown) {
         return new CheckpointTriggerResult(CheckpointDeclineReason.COORDINATOR_SHUTDOWN);
      }

      // Don't allow periodic checkpoint if scheduling has been disabled
      if (isPeriodic && !periodicScheduling) {
         return new CheckpointTriggerResult(CheckpointDeclineReason.PERIODIC_SCHEDULER_SHUTDOWN);
      }

      // validate whether the checkpoint can be triggered, with respect to the limit of
      // concurrent checkpoints, and the minimum time between checkpoints.
      // these checks are not relevant for savepoints
      if (!props.forceCheckpoint()) {
         // sanity check: there should never be more than one trigger request queued
         if (triggerRequestQueued) {
            LOG.warn("Trying to trigger another checkpoint for job {} while one was queued already.", job);
            return new CheckpointTriggerResult(CheckpointDeclineReason.ALREADY_QUEUED);
         }

         // if too many checkpoints are currently in progress, we need to mark that a request is queued
         if (pendingCheckpoints.size() >= maxConcurrentCheckpointAttempts) {
            triggerRequestQueued = true;
            if (currentPeriodicTrigger != null) {
               currentPeriodicTrigger.cancel(false);
               currentPeriodicTrigger = null;
            }
            return new CheckpointTriggerResult(CheckpointDeclineReason.TOO_MANY_CONCURRENT_CHECKPOINTS);
         }

         // make sure the minimum interval between checkpoints has passed
         final long earliestNext = lastCheckpointCompletionNanos + minPauseBetweenCheckpointsNanos;
         final long durationTillNextMillis = (earliestNext - System.nanoTime()) / 1_000_000;

         if (durationTillNextMillis > 0) {
            if (currentPeriodicTrigger != null) {
               currentPeriodicTrigger.cancel(false);
               currentPeriodicTrigger = null;
            }
            // Reassign the new trigger to the currentPeriodicTrigger
            currentPeriodicTrigger = timer.scheduleAtFixedRate(
                  new ScheduledTrigger(),
                  durationTillNextMillis, baseInterval, TimeUnit.MILLISECONDS);

            return new CheckpointTriggerResult(CheckpointDeclineReason.MINIMUM_TIME_BETWEEN_CHECKPOINTS);
         }
      }
   }

   //检查需要触发的task是否都正常运行,即所有的source task
   // check if all tasks that we need to trigger are running.
   // if not, abort the checkpoint
   Execution[] executions = new Execution[tasksToTrigger.length];
   for (int i = 0; i < tasksToTrigger.length; i++) {
      Execution ee = tasksToTrigger[i].getCurrentExecutionAttempt();
      if (ee == null) {
         LOG.info("Checkpoint triggering task {} of job {} is not being executed at the moment. Aborting checkpoint.",
               tasksToTrigger[i].getTaskNameWithSubtaskIndex(),
               job);
         return new CheckpointTriggerResult(CheckpointDeclineReason.NOT_ALL_REQUIRED_TASKS_RUNNING);
      } else if (ee.getState() == ExecutionState.RUNNING) {
         executions[i] = ee;
      } else {
         LOG.info("Checkpoint triggering task {} of job {} is not in state {} but {} instead. Aborting checkpoint.",
               tasksToTrigger[i].getTaskNameWithSubtaskIndex(),
               job,
               ExecutionState.RUNNING,
               ee.getState());
         return new CheckpointTriggerResult(CheckpointDeclineReason.NOT_ALL_REQUIRED_TASKS_RUNNING);
      }
   }

   // 检查JobManager端需要确认checkpoint信息的task时候正常运行
   // next, check if all tasks that need to acknowledge the checkpoint are running.
   // if not, abort the checkpoint
   Map<ExecutionAttemptID, ExecutionVertex> ackTasks = new HashMap<>(tasksToWaitFor.length);

   for (ExecutionVertex ev : tasksToWaitFor) {
      Execution ee = ev.getCurrentExecutionAttempt();
      if (ee != null) {
         ackTasks.put(ee.getAttemptId(), ev);
      } else {
         LOG.info("Checkpoint acknowledging task {} of job {} is not being executed at the moment. Aborting checkpoint.",
               ev.getTaskNameWithSubtaskIndex(),
               job);
         return new CheckpointTriggerResult(CheckpointDeclineReason.NOT_ALL_REQUIRED_TASKS_RUNNING);
      }
   }

   // we will actually trigger this checkpoint!

   // we lock with a special lock to make sure that trigger requests do not overtake each other.
   // this is not done with the coordinator-wide lock, because the 'checkpointIdCounter'
   // may issue blocking operations. Using a different lock than the coordinator-wide lock,
   // we avoid blocking the processing of 'acknowledge/decline' messages during that time.
   synchronized (triggerLock) {

      final CheckpointStorageLocation checkpointStorageLocation;
      final long checkpointID;

      try {
         // this must happen outside the coordinator-wide lock, because it communicates
         // with external services (in HA mode) and may block for a while.
         checkpointID = checkpointIdCounter.getAndIncrement();

         checkpointStorageLocation = props.isSavepoint() ?
               checkpointStorage.initializeLocationForSavepoint(checkpointID, externalSavepointLocation) :
               checkpointStorage.initializeLocationForCheckpoint(checkpointID);
      }
      catch (Throwable t) {
         ...
      }

      final PendingCheckpoint checkpoint = new PendingCheckpoint(
         job,
         checkpointID,
         timestamp,
         ackTasks,
         props,
         checkpointStorageLocation,
         executor);

          ...

      try {
         // re-acquire the coordinator-wide lock
         synchronized (lock) {
            // since we released the lock in the meantime, we need to re-check
            // that the conditions still hold.
            //这里又重新做了一遍checkpoint检查
            ...

            LOG.info("Triggering checkpoint {} @ {} for job {}.", checkpointID, timestamp, job);

            pendingCheckpoints.put(checkpointID, checkpoint);

            ...
         }
         // end of lock scope

         final CheckpointOptions checkpointOptions = new CheckpointOptions(
               props.getCheckpointType(),
               checkpointStorageLocation.getLocationReference());
               
         // 给source task发消息触发checkpoint      
         // send the messages to the tasks that trigger their checkpoint
         for (Execution execution: executions) {
            execution.triggerCheckpoint(checkpointID, timestamp, checkpointOptions);
         }

         numUnsuccessfulCheckpointsTriggers.set(0);
         return new CheckpointTriggerResult(checkpoint);
      }
      catch (Throwable t) {
          ...
      }

   } // end trigger lock
}

Execution.triggerCheckpoint()就是远程调用TaskManager的triggerCheckpoint()方法

//Execution 
public void triggerCheckpoint(long checkpointId, long timestamp, CheckpointOptions checkpointOptions) {
   final LogicalSlot slot = assignedResource;

   if (slot != null) {
      final TaskManagerGateway taskManagerGateway = slot.getTaskManagerGateway();
      // 通过 RPC接口触发 TaskManger.triggerCheckpoint
      taskManagerGateway.triggerCheckpoint(attemptId, getVertex().getJobId(), checkpointId, timestamp, checkpointOptions);
   } else {
      LOG.debug("The execution has no slot assigned. This indicates that the execution is " +
         "no longer running.");
   }
}

//RpcTaskManagerGateway类
public void triggerCheckpoint(ExecutionAttemptID executionAttemptID, JobID jobId, long checkpointId, long timestamp, CheckpointOptions checkpointOptions) {
   taskExecutorGateway.triggerCheckpoint(
      executionAttemptID,
      checkpointId,
      timestamp,
      checkpointOptions);
}

2.1.3 SourceStreamTask的Checkpoint执行

TaskManager的triggerCheckpoint()方法首先获取到source task(即SourceStreamTask),调用Task.triggerCheckpointBarrier(),triggerCheckpointBarrier()会异步的去执行一个独立线程,这个线程来负责source task的checkpoint执行。

//TaskExecutor类
public CompletableFuture<Acknowledge> triggerCheckpoint(
      ExecutionAttemptID executionAttemptID,
      long checkpointId,
      long checkpointTimestamp,
      CheckpointOptions checkpointOptions) {
   log.debug("Trigger checkpoint {}@{} for {}.", checkpointId, checkpointTimestamp, executionAttemptID);

   // 获取 task
   final Task task = taskSlotTable.getTask(executionAttemptID);

   if (task != null) {
      // 对 task 执行 triggerCheckpointBarrier
      task.triggerCheckpointBarrier(checkpointId, checkpointTimestamp, checkpointOptions);
      // 异步获取 ack 信息
      return CompletableFuture.completedFuture(Acknowledge.get());
   } else {
      final String message = "TaskManager received a checkpoint request for unknown task " + executionAttemptID + '.';

      log.debug(message);
      return FutureUtils.completedExceptionally(new CheckpointException(message));
   }
}

//Task类
public void triggerCheckpointBarrier(
      final long checkpointID,
      long checkpointTimestamp,
      final CheckpointOptions checkpointOptions) {

   final AbstractInvokable invokable = this.invokable;
   final CheckpointMetaData checkpointMetaData = new CheckpointMetaData(checkpointID, checkpointTimestamp);

   if (executionState == ExecutionState.RUNNING && invokable != null) {

      // build a local closure
      final String taskName = taskNameWithSubtask;
      final SafetyNetCloseableRegistry safetyNetCloseableRegistry =
         FileSystemSafetyNet.getSafetyNetCloseableRegistryForThread();

      Runnable runnable = new Runnable() {
         @Override
         public void run() {
            // set safety net from the task's context for checkpointing thread
            LOG.debug("Creating FileSystem stream leak safety net for {}", Thread.currentThread().getName());
            FileSystemSafetyNet.setSafetyNetCloseableRegistryForThread(safetyNetCloseableRegistry);

            try {
                //由一个单独的线程来调用执行,这个invokable在这里就是SourceStreamTask
               boolean success = invokable.triggerCheckpoint(checkpointMetaData, checkpointOptions);
               if (!success) {
                  checkpointResponder.declineCheckpoint(
                        getJobID(), getExecutionId(), checkpointID,
                        new CheckpointDeclineTaskNotReadyException(taskName));
               }
            }
            catch (Throwable t) {
               if (getExecutionState() == ExecutionState.RUNNING) {
                  failExternally(new Exception(
                     "Error while triggering checkpoint " + checkpointID + " for " +
                        taskNameWithSubtask, t));
               } else {
                  LOG.debug("Encountered error while triggering checkpoint {} for " +
                     "{} ({}) while being not in state running.", checkpointID,
                     taskNameWithSubtask, executionId, t);
               }
            } finally {
               FileSystemSafetyNet.setSafetyNetCloseableRegistryForThread(null);
            }
         }
      };
      //异步的执行这个线程
      executeAsyncCallRunnable(runnable, String.format("Checkpoint Trigger for %s (%s).", taskNameWithSubtask, executionId));
   }
   else {
     ...
   }
}

checkpoint的核心实现在StreamTask.performCheckpoint()方法中,该方法主要有三个步骤
1、在checkpoint之前做一些准备工作,通常情况下operator在这个阶段是不做什么操作的
2、立即向下游广播CheckpointBarrier,以便使下游的task能够及时的接收到CheckpointBarrier也开始进行checkpoint的操作
3、开始进行状态的快照,即checkpoint操作。
注意以上操作都是在同步代码块里进行的,获取到的这个lock锁就是用于checkpoint的锁,checkpoint线程和task任务线程用的是同一把锁,在进行performCheckpoint()时,task任务线程是不能够进行数据处理的

/SourceStreamTaskpublic boolean triggerCheckpoint(CheckpointMetaData checkpointMetaData, CheckpointOptions checkpointOptions) throws Exception {
   if (!externallyInducedCheckpoints) {
      return super.triggerCheckpoint(checkpointMetaData, checkpointOptions);
   }
   else {
      // we do not trigger checkpoints here, we simply state whether we can trigger them
      synchronized (getCheckpointLock()) {
         return isRunning();
      }
   }
}

//StreamTask类
public boolean triggerCheckpoint(CheckpointMetaData checkpointMetaData, CheckpointOptions checkpointOptions) throws Exception {
   try {
      // No alignment if we inject a checkpoint
      CheckpointMetrics checkpointMetrics = new CheckpointMetrics()
            .setBytesBufferedInAlignment(0L)
            .setAlignmentDurationNanos(0L);

      return performCheckpoint(checkpointMetaData, checkpointOptions, checkpointMetrics);
   }
   catch (Exception e) {
      ...
   }
}

private boolean performCheckpoint(
      CheckpointMetaData checkpointMetaData,
      CheckpointOptions checkpointOptions,
      CheckpointMetrics checkpointMetrics) throws Exception {

   LOG.debug("Starting checkpoint ({}) {} on task {}",
      checkpointMetaData.getCheckpointId(), checkpointOptions.getCheckpointType(), getName());

   synchronized (lock) {
      if (isRunning) {
         // we can do a checkpoint

         // All of the following steps happen as an atomic step from the perspective of barriers and
         // records/watermarks/timers/callbacks.
         // We generally try to emit the checkpoint barrier as soon as possible to not affect downstream
         // checkpoint alignments

         // Step (1): Prepare the checkpoint, allow operators to do some pre-barrier work.
         //           The pre-barrier work should be nothing or minimal in the common case.
         // 准备 checkPoint 
         operatorChain.prepareSnapshotPreBarrier(checkpointMetaData.getCheckpointId());

         // Step (2): Send the checkpoint barrier downstream
         // 发送 checkpoint barrier 到下游
         operatorChain.broadcastCheckpointBarrier(
               checkpointMetaData.getCheckpointId(),
               checkpointMetaData.getTimestamp(),
               checkpointOptions);

         // Step (3): Take the state snapshot. This should be largely asynchronous, to not
         //           impact progress of the streaming topology
         checkpointState(checkpointMetaData, checkpointOptions, checkpointMetrics);
         return true;
      }
      else {
         ...
         return false;
      }
   }
}

首先来看第一步operatorChain.prepareSnapshotPreBarrier(),默认的实现其实什么也没做

//OperatorChain类
public void prepareSnapshotPreBarrier(long checkpointId) throws Exception {
   // go forward through the operator chain and tell each operator
   // to prepare the checkpoint
   final StreamOperator<?>[] operators = this.allOperators;
   for (int i = operators.length - 1; i >= 0; --i) {
      final StreamOperator<?> op = operators[i];
      if (op != null) {
         op.prepareSnapshotPreBarrier(checkpointId);
      }
   }
}

//AbstractStreamOperator类
public void prepareSnapshotPreBarrier(long checkpointId) throws Exception {
   // the default implementation does nothing and accepts the checkpoint
   // this is purely for subclasses to override
}

广播CheckpointBarrier
接下来我们看看operatorChain.broadcastCheckpointBarrier(),即CheckpointBarrier的广播过程。
其核心实现就是在task的输出ResultPartition里,给下游的每个通道channel都发送一个CheckpointBarrier,下游有多少个任务,那就会有多少个通道,给每个任务都会发送一个CheckpointBarrier,这个过程叫做广播。
这个CheckpointBarrier广播的过程也叫CheckpointBarrier注入,因为CheckpointBarrier并不是由执行source task的线程来写入的,而是由checkpoint线程来写入的,并且做了同步,在写入CheckpointBarrier时source task线程是被阻塞的。
这个源码如下:

//OperatorChain类
public void broadcastCheckpointBarrier(long id, long timestamp, CheckpointOptions checkpointOptions) throws IOException {
   CheckpointBarrier barrier = new CheckpointBarrier(id, timestamp, checkpointOptions);
   for (RecordWriterOutput<?> streamOutput : streamOutputs) {
      streamOutput.broadcastEvent(barrier);
   }
}

//RecordWriterOutput类
public void broadcastEvent(AbstractEvent event) throws IOException {
   recordWriter.broadcastEvent(event);
}

//RecordWriter类
public void broadcastEvent(AbstractEvent event) throws IOException {
   try (BufferConsumer eventBufferConsumer = EventSerializer.toBufferConsumer(event)) {
      for (int targetChannel = 0; targetChannel < numberOfChannels; targetChannel++) {
         tryFinishCurrentBufferBuilder(targetChannel);

         //这里的targetPartition就是ResultPartition,向下游每一个通道都发送一个CheckpointBarrier 
         // Retain the buffer so that it can be recycled by each channel of targetPartition
         targetPartition.addBufferConsumer(eventBufferConsumer.copy(), targetChannel);
      }

      if (flushAlways) {
         flushAll();
      }
   }
}

异步执行checkpoint
接下来就是关键的部分checkpointState()方法,也就是checkpoint的执行了,这个过程也是一个异步的过程,不能因为checkpoint而影响了正常数据流的处理。
StreamTask里的每个operator都会创建一个OperatorSnapshotFutures,OperatorSnapshotFutures 里包含了执行operator状态checkpoint的FutureTask,然后由另一个单独的线程异步的来执行这些operator的实际checkpoint操作

//StreamTask类
private void checkpointState(
      CheckpointMetaData checkpointMetaData,
      CheckpointOptions checkpointOptions,
      CheckpointMetrics checkpointMetrics) throws Exception {

   CheckpointStreamFactory storage = checkpointStorage.resolveCheckpointStorageLocation(
         checkpointMetaData.getCheckpointId(),
         checkpointOptions.getTargetLocation());

   CheckpointingOperation checkpointingOperation = new CheckpointingOperation(
      this,
      checkpointMetaData,
      checkpointOptions,
      storage,
      checkpointMetrics);

   checkpointingOperation.executeCheckpointing();
}
//StreamTask#CheckpointingOperation类
public void executeCheckpointing() throws Exception {
   startSyncPartNano = System.nanoTime();

   try {
      //对每个operator创建一个OperatorSnapshotFutures添加到operatorSnapshotsInProgress
      for (StreamOperator<?> op : allOperators) {
         checkpointStreamOperator(op);
      }

      startAsyncPartNano = System.nanoTime();

      checkpointMetrics.setSyncDurationMillis((startAsyncPartNano - startSyncPartNano) / 1_000_000);
        
      //由一个异步线程来执行实际的checkpoint操作,旨在不影响数据流的处理
      // we are transferring ownership over snapshotInProgressList for cleanup to the thread, active on submit
      AsyncCheckpointRunnable asyncCheckpointRunnable = new AsyncCheckpointRunnable(
         owner,
         operatorSnapshotsInProgress,
         checkpointMetaData,
         checkpointMetrics,
         startAsyncPartNano);

      owner.cancelables.registerCloseable(asyncCheckpointRunnable);
      owner.asyncOperationsThreadPool.submit(asyncCheckpointRunnable);

      ...
   } catch (Exception ex) {
      ...
      owner.synchronousCheckpointExceptionHandler.tryHandleCheckpointException(checkpointMetaData, ex);
   }
}

private void checkpointStreamOperator(StreamOperator<?> op) throws Exception {
   if (null != op) {
       //这个snapshotInProgress包含了operator checkpoint的FutureTask
      OperatorSnapshotFutures snapshotInProgress = op.snapshotState(
            checkpointMetaData.getCheckpointId(),
            checkpointMetaData.getTimestamp(),
            checkpointOptions,
            storageLocation);
      operatorSnapshotsInProgress.put(op.getOperatorID(), snapshotInProgress);
   }
}

OperatorSnapshotFutures
我们首先来看看针对每个operator创建的OperatorSnapshotFutures是什么,OperatorSnapshotFutures持有了一个operator的状态数据快照过程的RunnableFuture,什么意思呢,RunnableFuture首先是一个Runnable,也就是把operator快照的操作写到run()方法里,但并不去执行,等到需要的时候再去执行run()方法,也就是去执行真正的快照操作。其次RunnableFuture是一个Future,在执行run()后可以调用get()来获取future的结果,这里的结果就是SnapshotResult。

public class OperatorSnapshotFutures {

   @Nonnull
   private RunnableFuture<SnapshotResult<KeyedStateHandle>> keyedStateManagedFuture;

   @Nonnull
   private RunnableFuture<SnapshotResult<KeyedStateHandle>> keyedStateRawFuture;

   @Nonnull
   private RunnableFuture<SnapshotResult<OperatorStateHandle>> operatorStateManagedFuture;

   @Nonnull
   private RunnableFuture<SnapshotResult<OperatorStateHandle>> operatorStateRawFuture;

从snapshotState()方法中可以看到OperatorSnapshotFutures中的keyedStateRawFuture和operatorStateRawFuture都是DoneFuture,也就是一个已经完成的RunnableFuture,那么这个过程就是一个同步的。
operatorState和keyedState的snapshot过程被封装到一个RunnableFuture(这个过程比较复杂,这里先不介绍),并不会立即执行,之后调用RunnableFuture.run()才会真正的执行snapshot,在这里RunnableFuture就是一个FutureTask,这个过程是一个异步的,会被一个异步快照线程执行

//AbstractStreamOperator类
public final OperatorSnapshotFutures snapshotState(long checkpointId, long timestamp, CheckpointOptions checkpointOptions,
      CheckpointStreamFactory factory) throws Exception {

   KeyGroupRange keyGroupRange = null != keyedStateBackend ?
         keyedStateBackend.getKeyGroupRange() : KeyGroupRange.EMPTY_KEY_GROUP_RANGE;

   OperatorSnapshotFutures snapshotInProgress = new OperatorSnapshotFutures();

   try (StateSnapshotContextSynchronousImpl snapshotContext = new StateSnapshotContextSynchronousImpl(
         checkpointId,
         timestamp,
         factory,
         keyGroupRange,
         getContainingTask().getCancelables())) {
      //将snapshotContext进行快照,这个过程也会调用CheckpointedFunction.snapshotState()方法
      snapshotState(snapshotContext);
    
      //keyedStateRawFuture和operatorStateRawFuture都是DoneFuture,也就是一个已经完成的RunnableFuture
      snapshotInProgress.setKeyedStateRawFuture(snapshotContext.getKeyedStateStreamFuture());
      snapshotInProgress.setOperatorStateRawFuture(snapshotContext.getOperatorStateStreamFuture());
    
      //operatorState和keyedState的snapshot过程被封装到一个RunnableFuture,并不会立即执行,
      //之后调用RunnableFuture.run()才会真正的执行snapshot
      if (null != operatorStateBackend) {
         snapshotInProgress.setOperatorStateManagedFuture(
            operatorStateBackend.snapshot(checkpointId, timestamp, factory, checkpointOptions));
      }

      if (null != keyedStateBackend) {
         snapshotInProgress.setKeyedStateManagedFuture(
            keyedStateBackend.snapshot(checkpointId, timestamp, factory, checkpointOptions));
      }
   } catch (Exception snapshotException) {
      ...
   }

   return snapshotInProgress;
}

AsyncCheckpointRunnable
接下来我们看异步快照线程的实现,jobManagerTaskOperatorSubtaskStates是需要向JobManager ack的operator状态快照元数据信息,localTaskOperatorSubtaskStates是需要向TaskManager上报的operator状态快照元数据信息,localTaskOperatorSubtaskStates的作用是为状态数据保存一个备份,用户TaskManager快速的进行本地数据恢复,我们主要关心的还是向JobManager去Ack的TaskStateSnapshot。
1、针对每个operator创建一个OperatorSnapshotFinalizer,OperatorSnapshotFinalizer是状态数据快照的真正执行者,它真正的执行的operator的快照过程,也就是去执行那些FutureTask
2、状态快照执行完毕之后上JobManager上报checkpoint的信息

//AsyncCheckpointRunnable类
public void run() {
   FileSystemSafetyNet.initializeSafetyNetForThread();
   try {

      TaskStateSnapshot jobManagerTaskOperatorSubtaskStates =
         new TaskStateSnapshot(operatorSnapshotsInProgress.size());

      TaskStateSnapshot localTaskOperatorSubtaskStates =
         new TaskStateSnapshot(operatorSnapshotsInProgress.size());

      for (Map.Entry<OperatorID, OperatorSnapshotFutures> entry : operatorSnapshotsInProgress.entrySet()) {

         OperatorID operatorID = entry.getKey();
         OperatorSnapshotFutures snapshotInProgress = entry.getValue();
        
        //这里真正的执行的operator的快照过程,会执行那些FutureTask
         // finalize the async part of all by executing all snapshot runnables
         OperatorSnapshotFinalizer finalizedSnapshots =
            new OperatorSnapshotFinalizer(snapshotInProgress);

         jobManagerTaskOperatorSubtaskStates.putSubtaskStateByOperatorID(
            operatorID,
            finalizedSnapshots.getJobManagerOwnedState());

         localTaskOperatorSubtaskStates.putSubtaskStateByOperatorID(
            operatorID,
            finalizedSnapshots.getTaskLocalState());
      }

      final long asyncEndNanos = System.nanoTime();
      final long asyncDurationMillis = (asyncEndNanos - asyncStartNanos) / 1_000_000L;

      checkpointMetrics.setAsyncDurationMillis(asyncDurationMillis);

      if (asyncCheckpointState.compareAndSet(CheckpointingOperation.AsyncCheckpointState.RUNNING,
         CheckpointingOperation.AsyncCheckpointState.COMPLETED)) {
        //上JobManager上报checkpoint的信息
         reportCompletedSnapshotStates(
            jobManagerTaskOperatorSubtaskStates,
            localTaskOperatorSubtaskStates,
            asyncDurationMillis);

      } else {
         LOG.debug("{} - asynchronous part of checkpoint {} could not be completed because it was closed before.",
            owner.getName(),
            checkpointMetaData.getCheckpointId());
      }
   } catch (Exception e) {
      handleExecutionException(e);
   } finally {
      owner.cancelables.unregisterCloseable(this);
      FileSystemSafetyNet.closeSafetyNetAndGuardedResourcesForThread();
   }
}

可以看到在创建OperatorSnapshotFinalizer实例的时候就会去运行那些FutureTask,执行快照过程,比如将状态数据写到文件系统如HDFS等。

//OperatorSnapshotFinalizer类
public OperatorSnapshotFinalizer(
   @Nonnull OperatorSnapshotFutures snapshotFutures) throws ExecutionException, InterruptedException {

   SnapshotResult<KeyedStateHandle> keyedManaged =
      FutureUtils.runIfNotDoneAndGet(snapshotFutures.getKeyedStateManagedFuture());

   SnapshotResult<KeyedStateHandle> keyedRaw =
      FutureUtils.runIfNotDoneAndGet(snapshotFutures.getKeyedStateRawFuture());

   SnapshotResult<OperatorStateHandle> operatorManaged =
      FutureUtils.runIfNotDoneAndGet(snapshotFutures.getOperatorStateManagedFuture());

   SnapshotResult<OperatorStateHandle> operatorRaw =
      FutureUtils.runIfNotDoneAndGet(snapshotFutures.getOperatorStateRawFuture());

   jobManagerOwnedState = new OperatorSubtaskState(
      operatorManaged.getJobManagerOwnedSnapshot(),
      operatorRaw.getJobManagerOwnedSnapshot(),
      keyedManaged.getJobManagerOwnedSnapshot(),
      keyedRaw.getJobManagerOwnedSnapshot()
   );

   taskLocalState = new OperatorSubtaskState(
      operatorManaged.getTaskLocalSnapshot(),
      operatorRaw.getTaskLocalSnapshot(),
      keyedManaged.getTaskLocalSnapshot(),
      keyedRaw.getTaskLocalSnapshot()
   );
}

2.1.4 Task上报checkpoint信息

ask在执行完checkpoint后会向JobManager上报checkpoint的元数据信息,最终会调用到CheckpointCoordinator.receiveAcknowledgeMessage()方法

//TaskStateManagerImpl类
public void reportTaskStateSnapshots(
   @Nonnull CheckpointMetaData checkpointMetaData,
   @Nonnull CheckpointMetrics checkpointMetrics,
   @Nullable TaskStateSnapshot acknowledgedState,
   @Nullable TaskStateSnapshot localState) {

   long checkpointId = checkpointMetaData.getCheckpointId();

   localStateStore.storeLocalState(checkpointId, localState);

   checkpointResponder.acknowledgeCheckpoint(
      jobId,
      executionAttemptID,
      checkpointId,
      checkpointMetrics,
      acknowledgedState);
}

//JobMaster类
public void acknowledgeCheckpoint(
      final JobID jobID,
      final ExecutionAttemptID executionAttemptID,
      final long checkpointId,
      final CheckpointMetrics checkpointMetrics,
      final TaskStateSnapshot checkpointState) {

   final CheckpointCoordinator checkpointCoordinator = executionGraph.getCheckpointCoordinator();
   final AcknowledgeCheckpoint ackMessage = new AcknowledgeCheckpoint(
      jobID,
      executionAttemptID,
      checkpointId,
      checkpointMetrics,
      checkpointState);

   if (checkpointCoordinator != null) {
      getRpcService().execute(() -> {
         try {
            // 通过 RPC 接口调用 receiveAcknowledgeMessage
            checkpointCoordinator.receiveAcknowledgeMessage(ackMessage);
         } catch (Throwable t) {
            log.warn("Error while processing checkpoint acknowledgement message", t);
         }
      });
   } else {
      ...
   }
}

CheckpointCoordinator最后会调用PendingCheckpoint.acknowledgeTask()方法,该方法就是将task上报的元数据信息(checkpoint的路径地址,状态数据大小等等)添加到PendingCheckpoint里,如果接收到了全部task上报的的Ack消息,就执行completePendingCheckpoint()

//CheckpointCoordinator
public boolean receiveAcknowledgeMessage(AcknowledgeCheckpoint message) throws CheckpointException {
   ...
   
   final long checkpointId = message.getCheckpointId();

   synchronized (lock) {
      // we need to check inside the lock for being shutdown as well, otherwise we
      // get races and invalid error log messages
      if (shutdown) {
         return false;
      }

      final PendingCheckpoint checkpoint = pendingCheckpoints.get(checkpointId);

      if (checkpoint != null && !checkpoint.isDiscarded()) {

         switch (checkpoint.acknowledgeTask(message.getTaskExecutionId(), message.getSubtaskState(), message.getCheckpointMetrics())) {
            case SUCCESS:
               if (checkpoint.isFullyAcknowledged()) {
                  // 如果所有的 subTask 都返回 ack,此时就可以完成本次 checkPoint
                  completePendingCheckpoint(checkpoint);
               }
               break;
            ...
          }

         return true;
      }
      ...
   }
}

//PendingCheckpoint类
public TaskAcknowledgeResult acknowledgeTask(
      ExecutionAttemptID executionAttemptId,
      TaskStateSnapshot operatorSubtaskStates,
      CheckpointMetrics metrics) {

   synchronized (lock) {
      if (discarded) {
         return TaskAcknowledgeResult.DISCARDED;
      }

      final ExecutionVertex vertex = notYetAcknowledgedTasks.remove(executionAttemptId);

      if (vertex == null) {
         if (acknowledgedTasks.contains(executionAttemptId)) {
            return TaskAcknowledgeResult.DUPLICATE;
         } else {
            return TaskAcknowledgeResult.UNKNOWN;
         }
      } else {
         acknowledgedTasks.add(executionAttemptId);
      }

      List<OperatorID> operatorIDs = vertex.getJobVertex().getOperatorIDs();
      int subtaskIndex = vertex.getParallelSubtaskIndex();
      long ackTimestamp = System.currentTimeMillis();

      long stateSize = 0L;
      //将上报的operatorSubtaskStates也就是每个Task的状态快照元数据添加到PendingCheckpoint的operatorStates里面
      if (operatorSubtaskStates != null) {
         for (OperatorID operatorID : operatorIDs) {

            OperatorSubtaskState operatorSubtaskState =
               operatorSubtaskStates.getSubtaskStateByOperatorID(operatorID);

            // if no real operatorSubtaskState was reported, we insert an empty state
            if (operatorSubtaskState == null) {
               operatorSubtaskState = new OperatorSubtaskState();
            }

            OperatorState operatorState = operatorStates.get(operatorID);

            if (operatorState == null) {
               operatorState = new OperatorState(
                  operatorID,
                  vertex.getTotalNumberOfParallelSubtasks(),
                  vertex.getMaxParallelism());
               operatorStates.put(operatorID, operatorState);
            }

            operatorState.putState(subtaskIndex, operatorSubtaskState);
            stateSize += operatorSubtaskState.getStateSize();
         }
      }

      ++numAcknowledgedTasks;

    ...

      return TaskAcknowledgeResult.SUCCESS;
   }
}

JobManager完成所有task的ack之后,会做以下操作:

  1. 将PendingCheckpoint 转成CompletedCheckpoint,标志着checkpoint过程完成,CompletedCheckpoint里包含了checkpoint的元数据信息,包括checkpoint的路径地址,状态数据大小等等,同时也会将元数据信息进行持久化,目录为 c h e c k p o i n t D i r / checkpointDir/ checkpointDir/uuid/chk-***/_metadata,也会把过期的checkpoint数据给删除
  2. 通知所有的task进行commit操作,一般来说,task的commit操作其实不需要做什么,但是像那种TwoPhaseCommitSinkFunction,比如FlinkKafkaProducer,就会进行一些事物的提交操作等。
private void completePendingCheckpoint(PendingCheckpoint pendingCheckpoint) throws CheckpointException {
   final long checkpointId = pendingCheckpoint.getCheckpointId();
   final CompletedCheckpoint completedCheckpoint;

   // As a first step to complete the checkpoint, we register its state with the registry
   Map<OperatorID, OperatorState> operatorStates = pendingCheckpoint.getOperatorStates();
   sharedStateRegistry.registerAll(operatorStates.values());

   try {
      try {
         // 完成 checkpoint
         completedCheckpoint = pendingCheckpoint.finalizeCheckpoint();
      }
      catch (Exception e1) {
          ...
       }

      // the pending checkpoint must be discarded after the finalization
      Preconditions.checkState(pendingCheckpoint.isDiscarded() && completedCheckpoint != null);

      try {
         completedCheckpointStore.addCheckpoint(completedCheckpoint);
      } catch (Exception exception) {
      ...
     }
   } finally {
      pendingCheckpoints.remove(checkpointId);

      triggerQueuedRequests();
   }

   rememberRecentCheckpointId(checkpointId);

   // drop those pending checkpoints that are at prior to the completed one
   dropSubsumedCheckpoints(checkpointId);

   // record the time when this was completed, to calculate
   // the 'min delay between checkpoints'
   lastCheckpointCompletionNanos = System.nanoTime();

  ...

   // send the "notify complete" call to all vertices
   final long timestamp = completedCheckpoint.getTimestamp();

   for (ExecutionVertex ev : tasksToCommitTo) {
      Execution ee = ev.getCurrentExecutionAttempt();
      if (ee != null) {
          // task在接收到消息之后会调用Task.notifyCheckpointComplete()方法
         ee.notifyCheckpointComplete(checkpointId, timestamp);
      }
   }
}

2.1.5 JobManager通知Task进行commit

ask在接收到消息之后会调用Task.notifyCheckpointComplete()方法,最后会调用StreamOperator.notifyCheckpointComplete(),一般来说不做什么操作。但是像AbstractUdfStreamOperator这种的可能还会由一些其他操作

public void notifyCheckpointComplete(final long checkpointID) {
   final AbstractInvokable invokable = this.invokable;

   if (executionState == ExecutionState.RUNNING && invokable != null) {

      Runnable runnable = new Runnable() {
         @Override
         public void run() {
            try {
               invokable.notifyCheckpointComplete(checkpointID);
               taskStateManager.notifyCheckpointComplete(checkpointID);
            } catch (Throwable t) {
               ...
            }
         }
      };
      executeAsyncCallRunnable(runnable, "Checkpoint Confirmation for " +
         taskNameWithSubtask);
   }
   else {
      LOG.debug("Ignoring checkpoint commit notification for non-running task {}.", taskNameWithSubtask);
   }
}
//StreamTask类
public void notifyCheckpointComplete(long checkpointId) throws Exception {
   synchronized (lock) {
      if (isRunning) {
         LOG.debug("Notification of complete checkpoint for task {}", getName());

         for (StreamOperator<?> operator : operatorChain.getAllOperators()) {
            if (operator != null) {
               operator.notifyCheckpointComplete(checkpointId);
            }
         }
      }
      else {
         LOG.debug("Ignoring notification of complete checkpoint for not-running task {}", getName());
      }
   }
}

//AbstractStreamOperator类
public void notifyCheckpointComplete(long checkpointId) throws Exception {
   if (keyedStateBackend != null) {
      keyedStateBackend.notifyCheckpointComplete(checkpointId);
   }
}

//HeapKeyedStateBackend类
public void notifyCheckpointComplete(long checkpointId) {
   //Nothing to do
}

AbstractUdfStreamOperator主要是针对用户自定义函数的operator,像StreamMap,StreamSource等等,如果用户定义的Function实现了CheckpointListener接口,则会进行额外的一些处理,例如FlinkKafkaConsumerBase会向kafka提交消费的offset,TwoPhaseCommitSinkFunction类会进行事务的提交,例如FlinkKafkaProducer。值得一提的是,TwoPhaseCommitSinkFunction是保证flink端到端exactly-once的保证

//AbstractUdfStreamOperator
public void notifyCheckpointComplete(long checkpointId) throws Exception {
   super.notifyCheckpointComplete(checkpointId);

   if (userFunction instanceof CheckpointListener) {
      ((CheckpointListener) userFunction).notifyCheckpointComplete(checkpointId);
   }
}
//FlinkKafkaConsumerBase类
public final void notifyCheckpointComplete(long checkpointId) throws Exception {
   if (!running) {
      LOG.debug("notifyCheckpointComplete() called on closed source");
      return;
   }

   final AbstractFetcher<?, ?> fetcher = this.kafkaFetcher;
   if (fetcher == null) {
      LOG.debug("notifyCheckpointComplete() called on uninitialized source");
      return;
   }

   if (offsetCommitMode == OffsetCommitMode.ON_CHECKPOINTS) {
      // only one commit operation must be in progress
      if (LOG.isDebugEnabled()) {
         LOG.debug("Committing offsets to Kafka/ZooKeeper for checkpoint " + checkpointId);
      }

      try {
         ...
         @SuppressWarnings("unchecked")
         Map<KafkaTopicPartition, Long> offsets =
            (Map<KafkaTopicPartition, Long>) pendingOffsetsToCommit.remove(posInMap);

         ...
         //向kafka提交offset
         fetcher.commitInternalOffsetsToKafka(offsets, offsetCommitCallback);
      } catch (Exception e) {
         ...
      }
   }
}
//TwoPhaseCommitSinkFunction类
public final void notifyCheckpointComplete(long checkpointId) throws Exception {
   
   Iterator<Map.Entry<Long, TransactionHolder<TXN>>> pendingTransactionIterator = pendingCommitTransactions.entrySet().iterator();
   checkState(pendingTransactionIterator.hasNext(), "checkpoint completed, but no transaction pending");
   Throwable firstError = null;

   while (pendingTransactionIterator.hasNext()) {
      ...
      try {
          //提交事务
         commit(pendingTransaction.handle);
      } catch (Throwable t) {
        ...
      }
    ...
   }
    ...
}

//FlinkKafkaProducer类
protected void commit(FlinkKafkaProducer.KafkaTransactionState transaction) {
   if (transaction.isTransactional()) {
      try {
         transaction.producer.commitTransaction();
      } finally {
         recycleTransactionalProducer(transaction.producer);
      }
   }
}

所有的task完成了notifyCheckpointComplete()方法后,一个完整的checkpoint流程就完成了

2.2 非SourceStreamTask的checkpoint实现

上述只说了source task的checkpoint实现,source task的checkpoint是由JobManager来触发的,那么非source task的checkpoint流程又是如何的呢?
上面说了source task会向下游广播发送CheckpointBarrier,那么下游的task就会接收到source task发送的CheckpointBarrier,checkpoint的起始位置也在接收到CheckpointBarrier。起点就在StreamTask中的CheckpointBarrierHandler.getNextNonBlocked()中

//StreamInputProcessor类
public boolean processInput() throws Exception {
   ...

   while (true) {
     ...
    //
      final BufferOrEvent bufferOrEvent = barrierHandler.getNextNonBlocked();
      ...
   }
}

CheckpointBarrierHandler会根据CheckpointingMode模式不同生成不同的Handler,如果是EXACTLY_ONCE,就会生成BarrierBuffer,会进行barrier对齐,BarrierBuffer中的CachedBufferBlocker是用来缓存barrier对齐时从被阻塞channel接收到的数据。如果CheckpointingMode是AT_LEAST_ONCE,那就会生成BarrierTracker,不会进行barrier对齐。

public static CheckpointBarrierHandler createCheckpointBarrierHandler(
      StreamTask<?, ?> checkpointedTask,
      CheckpointingMode checkpointMode,
      IOManager ioManager,
      InputGate inputGate,
      Configuration taskManagerConfig) throws IOException {

   CheckpointBarrierHandler barrierHandler;
   if (checkpointMode == CheckpointingMode.EXACTLY_ONCE) {
      long maxAlign = taskManagerConfig.getLong(TaskManagerOptions.TASK_CHECKPOINT_ALIGNMENT_BYTES_LIMIT);
      if (!(maxAlign == -1 || maxAlign > 0)) {
         throw new IllegalConfigurationException(
            TaskManagerOptions.TASK_CHECKPOINT_ALIGNMENT_BYTES_LIMIT.key()
            + " must be positive or -1 (infinite)");
      }

      if (taskManagerConfig.getBoolean(TaskManagerOptions.NETWORK_CREDIT_MODEL)) {
         barrierHandler = new BarrierBuffer(inputGate, new CachedBufferBlocker(inputGate.getPageSize()), maxAlign);
      } else {
         barrierHandler = new BarrierBuffer(inputGate, new BufferSpiller(ioManager, inputGate.getPageSize()), maxAlign);
      }
   } else if (checkpointMode == CheckpointingMode.AT_LEAST_ONCE) {
      barrierHandler = new BarrierTracker(inputGate);
   } else {
      throw new IllegalArgumentException("Unrecognized Checkpointing Mode: " + checkpointMode);
   }

   if (checkpointedTask != null) {
      barrierHandler.registerCheckpointEventHandler(checkpointedTask);
   }

   return barrierHandler;
}

2.2.1 EXACTLY_ONCE下checkpoint的实现

下面看看CheckpointBarrierHandler的处理逻辑,首先看BarrierBuffer
1、首先获取到下一个BufferOrEvent,可能是从当前缓存中获取获取,如果缓存没了,就从inputGate获取
2、如果获取的事件是已经被设置为阻塞的channel的,就将其放到缓存中,如果channel已经接收到CheckpointBarrier,就会将其设置为block状态
3、如果接收的事件是CheckpointBarrier,处理接收到的CheckpointBarrier,如果接收到的是正常的数据就返回

//BarrierBuffer类
public BufferOrEvent getNextNonBlocked() throws Exception {
   while (true) {
      // process buffered BufferOrEvents before grabbing new ones
      Optional<BufferOrEvent> next;
      if (currentBuffered == null) {
         next = inputGate.getNextBufferOrEvent();
      }
      else {
         next = Optional.ofNullable(currentBuffered.getNext());
         if (!next.isPresent()) {
            completeBufferedSequence();
            return getNextNonBlocked();
         }
      }

      ...

      BufferOrEvent bufferOrEvent = next.get();
      if (isBlocked(bufferOrEvent.getChannelIndex())) {
          //如果数据是被阻塞的channel,就将其添加到缓存,如果channel已经接收到CheckpointBarrier,就会将其设置为block状态
         // if the channel is blocked, we just store the BufferOrEvent
         bufferBlocker.add(bufferOrEvent);
         checkSizeLimit();
      }
      else if (bufferOrEvent.isBuffer()) {
         return bufferOrEvent;
      }
      else if (bufferOrEvent.getEvent().getClass() == CheckpointBarrier.class) {
         if (!endOfStream) {
             //处理接收到的CheckpointBarrier
            // process barriers only if there is a chance of the checkpoint completing
            processBarrier((CheckpointBarrier) bufferOrEvent.getEvent(), bufferOrEvent.getChannelIndex());
         }
      }
      else if (bufferOrEvent.getEvent().getClass() == CancelCheckpointMarker.class) {
         processCancellationBarrier((CancelCheckpointMarker) bufferOrEvent.getEvent());
      }
      else {
         if (bufferOrEvent.getEvent().getClass() == EndOfPartitionEvent.class) {
            processEndOfPartition();
         }
         return bufferOrEvent;
      }
   }
}

//CachedBufferBlocker类
public void add(BufferOrEvent boe) {
   bytesBlocked += pageSize;
    //currentBuffers是一个ArrayDeque<BufferOrEvent>
   currentBuffers.add(boe);
}

核心处理逻辑在processBarrier()方法中,实现如下:
1、如果只有一个channel,也就是说只有一个上游任务,就立马触发checkpoint
2、如果有多个上游任务,也就是有多个channel,那么会有以下几种情况的判断
1)、如果是首次接收到barrier,就开始进行barrier对齐,并将该channel设置为阻塞状态
2)、如果不是首次接收到barrier,但也不是最后一个barrier,就只给channel设置为阻塞状态
3)、如果在没完成当前checkpoint的时候又接收到了下一次checkpoint的barrier,就终止当前的checkpoint,重新开始新的一次checkpoint,这种情况并不常见
4)、如果接收到全部channel,即上游所有任务的barrier,就开始触发checkpoint,并取消所有channel的阻塞状态,开始处理那些被添加到缓存的事件

private void processBarrier(CheckpointBarrier receivedBarrier, int channelIndex) throws Exception {
   final long barrierId = receivedBarrier.getId();

   // fast path for single channel cases
   if (totalNumberOfInputChannels == 1) {
      if (barrierId > currentCheckpointId) {
         // new checkpoint
         currentCheckpointId = barrierId;
         notifyCheckpoint(receivedBarrier);
      }
      return;
   }

   // -- general code path for multiple input channels --

   if (numBarriersReceived > 0) {
      // this is only true if some alignment is already progress and was not canceled

      if (barrierId == currentCheckpointId) {
         // regular case
         onBarrier(channelIndex);
      }
      else if (barrierId > currentCheckpointId) {
          //这种情况就是上一个checkpoint还没完呢,就接收到下一个checkpoint的barrier,这种情况,就终止当前的checkpoint
         // we did not complete the current checkpoint, another started before
         ...
         // let the task know we are not completing this
         notifyAbort(currentCheckpointId, new CheckpointDeclineSubsumedException(barrierId));

         // abort the current checkpoint
         releaseBlocksAndResetBarriers();

         // begin a the new checkpoint
         beginNewAlignment(barrierId, channelIndex);
      }
      else {
         // ignore trailing barrier from an earlier checkpoint (obsolete now)
         return;
      }
   }
   else if (barrierId > currentCheckpointId) {
      // first barrier of a new checkpoint
      beginNewAlignment(barrierId, channelIndex);
   }
   else {
      // either the current checkpoint was canceled (numBarriers == 0) or
      // this barrier is from an old subsumed checkpoint
      return;
   }
    //接收到所有channel的barrier,开始触发checkpoint
   // check if we have all barriers - since canceled checkpoints always have zero barriers
   // this can only happen on a non canceled checkpoint
   if (numBarriersReceived + numClosedChannels == totalNumberOfInputChannels) {
      // actually trigger checkpoint
      ...
      //释放所有channel的阻塞状态
      releaseBlocksAndResetBarriers();
      //开始触发checkpoint
      notifyCheckpoint(receivedBarrier);
   }
}

//开始barrier对齐
private void beginNewAlignment(long checkpointId, int channelIndex) throws IOException {
   currentCheckpointId = checkpointId;
   onBarrier(channelIndex);

   startOfAlignmentTimestamp = System.nanoTime();

   ...
}

//onBarrier()就是将其channel设置为阻塞状态
private void onBarrier(int channelIndex) throws IOException {
   if (!blockedChannels[channelIndex]) {
      blockedChannels[channelIndex] = true;

      numBarriersReceived++;

      ...
   }
   else {
      throw new IOException("Stream corrupt: Repeated barrier for same checkpoint on input " + channelIndex);
   }
}

触发checkpoint的逻辑在notifyCheckpoint()方法中,toNotifyOnCheckpoint就是StreamTask,最后还是调用了StreamTask.triggerCheckpointOnBarrier(),就跟上述描述的source task的checkpoint逻辑是一样的了。

private void notifyCheckpoint(CheckpointBarrier checkpointBarrier) throws Exception {
   if (toNotifyOnCheckpoint != null) {
      CheckpointMetaData checkpointMetaData =
            new CheckpointMetaData(checkpointBarrier.getId(), checkpointBarrier.getTimestamp());

      long bytesBuffered = currentBuffered != null ? currentBuffered.size() : 0L;

      CheckpointMetrics checkpointMetrics = new CheckpointMetrics()
            .setBytesBufferedInAlignment(bytesBuffered)
            .setAlignmentDurationNanos(latestAlignmentDurationNanos);

      toNotifyOnCheckpoint.triggerCheckpointOnBarrier(
         checkpointMetaData,
         checkpointBarrier.getCheckpointOptions(),
         checkpointMetrics);
   }
}

//StreamTask
public void triggerCheckpointOnBarrier(
      CheckpointMetaData checkpointMetaData,
      CheckpointOptions checkpointOptions,
      CheckpointMetrics checkpointMetrics) throws Exception {

   try {
      performCheckpoint(checkpointMetaData, checkpointOptions, checkpointMetrics);
   }
   catch (CancelTaskException e) {
      ...
   }
}

2.2.2 AT_LEAST_ONCE下的checkpoint实现

//BarrierTracker类
public BufferOrEvent getNextNonBlocked() throws Exception {
   while (true) {
      Optional<BufferOrEvent> next = inputGate.getNextBufferOrEvent();
      if (!next.isPresent()) {
         // buffer or input exhausted
         return null;
      }

      BufferOrEvent bufferOrEvent = next.get();
      if (bufferOrEvent.isBuffer()) {
         return bufferOrEvent;
      }
      else if (bufferOrEvent.getEvent().getClass() == CheckpointBarrier.class) {
         processBarrier((CheckpointBarrier) bufferOrEvent.getEvent(), bufferOrEvent.getChannelIndex());
      }
      else if (bufferOrEvent.getEvent().getClass() == CancelCheckpointMarker.class) {
         processCheckpointAbortBarrier((CancelCheckpointMarker) bufferOrEvent.getEvent(), bufferOrEvent.getChannelIndex());
      }
      else {
         // some other event
         return bufferOrEvent;
      }
   }
}

核心实现还是在processBarrier()中,实现如下:
1、如果一个channel,也即上游只有一个channel,就立即触发checkpoint
2、在多个channel的情况下,也即多个上游任务,有如下几种情况
1)、如果是首次接收到barrier,只更新一下当前的checkpointID
2)、不是首次,但也不是最后一次接收到barrier,基本什么也不做,就累加一个计数器
3)、如果接收到所有channel的barrier,就开始触发checkpoint
因为BarrierTracker在等待所有channel的barrier到来的时候还会处理其他已经到来barrier的channel的数据,所以在进行状态数据快照的时候就会造成多了一部分数据的状态,快照并不具备一致性。在程序从checkpoint进行恢复的时候会重复处理这部分数据,所以才会出现AT_LEAST_ONCE

//BarrierTracker类
private void processBarrier(CheckpointBarrier receivedBarrier, int channelIndex) throws Exception {
   final long barrierId = receivedBarrier.getId();

   // fast path for single channel trackers
   if (totalNumberOfInputChannels == 1) {
      notifyCheckpoint(barrierId, receivedBarrier.getTimestamp(), receivedBarrier.getCheckpointOptions());
      return;
   }

   // general path for multiple input channels
   if (LOG.isDebugEnabled()) {
      LOG.debug("Received barrier for checkpoint {} from channel {}", barrierId, channelIndex);
   }

   // find the checkpoint barrier in the queue of pending barriers
   CheckpointBarrierCount cbc = null;
   int pos = 0;

   for (CheckpointBarrierCount next : pendingCheckpoints) {
      if (next.checkpointId == barrierId) {
         cbc = next;
         break;
      }
      pos++;
   }

   if (cbc != null) {
      // add one to the count to that barrier and check for completion
      int numBarriersNew = cbc.incrementBarrierCount();
      //当接收到所有channel的barrier,开始触发checkpoint
      if (numBarriersNew == totalNumberOfInputChannels) {
         // checkpoint can be triggered (or is aborted and all barriers have been seen)
         // first, remove this checkpoint and all all prior pending
         // checkpoints (which are now subsumed)
         for (int i = 0; i <= pos; i++) {
            pendingCheckpoints.pollFirst();
         }

         // notify the listener
         if (!cbc.isAborted()) {
            if (LOG.isDebugEnabled()) {
               LOG.debug("Received all barriers for checkpoint {}", barrierId);
            }

            notifyCheckpoint(receivedBarrier.getId(), receivedBarrier.getTimestamp(), receivedBarrier.getCheckpointOptions());
         }
      }
   }
   else {
      // first barrier for that checkpoint ID
      // add it only if it is newer than the latest checkpoint.
      // if it is not newer than the latest checkpoint ID, then there cannot be a
      // successful checkpoint for that ID anyways
      if (barrierId > latestPendingCheckpointID) {
         latestPendingCheckpointID = barrierId;
         pendingCheckpoints.addLast(new CheckpointBarrierCount(barrierId));

         // make sure we do not track too many checkpoints
         if (pendingCheckpoints.size() > MAX_CHECKPOINTS_TO_TRACK) {
            pendingCheckpoints.pollFirst();
         }
      }
   }
}

notifyCheckpoint()方法也是调用了StreamTask.triggerCheckpointOnBarrier(),跟BarrierBuffer是一样的。

//BarrierTracker类
private void notifyCheckpoint(long checkpointId, long timestamp, CheckpointOptions checkpointOptions) throws Exception {
   if (toNotifyOnCheckpoint != null) {
      CheckpointMetaData checkpointMetaData = new CheckpointMetaData(checkpointId, timestamp);
      CheckpointMetrics checkpointMetrics = new CheckpointMetrics()
         .setBytesBufferedInAlignment(0L)
         .setAlignmentDurationNanos(0L);

      toNotifyOnCheckpoint.triggerCheckpointOnBarrier(checkpointMetaData, checkpointOptions, checkpointMetrics);
   }
}

上述的流程会一直执行到sink task,sink task执行完checkpoint,整个checkpoint就完成了。

2.3 2.4 checkPoint 状态数据存储

operatorState和keyedState的snapshot过程被封装到一个RunnableFuture,它并不会立即执行,之后会有一个异步的线程调用RunnableFuture.run()才会真正的执行snapshot。

//AbstractStreamOperator类
public final OperatorSnapshotFutures snapshotState(long checkpointId, long timestamp, CheckpointOptions checkpointOptions,
      CheckpointStreamFactory factory) throws Exception {

   KeyGroupRange keyGroupRange = null != keyedStateBackend ?
         keyedStateBackend.getKeyGroupRange() : KeyGroupRange.EMPTY_KEY_GROUP_RANGE;

   OperatorSnapshotFutures snapshotInProgress = new OperatorSnapshotFutures();

   try (StateSnapshotContextSynchronousImpl snapshotContext = new StateSnapshotContextSynchronousImpl(
         checkpointId,
         timestamp,
         factory,
         keyGroupRange,
         getContainingTask().getCancelables())) {

      snapshotState(snapshotContext);
    
      //keyedStateRawFuture和operatorStateRawFuture都是DoneFuture,也就是一个已经完成的RunnableFuture
      snapshotInProgress.setKeyedStateRawFuture(snapshotContext.getKeyedStateStreamFuture());
      snapshotInProgress.setOperatorStateRawFuture(snapshotContext.getOperatorStateStreamFuture());
    
      //operatorState和keyedState的snapshot过程被封装到一个RunnableFuture,并不会立即执行,
      //之后调用RunnableFuture.run()才会真正的执行snapshot
      if (null != operatorStateBackend) {
         snapshotInProgress.setOperatorStateManagedFuture(
            operatorStateBackend.snapshot(checkpointId, timestamp, factory, checkpointOptions));
      }

      if (null != keyedStateBackend) {
         snapshotInProgress.setKeyedStateManagedFuture(
            keyedStateBackend.snapshot(checkpointId, timestamp, factory, checkpointOptions));
      }
   } catch (Exception snapshotException) {
      ...
   }

   return snapshotInProgress;
}
//HeapKeyedStateBackend类
public RunnableFuture<SnapshotResult<KeyedStateHandle>> snapshot(
   final long checkpointId,
   final long timestamp,
   @Nonnull final CheckpointStreamFactory streamFactory,
   @Nonnull CheckpointOptions checkpointOptions) throws IOException {

   long startTime = System.currentTimeMillis();

   final RunnableFuture<SnapshotResult<KeyedStateHandle>> snapshotRunner =
      snapshotStrategy.snapshot(checkpointId, timestamp, streamFactory, checkpointOptions);

   snapshotStrategy.logSyncCompleted(streamFactory, startTime);
   return snapshotRunner;
}

2.4.1 KeyedState快照过程

首先我们来看KeyedState的快照,调用的是snapshotStrategy.snapshot(),checkpoint的核心源码逻辑都在里面,这个snapshotStrategy就是HeapSnapshotStrategy,源码如下,主要实现的逻辑大致如下:
1、首先对状态数据进行快照拷贝,这里拷贝的仅仅是状态数据的引用,而不是对实例对象的拷贝
2、创建checkpoint的输出流CheckpointStateOutputStream,这个就是checkpoint的数据目录,但是并非是最终的数据目录,而是像 C H E C K P O I N T D I R / CHECKPOINT_DIR/ CHECKPOINTDIR/UID/chk-n这种的
3、上述2个步骤都是同步进行的,下面就是生成一个AsyncSnapshotCallable,callInternal()方法就是对checkpoint的持久化操作,这是可以异步进行的。大致步骤又分为:
1)、获取CheckpointStateOutputStream,写入状态数据的元数据信息,包括状态的名字、
序列化类型等
2)、对状态数据按照keyGroupId进行分组,依次对每个keyGoupId的状态数据进行写入
3)、封装状态数据checkpoint的元数据信息,StreamStateHandle里包含状态数据存储的路径和
大小,KeyGroupRangeOffsets里包含的是每个keyGoupId在文件中存储的offset位置

/HeapSnapshotStrategypublic RunnableFuture<SnapshotResult<KeyedStateHandle>> snapshot(
   long checkpointId,
   long timestamp,
   @Nonnull CheckpointStreamFactory primaryStreamFactory,
   @Nonnull CheckpointOptions checkpointOptions) throws IOException {

   if (!hasRegisteredState()) {
      return DoneFuture.of(SnapshotResult.empty());
   }

   int numStates = registeredKVStates.size() + registeredPQStates.size();

   Preconditions.checkState(numStates <= Short.MAX_VALUE,
      "Too many states: " + numStates +
         ". Currently at most " + Short.MAX_VALUE + " states are supported");

   final List<StateMetaInfoSnapshot> metaInfoSnapshots = new ArrayList<>(numStates);
   final Map<StateUID, Integer> stateNamesToId =
      new HashMap<>(numStates);
   final Map<StateUID, StateSnapshot> cowStateStableSnapshots =
      new HashMap<>(numStates);
    //对状态数据进行快照拷贝到cowStateStableSnapshots 
   processSnapshotMetaInfoForAllStates(
      metaInfoSnapshots,
      cowStateStableSnapshots,
      stateNamesToId,
      registeredKVStates,
      StateMetaInfoSnapshot.BackendStateType.KEY_VALUE);

   processSnapshotMetaInfoForAllStates(
      metaInfoSnapshots,
      cowStateStableSnapshots,
      stateNamesToId,
      registeredPQStates, //registeredPQStates对我们keyed状态数据没有影响
      StateMetaInfoSnapshot.BackendStateType.PRIORITY_QUEUE);

   final KeyedBackendSerializationProxy<K> serializationProxy =
      new KeyedBackendSerializationProxy<>(
         // TODO: this code assumes that writing a serializer is threadsafe, we should support to
         // get a serialized form already at state registration time in the future
         getKeySerializer(),
         metaInfoSnapshots,
         !Objects.equals(UncompressedStreamCompressionDecorator.INSTANCE, keyGroupCompressionDecorator));
   //创建checkpoint的输出流CheckpointStateOutputStream      
   final SupplierWithException<CheckpointStreamWithResultProvider, Exception> checkpointStreamSupplier =

      localRecoveryConfig.isLocalRecoveryEnabled() ?

         () -> CheckpointStreamWithResultProvider.createDuplicatingStream(
            checkpointId,
            CheckpointedStateScope.EXCLUSIVE,
            primaryStreamFactory,
            localRecoveryConfig.getLocalStateDirectoryProvider()) :

         () -> CheckpointStreamWithResultProvider.createSimpleStream(
            CheckpointedStateScope.EXCLUSIVE,
            primaryStreamFactory);

   //--------------------------------------------------- this becomes the end of sync part

   final AsyncSnapshotCallable<SnapshotResult<KeyedStateHandle>> asyncSnapshotCallable =
      new AsyncSnapshotCallable<SnapshotResult<KeyedStateHandle>>() {
         @Override
         protected SnapshotResult<KeyedStateHandle> callInternal() throws Exception {

            final CheckpointStreamWithResultProvider streamWithResultProvider =
               checkpointStreamSupplier.get();

            snapshotCloseableRegistry.registerCloseable(streamWithResultProvider);

            final CheckpointStreamFactory.CheckpointStateOutputStream localStream =
               streamWithResultProvider.getCheckpointOutputStream();
            
            final DataOutputViewStreamWrapper outView = new DataOutputViewStreamWrapper(localStream);
            //写入状态数据的元数据信息,包括状态的名字、序列化类型等
            serializationProxy.write(outView);

            final long[] keyGroupRangeOffsets = new long[keyGroupRange.getNumberOfKeyGroups()];
            //对状态数据按照keyGroupId进行分组,依次对每个keyGoupId的状态数据进行写入
            for (int keyGroupPos = 0; keyGroupPos < keyGroupRange.getNumberOfKeyGroups(); ++keyGroupPos) {
               int keyGroupId = keyGroupRange.getKeyGroupId(keyGroupPos);
               //记录每个keyGoupId的状态数据的存储位置
               keyGroupRangeOffsets[keyGroupPos] = localStream.getPos();
               outView.writeInt(keyGroupId);

               for (Map.Entry<StateUID, StateSnapshot> stateSnapshot :
                  cowStateStableSnapshots.entrySet()) {
                  StateSnapshot.StateKeyGroupWriter partitionedSnapshot =
                    //分组的实现
                     stateSnapshot.getValue().getKeyGroupWriter();
                  try (
                     OutputStream kgCompressionOut =
                        keyGroupCompressionDecorator.decorateWithCompression(localStream)) {
                     DataOutputViewStreamWrapper kgCompressionView =
                        new DataOutputViewStreamWrapper(kgCompressionOut);
                     kgCompressionView.writeShort(stateNamesToId.get(stateSnapshot.getKey()));
                     //写入每个keyGoupId的状态数据
                     partitionedSnapshot.writeStateInKeyGroup(kgCompressionView, keyGroupId);
                  } // this will just close the outer compression stream
               }
            }
            //封装状态数据的checkpoint元数据信息
            if (snapshotCloseableRegistry.unregisterCloseable(streamWithResultProvider)) {
               KeyGroupRangeOffsets kgOffs = new KeyGroupRangeOffsets(keyGroupRange, keyGroupRangeOffsets);
               SnapshotResult<StreamStateHandle> result =
                  streamWithResultProvider.closeAndFinalizeCheckpointStreamResult();
               return CheckpointStreamWithResultProvider.toKeyedStateHandleSnapshotResult(result, kgOffs);
            } else {
               throw new IOException("Stream already unregistered.");
            }
         }

         @Override
         protected void cleanupProvidedResources() {
            for (StateSnapshot tableSnapshot : cowStateStableSnapshots.values()) {
               tableSnapshot.release();
            }
         }

         @Override
         protected void logAsyncSnapshotComplete(long startTime) {
            if (snapshotStrategySynchronicityTrait.isAsynchronous()) {
               logAsyncCompleted(primaryStreamFactory, startTime);
            }
         }
      };

   final FutureTask<SnapshotResult<KeyedStateHandle>> task =
      asyncSnapshotCallable.toAsyncSnapshotFutureTask(cancelStreamRegistry);
   finalizeSnapshotBeforeReturnHook(task);

   return task;
}

状态数据的快照拷贝
拷贝的方法调用起始在processSnapshotMetaInfoForAllStates()方法,registeredStates就是我们注册的一些状态数据表,是一个Map,key就是状态数据的名字,比如下面的代码就注册了一个名字为sumValueState的状态表,value就是状态表,默认是一个CopyOnWriteStateTable
对每个状态表都会创建一个快照放到cowStateStableSnapshots中,快照的核心就是将CopyOnWriteStateTable中的状态数据拷贝到CopyOnWriteStateTableSnapshot中的一个数组snapshotData中,注意这里拷贝的仅仅是数据的引用而已,并不是对象实例,这个过程很快,也不会占用过多的内存。CopyOnWriteStateTable使用了写入时复制技术,状态数据对象本身的值并不会发生改变,而是用一个新对象来替换原来的老对象,这样即使快照和数据处理同时进行,快照还是指向的老对象,引用的对象本身也并不会发生变化。

private void processSnapshotMetaInfoForAllStates(
   List<StateMetaInfoSnapshot> metaInfoSnapshots,
   Map<StateUID, StateSnapshot> cowStateStableSnapshots,
   Map<StateUID, Integer> stateNamesToId,
   Map<String, ? extends StateSnapshotRestore> registeredStates,
   StateMetaInfoSnapshot.BackendStateType stateType) {

   for (Map.Entry<String, ? extends StateSnapshotRestore> kvState : registeredStates.entrySet()) {
      final StateUID stateUid = StateUID.of(kvState.getKey(), stateType);
      stateNamesToId.put(stateUid, stateNamesToId.size());
      StateSnapshotRestore state = kvState.getValue();
      if (null != state) {
         final StateSnapshot stateSnapshot = state.stateSnapshot();
         metaInfoSnapshots.add(stateSnapshot.getMetaInfoSnapshot());
         cowStateStableSnapshots.put(stateUid, stateSnapshot);
      }
   }
}
//CopyOnWriteStateTable类
public CopyOnWriteStateTableSnapshot<K, N, S> stateSnapshot() {
   return new CopyOnWriteStateTableSnapshot<>(this);
}

//CopyOnWriteStateTableSnapshot类
CopyOnWriteStateTableSnapshot(CopyOnWriteStateTable<K, N, S> owningStateTable) {

   super(owningStateTable);
   //对CopyOnWriteStateTable表中的状态数据进行一个拷贝
   this.snapshotData = owningStateTable.snapshotTableArrays();
   this.snapshotVersion = owningStateTable.getStateTableVersion();
   this.numberOfEntriesInSnapshotData = owningStateTable.size();

   // We create duplicates of the serializers for the async snapshot, because TypeSerializer
   // might be stateful and shared with the event processing thread.
   this.localKeySerializer = owningStateTable.keyContext.getKeySerializer().duplicate();
   this.localNamespaceSerializer = owningStateTable.metaInfo.getNamespaceSerializer().duplicate();
   this.localStateSerializer = owningStateTable.metaInfo.getStateSerializer().duplicate();

   this.partitionedStateTableSnapshot = null;

   this.stateSnapshotTransformer = owningStateTable.metaInfo.
      getStateSnapshotTransformFactory().createForDeserializedState().orElse(null);
}

//CopyOnWriteStateTable类
StateTableEntry<K, N, S>[] snapshotTableArrays() {

   synchronized (snapshotVersions) {

      // increase the table version for copy-on-write and register the snapshot
      if (++stateTableVersion < 0) {
         // this is just a safety net against overflows, but should never happen in practice (i.e., only after 2^31 snapshots)
         throw new IllegalStateException("Version count overflow in CopyOnWriteStateTable. Enforcing restart.");
      }

      highestRequiredSnapshotVersion = stateTableVersion;
      snapshotVersions.add(highestRequiredSnapshotVersion);
   }

   StateTableEntry<K, N, S>[] table = primaryTable;

   final int totalTableIndexSize = rehashIndex + table.length;
   final int copiedArraySize = Math.max(totalTableIndexSize, size());
   final StateTableEntry<K, N, S>[] copy = new StateTableEntry[copiedArraySize];

   if (isRehashing()) {
      // consider both tables for the snapshot, the rehash index tells us which part of the two tables we need
      final int localRehashIndex = rehashIndex;
      final int localCopyLength = table.length - localRehashIndex;
      // for the primary table, take every index >= rhIdx.
      System.arraycopy(table, localRehashIndex, copy, 0, localCopyLength);

      // for the new table, we are sure that two regions contain all the entries:
      // [0, rhIdx[ AND [table.length / 2, table.length / 2 + rhIdx[
      table = incrementalRehashTable;
      System.arraycopy(table, 0, copy, localCopyLength, localRehashIndex);
      System.arraycopy(table, table.length >>> 1, copy, localCopyLength + localRehashIndex, localRehashIndex);
   } else {
      // we only need to copy the primary table
      System.arraycopy(table, 0, copy, 0, table.length);
   }

   return copy;
}

2.4.2 创建CheckpointStateOutputStream

一般在正式环境我们设置的checkpoint存储都是基于文件系统的,即FsStateBackend, 而不是内存MemoryStateBackend,在FsStateBackend模式下,最终生成的CheckpointStateOutputStream是FsCheckpointStateOutputStream

static CheckpointStreamWithResultProvider createSimpleStream(
   @Nonnull CheckpointedStateScope checkpointedStateScope,
   @Nonnull CheckpointStreamFactory primaryStreamFactory) throws IOException {

   CheckpointStreamFactory.CheckpointStateOutputStream primaryOut =
      primaryStreamFactory.createCheckpointStateOutputStream(checkpointedStateScope);

   return new CheckpointStreamWithResultProvider.PrimaryStreamOnly(primaryOut);
}


public FsCheckpointStateOutputStream createCheckpointStateOutputStream(CheckpointedStateScope scope) throws IOException {
   Path target = scope == CheckpointedStateScope.EXCLUSIVE ? checkpointDirectory : sharedStateDirectory;
   int bufferSize = Math.max(DEFAULT_WRITE_BUFFER_SIZE, fileStateThreshold);

   // 创建 FsCheckpointStateOutputStream
   return new FsCheckpointStateOutputStream(target, filesystem, bufferSize, fileStateThreshold);
}

构建FsCheckpointStateOutputStream有几个参数:
basePath:checkpoint基础路径,形式是 c h e c k p o i n t d i r / {checkpoint_dir}/ checkpointdir/{uid}/chk-${checkpointid}
fs:文件系统FileSystem,这里是flink自己封装的,对于hdfs有HadoopFileSystem,本地文件有LocalFileSystem
bufferSize:写OutputStream时会先将数据写入到buffer,bufferSize控制buffer的大小
fileStateThreshold:文件状态的阈值,超过这个大小就会将状态数据写入到文件,否则就直接封装到StreamHandle里面,这样做是为了减少小文件,默认值为1024,即1kb

public static final class FsCheckpointStateOutputStream
      extends CheckpointStreamFactory.CheckpointStateOutputStream {

   private final byte[] writeBuffer;

   private int pos;

   private FSDataOutputStream outStream;

   private final int localStateThreshold;

   private final Path basePath;

   private final FileSystem fs;

   private Path statePath;

   private volatile boolean closed;

   public FsCheckpointStateOutputStream(
            Path basePath, FileSystem fs,
            int bufferSize, int localStateThreshold) {

      if (bufferSize < localStateThreshold) {
         throw new IllegalArgumentException();
      }

      this.basePath = basePath;
      this.fs = fs;
      this.writeBuffer = new byte[bufferSize];
      this.localStateThreshold = localStateThreshold;
   }
   
   @Override
   public void write(int b) throws IOException {
     if (pos >= writeBuffer.length) {
        flush();
     }
    writeBuffer[pos++] = (byte) b;
   }
  ...

此外,FsCheckpointStateOutputStream还有一些其他的成员:
writeBuffer:一个缓冲数组byte[],写OutputStream时会先将数据写入到buffer缓冲区,减少磁盘的io交互
pos:writeBuffer中数据写入的位置
outStream:checkpoint最终文件路径封装的FSDataOutputStream,通过这个OutputStream写入状态数据到最终文件里面
statePath:状态数据的文件路径,路径形式为 c h e c k p o i n t d i r / {checkpoint_dir}/ checkpointdir/{uid}/chk- c h e c k p o i n t i d / {checkpointid}/ checkpointid/{uuid}

2.4.3 写入状态数据元数据信息

调用serializationProxy.write(outView)写入状态数据的元数据信息,包括状态数据的名称,序列化类型等等

//KeyedBackendSerializationProxy类
public void write(DataOutputView out) throws IOException {
   super.write(out);

   // write the compression format used to write each key-group
   out.writeBoolean(usingKeyGroupCompression);

   TypeSerializerSnapshotSerializationUtil.writeSerializerSnapshot(out, keySerializerSnapshot, keySerializer);

   // write individual registered keyed state metainfos
   out.writeShort(stateMetaInfoSnapshots.size());
   for (StateMetaInfoSnapshot metaInfoSnapshot : stateMetaInfoSnapshots) {
      StateMetaInfoSnapshotReadersWriters.getWriter().writeStateMetaInfoSnapshot(metaInfoSnapshot, out);
   }
}

2.4.4 状态数据分组

状态数据分组的调用在CopyOnWriteStateTableSnapshot.getKeyGroupWriter()方法,是将状态数据按照keyGoupId进行分组,算法比较复杂一点。

//CopyOnWriteStateTableSnapshot类
public StateKeyGroupWriter getKeyGroupWriter() {
   if (partitionedStateTableSnapshot == null) {
      final InternalKeyContext<K> keyContext = owningStateTable.keyContext;
      final int numberOfKeyGroups = keyContext.getNumberOfKeyGroups();
      final KeyGroupRange keyGroupRange = keyContext.getKeyGroupRange();
      //状态数据的持久化函数
      ElementWriterFunction<CopyOnWriteStateTable.StateTableEntry<K, N, S>> elementWriterFunction =
         (element, dov) -> {
            localNamespaceSerializer.serialize(element.namespace, dov);
            localKeySerializer.serialize(element.key, dov);
            localStateSerializer.serialize(element.state, dov);
         };
      //默认是StateTableKeyGroupPartitioner
      StateTableKeyGroupPartitioner<K, N, S> stateTableKeyGroupPartitioner = stateSnapshotTransformer != null ?
         new TransformingStateTableKeyGroupPartitioner<>(
            snapshotData,
            numberOfEntriesInSnapshotData,
            keyGroupRange,
            numberOfKeyGroups,
            elementWriterFunction,
            stateSnapshotTransformer) :
         new StateTableKeyGroupPartitioner<>(
            snapshotData,
            numberOfEntriesInSnapshotData,
            keyGroupRange,
            numberOfKeyGroups,
            elementWriterFunction);
      partitionedStateTableSnapshot = stateTableKeyGroupPartitioner.partitionByKeyGroup();
   }
   return partitionedStateTableSnapshot;
}

首先我们看一下StateTableKeyGroupPartitioner类的结构,StateTableKeyGroupPartitioner继承了KeyGroupPartitioner,没有其他新的成员:

public class KeyGroupPartitioner<T> {

   /**
    * The input data for the partitioning. All elements to consider must be densely in the index interval
    * [0, {@link #numberOfElements}[, without null values.
    */
   @Nonnull
   //原始的快照状态数据
   protected final T[] partitioningSource; 

   /**
    * The output array for the partitioning. The size must be {@link #numberOfElements} (or bigger).
    */
   @Nonnull
   //最终经过分组后的状态数据
   protected final T[] partitioningDestination;

   /** Total number of input elements. */
   @Nonnegative
   //状态数据总量
   protected final int numberOfElements;

   /** The total number of key-groups in the job. */
   @Nonnegative
   //该operator总共的keyGoupId数量
   protected final int totalKeyGroups;

   /** The key-group range for the input data, covered in this partitioning. */
   @Nonnull
   //该operator的KeyGroupRange,有起始和结束的keyGoupId
   protected final KeyGroupRange keyGroupRange;

   /**
    * This bookkeeping array is used to count the elements in each key-group. In a second step, it is transformed into
    * a histogram by accumulation.
    */
   @Nonnull
   //直方图,用于标记每个keyGoupId的数据在最终状态数据数组中的位置
   protected final int[] counterHistogram;

   /**
    * This is a helper array that caches the key-group for each element, so we do not have to compute them twice.
    */
   @Nonnull
   //用来标记partitioningSource中每个元素对应的keyGoupId在counterHistogram中的位置
   protected final int[] elementKeyGroups;

   /** Cached value of keyGroupRange#firstKeyGroup. */
   @Nonnegative
   //每个operator的起始keyGoupId,即keyGroupRange#firstKeyGroup
   protected final int firstKeyGroup;

   /** Function to extract the key from a given element. */
   @Nonnull
   protected final KeyExtractorFunction<T> keyExtractorFunction;

   /** Function to write an element to a {@link DataOutputView}. */
   @Nonnull
   protected final ElementWriterFunction<T> elementWriterFunction;

   /** Cached result. */
   @Nullable
   protected StateSnapshot.StateKeyGroupWriter computedResult;

状态数据的分组算法便是依赖于成员中各个数据结构。

我们再来看KeyGroupPartitioner.partitionByKeyGroup()方法,这个方法分为3步:
1、记录状态数据的keyGoupId
2、基于keyGoupId构建直方图,用于数据的分组
3、执行分组

//KeyGroupPartitioner类
public StateSnapshot.StateKeyGroupWriter partitionByKeyGroup() {
   if (computedResult == null) {
      reportAllElementKeyGroups();
      int outputNumberOfElements = buildHistogramByAccumulatingCounts();
      executePartitioning(outputNumberOfElements);
   }
   return computedResult;
}

记录状态数据的keyGoupId
这个操作的效果就是将原始的快照状态数据扁平化,添加到partitioningSource[]中,在elementKeyGroups[]中记录partitioningSource[]每个元素的keyGoupId在counterHistogram[]中的位置。counterHistogram[]中统计出每个keyGoupId有多少个元素

//StateTableKeyGroupPartitioner类
protected void reportAllElementKeyGroups() {
   // In this step we i) 'flatten' the linked list of entries to a second array and ii) report key-groups.
   int flattenIndex = 0;
   for (CopyOnWriteStateTable.StateTableEntry<K, N, S> entry : partitioningDestination) {
      while (null != entry) {
         flattenIndex = tryAddToSource(flattenIndex, entry);
         entry = entry.next;
      }
   }
}

/** Tries to append next entry to {@code partitioningSource} array snapshot and returns next index.*/
int tryAddToSource(int currentIndex, CopyOnWriteStateTable.StateTableEntry<K, N, S> entry) {
   final int keyGroup = KeyGroupRangeAssignment.assignToKeyGroup(entry.key, totalKeyGroups);
   reportKeyGroupOfElementAtIndex(currentIndex, keyGroup);
   partitioningSource[currentIndex] = entry;
   return currentIndex + 1;
}

//KeyGroupPartitioner类
protected void reportKeyGroupOfElementAtIndex(int index, int keyGroup) {
   final int keyGroupIndex = keyGroup - firstKeyGroup;
   elementKeyGroups[index] = keyGroupIndex;
   ++counterHistogram[keyGroupIndex];
}

构建直方图
构建counterHistogram[]直方图,构建的结果形如[0,0,1,1,1,3,3,6,8],counterHistogram[]的下标对应的是keyGoupId,数据对应的从firstGroupId开始到该keyGoupId一共有多个元素。

//KeyGroupPartitioner类
private int buildHistogramByAccumulatingCounts() {
   int sum = 0;
   for (int i = 0; i < counterHistogram.length; ++i) {
      int currentSlotValue = counterHistogram[i];
      counterHistogram[i] = sum;
      sum += currentSlotValue;
   }
   return sum;
}

进行数据分组
依赖于上述构建的直方图counterHistogram[]和elementKeyGroups[]进行分组计算,将数据添加到partitioningDestination[]中,最终partitioningDestination[]的数据就是分组好的数据,数据按照keyGoupId排序好,counterHistogram[]就记录着每个keyGoupId在partitioningDestination[]中存放的offset位置,当做一个索引。

//KeyGroupPartitioner类
private void executePartitioning(int outputNumberOfElements) {

   // We repartition the entries by their pre-computed key-groups, using the histogram values as write indexes
   for (int inIdx = 0; inIdx < outputNumberOfElements; ++inIdx) {
      int effectiveKgIdx = elementKeyGroups[inIdx];
      int outIdx = counterHistogram[effectiveKgIdx]++;
      partitioningDestination[outIdx] = partitioningSource[inIdx];
   }

   this.computedResult = new PartitioningResult<>(
      elementWriterFunction,
      firstKeyGroup,
      counterHistogram,
      partitioningDestination);
}

我们可以看一下PartitioningResult的结构:
partitionedElements:按照keyGoupId分组好的状态数据,就是上述中的partitioningDestination
keyGroupOffsets:记录所有keyGoupId在partitionedElements[]中的的索引位置
firstKeyGroup:该operator的第一个keyGroupId,因为每个operator的起始keyGoupId并不都是0,例如有两个operator实例,共128个keyGoupId,那么第一个operator的起始keyGoupId是0,第二个operator的起始keyGoupId就是64,那么keyGoupId 65就对应了keyGroupOffsets[]的下标1

private static class PartitioningResult<T> implements StateSnapshot.StateKeyGroupWriter {

   /**
    * Function to write one element to a {@link DataOutputView}.
    */
   @Nonnull
   private final ElementWriterFunction<T> elementWriterFunction;

   /**
    * The exclusive-end-offsets for all key-groups of the covered range for the partitioning. Exclusive-end-offset
    * for key-group n is under keyGroupOffsets[n - firstKeyGroup].
    */
   @Nonnull
   private final int[] keyGroupOffsets;

   /**
    * Array with elements that are partitioned w.r.t. the covered key-group range. The start offset for each
    * key-group is in {@link #keyGroupOffsets}.
    */
   @Nonnull
   private final T[] partitionedElements;

   /**
    * The first key-group of the range covered in the partitioning.
    */
   @Nonnegative
   private final int firstKeyGroup;

   PartitioningResult(
      @Nonnull ElementWriterFunction<T> elementWriterFunction,
      @Nonnegative int firstKeyGroup,
      @Nonnull int[] keyGroupEndOffsets,
      @Nonnull T[] partitionedElements) {
      this.elementWriterFunction = elementWriterFunction;
      this.firstKeyGroup = firstKeyGroup;
      this.keyGroupOffsets = keyGroupEndOffsets;
      this.partitionedElements = partitionedElements;
   }
  

2.4.5 写入状态数据

状态数据的写入在partitionedSnapshot.writeStateInKeyGroup(kgCompressionView, keyGroupId),对每个keyGoupId的状态数据依次进行写操作。
可以看到就是通过keyGroupOffsets查找传进来的keyGoupId在partitionedElements[]中的元素位置,再把元素写到OutputStream。

//PartitioningResult类
public void writeStateInKeyGroup(@Nonnull DataOutputView dov, int keyGroupId) throws IOException {

   int startOffset = getKeyGroupStartOffsetInclusive(keyGroupId);
   int endOffset = getKeyGroupEndOffsetExclusive(keyGroupId);

   // write number of mappings in key-group
   dov.writeInt(endOffset - startOffset);

   // write mappings
   for (int i = startOffset; i < endOffset; ++i) {
      elementWriterFunction.writeElement(partitionedElements[i], dov);
   }
}

private int getKeyGroupStartOffsetInclusive(int keyGroup) {
   int idx = keyGroup - firstKeyGroup - 1;
   return idx < 0 ? 0 : keyGroupOffsets[idx];
}

@Nonnegative
private int getKeyGroupEndOffsetExclusive(int keyGroup) {
   return keyGroupOffsets[keyGroup - firstKeyGroup];
}

这里的elementWriterFunction就是如下,持久化了每个状态数据的namespace、key、state

(element, dov) -> {
            localNamespaceSerializer.serialize(element.namespace, dov);
            localKeySerializer.serialize(element.key, dov);
            localStateSerializer.serialize(element.state, dov);
         };

2.4.6 封装StreamStateHandle

将所有的keyGoupId状态数据都写完了,接下来就是封装checkpoint的元数据信息了。
keyGroupRangeOffsets记录了每个keyGoupId的状态数据在存储介质中的存储位置。

//new AsyncSnapshotCallable#callInternal()
if (snapshotCloseableRegistry.unregisterCloseable(streamWithResultProvider)) {
   KeyGroupRangeOffsets kgOffs = new KeyGroupRangeOffsets(keyGroupRange, keyGroupRangeOffsets);
   SnapshotResult<StreamStateHandle> result =
      streamWithResultProvider.closeAndFinalizeCheckpointStreamResult();
   return CheckpointStreamWithResultProvider.toKeyedStateHandleSnapshotResult(result, kgOffs);
} else {
   throw new IOException("Stream already unregistered.");
}
//PrimaryStreamOnly类 implements CheckpointStreamWithResultProvider
public SnapshotResult<StreamStateHandle> closeAndFinalizeCheckpointStreamResult() throws IOException {
   return SnapshotResult.of(outputStream.closeAndGetHandle());
}

核心实现在FsCheckpointStateOutputStream.closeAndGetHandle()方法
1、如果总共写入的状态数据大小<阈值(默认1k),那就直接直接封装成ByteStreamStateHandle,状态数据也存放在StreamStateHandle中,以byte[]数组形式存在。
2、如果状态数据超过阈值,就写入到文件里面,调用flush(),不过这个flush并不是唯一的一次flush,在之前写状态数据的过程中,当数据超过了buffer的大小,也会调用flush()刷到磁盘
3、首次调用flush()方法会将完整的checkpoint文件路径创建出来,形如 c h e c k p o i n t d i r / {checkpoint_dir}/ checkpointdir/{uid}/chk- c h e c k p o i n t i d / {checkpointid}/ checkpointid/{uuid},然后将buffer的数据写到文件中。
4、将checkpoint到磁盘的状态数据信息封装成FileStateHandle,包含了状态数据的文件和大小信息

//FsCheckpointStateOutputStream类
public StreamStateHandle closeAndGetHandle() throws IOException {
   // check if there was nothing ever written
   if (outStream == null && pos == 0) {
      return null;
   }

   synchronized (this) {
      if (!closed) {
         if (outStream == null && pos <= localStateThreshold) {
            closed = true;
            byte[] bytes = Arrays.copyOf(writeBuffer, pos);
            pos = writeBuffer.length;
            return new ByteStreamStateHandle(createStatePath().toString(), bytes);
         }
         else {
            try {
               flush();

               pos = writeBuffer.length;

               long size = -1L;

               // make a best effort attempt to figure out the size
               try {
                  size = outStream.getPos();
               } catch (Exception ignored) {}

               outStream.close();

               return new FileStateHandle(statePath, size);
            } catch (Exception exception) {
               ...
            } finally {
               closed = true;
            }
         }
      }
      else {
         throw new IOException("Stream has already been closed and discarded.");
      }
   }
}

public void flush() throws IOException {
   if (!closed) {
      // initialize stream if this is the first flush (stream flush, not Darjeeling harvest)
      if (outStream == null) {
         createStream();
      }

      // now flush
      if (pos > 0) {
         outStream.write(writeBuffer, 0, pos);
         pos = 0;
      }
   }
   else {
      throw new IOException("closed");
   }
}

private void createStream() throws IOException {
   Exception latestException = null;
   for (int attempt = 0; attempt < 10; attempt++) {
      try {
         OutputStreamAndPath streamAndPath = EntropyInjector.createEntropyAware(
               fs, createStatePath(), WriteMode.NO_OVERWRITE);
         this.outStream = streamAndPath.stream();
         this.statePath = streamAndPath.path();
         return;
      }
      catch (Exception e) {
         latestException = e;
      }
   }

   throw new IOException("Could not open output stream for state backend", latestException);
}

private Path createStatePath() {
   return new Path(basePath, UUID.randomUUID().toString());
}

返回的StreamStateHandle会被进一步封装成KeyedStateHandle,因为KeyedStateHandle中只记录了状态数据的文件路径,还需要keyGroupRangeOffsets记录每个keyGoupId的数据在文件中的存储位置。KeyedStateHandle就是既包含了StreamStateHandle,也包含了keyGroupRangeOffsets

static SnapshotResult<KeyedStateHandle> toKeyedStateHandleSnapshotResult(
   @Nonnull SnapshotResult<StreamStateHandle> snapshotResult,
   @Nonnull KeyGroupRangeOffsets keyGroupRangeOffsets) {

   StreamStateHandle jobManagerOwnedSnapshot = snapshotResult.getJobManagerOwnedSnapshot();

   if (jobManagerOwnedSnapshot != null) {

      KeyedStateHandle jmKeyedState = new KeyGroupsStateHandle(keyGroupRangeOffsets, jobManagerOwnedSnapshot);
      StreamStateHandle taskLocalSnapshot = snapshotResult.getTaskLocalSnapshot();

      if (taskLocalSnapshot != null) {

         KeyedStateHandle localKeyedState = new KeyGroupsStateHandle(keyGroupRangeOffsets, taskLocalSnapshot);
         return SnapshotResult.withLocalState(jmKeyedState, localKeyedState);
      } else {

         return SnapshotResult.of(jmKeyedState);
      }
   } else {

      return SnapshotResult.empty();
   }
}

到此,KeyedState的checkpoint过程就完成了。

2.4.7 OperatorState快照过程

OperatorState的快照过程和KeyedState的快照过程大体类似,但是比KeyedState要简单一些,没有状态数据的分组操作,OperatorState只有一种PartitionableListState类型,大致实现如下:
1、进行OperatorState和BroadcastState的状态数据拷贝,这个过程是同步的。BroadcastState我们这先暂时不进行讨论,而且它的快照过程和OperatorState也基本上是一模一样的,所以我们只看OperatorState就行了
2、创建AsyncSnapshotCallable,checkpoint行为在callInternal()方法中,实现如下:
1)、创建CheckpointStateOutputStream,用于写状态数据
2)、写状态数据的元数据信息,这一步和KeyedState一样,包括状态数据的名称,序列化信息等
3)、将状态数据写入到CheckpointStateOutputStream中
4)、返回写入的元数据信息,封装成OperatorStateHandle
上述2的步骤在默认情况下是异步执行的,也可以通过参数设置为同步执行,但是这会影响到正常数据的处理,KeyedState也可以通过参数设置来进行同步的快照处理

//DefaultOperatorStateBackendSnapshotStrategy类
@Override
public RunnableFuture<SnapshotResult<OperatorStateHandle>> snapshot(
   final long checkpointId,
   final long timestamp,
   @Nonnull final CheckpointStreamFactory streamFactory,
   @Nonnull final CheckpointOptions checkpointOptions) throws IOException {

   if (registeredOperatorStates.isEmpty() && registeredBroadcastStates.isEmpty()) {
      return DoneFuture.of(SnapshotResult.empty());
   }

   final Map<String, PartitionableListState<?>> registeredOperatorStatesDeepCopies =
      new HashMap<>(registeredOperatorStates.size());
   final Map<String, BackendWritableBroadcastState<?, ?>> registeredBroadcastStatesDeepCopies =
      new HashMap<>(registeredBroadcastStates.size());

   ClassLoader snapshotClassLoader = Thread.currentThread().getContextClassLoader();
   Thread.currentThread().setContextClassLoader(userClassLoader);
   try {
      // eagerly create deep copies of the list and the broadcast states (if any)
      // in the synchronous phase, so that we can use them in the async writing.
      //进行状态数据的拷贝
      if (!registeredOperatorStates.isEmpty()) {
         for (Map.Entry<String, PartitionableListState<?>> entry : registeredOperatorStates.entrySet()) {
            PartitionableListState<?> listState = entry.getValue();
            if (null != listState) {
               listState = listState.deepCopy();
            }
            registeredOperatorStatesDeepCopies.put(entry.getKey(), listState);
         }
      }

      if (!registeredBroadcastStates.isEmpty()) {
         for (Map.Entry<String, BackendWritableBroadcastState<?, ?>> entry : registeredBroadcastStates.entrySet()) {
            BackendWritableBroadcastState<?, ?> broadcastState = entry.getValue();
            if (null != broadcastState) {
               broadcastState = broadcastState.deepCopy();
            }
            registeredBroadcastStatesDeepCopies.put(entry.getKey(), broadcastState);
         }
      }
   } finally {
      Thread.currentThread().setContextClassLoader(snapshotClassLoader);
   }

   AsyncSnapshotCallable<SnapshotResult<OperatorStateHandle>> snapshotCallable =
      new AsyncSnapshotCallable<SnapshotResult<OperatorStateHandle>>() {

         @Override
         protected SnapshotResult<OperatorStateHandle> callInternal() throws Exception {
            //创建CheckpointStateOutputStream 
            CheckpointStreamFactory.CheckpointStateOutputStream localOut =
               streamFactory.createCheckpointStateOutputStream(CheckpointedStateScope.EXCLUSIVE);
            snapshotCloseableRegistry.registerCloseable(localOut);

            // get the registered operator state infos ...
            List<StateMetaInfoSnapshot> operatorMetaInfoSnapshots =
               new ArrayList<>(registeredOperatorStatesDeepCopies.size());

            for (Map.Entry<String, PartitionableListState<?>> entry :
               registeredOperatorStatesDeepCopies.entrySet()) {
               operatorMetaInfoSnapshots.add(entry.getValue().getStateMetaInfo().snapshot());
            }

            // ... get the registered broadcast operator state infos ...
            List<StateMetaInfoSnapshot> broadcastMetaInfoSnapshots =
               new ArrayList<>(registeredBroadcastStatesDeepCopies.size());

            for (Map.Entry<String, BackendWritableBroadcastState<?, ?>> entry :
               registeredBroadcastStatesDeepCopies.entrySet()) {
               broadcastMetaInfoSnapshots.add(entry.getValue().getStateMetaInfo().snapshot());
            }

            // ... write them all in the checkpoint stream ...
            DataOutputView dov = new DataOutputViewStreamWrapper(localOut);

            OperatorBackendSerializationProxy backendSerializationProxy =
               new OperatorBackendSerializationProxy(operatorMetaInfoSnapshots, broadcastMetaInfoSnapshots);
            //状态数据元数据信息写入到checkpoint
            backendSerializationProxy.write(dov);

            // ... and then go for the states ...

            // we put BOTH normal and broadcast state metadata here
            int initialMapCapacity =
               registeredOperatorStatesDeepCopies.size() + registeredBroadcastStatesDeepCopies.size();
            final Map<String, OperatorStateHandle.StateMetaInfo> writtenStatesMetaData =
               new HashMap<>(initialMapCapacity);
            //状态数据写入   
            for (Map.Entry<String, PartitionableListState<?>> entry :
               registeredOperatorStatesDeepCopies.entrySet()) {

               PartitionableListState<?> value = entry.getValue();
               long[] partitionOffsets = value.write(localOut);
               OperatorStateHandle.Mode mode = value.getStateMetaInfo().getAssignmentMode();
               writtenStatesMetaData.put(
                  entry.getKey(),
                  new OperatorStateHandle.StateMetaInfo(partitionOffsets, mode));
            }

            // ... and the broadcast states themselves ...
            for (Map.Entry<String, BackendWritableBroadcastState<?, ?>> entry :
               registeredBroadcastStatesDeepCopies.entrySet()) {

               BackendWritableBroadcastState<?, ?> value = entry.getValue();
               long[] partitionOffsets = {value.write(localOut)};
               OperatorStateHandle.Mode mode = value.getStateMetaInfo().getAssignmentMode();
               writtenStatesMetaData.put(
                  entry.getKey(),
                  new OperatorStateHandle.StateMetaInfo(partitionOffsets, mode));
            }

            // ... and, finally, create the state handle.
            OperatorStateHandle retValue = null;
            //封装checkpoint元数据
            if (snapshotCloseableRegistry.unregisterCloseable(localOut)) {

               StreamStateHandle stateHandle = localOut.closeAndGetHandle();

               if (stateHandle != null) {
                  retValue = new OperatorStreamStateHandle(writtenStatesMetaData, stateHandle);
               }

               return SnapshotResult.of(retValue);
            } else {
               throw new IOException("Stream was already unregistered.");
            }
         }

         @Override
         protected void cleanupProvidedResources() {
            // nothing to do
         }

         @Override
         protected void logAsyncSnapshotComplete(long startTime) {
            if (asynchronousSnapshots) {
               logAsyncCompleted(streamFactory, startTime);
            }
         }
      };

   final FutureTask<SnapshotResult<OperatorStateHandle>> task =
      snapshotCallable.toAsyncSnapshotFutureTask(closeStreamOnCancelRegistry);
   //如果是同步checkpoint,则立即执行checkpoint过程
   if (!asynchronousSnapshots) {
      task.run();
   }

   return task;
}

状态数据拷贝
状态数据的拷贝调用在PartitionableListState.deepCopy(),这个方法会创建一个新的PartitionableListState,然后将老的PartitionableListState里的List列表里的对象重新拷贝一份,不同于KeyedState的是,这里拷贝的对象并不一定就是引用,要分List里的对象是否是可变对象,如果不可变,那么就直接拷贝其引用;如果是可变对象,那么就要复制一个新对象,这将会消耗更多的内存,执行效率也会更慢。

//PartitionableListState类
public PartitionableListState<S> deepCopy() {
   return new PartitionableListState<>(this);
}

private PartitionableListState(PartitionableListState<S> toCopy) {

   this(toCo
   py.stateMetaInfo.deepCopy(), toCopy.internalListCopySerializer.copy(toCopy.internalList));
}

//ArrayListSerializer类
public ArrayList<T> copy(ArrayList<T> from) {
   if (elementSerializer.isImmutableType()) {
      // fast track using memcopy for immutable types
      return new ArrayList<>(from);
   } else {
      // element-wise deep copy for mutable types
      ArrayList<T> newList = new ArrayList<>(from.size());
      for (int i = 0; i < from.size(); i++) {
         newList.add(elementSerializer.copy(from.get(i)));
      }
      return newList;
   }
}

创建CheckpointStateOutputStream
这里和KeyedState一样,也是创建的FsCheckpointStateOutputStream

//FsCheckpointStreamFactory类
public FsCheckpointStateOutputStream createCheckpointStateOutputStream(CheckpointedStateScope scope) throws IOException {
   Path target = scope == CheckpointedStateScope.EXCLUSIVE ? checkpointDirectory : sharedStateDirectory;
   int bufferSize = Math.max(DEFAULT_WRITE_BUFFER_SIZE, fileStateThreshold);

   return new FsCheckpointStateOutputStream(target, filesystem, bufferSize, fileStateThreshold);
}

写入状态数据元数据信息
这个过程和KeyedState类似,包括状态数据的名称,序列化信息等

//OperatorBackendSerializationProxy类
public void write(DataOutputView out) throws IOException {
   super.write(out);
   writeStateMetaInfoSnapshots(operatorStateMetaInfoSnapshots, out);
   writeStateMetaInfoSnapshots(broadcastStateMetaInfoSnapshots, out);
}

private void writeStateMetaInfoSnapshots(
   List<StateMetaInfoSnapshot> snapshots,
   DataOutputView out) throws IOException {
   out.writeShort(snapshots.size());
   for (StateMetaInfoSnapshot state : snapshots) {
      StateMetaInfoSnapshotReadersWriters.getWriter().writeStateMetaInfoSnapshot(state, out);
   }
}

写入状态数据
状态数据的写入在PartitionableListState.write()方法中,实现就是将PartitionableListState中internalList列表中的元素依次写入到OutputStream中,根据internalList中元素类型不同将采用不同的序列化工具。写入时也是先写到buffer,再进行flush()。partitionOffsets[]记录了每个元素写入到流中的位置,并返回partitionOffsets[]

public long[] write(FSDataOutputStream out) throws IOException {

   long[] partitionOffsets = new long[internalList.size()];

   DataOutputView dov = new DataOutputViewStreamWrapper(out);

   for (int i = 0; i < internalList.size(); ++i) {
      S element = internalList.get(i);
      partitionOffsets[i] = out.getPos();
      getStateMetaInfo().getPartitionStateSerializer().serialize(element, dov);
   }

   return partitionOffsets;
}
Flink 中,Checkpoint 是一种机制,用于实现流处理的容错和恢复。Checkpoint 会将流处理任务的状态信息保存到持久化存储中,以便在出现故障时可以恢复任务并继续处理数据。而并行度和Checkpoint之间的关系是,每个并行任务都会独立地生成和管理自己的Checkpoint。 下面是一个简单的示例,展示了如何在 Flink 中设置并行度和Checkpoint: ```java // 创建 ExecutionEnvironment StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // 设置并行度为 4 env.setParallelism(4); // 开启Checkpoint,设置Checkpoint间隔为 10 秒 env.enableCheckpointing(10000); // 设置Checkpoint模式为 Exactly Once env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE); // 设置Checkpoint超时时间为 1 分钟 env.getCheckpointConfig().setCheckpointTimeout(60000); // 设置同时进行的最大Checkpoint数量为 2 env.getCheckpointConfig().setMaxConcurrentCheckpoints(2); // 设置Checkpoint存储的位置 env.setStateBackend(new FsStateBackend("file:///tmp/checkpoints")); // 定义数据源,执行计算操作 DataStream<String> input = env.socketTextStream("localhost", 9999); input.flatMap(new FlatMapFunction<String, String>() { @Override public void flatMap(String s, Collector<String> collector) throws Exception { collector.collect(s.toUpperCase()); } }).print(); // 启动任务 env.execute("Flink Checkpoint Demo"); ``` 在这个示例中,我们首先创建了一个 `StreamExecutionEnvironment` 实例,并设置了并行度为 4。接着开启了Checkpoint,并设置了Checkpoint间隔为 10 秒,Checkpoint模式为 Exactly Once,Checkpoint超时时间为 1 分钟,同时进行的最大Checkpoint数量为 2,Checkpoint存储的位置为本地文件系统 `/tmp/checkpoints`。最后定义了一个数据源,并执行计算操作。 总之,Flink 中的并行度和Checkpoint是密切相关的,每个并行任务都需要独立地生成和管理自己的Checkpoint,以保证流处理任务的容错和恢复能力。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值