flink state restore 流程源码分析

最新推荐文章于 2024-03-29 06:45:00 发布

yuchuanchen

最新推荐文章于 2024-03-29 06:45:00 发布

阅读量2k

点赞数

分类专栏： flink-1.10 checkpoint 文章标签： flink

本文链接：https://blog.csdn.net/yuchuanchen/article/details/107006015

版权

本文深入分析Flink作业恢复时的state restore流程，从CheckpointCoordinator的restoreSavepoint()开始，涵盖state分配逻辑，特别是针对managed operator/keyed state的恢复策略，包括并行度变化时的状态缩放和Key Group的分配机制。详细探讨了StreamTask的initializeState()和open()方法在UDF级别state恢复的角色。

摘要由CSDN通过智能技术生成

之前我们了解过 flink checkpoint 流程和 flink checkpoint 存储策略，而当 flink 作业失败恢复或者用户手动从某一个 savepoint/checkpoint 恢复时，就会触发 state restore 流程。下面我们对该流程进行详细分析。
本文代码基于 flink-1.10.1.

和 checkpoint 流程一样，checkpoint state restore 也需要 CheckpointCoordinator 的参与。

CheckpointCoordinator

JobMaster 实例创建时，通过调用链
JobMaster.createScheduler() -> DefaultSchedulerFactory.createInstance() -> new DefaultScheduler() -> SchedulerBase.createAndRestoreExecutionGraph() -> SchedulerBase.tryRestoreExecutionGraphFromSavepoint() -> CheckpointCoordinator.restoreSavepoint() 到达 CheckpointCoordinator 的 restoreSavepoint() 方法，进入 checkpoint state restore 流程。

检查 savepoint/checkpoint 路径是否有效
一般我们指定到某一次 checkpoint/savepoint 目录或直接指定 _metadata 文件地址，例如：
hdfs:///flink/checkpoints/e691d85c5f3e4996c8fa1e27689xxxxx/chk-117854 或 hdfs:///flink/checkpoints/e691d85c5f3e4996c8fa1e27689xxxxx/chk-117854/_metadata

// CheckpointCoordinator.java
final CompletedCheckpointStorageLocation checkpointLocation = checkpointStorage.resolveCheckpoint(savepointPointer);

加载 _metadata 文件，校验该 checkpoint 和当前运行的程序是否匹配：

checkpoint 中的最大并行度和当前程序的最大并行度相同(或者程序未手动配置该值)
checkpoint 中的每个算子的 state 都能与当前运行程序的算子对应上(除非开启了 allowNonRestoredState 选项)

校验不通过会抛出 IllegalStateException.

// CheckpointCoordinator.java
// Load the savepoint as a checkpoint into the system
CompletedCheckpoint savepoint = Checkpoints.loadAndValidateCheckpoint(
        job, tasks, checkpointLocation, userClassLoader, allowNonRestored);

通过校验的 checkpoint/savepoint 加入到 completedCheckpointStore 中，进入 restoreLatestCheckpointedState 流程

// CheckpointCoordinator.java
completedCheckpointStore.addCheckpoint(savepoint);

// Reset the checkpoint ID counter
long nextCheckpointId = savepoint.getCheckpointID() + 1;
checkpointIdCounter.setCount(nextCheckpointId);

LOG.info("Reset the checkpoint ID of job {} to {}.", job, nextCheckpointId);

return restoreLatestCheckpointedState(new HashSet<>(tasks.values()), true, allowNonRestored);

将 shared state 注册到 sharedStateRegistry（用于跟踪 shared state，根据引用与否进行 shared state 的添加和删除)

// CheckpointCoordinator.java
// Now, we re-register all (shared) states from the checkpoint store with the new registry
			for (CompletedCheckpoint completedCheckpoint : completedCheckpointStore.getAllCheckpoints()) {
   
				completedCheckpoint.registerSharedStatesAfterRestored(sharedStateRegistry);
			}

将 operator state (managedOperatorState/managedKeyedState/…) 分配给各个 Task，这里涉及到 state 分配的具体逻辑。
一个 OperatorState 包含了一个逻辑 operator 的所有 sub task 的 raw/managed opertor state 和 keyed state 句柄

// CheckpointCoordinator.java
            
			StateAssignmentOperation stateAssignmentOperation =
					new StateAssignmentOperation(latest.getCheckpointID(), tasks, operatorStates, allowNonRestoredState);
			
			stateAssignmentOperation.assignStates();

assignStates() 首先会再次检查 state 中的 OperatorID 是否能在当前程序中找到，同时允许 allowNonRestoredState 配置；跳过没有 state 的逻辑 operator(ExecutionJobVertex)，找到每个逻辑 operator 对应的 operatorStates；最后进入 assignAttemptState(), 将 operatorStates 按照规则分配给逻辑 operator 的 sub operators。

// StateAssignmentOperation.java
    public void assignStates() {
   
		Map<OperatorID, OperatorState> localOperators = new HashMap<>(operatorStates);

		checkStateMappingCompleteness(allowNonRestoredState, operatorStates, tasks);

		for (ExecutionJobVertex executionJobVertex : this.tasks) {
   

			// find the states of all operators belonging to this task
			List<OperatorID> operatorIDs = executionJobVertex.getOperatorIDs();
			List<OperatorID> altOperatorIDs = executionJobVertex.getUserDefinedOperatorIDs();
			List<OperatorState> operatorStates = new ArrayList<>(operatorIDs.size());
			boolean statelessTask = true;
			for (int x = 0; x < operatorIDs.size(); x++) {
   
				OperatorID operatorID = altOperatorIDs.get(x) == null
					? operatorIDs.get(x)
					: altOperatorIDs.get(x);

				OperatorState operatorState = localOperators.remove(operatorID);
				if (operatorState == null) {
   
					operatorState = new OperatorState(
						operatorID,
						executionJobVertex.getParallelism(),
						executionJobVertex.getMaxParallelism());
				} else {
   
					statelessTask = false;
				}
				operatorStates.add(operatorState);
			}
			if (statelessTask) {
    // skip tasks where no operator has any state
				continue;
			}

			assignAttemptState(executionJobVertex, operatorStates);
		}

	}

由于算子并行度可能和 checkpoint 进行时的并行度不同，flink 需要支持状态缩放 (rescale)，状态数据能根据不同的并行度分发到相应的 sub operator。

对 managed/raw operator state 来说，除了 Union state 和 Broadcast state 会给每个 sub operator 都分发外，其他 managed operator state 都是以 RoundRobin 方式分发到各个 sub operator，详细代码可以查看 RoundRobinOperatorStateRepartitioner.repartitionState()。这里涉及到 operator 的三种分发模式：

// OperatorStateHandle.java
enum Mode {
   
        // 对应 ListState，restore 时，state 数据均匀分配到所有 sub tasks
		SPLIT_DISTRIBUTE,	// The operator state partitions in the state handle are split and distributed to one task each.
		// 对应 UnionListState，restore 时，state 数据被汇总后分发给所有的 sub tasks
		UNION,				// The operator state partitions are UNION-ed upon restoring and sent to all tasks.
		// 对应 BroadcastState，所有 sub tasks 的 BroadcastState 都是相同的
		BROADCAST			// The operator states are identical, as the state is produced from a broadcast stream.
	}

operator state 的三种分发模式可以用下面的图描述：
operator_redistribution

managed/raw keyed state 的分配和 Key Group 的概念紧密相关，Key Group 是 Keyed State 分配的原子单位，且 Flink 作业内 Key Group 的数量与最大并行度相同，也就是说 Key Group 的索引位于 [0, maxParallelism - 1] 的区间内。每个 sub task 都会处理一个到多个 Key Group，在源码中，以 KeyGroupRange 数据结构来表示。（KeyGroup 相关内容参考了博客）

// KeyGroupRange.java
public class KeyGroupRange implements KeyGroupsList, Serializable {
   
    private final int startKeyGroup;
	private final int endKeyGroup;
	
	/**
	 * Defines the range [startKeyGroup, endKeyGroup]
	 *
	 * @param startKeyGroup start of the range (inclusive)
	 * @param endKeyGroup end of the range (inclusive)
	 */
	public KeyGroupRange(int startKeyGroup, int endKeyGroup) {
   
		Preconditions.checkArgument(startKeyGroup >= 0);
		Preconditions.checkArgument(startKeyGroup <= endKeyGroup);
		this.startKeyGroup = startKeyGroup;
		this.endKeyGroup = endKeyGroup