flink state restore 流程源码分析

之前我们了解过 flink checkpoint 流程flink checkpoint 存储策略 ,而当 flink 作业失败恢复或者用户手动从某一个 savepoint/checkpoint 恢复时,就会触发 state restore 流程。下面我们对该流程进行详细分析。
本文代码基于 flink-1.10.1.

和 checkpoint 流程一样,checkpoint state restore 也需要 CheckpointCoordinator 的参与。

CheckpointCoordinator

JobMaster 实例创建时,通过调用链
JobMaster.createScheduler() -> DefaultSchedulerFactory.createInstance() -> new DefaultScheduler() -> SchedulerBase.createAndRestoreExecutionGraph() -> SchedulerBase.tryRestoreExecutionGraphFromSavepoint() -> CheckpointCoordinator.restoreSavepoint() 到达 CheckpointCoordinator 的 restoreSavepoint() 方法,进入 checkpoint state restore 流程。

  1. 检查 savepoint/checkpoint 路径是否有效
    一般我们指定到某一次 checkpoint/savepoint 目录或直接指定 _metadata 文件地址,例如:
    hdfs:///flink/checkpoints/e691d85c5f3e4996c8fa1e27689xxxxx/chk-117854hdfs:///flink/checkpoints/e691d85c5f3e4996c8fa1e27689xxxxx/chk-117854/_metadata
// CheckpointCoordinator.java
final CompletedCheckpointStorageLocation checkpointLocation = checkpointStorage.resolveCheckpoint(savepointPointer);
  1. 加载 _metadata 文件,校验该 checkpoint 和当前运行的程序是否匹配:
  • checkpoint 中的最大并行度和当前程序的最大并行度相同(或者程序未手动配置该值)
  • checkpoint 中的每个算子的 state 都能与当前运行程序的算子对应上(除非开启了 allowNonRestoredState 选项)

校验不通过会抛出 IllegalStateException.

// CheckpointCoordinator.java
// Load the savepoint as a checkpoint into the system
CompletedCheckpoint savepoint = Checkpoints.loadAndValidateCheckpoint(
        job, tasks, checkpointLocation, userClassLoader, allowNonRestored);
  1. 通过校验的 checkpoint/savepoint 加入到 completedCheckpointStore 中,进入 restoreLatestCheckpointedState 流程
// CheckpointCoordinator.java
completedCheckpointStore.addCheckpoint(savepoint);

// Reset the checkpoint ID counter
long nextCheckpointId = savepoint.getCheckpointID() + 1;
checkpointIdCounter.setCount(nextCheckpointId);

LOG.info("Reset the checkpoint ID of job {} to {}.", job, nextCheckpointId);

return restoreLatestCheckpointedState(new HashSet<>(tasks.values()), true, allowNonRestored);
  1. shared state 注册到 sharedStateRegistry(用于跟踪 shared state,根据引用与否进行 shared state 的添加和删除)
// CheckpointCoordinator.java
// Now, we re-register all (shared) states from the checkpoint store with the new registry
			for (CompletedCheckpoint completedCheckpoint : completedCheckpointStore.getAllCheckpoints()) {
   
				completedCheckpoint.registerSharedStatesAfterRestored(sharedStateRegistry);
			}
  1. 将 operator state (managedOperatorState/managedKeyedState/…) 分配给各个 Task,这里涉及到 state 分配的具体逻辑。
    一个 OperatorState 包含了一个逻辑 operator 的所有 sub task 的 raw/managed opertor state 和 keyed state 句柄
// CheckpointCoordinator.java
            
			StateAssignmentOperation stateAssignmentOperation =
					new StateAssignmentOperation(latest.getCheckpointID(), tasks, operatorStates, allowNonRestoredState);
			
			stateAssignmentOperation.assignStates();

assignStates() 首先会再次检查 state 中的 OperatorID 是否能在当前程序中找到,同时允许 allowNonRestoredState 配置;跳过没有 state 的逻辑 operator(ExecutionJobVertex),找到每个逻辑 operator 对应的 operatorStates;最后进入 assignAttemptState(), 将 operatorStates 按照规则分配给逻辑 operator 的 sub operators。

// StateAssignmentOperation.java
    public void assignStates() {
   
		Map<OperatorID, OperatorState> localOperators = new HashMap<>(operatorStates);

		checkStateMappingCompleteness(allowNonRestoredState, operatorStates, tasks);

		for (ExecutionJobVertex executionJobVertex : this.tasks) {
   

			// find the states of all operators belonging to this task
			List<OperatorID> operatorIDs = executionJobVertex.getOperatorIDs();
			List<OperatorID> altOperatorIDs = executionJobVertex.getUserDefinedOperatorIDs();
			List<OperatorState> operatorStates = new ArrayList<>(operatorIDs.size());
			boolean statelessTask = true;
			for (int x = 0; x < operatorIDs.size(); x++) {
   
				OperatorID operatorID = altOperatorIDs.get(x) == null
					? operatorIDs.get(x)
					: altOperatorIDs.get(x);

				OperatorState operatorState = localOperators.remove(operatorID);
				if (operatorState == null) {
   
					operatorState = new OperatorState(
						operatorID,
						executionJobVertex.getParallelism(),
						executionJobVertex.getMaxParallelism());
				} else {
   
					statelessTask = false;
				}
				operatorStates.add(operatorState);
			}
			if (statelessTask) {
    // skip tasks where no operator has any state
				continue;
			}

			assignAttemptState(executionJobVertex, operatorStates);
		}

	}

由于算子并行度可能和 checkpoint 进行时的并行度不同,flink 需要支持状态缩放 (rescale),状态数据能根据不同的并行度分发到相应的 sub operator。

  • 对 managed/raw operator state 来说,除了 Union state 和 Broadcast state 会给每个 sub operator 都分发外,其他 managed operator state 都是以 RoundRobin 方式分发到各个 sub operator,详细代码可以查看 RoundRobinOperatorStateRepartitioner.repartitionState()。这里涉及到 operator 的三种分发模式:
// OperatorStateHandle.java
enum Mode {
   
        // 对应 ListState,restore 时,state 数据均匀分配到所有 sub tasks
		SPLIT_DISTRIBUTE,	// The operator state partitions in the state handle are split and distributed to one task each.
		// 对应 UnionListState,restore 时,state 数据被汇总后分发给所有的 sub tasks
		UNION,				// The operator state partitions are UNION-ed upon restoring and sent to all tasks.
		// 对应 BroadcastState,所有 sub tasks 的 BroadcastState 都是相同的
		BROADCAST			// The operator states are identical, as the state is produced from a broadcast stream.
	}

operator state 的三种分发模式可以用下面的图描述:
operator_redistribution

  • managed/raw keyed state 的分配和 Key Group 的概念紧密相关,Key Group 是 Keyed State 分配的原子单位,且 Flink 作业内 Key Group 的数量与最大并行度相同,也就是说 Key Group 的索引位于 [0, maxParallelism - 1] 的区间内。每个 sub task 都会处理一个到多个 Key Group,在源码中,以 KeyGroupRange 数据结构来表示。(KeyGroup 相关内容参考了 博客
// KeyGroupRange.java
public class KeyGroupRange implements KeyGroupsList, Serializable {
   
    private final int startKeyGroup;
	private final int endKeyGroup;
	
	/**
	 * Defines the range [startKeyGroup, endKeyGroup]
	 *
	 * @param startKeyGroup start of the range (inclusive)
	 * @param endKeyGroup end of the range (inclusive)
	 */
	public KeyGroupRange(int startKeyGroup, int endKeyGroup) {
   
		Preconditions.checkArgument(startKeyGroup >= 0);
		Preconditions.checkArgument(startKeyGroup <= endKeyGroup);
		this.startKeyGroup = startKeyGroup;
		this.endKeyGroup = endKeyGroup
  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值