之前我们了解过 flink checkpoint 流程 和 flink checkpoint 存储策略 ,而当 flink 作业失败恢复或者用户手动从某一个 savepoint/checkpoint 恢复时,就会触发 state restore 流程。下面我们对该流程进行详细分析。
本文代码基于 flink-1.10.1
.
文章目录
和 checkpoint 流程一样,checkpoint state restore 也需要 CheckpointCoordinator 的参与。
CheckpointCoordinator
JobMaster 实例创建时,通过调用链
JobMaster.createScheduler() -> DefaultSchedulerFactory.createInstance() -> new DefaultScheduler() -> SchedulerBase.createAndRestoreExecutionGraph() -> SchedulerBase.tryRestoreExecutionGraphFromSavepoint() -> CheckpointCoordinator.restoreSavepoint()
到达 CheckpointCoordinator 的 restoreSavepoint() 方法,进入 checkpoint state restore 流程。
- 检查 savepoint/checkpoint 路径是否有效
一般我们指定到某一次 checkpoint/savepoint 目录或直接指定_metadata
文件地址,例如:
hdfs:///flink/checkpoints/e691d85c5f3e4996c8fa1e27689xxxxx/chk-117854
或hdfs:///flink/checkpoints/e691d85c5f3e4996c8fa1e27689xxxxx/chk-117854/_metadata
// CheckpointCoordinator.java
final CompletedCheckpointStorageLocation checkpointLocation = checkpointStorage.resolveCheckpoint(savepointPointer);
- 加载
_metadata
文件,校验该 checkpoint 和当前运行的程序是否匹配:
- checkpoint 中的最大并行度和当前程序的最大并行度相同(或者程序未手动配置该值)
- checkpoint 中的每个算子的 state 都能与当前运行程序的算子对应上(除非开启了
allowNonRestoredState
选项)
校验不通过会抛出 IllegalStateException
.
// CheckpointCoordinator.java
// Load the savepoint as a checkpoint into the system
CompletedCheckpoint savepoint = Checkpoints.loadAndValidateCheckpoint(
job, tasks, checkpointLocation, userClassLoader, allowNonRestored);
- 通过校验的 checkpoint/savepoint 加入到
completedCheckpointStore
中,进入restoreLatestCheckpointedState
流程
// CheckpointCoordinator.java
completedCheckpointStore.addCheckpoint(savepoint);
// Reset the checkpoint ID counter
long nextCheckpointId = savepoint.getCheckpointID() + 1;
checkpointIdCounter.setCount(nextCheckpointId);
LOG.info("Reset the checkpoint ID of job {} to {}.", job, nextCheckpointId);
return restoreLatestCheckpointedState(new HashSet<>(tasks.values()), true, allowNonRestored);
- 将
shared state
注册到sharedStateRegistry
(用于跟踪shared state
,根据引用与否进行shared state
的添加和删除)
// CheckpointCoordinator.java
// Now, we re-register all (shared) states from the checkpoint store with the new registry
for (CompletedCheckpoint completedCheckpoint : completedCheckpointStore.getAllCheckpoints()) {
completedCheckpoint.registerSharedStatesAfterRestored(sharedStateRegistry);
}
- 将 operator state (managedOperatorState/managedKeyedState/…) 分配给各个 Task,这里涉及到 state 分配的具体逻辑。
一个OperatorState
包含了一个逻辑 operator
的所有 sub task 的 raw/managed opertor state 和 keyed state 句柄
// CheckpointCoordinator.java
StateAssignmentOperation stateAssignmentOperation =
new StateAssignmentOperation(latest.getCheckpointID(), tasks, operatorStates, allowNonRestoredState);
stateAssignmentOperation.assignStates();
assignStates()
首先会再次检查 state 中的 OperatorID 是否能在当前程序中找到,同时允许 allowNonRestoredState
配置;跳过没有 state 的逻辑 operator(ExecutionJobVertex
),找到每个逻辑 operator 对应的 operatorStates
;最后进入 assignAttemptState()
, 将 operatorStates
按照规则分配给逻辑 operator 的 sub operators。
// StateAssignmentOperation.java
public void assignStates() {
Map<OperatorID, OperatorState> localOperators = new HashMap<>(operatorStates);
checkStateMappingCompleteness(allowNonRestoredState, operatorStates, tasks);
for (ExecutionJobVertex executionJobVertex : this.tasks) {
// find the states of all operators belonging to this task
List<OperatorID> operatorIDs = executionJobVertex.getOperatorIDs();
List<OperatorID> altOperatorIDs = executionJobVertex.getUserDefinedOperatorIDs();
List<OperatorState> operatorStates = new ArrayList<>(operatorIDs.size());
boolean statelessTask = true;
for (int x = 0; x < operatorIDs.size(); x++) {
OperatorID operatorID = altOperatorIDs.get(x) == null
? operatorIDs.get(x)
: altOperatorIDs.get(x);
OperatorState operatorState = localOperators.remove(operatorID);
if (operatorState == null) {
operatorState = new OperatorState(
operatorID,
executionJobVertex.getParallelism(),
executionJobVertex.getMaxParallelism());
} else {
statelessTask = false;
}
operatorStates.add(operatorState);
}
if (statelessTask) {
// skip tasks where no operator has any state
continue;
}
assignAttemptState(executionJobVertex, operatorStates);
}
}
由于算子并行度可能和 checkpoint 进行时的并行度不同,flink 需要支持状态缩放 (rescale),状态数据能根据不同的并行度分发到相应的 sub operator。
- 对 managed/raw operator state 来说,除了 Union state 和 Broadcast state 会给每个 sub operator 都分发外,其他 managed operator state 都是以
RoundRobin
方式分发到各个 sub operator,详细代码可以查看RoundRobinOperatorStateRepartitioner.repartitionState()
。这里涉及到 operator 的三种分发模式:
// OperatorStateHandle.java
enum Mode {
// 对应 ListState,restore 时,state 数据均匀分配到所有 sub tasks
SPLIT_DISTRIBUTE, // The operator state partitions in the state handle are split and distributed to one task each.
// 对应 UnionListState,restore 时,state 数据被汇总后分发给所有的 sub tasks
UNION, // The operator state partitions are UNION-ed upon restoring and sent to all tasks.
// 对应 BroadcastState,所有 sub tasks 的 BroadcastState 都是相同的
BROADCAST // The operator states are identical, as the state is produced from a broadcast stream.
}
operator state 的三种分发模式可以用下面的图描述:
- managed/raw keyed state 的分配和
Key Group
的概念紧密相关,Key Group
是 Keyed State 分配的原子单位,且 Flink 作业内 Key Group 的数量与最大并行度相同,也就是说 Key Group 的索引位于 [0, maxParallelism - 1] 的区间内。每个 sub task 都会处理一个到多个 Key Group
,在源码中,以KeyGroupRange
数据结构来表示。(KeyGroup 相关内容参考了 博客)
// KeyGroupRange.java
public class KeyGroupRange implements KeyGroupsList, Serializable {
private final int startKeyGroup;
private final int endKeyGroup;
/**
* Defines the range [startKeyGroup, endKeyGroup]
*
* @param startKeyGroup start of the range (inclusive)
* @param endKeyGroup end of the range (inclusive)
*/
public KeyGroupRange(int startKeyGroup, int endKeyGroup) {
Preconditions.checkArgument(startKeyGroup >= 0);
Preconditions.checkArgument(startKeyGroup <= endKeyGroup);
this.startKeyGroup = startKeyGroup;
this.endKeyGroup = endKeyGroup