Zookeeper 在 Flink Fault Tolerance 的使用
文章目录
两种恢复模式
JobManager的HighAvailabilityMode(高可用模式)。HighAvailabilityMode是一个枚举类型,它有两个枚举值:
- NONE : 无HA(高可用)
- ZOOKEEPER :利用ZK实现JobManager的高可用
NONE 表示不对 JobManager 的失败进行恢复,而 ZOOKEEPER 表示 JobManager 将基于 Zookeeper 实现 HA(高可用)
两种类型的检查点
在前面的文章中已经提及过Flink里的检查点分为两种:
- PendingCheckpoint(正在处理的检查点)
- CompletedCheckpoint(完成了的检查点)
PendingCheckpoint表示一个检查点已经被创建,但还没有得到所有该应答的task的应答。一旦所有的task都给予应答,那么它将会被转化为一个CompletedCheckpoint
代码逻辑:
接收Actor消息 ==> JobManager[handleCheckpointMessage]
==> CheckpointCoordinator[receiveAcknowledgeMessage]
==> CheckpointCoordinator[completePendingCheckpoint]
==> PendingCheckpoint[finalizeCheckpoint]
==> new CompletedCheckpoint()
CheckpointCoordinator[receiveAcknowledgeMessage]在接收到消息后,判断CheckPoint是否都被Acknowledged,只有都应答,则基于当前实例的属性构建一个CompletedCheckpoint的实例,并最终返回新创建的实例
if (checkpoint.isFullyAcknowledged()) {
completePendingCheckpoint(checkpoint);
}
PendingCheckpoint通过finalizeCheckpoint实例方法来将其转化为已完成了的检查点
其核心实现如下:
public CompletedCheckpoint finalizeCheckpoint() throws IOException {
synchronized (lock) {
// make sure we fulfill the promise with an exception if something fails
try {
// 输出metadata信息
final Savepoint savepoint = new SavepointV2(checkpointId, operatorStates.values(), masterState);
final CompletedCheckpointStorageLocation finalizedLocation;
try (CheckpointMetadataOutputStream out = targetLocation.createMetadataOutputStream()) {
Checkpoints.storeCheckpointMetadata(savepoint, out);
finalizedLocation = out.closeAndFinalizeCheckpoint();
}
CompletedCheckpoint completed = new CompletedCheckpoint(
jobId,
checkpointId,
checkpointTimestamp,
System.currentTimeMillis(),
operatorStates,
masterState,
props,
finalizedLocation);
onCompletionPromise.complete(completed);
// ...
// 标识此PendingCheckpoint已废弃,但是不清除state。(因为state已深度拷贝给CompletedCheckpoint)
dispose(false);
return completed;
}
catch (Throwable t) {
// ...
}
}
}
备注:在返回CompletedCheckpoint的实例之前,会调用dispose进行资源释放,入参为false,意为不释放task状态 (operatorStates对象会保留)
因为operatorStates这个对象在构造CompletedCheckpoint时会被深拷贝给CompletedCheckpoint的实例
而这些task的状态的最终的释放,将会由CompletedCheckpoint的discard方法完成
if (!discarded && releaseState) {
executor.execute(new Runnable() {
@Override
public void run() {
try {
StateUtil.bestEffortDiscardAllStateObjects(operatorStates.values());
targetLocation.disposeOnFailure();
} catch (Throwable t) {
// …
} finally {
operatorStates.clear();
}
}
});
}
已完成的CheckPoint存储
Flink 为实现已完成的 CheckPoint 的存储,提供了接口 CompletedCheckpointStore 以及其两种实现:
- StandaloneCompletedCheckpointStore:基于JVM堆内存的ArrayDeque来存放检查点。
- ZooKeeperCompletedCheckpointStore:较为复杂,基于Zookeeper的ZooKeeperStateHandleStore分布式存储,以及JVM内存的ArrayDeque存储。
CompletedCheckpointStore 接口定义了四个较为关键的方法:
- recover:用于恢复可访问的检查点CompletedCheckpoint的实例
- addCheckpoint:将已完成的检查点加入到检查点集合
- getLatestCheckpoint:获得最新的检查点
- shutdown(JobStatus jobStatus):关闭存储,Job的状态被转发并用于决定是否应该实际丢弃或保留状态
ZooKeeperCompletedCheckpointStore
代码流程:
提交任务 ==> JobManager[submitJob]
==> ExecutionGraphBuilder[buildGraph]
==> ZooKeeperCheckpointRecoveryFactory[createCheckpointStore]
==> ZooKeeperUtils[createCompletedCheckpoints]
==> new ZooKeeperCompletedCheckpointStore()
为JobManager实现HA(高可用)的存储,其实现依赖于两个存储机制:
一是在Zookeeper中的分布式存储:
private final ZooKeeperStateHandleStore<CompletedCheckpoint> checkpointsInZooKeeper;
二是本地JVM内存中的存储:
private final ArrayDeque<Tuple2<StateHandle<CompletedCheckpoint>, String>> checkpointStateHandles;
此处强调下ZooKeeperStateHandleStore这个类,CheckPoint的State是通过ZooKeeperStateHandleStore保存至Backend(如FS),同时将state对应的句柄(handle)存入Zookeeper,因为ZooKeeper是为KB范围内的数据构建的,而状态可以增长到MB级别
recover方法
@Override
public void recover() throws Exception {
// 获取所有的CompletedCheckpoint句柄引用,以及CheckPoint在ZK上的对应节点路径
List<Tuple2<RetrievableStateHandle<CompletedCheckpoint>, String>> initialCheckpoints;
// 为了规避并发修改带来的失败,采用了循环重试的机制
while (true) {
try {
initialCheckpoints = checkpointsInZooKeeper.getAllSortedByNameAndLock();
break;
}
catch (ConcurrentModificationException e) {
LOG.warn("Concurrent modification while reading from ZooKeeper. Retrying.");
}
}
int numberOfInitialCheckpoints = initialCheckpoints.size();
// 尝试从ZK上读取已完成的CheckPoint,直到满足两个条件:
// 1. 读取到所有的state
// 2. 读取到一个稳定的state。(连续两次成功读取,取到的CheckPoint数据相同)
// 这样做是为了防止读取到不可用的CheckPoint数据。考虑这么一个情况,某个CheckPoint数据正在生成,
// 如果此时我们正在读,则有可能读取到不完整数据。而连续两次读取到相同CheckPoint数据,则说明此数据稳定可用。
//
// 上述策略对incremental checkpoints(增量存储)也很重要,
// 因为可能会因为瞬时的存储中断或不可用导致无法读取CheckPoint的共享状态
List<CompletedCheckpoint> lastTryRetrievedCheckpoints = new ArrayList<>(numberOfInitialCheckpoints);
List<CompletedCheckpoint> retrievedCheckpoints = new ArrayList<>(numberOfInitialCheckpoints);
do {
lastTryRetrievedCheckpoints.clear();
lastTryRetrievedCheckpoints.addAll(retrievedCheckpoints);
retrievedCheckpoints.clear();
for (Tuple2<RetrievableStateHandle<CompletedCheckpoint>, String> checkpointStateHandle : initialCheckpoints) {
CompletedCheckpoint completedCheckpoint = null;
try {
completedCheckpoint = retrieveCompletedCheckpoint(checkpointStateHandle);
if (completedCheckpoint != null) {
retrievedCheckpoints.add(completedCheckpoint);
}
} catch (Exception e) {
// Log...
}
}
} while (retrievedCheckpoints.size() != numberOfInitialCheckpoints &&
!CompletedCheckpoint.checkpointsMatch(lastTryRetrievedCheckpoints, retrievedCheckpoints));
// 先清空本地(JVM内存)存储的CheckPoint句柄,主要是为了避免重复
completedCheckpoints.clear();
// 此时completedCheckpoints和ZooKeeper中保存的CheckPoint信息是一致的
completedCheckpoints.addAll(retrievedCheckpoints);
// 做一些校验与提示
if (completedCheckpoints.isEmpty() && numberOfInitialCheckpoints > 0) {
throw new FlinkException(
"Could not read any of the " + numberOfInitialCheckpoints + " checkpoints from storage.");
} else if (completedCheckpoints.size() != numberOfInitialCheckpoints) {
LOG.warn(
"Could only fetch {} of {} checkpoints from storage.",
completedCheckpoints.size(),
numberOfInitialCheckpoints);
}
}
addCheckpoint方法
同步新增CheckPoint至Zookeeper,同时根据设置的maxNumberOfCheckpointsToRetain值,异步移除最老的CheckPoint信息
@Override
public void addCheckpoint(final CompletedCheckpoint checkpoint) throws Exception {
// 根据checkpointId获取其在ZK上的对应节点路径
final String path = checkpointIdToPath(checkpoint.getCheckpointID());
// 向ZK上添加新CheckPoint信息,如果失败,不清除现有信息
// 注:实际State信息是放到FS等backend上,ZK上只是存这些state的句柄
checkpointsInZooKeeper.addAndLock(path, checkpoint);
// 本地JVM堆内存上也同步新增信息,使保持一致
completedCheckpoints.addLast(checkpoint);
// 上述保存都成功,如有需要可以移除最老的数据
while (completedCheckpoints.size() > maxNumberOfCheckpointsToRetain) {
try {
removeSubsumed(completedCheckpoints.removeFirst());
} catch (Exception e) {
LOG.warn("Failed to subsume the old checkpoint", e);
}
}
}
getLatestCheckpoint方法
直接从本地获取最新的CheckPoint信息,因为ZooKeeperStateHandleStore的信息和和本地的ArrayDeque的信息是一致的
public CompletedCheckpoint getLatestCheckpoint() {
if (completedCheckpoints.isEmpty()) {
return null;
} else {
return completedCheckpoints.peekLast();
}
}
shutdown方法
根据参数JobStatus来决定,CheckPoint的State状态是否保留。
public void shutdown(JobStatus jobStatus) throws Exception {
if (jobStatus.isGloballyTerminalState()) {
// 判断应用是否停止,应用停止后意为这任务完成,并且不会再失败。同时应用也不会被其他standby master节点重启或恢复。
// 因此如果应用停止,可以把为HA服务的所有数据都丢掉(句柄和State状态)
for (CompletedCheckpoint checkpoint : completedCheckpoints) {
try {
removeShutdown(checkpoint, jobStatus);
} catch (Exception e) {
LOG.error("Failed to discard checkpoint.", e);
}
}
completedCheckpoints.clear();
String path = "/" + client.getNamespace();
ZKPaths.deleteChildren(client.getZookeeperClient().getZooKeeper(), path, true);
} else {
// 如果应用被暂停(Suspending),则只清除句柄
// 清除CheckPoint信息的本地存储
completedCheckpoints.clear();
// 释放ZK上的CheckPoint信息的句柄(以便有需要时可以在ZK上删除)
checkpointsInZooKeeper.releaseAll();
}
}
CheckPoint编号计数器
每个检查点都有各自的编号,为Long类型
目前Flink提供了CheckpointIDCounter及两个计数器实现:
- StandaloneCheckpointIDCounter
- ZooKeeperCheckpointIDCounter
计数器只是为了提供一个计数服务,CheckpointIDCounter定义如下:
public interface CheckpointIDCounter {
void start() throws Exception;
void shutdown(JobStatus jobStatus) throws Exception;
long getAndIncrement() throws Exception;
void setCount(long newId) throws Exception;
}
下面重点看ZooKeeperCheckpointIDCounter实现的一种分布式原子累加器
ZooKeeperCheckpointIDCounter
代码流程:
提交任务 ==> JobManager[submitJob]
==> ExecutionGraphBuilder[buildGraph]
==> ZooKeeperCheckpointRecoveryFactory[createCheckpointIDCounter]
==> ZooKeeperUtils[createCheckpointIDCounter]
==> new ZooKeeperCheckpointIDCounter()
ZooKeeperCheckpointIDCounter 和 ZooKeeperCompletedCheckpointStore 一样,都是在JobManager提交任务,构建执行计划时创建的
completedCheckpoints = recoveryFactory.createCheckpointStore(jobId, maxNumberOfCheckpointsToRetain, classLoader);
checkpointIdCounter = recoveryFactory.createCheckpointIDCounter(jobId);
ZooKeeperCheckpointIDCounter 是基于 Zookeeper 实现的一种分布式原子累加器
具体的做法是每一个计数器,在Zookeeper上新建一个ZNode,形如:
/flink/checkpoint-counter/<job-id> 1 [persistent]
....
/flink/checkpoint-counter/<job-id> N [persistent]
在Zookeeper中的检查点编号被要求是升序的,这可以使得我们在JobManager失效的情况下,可以拥有一个共享的跨JobManager实例的计数器
这里使用的Zookeeper的客户端是CuratorFramework,同时还利用了它附带的SharedCount这一recipes来作为分布式共享的计数器
getAndIncrement
重点看下 ZooKeeperCheckpointIDCounter 如何实现累加操作,getAndIncrement 使用了循环尝试的机制:
public long getAndIncrement() throws Exception {
while (true) {
ConnectionState connState = connStateListener.getLastState();
if (connState != null) {
throw new IllegalStateException("Connection state: " + connState);
}
VersionedValue<Integer> current = sharedCount.getVersionedValue();
int newCount = current.getValue() + 1;
if (newCount < 0) {
// 下标越界,由此看书,CheckPointId的最大值为Integer.MAX_VALUE
throw new Exception("Checkpoint counter overflow. ZooKeeper checkpoint counter only supports " +
"checkpoints Ids up to " + Integer.MAX_VALUE);
}
// 在while循环中,会不断尝试
if (sharedCount.trySetCount(current, newCount)) {
return current.getValue();
}
}
}
其他方法
start、shutdown、setCount方法不做介绍,主要是针对CuratorFramework附带的SharedCount这一recipes进行操作,并同步增、删ZK节点
JobManager的CheckPoint恢复服务
Flink提供了一个创建CheckPoint恢复服务的工厂接口:CheckpointRecoveryFactory,以及其与两种恢复模式向对应的两种实现:
- StandaloneCheckpointRecoveryFactory
- ZooKeeperCheckpointRecoveryFactory
CheckpointRecoveryFactory定义如下:
public interface CheckpointRecoveryFactory {
/**
* 创建已完成CheckPoint的存储机制
*/
CompletedCheckpointStore createCheckpointStore(JobID jobId, int maxNumberOfCheckpointsToRetain, ClassLoader userClassLoader)
throws Exception;
/**
* 创建CheckPointId编号计数器
*/
CheckpointIDCounter createCheckpointIDCounter(JobID jobId) throws Exception;
}
由此接口定义可以看出,所谓的CheckPoint恢复功能,就是聚合了上面的”已完成的CheckPoint存储”和”CheckPoint编号计数器”这两个功能
在前面ZooKeeperStateHandleStore和ZooKeeperCheckpointIDCounter的代码流程中已说明,CheckPoint恢复服务是在JobManager提交任务,构建执行计划时创建的
小结
本篇文章我们主要分析了,Zookeeper 在 Flink 的 Fault Tolerance 机制中发挥的作用
但因为 Zookeeper 在 Flink 中得主要用途是实现 JobManager 的高可用,所以里面的部分内容多少还是跟这一主题有所联系