Apache Flink fault tolerance源码剖析-3

Zookeeper 在 Flink Fault Tolerance 的使用

两种恢复模式

JobManager的HighAvailabilityMode(高可用模式)。HighAvailabilityMode是一个枚举类型,它有两个枚举值:

  • NONE : 无HA(高可用)
  • ZOOKEEPER :利用ZK实现JobManager的高可用

NONE 表示不对 JobManager 的失败进行恢复,而 ZOOKEEPER 表示 JobManager 将基于 Zookeeper 实现 HA(高可用)

两种类型的检查点

在前面的文章中已经提及过Flink里的检查点分为两种:

  • PendingCheckpoint(正在处理的检查点)
  • CompletedCheckpoint(完成了的检查点)

PendingCheckpoint表示一个检查点已经被创建,但还没有得到所有该应答的task的应答。一旦所有的task都给予应答,那么它将会被转化为一个CompletedCheckpoint

代码逻辑:

接收Actor消息  ==>  JobManager[handleCheckpointMessage]  
			   ==>  CheckpointCoordinator[receiveAcknowledgeMessage]  
			   ==>  CheckpointCoordinator[completePendingCheckpoint] 
			   ==>  PendingCheckpoint[finalizeCheckpoint]
			   ==>  new CompletedCheckpoint()

CheckpointCoordinator[receiveAcknowledgeMessage]在接收到消息后,判断CheckPoint是否都被Acknowledged,只有都应答,则基于当前实例的属性构建一个CompletedCheckpoint的实例,并最终返回新创建的实例

if (checkpoint.isFullyAcknowledged()) {
	completePendingCheckpoint(checkpoint);
}

PendingCheckpoint通过finalizeCheckpoint实例方法来将其转化为已完成了的检查点

其核心实现如下:

public CompletedCheckpoint finalizeCheckpoint() throws IOException {

	synchronized (lock) {

		// make sure we fulfill the promise with an exception if something fails
		try {
			// 输出metadata信息
			final Savepoint savepoint = new SavepointV2(checkpointId, operatorStates.values(), masterState);
			final CompletedCheckpointStorageLocation finalizedLocation;

			try (CheckpointMetadataOutputStream out = targetLocation.createMetadataOutputStream()) {
				Checkpoints.storeCheckpointMetadata(savepoint, out);
				finalizedLocation = out.closeAndFinalizeCheckpoint();
			}

			CompletedCheckpoint completed = new CompletedCheckpoint(
					jobId,
					checkpointId,
					checkpointTimestamp,
					System.currentTimeMillis(),
					operatorStates,
					masterState,
					props,
					finalizedLocation);

			onCompletionPromise.complete(completed);

			// ...

			// 标识此PendingCheckpoint已废弃,但是不清除state。(因为state已深度拷贝给CompletedCheckpoint)
			dispose(false);

			return completed;
		}
		catch (Throwable t) {
			// ...
		}
	}
}

备注:在返回CompletedCheckpoint的实例之前,会调用dispose进行资源释放,入参为false,意为不释放task状态 (operatorStates对象会保留)

因为operatorStates这个对象在构造CompletedCheckpoint时会被深拷贝给CompletedCheckpoint的实例

而这些task的状态的最终的释放,将会由CompletedCheckpoint的discard方法完成

if (!discarded && releaseState) {
	executor.execute(new Runnable() {
		@Override
		public void run() {
			try {
				StateUtil.bestEffortDiscardAllStateObjects(operatorStates.values());
				targetLocation.disposeOnFailure();
			} catch (Throwable t) {
				// …
			} finally {
				operatorStates.clear();
			}
		}
	});
}

已完成的CheckPoint存储

Flink 为实现已完成的 CheckPoint 的存储,提供了接口 CompletedCheckpointStore 以及其两种实现:

  • StandaloneCompletedCheckpointStore:基于JVM堆内存的ArrayDeque来存放检查点。
  • ZooKeeperCompletedCheckpointStore:较为复杂,基于Zookeeper的ZooKeeperStateHandleStore分布式存储,以及JVM内存的ArrayDeque存储。

CompletedCheckpointStore 接口定义了四个较为关键的方法:

  • recover:用于恢复可访问的检查点CompletedCheckpoint的实例
  • addCheckpoint:将已完成的检查点加入到检查点集合
  • getLatestCheckpoint:获得最新的检查点
  • shutdown(JobStatus jobStatus):关闭存储,Job的状态被转发并用于决定是否应该实际丢弃或保留状态

ZooKeeperCompletedCheckpointStore

代码流程:

提交任务       ==>  JobManager[submitJob]  
			   ==>  ExecutionGraphBuilder[buildGraph]  
			   ==>  ZooKeeperCheckpointRecoveryFactory[createCheckpointStore] 
			   ==>  ZooKeeperUtils[createCompletedCheckpoints]
			   ==>  new ZooKeeperCompletedCheckpointStore()

为JobManager实现HA(高可用)的存储,其实现依赖于两个存储机制:

一是在Zookeeper中的分布式存储:

private final ZooKeeperStateHandleStore<CompletedCheckpoint> checkpointsInZooKeeper;

二是本地JVM内存中的存储:

private final ArrayDeque<Tuple2<StateHandle<CompletedCheckpoint>, String>> checkpointStateHandles;

此处强调下ZooKeeperStateHandleStore这个类,CheckPoint的State是通过ZooKeeperStateHandleStore保存至Backend(如FS),同时将state对应的句柄(handle)存入Zookeeper,因为ZooKeeper是为KB范围内的数据构建的,而状态可以增长到MB级别

recover方法

@Override
public void recover() throws Exception {
	// 获取所有的CompletedCheckpoint句柄引用,以及CheckPoint在ZK上的对应节点路径
	List<Tuple2<RetrievableStateHandle<CompletedCheckpoint>, String>> initialCheckpoints;
	// 为了规避并发修改带来的失败,采用了循环重试的机制
	while (true) {
		try {
			initialCheckpoints = checkpointsInZooKeeper.getAllSortedByNameAndLock();
			break;
		}
		catch (ConcurrentModificationException e) {
			LOG.warn("Concurrent modification while reading from ZooKeeper. Retrying.");
		}
	}

	int numberOfInitialCheckpoints = initialCheckpoints.size();

	// 尝试从ZK上读取已完成的CheckPoint,直到满足两个条件:
	// 1. 读取到所有的state
	// 2. 读取到一个稳定的state。(连续两次成功读取,取到的CheckPoint数据相同)
	// 这样做是为了防止读取到不可用的CheckPoint数据。考虑这么一个情况,某个CheckPoint数据正在生成,
	// 如果此时我们正在读,则有可能读取到不完整数据。而连续两次读取到相同CheckPoint数据,则说明此数据稳定可用。
	// 
	// 上述策略对incremental checkpoints(增量存储)也很重要,
	// 因为可能会因为瞬时的存储中断或不可用导致无法读取CheckPoint的共享状态
	List<CompletedCheckpoint> lastTryRetrievedCheckpoints = new ArrayList<>(numberOfInitialCheckpoints);
	List<CompletedCheckpoint> retrievedCheckpoints = new ArrayList<>(numberOfInitialCheckpoints);
	do {
		lastTryRetrievedCheckpoints.clear();
		lastTryRetrievedCheckpoints.addAll(retrievedCheckpoints);

		retrievedCheckpoints.clear();

		for (Tuple2<RetrievableStateHandle<CompletedCheckpoint>, String> checkpointStateHandle : initialCheckpoints) {

			CompletedCheckpoint completedCheckpoint = null;

			try {
				completedCheckpoint = retrieveCompletedCheckpoint(checkpointStateHandle);
				if (completedCheckpoint != null) {
					retrievedCheckpoints.add(completedCheckpoint);
				}
			} catch (Exception e) {
				// Log...
			}
		}

	} while (retrievedCheckpoints.size() != numberOfInitialCheckpoints &&
		!CompletedCheckpoint.checkpointsMatch(lastTryRetrievedCheckpoints, retrievedCheckpoints));

	// 先清空本地(JVM内存)存储的CheckPoint句柄,主要是为了避免重复
	completedCheckpoints.clear();
	// 此时completedCheckpoints和ZooKeeper中保存的CheckPoint信息是一致的
	completedCheckpoints.addAll(retrievedCheckpoints);

	// 做一些校验与提示
	if (completedCheckpoints.isEmpty() && numberOfInitialCheckpoints > 0) {
		throw new FlinkException(
			"Could not read any of the " + numberOfInitialCheckpoints + " checkpoints from storage.");
	} else if (completedCheckpoints.size() != numberOfInitialCheckpoints) {
		LOG.warn(
			"Could only fetch {} of {} checkpoints from storage.",
			completedCheckpoints.size(),
			numberOfInitialCheckpoints);
	}
}

addCheckpoint方法

同步新增CheckPoint至Zookeeper,同时根据设置的maxNumberOfCheckpointsToRetain值,异步移除最老的CheckPoint信息

@Override
public void addCheckpoint(final CompletedCheckpoint checkpoint) throws Exception {
	// 根据checkpointId获取其在ZK上的对应节点路径
	final String path = checkpointIdToPath(checkpoint.getCheckpointID());

	// 向ZK上添加新CheckPoint信息,如果失败,不清除现有信息
    // 注:实际State信息是放到FS等backend上,ZK上只是存这些state的句柄
	checkpointsInZooKeeper.addAndLock(path, checkpoint);
	// 本地JVM堆内存上也同步新增信息,使保持一致
	completedCheckpoints.addLast(checkpoint);

	// 上述保存都成功,如有需要可以移除最老的数据
	while (completedCheckpoints.size() > maxNumberOfCheckpointsToRetain) {
		try {
			removeSubsumed(completedCheckpoints.removeFirst());
		} catch (Exception e) {
			LOG.warn("Failed to subsume the old checkpoint", e);
		}
	}
}

getLatestCheckpoint方法

直接从本地获取最新的CheckPoint信息,因为ZooKeeperStateHandleStore的信息和和本地的ArrayDeque的信息是一致的

public CompletedCheckpoint getLatestCheckpoint() {
	if (completedCheckpoints.isEmpty()) {
		return null;
	} else {
		return completedCheckpoints.peekLast();
	}
}

shutdown方法

根据参数JobStatus来决定,CheckPoint的State状态是否保留。

public void shutdown(JobStatus jobStatus) throws Exception {
	if (jobStatus.isGloballyTerminalState()) {
		// 判断应用是否停止,应用停止后意为这任务完成,并且不会再失败。同时应用也不会被其他standby master节点重启或恢复。
		// 因此如果应用停止,可以把为HA服务的所有数据都丢掉(句柄和State状态)
		for (CompletedCheckpoint checkpoint : completedCheckpoints) {
			try {
				removeShutdown(checkpoint, jobStatus);
			} catch (Exception e) {
				LOG.error("Failed to discard checkpoint.", e);
			}
		}

		completedCheckpoints.clear();

		String path = "/" + client.getNamespace();

		ZKPaths.deleteChildren(client.getZookeeperClient().getZooKeeper(), path, true);
	} else {
		// 如果应用被暂停(Suspending),则只清除句柄
		// 清除CheckPoint信息的本地存储
		completedCheckpoints.clear();
		// 释放ZK上的CheckPoint信息的句柄(以便有需要时可以在ZK上删除)
		checkpointsInZooKeeper.releaseAll();
	}
}

CheckPoint编号计数器

每个检查点都有各自的编号,为Long类型

目前Flink提供了CheckpointIDCounter及两个计数器实现:

  • StandaloneCheckpointIDCounter
  • ZooKeeperCheckpointIDCounter

计数器只是为了提供一个计数服务,CheckpointIDCounter定义如下:
public interface CheckpointIDCounter {

void start() throws Exception;
	void shutdown(JobStatus jobStatus) throws Exception;
	long getAndIncrement() throws Exception;
	void setCount(long newId) throws Exception;
}

下面重点看ZooKeeperCheckpointIDCounter实现的一种分布式原子累加器

ZooKeeperCheckpointIDCounter

代码流程:

提交任务       ==>  JobManager[submitJob]  
			   ==>  ExecutionGraphBuilder[buildGraph]  
			   ==>  ZooKeeperCheckpointRecoveryFactory[createCheckpointIDCounter]
			   ==>  ZooKeeperUtils[createCheckpointIDCounter]
			   ==>  new ZooKeeperCheckpointIDCounter()

ZooKeeperCheckpointIDCounter 和 ZooKeeperCompletedCheckpointStore 一样,都是在JobManager提交任务,构建执行计划时创建的

completedCheckpoints = recoveryFactory.createCheckpointStore(jobId, maxNumberOfCheckpointsToRetain, classLoader);
checkpointIdCounter = recoveryFactory.createCheckpointIDCounter(jobId);

ZooKeeperCheckpointIDCounter 是基于 Zookeeper 实现的一种分布式原子累加器

具体的做法是每一个计数器,在Zookeeper上新建一个ZNode,形如:

/flink/checkpoint-counter/<job-id> 1 [persistent]
....
/flink/checkpoint-counter/<job-id> N [persistent]

在Zookeeper中的检查点编号被要求是升序的,这可以使得我们在JobManager失效的情况下,可以拥有一个共享的跨JobManager实例的计数器

这里使用的Zookeeper的客户端是CuratorFramework,同时还利用了它附带的SharedCount这一recipes来作为分布式共享的计数器

getAndIncrement

重点看下 ZooKeeperCheckpointIDCounter 如何实现累加操作,getAndIncrement 使用了循环尝试的机制:

public long getAndIncrement() throws Exception {
	while (true) {
		ConnectionState connState = connStateListener.getLastState();

		if (connState != null) {
			throw new IllegalStateException("Connection state: " + connState);
		}

		VersionedValue<Integer> current = sharedCount.getVersionedValue();
		int newCount = current.getValue() + 1;

		if (newCount < 0) {
			// 下标越界,由此看书,CheckPointId的最大值为Integer.MAX_VALUE
			throw new Exception("Checkpoint counter overflow. ZooKeeper checkpoint counter only supports " +
					"checkpoints Ids up to " + Integer.MAX_VALUE);
		}

		// 在while循环中,会不断尝试
		if (sharedCount.trySetCount(current, newCount)) {
			return current.getValue();
		}
	}
}

其他方法

start、shutdown、setCount方法不做介绍,主要是针对CuratorFramework附带的SharedCount这一recipes进行操作,并同步增、删ZK节点

JobManager的CheckPoint恢复服务

Flink提供了一个创建CheckPoint恢复服务的工厂接口:CheckpointRecoveryFactory,以及其与两种恢复模式向对应的两种实现:

  • StandaloneCheckpointRecoveryFactory
  • ZooKeeperCheckpointRecoveryFactory

CheckpointRecoveryFactory定义如下:

public interface CheckpointRecoveryFactory {

	/**
	 * 创建已完成CheckPoint的存储机制
	 */
	CompletedCheckpointStore createCheckpointStore(JobID jobId, int maxNumberOfCheckpointsToRetain, ClassLoader userClassLoader)
			throws Exception;

	/**
	 * 创建CheckPointId编号计数器
	 */
	CheckpointIDCounter createCheckpointIDCounter(JobID jobId) throws Exception;

}

由此接口定义可以看出,所谓的CheckPoint恢复功能,就是聚合了上面的”已完成的CheckPoint存储”和”CheckPoint编号计数器”这两个功能

在前面ZooKeeperStateHandleStore和ZooKeeperCheckpointIDCounter的代码流程中已说明,CheckPoint恢复服务是在JobManager提交任务,构建执行计划时创建的

小结

本篇文章我们主要分析了,Zookeeper 在 Flink 的 Fault Tolerance 机制中发挥的作用

但因为 Zookeeper 在 Flink 中得主要用途是实现 JobManager 的高可用,所以里面的部分内容多少还是跟这一主题有所联系

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值