由于在(StreamTask反射)beforeInvoke的“状态初始化”阶段已经把StateInitializationContext(保存了可以创建xxxState的xxxStateStore)交给了StreamOperator的自定义RichFunction,并且AbstractStreamOperator也已经通过成员变量的方式“持有”了KeyedStateStore。
/**
* 对StreamOperator进行“状态初始化”:提供一个上下文,确定使用哪个StateBackend
*/
@Override
public final void initializeState() throws Exception {
// 从ExecutionConfig中获取Key的序列化器
final TypeSerializer<?> keySerializer = config.getStateKeySerializer(getUserCodeClassloader());
// 当前StreamOperator所属的StreamTask
final StreamTask<?, ?> containingTask =
Preconditions.checkNotNull(getContainingTask());
final CloseableRegistry streamTaskCloseableRegistry =
Preconditions.checkNotNull(containingTask.getCancelables());
/**
* StreamTaskStateInitializer是专门用来创建StreamOperatorStateContext的,
* StreamOperator必须得通过StreamOperatorStateContext,来初始化StateBackend、InternalTimeServiceManager等。
*/
final StreamTaskStateInitializer streamTaskStateManager =
Preconditions.checkNotNull(containingTask.createStreamTaskStateInitializer());
/**
* StreamOperatorStateContext存储了键控状态、算子状态的状态后端,以及管理时间服务的InternalTimeServiceManager
*/
final StreamOperatorStateContext context =
// 创建AbstractKeyedStateBackend、OperatorStateBackend以及管理时间服务的InternalTimeServiceManager,并保存起来
streamTaskStateManager.streamOperatorStateContext(
getOperatorID(),
getClass().getSimpleName(),
getProcessingTimeService(),
this,
keySerializer,
streamTaskCloseableRegistry,
metrics);
// 获取算子状态的状态后端
this.operatorStateBackend = context.operatorStateBackend();
// 获取键控状态的状态后端
this.keyedStateBackend = context.keyedStateBackend();
if (keyedStateBackend != null) {
// 将KeyedStateBackend进一步包装成代理类:KeyedStateStore
this.keyedStateStore = new DefaultKeyedStateStore(keyedStateBackend, getExecutionConfig());
}
// 获取InternalTimeServiceManager(被用来管理内部专用的InternalTimerService),以便能在当前StreamOperator中注册/管理Timer
timeServiceManager = context.internalTimerServiceManager();
// 提供了创建、获取KeyedState的方法
CloseableIterable<KeyGroupStatePartitionStreamProvider> keyedStateInputs = context.rawKeyedStateInputs();
// 提供了创建、获取OperatorState的方法
CloseableIterable<StatePartitionStreamProvider> operatorStateInputs = context.rawOperatorStateInputs();
try {
/**
* 将各种xxxStateBackend进一步包装成xxxStateStore(状态后端的代理类),保存到StateInitializationContext。因此它就具备了管理算子状态、键控状态、原生状态的能力!
* StateInitializationContext会被用在StreamOperator、自定义Function中,(通过内部保存的xxxStateStore)来初始化State(getState、getMapState...)
*/
StateInitializationContext initializationContext = new StateInitializationContextImpl(
context.isRestored(), // information whether we restore or start for the first time
operatorStateBackend, // access to operator state backend
keyedStateStore, // access to keyed state backend
keyedStateInputs, // access to keyed state stream
operatorStateInputs); // access to operator state stream
// AbstractUdfStreamOperator使用自定义CheckpointedFunction时,会通过这个接口方法来初始化状态
// 本质:利用(拥有xxxStateStore的)StateInitializationContext,来初始化StreamOperator中的xxxState
initializeState(initializationContext);
} finally {
closeFromRegistry(operatorStateInputs, streamTaskCloseableRegistry);
closeFromRegistry(keyedStateInputs, streamTaskCloseableRegistry);
}
}
因此我们在自定义RichFunction中可以通过getRuntimeContext().getState()来初始化xxxState。RuntimeContext提供了接口方法getState(),抽象类AbstractRuntimeUDFContext实现了RuntimeContext接口。作为抽象类AbstractRuntimeUDFContext的抽象实现子类,为getState()方法提供了具体的实现逻辑
/**
* 通过RuntimeContext获取ValueState
*/
@Override
public <T> ValueState<T> getState(ValueStateDescriptor<T> stateProperties) {
// 校验完StateDescription后,得到包装好的KeyedStateStore
KeyedStateStore keyedStateStore = checkPreconditionsAndGetKeyedStateStore(stateProperties);
// 初始化StateDescription中的序列化器
stateProperties.initializeSerializerUnlessSet(getExecutionConfig());
// 通过keyedStateStore接口(默认实现子类为DefaultKeyedStateStore),创建和获取ValueState
return keyedStateStore.getState(stateProperties);
}
也就是取出AbstractStreamOperator用全局变量保存的KeyedStateStore
/**
* 校验StateDescription,并返回包装好的KeyedStateStore(可以用来初始化KeyedState)
*/
private KeyedStateStore checkPreconditionsAndGetKeyedStateStore(StateDescriptor<?, ?> stateDescriptor) {
// 对StateDescriptor进行安全校验
Preconditions.checkNotNull(stateDescriptor, "The state properties must not be null");
// 获取StreamOperator中的KeyedStateStore(AbstractStreamOperator已经把创建好的KeyedStateStore保存起来了)
KeyedStateStore keyedStateStore = operator.getKeyedStateStore();
Preconditions.checkNotNull(keyedStateStore, "Keyed state can only be used on a 'keyed stream', i.e., after a 'keyBy()' operation.");
// KeyedStateStore是KeyedStateBackend的代理,会提供给RuntimeContext来创建xxxState
return keyedStateStore;
}
初始化xxxDescription的序列化器,然后通过KeyedStateStore来初始化xxxState
/**
* 使用KeyedStateStore,初始化xxxState
*/
@Override
public <T> ValueState<T> getState(ValueStateDescriptor<T> stateProperties) {
requireNonNull(stateProperties, "The state properties must not be null");
try {
// 保证State使用到的序列化器已经正常的初始化
stateProperties.initializeSerializerUnlessSet(executionConfig);
// 创建KeyedState
return getPartitionedState(stateProperties);
} catch (Exception e) {
throw new RuntimeException("Error while getting state", e);
}
}
/**
* 获取KeyedState:实际是调用KeyedStateBackend来初始化xxxState(KeyedStateStore只是KeyedStateBackend的代理类)
*/
protected <S extends State> S getPartitionedState(StateDescriptor<S, ?> stateDescriptor) throws Exception {
// 由KeyedStateBackend接口(默认实现为AbstractKeyedStateBackend)来获取KeyedState
return keyedStateBackend.getPartitionedState(
VoidNamespace.INSTANCE,
VoidNamespaceSerializer.INSTANCE,
stateDescriptor);
}
虽然上层是由KeyedStateStore负责初始化xxxState,但底层还是调用KeyedStateBackend来初始化xxxState,KeyedStateStore只是KeyedStateBackend的代理类。
KeyedStateBackend接口定义了“KeyedStateBackend获取KeyedState”的接口方法,抽象类AbstractKeyedStateBackend作为接口的实现子类,为“初始化xxxState”的接口方法提供了具体实现:整体采取缓存设计,将初始化好的InternalKvState缓存起来,下次再访问的时候如果能命中缓存就直接取,缓存没有就创建新的InternalKvState
/**
* 基于KeyedStateBackend,初始化KeyedState。
* 整体采用了“缓存设计”
*/
@SuppressWarnings("unchecked")
@Override
public <N, S extends State> S getPartitionedState(
final N namespace,
final TypeSerializer<N> namespaceSerializer,
final StateDescriptor<S, ?> stateDescriptor) throws Exception {
checkNotNull(namespace, "Namespace");
// lastName是缓存的最近访问的StateDescription名称,缓存有,就会直接返回缓存的InternalKvState
if (lastName != null && lastName.equals(stateDescriptor.getName())) {
lastState.setCurrentNamespace(namespace);
return (S) lastState;
}
// 从映射关系为“StateDescription name:InternalKvState”的HashMap中取出对应的InternalKvState
InternalKvState<K, ?, ?> previous = keyValueStatesByName.get(stateDescriptor.getName());
// 为了缓存,将刚得到的key和value额外交给成员变量保存
if (previous != null) {
lastState = previous;
lastState.setCurrentNamespace(namespace);
lastName = stateDescriptor.getName();
return (S) previous;
}
// 核心:HashMap中没有对应的键值对,就创建新的KeyedState,并put到HashMap中缓存起来
final S state = getOrCreateKeyedState(namespaceSerializer, stateDescriptor);
// 将KeyedState转换成InternalKvState
final InternalKvState<K, N, ?> kvState = (InternalKvState<K, N, ?>) state;
// 更新最近一次访问的StateDescriptor Name和InternalKvState,方便下一次访问时直接从缓存中取
lastName = stateDescriptor.getName();
lastState = kvState;
kvState.setCurrentNamespace(namespace);
return state;
}
核心逻辑是如何创建InternalKvState:如果配置了ttl,那就使用TtlStateFactory创建(可配置过期时间的)InternalKvState;否则就通过KeyedStateBackend来创建InternalKvState
/**
* 使用TtlStateFactory创建(可配置过期时间的)InternalKvState
*/
@Override
@SuppressWarnings("unchecked")
public <N, S extends State, V> S getOrCreateKeyedState(
final TypeSerializer<N> namespaceSerializer,
StateDescriptor<S, V> stateDescriptor) throws Exception {
checkNotNull(namespaceSerializer, "Namespace serializer");
checkNotNull(keySerializer, "State key serializer has not been configured in the config. " +
"This operation cannot use partitioned state.");
// (为了保险起见)从HashMap<String, InternalKvState<K, ?, ?>>中取出对应的InternalKvState,绝对取不到
InternalKvState<K, ?, ?> kvState = keyValueStatesByName.get(stateDescriptor.getName());
if (kvState == null) {
if (!stateDescriptor.isSerializerInitialized()) {
stateDescriptor.initializeSerializerUnlessSet(executionConfig);
}
// 核心:使用TtlStateFactory创建(可配置过期时间的)InternalKvState
kvState = TtlStateFactory.createStateAndWrapWithTtlIfEnabled(
namespaceSerializer, stateDescriptor, this, ttlTimeProvider);
// 将InternalKvState缓存到HashMap中,下次直接走缓存
keyValueStatesByName.put(stateDescriptor.getName(), kvState);
publishQueryableStateIfEnabled(stateDescriptor, kvState);
}
// 返回创建好的InternalKvState
return (S) kvState;
}
/**
* 使用TtlStateFactory创建(可配置过期时间的)InternalKvState
*/
public static <K, N, SV, TTLSV, S extends State, IS extends S> IS createStateAndWrapWithTtlIfEnabled(
TypeSerializer<N> namespaceSerializer,
StateDescriptor<S, SV> stateDesc,
KeyedStateBackend<K> stateBackend,
TtlTimeProvider timeProvider) throws Exception {
Preconditions.checkNotNull(namespaceSerializer);
Preconditions.checkNotNull(stateDesc);
Preconditions.checkNotNull(stateBackend);
Preconditions.checkNotNull(timeProvider);
// 先判断StateDescription中是否添加了ttl配置
return stateDesc.getTtlConfig().isEnabled() ?
// 如果配置了ttl,就通过TtlStateFactory#createState()方法创建创建InternalKvState
new TtlStateFactory<K, N, SV, TTLSV, S, IS>(
namespaceSerializer, stateDesc, stateBackend, timeProvider)
.createState() :
// 未配置ttl,就使用KeyedStateBackend接口创建KeyedState(这个功能由它的父类--KeyedStateFactory接口提供)
stateBackend.createInternalState(namespaceSerializer, stateDesc);
}
假设未配置TTL,于是就由KeyedStateBackend负责创建KeyedState。由于Flink默认使用MemoryStateBackend,因此这里会基于HeapKeyedStateBackend来创建KeyedState。KeyedStateBackend接口“能够创建KeyedState”的能力,是父接口KeyedStateFactory接口赋予的
/**
* KeyedStateFactory接口提供了“创建KeyedState”的接口方法
*/
@Nonnull
default <N, SV, S extends State, IS extends S> IS createInternalState(
@Nonnull TypeSerializer<N> namespaceSerializer,
@Nonnull StateDescriptor<S, SV> stateDesc) throws Exception {
// 基于KeyedStateBackend创建KeyedState
return createInternalState(namespaceSerializer, stateDesc, StateSnapshotTransformFactory.noTransform());
}
作为实现子类,HeapKeyedStateBackend提供了创建InternalKvState的具体实现逻辑。注意,在初始化HeapKeyedStateBackend之前,会提前将各种类型State所对应的StateFactory保存起来,StateFactory是专门用来创建State的。
// 初始化HeapKeyedStateBackend时,会提前将各种KeyedState所对应的StateFactory保存起来
// 注意:负责具体实现逻辑的代码块,会作为StateFactory接口的实现
private static final Map<Class<? extends StateDescriptor>, StateFactory> STATE_FACTORIES =
Stream.of(
Tuple2.of(ValueStateDescriptor.class, (StateFactory) HeapValueState::create),
Tuple2.of(ListStateDescriptor.class, (StateFactory) HeapListState::create),
Tuple2.of(MapStateDescriptor.class, (StateFactory) HeapMapState::create),
Tuple2.of(AggregatingStateDescriptor.class, (StateFactory) HeapAggregatingState::create),
Tuple2.of(ReducingStateDescriptor.class, (StateFactory) HeapReducingState::create),
Tuple2.of(FoldingStateDescriptor.class, (StateFactory) HeapFoldingState::create)
).collect(Collectors.toMap(t -> t.f0, t -> t.f1));
/**
* KeyedStateBackend接口的父类KeyedStateFactory接口定义了创建状态的方法,HeapKeyedStateBackend提供了基本实现
* tips:StateFactory是专门用来创建State的,到底想要哪种State,就取出对应的StateFactory接口。
* 由于具体实现逻辑的代码块作为接口方法的实现,因此调用接口方法时,执行的就是具体的实现逻辑。
*/
@Override
@Nonnull
public <N, SV, SEV, S extends State, IS extends S> IS createInternalState(
@Nonnull TypeSerializer<N> namespaceSerializer,
@Nonnull StateDescriptor<S, SV> stateDesc,
@Nonnull StateSnapshotTransformFactory<SEV> snapshotTransformFactory) throws Exception {
// 根据Key,从映射关系为“StateDescriptor Name:StateFactory”的HashMap中取出对应的StateFactory。
StateFactory stateFactory = STATE_FACTORIES.get(stateDesc.getClass());
if (stateFactory == null) {
String message = String.format("State %s is not supported by %s",
stateDesc.getClass(), this.getClass());
throw new FlinkRuntimeException(message);
}
// 尝试注册StateTable:StateTable是用来存储状态数据的(借助StateMap集合)
StateTable<K, N, SV> stateTable = tryRegisterStateTable(
namespaceSerializer, stateDesc, getStateSnapshotTransformFactory(stateDesc, snapshotTransformFactory));
// 最底层:基于StateFactory创建KeyedState
return stateFactory.createState(stateDesc, stateTable, getKeySerializer());
}
StateFactory是HeapKeyedStateBackend的内部接口,专门用来创建各种类型的State
/**
* 专门用于创建不同类型的State
*/
private interface StateFactory {
<K, N, SV, S extends State, IS extends S> IS createState(
StateDescriptor<S, SV> stateDesc,
StateTable<K, N, SV> stateTable,
TypeSerializer<K> keySerializer) throws Exception;
}
各种类型的State则为StateFactory#createState()提供了各自的实现逻辑,此时,负责具体实现逻辑的代码块,会作为StateFactory接口的实现。当调用接口方法StateFactory#createState()时,实际执行的就是各个HeapXXXState提供的各自具体的实现逻辑的代码块。
假设我们要创建ValueState,系统就会调用HeapValueState#createState()来达到目的。
/**
* 创建ValueState
*/
@SuppressWarnings("unchecked")
static <K, N, SV, S extends State, IS extends S> IS create(
StateDescriptor<S, SV> stateDesc,
StateTable<K, N, SV> stateTable,
TypeSerializer<K> keySerializer) {
// 创建HeapValueState
return (IS) new HeapValueState<>(
// StateTable是用来容纳状态数据的数据结构,底层默认实现为CopyOnWriteStateTable。
// 映射关系为:“KeyGroup:CopyOnWriteStateMap”,其中Value为有数组+链表组合成的哈希表,扩容策略是渐进式rehash
stateTable,
keySerializer,
stateTable.getStateSerializer(),
stateTable.getNamespaceSerializer(),
stateDesc.getDefaultValue());
}
因此,基于KeyedStateBackend创建KeyedState,就是使用State对应类型的StateFactory,执行具体的创建逻辑(new HeapValueState)
文章详细介绍了Flink中StreamTask如何通过反射在beforeInvoke阶段进行状态初始化,包括如何创建和传递StateInitializationContext,以及StreamOperator如何持有和使用KeyedStateStore。同时,阐述了自定义RichFunction如何通过RuntimeContext获取和初始化状态,强调了KeyedStateBackend在其中的作用,以及其内部的缓存设计和InternalKvState的创建过程。整个流程展示了Flink状态管理的核心机制。
132

被折叠的 条评论
为什么被折叠?



