解析创建KeyedState流程源码

文章详细介绍了Flink中StreamTask如何通过反射在beforeInvoke阶段进行状态初始化,包括如何创建和传递StateInitializationContext,以及StreamOperator如何持有和使用KeyedStateStore。同时,阐述了自定义RichFunction如何通过RuntimeContext获取和初始化状态,强调了KeyedStateBackend在其中的作用,以及其内部的缓存设计和InternalKvState的创建过程。整个流程展示了Flink状态管理的核心机制。
摘要由CSDN通过智能技术生成

由于在(StreamTask反射)beforeInvoke的“状态初始化”阶段已经把StateInitializationContext(保存了可以创建xxxState的xxxStateStore)交给了StreamOperator的自定义RichFunction,并且AbstractStreamOperator也已经通过成员变量的方式“持有”了KeyedStateStore。

/**
 * 对StreamOperator进行“状态初始化”:提供一个上下文,确定使用哪个StateBackend
 */
@Override
public final void initializeState() throws Exception {

    // 从ExecutionConfig中获取Key的序列化器
    final TypeSerializer<?> keySerializer = config.getStateKeySerializer(getUserCodeClassloader());

    // 当前StreamOperator所属的StreamTask
    final StreamTask<?, ?> containingTask =
        Preconditions.checkNotNull(getContainingTask());
    final CloseableRegistry streamTaskCloseableRegistry =
        Preconditions.checkNotNull(containingTask.getCancelables());
    /**
	 * StreamTaskStateInitializer是专门用来创建StreamOperatorStateContext的,
	 * StreamOperator必须得通过StreamOperatorStateContext,来初始化StateBackend、InternalTimeServiceManager等。
	 */
    final StreamTaskStateInitializer streamTaskStateManager =
        Preconditions.checkNotNull(containingTask.createStreamTaskStateInitializer());

    /**
	 * StreamOperatorStateContext存储了键控状态、算子状态的状态后端,以及管理时间服务的InternalTimeServiceManager
	 */
    final StreamOperatorStateContext context =
        // 创建AbstractKeyedStateBackend、OperatorStateBackend以及管理时间服务的InternalTimeServiceManager,并保存起来
        streamTaskStateManager.streamOperatorStateContext(
        getOperatorID(),
        getClass().getSimpleName(),
        getProcessingTimeService(),
        this,
        keySerializer,
        streamTaskCloseableRegistry,
        metrics);

    // 获取算子状态的状态后端
    this.operatorStateBackend = context.operatorStateBackend();
    // 获取键控状态的状态后端
    this.keyedStateBackend = context.keyedStateBackend();

    if (keyedStateBackend != null) {
        // 将KeyedStateBackend进一步包装成代理类:KeyedStateStore
        this.keyedStateStore = new DefaultKeyedStateStore(keyedStateBackend, getExecutionConfig());
    }

    // 获取InternalTimeServiceManager(被用来管理内部专用的InternalTimerService),以便能在当前StreamOperator中注册/管理Timer
    timeServiceManager = context.internalTimerServiceManager();

    // 提供了创建、获取KeyedState的方法
    CloseableIterable<KeyGroupStatePartitionStreamProvider> keyedStateInputs = context.rawKeyedStateInputs();
    // 提供了创建、获取OperatorState的方法
    CloseableIterable<StatePartitionStreamProvider> operatorStateInputs = context.rawOperatorStateInputs();

    try {
        /**
		 * 将各种xxxStateBackend进一步包装成xxxStateStore(状态后端的代理类),保存到StateInitializationContext。因此它就具备了管理算子状态、键控状态、原生状态的能力!
		 * StateInitializationContext会被用在StreamOperator、自定义Function中,(通过内部保存的xxxStateStore)来初始化State(getState、getMapState...)
		 */
        StateInitializationContext initializationContext = new StateInitializationContextImpl(
            context.isRestored(), // information whether we restore or start for the first time
            operatorStateBackend, // access to operator state backend
            keyedStateStore, // access to keyed state backend
            keyedStateInputs, // access to keyed state stream
            operatorStateInputs); // access to operator state stream

        // AbstractUdfStreamOperator使用自定义CheckpointedFunction时,会通过这个接口方法来初始化状态
        // 本质:利用(拥有xxxStateStore的)StateInitializationContext,来初始化StreamOperator中的xxxState
        initializeState(initializationContext);
    } finally {
        closeFromRegistry(operatorStateInputs, streamTaskCloseableRegistry);
        closeFromRegistry(keyedStateInputs, streamTaskCloseableRegistry);
    }
}

因此我们在自定义RichFunction中可以通过getRuntimeContext().getState()来初始化xxxState。RuntimeContext提供了接口方法getState(),抽象类AbstractRuntimeUDFContext实现了RuntimeContext接口。作为抽象类AbstractRuntimeUDFContext的抽象实现子类,为getState()方法提供了具体的实现逻辑

/**
 * 通过RuntimeContext获取ValueState
 */
@Override
public <T> ValueState<T> getState(ValueStateDescriptor<T> stateProperties) {
    // 校验完StateDescription后,得到包装好的KeyedStateStore
    KeyedStateStore keyedStateStore = checkPreconditionsAndGetKeyedStateStore(stateProperties);
    // 初始化StateDescription中的序列化器
    stateProperties.initializeSerializerUnlessSet(getExecutionConfig());
    // 通过keyedStateStore接口(默认实现子类为DefaultKeyedStateStore),创建和获取ValueState
    return keyedStateStore.getState(stateProperties);
}

也就是取出AbstractStreamOperator用全局变量保存的KeyedStateStore

/**
 * 校验StateDescription,并返回包装好的KeyedStateStore(可以用来初始化KeyedState)
 */
private KeyedStateStore checkPreconditionsAndGetKeyedStateStore(StateDescriptor<?, ?> stateDescriptor) {
    // 对StateDescriptor进行安全校验
    Preconditions.checkNotNull(stateDescriptor, "The state properties must not be null");
    // 获取StreamOperator中的KeyedStateStore(AbstractStreamOperator已经把创建好的KeyedStateStore保存起来了)
    KeyedStateStore keyedStateStore = operator.getKeyedStateStore();
    Preconditions.checkNotNull(keyedStateStore, "Keyed state can only be used on a 'keyed stream', i.e., after a 'keyBy()' operation.");
    // KeyedStateStore是KeyedStateBackend的代理,会提供给RuntimeContext来创建xxxState
    return keyedStateStore;
}

初始化xxxDescription的序列化器,然后通过KeyedStateStore来初始化xxxState

/**
 * 使用KeyedStateStore,初始化xxxState
 */
@Override
public <T> ValueState<T> getState(ValueStateDescriptor<T> stateProperties) {
    requireNonNull(stateProperties, "The state properties must not be null");
    try {
        // 保证State使用到的序列化器已经正常的初始化
        stateProperties.initializeSerializerUnlessSet(executionConfig);
        // 创建KeyedState
        return getPartitionedState(stateProperties);
    } catch (Exception e) {
        throw new RuntimeException("Error while getting state", e);
    }
}

/**
 * 获取KeyedState:实际是调用KeyedStateBackend来初始化xxxState(KeyedStateStore只是KeyedStateBackend的代理类)
 */
protected  <S extends State> S getPartitionedState(StateDescriptor<S, ?> stateDescriptor) throws Exception {
    // 由KeyedStateBackend接口(默认实现为AbstractKeyedStateBackend)来获取KeyedState
    return keyedStateBackend.getPartitionedState(
        VoidNamespace.INSTANCE,
        VoidNamespaceSerializer.INSTANCE,
        stateDescriptor);
}

虽然上层是由KeyedStateStore负责初始化xxxState,但底层还是调用KeyedStateBackend来初始化xxxState,KeyedStateStore只是KeyedStateBackend的代理类。

KeyedStateBackend接口定义了“KeyedStateBackend获取KeyedState”的接口方法,抽象类AbstractKeyedStateBackend作为接口的实现子类,为“初始化xxxState”的接口方法提供了具体实现:整体采取缓存设计,将初始化好的InternalKvState缓存起来,下次再访问的时候如果能命中缓存就直接取,缓存没有就创建新的InternalKvState

/**
 * 基于KeyedStateBackend,初始化KeyedState。
 * 整体采用了“缓存设计”
 */
@SuppressWarnings("unchecked")
@Override
public <N, S extends State> S getPartitionedState(
    final N namespace,
    final TypeSerializer<N> namespaceSerializer,
    final StateDescriptor<S, ?> stateDescriptor) throws Exception {

    checkNotNull(namespace, "Namespace");

    // lastName是缓存的最近访问的StateDescription名称,缓存有,就会直接返回缓存的InternalKvState
    if (lastName != null && lastName.equals(stateDescriptor.getName())) {
        lastState.setCurrentNamespace(namespace);
        return (S) lastState;
    }

    // 从映射关系为“StateDescription name:InternalKvState”的HashMap中取出对应的InternalKvState
    InternalKvState<K, ?, ?> previous = keyValueStatesByName.get(stateDescriptor.getName());
    // 为了缓存,将刚得到的key和value额外交给成员变量保存
    if (previous != null) {
        lastState = previous;
        lastState.setCurrentNamespace(namespace);
        lastName = stateDescriptor.getName();
        return (S) previous;
    }

    // 核心:HashMap中没有对应的键值对,就创建新的KeyedState,并put到HashMap中缓存起来
    final S state = getOrCreateKeyedState(namespaceSerializer, stateDescriptor);
    // 将KeyedState转换成InternalKvState
    final InternalKvState<K, N, ?> kvState = (InternalKvState<K, N, ?>) state;

    // 更新最近一次访问的StateDescriptor Name和InternalKvState,方便下一次访问时直接从缓存中取
    lastName = stateDescriptor.getName();
    lastState = kvState;
    kvState.setCurrentNamespace(namespace);

    return state;
}

核心逻辑是如何创建InternalKvState:如果配置了ttl,那就使用TtlStateFactory创建(可配置过期时间的)InternalKvState;否则就通过KeyedStateBackend来创建InternalKvState

/** 
 * 使用TtlStateFactory创建(可配置过期时间的)InternalKvState
 */
@Override
@SuppressWarnings("unchecked")
public <N, S extends State, V> S getOrCreateKeyedState(
    final TypeSerializer<N> namespaceSerializer,
    StateDescriptor<S, V> stateDescriptor) throws Exception {
    checkNotNull(namespaceSerializer, "Namespace serializer");
    checkNotNull(keySerializer, "State key serializer has not been configured in the config. " +
                 "This operation cannot use partitioned state.");

    // (为了保险起见)从HashMap<String, InternalKvState<K, ?, ?>>中取出对应的InternalKvState,绝对取不到
    InternalKvState<K, ?, ?> kvState = keyValueStatesByName.get(stateDescriptor.getName());
    if (kvState == null) {
        if (!stateDescriptor.isSerializerInitialized()) {
            stateDescriptor.initializeSerializerUnlessSet(executionConfig);
        }
        // 核心:使用TtlStateFactory创建(可配置过期时间的)InternalKvState
        kvState = TtlStateFactory.createStateAndWrapWithTtlIfEnabled(
            namespaceSerializer, stateDescriptor, this, ttlTimeProvider);
        // 将InternalKvState缓存到HashMap中,下次直接走缓存
        keyValueStatesByName.put(stateDescriptor.getName(), kvState);
        publishQueryableStateIfEnabled(stateDescriptor, kvState);
    }
    // 返回创建好的InternalKvState
    return (S) kvState;
}

/**
 * 使用TtlStateFactory创建(可配置过期时间的)InternalKvState
 */
public static <K, N, SV, TTLSV, S extends State, IS extends S> IS createStateAndWrapWithTtlIfEnabled(
    TypeSerializer<N> namespaceSerializer,
    StateDescriptor<S, SV> stateDesc,
    KeyedStateBackend<K> stateBackend,
    TtlTimeProvider timeProvider) throws Exception {
    Preconditions.checkNotNull(namespaceSerializer);
    Preconditions.checkNotNull(stateDesc);
    Preconditions.checkNotNull(stateBackend);
    Preconditions.checkNotNull(timeProvider);
    // 先判断StateDescription中是否添加了ttl配置
    return  stateDesc.getTtlConfig().isEnabled() ?
        // 如果配置了ttl,就通过TtlStateFactory#createState()方法创建创建InternalKvState
        new TtlStateFactory<K, N, SV, TTLSV, S, IS>(
        namespaceSerializer, stateDesc, stateBackend, timeProvider)
        .createState() :
    // 未配置ttl,就使用KeyedStateBackend接口创建KeyedState(这个功能由它的父类--KeyedStateFactory接口提供)
    stateBackend.createInternalState(namespaceSerializer, stateDesc);
}

假设未配置TTL,于是就由KeyedStateBackend负责创建KeyedState。由于Flink默认使用MemoryStateBackend,因此这里会基于HeapKeyedStateBackend来创建KeyedState。KeyedStateBackend接口“能够创建KeyedState”的能力,是父接口KeyedStateFactory接口赋予的

/**
 * KeyedStateFactory接口提供了“创建KeyedState”的接口方法
 */
@Nonnull
default <N, SV, S extends State, IS extends S> IS createInternalState(
    @Nonnull TypeSerializer<N> namespaceSerializer,
    @Nonnull StateDescriptor<S, SV> stateDesc) throws Exception {
    // 基于KeyedStateBackend创建KeyedState
    return createInternalState(namespaceSerializer, stateDesc, StateSnapshotTransformFactory.noTransform());
}

作为实现子类,HeapKeyedStateBackend提供了创建InternalKvState的具体实现逻辑。注意,在初始化HeapKeyedStateBackend之前,会提前将各种类型State所对应的StateFactory保存起来,StateFactory是专门用来创建State的

// 初始化HeapKeyedStateBackend时,会提前将各种KeyedState所对应的StateFactory保存起来
// 注意:负责具体实现逻辑的代码块,会作为StateFactory接口的实现
private static final Map<Class<? extends StateDescriptor>, StateFactory> STATE_FACTORIES =
    Stream.of(
    Tuple2.of(ValueStateDescriptor.class, (StateFactory) HeapValueState::create),
    Tuple2.of(ListStateDescriptor.class, (StateFactory) HeapListState::create),
    Tuple2.of(MapStateDescriptor.class, (StateFactory) HeapMapState::create),
    Tuple2.of(AggregatingStateDescriptor.class, (StateFactory) HeapAggregatingState::create),
    Tuple2.of(ReducingStateDescriptor.class, (StateFactory) HeapReducingState::create),
    Tuple2.of(FoldingStateDescriptor.class, (StateFactory) HeapFoldingState::create)
).collect(Collectors.toMap(t -> t.f0, t -> t.f1));



/**
 * KeyedStateBackend接口的父类KeyedStateFactory接口定义了创建状态的方法,HeapKeyedStateBackend提供了基本实现
 * tips:StateFactory是专门用来创建State的,到底想要哪种State,就取出对应的StateFactory接口。
 * 由于具体实现逻辑的代码块作为接口方法的实现,因此调用接口方法时,执行的就是具体的实现逻辑。
 */
@Override
@Nonnull
public <N, SV, SEV, S extends State, IS extends S> IS createInternalState(
    @Nonnull TypeSerializer<N> namespaceSerializer,
    @Nonnull StateDescriptor<S, SV> stateDesc,
    @Nonnull StateSnapshotTransformFactory<SEV> snapshotTransformFactory) throws Exception {
    // 根据Key,从映射关系为“StateDescriptor Name:StateFactory”的HashMap中取出对应的StateFactory。
    StateFactory stateFactory = STATE_FACTORIES.get(stateDesc.getClass());
    if (stateFactory == null) {
        String message = String.format("State %s is not supported by %s",
                                       stateDesc.getClass(), this.getClass());
        throw new FlinkRuntimeException(message);
    }
    // 尝试注册StateTable:StateTable是用来存储状态数据的(借助StateMap集合)
    StateTable<K, N, SV> stateTable = tryRegisterStateTable(
        namespaceSerializer, stateDesc, getStateSnapshotTransformFactory(stateDesc, snapshotTransformFactory));
    // 最底层:基于StateFactory创建KeyedState
    return stateFactory.createState(stateDesc, stateTable, getKeySerializer());
}

StateFactory是HeapKeyedStateBackend的内部接口,专门用来创建各种类型的State

/**
 * 专门用于创建不同类型的State
 */
private interface StateFactory {
    <K, N, SV, S extends State, IS extends S> IS createState(
        StateDescriptor<S, SV> stateDesc,
        StateTable<K, N, SV> stateTable,
        TypeSerializer<K> keySerializer) throws Exception;
}

各种类型的State则为StateFactory#createState()提供了各自的实现逻辑,此时,负责具体实现逻辑的代码块,会作为StateFactory接口的实现。当调用接口方法StateFactory#createState()时,实际执行的就是各个HeapXXXState提供的各自具体的实现逻辑的代码块

假设我们要创建ValueState,系统就会调用HeapValueState#createState()来达到目的。

/**
 * 创建ValueState
 */
@SuppressWarnings("unchecked")
static <K, N, SV, S extends State, IS extends S> IS create(
    StateDescriptor<S, SV> stateDesc,
    StateTable<K, N, SV> stateTable,
    TypeSerializer<K> keySerializer) {
    // 创建HeapValueState
    return (IS) new HeapValueState<>(
        // StateTable是用来容纳状态数据的数据结构,底层默认实现为CopyOnWriteStateTable。
		// 映射关系为:“KeyGroup:CopyOnWriteStateMap”,其中Value为有数组+链表组合成的哈希表,扩容策略是渐进式rehash
        stateTable,
        keySerializer,
        stateTable.getStateSerializer(),
        stateTable.getNamespaceSerializer(),
        stateDesc.getDefaultValue());
}

因此,基于KeyedStateBackend创建KeyedState,就是使用State对应类型的StateFactory,执行具体的创建逻辑(new HeapValueState)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值