Flink作业开发清单(4)——State/状态

最新推荐文章于 2024-02-14 14:35:59 发布

TRX1024

最新推荐文章于 2024-02-14 14:35:59 发布

阅读量466

点赞数

分类专栏： Flink 文章标签： flink

本文链接：https://blog.csdn.net/weixin_43161811/article/details/123741743

版权

Flink 专栏收录该内容

17 篇文章 5 订阅

订阅专栏

建议阅读：

如果当前的计算依赖于前面数据产生的结果，那就需要依赖状态；比如Word Count，需要通过状态来保存前面数据的统计结果。

状态类型

Flink Managed State分为两类，一是Keyed State，二是Operator State。

用户经常用到的是Keyed State。

Keyed State

几种 Keyed State 的差异具体体现在：

ValueState 存储单个值，比如 Wordcount，用 Word 当 Key，State 就是它的 Count。这里面的单个值可能是数值或者字符串，作为单个值，访问接口可能有两种，get 和 set。在 State 上体现的是 update(T) / T value()。
MapState 的状态数据类型是 Map，在 State 上有 put、remove等。需要注意的是在 MapState 中的 key 和 Keyed state 中的 key 不是同一个。
ListState 状态数据类型是 List，访问接口如 add、update 等。
ReducingState 和 AggregatingState 与 ListState 都是同一个父类，但状态数据类型上是单个值，原因在于其中的 add 方法不是把当前的元素追加到列表中，而是把当前元素直接更新进了 Reducing 的结果中。
AggregatingState 的区别是在访问接口，ReducingState 中 add（T）和 T get() 进去和出来的元素都是同一个类型，但在 AggregatingState 输入的 IN，输出的是 OUT。

示例代码

在FlatMap中使用State：

public class CountWindowAverage extends RichFlatMapFunction<Tuple2<Long, Long>, Tuple2<Long, Long>> {

    /**
     * The ValueState handle. The first field is the count, the second field a running sum.
     */
    private transient ValueState<Tuple2<Long, Long>> sum;

    @Override
    public void flatMap(Tuple2<Long, Long> input, Collector<Tuple2<Long, Long>> out) throws Exception {

        // access the state value
        Tuple2<Long, Long> currentSum = sum.value();

        // update the count
        currentSum.f0 += 1;

        // add the second field of the input value
        currentSum.f1 += input.f1;

        // update the state
        sum.update(currentSum);

        // if the count reaches 2, emit the average and clear the state
        if (currentSum.f0 >= 2) {
            out.collect(new Tuple2<>(input.f0, currentSum.f1 / currentSum.f0));
            sum.clear();
        }
    }

    @Override
    public void open(Configuration config) {
        ValueStateDescriptor<Tuple2<Long, Long>> descriptor =
                new ValueStateDescriptor<>(
                        "average", // the state name
                        TypeInformation.of(new TypeHint<Tuple2<Long, Long>>() {}), // type information
                        Tuple2.of(0L, 0L)); // default value of the state, if nothing was set
        sum = getRuntimeContext().getState(descriptor);
    }
}

// this can be used in a streaming program like this (assuming we have a StreamExecutionEnvironment env)
env.fromElements(Tuple2.of(1L, 3L), Tuple2.of(1L, 5L), Tuple2.of(1L, 7L), Tuple2.of(1L, 4L), Tuple2.of(1L, 2L))
        .keyBy(0)
        .flatMap(new CountWindowAverage())
        .print();

// the printed output will be (1,4) and (1,5)

状态清理

如果状态不断累计的话，势必会造成内存和效率问题，所以状态的正确清理非常重要。可以在RichXXXFunction中通过Timer定期清理State，也可以使用Flink提供的TTL State。

import org.apache.flink.api.common.state.StateTtlConfig;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.time.Time;

StateTtlConfig ttlConfig = StateTtlConfig
    .newBuilder(Time.seconds(1))
    .setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
    .setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
    .build();

ValueStateDescriptor<String> stateDescriptor = new ValueStateDescriptor<>("text state", String.class);
stateDescriptor.enableTimeToLive(ttlConfig);

线上已经默认开启了“后台清理”，一般用户通过上述代码即可正常使用状态清理，如果想对“后台清理”有更细致的控制，请查看官方文档。

对于State Backend为filesystem的用户，建议添加如下配置：

import org.apache.flink.api.common.state.StateTtlConfig;
import org.apache.flink.api.common.time.Time;

StateTtlConfig ttlConfig = StateTtlConfig
    .newBuilder(Time.seconds(1))
    .cleanupFullSnapshot()
    .build();

TRX1024

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Flink作业开发清单(4)——State/状态

建议阅读：Working with State 状态管理及容错机制如果当前的计算依赖于前面数据产生的结果，那就需要依赖状态；比如Word Count，需要通过状态来保存前面数据的统计结果。状态类型Flink Managed State分为两类，一是Keyed State，二是Operator State。用户经常用到的是Keyed State。Keyed State几种 Keyed State 的差异具体体现在：ValueState 存储单个值，比如 Wordcount，.
复制链接

扫一扫

专栏目录