【FLINK】实时流读取维表(一)Broadcast State Pattern

最新推荐文章于 2024-06-28 22:30:38 发布

Zsigner

最新推荐文章于 2024-06-28 22:30:38 发布

阅读量2k

点赞数 1

分类专栏： Flink 文章标签： flink 大数据 broadcast cast

本文链接：https://blog.csdn.net/Zsigner/article/details/122454254

版权

Flink 专栏收录该内容

23 篇文章 1 订阅

订阅专栏

1、什么是广播状态

What is Broadcast State?

The Broadcast State can be used to combine and jointly process two streams of events in a specific way. The events of the first stream are broadcasted to all parallel instances of an operator, which maintains them as state. The events of the other stream are not broadcasted but sent to individual instances of the same operator and processed together with the events of the broadcasted stream.

广播状态可用于以特定方式组合和共同处理两个事件流。第一个流的事件被广播到并行的算子中，并且把数据存储到状态中。另一个流的事件不会被广播，而是发送到同一个算子和程序的各个实例中，并与广播流的事件一起处理。

A Practical Guide to Broadcast State in Apache Flink

1.1 broadcast state 跟state 有什么不一样？

查看源码我们会发现，其实broadcast state就是state中的一种mapstate

broadcast state 其实是flink 中的一种operator state，他会将所有的state分发到每一个task中，与实时流中的数据进行匹配计算

注意：broadcast state 与operator state的三个区别：

· broadcast state 一定是Map格式的(non keyed Stream也是)，operator state可以是： ValueState，ListState，MapState.....

· broadcast state 必定有一条低吞吐的输入流

· operator 可以有多个不同句柄(具体可以细看源码)

2、实践操作

2.1、定义一个MapStateDescriptor

 MapStateDescriptor<Void, Map<String,Tuple2<String,Integer>>> configBroadcastDescriptor
                = new MapStateDescriptor<>("mysql-config-table", Types.VOID,
                Types.MAP(Types.STRING, Types.TUPLE(Types.STRING, Types.INT)));

2.2、生成一个broadcast stream

//生成broadcast stream ，此处可以添加多个不同类型的Descriptor
        BroadcastStream<Map<Integer, Tuple2<String, Integer>>> broadcastStream =
                mysqlStream.broadcast(configBroadcastDescriptor);

2.3、调用dataStream.connect

SingleOutputStreamOperator<MediaEntity> resultStream = mediaSource
                .connect(broadcastStream)
                .process(new CoBroadcastProcessFunction()).name("kafka-co-mysql-broadcast");

2.4 、重写ProcessFunction

这里要注意一下，如果是non keyed datastream 继承的是 BroadcastProcessFunction

如果是keyed datastream 继承的是 KeyedBroadcastProcessFunction。

查看代码发现，两者都是继承了 BaseBroadcastProcessFunction ， KeyedBroadcastProcessFunction 比BroadcastProcessFunction 多了一个onTimer方法，这个可以定时对数据输出

3、重写processElement与processBroadcastElement方法

@Override
    public void processElement(MediaEntity value, ReadOnlyContext ctx, Collector<MediaEntity> out) throws Exception {

        ReadOnlyBroadcastState<Void, Map<Integer, Tuple2<String, Integer>>> broadcastState =
                ctx.getBroadcastState(configDescriptor);
        Map<Integer, Tuple2<String, Integer>> configMap = broadcastState.get(null);

        try {
            if(configMap != null){
                System.out.println("processElement：" + value.toString());
                for(Integer key : configMap.keySet()){
                    System.out.println("key: "+ key +",value: " + configMap.get(key));
                }
                Tuple2<String, Integer> tuple2Tmp = configMap.get(value.mediaId);
                value.mediaName = tuple2Tmp.f0;
                value.mediaType = tuple2Tmp.f1;
            }
            out.collect(value);
        } catch (Exception e) {
            logger.error("run process error {}",e.getMessage());
            e.printStackTrace();
        }
    }

@Override
    public void processBroadcastElement(Map<Integer, Tuple2<String, Integer>> value, Context ctx, Collector<MediaEntity> out) throws Exception {

        System.out.println("processBroadcastElement："+value.toString());
        ctx.getBroadcastState(configDescriptor).put(null,value);

    }

4、使用时注意：

Important Considerations #

After describing the offered APIs, this section focuses on the important things to keep in mind when using broadcast state. These are:

There is no cross-task communication: As stated earlier, this is the reason why only the broadcast side of a (Keyed)-BroadcastProcessFunction can modify the contents of the broadcast state. In addition, the user has to make sure that all tasks modify the contents of the broadcast state in the same way for each incoming element. Otherwise, different tasks might have different contents, leading to inconsistent results.

Order of events in Broadcast State may differ across tasks: Although broadcasting the elements of a stream guarantees that all elements will (eventually) go to all downstream tasks, elements may arrive in a different order to each task. So the state updates for each incoming element MUST NOT depend on the ordering of the incoming events.

All tasks checkpoint their broadcast state: Although all tasks have the same elements in their broadcast state when a checkpoint takes place (checkpoint barriers do not overpass elements), all tasks checkpoint their broadcast state, and not just one of them. This is a design decision to avoid having all tasks read from the same file during a restore (thus avoiding hotspots), although it comes at the expense of increasing the size of the checkpointed state by a factor of p (= parallelism). Flink guarantees that upon restoring/rescaling there will be no duplicates and no missing data. In case of recovery with the same or smaller parallelism, each task reads its checkpointed state. Upon scaling up, each task reads its own state, and the remaining tasks (p_new-p_old) read checkpoints of previous tasks in a round-robin manner.

No RocksDB state backend: Broadcast state is kept in-memory at runtime and memory provisioning should be done accordingly. This holds for all operator states.

使用广播状态，operator task 之间不会相互通信

这也是为什么(Keyed)-BroadcastProcessFunction上只有广播的一边可以修改广播状态的内容。用户必须保证所有 operator 并发实例上对广播状态的修改行为都是一致的。或者说，如果不同的并发实例拥有不同的广播状态内容，将导致不一致的结果。
广播状态中事件的顺序在各个并发实例中可能不尽相同

虽然广播流的元素保证了将所有元素（最终）都发给下游所有的并发实例，但是元素的到达的顺序可能在并发实例之间并不相同。因此，对广播状态的修改不能依赖于输入数据的顺序。
所有 operator task 都会快照下他们的广播状态

在 checkpoint 时，所有的 task 都会 checkpoint 下他们的广播状态，并不仅仅是其中一个，即使所有 task 在广播状态中存储的元素是一模一样的。这是一个设计倾向，为了避免在恢复期间从单个文件读取而造成热点。然而，随着并发度的增加，checkpoint 的大小也会随之增加，这里会存在一个并发因子 p 的权衡。Flink 保证了在恢复/扩缩容时不会出现重复数据和少数据。在以相同或更小并行度恢复时，每个 task 会读取其对应的检查点状态。在已更大并行度恢复时，每个 task 读取自己的状态，剩余的 task （p_new-p_old）会以循环方式（round-robin）读取检查点的状态。
RocksDB 状态后端目前还不支持广播状态

广播状态目前在运行时保存在内存中。因为当前，RocksDB 状态后端还不适用于 operator state。Flink 用户应该相应地为其应用程序配置足够的内存。

5、数据及结果

5.1 数据准备

mysql:

kafka：

print：

具体数据及代码请查看github(如果可以，请点一下star⭐️，谢谢支持)：

non keyed broadcast state：https://github.com/BiGsuw/flink-learning/blob/main/src/main/java/com/flink/demo/broadcast/BroadcastMain.java

keyed broadcast state：

https://github.com/BiGsuw/flink-learning/blob/main/src/main/java/com/flink/demo/broadcast/KeyBroadcastMain.java

异常报错：The requested state does not exist. Check for typos in your state descriptor, or specify the，可以参考我的另一篇博客

【FLINK】The requested state does not exist. Check for typos in your state descriptor, or specify the_Zsigner的博客-CSDN博客

参考文章：

A Practical Guide to Broadcast State in Apache Flink
The Broadcast State Pattern

Flink 小贴士 (6): 使用 Broadcast State 的 4 个注意事项

Zsigner

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
打赏
0
评论
【FLINK】实时流读取维表(一)Broadcast State Pattern

1、什么是广播状态What is Broadcast State?TheBroadcast Statecan be used to combine and jointly process two streams of events in a specific way. The events of the first stream are broadcasted to all parallel instances of an operator, which maintains them as s..
复制链接

扫一扫