Flink的Interval-join实现原理,以及关联不上怎么办

青云直上入广寒

已于 2023-08-09 14:52:31 修改

阅读量1.3k

点赞数 3

文章标签： flink 大数据

于 2023-08-09 10:23:20 首次发布

本文链接：https://blog.csdn.net/weixin_47950169/article/details/132182395

版权

IntervalJoin实现原理:

我们给定两个时间点，分别叫作间隔的“上界”（upperBound）和“下界”（lowerBound）；于是对于一条流（不妨叫作A）中的任意一个数据元素a，就可以开辟一段时间间隔：[a.timestamp + lowerBound, a.timestamp + upperBound],即以a的时间戳为中心，下至下界点、上至上界点的一个闭区间：我们就把这段时间作为可以匹配另一条流数据的“窗口”范围。所以对于另一条流（不妨叫B）中的数据元素b，如果它的时间戳落在了这个区间范围内，a和b就可以成功配对，进而进行计算输出结果。所以匹配的条件为：

a.timestamp + lowerBound <= b.timestamp <= a.timestamp + upperBound

这里需要注意，做间隔联结的两条流A和B，也必须基于相同的key；下界lowerBound应该小于等于上界upperBound，两者都可正可负；间隔联结目前只支持事件时间语义

Interval-join的源码执行过程总结:

0:两条流合流-connect,然后keyBy()

①维护两个MapState状态分别保存左流和右流的数据:

②判断数据的时间戳是否是迟到数据,

若不是则:

A流新来一条数据以时间戳为key,数据本身放到List中作为状态的value-->leftBuffer

B流新来一条数据以时间戳为key,数据本身放到List中作为状态的value-->rightBuffer

③用此数据和另外一条流中的缓存数据做关联
④根据关联逻辑进行处理并将关联后的数据输出到下游

⑤判断数据是否超过上限时间,做缓存清除--->Interval-join自己会清除数据,不用设置ttl

上源码:

源码维护的两个状态,以及状态的赋值

private transient MapState<Long, List<BufferEntry<T1>>> leftBuffer;
private transient MapState<Long, List<BufferEntry<T2>>> rightBuffer;

public void initializeState(StateInitializationContext context) throws Exception {
        super.initializeState(context);
//构建 左流缓冲区，类型为keyedState的MapState 其中时间戳是key,因为相同时间戳可能会来多条数据
        this.leftBuffer =
                context.getKeyedStateStore()
                        .getMapState(
                                new MapStateDescriptor<>(
                                        LEFT_BUFFER,
                                        LongSerializer.INSTANCE,
                                        new ListSerializer<>(
                                                new BufferEntrySerializer<>(leftTypeSerializer))));
//构建 右流缓冲区，类型为keyedState的MapState 其中时间戳是key,因为相同时间戳可能会来多条数据
        this.rightBuffer =
                context.getKeyedStateStore()
                        .getMapState(
                                new MapStateDescriptor<>(
                                        RIGHT_BUFFER,
                                        LongSerializer.INSTANCE,
                                        new ListSerializer<>(
                                                new BufferEntrySerializer<>(rightTypeSerializer))));
    }

对两条流数据的处理:

public void processElement1(StreamRecord<T1> record) throws Exception {
        processElement(record, leftBuffer, rightBuffer, lowerBound, upperBound, true);
    }
public void processElement2(StreamRecord<T2> record) throws Exception {
        processElement(record, rightBuffer, leftBuffer, -upperBound, -lowerBound, false);
    }
private <THIS, OTHER> void processElement(
            final StreamRecord<THIS> record,
            final MapState<Long, List<IntervalJoinOperator.BufferEntry<THIS>>> ourBuffer,
            final MapState<Long, List<IntervalJoinOperator.BufferEntry<OTHER>>> otherBuffer,
            final long relativeLowerBound,
            final long relativeUpperBound,
            final boolean isLeft)
            throws Exception {

        final THIS ourValue = record.getValue();
        final long ourTimestamp = record.getTimestamp();

        if (ourTimestamp == Long.MIN_VALUE) {
            throw new FlinkException(
                    "Long.MIN_VALUE timestamp: Elements used in "
                            + "interval stream joins need to have timestamps meaningful timestamps.");
        }

        if (isLate(ourTimestamp)) {
            sideOutput(ourValue, ourTimestamp, isLeft);
            return;
        }

        addToBuffer(ourBuffer, ourValue, ourTimestamp);

        for (Map.Entry<Long, List<BufferEntry<OTHER>>> bucket : otherBuffer.entries()) {
            final long timestamp = bucket.getKey();

            if (timestamp < ourTimestamp + relativeLowerBound
                    || timestamp > ourTimestamp + relativeUpperBound) {
                continue;
            }

            for (BufferEntry<OTHER> entry : bucket.getValue()) {
                if (isLeft) {
                    collect((T1) ourValue, (T2) entry.element, ourTimestamp, timestamp);
                } else {
                    collect((T1) entry.element, (T2) ourValue, timestamp, ourTimestamp);
                }
            }
        }
//判断时间戳是否过期=>若过期调用定时器清理数据
        long cleanupTime =
                (relativeUpperBound > 0L) ? ourTimestamp + relativeUpperBound : ourTimestamp;
        if (isLeft) {
            internalTimerService.registerEventTimeTimer(CLEANUP_NAMESPACE_LEFT, cleanupTime);
        } else {
            internalTimerService.registerEventTimeTimer(CLEANUP_NAMESPACE_RIGHT, cleanupTime);
        }
    }

真实使用的代码截图: