Flink TopN源码

orange大数据技术探索者

已于 2024-09-12 21:21:14 修改

阅读量428

点赞数 5

分类专栏：源码探索 # flink（原目录迁移至此）文章标签： flink 大数据 flink源码 topn

于 2024-03-17 17:47:26 首次发布

本文链接：https://blog.csdn.net/weixin_43283487/article/details/136785521

版权

源码探索同时被 2 个专栏收录

12 篇文章 0 订阅

订阅专栏

flink（原目录迁移至此）

11 篇文章 0 订阅

订阅专栏

先去找flink源码的系统函数(org.apache.calcite.sql.fun.SqlStdOperatorTable)，找ROW_NUMBER 关键字

然后就发现FlinkLogicalRankRuleBase类里面调用了它

FlinkLogicalRankRuleBase这个calcite的rule里面果然根据传参的function的类型来确定rank的类型

既然是relNode那肯定又会有calcite的rule去处理它，去找下面这个包(java和scala都有，别找错了)

org.apache.flink.table.planner.plan.rules.physical.stream
org.apache.flink.table.planner.plan.nodes.physical.stream

很明显 StreamPhysicalRankRule 和 StreamPhysicalRank 是对应的类，通过translateToExecNode()转为ExecNode

那接着去org.apache.flink.table.planner.plan.nodes.exec.stream包下找，里面有StreamExecRank，核心代码就在这了

去找StreamExecRank类translateToPlanInternal()，主要研究这个方法就行

发现 3 种流对应不同的逻辑

AppendFastStrategy （输入仅包含插入时）
RetractStrategy （输入包含update和delete）
UpdateFastStrategy (输入不应包含删除且输入有给定的primaryKeys且按字段排序时)

主要看回撤流 retractStrategy

先通过sort的字段获取一个用于排序RowData的比较器 ComparableRecordComparator，然后根据比较器创建 RetractableTopNFunction

RetractableTopNFunction类用map和treemap存放starte

接着看下 processElement() 做了什么

    public void processElement(RowData input, Context ctx, Collector<RowData> out)
            throws Exception {
        initRankEnd(input);
        SortedMap<RowData, Long> sortedMap = treeMap.value();
        if (sortedMap == null) {
            sortedMap = new TreeMap<>(sortKeyComparator);
        }
        RowData sortKey = sortKeySelector.getKey(input);
        boolean isAccumulate = RowDataUtil.isAccumulateMsg(input);
        input.setRowKind(RowKind.INSERT); // erase row kind for further state accessing
        if (isAccumulate) {
            // update sortedMap
            if (sortedMap.containsKey(sortKey)) {
                sortedMap.put(sortKey, sortedMap.get(sortKey) + 1);
            } else {
                sortedMap.put(sortKey, 1L);
            }

            // emit
            if (outputRankNumber || hasOffset()) {
                // the without-number-algorithm can't handle topN with offset,
                // so use the with-number-algorithm to handle offset
                emitRecordsWithRowNumber(sortedMap, sortKey, input, out);
            } else {
                emitRecordsWithoutRowNumber(sortedMap, sortKey, input, out);
            }
            // update data state
            List<RowData> inputs = dataState.get(sortKey);
            if (inputs == null) {
                // the sort key is never seen
                inputs = new ArrayList<>();
            }
            inputs.add(input);
            dataState.put(sortKey, inputs);
        } else {
            final boolean stateRemoved;
            // emit updates first
            if (outputRankNumber || hasOffset()) {
                // the without-number-algorithm can't handle topN with offset,
                // so use the with-number-algorithm to handle offset
                stateRemoved = retractRecordWithRowNumber(sortedMap, sortKey, input, out);
            } else {
                stateRemoved = retractRecordWithoutRowNumber(sortedMap, sortKey, input, out);
            }

            // and then update sortedMap
            if (sortedMap.containsKey(sortKey)) {
                long count = sortedMap.get(sortKey) - 1;
                if (count == 0) {
                    sortedMap.remove(sortKey);
                } else {
                    sortedMap.put(sortKey, count);
                }
            } else {
                stateStaledErrorHandle();
            }

            if (!stateRemoved) {
                // the input record has not been removed from state
                // should update the data state
                List<RowData> inputs = dataState.get(sortKey);
                if (inputs != null) {
                    // comparing record by equaliser
                    Iterator<RowData> inputsIter = inputs.iterator();
                    while (inputsIter.hasNext()) {
                        if (equaliser.equals(inputsIter.next(), input)) {
                            inputsIter.remove();
                            break;
                        }
                    }
                    if (inputs.isEmpty()) {
                        dataState.remove(sortKey);
                    } else {
                        dataState.put(sortKey, inputs);
                    }
                }
            }
        }
        treeMap.update(sortedMap);
    }

其实也就是对treemap的操作

当数据是insert数据的时候，INSERT数据会先放到treeMap里面去

按顺序遍历treeMap，当遍历过程中发现遍历的key与当前数据的key相同时，和当前数据key相同的所有数据数据（dataState中的LIST），全部撤回并且更新他们的rowNumber+1

继续遍历treeMap，之后的数据全部撤回UpdateBefore，并且向下游发送UpdateAfter使rowNumber+1，遍历直到已经到第TopN个数据循环结束

当数据是DELETE类型的时候，会和Insert反过来，当前key之后的数据全部撤回，然后rowNumber-1

整个处理流程差不多就结束了，可以看到rowNumber当N较大且排序变化频繁的时候，性能消耗还是非常大的，极端情况下游的数据会翻很多倍

注：以前看过spark离线的开窗函数源码，但是没想到flink topn代码相差还是比较大的

orange大数据技术探索者

关注

5
点赞
踩
7

收藏

觉得还不错? 一键收藏
打赏
0
评论
Flink TopN源码

按顺序遍历treeMap，当遍历过程中发现遍历的key与当前数据的key相同时，和当前数据key相同的所有数据数据（dataState中的LIST），全部撤回并且更新他们的rowNumber+1。那接着去org.apache.flink.table.planner.plan.nodes.exec.stream包下找，里面有StreamExecRank，核心代码就在这了。整个处理流程差不多就结束了，可以看到rowNumber当N较大且排序变化频繁的时候，性能消耗还是非常大的，极端情况下游的数据会翻很多倍。
复制链接

扫一扫