storm源码分析研究（七）

最新推荐文章于 2021-12-16 09:45:00 发布

王储公子

最新推荐文章于 2021-12-16 09:45:00 发布

阅读量1.4k

点赞数

分类专栏： storm 文章标签： storm 大数据 big data

本文链接：https://blog.csdn.net/qq_45849855/article/details/121238122

版权

storm 专栏收录该内容

13 篇文章 1 订阅

订阅专栏

2021SC@SDUSC

bolt源码分析（二）
2021SC@SDUSC

JoinBolt源码总体介绍

JoinBolt继承了BaseWindowedBolt，定义了Selector selectorType、LinkedHashMap<String, JoinInfo> joinCriteria、FieldSelector[] outputFields等属性，用于记录关联类型及关联关系

join、leftJoin方法用于设置join关联关系，最后都是调用joinCommon方法，关联关系使用JoinInfo对象，存储在joinCriteria中

select方法用于选择结果集的列，最后设置到outputFields，用于declareOutputFields

execute是join的核心逻辑，调用了hashJoin方法。

JoinBolt.java

hashJoin方法

 protected JoinAccumulator hashJoin(List<Tuple> tuples) {
        clearHashedInputs();

        JoinAccumulator probe = new JoinAccumulator();

        // 1) Build phase - Segregate tuples in the Window into streams.
        //    First stream's tuples go into probe, rest into HashMaps in hashedInputs
        String firstStream = joinCriteria.keySet().iterator().next();
        for (Tuple tuple : tuples) {
            String streamId = getStreamSelector(tuple);
            if (!streamId.equals(firstStream)) {
                Object field = getJoinField(streamId, tuple);
                ArrayList<Tuple> recs = hashedInputs.get(streamId).get(field);
                if (recs == null) {
                    recs = new ArrayList<Tuple>();
                    hashedInputs.get(streamId).put(field, recs);
                }
                recs.add(tuple);

            } else {
                ResultRecord probeRecord = new ResultRecord(tuple, joinCriteria.size() == 1);
                probe.insert(probeRecord);  // first stream's data goes into the probe
            }
        }

        // 2) Join the streams in order of streamJoinOrder
        int i = 0;
        for (String streamName : joinCriteria.keySet()) {
            boolean finalJoin = (i == joinCriteria.size() - 1);
            if (i > 0) {
                probe = doJoin(probe, hashedInputs.get(streamName), joinCriteria.get(streamName), finalJoin);
            }
            ++i;
        }
        return probe;
    }

hashJoin方法先遍历一下tuples，把tuples分为两类，firstStream的数据存到JoinAccumulator probe中，其余的存到HashMap<String, HashMap<Object, ArrayList>> hashedInputs，之后对剩余的streamId，挨个遍历调用doJoin，把结果整合到JoinAccumulator probe。

doJoin方法

 protected JoinAccumulator doJoin(JoinAccumulator probe, HashMap<Object, ArrayList<Tuple>> buildInput, JoinInfo joinInfo,
                                     boolean finalJoin) {
        final JoinType joinType = joinInfo.getJoinType();
        switch (joinType) {
            case INNER:
                return doInnerJoin(probe, buildInput, joinInfo, finalJoin);
            case LEFT:
                return doLeftJoin(probe, buildInput, joinInfo, finalJoin);
            case RIGHT:
            case OUTER:
            default:
                throw new RuntimeException("Unsupported join type : " + joinType.name());
        }
    }

doJoin封装了各种join类型的方法，目前仅仅实现了INNER以及LEFT，分别调用doInnerJoin、doLeftJoin方法

doInnerJoin

 protected JoinAccumulator doInnerJoin(JoinAccumulator probe, Map<Object, ArrayList<Tuple>> buildInput, JoinInfo joinInfo,
                                          boolean finalJoin) {
        String[] probeKeyName = joinInfo.getOtherField();
        JoinAccumulator result = new JoinAccumulator();
        FieldSelector fieldSelector = new FieldSelector(joinInfo.other.getStreamName(), probeKeyName);
        for (ResultRecord rec : probe.getRecords()) {
            Object probeKey = rec.getField(fieldSelector);
            if (probeKey != null) {
                ArrayList<Tuple> matchingBuildRecs = buildInput.get(probeKey);
                if (matchingBuildRecs != null) {
                    for (Tuple matchingRec : matchingBuildRecs) {
                        ResultRecord mergedRecord = new ResultRecord(rec, matchingRec, finalJoin);
                        result.insert(mergedRecord);
                    }
                }
            }
        }
        return result;
    }

这里挨个对JoinAccumulator probe的records遍历，然后通过probeKey从buildInput寻找对应的records，如果有找到则进行合并。

doLeftJoin

 protected JoinAccumulator doLeftJoin(JoinAccumulator probe, Map<Object, ArrayList<Tuple>> buildInput, JoinInfo joinInfo,
                                         boolean finalJoin) {
        String[] probeKeyName = joinInfo.getOtherField();
        JoinAccumulator result = new JoinAccumulator();
        FieldSelector fieldSelector = new FieldSelector(joinInfo.other.getStreamName(), probeKeyName);
        for (ResultRecord rec : probe.getRecords()) {
            Object probeKey = rec.getField(fieldSelector);
            if (probeKey != null) {
                ArrayList<Tuple> matchingBuildRecs = buildInput.get(probeKey); // ok if its return null
                if (matchingBuildRecs != null && !matchingBuildRecs.isEmpty()) {
                    for (Tuple matchingRec : matchingBuildRecs) {
                        ResultRecord mergedRecord = new ResultRecord(rec, matchingRec, finalJoin);
                        result.insert(mergedRecord);
                    }
                } else {
                    ResultRecord mergedRecord = new ResultRecord(rec, null, finalJoin);
                    result.insert(mergedRecord);
                }

            }
        }
        return result;
    }

left join与inner join的区别就在于没有找到匹配记录的话，仍旧保留左边的记录

小结
JoinBolt继承了BaseWindowedBolt，目前仅仅支持inner join及left join，而且要求join的字段与fieldsGrouping的字段相同

JoinBolt对于多个stream数据的合并，使用分治的方式实现，采用JoinAccumulator不断累加结果集，循环遍历调用doJoin来完成。

由于JoinBolt是在内存进行操作，又需要匹配数据，需要消耗CPU及内存，需要注意：
1、window的时间窗口不宜过大，否则内存堆积的数据过多，容易OOM，可根据情况调整时间窗口或者通过Config.TOPOLOGY_WORKER_MAX_HEAP_SIZE_MB设置woker的内存大小
2、采取slding window会造成数据重复join，因而需要使用withTumblingWindow
如果开启tuple处理超时，则要求Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS大于windowLength + slidingInterval + 处理时间，避免还没有处理完就误判为超时重新replayed
3、由于windowedBolt会自动对tupleWindow的数据进行anchor，数据量过多anchor操作会给整个topology造成压力，如无必要可以关闭ack(设置Config.TOPOLOGY_ACKER_EXECUTORS为0)
4、Config.TOPOLOGY_MAX_SPOUT_PENDING要设置的大一点，给window的join操作及后续操作足够的时间，在一定程度上避免spout发送tuple速度过快，下游bolt消费不过来
5、生产上Config.TOPOLOGY_DEBUG设置为false关闭debug日志，Config.TOPOLOGY_EVENTLOGGER_EXECUTORS设置为0关闭event logger

参考链接：
https://blog.csdn.net/weixin_34405332/article/details/91665504