2021SC@SDUSC
bolt源码分析(二)
2021SC@SDUSC
JoinBolt源码总体介绍
JoinBolt继承了BaseWindowedBolt,定义了Selector selectorType、LinkedHashMap<String, JoinInfo> joinCriteria、FieldSelector[] outputFields等属性,用于记录关联类型及关联关系
join、leftJoin方法用于设置join关联关系,最后都是调用joinCommon方法,关联关系使用JoinInfo对象,存储在joinCriteria中
select方法用于选择结果集的列,最后设置到outputFields,用于declareOutputFields
execute是join的核心逻辑,调用了hashJoin方法。
JoinBolt.java
hashJoin方法
protected JoinAccumulator hashJoin(List<Tuple> tuples) {
clearHashedInputs();
JoinAccumulator probe = new JoinAccumulator();
// 1) Build phase - Segregate tuples in the Window into streams.
// First stream's tuples go into probe, rest into HashMaps in hashedInputs
String firstStream = joinCriteria.keySet().iterator().next();
for (Tuple tuple : tuples) {
String streamId = getStreamSelector(tuple);
if (!streamId.equals(firstStream)) {
Object field = getJoinField(streamId, tuple);
ArrayList<Tuple> recs = hashedInputs.get(streamId).get(field);
if (recs == null) {
recs = new ArrayList<Tuple>();
hashedInputs.get(streamId).put(field, recs);
}
recs.add(tuple);
} else {
ResultRecord probeRecord = new ResultRecord(tuple, joinCriteria.size() == 1);
probe.insert(probeRecord); // first stream's data goes into the probe
}
}
// 2) Join the streams in order of streamJoinOrder
int i = 0;
for (String streamName : joinCriteria.keySet()) {
boolean finalJoin = (i == joinCriteria.size() - 1);
if (i > 0) {
probe = doJoin(probe, hashedInputs.get(streamName), joinCriteria.get(streamName), finalJoin);
}
++i;
}
return probe;
}
hashJoin方法先遍历一下tuples,把tuples分为两类,firstStream的数据存到JoinAccumulator probe中,其余的存到HashMap<String, HashMap<Object, ArrayList>> hashedInputs,之后对剩余的streamId,挨个遍历调用doJoin,把结果整合到JoinAccumulator probe。
doJoin方法
protected JoinAccumulator doJoin(JoinAccumulator probe, HashMap<Object, ArrayList<Tuple>> buildInput, JoinInfo joinInfo,
boolean finalJoin) {
final JoinType joinType = joinInfo.getJoinType();
switch (joinType) {
case INNER:
return doInnerJoin(probe, buildInput, joinInfo, finalJoin);
case LEFT:
return doLeftJoin(probe, buildInput, joinInfo, finalJoin);
case RIGHT:
case OUTER:
default:
throw new RuntimeException("Unsupported join type : " + joinType.name());
}
}
doJoin封装了各种join类型的方法,目前仅仅实现了INNER以及LEFT,分别调用doInnerJoin、doLeftJoin方法
doInnerJoin
protected JoinAccumulator doInnerJoin(JoinAccumulator probe, Map<Object, ArrayList<Tuple>> buildInput, JoinInfo joinInfo,
boolean finalJoin) {
String[] probeKeyName = joinInfo.getOtherField();
JoinAccumulator result = new JoinAccumulator();
FieldSelector fieldSelector = new FieldSelector(joinInfo.other.getStreamName(), probeKeyName);
for (ResultRecord rec : probe.getRecords()) {
Object probeKey = rec.getField(fieldSelector);
if (probeKey != null) {
ArrayList<Tuple> matchingBuildRecs = buildInput.get(probeKey);
if (matchingBuildRecs != null) {
for (Tuple matchingRec : matchingBuildRecs) {
ResultRecord mergedRecord = new ResultRecord(rec, matchingRec, finalJoin);
result.insert(mergedRecord);
}
}
}
}
return result;
}
这里挨个对JoinAccumulator probe的records遍历,然后通过probeKey从buildInput寻找对应的records,如果有找到则进行合并。
doLeftJoin
protected JoinAccumulator doLeftJoin(JoinAccumulator probe, Map<Object, ArrayList<Tuple>> buildInput, JoinInfo joinInfo,
boolean finalJoin) {
String[] probeKeyName = joinInfo.getOtherField();
JoinAccumulator result = new JoinAccumulator();
FieldSelector fieldSelector = new FieldSelector(joinInfo.other.getStreamName(), probeKeyName);
for (ResultRecord rec : probe.getRecords()) {
Object probeKey = rec.getField(fieldSelector);
if (probeKey != null) {
ArrayList<Tuple> matchingBuildRecs = buildInput.get(probeKey); // ok if its return null
if (matchingBuildRecs != null && !matchingBuildRecs.isEmpty()) {
for (Tuple matchingRec : matchingBuildRecs) {
ResultRecord mergedRecord = new ResultRecord(rec, matchingRec, finalJoin);
result.insert(mergedRecord);
}
} else {
ResultRecord mergedRecord = new ResultRecord(rec, null, finalJoin);
result.insert(mergedRecord);
}
}
}
return result;
}
left join与inner join的区别就在于没有找到匹配记录的话,仍旧保留左边的记录
小结
JoinBolt继承了BaseWindowedBolt,目前仅仅支持inner join及left join,而且要求join的字段与fieldsGrouping的字段相同
JoinBolt对于多个stream数据的合并,使用分治的方式实现,采用JoinAccumulator不断累加结果集,循环遍历调用doJoin来完成。
由于JoinBolt是在内存进行操作,又需要匹配数据,需要消耗CPU及内存,需要注意:
1、window的时间窗口不宜过大,否则内存堆积的数据过多,容易OOM,可根据情况调整时间窗口或者通过Config.TOPOLOGY_WORKER_MAX_HEAP_SIZE_MB设置woker的内存大小
2、采取slding window会造成数据重复join,因而需要使用withTumblingWindow
如果开启tuple处理超时,则要求Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS大于windowLength + slidingInterval + 处理时间,避免还没有处理完就误判为超时重新replayed
3、由于windowedBolt会自动对tupleWindow的数据进行anchor,数据量过多anchor操作会给整个topology造成压力,如无必要可以关闭ack(设置Config.TOPOLOGY_ACKER_EXECUTORS为0)
4、Config.TOPOLOGY_MAX_SPOUT_PENDING要设置的大一点,给window的join操作及后续操作足够的时间,在一定程度上避免spout发送tuple速度过快,下游bolt消费不过来
5、生产上Config.TOPOLOGY_DEBUG设置为false关闭debug日志,Config.TOPOLOGY_EVENTLOGGER_EXECUTORS设置为0关闭event logger
参考链接:
https://blog.csdn.net/weixin_34405332/article/details/91665504