flink源码解读2 jobGraph的形成

最新推荐文章于 2024-03-15 16:14:11 发布

置顶代码届彭于晏

最新推荐文章于 2024-03-15 16:14:11 发布

阅读量194

点赞数

分类专栏：大数据框架

本文链接：https://blog.csdn.net/m0_37139189/article/details/102893330

版权

框架同时被 2 个专栏收录

74 篇文章 0 订阅

订阅专栏

大数据

39 篇文章 1 订阅

订阅专栏

接着上一篇：

LocalStrteamEnviroment#execute()

JobGraph jobGraph = streamGraph.getJobGraph(); 追踪导

StreamingJobGraphGenerator#createJobGraph

private JobGraph createJobGraph() {

   // make sure that all vertices start immediately
   jobGraph.setScheduleMode(ScheduleMode.EAGER);

   // Generate deterministic hashes for the nodes in order to identify them across
   // submission iff they didn't change.
   Map<Integer, byte[]> hashes = defaultStreamGraphHasher.traverseStreamGraphAndGenerateHashes(streamGraph);

   // Generate legacy version hashes for backwards compatibility
   List<Map<Integer, byte[]>> legacyHashes = new ArrayList<>(legacyStreamGraphHashers.size());
   for (StreamGraphHasher hasher : legacyStreamGraphHashers) {
      legacyHashes.add(hasher.traverseStreamGraphAndGenerateHashes(streamGraph));
   }

   Map<Integer, List<Tuple2<byte[], byte[]>>> chainedOperatorHashes = new HashMap<>();

   setChaining(hashes, legacyHashes, chainedOperatorHashes);

   setPhysicalEdges();

   setSlotSharingAndCoLocation();

   configureCheckpointing();

   JobGraphGenerator.addUserArtifactEntries(streamGraph.getEnvironment().getCachedFiles(), jobGraph);

   // set the ExecutionConfig last when it has been finalized
   try {
      jobGraph.setExecutionConfig(streamGraph.getExecutionConfig());
   }
   catch (IOException e) {
      throw new IllegalConfigurationException("Could not serialize the ExecutionConfig." +
            "This indicates that non-serializable types (like custom serializers) were registered");
   }

   return jobGraph;
}

这段代码主要进行了四个大逻辑：

1.遍历stream graph

2。生成operatorChain

3.设置物理边

4.设置SlotSharing Group

先看第一段逻辑：

map<Integer, byte[]> hashes = defaultStreamGraphHasher.traverseStreamGraphAndGenerateHashes(streamGraph);

遍历StreamGraph 会从source开始遍历求每一个StreamNode的hash码，在计算的时候，一定会确保一个StremNode的所有输入Node都已经计算过了之后才会计算当前的StreamNode

第二段逻辑

StreamingJobGraphGenerator#setChaining()

for (Integer sourceNodeId : streamGraph.getSourceIDs()) {
   createChain(sourceNodeId, sourceNodeId, hashes, legacyHashes, 0, chainedOperatorHashes);
}

这里是一个for 循环调用createChain。循环次数根据整个streamGraph的source个数，我们这边只有一个

这边穿插一个介绍

opearor chain 及作用

在StreamGraph中可以知道一个Operator对应一个StreamNode, 考虑一个日常经常遇到的问题，一个DataStream.map().filter() 这个关系中map和filter Operator会组成不同的StreamNode，最后生成Task, 如果这两个Task不在同一个Slot或在不同一个TaskManager中，数据会经过网络从map传到filter，执行性能会很差，考虑到这一点，flink引入 operator chain的概念，一个operator chain 代表一组可以在同一个Slot执行的Operator串

什么样的情况可以chain在一起

根据源码信息，如果一个上游opeartor A与下游满足以下关系则可以串在一起

下游的input只有一个即上游
属于同一个SlotSharingGroup
允许Chain打开
Partitioner 为ForwardPartitioner
并行度一致
ChainingStrategy允许chain在起

当然一个chain可以chain多个operator，只要连续的两个operator满足以下关系

public static boolean isChainable(StreamEdge edge, StreamGraph streamGraph) {
   StreamNode upStreamVertex = edge.getSourceVertex();
   StreamNode downStreamVertex = edge.getTargetVertex();

   StreamOperator<?> headOperator = upStreamVertex.getOperator();
   StreamOperator<?> outOperator = downStreamVertex.getOperator();

   return downStreamVertex.getInEdges().size() == 1
         && outOperator != null
         && headOperator != null
         && upStreamVertex.isSameSlotSharingGroup(downStreamVertex)
         && outOperator.getChainingStrategy() == ChainingStrategy.ALWAYS
         && (headOperator.getChainingStrategy() == ChainingStrategy.HEAD ||
            headOperator.getChainingStrategy() == ChainingStrategy.ALWAYS)
         && (edge.getPartitioner() instanceof ForwardPartitioner)
         && upStreamVertex.getParallelism() == downStreamVertex.getParallelism()
         && streamGraph.isChainingEnabled();
}

接着走上面的代码。进入createChain方法

for (StreamEdge outEdge : streamGraph.getStreamNode(currentNodeId).getOutEdges()) {
   if (isChainable(outEdge, streamGraph)) {
      chainableOutputs.add(outEdge);
   } else {
      nonChainableOutputs.add(outEdge);
   }
}

这里会遍历streamNode的输出边，然后用上面说的方法判断是不是可以合并在一起，分别放在

chainableOutputs和nonChainableOutputs两个集合中

for (StreamEdge chainable : chainableOutputs) {
   transitiveOutEdges.addAll(
         createChain(startNodeId, chainable.getTargetId(), hashes, legacyHashes, chainIndex + 1, chainedOperatorHashes));
}

for (StreamEdge nonChainable : nonChainableOutputs) {
   transitiveOutEdges.add(nonChainable);
   createChain(nonChainable.getTargetId(), nonChainable.getTargetId(), hashes, legacyHashes, 0, chainedOperatorHashes);
}

这里会根据两个集合中的数据情况递归调用createChain方法。最终会把形成的chain放到

chainedOperatorHashes中

。回到

createJobGraph方法。

可以看到这里被合并成了三个节点。

setPhysicalEdges();

这行代码会构建物理边。

setSlotSharingAndCoLocation();

这里会设置

SlotSharing Group

总结一下：

从当前StreamNode开始，一直遍历到结点不能与其串在一起(从代码逻辑上看，StreamNode与其本身是永远可以串在一起), 记录这些能串在一起的结点，递归翻译当前结点的输出后, 然后将保存下来可以串在一起的StreamNode生成一个JobVertex, 最后将JobVertex的输出设置成之前已经翻译的输出JobVertex。

可以发现JobGraph相对于StreamGraph的最主要区别是将一些StreamNode合并成一个JobVertex, 而JobVertex通过JobEdge(物理边)相连, 最大程度的优化了StreamGraph

看一下形成的jobGraph