Flink 源码笔记03—StreamGraph到JobGraph

最新推荐文章于 2024-01-11 23:13:12 发布

董嘻嘻

最新推荐文章于 2024-01-11 23:13:12 发布

阅读量591

点赞数 1

分类专栏： Flink源码笔记文章标签： flink java big data

本文链接：https://blog.csdn.net/yiyezhiqiu167/article/details/120744432

版权

文章目录

简介

JobGraph 可以认为是 StreamGraph 的优化图，它将一些符合特定条件的 operators 合并成一个 operator chain，以减少数据在节点之间序列化/反序列化以及网络通信带来的资源消耗。

入口函数

与 StreamGraph 的生成类似，调用 StreamGraph.getJobGraph() 就可以得到对应的 JobGraph，底层会创建一个 StreamingJobGraphGenerator 以创建 JobGraph：new StreamingJobGraphGenerator(streamGraph, jobID).createJobGraph()。

private JobGraph createJobGraph() {
   
    // ...
    
    // Generate deterministic hashes for the nodes in order to identify them across
    // submission iff they didn't change.
    Map<Integer, byte[]> hashes =
            defaultStreamGraphHasher.traverseStreamGraphAndGenerateHashes(streamGraph);

    // Generate legacy version hashes for backwards compatibility
    List<Map<Integer, byte[]>> legacyHashes = new ArrayList<>(legacyStreamGraphHashers.size());
    for (StreamGraphHasher hasher : legacyStreamGraphHashers) {
   
        legacyHashes.add(hasher.traverseStreamGraphAndGenerateHashes(streamGraph));
    }

    setChaining(hashes, legacyHashes);
	// ...
    return jobGraph;
}

核心是这两步：

调用 traverseStreamGraphAndGenerateHashes 为每个节点生成哈希（唯一标识）；
调用 setChaining 优化算子流程，将一些算子 chain 在一起，减少序列化/反序列化等网络通信开销。

traverseStreamGraphAndGenerateHashes

在 StreamGraph 中我们提到，创建 StreamGraph 时创建的 StreamNode ID，是由 Transformation ID 转换而来，而 Transformation ID 是一个不断递增的静态变量，因此会出现以下情况：在同一个进程中，我们用 DataStream API 先后构建了两个算子流程完全一致的作业 A 和 B，但他们底层的 Transformation ID 完全不同。从作业和图结构角度上，这两个作业完全一致，因此我们需要引入另一套 id 机制去标识作业，这就是 Operator ID。

traverseStreamGraphAndGenerateHashes 的作用就是根据节点在 StreamGraph 中的位置，生成对应的哈希值作为节点标识，Flink 默认使用 StreamGraphHasherV2 生成节点哈希。

// The hash function used to generate the hash
final HashFunction hashFunction = Hashing.murmur3_128(0);
final Map<Integer, byte[]> hashes = new HashMap<>();

首先，该方法先收集 StreamGraph 所有的 sources，为了确保对相同的 StreamGraph 每次生成的哈希一致，在拿到所有 source IDs 后会做一次排序。

We need to make the source order deterministic. The source IDs not returned in the same order, which means that submitting the same program twice might result in different traversal, which breaks the deterministic hash assignment.

List<Integer> sources = new ArrayList<>();
for (Integer sourceNodeId : streamGraph.getSourceIDs()) {
   
    sources.add(sourceNodeId);
}
Collections.sort(sources);

然后，StreamGraphHasherV2 使用宽度优先遍历算法来遍历这些节点（利用队列）：

对队列中每个节点尝试生成哈希，若成功生成，则将该节点的所有下游节点也添加到队列中；
若生成失败，说明该节点尚未到生成时机（比如该节点有些上游节点还没被遍历到），因此先将其从队列中移除，等待该节点的另一个上游节点被遍历到再将该节点添加回队列中。

1	2
|	|
|	|
|	3
\	/
 \ /
  4

如上述例子，1 和 2 都是 sources 节点，会被先添加到队列中，此时队列中的节点为：[1. 2]，当经过第一次遍历后，节点 1、2 的哈希计算完毕，我们将它们的下游节点按序放入队列，此时队列中的节点变成了 [4, 3]，此时我们先从队列中取到了节点 4 尝试计算哈希，按我们上述所说的，这次哈希计算会失败，从而进入到 else 分支，4 节点从队列中被移除，然后我们再取出节点 3 进行哈希计算，在计算完毕后将它的下游节点 4 再度放入到队列中：[4]。这样在下一次遍历时再度计算节点 4 的哈希，此时节点 4 的所有上游节点都已被遍历过，可以成功计算得到哈希。

//
// Traverse the graph in a breadth-first manner. Keep in mind that
// the graph is not a tree and multiple paths to nodes can exist.
//

Set<Integer> visited = new HashSet<>();
Queue<StreamNode> remaining = new ArrayDeque<>();

// Start with source nodes
for (Integer sourceNodeId : sources) {
   
    remaining.add(streamGraph.getStreamNode(sourceNodeId));
    visited.add(sourceNodeId);
}

StreamNode currentNode;
while ((currentNode = remaining.poll()) != null) {
   
    // Generate the hash code. Because multiple path exist to each
    // node, we might not have all required inputs available to
    // generate the hash code.
    if (generateNodeHash(
        currentNode,
        hashFunction,
        hashes,
        streamGraph.isChainingEnabled(),
        streamGraph)) {
   
        // Add the child nodes
        for (StreamEdge outEdge : currentNode.getOutEdges()) {
   
            StreamNode child = streamGraph.getTargetVertex(outEdge

最低0.47元/天解锁文章

董嘻嘻

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Flink 源码笔记03—StreamGraph到JobGraph

文章目录简介入口函数traverseStreamGraphAndGenerateHashesgenerateDeterministicHashgenerateUserSpecifiedHashsetChainingisChainable简介JobGraph 可以认为是 StreamGraph 的优化图，它将一些符合特定条件的 operators 合并成一个 operator chain，以减少数据在节点之间序列化/反序列化以及网络通信带来的资源消耗。入口函数与 StreamGraph 的生成类似，调
复制链接

扫一扫