解读flink源码1 。StreamGraph生成

最新推荐文章于 2024-07-23 22:57:27 发布

置顶代码届彭于晏

最新推荐文章于 2024-07-23 22:57:27 发布

阅读量217

点赞数

分类专栏：大数据

本文链接：https://blog.csdn.net/m0_37139189/article/details/102508934

版权

大数据专栏收录该内容

39 篇文章 1 订阅

订阅专栏

之前我写过一篇spark的源码，只写了spark的批处理部分，这边写flink主要是以flink的流处理为主

----------

   public static void main(String[] args) throws Exception {
        //创建流运行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        final DataStreamSource<String> dataStreamSource = env.addSource(new MySource());
        final DataStream<Tuple2<String, String>> tuple2SingleOutputStreamOperator = dataStreamSource
            .flatMap(new FlatMapFunction<String, Tuple2<String, String>>() {
                @Override
                public void flatMap(String value, Collector<Tuple2<String, String>> out) throws Exception {
                    String[] splits = value.toLowerCase().split("\\W+");

                    for (String split : splits) {
                        if (split.length() > 0) {
                            out.collect(new Tuple2<>(split, "1"));
                        }
                    }
                }
            });
        final SingleOutputStreamOperator<Tuple2<String, Integer>> map = tuple2SingleOutputStreamOperator.map(
            new MapFunction<Tuple2<String, String>, Tuple2<String, Integer>>() {
                @Override
                public Tuple2<String, Integer> map(Tuple2<String, String> value) throws Exception {
                    return new Tuple2<>(value.f0, Integer.parseInt(value.f1));
                }
            });
        final KeyedStream<Tuple2<String, Integer>, Tuple> tuple2TupleKeyedStream = map
            .keyBy(0);
        final DataStream<Tuple2<String, Integer>> reduce = tuple2TupleKeyedStream
            .reduce(new ReduceFunction<Tuple2<String, Integer>>() {
                @Override
                public Tuple2<String, Integer> reduce(Tuple2<String, Integer> value1, Tuple2<String, Integer> value2)
                    throws Exception {
                    return new Tuple2<>(value1.f0, value1.f1 + value2.f1);
                }
            });
        reduce
                .print();

        //Streaming 程序必须加这个才能启动程序，否则不会有结果
        env.execute("zhisheng —— word count streaming demo");

以上面这段代码为例，首先看创建流运行环境

第一行代码执行完毕之后会返回一个StreamExecutionEnviroment 实例。看下这个实例中的成员变量：

前面两个是配置相关的组件。第一个内部用一个HashMap，用key形式存储配置，

第二个config是一个ExecutionConfig，内部维护了不少配置项，随便找几个：

maxParallelism并行度

autoWatermarkInterval水印相关的。。具体就不说

checkoupointCtg是关于checkkpoint相关的配置

transformations是一个非常重要的组件，类型是一个ArrayList，内部维护了所有算子，这个我们后续会继续深入讲解。

isChainingEnabled是一个开关，控制合并阶段，可以理解成类似允许spark 算子合并成stage。

timeCharacteristic是时间相关的

----------

进入final DataStreamSource<String> dataStreamSource = env.addSource(new MySource());

追踪StreamExecutorEnvironment#addSource

首先会拿到source的返回类型。然后创建一个StreamSource对象，

这个对象内部封装了自定义的sourceFunction，然后再把这个StreamSource对象连同StreamExecutionEviroment typeInfo一起封装在一个DataStreamSource中。DataStreamSource继承了DataStream，内部封装了一个

transformation对象，这个实际上就是一个算子的体现。，最终返回的DataStreamSource对象是这样一种情况

接着追踪flatMap操作

DataStream#transform

getExecutionEnvironment().addOperator(resultTransform);方法

实际上就是调用StreamExecutionEnviroment#addOperator方法

可以看到这边就是往之前说过的transformations集合中添加DataStream的transformation参数。

直接看最后print之前的这个reduce对象

可以看到transformations中包含了四个对象，分别对应了上述四个算子。

继续分析env.execute方法

LocalStreamEnvironment#execute方法：

StreamGraph streamGraph = getStreamGraph();追踪到

StreamGraphGenerator#generateInternal

for (StreamTransformation<?> transformation: transformations) {
   transform(transformation);
}

可以看到遍历之前env中的transformation中的元素，调用transform方法

重点来了：

Collection<Integer> transformedIds;
		if (transform instanceof OneInputTransformation<?, ?>) {
			transformedIds = transformOneInputTransform((OneInputTransformation<?, ?>) transform);
		} else if (transform instanceof TwoInputTransformation<?, ?, ?>) {
			transformedIds = transformTwoInputTransform((TwoInputTransformation<?, ?, ?>) transform);
		} else if (transform instanceof SourceTransformation<?>) {
			transformedIds = transformSource((SourceTransformation<?>) transform);
		} else if (transform instanceof SinkTransformation<?>) {
			transformedIds = transformSink((SinkTransformation<?>) transform);
		} else if (transform instanceof UnionTransformation<?>) {
			transformedIds = transformUnion((UnionTransformation<?>) transform);
		} else if (transform instanceof SplitTransformation<?>) {
			transformedIds = transformSplit((SplitTransformation<?>) transform);
		} else if (transform instanceof SelectTransformation<?>) {
			transformedIds = transformSelect((SelectTransformation<?>) transform);
		} else if (transform instanceof FeedbackTransformation<?>) {
			transformedIds = transformFeedback((FeedbackTransformation<?>) transform);
		} else if (transform instanceof CoFeedbackTransformation<?>) {
			transformedIds = transformCoFeedback((CoFeedbackTransformation<?>) transform);
		} else if (transform instanceof PartitionTransformation<?>) {
			transformedIds = transformPartition((PartitionTransformation<?>) transform);
		} else if (transform instanceof SideOutputTransformation<?>) {
			transformedIds = transformSideOutput((SideOutputTransformation<?>) transform);
		} else {
			throw new IllegalStateException("Unknown transformation: " + transform);
		}

这里会根据Tranformation的不同执行不同的方法

这边看一个transformOnInputTransform

private <IN, OUT> Collection<Integer> transformOnInputTransform(OneInputTransformation<IN, OUT> transform) {
// 递归对该transform的直接上游transform进行转换，获得直接上游id集合
Collection<Integer> inputIds = transform(transform.getInput());

// 递归调用可能已经处理过该transform了
if (alreadyTransformed.containsKey(transform)) {
  return alreadyTransformed.get(transform);
}

String slotSharingGroup = determineSlotSharingGroup(transform.getSlotSharingGroup(), inputIds);

// 添加 StreamNode
streamGraph.addOperator(transform.getId(),
    slotSharingGroup,
    transform.getOperator(),
    transform.getInputType(),
    transform.getOutputType(),
    transform.getName());

if (transform.getStateKeySelector() != null) {
  TypeSerializer<?> keySerializer = transform.getStateKeyType().createSerializer(env.getConfig());
  streamGraph.setOneInputStateKey(transform.getId(), transform.getStateKeySelector(), keySerializer);
}

streamGraph.setParallelism(transform.getId(), transform.getParallelism());

// 添加 StreamEdge
for (Integer inputId: inputIds) {
  streamGraph.addEdge(inputId, transform.getId(), 0);
}

return Collections.singleton(transform.getId());
}

这里会生成StreamNode和StreamEdge

当进行到keyBy这个算子的transformation的时候会走到transformationPartition这个方法

private <T> Collection<Integer> transformPartition(PartitionTransformation<T> partition) {
   StreamTransformation<T> input = partition.getInput();
   List<Integer> resultIds = new ArrayList<>();

   Collection<Integer> transformedIds = transform(input);
   for (Integer transformedId: transformedIds) {
      int virtualId = StreamTransformation.getNewNodeId();
      streamGraph.addVirtualPartitionNode(transformedId, virtualId, partition.getPartitioner());
      resultIds.add(virtualId);
   }

   return resultIds;
}

对partition的转换没有生成具体的StreamNode和StreamEdge，而是添加一个虚节点。当partition的下游transform（如map）添加edge时（调用StreamGraph.addEdge），会把partition信息写入到edge中。如StreamGraph.addEdgeInternal所示：

public void addEdge(Integer upStreamVertexID, Integer downStreamVertexID, int typeNumber) {
  addEdgeInternal(upStreamVertexID, downStreamVertexID, typeNumber, null, new ArrayList<String>());
}
private void addEdgeInternal(Integer upStreamVertexID,
    Integer downStreamVertexID,
    int typeNumber,
    StreamPartitioner<?> partitioner,
    List<String> outputNames) {

  // 当上游是select时，递归调用，并传入select信息
  if (virtualSelectNodes.containsKey(upStreamVertexID)) {
    int virtualId = upStreamVertexID;
    // select上游的节点id
    upStreamVertexID = virtualSelectNodes.get(virtualId).f0;
    if (outputNames.isEmpty()) {
      // selections that happen downstream override earlier selections
      outputNames = virtualSelectNodes.get(virtualId).f1;
    }
    addEdgeInternal(upStreamVertexID, downStreamVertexID, typeNumber, partitioner, outputNames);
  } 
  // 当上游是partition时，递归调用，并传入partitioner信息
  else if (virtuaPartitionNodes.containsKey(upStreamVertexID)) {
    int virtualId = upStreamVertexID;
    // partition上游的节点id
    upStreamVertexID = virtuaPartitionNodes.get(virtualId).f0;
    if (partitioner == null) {
      partitioner = virtuaPartitionNodes.get(virtualId).f1;
    }
    addEdgeInternal(upStreamVertexID, downStreamVertexID, typeNumber, partitioner, outputNames);
  } else {
    // 真正构建StreamEdge
    StreamNode upstreamNode = getStreamNode(upStreamVertexID);
    StreamNode downstreamNode = getStreamNode(downStreamVertexID);

    // 未指定partitioner的话，会为其选择 forward 或 rebalance 分区。
    if (partitioner == null && upstreamNode.getParallelism() == downstreamNode.getParallelism()) {
      partitioner = new ForwardPartitioner<Object>();
    } else if (partitioner == null) {
      partitioner = new RebalancePartitioner<Object>();
    }

    // 健康检查， forward 分区必须要上下游的并发度一致
    if (partitioner instanceof ForwardPartitioner) {
      if (upstreamNode.getParallelism() != downstreamNode.getParallelism()) {
        throw new UnsupportedOperationException("Forward partitioning does not allow " +
            "change of parallelism. Upstream operation: " + upstreamNode + " parallelism: " + upstreamNode.getParallelism() +
            ", downstream operation: " + downstreamNode + " parallelism: " + downstreamNode.getParallelism() +
            " You must use another partitioning strategy, such as broadcast, rebalance, shuffle or global.");
      }
    }
    // 创建 StreamEdge
    StreamEdge edge = new StreamEdge(upstreamNode, downstreamNode, typeNumber, outputNames, partitioner);
    // 将该 StreamEdge 添加到上游的输出，下游的输入
    getStreamNode(edge.getSourceId()).addOutEdge(edge);
    getStreamNode(edge.getTargetId()).addInEdge(edge);
  }
}

最后来一张生成之后的StreamGraph