Flink的Graph有4层:Stream Graph - Job Graph - Execution Graph - 物理执行图
如图所示,Stream Graph中主要包含了:StreamNode 和 StreamEdge
实例代码:
// get input data by connecting to the socket
DataStream<String> source = env.socketTextStream(hostname, port, "\n").name("source");
// parse the data, group it, window it, and aggregate the counts
source.flatMap(
new FlatMapFunction<String, WordWithCount>() {
@Override
public void flatMap(
String value, Collector<WordWithCount> out) {
for (String word : value.split("\\s")) {
out.collect(new WordWithCount(word, 1L));
}
}
}).setParallelism(3).name("FlatMapSplit")
.keyBy(value -> value.word)
.flatMap(new CountMapFunction()).setParallelism(3).name("FlatMapCount")
.addSink(new SinkFunction<Tuple2<String, Long>>() {
@Override
public void invoke(Tuple2<String, Long> value, Context context) throws Exception {
SinkFunction.super.invoke(value, context);
System.out.println(value);
}
}).setParallelism(1).name("SinkFunction");
env.execute("Socket WordCount");
source --> FlatMapSplit (3) --> keyBy() --> FlatMapCount (3) --> SinkFunction(1)
该代码完成统计某个端口输入的单词的个数的功能。
StreamGraph的创建过程
从 env.execute("Socket WordCount"); 方法开始,一路点点点,到了:StreamGraphGenerator.generate() 方法:
public StreamGraph generate() {
this.streamGraph = new StreamGraph(this.executionConfig, this.checkpointConfig, this.savepointRestoreSettings);
this.shouldExecuteInBatchMode = this.shouldExecuteInBatchMode(this.runtimeExecutionMode);
this.configureStreamGraph(this.streamGraph);
this.alreadyTransformed = new HashMap();
Iterator var1 = this.transformations.iterator();
while(var1.hasNext()) {
Transformation<?> transformation = (Transformation)var1.next();
this.transform(transformation);
}
StreamGraph builtStreamGraph = this.streamGraph;
this.alreadyTransformed.clear();
this.alreadyTransformed = null;
this.streamGraph = null;
return builtStreamGraph;
}
StreamGraph的内容
找到了StreamGraph这个类,我们进入debug模式看一下StreamGraph的内容:
StreamGraph中出现了我们想要的streamNodes,除此之外还有我们熟悉的各种配置:task的配置,checkpoint配置,savepoint配置等。
streamNodes是一个HashMap,value的类型是 StreamNode,如图所示:
思考:为什么该HashMap的key没有3呢?是不是keyBy()不是streamNode但是还是占了一个key?
StreamNode的内容
继续看一下StreamNode的结构,看一下第一StreamNode,也就是source:
里面的内容很丰富:
- 该node的并行度,不知道maxParallelism是干啥的?
- slotSharingGroup,猜测的话,具有同一个改名字的不同的StreamNode也可share slot?
- inEdges:是一个StreamEdge类型的ArrayList,定义了该StreamNode有哪些输入边(很显然,对于source来说,没有inEdges)
- outEdges:是一个StreamEdge类型的ArrayList,定一个该StreamNode有哪些输出边
StreamEdge的内容
- edgeId: 字符串中说明了输入和输出
- 输出的时候的数据传输模式:REBALANCE
- source和target的operator的name