前提之一:DAG
理解StreamGraph的前提之一就是要对DAG有向无环图有所了解。需要知道什么是DAG的顶点和边,以及怎么通过顶点和边构建DAG。
前提之二:Transformation
Flink流处理中的各个算子操作会转化成一系列的Transformation对象进行存储,StreamGraph就是通过一系列的Transformation对象进行构建。
Transformation类型
- OneInputTransformation
只有一个输入的一类转化操作,如map、filter、process等算子 - TwoInputTransformation
TwoInputTransformation具有两个输入。ConnectedStream的算子为双流运算,它的算子会被转换为TwoInputTransformation。 - SourceTransformation
在env.addSource()的时候会创建一个DataStreamSource。DataStreamSource的构造函数中会创建一个SourceTransformation。 - SinkTransformation
和SourceTransformation类似,在dataStream调用addSink方法的时候会生成一个DataStreamSink对象。该对象在创建的时候会同时构造一个SinkTransformation。 - UnionTransformation
DataStream在进行union的时候会创建一个UnionTransformation
public final DataStream<T> union(DataStream<T>... streams) {
List<Transformation<T>> unionedTransforms = new ArrayList<>();
unionedTransforms.add(this.transformation);
for (DataStream<T> newStream : streams) {
if (!getType().equals(newStream.getType())) {
throw new IllegalArgumentException(
"Cannot union streams of different types: "
+ getType()
+ " and "
+ newStream.getType());
}
unionedTransforms.add(newStream.getTransformation());
}
return new DataStream<>(this.environment, new UnionTransformation<>(unionedTransforms));
}
- FeedbackTransformation
DataStream在使用iterate()算子时创建IterativeStream的同时会创建FeedbackTransformation
protected IterativeStream(DataStream<T> dataStream, long maxWaitTime) {
super(
dataStream.getExecutionEnvironment(),
new FeedbackTransformation<>(dataStream.getTransformation(), maxWaitTime));
this.originalInput = dataStream;
this.maxWaitTime = maxWaitTime;
setBufferTimeout(dataStream.environment.getBufferTimeout());
}
- CoFeedbackTransformation
构建ConnectedIterativeStreams时会创建CoFeedbackTransformation
public static class ConnectedIterativeStreams<I, F> extends ConnectedStreams<I, F> {
private CoFeedbackTransformation<F> coFeedbackTransformation;
public ConnectedIterativeStreams(
DataStream<I> input, TypeInformation<F> feedbackType, long waitTime) {
super(
input.getExecutionEnvironment(),
input,
new DataStream<>(
input.getExecutionEnvironment(),
new CoFeedbackTransformation<>(
input.getParallelism(), feedbackType, waitTime)));
this.coFeedbackTransformation =
(CoFeedbackTransformation<F>) getSecondInput().getTransformation();
}
- PartitionTransformation
进行shuffle、forward、rebalance、keyBy等控制数据流向的算子都属于PartitionTransformation
@PublicEvolving
public DataStream<T> shuffle() {
return setConnectionType(new ShufflePartitioner<T>());
}
protected DataStream<T> setConnectionType(StreamPartitioner<T> partitioner) {
return new DataStream<>(
this.getExecutionEnvironment(),
new PartitionTransformation<>(this.getTransformation(), partitioner));
}
- SideOutputTransformation
SideOutputTransformation在进行旁路输出时创建
StreamGraph的构建
我们从StreamExecutionEnvironment的execute方法开始分析
public JobExecutionResult execute(String jobName) throws Exception {
Preconditions.checkNotNull(jobName, "Streaming Job name should not be null.");
final StreamGraph streamGraph = getStreamGraph();
streamGraph.setJobName(jobName);
return execute(streamGraph);
}
execute中调用了getStreamGraph()方法,getStreamGraph生成了一个StreamGraph实例对象,然后给任务设置Name继续提交执行任务,可见生成StreamGraph的逻辑都在getStreamGraph()方法中,我们接着往下看
getStreamGraph方法
public StreamGraph getStreamGraph() {
return getStreamGraph(true);
}
public StreamGraph getStreamGraph(boolean clearTransformations) {
final StreamGraph streamGraph = getStreamGraphGenerator(transformations).generate();
if (clearTransformations) {
transformations.clear();
}
return streamGraph;
}
getStreamGraphGenerator(transformations).generate();这行代码意思是通过transformations列表构建StreamGraphGenerator,然后通过generate生成StreamGraph,那么问题来了,transformations集合是什么时候被填充的呢?追踪代码我们发现在StreamExecutionEnvironment的addOperator方法中transformations被填充
public void addOperator(Transformation<?>