之前我写过一篇spark的源码,只写了spark的批处理部分,这边写flink主要是以flink的流处理为主
----------
public static void main(String[] args) throws Exception {
//创建流运行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
final DataStreamSource<String> dataStreamSource = env.addSource(new MySource());
final DataStream<Tuple2<String, String>> tuple2SingleOutputStreamOperator = dataStreamSource
.flatMap(new FlatMapFunction<String, Tuple2<String, String>>() {
@Override
public void flatMap(String value, Collector<Tuple2<String, String>> out) throws Exception {
String[] splits = value.toLowerCase().split("\\W+");
for (String split : splits) {
if (split.length() > 0) {
out.collect(new Tuple2<>(split, "1"));
}
}
}
});
final SingleOutputStreamOperator<Tuple2<String, Integer>> map = tuple2SingleOutputStreamOperator.map(
new MapFunction<Tuple2<String, String>, Tuple2<String, Integer>>() {
@Override
public Tuple2<String, Integer> map(Tuple2<String, String> value) throws Exception {
return new Tuple2<>(value.f0, Integer.parseInt(value.f1));
}
});
final KeyedStream<Tuple2<String, Integer>, Tuple> tuple2TupleKeyedStream = map
.keyBy(0);
final DataStream<Tuple2<String, Integer>> reduce = tuple2TupleKeyedStream
.reduce(new ReduceFunction<Tuple2<String, Integer>>() {
@Override
public Tuple2<String, Integer> reduce(Tuple2<String, Integer> value1, Tuple2<String, Integer> value2)
throws Exception {
return new Tuple2<>(value1.f0, value1.f1 + value2.f1);
}
});
reduce
.print();
//Streaming 程序必须加这个才能启动程序,否则不会有结果
env.execute("zhisheng —— word count streaming demo");
以上面这段代码为例,首先看创建流运行环境
第一行代码执行完毕之后会返回一个StreamExecutionEnviroment 实例。看下这个实例中的成员变量:
前面两个是配置相关的组件。第一个内部用一个HashMap,用key形式存储配置,
第二个config是一个ExecutionConfig,内部维护了不少配置项,随便找几个:
maxParallelism并行度
autoWatermarkInterval水印相关的。。具体就不说
checkoupointCtg是关于checkkpoint相关的配置
transformations是一个非常重要的组件,类型是一个ArrayList,内部维护了所有算子,这个我们后续会继续深入讲解。
isChainingEnabled是一个开关,控制合并阶段,可以理解成类似允许spark 算子合并成stage。
timeCharacteristic是时间相关的
----------
进入final DataStreamSource<String> dataStreamSource = env.addSource(new MySource());
追踪StreamExecutorEnvironment#addSource
首先会拿到source的返回类型。然后创建一个StreamSource对象,
这个对象内部封装了自定义的sourceFunction,然后再把这个StreamSource对象连同StreamExecutionEviroment typeInfo一起封装在一个DataStreamSource中。DataStreamSource继承了DataStream,内部封装了一个
transformation对象,这个实际上就是一个算子的体现。,最终返回的DataStreamSource对象是这样一种情况
接着追踪flatMap操作
DataStream#transform
getExecutionEnvironment().addOperator(resultTransform);方法
实际上就是调用StreamExecutionEnviroment#addOperator方法
可以看到这边就是往之前说过的transformations集合中添加DataStream的transformation参数。
直接看最后print之前的这个reduce对象
可以看到transformations中包含了四个对象,分别对应了上述四个算子。
继续分析env.execute方法
LocalStreamEnvironment#execute方法:
StreamGraph streamGraph = getStreamGraph();追踪到
StreamGraphGenerator#generateInternal
for (StreamTransformation<?> transformation: transformations) {
transform(transformation);
}
可以看到遍历之前env中的transformation中的元素,调用transform方法
重点来了:
Collection<Integer> transformedIds;
if (transform instanceof OneInputTransformation<?, ?>) {
transformedIds = transformOneInputTransform((OneInputTransformation<?, ?>) transform);
} else if (transform instanceof TwoInputTransformation<?, ?, ?>) {
transformedIds = transformTwoInputTransform((TwoInputTransformation<?, ?, ?>) transform);
} else if (transform instanceof SourceTransformation<?>) {
transformedIds = transformSource((SourceTransformation<?>) transform);
} else if (transform instanceof SinkTransformation<?>) {
transformedIds = transformSink((SinkTransformation<?>) transform);
} else if (transform instanceof UnionTransformation<?>) {
transformedIds = transformUnion((UnionTransformation<?>) transform);
} else if (transform instanceof SplitTransformation<?>) {
transformedIds = transformSplit((SplitTransformation<?>) transform);
} else if (transform instanceof SelectTransformation<?>) {
transformedIds = transformSelect((SelectTransformation<?>) transform);
} else if (transform instanceof FeedbackTransformation<?>) {
transformedIds = transformFeedback((FeedbackTransformation<?>) transform);
} else if (transform instanceof CoFeedbackTransformation<?>) {
transformedIds = transformCoFeedback((CoFeedbackTransformation<?>) transform);
} else if (transform instanceof PartitionTransformation<?>) {
transformedIds = transformPartition((PartitionTransformation<?>) transform);
} else if (transform instanceof SideOutputTransformation<?>) {
transformedIds = transformSideOutput((SideOutputTransformation<?>) transform);
} else {
throw new IllegalStateException("Unknown transformation: " + transform);
}
这里会根据Tranformation的不同执行不同的方法
这边看一个transformOnInputTransform
private <IN, OUT> Collection<Integer> transformOnInputTransform(OneInputTransformation<IN, OUT> transform) {
// 递归对该transform的直接上游transform进行转换,获得直接上游id集合
Collection<Integer> inputIds = transform(transform.getInput());
// 递归调用可能已经处理过该transform了
if (alreadyTransformed.containsKey(transform)) {
return alreadyTransformed.get(transform);
}
String slotSharingGroup = determineSlotSharingGroup(transform.getSlotSharingGroup(), inputIds);
// 添加 StreamNode
streamGraph.addOperator(transform.getId(),
slotSharingGroup,
transform.getOperator(),
transform.getInputType(),
transform.getOutputType(),
transform.getName());
if (transform.getStateKeySelector() != null) {
TypeSerializer<?> keySerializer = transform.getStateKeyType().createSerializer(env.getConfig());
streamGraph.setOneInputStateKey(transform.getId(), transform.getStateKeySelector(), keySerializer);
}
streamGraph.setParallelism(transform.getId(), transform.getParallelism());
// 添加 StreamEdge
for (Integer inputId: inputIds) {
streamGraph.addEdge(inputId, transform.getId(), 0);
}
return Collections.singleton(transform.getId());
}
这里会生成StreamNode和StreamEdge
当进行到keyBy这个算子的transformation的时候会走到transformationPartition这个方法
private <T> Collection<Integer> transformPartition(PartitionTransformation<T> partition) {
StreamTransformation<T> input = partition.getInput();
List<Integer> resultIds = new ArrayList<>();
Collection<Integer> transformedIds = transform(input);
for (Integer transformedId: transformedIds) {
int virtualId = StreamTransformation.getNewNodeId();
streamGraph.addVirtualPartitionNode(transformedId, virtualId, partition.getPartitioner());
resultIds.add(virtualId);
}
return resultIds;
}
对partition的转换没有生成具体的StreamNode
和StreamEdge
,而是添加一个虚节点。当partition的下游transform(如map)添加edge时(调用StreamGraph.addEdge
),会把partition信息写入到edge中。如StreamGraph.addEdgeInternal
所示:
public void addEdge(Integer upStreamVertexID, Integer downStreamVertexID, int typeNumber) {
addEdgeInternal(upStreamVertexID, downStreamVertexID, typeNumber, null, new ArrayList<String>());
}
private void addEdgeInternal(Integer upStreamVertexID,
Integer downStreamVertexID,
int typeNumber,
StreamPartitioner<?> partitioner,
List<String> outputNames) {
// 当上游是select时,递归调用,并传入select信息
if (virtualSelectNodes.containsKey(upStreamVertexID)) {
int virtualId = upStreamVertexID;
// select上游的节点id
upStreamVertexID = virtualSelectNodes.get(virtualId).f0;
if (outputNames.isEmpty()) {
// selections that happen downstream override earlier selections
outputNames = virtualSelectNodes.get(virtualId).f1;
}
addEdgeInternal(upStreamVertexID, downStreamVertexID, typeNumber, partitioner, outputNames);
}
// 当上游是partition时,递归调用,并传入partitioner信息
else if (virtuaPartitionNodes.containsKey(upStreamVertexID)) {
int virtualId = upStreamVertexID;
// partition上游的节点id
upStreamVertexID = virtuaPartitionNodes.get(virtualId).f0;
if (partitioner == null) {
partitioner = virtuaPartitionNodes.get(virtualId).f1;
}
addEdgeInternal(upStreamVertexID, downStreamVertexID, typeNumber, partitioner, outputNames);
} else {
// 真正构建StreamEdge
StreamNode upstreamNode = getStreamNode(upStreamVertexID);
StreamNode downstreamNode = getStreamNode(downStreamVertexID);
// 未指定partitioner的话,会为其选择 forward 或 rebalance 分区。
if (partitioner == null && upstreamNode.getParallelism() == downstreamNode.getParallelism()) {
partitioner = new ForwardPartitioner<Object>();
} else if (partitioner == null) {
partitioner = new RebalancePartitioner<Object>();
}
// 健康检查, forward 分区必须要上下游的并发度一致
if (partitioner instanceof ForwardPartitioner) {
if (upstreamNode.getParallelism() != downstreamNode.getParallelism()) {
throw new UnsupportedOperationException("Forward partitioning does not allow " +
"change of parallelism. Upstream operation: " + upstreamNode + " parallelism: " + upstreamNode.getParallelism() +
", downstream operation: " + downstreamNode + " parallelism: " + downstreamNode.getParallelism() +
" You must use another partitioning strategy, such as broadcast, rebalance, shuffle or global.");
}
}
// 创建 StreamEdge
StreamEdge edge = new StreamEdge(upstreamNode, downstreamNode, typeNumber, outputNames, partitioner);
// 将该 StreamEdge 添加到上游的输出,下游的输入
getStreamNode(edge.getSourceId()).addOutEdge(edge);
getStreamNode(edge.getTargetId()).addInEdge(edge);
}
}
最后来一张生成之后的StreamGraph