Flink源码拾遗＜三＞：DataStream

Petrov_Dong

已于 2022-03-02 12:53:09 修改

阅读量377

点赞数 2

分类专栏： flink 文章标签： flink 大数据 big data hadoop

于 2021-11-08 16:09:06 首次发布

本文链接：https://blog.csdn.net/Li_DeSheng/article/details/121195140

版权

flink 专栏收录该内容

7 篇文章 2 订阅

订阅专栏

一、概述

我们在上一篇文章中通过一个wordcount例子引入了对StreamExecutionEnviornment的思考，这一篇文章我们依然从这个例子说起，我们来看看wordcount中算子转化的部分吧。

DataStream<WordWithCount> windowCount = text.flatMap(new FlatMapFunction<String, WordWithCount>() {
            public void flatMap(String value, Collector<WordWithCount> out) throws Exception {
                String[] splits = value.split("\\s");
                for (String word:splits) {
                    out.collect(new WordWithCount(word,1L));
                }
            }
        }).keyBy("word").timeWindow(Time.seconds(2),Time.seconds(1)).sum("count");

分析其骨架，我们可以大致这样理解其中的转化

newDataStream = oldDataStream.flatMap(new FlatMapFunction).keyBy("word").timeWindow(xxx);

SEE添加了Source源产生了第一个DataStream text，DataStream text通过一系列算子的转化最终还是变成了DataStream，其中不同的算子起到了不同的作用，但是最终都是产生了新DataStream，其中flatMap方法还传入了用户定义的function。

在上一篇文章中我们知道，在SEE执行execute()之前，所有的步骤都可以理解为对程序的层层封装。那么，DataStream究竟是什么呢，这些算子是怎么封装的呢？

本文将从DataStream出发进行剖析，一步步深入理解DataStream的算子家族们。

首先，依然是看看官方注释是怎么解释DataStream的

DataStream代表了相同的elements的一个流(stream).
一个DataStream可以通过算子(如map、filter)被转化为另一个DataStream

简单地说，DataStream是对数据流的抽象，数据流会被自带的算子转化为新的数据流

二、DataStream的属性和构造方法

    //当前系统中运行着的enviornment
	⭐protected final StreamExecutionEnvironment environment;

	//transformation指的是生成当前datastream的操作
	⭐protected final StreamTransformation<T> transformation;

	/**
	 * Create a new {@link DataStream} in the given execution environment with
	 * partitioning set to forward by default.
	 * 在给定的enviornment中创建新的DataStream（分区策略默认为forward）
	 *
	 * @param environment The StreamExecutionEnvironment
	 */
	⭐public DataStream(StreamExecutionEnvironment environment, StreamTransformation<T> transformation) {
		this.environment = Preconditions.checkNotNull(environment, "Execution Environment must not be null.");
		this.transformation = Preconditions.checkNotNull(transformation, "Stream Transformation must not be null.");
	}

可见DataStream的属性和构造方法都十分简单，属性只有SEE和生成本DataStream的StreamTransformation，构造方法就是传入这两个参数，因为enviornment是共享的，所以我们可以猜到，keyby、map、filter等算子最终都会封装成一个StreamTransformation然后new出新的DataStream。

简单的讲，StreamTransformation就是生成新DataStream的操作的一系列操作和配置的封装，我们后面再详细了解StreamTransformation是什么，先来看看DataStream的算子家族是怎么运作的吧。

下面的代码都已经在注释中作出了解释，具体请看⭐标注的地方

三、DataStream的算子们

1、union()算子

首先是union()算子,生成该算子对应的transformation除了当前SEE还需要一个装着所有DataStream的List

/**
	 * Creates a new {@link DataStream} by merging {@link DataStream} outputs of
	 * the same type with each other. The DataStreams merged using this operator
	 * will be transformed simultaneously.
	 * 通过merge两个相同类型的DataStream来创建新的DataStream
	 * 两个DataStream会被同时merge到一起
	 *
	 * @param streams
	 *            The DataStreams to union output with.另一个被union的DataStream
	 * @return The {@link DataStream}.
	 */
	@SafeVarargs
	public final DataStream<T> union(DataStream<T>... streams) {
		//⭐输入的需要union的DataStream可能有多个，所以需要一个list来装
		List<StreamTransformation<T>> unionedTransforms = new ArrayList<>();
		unionedTransforms.add(this.transformation);//⭐装入当前transformation

		//⭐把输入参数的DataStream都加入到list中
		for (DataStream<T> newStream : streams) {
			if (!getType().equals(newStream.getType())) {
				throw new IllegalArgumentException("Cannot union streams of different types: "
						+ getType() + " and " + newStream.getType());
			}

			unionedTransforms.add(newStream.getTransformation());
		}
		//⭐通过transformation的list和当前enviornment来创建新的DataStream
		return new DataStream<>(this.environment, new UnionTransformation<>(unionedTransforms));
	}

2、connect()算子

再来看看connect算子，需要注意的是，本算子返回的ConnectedStream不是DataStream的子类，但是其中封装了参与connect的两个DataStream

/**
	 * Creates a new {@link ConnectedStreams} by connecting
	 * {@link DataStream} outputs of (possible) different types with each other.
	 * The DataStreams connected using this operator can be used with
	 * CoFunctions to apply joint transformations.
	 * ⭐通过两个DataStream（type可以不同）创建一个新的ConnectedStreams
	 * ⭐被本方法连接到一起的DataStream可以和CoFunction一起使用
	 *
	 * @param dataStream
	 *            The DataStream with which this stream will be connected.
	 * @return The {@link ConnectedStreams}.
	 */
	//⭐返回的ConnectedStream不是DataStream的子类，但是需要连接起来的两个DataStream都会被封装进去⭐
	public <R> ConnectedStreams<T, R> connect(DataStream<R> dataStream) {
		return new ConnectedStreams<>(environment, this, dataStream);
	}

3、keyby()算子

我们再来看看keyby算子，其工作原理很简单，就是把keyselector和当前datastream封装起来返回。

/**
	 * Creates a new {@link KeyedStream} using the given {@link KeySelector} and {@link TypeInformation}
	 * to partition operator state by key, where the partitioning is defined by a {@link PartitionTransformation}.
	 * ⭐使用KeySelector和TypeInformation来创建新的KeyedStream，本方法是KeyedStream的最终构造方法
	 * ⭐用指定的key来分区，分区函数封装在PartitionTransformation中
	 *
	 * @param stream 基础stream
	 *            Base stream of data
	 * @param partitionTransformation 决定了keys如何被分配到下游的operator
	 *            Function that determines how the keys are distributed to downstream operator(s)
	 * @param keySelector 从基础stream中抽取key的方法
	 *            Function to extract keys from the base stream
	 * @param keyType 被抽取出来的key的类型
	 *            Defines the type of the extracted keys
	 */
	@Internal
	KeyedStream(
		DataStream<T> stream,
		PartitionTransformation<T> partitionTransformation,
		KeySelector<T, KEY> keySelector,
		TypeInformation<KEY> keyType) {

		//⭐先调用父类的构造方法，传入enviorment和transformation，得到datastream
		/**⭐子类继承父类，子类的构造方法必须调用super（）即父类的构造方法，而且必须放在构造方法的第一行。
		*⭐如果父类"只"有无参构造方法，且不打算重写子类的构造方法，为节省代码量，子类构造方法可以不写
		* */
		super(stream.getExecutionEnvironment(), partitionTransformation);
		//定义keyedstream中的属性
		this.keySelector = clean(keySelector);
		this.keyType = validateKeyType(keyType);
	}

4、window算子家族

接着是window算子家族，想要得到WindowStream有两条路线，

如果没有key需要指定，直接在DataStream使用timeWindowAll()、countWindowAll()方法，这两个方法最后调用windowAll()会为DataStream设一个不起作用的NullByteKeySelector将key都指定为int 0，所以实际上没有分区，故最后得到的AllWindowedStream的并行度为1.
如果需要指定key，那就首先在DataStream使用keyby得到KeyedStream，再对KeyedStream使用timeWindow()、countWindow()方法，他俩最后调用window()方法，得到WindowedStream

注意：AllWindowedStream和WindowedStream都没有继承DataStream，但是都封装了一个KeyedStream，区别在于AllWindowedStream的keySelector不起作用，实际上没有key

我们先看看第一条路线，下面是DataStream的timeWindowAll方法

	/**
	 * Windows this {@code DataStream} into sliding time windows.将DataStream转化为滑动时间窗口
	 *
	 * <p>This is a shortcut for either {@code .window(SlidingEventTimeWindows.of(size, slide))} or
	 * {@code .window(SlidingProcessingTimeWindows.of(size, slide))} depending on the time characteristic
	 * set using
	 * {@link org.apache.flink.streaming.api.environment.StreamExecutionEnvironment#setStreamTimeCharacteristic(org.apache.flink.streaming.api.TimeCharacteristic)}
	 * ⭐新产生的windowStream是基于事件时间还是处理时间取决于当前设置的时间语义
	 *
	 * <p>Note: This operation is inherently non-parallel since all elements have to pass through
	 * the same operator instance.
	 * ⭐注意：本操作产生的AllWindowStream的并行度为1，所有的element都存放在一个算子实例中
	 * 
	 * @param size The size of the window.⭐窗口大小，第二个参数是滑动步长
	 */
	public AllWindowedStream<T, TimeWindow> timeWindowAll(Time size, Time slide) {
		if (environment.getStreamTimeCharacteristic() == TimeCharacteristic.ProcessingTime) {
			return windowAll(SlidingProcessingTimeWindows.of(size, slide));
		} else {
			return windowAll(SlidingEventTimeWindows.of(size, slide));//⭐两个子句的时间语义不一样，但是都使用的windowAll方法
		}
	}

下面是DataStream的countWindowAll方法

	/**
	 * Windows this {@code DataStream} into sliding count windows.⭐将DataStream转化为滑动数量窗口
	 *
	 * <p>Note: This operation is inherently non-parallel since all elements have to pass through
	 * the same operator instance.
	 * ⭐注意：本操作产生的AllWindowStream的并行度为1，所有的element都存放在一个算子实例中
	 *
	 * @param size The size of the windows in number of elements.
	 * @param slide The slide interval in number of elements.
	 */
	public AllWindowedStream<T, GlobalWindow> countWindowAll(long size, long slide) {
		return windowAll(GlobalWindows.create())//⭐最后调用的还是windowAll算子，但是在后面还连着使用了evictor和trigger定义了WindowStream的性质，windowAll算子实际上只需要一个入参
				.evictor(CountEvictor.of(size))
				.trigger(CountTrigger.of(slide));
	}

可以看到，上面两个方法都是调用的WindowAll方法，那么我们再来看看WindowAll方法是怎么工作的吧

	@PublicEvolving
	public <W extends Window> AllWindowedStream<T, W> windowAll(WindowAssigner<? super T, W> assigner) {
		return new AllWindowedStream<>(this, assigner);
	}

从上面可以看到WindowAll的工作原理很简单，就是用当前DataStream和assigner创建AllWindowedStream并返回，刚刚我们说到了AllWindowedStream当中有一个不起作用的keySelector，它在哪呢，我们进去AllWindowedStream的构造方法里看看吧

	@PublicEvolving
	public AllWindowedStream(DataStream<T> input,
			WindowAssigner<? super T, W> windowAssigner) {
		this.input = input.keyBy(new NullByteKeySelector<T>());
		this.windowAssigner = windowAssigner;
		this.trigger = windowAssigner.getDefaultTrigger(input.getExecutionEnvironment());
	}

可以看到第四行有一个没见过的NullByteKeySelector，我们进去再看看

@Internal
public class NullByteKeySelector<T> implements KeySelector<T, Byte> {

	private static final long serialVersionUID = 614256539098549020L;

	@Override
	public Byte getKey(T value) throws Exception {
		return 0;
	}
}

重要：原来NullByteKeySelector只有一个只会return 0的getKey方法，如果被指定了NullByteKeySelector，在获取key的时候所有element的key都会被指定为int 0，所以都会被放进同一个operator实例中，也就是实际上的不分区。这就是WindowAll方法没有key的真相

/**
	 * Creates a new {@link KeyedStream} using the given {@link KeySelector} and {@link TypeInformation}
	 * to partition operator state by key, where the partitioning is defined by a {@link PartitionTransformation}.
	 * ⭐使用KeySelector和TypeInformation来创建新的KeyedStream，本方法是KeyedStream的最终构造方法
	 * ⭐用指定的key来分区，分区函数封装在PartitionTransformation中
	 *
	 * @param stream 基础stream
	 *            Base stream of data
	 * @param partitionTransformation 决定了keys如何被分配到下游的operator
	 *            Function that determines how the keys are distributed to downstream operator(s)
	 * @param keySelector 从基础stream中抽取key的方法
	 *            Function to extract keys from the base stream
	 * @param keyType 被抽取出来的key的类型
	 *            Defines the type of the extracted keys
	 */
	@Internal
	KeyedStream(
		DataStream<T> stream,
		PartitionTransformation<T> partitionTransformation,
		KeySelector<T, KEY> keySelector,
		TypeInformation<KEY> keyType) {

		//⭐先调用父类的构造方法，传入enviorment和transformation，得到datastream
		/**⭐子类继承父类，子类的构造方法必须调用super（）即父类的构造方法，而且必须放在构造方法的第一行。
		*⭐如果父类"只"有无参构造方法，且不打算重写子类的构造方法，为节省代码量，子类构造方法可以不写
		* */
		super(stream.getExecutionEnvironment(), partitionTransformation);
		//定义keyedstream中的属性
		this.keySelector = clean(keySelector);
		this.keyType = validateKeyType(keyType);
	}

下面是第二条路线，我们接着KeyStream看看timeWindow和countWindow方法，他俩都最后调用的是window算子，思路和第一条路线其实差不多，读者可以自己看看注释翻译

	/**
	 * Windows this {@code KeyedStream} into sliding time windows.
	 * ⭐将KeyedStream转化为带有滑动时间窗口的Stream
	 *
	 * <p>This is a shortcut for either {@code .window(SlidingEventTimeWindows.of(size, slide))} or
	 * {@code .window(SlidingProcessingTimeWindows.of(size, slide))} depending on the time
	 * characteristic set using
	 * {@link org.apache.flink.streaming.api.environment.StreamExecutionEnvironment#setStreamTimeCharacteristic(org.apache.flink.streaming.api.TimeCharacteristic)}
	 * ⭐新产生的windowStream是基于事件时间还是处理时间取决于当前设置的时间语义
	 * 
	 * @param size The size of the window.窗口大小
	 *             
	 * @param slide The ⭐滑动步长，如果slide=size，实际效果等于滚动窗口
	 */
	public WindowedStream<T, KEY, TimeWindow> timeWindow(Time size, Time slide) {
		if (environment.getStreamTimeCharacteristic() == TimeCharacteristic.ProcessingTime) {
			return window(SlidingProcessingTimeWindows.of(size, slide));//⭐可以看到实际上调用的是window()方法，我们再来看看window算子
		} else {
			return window(SlidingEventTimeWindows.of(size, slide));
		}
	}

	/**
	 * Windows this {@code KeyedStream} into sliding count windows.
	 * ⭐将当前keyedStream转化为基于数量的滑动窗口
	 *
	 * @param size The size of the windows in number of elements. 窗口大小
	 * @param slide The slide interval in number of elements.滑动步长
	 */
	//windowstream和connectedStream⭐同样都不是DataStream的子类，但是都封装了DataStream
	public WindowedStream<T, KEY, GlobalWindow> countWindow(long size, long slide) {
		//⭐最后调用的还是window算子，但是在后面还连着使用了evictor和trigger定义了WindowStream的性质，window算子实际上只需要一个入参
		return window(GlobalWindows.create())
				.evictor(CountEvictor.of(size))
				.trigger(CountTrigger.of(slide));
	}

再来看看window算子中做了什么

	/**
	 * Windows this data stream to a {@code WindowedStream}, which evaluates windows
	 * over a key grouped stream. Elements are put into windows by a {@link WindowAssigner}. The
	 * grouping of elements is done both by key and by window.
	 * ⭐将本DataStream用转化为带key的WindowStream
	 * ⭐Elements由WindowAssigner来根据其key和时间戳来决定最后进到哪个window中
	 *
	 * <p>A {@link org.apache.flink.streaming.api.windowing.triggers.Trigger} can be defined to
	 * specify when windows are evaluated. However, {@code WindowAssigners} have a default
	 * {@code Trigger} that is used if a {@code Trigger} is not specified.
	 * ⭐Trigger是作用是决定window中的数据什么时候被计算输出
	 * ⭐WindowAssigners都有有一个默认的通用Trigger
	 *
	 * @param assigner The {@code WindowAssigner} that assigns elements to windows.
	 * @return The trigger windows data stream.
	 */
	@PublicEvolving
	public <W extends Window> WindowedStream<T, KEY, W> window(WindowAssigner<? super T, W> assigner) {
		return new WindowedStream<>(this, assigner);//⭐用assigner来创建新的WindowStream
	}

注意：使用窗口函数得到的Windowedstream、AllWindowedstream和都不是DataStream的子类，但是都封装了KeyedStream（DataStream的子类），我们知道flink程序编写算子链的时候DataStream会通过算子不断产生新的DataStream，直到产生了SinkStream为止。

你可以已经发现了：那使用窗口函数之后产生的并不是DataStream子类，这样DataStream的循环不就断了吗？这怎么办呢？

答：在实际使用中，将完整的无休止的流转化为窗口流不是我们的目的，对窗口中的数据进行计算才是我们的目的，所以对流进行窗口划分之后一定一定一定会定义一个聚合函数（aggregate、min、sum、max、reduce），不然划分窗口是无意义的，所以Windowedstream\AllWindowedstream的聚合函数会返回新的DataStream，这样，DataStream的循环就不会被打断了。

为了不让读者看的太累，我们就不展开具体的聚合函数源码，通过一张图来看看DataStream是怎么转化为窗口流再转化为DataStream的吧

5、broadcast()、shuffle()、forward()、rebalance()、rescale()、global()

以上六兄弟的功能类似，就是定义当前Stream里的Partitioner，指明数据如何进入下游算子。

具体使用的是

setConnectionType(new Partitioner<T>());

broadcast：广播发送到下游并行算子的所有实例中
shuffle：随机发送到下游并行算子的一个实例中
forward：直接发送到当前subtask的下一个operator中
rebalance：轮询发送（下游所有并行算子实例都有机会收到）
rescale：轮询发送（如果上游并行度为2，下游并行度为4，那么每个上游算子只会往2个下游算子里轮询发送）
global：统一发送到下游的第一个算子（可能会导致严重的数据倾斜）

6、需要传入自定义函数的算子map()、flatMap()、filter()

	public <R> SingleOutputStreamOperator<R> map(MapFunction<T, R> mapper) {

		//⭐获得每一次map操作的输出类型
		TypeInformation<R> outType = TypeExtractor.getMapReturnTypes(clean(mapper), getType(),
				Utils.getCallLocationName(), true);

		//⭐封装然后输出
		return transform("Map", outType, new StreamMap<>(clean(mapper)));
	}


	public <R> SingleOutputStreamOperator<R> flatMap(FlatMapFunction<T, R> flatMapper) {
        //⭐获得每一次map操作的输出类型
		TypeInformation<R> outType = TypeExtractor.getFlatMapReturnTypes(clean(flatMapper),
				getType(), Utils.getCallLocationName(), true);

        //⭐封装然后输出
		return transform("Flat Map", outType, new StreamFlatMap<>(clean(flatMapper)));

	}

	public SingleOutputStreamOperator<T> filter(FilterFunction<T> filter) {
        //⭐封装然后输出
		return transform("Filter", getType(), new StreamFilter<>(clean(filter)));

	}

可以看到，这三个算子的实现非常一致，输入参数都是一个自定义的Function，前两个算子要先定义一下输出类型，最后都是调用了transform()算子，这个算子在window()算子章节中已经出现了一次，WindowedStream/AllWindowedStream 调用了封装在其中的DataStream的transform()方法创建了新的DataStream，那么这里我们可以很容易想到，本章的三个算子也是差不多，transform()算子会把传入的自定义function封装到新创建的DataStream。

我们来看看transform()做了什么

/**
	 * Method for passing user defined operators along with the type
	 * information that will transform the DataStream.
	 *
	 * ⭐传入自定义函数（传入时已经封装成operator）和类型信息，生成新的DataStream
	 */
	@PublicEvolving
	public <R> SingleOutputStreamOperator<R> transform(String operatorName, TypeInformation<R> outTypeInfo, OneInputStreamOperator<T, R> operator) {

		// read the output type of the input Transform to coax out errors about MissingTypeInfo  ⭐读取输出类型
		transformation.getOutputType();

		//⭐创建OneInputTransformation (StreamTransformation的子类)
		//⭐用于创建新DataStream(还记得DataStream的构造方法吗？
		OneInputTransformation<T, R> resultTransform = new OneInputTransformation<>(
				this.transformation,
				operatorName,
				operator,
				outTypeInfo,
				environment.getParallelism());

		//⭐SingleOutputStreamOperator其实就是一种datastream，继承了datastream，所以构造方法和datastream一样
		SingleOutputStreamOperator<R> returnStream = new SingleOutputStreamOperator(environment, resultTransform);

		//⭐将OneOutputStreamOperator添加进StreamExecutionEnviroment的Transformation集合中,用于生成StreamGraph
		getExecutionEnvironment().addOperator(resultTransform);

		return returnStream;
	}