Flink中常用的API(二)——内置窗口使用_flink streamapi 窗口-CSDN博客

本文链接：https://blog.csdn.net/PanicJaw/article/details/107307640

Flink中常用的API(二)——内置窗口使用

续Flink中常用的API

一、Flink基本编程框架

在主函数中获取当前运行环境：StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();运行环境控制Flink运行——可以设置并行度、设置检查点参数、同时指定数据获取的方式（通过Kafka等connector还是端口或者时文件、手动输入数据等）。
通过env获取数据得到DataStream，然后就能使用上一节中的API进行连续性操作；Flink中功能函数都是通过SAM形式定义（因为一定要序列化传输的缘故），方便使用Java Lambda表达式简化写法。（其中不同的算子可以设置不同的并行度）
根据需求定义各种window和聚合函数
最后env.execute提交任务。

二、window相关的API

window方法可以被DataStream或者KeyedStream调用

window方法完成功能需要两个组件：windowassigner和对窗口操作的窗口函数。

窗口分配器：将输出流中每个数据项分发到指定的窗口中。几种Flink内置分配器（自带默认的触发器——在水位线超过窗口结束边界的时间戳时触发，同时只能按照event time或者process time分配元素，也就是说只能是按照时间分配）如下：
1. 滚动窗口（窗口之间没有重合）：常用的是以事件时间为度量的TumblingEventTimeWindows，它作为传入到window()方法中的唯一参数，产生方式有：
```
//调用静态方法返回TumblingEventTimeWindows作为参数
public static TumblingEventTimeWindows of(Time size);//只指定窗口长度
public static TumblingEventTimeWindows of(Time size, Time offset)//指定窗口长度和起始偏移长度
```
  或者可以简写为DataStream.timeWindow(size)，其中具体按照peocessTime还是eventTime要看之前环境配置是什么，如果env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)，那么这个size就是eventTime性质。
2. 滑动窗口（窗口之间可以有重合的部分）：所以要指定窗口大小和滑动的距离。常用的有SlidingEventTimeWindows产生的方式有两种：
```
public static SlidingEventTimeWindows of(Time size, Time slide);
public static SlidingEventTimeWindows of(Time size, Time slide, Time offset)
```
  同时也有简写形式：timeWindowAll(Time size, Time slide)两个参数
3. 会话窗口（长度可变，不重叠）：定义时window()的参数为EventTimeSessionWindows，通过withGap(Time size)指定间隔时间，超过间隔时间没有收到数据窗口就关闭了。

操作函数：窗口被触发（自带的assigner自带触发器）之后进行的功能运算。

和KeyedStream一样，可以直接跟聚合函数reduce和aggregate，每次数据项进来都会更新，窗口内存储单值；除此之外，还有全量窗口，定义内所有元素都进来了一起进行操作。

第一类有ReduceFunction AggregateFunction，窗口结束后输出汇总值

ReduceFunction功能和KeyedStream中用法一致，“汇总”的作用；窗口中只保存一个单值，每次来元素都更新一次，输入输出类型是一致的。

public SingleOutputStreamOperator<T> reduce(ReduceFunction<T> function)
```
public interface ReduceFunction<T> extends Function, Serializable {
	T reduce(T value1, T value2) throws Exception;
}
```

AggregateFunction基本一致，也是只存一个状态值，优点是输入输出类型可以不一致

public <ACC, R> SingleOutputStreamOperator<R> aggregate(AggregateFunction<T, ACC, R> function)

public interface AggregateFunction<IN, ACC, OUT> extends Function, Serializable {
 //@return A new accumulator, corresponding to an empty aggregate.
	ACC createAccumulator();

	/**
	 * @param value The value to add
	 * @param accumulator The accumulator to add the value to
	 */
	ACC add(IN value, ACC accumulator);

	/**
	 * Gets the result of the aggregation from the accumulator.
	 *
	 * @param accumulator The accumulator of the aggregation
	 * @return The final aggregation result.
	 */
	OUT getResult(ACC accumulator);

	/**
	 * Merges two accumulators, returning an accumulator with the merged state.
	 *
	 * <p>This function may reuse any of the given accumulators as the target for the merge
	 * and return that. The assumption is that the given accumulators will not be used any
	 * more after having been passed to this function.
	 *
	 * @param a An accumulator to merge
	 * @param b Another accumulator to merge
	 *
	 * @return The accumulator with the merged state
	 */
	ACC merge(ACC a, ACC b);
}

第二类有ProcessWindowFunction，作为process()的参数实现功能。参数public abstract class ProcessWindowFunction<IN, OUT, KEY, W extends Window> extends AbstractRichFunction{}

最重要的就是能够通过Context参数访问窗口的元数据，还能够使用内置ListState存储窗口状态，如下：

public abstract class Context implements java.io.Serializable {
		/**
		 * Returns the window that is being evaluated.
		 */
		public abstract W window();

		/** Returns the current processing time. */
		public abstract long currentProcessingTime();

		/** Returns the current event-time watermark. */
		public abstract long currentWatermark();

		/**
		 * State accessor for per-key and per-window state.
		 *
		 * <p><b>NOTE:</b>If you use per-window state you have to ensure that you clean it up
		 * by implementing {@link ProcessWindowFunction#clear(Context)}.
		 */
		public abstract KeyedStateStore windowState();

		/**
		 * State accessor for per-key global state.
		 */
		public abstract KeyedStateStore globalState();

		/**
		 * Emits a record to the side output identified by the {@link OutputTag}.
		 *
		 * @param outputTag the {@code OutputTag} that identifies the side output to emit to.
		 * @param value The record to emit.
		 */
		public abstract <X> void output(OutputTag<X> outputTag, X value);
	}

前两种结合使用：

ProcessWindowFunction作为reduce()或者aggregate()的第二个参数：前者作为一个聚合操作，输出的结果是由ProcessWindowFunction逻辑决定的。

这样做的优势就是减少了ProcessWindowFunction中状态的个数，

三、自定义窗口使用

可以自己重写WindowAssigner, Evictor 以及 Trigger——代替自带的.window()方法，后续的自定义处理函数还是得自己写来达到自定义完全功能的window的目的。

WindowAssigner决定数据项如何分配。
- 如果自定义的操作函数是第一种聚合式的，那么自定义窗口只存储一个单值
- 如果是第二种全量形式的，窗口内的数据项保存为一个ListState
Trigger：自定义窗口计算的时机，可以使用时间，也可以使用状态等进行条件判断
Evictor：可以没有；用来在操作函数前或后删除元素