1. 介绍
参考Session Windows。此外也可阅读Flink 原理与实现:Session Window。以下是正文:
会话窗口分配器按活动会话对元素进行分组。与翻滚窗口和滑动窗口相比,会话窗口不重叠并且没有固定的开始和结束时间。当会话窗口在一段时间内没有接收到元素时,即当发生不活动的间隙时,会话窗口关闭。会话窗口分配器可以设置静态会话间隙和动态会话间隙。
一共有四种形式的 Session Windows:
EventTimeSessionWindows.withGap()
EventTimeSessionWindows.withDynamicGap()
ProcessingTimeSessionWindows.withGap()
ProcessingTimeSessionWindows.withDynamicGap()
其中的 withDynamicGap()是动态session windows的使用,暂时先不介绍,本文主要介绍静态的session windows。
2. Session Windows 原理
由于Session 窗口没有固定的开始和结束,所以Session 窗口的元素分配方式与 tumbling/ sliding 窗口不同。在内部,会话窗口操作员为每个到达的记录创建一个新窗口,如果它们彼此之间的距离比定义的间隙更近,则将窗口合并在一起。为了可合并,会话窗口运算符需要合并触发器和合并窗口函数,例如ReduceFunction,AggregateFunction或ProcessWindowFunction(FoldFunction无法合并。)
3. 示例
使用的数据源如下:
public class StreamDataSource extends RichParallelSourceFunction<Tuple3<String, String, Long>> {
private volatile boolean running = true;
@Override
public void run(SourceContext<Tuple3<String, String, Long>> ctx) throws InterruptedException {
Tuple3[] elements = new Tuple3[]{
Tuple3.of("a", "1", 1000000050000L),
Tuple3.of("a", "2", 1000000054000L),
Tuple3.of("a", "3", 1000000079900L),
Tuple3.of("a", "4", 1000000115000L),
Tuple3.of("b", "5", 1000000100000L),
Tuple3.of("b", "6", 1000000108000L)
};
int count = 0;
while (running && count < elements.length) {
ctx.collect(new Tuple3<>((String) elements[count].f0, (String) elements[count].f1, (Long) elements[count].f2));
count++;
Thread.sleep(1000);
}
}
@Override
public void cancel() {
running = false;
}
}
测试代码如下,这里的参数 delay 和 windowGap都是可配的。
import util.source.StreamDataSource;
import java.text.SimpleDateFormat;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor;
import org.apache.flink.streaming.api.windowing.assigners.EventTimeSessionWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
/**
* Created by yidxue on 2018/9/11
*/
public class FlinkStaticSessionWindowsDemo {
public static void main(String[] args) throws Exception {
long delay = 5000L;
long windowGap = 10000L;
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.setParallelism(1);
// 设置数据源
DataStream<Tuple3<String, String, Long>> source = env.addSource(new StreamDataSource()).name("Demo Source");
// 设置水位线
DataStream<Tuple3<String, String, Long>> stream = source.assignTimestampsAndWatermarks(
new BoundedOutOfOrdernessTimestampExtractor<Tuple3<String, String, Long>>(Time.milliseconds(delay)) {
@Override
public long extractTimestamp(Tuple3<String, String, Long> element) {
SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
System.out.println(element.f0 + "\t" + element.f1 + " watermark -> " + format.format(getCurrentWatermark().getTimestamp()) + " timestamp -> " + format.format(element.f2));
return element.f2;
}
}
);
// 窗口聚合
stream.keyBy(0).window(EventTimeSessionWindows.withGap(Time.milliseconds(windowGap))).reduce(
new ReduceFunction<Tuple3<String, String, Long>>() {
@Override
public Tuple3<String, String, Long> reduce(Tuple3<String, String, Long> value1, Tuple3<String, String, Long> value2) throws Exception {
return Tuple3.of(value1.f0, value1.f1 + "" + value2.f1, 1L);
}
}
).print();
env.execute("TimeWindowDemo");
}
}
3.1 情况1:当前元素 event_time + windowGap < watermark
3.1.1 参数1
long delay = 5000L;
long windowGap = 10000L;
3.1.2 输出:
a 1 watermark -> 292269055-12-03 00:47:04.192 timestamp -> 2001-09-09 09:47:30.000
a 2 watermark -> 2001-09-09 09:47:25.000 timestamp -> 2001-09-09 09:47:34.000
a 3 watermark -> 2001-09-09 09:47:29.000 timestamp -> 2001-09-09 09:47:59.900
(a,12,1)
a 4 watermark -> 2001-09-09 09:47:54.900 timestamp -> 2001-09-09 09:48:35.000
(a,3,1000000079900)
b 5 watermark -> 2001-09-09 09:48:30.000 timestamp -> 2001-09-09 09:48:20.000
b 6 watermark -> 2001-09-09 09:48:30.000 timestamp -> 2001-09-09 09:48:28.000
(b,6,1000000108000)
(a,4,1000000115000)
3.1.3 说明:
结果中我们发现第1条记录和第2条记录是在一个窗口中计算的。因为这两条记录只相差4秒小于10秒,所以在同一个session 窗口中。在第3条记录进来时,触发计算。
这里我们需要关注第5条记录丢了,但是第6条记录还在的输出。因为第4条记录进来后,watermark已经上升到 2001-09-09 09:48:30.000。而第5条数据的eventtime是2001-09-09 09:48:20.000,因此对第5条数据有:event_time + windowGap < watermark,所以这条数据丢失了。同理,第6条数据被保留计算了。
3.2 情况2:当前元素 event_time + windowGap > watermark
3.2.1 参数1
long delay = 5000L;
long windowGap = 11000L;
3.2.2 输出:
a 1 watermark -> 292269055-12-03 00:47:04.192 timestamp -> 2001-09-09 09:47:30.000
a 2 watermark -> 2001-09-09 09:47:25.000 timestamp -> 2001-09-09 09:47:34.000
a 3 watermark -> 2001-09-09 09:47:29.000 timestamp -> 2001-09-09 09:47:59.900
(a,12,1)
a 4 watermark -> 2001-09-09 09:47:54.900 timestamp -> 2001-09-09 09:48:35.000
(a,3,1000000079900)
b 5 watermark -> 2001-09-09 09:48:30.000 timestamp -> 2001-09-09 09:48:20.000
b 6 watermark -> 2001-09-09 09:48:30.000 timestamp -> 2001-09-09 09:48:28.000
(b,56,1)
(a,4,1000000115000)
3.2.3 说明:
上个结论的进一步验证,就是把 windowGap 改成了10.1秒,这时第5条记录就被保存了,并且和第6条记录一起被计算,说明结论是正确的。