Flink DataStream Window
Windows是处理无限流的核心。Windows将流拆分为有限大小的"桶",这样可以在Window中进行聚合操作。
窗口的生命周期:一般当第一个元素到达时,创建窗口,当(处理时间或事件时间)时间大于等于其结束的时间,窗口进行触发计算,计算结束后,窗口将完全删除。
1. Window的分类
Window可以分为2类,分别为:Keyed Windows(通过KeyBy算子后的Window)和Non-Keyed Windows(其他算子后的Window)
1.1 Keyed Windows
stream
.keyBy(...)
.window(...)
[.trigger(...)]
.apply()
.print()
1.2 Non-Keyed Windows
stream
.windowAll(...)
[.trigger(...)]
.apply()
.print()
2. 窗口分配器(Window Assigners)
Window窗口分配器,分为4种,分别为:
- Tumbling Window(翻滚窗口)
- Slide Window(滑动窗口)
- Session Window(会话窗口)
- Global Window(全局窗口)
2.1 Tumbling Window(翻滚窗口)
Tumbling Window具有固定的尺寸,连续不重叠。如图
// tumbling event-time windows
input
.keyBy(<key selector>)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.<windowed transformation>(<window function>);
// tumbling processing-time windows
input
.keyBy(<key selector>)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.<windowed transformation>(<window function>);
// daily tumbling event-time windows offset by -8 hours.
input
.keyBy(<key selector>)
.window(TumblingEventTimeWindows.of(Time.days(1), Time.hours(-8)))
.<windowed transformation>(<window function>);
2.2 Slide Window(滑动窗口)
滑动窗口以固定长度为窗口,加上一个可滑动的slide步长,步长长度不超过window窗口长度。如图:
DataStream<T> input = ...;
// sliding event-time windows
input
.keyBy(<key selector>)
.window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))
.<windowed transformation>(<window function>);
// sliding processing-time windows
input
.keyBy(<key selector>)
.window(SlidingProcessingTimeWindows.of(Time.seconds(10), Time.seconds(5)))
.<windowed transformation>(<window function>);
// sliding processing-time windows offset by -8 hours
input
.keyBy(<key selector>)
.window(SlidingProcessingTimeWindows.of(Time.hours(12), Time.hours(1), Time.hours(-8)))
.<windowed transformation>(<window function>);
2.3 Session Window(会话窗口)
会话窗口按活动会话分配组中的元素。会话窗口不重叠,没有固定的开始和结束时间,与翻滚窗口和滑动窗口相反。相反,当会话窗口在一段时间内没有接收到元素时,即当发生不活动的间隙时,会关闭会话窗口。
会话窗口分配器可以配置静态会话间隙或 会话间隙提取器功能,该功能定义不活动时间段的长度。当此期限到期时,当前会话将关闭,后续元素将分配给新的会话窗口。
会话窗口如图:
DataStream<T> input = ...;
// event-time session windows with static gap
input
.keyBy(<key selector>)
.window(EventTimeSessionWindows.withGap(Time.minutes(10)))
.<windowed transformation>(<window function>);
// event-time session windows with dynamic gap
input
.keyBy(<key selector>)
.window(EventTimeSessionWindows.withDynamicGap((element) -> {
// determine and return session gap
}))
.<windowed transformation>(<window function>);
// processing-time session windows with static gap
input
.keyBy(<key selector>)
.window(ProcessingTimeSessionWindows.withGap(Time.minutes(10)))
.<windowed transformation>(<window function>);
// processing-time session windows with dynamic gap
input
.keyBy(<key selector>)
.window(ProcessingTimeSessionWindows.withDynamicGap((element) -> {
// determine and return session gap
}))
.<windowed transformation>(<window function>);
2.4 Global Window(全局窗口)
全局窗口需要自定义触发器时才有用。否则,将不执行任何计算,因为全局窗口没有我们可以处理聚合元素的自然结束。
DataStream<T> input = ...;
input
.keyBy(<key selector>)
.window(GlobalWindows.create())
.<windowed transformation>(<window function>);
3. timeWindow 和 countWindow
timeWindow
和countWindow
的区别在于,计算stream
的计量方式的不同,timeWindow
是基于时间范围来进行窗口的计算,countWindow
是基于数量范围来进行计算。
timeWindow
和countWindow
都适用于Tumbling Window
和Slide Window
。
timeWindow
默认使用Processing Time
计算,如果想使用Event Time
计算,需要给env对象设置Event Time
。如:
...
sEnv.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
...
// time:窗口时间
stream.timeWindow(time)
// time:窗口时间,slide:步长
stream.timeWindow(time,slide)
countWindow实际与时间没关系,自己设置好count数量就好,如代码:
stream.countWindow(size).apply()
stream.countWindow(size,slide).apply()
4. 允许延迟计算(allowedLateness)
在窗口计算时,总是有延迟数据抵达窗口,对于这类数据,要么丢弃不处理,要么设置延迟时间进行触发。window是允许延迟计算的,通过allowedLateness(time)
设置一个延迟时间。触发时间 = window设置时间 + 延迟时间。
例如:
示例代码:
public class WatermarkStream {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
Properties p = new Properties();
p.setProperty("bootstrap.servers", "localhost:9092");
p.setProperty("group.id", "test12345");
FlinkKafkaConsumer010<String> consumer010 = new FlinkKafkaConsumer010<>("app_scene_roominfo", new SimpleStringSchema(), p);
DataStreamSource<String> ds = env.addSource(consumer010);
final OutputTag<Value> out = new OutputTag<Value>("out") {
};
SingleOutputStreamOperator<Value> operator = ds
.map(new RichMapFunction<String, Value>() {
@Override
public Value map(String value) throws Exception {
return new Gson().fromJson(value, Value.class);
}
})
.filter(new FilterFunction<Value>() {
@Override
public boolean filter(Value value) throws Exception {
String memberId = value.getMsg().getMemberId();
return memberId != null;
}
})
.assignTimestampsAndWatermarks(new AscendingTimestampExtractor<Value>() {
@Override
public long extractAscendingTimestamp(Value element) {
long time = 0;
try {
Date date = FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss").parse(element.getMsg().getLogTime());
time = date.getTime();
} catch (ParseException e) {
e.printStackTrace();
}
return time;
}
})
.windowAll(TumblingEventTimeWindows.of(Time.minutes(1)))
.allowedLateness(Time.seconds(10))
.sideOutputLateData(out)
.trigger(EventTimeTrigger.create())
.apply(new AllWindowFunction<Value, Value, TimeWindow>() {
@Override
public void apply(TimeWindow window, Iterable<Value> values, Collector<Value> out) throws Exception {
String format = FastDateFormat.getInstance("yyyy-MM-dd HH:mm:ss").format(new Date());
System.out.println("计算时间:" + format);
for (Value value : values) {
out.collect(value);
}
}
});
operator.print();
DataStream<Value> sideOutput = operator.getSideOutput(out);
sideOutput.print("sideOutput: ");
env.execute("WatermarkStream");
}
}
5. 延迟的数据收集(sideOutputLateData)
没有进入窗口的数据,会进入sideOutputLateData
的流。详情可看上面示例。建议将allowedLateness()
注释,不然不太容易出现。