目前常用的有基于处理时间以及基于时间时间的滑动窗口以及滚动窗口,不过这些窗口时间是固定的不可改变,当有需求要求调整时间窗口大小的时候就得重新启动,然后发布程序,这种方式在变更频率小的时候无所谓,当变更频率大的时候就很耗时间,那么接下来介绍一种可以动态更改窗口时间的方法:
Flink 窗口实现语法为:
Keyed Windows
stream
.keyBy(...) <- keyed versus non-keyed windows
.window(...) <- required: "assigner"
[.trigger(...)] <- optional: "trigger" (else default trigger)
[.evictor(...)] <- optional: "evictor" (else no evictor)
[.allowedLateness(...)] <- optional: "lateness" (else zero)
[.sideOutputLateData(...)] <- optional: "output tag" (else no side output for late data)
.reduce/aggregate/apply() <- required: "function"
[.getSideOutput(...)] <- optional: "output tag"
Non-Keyed Windows
stream
.windowAll(...) <- required: "assigner"
[.trigger(...)] <- optional: "trigger" (else default trigger)
[.evictor(...)] <- optional: "evictor" (else no evictor)
[.allowedLateness(...)] <- optional: "lateness" (else zero)
[.sideOutputLateData(...)] <- optional: "output tag" (else no side output for late data)
.reduce/aggregate/apply() <- required: "function"
[.getSideOutput(...)] <- optional: "output tag"
当使用滑动窗口或者滚动窗口的时候,Flink 内部已经定义好了对应的类,只需要直接使用就行,那么我们实现窗口时间动态改变也基于写好的类进行;那么我们这次修改该是基于窗口分配器进行的。
窗口分配器定义了如何将元素分配给窗口。这是通过WindowAssigner 在window(…)【针对键控流】或windowAll()【针对非键控流】调用中指定的选项来完成的。
WindowAssigner负责将每个传入元素分配给一个或多个窗口。Flink带有针对最常见用例的预定义窗口分配器,即滚动窗口, 滑动窗口,会话窗口和全局窗口。还可以通过扩展WindowAssigner类来实现自定义窗口分配器。所有内置窗口分配器【全局窗口除外】均基于时间将元素分配给窗口,时间可以是处理时间,也可以是事件时间。
基于时间的窗口具有开始时间戳【包括端点】和结束时间戳【包括端点】,它们共同描述了窗口的大小。在代码中,Flink在使用TimeWindow基于时间的窗口时使用,该方法具有查询开始和结束时间戳记的方法maxTimestamp(),还具有返回给定窗口允许的最大时间戳的附加方法。
首先看的是滑动窗口代码:
public class SlidingEventTimeWindows extends WindowAssigner<Object, TimeWindow> {
private static final long serialVersionUID = 1L;
//窗口大小
private final long size;
//滑动步长
private final long slide;
private final long offset;
protected SlidingEventTimeWindows(long size, long slide, long offset) {
if (Math.abs(offset) >= slide || size <= 0) {
throw new IllegalArgumentException(
"SlidingEventTimeWindows parameters must satisfy "
+ "abs(offset) < slide and size > 0");
}
this.size = size;
this.slide = slide;
this.offset = offset;
}
//根据size以及slide去分配窗口,那么我们可以在这个地方动态调整size以及slide,实现窗口动态变化
//我们发现,每次分配窗口的时候都会将原始的数据传进来,那么我们就可以在element上抽取动态改变的数据
@Override
public Collection<TimeWindow> assignWindows(
Object element, long timestamp, WindowAssignerContext context) {
if (timestamp > Long.MIN_VALUE) {
List<TimeWindow> windows = new ArrayList<>((int) (size / slide));
long lastStart = TimeWindow.getWindowStartWithOffset(timestamp, offset, slide);
for (long start = lastStart; start > timestamp - size; start -= slide) {
windows.add(new TimeWindow(start, start + size));
}
return windows;
} else {
throw new RuntimeException(
"Record has Long.MIN_VALUE timestamp (= no timestamp marker). "
+ "Is the time characteristic set to 'ProcessingTime', or did you forget to call "
+ "'DataStream.assignTimestampsAndWatermarks(...)'?");
}
}
}
改完之后的代码为:
public class DynSlidingEventTimeWindows extends WindowAssigner<Object, TimeWindow> {
private static final long serialVersionUID = 1L;
private final long size;
private final long slide;
private final long offset;
//从原始数据中获取窗口长度
private final TimeAdjustExtractor sizeTimeAdjustExtractor;
//从原始数据中获取窗口步长
private final TimeAdjustExtractor slideTimeAdjustExtractor;
protected DynSlidingEventTimeWindows(long size, long slide, long offset) {
if (Math.abs(offset) >= slide || size <= 0) {
throw new IllegalArgumentException(
"SlidingEventTimeWindows parameters must satisfy "
+ "abs(offset) < slide and size > 0");
}
this.size = size;
this.slide = slide;
this.offset = offset;
this.sizeTimeAdjustExtractor = (elem) -> 0;
this.slideTimeAdjustExtractor = (elem) -> 0;
}
protected DynSlidingEventTimeWindows(long size, long slide, long offset,TimeAdjustExtractor sizeTimeAdjustExtractor,
TimeAdjustExtractor slideTimeAdjustExtractor) {
if (Math.abs(offset) >= slide || size <= 0) {
throw new IllegalArgumentException(
"SlidingEventTimeWindows parameters must satisfy "
+ "abs(offset) < slide and size > 0");
}
this.size = size;
this.slide = slide;
this.offset = offset;
this.sizeTimeAdjustExtractor = sizeTimeAdjustExtractor;
this.slideTimeAdjustExtractor = slideTimeAdjustExtractor;
}
//每次分配窗口的时候,都从数据里面抽取窗口与步长,如果存在就将新定义的长度以及步长作为新的长度与步长,这样就实现了动态调整
@Override
public Collection<TimeWindow> assignWindows(
Object element, long timestamp, WindowAssignerContext context) {
long realSize = this.sizeTimeAdjustExtractor.extract(element);
long realSlide = this.slideTimeAdjustExtractor.extract(element);
if (timestamp > Long.MIN_VALUE) {
List<TimeWindow> windows = new ArrayList<>((int) ((realSize == 0? size : realSize) / (realSlide == 0? slide:realSlide)));
long lastStart = TimeWindow.getWindowStartWithOffset(timestamp, offset, (realSlide == 0? slide:realSlide));
for (long start = lastStart; start > timestamp - (realSize == 0? size : realSize); start -= (realSlide == 0? slide:realSlide)) {
windows.add(new TimeWindow(start, start + (realSize == 0? size : realSize)));
}
return windows;
} else {
throw new RuntimeException(
"Record has Long.MIN_VALUE timestamp (= no timestamp marker). "
+ "Is the time characteristic set to 'ProcessingTime', or did you forget to call "
+ "'DataStream.assignTimestampsAndWatermarks(...)'?");
}
}
public long getSize() {
return size;
}
public long getSlide() {
return slide;
}
@Override
public Trigger<Object, TimeWindow> getDefaultTrigger(StreamExecutionEnvironment env) {
return EventTimeTrigger.create();
}
@Override
public String toString() {
return "SlidingEventTimeWindows(" + size + ", " + slide + ")";
}
/**
* Creates a new {@code SlidingEventTimeWindows} {@link WindowAssigner} that assigns elements to
* sliding time windows based on the element timestamp.
*
* @param size The size of the generated windows.
* @param slide The slide interval of the generated windows.
* @return The time policy.
*/
public static DynSlidingEventTimeWindows of(Time size, Time slide) {
return new DynSlidingEventTimeWindows(size.toMilliseconds(), slide.toMilliseconds(), 0);
}
/**
* Creates a new {@code SlidingEventTimeWindows} {@link WindowAssigner} that assigns elements to
* time windows based on the element timestamp and offset.
*
* <p>For example, if you want window a stream by hour,but window begins at the 15th minutes of
* each hour, you can use {@code of(Time.hours(1),Time.minutes(15))},then you will get time
* windows start at 0:15:00,1:15:00,2:15:00,etc.
*
* <p>Rather than that,if you are living in somewhere which is not using UTC±00:00 time, such as
* China which is using UTC+08:00,and you want a time window with size of one day, and window
* begins at every 00:00:00 of local time,you may use {@code of(Time.days(1),Time.hours(-8))}.
* The parameter of offset is {@code Time.hours(-8))} since UTC+08:00 is 8 hours earlier than
* UTC time.
*
* @param size The size of the generated windows.
* @param slide The slide interval of the generated windows.
* @param offset The offset which window start would be shifted by.
* @return The time policy.
*/
public static DynSlidingEventTimeWindows of(Time size, Time slide, Time offset) {
return new DynSlidingEventTimeWindows(
size.toMilliseconds(), slide.toMilliseconds(), offset.toMilliseconds());
}
public static DynSlidingEventTimeWindows of(Time size, Time slide, Time offset,TimeAdjustExtractor sizeTimeAdjustExtractor,
TimeAdjustExtractor slideTimeAdjustExtractor) {
return new DynSlidingEventTimeWindows(
size.toMilliseconds(), slide.toMilliseconds(), offset.toMilliseconds(),
sizeTimeAdjustExtractor,slideTimeAdjustExtractor);
}
@Override
public TypeSerializer<TimeWindow> getWindowSerializer(ExecutionConfig executionConfig) {
return new TimeWindow.Serializer();
}
@Override
public boolean isEventTime() {
return true;
}
}
使用代码:
StreamExecutionEnvironment env = FlinkEnvironment.getEnv(true,1);
JobConfig config = new JobConfig();
env.getConfig().setGlobalJobParameters(config.getParameterTool());
SingleOutputStreamOperator<String> source = env.addSource(new FakeRecordSource(100))
.assignTimestampsAndWatermarks(watermarkStrategy).setParallelism(2);
//读取配置,将需要调整的时间写入每条数据
SingleOutputStreamOperator<FakeRecordSource.TrafficRecord> resultWithAdjustMap = source .map(new AddAdjustTimeFunction);
SingleOutputStreamOperator<FakeRecordSource.TrafficRecord> result = resultWithAdjustMap.keyBy(new KeySelector<FakeRecordSource.TrafficRecord, Integer>() {
@Override
public Integer getKey(FakeRecordSource.TrafficRecord s) throws Exception {
return s.getCityId();
}
}).window(DynSlidingEventTimeWindows.of(Time.seconds(2), Time.seconds(1), Time.seconds(0), new TimeAdjustExtractor() {
@Override
public long extract(Object element) {
return ((FakeRecordSource.TrafficRecord)element).getAdjustSize();
}
}, new TimeAdjustExtractor() {
@Override
public long extract(Object element) {
return ((FakeRecordSource.TrafficRecord)element).getAdjustSlide();
}
}))
.process(new ProcessWindowFunction<FakeRecordSource.TrafficRecord, FakeRecordSource.TrafficRecord, Integer, TimeWindow>() {
@Override
public void process(Integer integer, Context context, Iterable<FakeRecordSource.TrafficRecord> iterable, Collector<FakeRecordSource.TrafficRecord> collector) throws Exception {
}
}).setParallelism(2);
result.addSink(new SinkFunction<FakeRecordSource.TrafficRecord>() {
@Override
public void invoke(FakeRecordSource.TrafficRecord value) throws Exception {
}
}).setParallelism(2);
env.execute("flink-dynamic");
}
实现原理为:
1. 在数据进入窗口之前,将每条数据填充上需要动态变更的时间
2. 数据进入窗口以后,每条数据都会调用assignWindow 这个方法,这个方法里面我们是可以从每条数据中获得到此时窗口时间相关的数据,获取到窗口相关数据之后,就可以在窗口创建的时候指定我们需要调整的时间了
这样就实现了窗口大小动态改变