一、周期生成watermark生成策略
指定固定时间间隔生成watermark,并且可指定OutOfOrderness最大延迟容忍时间,所以此时watermark=最大元素时间戳-OutOfOrderness最大容忍时间,并且当Watermark>=窗口结束时间,窗口被触发进行计算,该操作在默认的trigger中。
Watermark是可以设置延迟触发窗口计算,而allowedLateness是设置在窗口已经触发后对迟到的数据进行怎样的处理,是窗口的一种属性,默认为丢弃迟到数据,也可以侧输出流sideOutputLateData,也可以重新触发窗口计算allowedLateness。
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import java.time.*;
import java.util.Random;
public class WaterMarkBoundedOutOfOrderness {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
//设置生成Watermark的时间间隔,基于性能的考虑
env.getConfig().setAutoWatermarkInterval(100L);
DataStreamSource<Tuple2<String, Long>> stringDataStreamSource = env.addSource(new SourceFunction<Tuple2<String,Long>>() {
volatile boolean flag = true;
@Override
public void run(SourceContext<Tuple2<String,Long>> sourceContext) throws Exception {
String[] s = {"张三","王五","李四","秋英"};
while(flag) {
Thread.sleep(1000);
int i = new Random().nextInt(4);
sourceContext.collect(new Tuple2<String,Long>(s[i],System.currentTimeMillis()));
}
}
@Override
public void cancel() {
flag = false;
}
});
//在流上设置watermark生成策略,固定时间间隔策略,也就是最大容忍延迟时间
SingleOutputStreamOperator<Tuple2<String, Long>> tuple2SingleOutputStreamOperator = stringDataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple2<String, Long>>forBoundedOutOfOrderness(Duration.ofSeconds(1))
.withTimestampAssigner(new SerializableTimestampAssigner<Tuple2<String, Long>>() {
//抽取时间戳
@Override
public long extractTimestamp(Tuple2<String, Long> integerLongTuple2, long l) {
return integerLongTuple2.f1;
}
}));
tuple2SingleOutputStreamOperator.map(new MapFunction<Tuple2<String, Long>, Tuple3<String,Long,Integer>>() {
@Override
public Tuple3<String,Long,Integer> map(Tuple2<String, Long> stringLongTuple2) throws Exception {
System.out.println(stringLongTuple2.f0 + stringLongTuple2.f1+" "+System.currentTimeMillis());
return new Tuple3<String,Long,Integer>(stringLongTuple2.f0,stringLongTuple2.f1,1);
}
}).keyBy(new KeySelector<Tuple3<String,Long,Integer>, String>() {
@Override
public String getKey(Tuple3<String,Long,Integer> s) throws Exception {
return s.f0;
}
}).window(TumblingEventTimeWindows.of(Time.seconds(10)))
.sum(2)
.print();
env.execute("watermark test");
}
}
=====================分割线=========================
以下代码类似boundedOutOfOrderness的代码,不过该方式已经过时:
SingleOutputStreamOperator<Tuple2<String, Long>> tuple2SingleOutputStreamOperator = streamSource.assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks<Tuple2<String, Long>>() {
Long currentWatermark = 0l;
Long mxTimestamp = Long.MIN_VALUE;
Long maxOutOfOrderness = 1000l;
//抽取时间戳
@Override
public long extractTimestamp(Tuple2<String, Long> s, long l) {
System.out.println("(" + s.f0 + "," + s.f1 + ")");
System.out.println("当前的watermark: " + currentWatermark);
return mxTimestamp = Math.max(s.f1, mxTimestamp);
}
//生成watermark
@Nullable
@Override
public Watermark getCurrentWatermark() {
currentWatermark = mxTimestamp - maxOutOfOrderness;
System.out.println("当前产生的watermark: " + currentWatermark);
return new Watermark(currentWatermark);
}
});
二、watermark不更新怎么办?
如果有一个source一直没有事件流入,会发生什么?
因为没有任何事件流入,Flink流处理系统时钟将无法运作。source的这种情况,把它称之为IDLE source(空闲source)。在这种情况下,会因为某个task时钟没有推进,从而导致window无法触发计算。
在Flink中,我们可以使用withIdleness来设置空闲的source。
SingleOutputStreamOperator<Tuple2<String, Long>> wordWithTsDS =
wordSource.assignTimestampsAndWatermarks(WatermarkStrategy
.<Tuple2<String, Long>>forBoundedOutOfOrderness(Duration.ofSeconds(5)) // 设置水印允许延迟5秒
.withIdleness(Duration.ofSeconds(15)) // 设置空闲source为15秒
.withTimestampAssigner((event, timestamp) -> event.f1)); // 提取事件时间
这样,在window计算的时候,如果某个source超过15秒没有事件流入,就会被标记为IDLE source,window在计算watermark的时候,会忽略该source。