目录
水位线生成策略
在Flink的DataStream API中,有一个单独用于生成水位线的方法,assignTimestampsAndWatermarks,它用于为流数据分配时间戳,并生成水位线来指示事件时间。
WatermarkStrategy是一个接口,该接口中包含了一个“时间戳分配器”TimestampAssigner和一个“水位线生成器”WatermarkGenerator。该接口包括一个时间戳分配器和一个水位线生成器。
下面演示如何使用flink内置水位线
1、有序流中内置水位线设置
对于有序流,时间戳单调递增,不存在乱序、迟到的情况,直接调用WatermarkStrategy.forMonotonousTimestamps()方法即可实现
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
SingleOutputStreamOperator<WaterSensor> source = env
.socketTextStream("node1", 7777)
.map(new WaterSensorMapFunction());
// TODO 1.定义Watermark策略
WatermarkStrategy<WaterSensor> watermarkStrategy = WatermarkStrategy
// 1.1 指定watermark生成:升序的watermark,没有等待时间
.<WaterSensor>forMonotonousTimestamps()
// 1.2 指定 时间戳分配器,从数据中提取
.withTimestampAssigner(new SerializableTimestampAssigner<WaterSensor>(){
@Override
public long extractTimestamp(WaterSensor waterSensor, long l) {
// 返回的时间戳,要 毫秒
System.out.println("数据=" + waterSensor + ",recordTs=" + l);
// 返回的时间戳单位为毫秒
return waterSensor.getTs() * 1000L;
}
});
// TODO 2. 指定 watermark策略
SingleOutputStreamOperator<WaterSensor> watermark = source.assignTimestampsAndWatermarks(watermarkStrategy);
KeyedStream<WaterSensor, String> keyBy = watermark.keyBy(
new KeySelector<WaterSensor, String>() {
@Override
public String getKey(WaterSensor waterSensor) throws Exception {
return waterSensor.getId();
}
}
);
// TODO 3.使用 事件时间语义 的窗口
WindowedStream<WaterSensor, String, TimeWindow> sensorWS = keyBy.window(TumblingEventTimeWindows.of(Time.seconds(10)));
SingleOutputStreamOperator<String> process = sensorWS.process(
// IN,KEY,OUT,Window
new ProcessWindowFunction<WaterSensor, String, String, TimeWindow>() {
@Override
public void process(String s, Context context, Iterable<WaterSensor> iterable, Collector<String> collector) throws Exception {
// 拿到窗口的开始时间、结束时间
long startTS = context.window().getStart();
long endTS = context.window().getEnd();
String start_time = DateFormatUtils.format(startTS, "yyyy-MM-dd HH:mm:ss.SSS");
String end_time = DateFormatUtils.format(endTS, "yyyy-MM-dd HH:mm:ss.SSS");
// 去除窗口的size
long count = iterable.spliterator().estimateSize();
collector.collect("key=" + s + "的窗口[" + start_time + "->" +
end_time + "),长度为" + count + "条数据---->" + iterable.toString());
}
}
);
process.print();
env.execute();
}
有序流中,数据严格单增,在测试的时候输入几条乱序的数据,可以发现对应[0,10)秒的窗口在水位线为10s的数据到达后已经关闭,后面的乱序数据不会被放入任意一个窗口,因此乱序数据有特定的api,使用时按需选择。
2、乱序流水位线设置
乱序流中需要等待迟到的数据,因此需要设置一个迟到时间,例如size为10s的窗口,延迟时间设置为3s,那么直到事件时间为13s的数据到达,才会促发[0,10)的窗口执行,输出。调用WatermarkStrategy. forBoundedOutOfOrderness()方法就可以实现。这个方法需要传入一个maxOutOfOrderness参数,表示“最大乱序程度”,它表示数据流中乱序数据时间戳的最大差值,也就是等待的时间。
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
SingleOutputStreamOperator<WaterSensor> source = env
.socketTextStream("node1", 7777)
.map(new WaterSensorMapFunction());
// TODO 1.定义Watermark策略
WatermarkStrategy<WaterSensor> watermarkStrategy = WatermarkStrategy
// 1.1 指定watermark生成:乱序,等待3s
.<WaterSensor>forBoundedOutOfOrderness(Duration.ofSeconds(3))
// 1.2 指定 时间戳分配器,从数据中提取
.withTimestampAssigner(new SerializableTimestampAssigner<WaterSensor>(){
@Override
public long extractTimestamp(WaterSensor waterSensor, long l) {
// 返回的时间戳,要 毫秒
System.out.println("数据=" + waterSensor + ",recordTs=" + l);
// 返回的时间戳单位为毫秒
return waterSensor.getTs() * 1000L;
}
});
// TODO 2. 指定 watermark策略
SingleOutputStreamOperator<WaterSensor> watermark = source.assignTimestampsAndWatermarks(watermarkStrategy);
KeyedStream<WaterSensor, String> keyBy = watermark.keyBy(
new KeySelector<WaterSensor, String>() {
@Override
public String getKey(WaterSensor waterSensor) throws Exception {
return waterSensor.getId();
}
}
);
// TODO 3.使用 事件时间语义 的窗口
WindowedStream<WaterSensor, String, TimeWindow> sensorWS = keyBy.window(TumblingEventTimeWindows.of(Time.seconds(10)));
SingleOutputStreamOperator<String> process = sensorWS.process(
// IN,KEY,OUT,Window
new ProcessWindowFunction<WaterSensor, String, String, TimeWindow>() {
@Override
public void process(String s, Context context, Iterable<WaterSensor> iterable, Collector<String> collector) throws Exception {
// 拿到窗口的开始时间、结束时间
long startTS = context.window().getStart();
long endTS = context.window().getEnd();
String start_time = DateFormatUtils.format(startTS, "yyyy-MM-dd HH:mm:ss.SSS");
String end_time = DateFormatUtils.format(endTS, "yyyy-MM-dd HH:mm:ss.SSS");
// 去除窗口的size
long count = iterable.spliterator().estimateSize();
collector.collect("key=" + s + "的窗口[" + start_time + "->" +
end_time + "),长度为" + count + "条数据---->" + iterable.toString());
}
}
);
process.print();
env.execute();
}
接下来输入测试例子,假设等待时间为3,当事件时间为10s的数据到达,并不会调用process方法,而12s数据之后还有一条事件事件为7s的数据,由于设置了等待时间,这条数据也正确进入到[0,10)秒的窗口,13s的数据到达,窗口执行、输出。
水位线的传递
在流处理中,上游任务处理完水位线、时钟改变之后,要把当前的水位线再次发出,广播给所有的下游子任务。而当一个任务接收到多个上游并行任务传递来的水位线时,应该以最小的那个作为当前任务的事件时钟,理解为‘木桶效应’。
因此不难想到,如果上游任务有两个分区,但是一个分区一直没有发送数据,广播到下游任务的水位线也一直不更新(当前Task是以最小的那个作为当前任务的事件时钟),就会引发一个后果,尽管源源不断的有数据产生,但是窗口迟迟不会触发执行。
要解决这个问题,可以设置空闲等待。
.withIdleness(Duration.ofSeconds(5))
设置空闲等待的位置就在水位线生成策略时实现,下面给出实例代码。
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(2);
SingleOutputStreamOperator<Integer> source = env
.socketTextStream("node1", 7777)
.partitionCustom(new MyPartitioner(),r->r)
.map(r -> Integer.parseInt(r))
.assignTimestampsAndWatermarks(WatermarkStrategy
.<Integer>forMonotonousTimestamps()
.withTimestampAssigner((r,ts)->r*1000L)
.withIdleness(Duration.ofSeconds(5))
);
迟到数据的处理
1、设置乱序容忍度
即调用乱序流的水位线设置api,设置一个乱序容忍度,推迟系统时间的推进,保证窗口的计算被延迟执行。
2、设置窗口延迟关闭
在window函数后调用allowedLateness函数。窗口的触发和关闭时间默认一样,因为窗口推迟的时间系统默认设置为0,这个参数可以手动修改。
例如设置窗口大小为10s,乱序容忍度为3s,如果不设置窗口关闭的延迟时间,时间戳13s的数据到达后,后面再有处于[0,10)的数据也不会再被窗口计算。而设置延迟时间之后,每来一条时间戳为[0,10)的数据,该窗口都会触发一次计算。
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
SingleOutputStreamOperator<WaterSensor> source = env
.socketTextStream("node1", 7777)
.map(new WaterSensorMapFunction());
WatermarkStrategy<WaterSensor> watermarkStrategy = WatermarkStrategy
.<WaterSensor>forBoundedOutOfOrderness(Duration.ofSeconds(3))
.withTimestampAssigner(new SerializableTimestampAssigner<WaterSensor>(){
@Override
public long extractTimestamp(WaterSensor waterSensor, long l) {
System.out.println("数据=" + waterSensor + ",recordTs=" + l);
// 返回的时间戳单位为毫秒
return waterSensor.getTs() * 1000L;
}
});
SingleOutputStreamOperator<WaterSensor> watermark = source.assignTimestampsAndWatermarks(watermarkStrategy);
KeyedStream<WaterSensor, String> keyBy = watermark.keyBy(
new KeySelector<WaterSensor, String>() {
@Override
public String getKey(WaterSensor waterSensor) throws Exception {
return waterSensor.getId();
}
}
);
WindowedStream<WaterSensor, String, TimeWindow> sensorWS = keyBy
.window(TumblingEventTimeWindows.of(Time.seconds(10)))
// 推迟2s关窗
.allowedLateness(Time.seconds(2));
SingleOutputStreamOperator<String> process = sensorWS.process(
// // IN,OUT,KEY,WINDOW
new ProcessWindowFunction<WaterSensor, String, String, TimeWindow>() {
@Override
public void process(String s, Context context, Iterable<WaterSensor> iterable, Collector<String> collector) throws Exception {
// 拿到窗口的开始时间、结束时间
long startTS = context.window().getStart();
long endTS = context.window().getEnd();
String start_time = DateFormatUtils.format(startTS, "yyyy-MM-dd HH:mm:ss.SSS");
String end_time = DateFormatUtils.format(endTS, "yyyy-MM-dd HH:mm:ss.SSS");
// 去除窗口的size
long count = iterable.spliterator().estimateSize();
collector.collect("key=" + s + "的窗口[" + start_time + "->" +
end_time + "),长度为" + count + "条数据---->" + iterable.toString());
}
}
);
process.print();
env.execute();
}
3、使用侧输出流接受迟到的数据
当然还有第三种情况,在设置完前两个参数后,仍然有迟到的数据,这部分数据没有等待的必要,单独设置侧输出流,将其放入侧输出流中,之后与主流对应的数据进行合并。
首先设置侧输出流的标签,然后在window函数后调用方法,最后取出侧输出流的数据输出到控制台。
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
SingleOutputStreamOperator<WaterSensor> source = env
.socketTextStream("node1", 7777)
.map(new WaterSensorMapFunction());
WatermarkStrategy<WaterSensor> watermarkStrategy = WatermarkStrategy
.<WaterSensor>forBoundedOutOfOrderness(Duration.ofSeconds(3))
.withTimestampAssigner(new SerializableTimestampAssigner<WaterSensor>(){
@Override
public long extractTimestamp(WaterSensor waterSensor, long l) {
System.out.println("数据=" + waterSensor + ",recordTs=" + l);
// 返回的时间戳单位为毫秒
return waterSensor.getTs() * 1000L;
}
});
SingleOutputStreamOperator<WaterSensor> watermark = source.assignTimestampsAndWatermarks(watermarkStrategy);
KeyedStream<WaterSensor, String> keyBy = watermark.keyBy(
new KeySelector<WaterSensor, String>() {
@Override
public String getKey(WaterSensor waterSensor) throws Exception {
return waterSensor.getId();
}
}
);
// TODO 设置侧输出流
// 首先设置标签,然后在window函数后设置侧输出流,最后输出侧输出流的数据
OutputTag<WaterSensor> waterSensorOutputTag = new OutputTag<>("late-data", Types.POJO(WaterSensor.class));
WindowedStream<WaterSensor, String, TimeWindow> sensorWS = keyBy
.window(TumblingEventTimeWindows.of(Time.seconds(10)))
// 推迟2s关窗
.allowedLateness(Time.seconds(2))
// 设置侧输出流
.sideOutputLateData(waterSensorOutputTag);
SingleOutputStreamOperator<String> process = sensorWS.process(
// // IN,OUT,KEY,WINDOW
new ProcessWindowFunction<WaterSensor, String, String, TimeWindow>() {
@Override
public void process(String s, Context context, Iterable<WaterSensor> iterable, Collector<String> collector) throws Exception {
// 拿到窗口的开始时间、结束时间
long startTS = context.window().getStart();
long endTS = context.window().getEnd();
String start_time = DateFormatUtils.format(startTS, "yyyy-MM-dd HH:mm:ss.SSS");
String end_time = DateFormatUtils.format(endTS, "yyyy-MM-dd HH:mm:ss.SSS");
// 去除窗口的size
long count = iterable.spliterator().estimateSize();
collector.collect("key=" + s + "的窗口[" + start_time + "->" +
end_time + "),长度为" + count + "条数据---->" + iterable.toString());
}
}
);
// 获取侧输出流的数据
process.getSideOutput(waterSensorOutputTag).printToErr("迟到数据:");
process.print();
env.execute();
}
基于时间的合流(join)
1、window join
stream1.join(stream2)
.where(<KeySelector>)
.equalTo(<KeySelector>)
.window(<WindowAssigner>)
.apply(<JoinFunction>)
需要分别对于两条流数据设置keySelector,然后设置窗口类型。
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
SingleOutputStreamOperator<Tuple2<String, Integer>> ds1 = env.fromElements(
Tuple2.of("a", 1),
Tuple2.of("a", 2),
Tuple2.of("b", 3),
Tuple2.of("c", 4)
)
.assignTimestampsAndWatermarks(
WatermarkStrategy
.<Tuple2<String, Integer>>forMonotonousTimestamps()
.withTimestampAssigner(new SerializableTimestampAssigner<Tuple2<String, Integer>>() {
@Override
public long extractTimestamp(Tuple2<String, Integer> stringIntegerTuple2, long l) {
return stringIntegerTuple2.f1 * 1000L;
}
})
);
SingleOutputStreamOperator<Tuple3<String, Integer, Integer>> ds2 = env
.fromElements(
Tuple3.of("a", 1, 1),
Tuple3.of("a", 11, 1),
Tuple3.of("b", 2, 1),
Tuple3.of("b", 12, 1),
Tuple3.of("c", 14, 1),
Tuple3.of("d", 15, 1)
)
.assignTimestampsAndWatermarks(
WatermarkStrategy
.<Tuple3<String, Integer, Integer>>forMonotonousTimestamps()
.withTimestampAssigner(new SerializableTimestampAssigner<Tuple3<String, Integer, Integer>>() {
@Override
public long extractTimestamp(Tuple3<String, Integer, Integer> stringIntegerIntegerTuple3, long l) {
return stringIntegerIntegerTuple3.f1 * 1000L;
}
})
);
// TODO window join
// 1. 落在同一个时间窗口范围内才能匹配
// 2. 根据keyby的key,来进行匹配关联
// 3. 只能拿到匹配上的数据,类似有固定时间范围的inner join
ds1
.join(ds2)
.where(r1 -> r1.f0)
.equalTo(r2 -> r2.f0)
.window(TumblingEventTimeWindows.of(Time.seconds(10)))
.apply(new JoinFunction<Tuple2<String, Integer>, Tuple3<String, Integer, Integer>, String>() {
@Override
public String join(Tuple2<String, Integer> value1, Tuple3<String, Integer, Integer> value2) throws Exception {
return value1 + "<---->" + value2;
}
})
.print();
env.execute();
}
观察window join结果,可以看出,两条流数据根据key关联,只有落在同一个时间窗口[0,10)的数据,才能连接上。
2、interval join
某些特定场景下,用滚动窗口可能不好实现,例如应该关联上的两条数据因为位于两个不同时间窗口,就关联不上,这种情况可以使用间隔联接,设置某条数据的时间戳的上下界。意思就是,有一条3s的数据,分别设置其上界,下界为2s,那么他可以join的时间区间就是[1,5],这种方法可以满足部分特殊的场景,注意只适用于时间时间。
常见调用格式如下
stream1
.keyBy(<KeySelector>)
.intervalJoin(stream2.keyBy(<KeySelector>))
.between(Time.milliseconds(-2), Time.milliseconds(1))
.process (new ProcessJoinFunction<Integer, Integer, String(){
@Override
public void processElement(Integer left, Integer right, Context ctx, Collector<String> out) {
out.collect(left + "," + right);
}
});
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
SingleOutputStreamOperator<Tuple2<String, Integer>> ds1 = env.fromElements(
Tuple2.of("a", 1),
Tuple2.of("a", 2),
Tuple2.of("b", 3),
Tuple2.of("c", 4)
)
.assignTimestampsAndWatermarks(
WatermarkStrategy
.<Tuple2<String, Integer>>forMonotonousTimestamps()
.withTimestampAssigner(new SerializableTimestampAssigner<Tuple2<String, Integer>>() {
@Override
public long extractTimestamp(Tuple2<String, Integer> stringIntegerTuple2, long l) {
return stringIntegerTuple2.f1 * 1000L;
}
})
);
SingleOutputStreamOperator<Tuple3<String, Integer, Integer>> ds2 = env
.fromElements(
Tuple3.of("a", 1, 1),
Tuple3.of("a", 5, 1)
// ,
// Tuple3.of("b", 2, 1),
// Tuple3.of("b", 12, 1),
// Tuple3.of("c", 14, 1),
// Tuple3.of("d", 15, 1)
)
.assignTimestampsAndWatermarks(
WatermarkStrategy
.<Tuple3<String, Integer, Integer>>forMonotonousTimestamps()
.withTimestampAssigner(
(value,l)->value.f1*1000L
)
);
// TODO Interval join
//1. 分别做keyby,key其实就是关联条件
KeyedStream<Tuple2<String, Integer>, String> ks1 = ds1.keyBy(r1 -> r1.f0);
KeyedStream<Tuple3<String, Integer, Integer>, String> ks2 = ds2.keyBy(r2 -> r2.f0);
//2. 调用 interval join
ks1
.intervalJoin(ks2)
.between(Time.seconds(-2),Time.seconds(2))
.process(new ProcessJoinFunction<Tuple2<String, Integer>, Tuple3<String, Integer, Integer>, String>() {
@Override
public void processElement(Tuple2<String, Integer> value1, Tuple3<String, Integer, Integer> value2, Context context, Collector<String> collector) throws Exception {
collector.collect(value1 + "<---->" + value2);
}
})
.print();
env.execute();
}
demo中上下界的设置如下:
观察发现,stream1中第一条数据上下界不包含5s,因此最后join的结果两者并没有连接上。
处理迟到数据,例如有两条流,此时join后的watermark已经为7s,突然来了一条watermark为3s的数据,就属于迟到数据。可以通过侧输出流拿到迟到数据。
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
SingleOutputStreamOperator<Tuple2<String, Integer>> ds1 = env.socketTextStream("node1", 7777)
.map(new MapFunction<String, Tuple2<String, Integer>>() {
@Override
public Tuple2<String, Integer> map(String s) throws Exception {
String[] datas = s.split(",");
return Tuple2.of(datas[0], Integer.valueOf(datas[1]));
}
})
.assignTimestampsAndWatermarks(
WatermarkStrategy
.<Tuple2<String, Integer>>forBoundedOutOfOrderness(Duration.ofSeconds(3))
.withTimestampAssigner((value, ts) -> value.f1 * 1000L)
);
SingleOutputStreamOperator<Tuple3<String, Integer,Integer>> ds2 = env.socketTextStream("node1", 8888)
.map(new MapFunction<String, Tuple3<String, Integer,Integer>>() {
@Override
public Tuple3<String, Integer,Integer> map(String s) throws Exception {
String[] datas = s.split(",");
return Tuple3.of(datas[0], Integer.valueOf(datas[1]), Integer.valueOf(datas[2]));
}
})
.assignTimestampsAndWatermarks(
WatermarkStrategy
.<Tuple3<String, Integer,Integer>>forBoundedOutOfOrderness(Duration.ofSeconds(3))
.withTimestampAssigner((value, ts) -> value.f1 * 1000L)
);
/**
* TODO Interval join
* 1、只支持事件时间
* 2、指定上界、下界的偏移,负号代表时间往前,正号代表时间往后
* 3、process中,只能处理 join上的数据
* 4、两条流关联后的watermark,以两条流中最小的为准
* 5、如果 当前数据的事件时间 < 当前的watermark,就是迟到数据, 主流的process不处理
* => between后,可以指定将 左流 或 右流 的迟到数据 放入侧输出流
*/
//1. 分别做keyby,key其实就是关联条件
KeyedStream<Tuple2<String, Integer>, String> ks1 = ds1.keyBy(r1 -> r1.f0);
KeyedStream<Tuple3<String, Integer, Integer>, String> ks2 = ds2.keyBy(r2 -> r2.f0);
OutputTag<Tuple2<String, Integer>> leftTag = new OutputTag<>("left-late", Types.TUPLE(Types.STRING, Types.INT));
OutputTag<Tuple3<String, Integer, Integer>> rightTag = new OutputTag<>("right-late", Types.TUPLE(Types.STRING, Types.INT, Types.INT));
SingleOutputStreamOperator<String> process = ks1.intervalJoin(ks2)
.between(Time.seconds(-2), Time.seconds(2))
.sideOutputLeftLateData(new OutputTag<Tuple2<String, Integer>>("left-late", Types.TUPLE(Types.STRING,Types.INT)))
.sideOutputRightLateData(rightTag)
.process(
new ProcessJoinFunction<Tuple2<String, Integer>, Tuple3<String, Integer, Integer>, String>() {
@Override
public void processElement(Tuple2<String, Integer> value1, Tuple3<String, Integer, Integer> value2, Context context, Collector<String> collector) throws Exception {
collector.collect(value1 + "<---->" + value2);
}
}
);
process.print();
process.getSideOutput(leftTag).printToErr("left-late");
process.getSideOutput(rightTag).printToErr("right-late");
env.execute();
}
首先流1和流2分别传来两条数据,流1的watermark为7s,流2为8s,合流的watermark取小值,即7s。这是,流2来了一条迟到数据,其事件事件为5s,按理说这条数据可以匹配上(a,4),但是因为这条数据的事件时间已经小于合流的watermark(此时来的数据如果事件时间为7s就不算迟到),因此不再被处理,可以通过侧输出流拿到。