简单理解什么是迟到数据
比如一个0-10s的窗口,设置了延迟时间是2s,后来来了一条12s的数据,窗口要触发然后关闭,紧接着又来了一条6s的数据,这个数据应该属于0-10s,但是窗口的延迟时间已经过去并且窗口已经关闭了,此时这条数据就叫做迟到数据。
迟到数据的处理
1.窗口允许迟到
Flink的窗口,也允许迟到数据。当触发了窗口计算后,会先计算当前的结果,但是此时并不会关闭窗口。
以后每来一条迟到数据,就触发一次这条数据所在窗口计算。直到wartermark 超过了窗口结束时间+推迟时间,此时窗口会真正关闭。
案例:
package window;
import flink_Partition.WaterSensorMapFunction;
import flink_transfrom.WaterSensor;
import org.apache.commons.lang.time.DateFormatUtils;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import java.time.Duration;
public class WatermarkAllowLateness {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
SingleOutputStreamOperator<WaterSensor> sensorsDS = env
.socketTextStream("hadoop102", 7777)
.map(new WaterSensorMapFunction());
//TODO 定义Watermark策略
WatermarkStrategy<WaterSensor> watermarkStrategy = WatermarkStrategy
.<WaterSensor>forBoundedOutOfOrderness(Duration.ofSeconds(3)) //指定watermark生成:乱序的,等待3s
//指定 时间戳分配器,从数据中提取
.withTimestampAssigner(new SerializableTimestampAssigner<WaterSensor>() {
@Override
public long extractTimestamp(WaterSensor element, long recordTimestamp) {
return element.getTs() * 1000L;
}
});
SingleOutputStreamOperator<WaterSensor> sensorsDSWithWatermark = sensorsDS.assignTimestampsAndWatermarks(watermarkStrategy);
SingleOutputStreamOperator<String> watermark = sensorsDSWithWatermark.keyBy(r -> r.getId())
// 使用 事件时间语义 的窗口
.window(TumblingEventTimeWindows.of(Time.seconds(10)))
.allowedLateness(Time.seconds(2))//推迟两秒关窗
.process(
new ProcessWindowFunction<WaterSensor, String, String, TimeWindow>() {
@Override
public void process(String s, Context context, Iterable<WaterSensor> elements, Collector<String> out) throws Exception {
long startTs = context.window().getStart();
long endTs = context.window().getEnd();
String windowStart = DateFormatUtils.format(startTs, "yyyy-MM-dd HH:mm:ss.SSS");
String windowEnd = DateFormatUtils.format(endTs, "yyyy-MM-dd HH:mm:ss.SSS");
long count = elements.spliterator().estimateSize();
out.collect("key=" + s + "的窗口[" + windowStart + "," + windowEnd + ")包含" + count + "条数据===>" + elements.toString());
}
}
);
watermark.print();
env.execute();
}
}



设置了迟到时间后,窗口是这样的:

15s这条数据来了后窗口就关闭了,所以后续的5s的数据也不会进入0-10s这个窗口(桶),所以也不会输出了。
2.使用侧流接收迟到的数据
关窗后的迟到数据可以放到侧输出流中。
package window;
import flink_Partition.WaterSensorMapFunction;
import flink_transfrom.WaterSensor;
import org.apache.commons.lang.time.DateFormatUtils;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import org.apache.flink.util.OutputTag;
import java.time.Duration;
public class WatermarkAllowLateness {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
SingleOutputStreamOperator<WaterSensor> sensorsDS = env
.socketTextStream("hadoop102", 7777)
.map(new WaterSensorMapFunction());
//TODO 定义Watermark策略
WatermarkStrategy<WaterSensor> watermarkStrategy = WatermarkStrategy
.<WaterSensor>forBoundedOutOfOrderness(Duration.ofSeconds(3)) //指定watermark生成:乱序的,等待3s
//指定 时间戳分配器,从数据中提取
.withTimestampAssigner(new SerializableTimestampAssigner<WaterSensor>() {
@Override
public long extractTimestamp(WaterSensor element, long recordTimestamp) {
return element.getTs() * 1000L;
}
});
SingleOutputStreamOperator<WaterSensor> sensorsDSWithWatermark = sensorsDS.assignTimestampsAndWatermarks(watermarkStrategy);
OutputTag<WaterSensor> lateTag = new OutputTag<>("late-data", Types.POJO(WaterSensor.class));
SingleOutputStreamOperator<String> watermark = sensorsDSWithWatermark.keyBy(r -> r.getId())
// 使用 事件时间语义 的窗口
.window(TumblingEventTimeWindows.of(Time.seconds(10)))
.allowedLateness(Time.seconds(2))//推迟两秒关窗
.sideOutputLateData(lateTag)//关窗后的迟到数据,放入侧输出流
.process(
new ProcessWindowFunction<WaterSensor, String, String, TimeWindow>() {
@Override
public void process(String s, Context context, Iterable<WaterSensor> elements, Collector<String> out) throws Exception {
long startTs = context.window().getStart();
long endTs = context.window().getEnd();
String windowStart = DateFormatUtils.format(startTs, "yyyy-MM-dd HH:mm:ss.SSS");
String windowEnd = DateFormatUtils.format(endTs, "yyyy-MM-dd HH:mm:ss.SSS");
long count = elements.spliterator().estimateSize();
out.collect("key=" + s + "的窗口[" + windowStart + "," + windowEnd + ")包含" + count + "条数据===>" + elements.toString());
}
}
);
watermark.getSideOutput(lateTag).printToErr("关窗后的迟到数据"); //从主流获取侧输出流打印
watermark.print();//打印主流
env.execute();
}
}


311

被折叠的 条评论
为什么被折叠?



