1、WaterMark机制
- Watermark机制:由某个算子操作生成,在整个程序中随event数据流转
Watermaker = 当前计算窗⼝最⼤的事件时间 - 允许乱序延迟的时间
- 触发计算时机:
1、:Watermaker >= Window EndTime窗⼝结束时间
2、:当前计算窗⼝最⼤的事件时间 - 允许乱序延迟的时间 >=Window EndTime窗⼝结束时间
(1)需求
需求:分组统计不同视频的成交价,数据有乱序延迟,允许3秒延迟
(2)代码实战:watermark+window机制
package com.lihaiwei.text1.app;
import com.lihaiwei.text1.util.TimeUtil;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import javax.print.DocFlavor;
import java.time.Duration;
import java.util.ArrayList;
import java.util.List;
public class flink08watermark {
public static void main(String[] args) throws Exception {
// 1、构建流环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
// 2、从socket中读取数据
DataStream<String> DS = env.socketTextStream("192.168.6.104",8088);
// 3、将输入数据(java,2022-11-11 09-10-10,12)转为tuple3类型
SingleOutputStreamOperator<Tuple3<String, String, Integer>> flatmapDS = DS.flatMap(new FlatMapFunction<String, Tuple3<String, String, Integer>>() {
@Override
public void flatMap(String value, Collector<Tuple3<String, String, Integer>> out) throws Exception {
// 3.1、将字符串切割为数组
String[] arr = value.split(",");
// 3.2、使用收集器收集到tuple中
out.collect(Tuple3.of(arr[0],arr[2],Integer.parseInt(arr[2])));
}
});
// 4、指定特定列为eventtime,进行watermark生成
SingleOutputStreamOperator<Tuple3<String, String, Integer>> watermarkDS = flatmapDS.assignTimestampsAndWatermarks(WatermarkStrategy
// 4.1、指定允许最大延迟乱序时间
.<Tuple3<String, String, Integer>>forBoundedOutOfOrderness(Duration.ofSeconds(3))
// 4.2、指定water的时间语义及时间时间列
.withTimestampAssigner((event, timestamp) -> TimeUtil.strToDate(event.f1).getTime()));
// 5、分组开窗统计
SingleOutputStreamOperator<String> sumDS = watermarkDS.keyBy(new KeySelector<Tuple3<String, String, Integer>, String>() {
@Override
public String getKey(Tuple3<String, String, Integer> value) throws Exception {
return value.f0;
}
})
// 开窗
.window(TumblingEventTimeWindows.of(Time.seconds(10)))
// 聚合
.apply(new WindowFunction<Tuple3<String, String, Integer>, String, String, TimeWindow>() {
@Override
public void apply(String key, TimeWindow window, Iterable<Tuple3<String, String, Integer>> input, Collector<String> out) throws Exception {
// 准备list,存放窗口的事件时间
List<String> timeList = new ArrayList<>();
int total = 0;
//遍历数据集,
for(Tuple3<String,String,Integer> order:input){
timeList.add(order.f1);
total = total + order.f2;
}
String outStr = String.format("分组key:%s,聚合值:%s,窗⼝开始结束:[%s~%s),窗⼝所有事件时 间:%s", key,total,TimeUtil.format(window.getStart()),TimeUtil.format(window.getEnd()), timeList);
out.collect(outStr);
}
});
// 6、打印输出
sumDS.print();
// 7、命名Job并提交
env.execute("watermark job");
}
}
(3)代码调试过程
- 开启sokect
nc -lk 9999
- 窗口一
// 窗口一[12:00,12:10)
java,2022-11-11 23:12:07,10 // NO
java,2022-11-11 23:12:11,10 // NO,在另一个窗口
java,2022-11-11 23:12:08,10 // NO
mysql,2022-11-11 23:12:13,10 //YES,在另一个窗口
- 窗口二
// 窗口二[12:10,12:20)
java,2022-11-11 23:12:13,10 // NO
java,2022-11-11 23:12:17,10 // NO
java,2022-11-11 23:12:09,10 // NO,数据丢弃
java,2022-11-11 23:12:20,10 // NO,数据在第三个窗口
java,2022-11-11 23:12:22,10 // NO,数据在第三个窗口
java,2022-11-11 23:12:23,10 // YES,数据在第三个窗口
- 运行结果
2、allowLateness机制 - 二次兜底
- 应用场景:超过了watermark的等待,然后配置allowedLateness 再延⻓时间,然后到了后更新之前的窗⼝数据;
DataStream.window("窗口时间")
.allowedLateness("允许时间")
.apply("全窗口聚合")
(1)代码实战
package com.lihaiwei.text1.app;
import com.lihaiwei.text1.util.TimeUtil;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import javax.print.DocFlavor;
import java.time.Duration;
import java.util.ArrayList;
import java.util.List;
public class flink08watermark {
public static void main(String[] args) throws Exception {
// 1、构建流环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
// 2、从socket中读取数据
DataStream<String> DS = env.socketTextStream("192.168.6.104",9999);
// 3、将输入数据(java,2022-11-11 09-10-10,12)转为tuple3类型
SingleOutputStreamOperator<Tuple3<String, String, Integer>> flatmapDS = DS.flatMap(new FlatMapFunction<String, Tuple3<String, String, Integer>>() {
@Override
public void flatMap(String value, Collector<Tuple3<String, String, Integer>> out) throws Exception {
// 3.1、将字符串切割为数组
String[] arr = value.split(",");
// 3.2、使用收集器收集到tuple中
out.collect(Tuple3.of(arr[0],arr[1],Integer.parseInt(arr[2])));
}
});
// 4、指定特定列为eventtime,进行watermark生成
SingleOutputStreamOperator<Tuple3<String, String, Integer>> watermarkDS = flatmapDS.assignTimestampsAndWatermarks(WatermarkStrategy
// 4.1、指定允许最大延迟乱序时间
.<Tuple3<String, String, Integer>>forBoundedOutOfOrderness(Duration.ofSeconds(3))
// 4.2、指定water的时间语义及时间时间列
.withTimestampAssigner((event, timestamp) -> TimeUtil.strToDate(event.f1).getTime()));
// 5、分组开窗统计
SingleOutputStreamOperator<String> sumDS = watermarkDS.keyBy(new KeySelector<Tuple3<String, String, Integer>, String>() {
@Override
public String getKey(Tuple3<String, String, Integer> value) throws Exception {
return value.f0;
}
})
// 开窗
.window(TumblingEventTimeWindows.of(Time.seconds(10)))
// 允许一分钟延迟
.allowedLateness(Time.minutes(1))
// 聚合
.apply(new WindowFunction<Tuple3<String, String, Integer>, String, String, TimeWindow>() {
@Override
public void apply(String key, TimeWindow window, Iterable<Tuple3<String, String, Integer>> input, Collector<String> out) throws Exception {
// 准备list,存放窗口的事件时间
List<String> timeList = new ArrayList<>();
int total = 0;
//遍历数据集,
for(Tuple3<String,String,Integer> order:input){
timeList.add(order.f1);
total = total + order.f2;
}
String outStr = String.format("分组key:%s,聚合值:%s,窗⼝开始结束:[%s~%s),窗⼝所有事件时 间:%s", key,total,TimeUtil.format(window.getStart()),TimeUtil.format(window.getEnd()), timeList);
out.collect(outStr);
}
});
// 6、打印输出
sumDS.print();
// 7、命名Job并提交
env.execute("watermark job");
}
}
(2)调试结果
- 窗口一
// 窗口一[12:00,12:10)
java,2022-11-11 23:12:07,10 // NO
java,2022-11-11 23:12:11,10 // NO,在另一个窗口
java,2022-11-11 23:12:08,10 // NO
mysql,2022-11-11 23:12:13,10 //YES,在另一个窗口
- 窗口二
// 窗口二[12:10,12:20)
java,2022-11-11 23:12:13,10 // NO
java,2022-11-11 23:12:17,10 // NO
java,2022-11-11 23:12:09,10 // YES,触发窗口一更新计算
java,2022-11-11 23:12:20,10 // NO,数据在第三个窗口
java,2022-11-11 23:12:22,10 // NO,数据在第三个窗口
java,2022-11-11 23:12:23,10 // YES,数据在第三个窗口
- 运行结果
3、sideOutPut机制 - 最后兜底
- 应用场景:超过了watermark的等待,到了后更新之前的窗⼝数据数据超过了allowedLateness 后,⽤侧输出流 SideOutput
// 1、new旁路输出对象
DataStream.window("窗口时间")
.allowedLateness("允许时间")
.sideOutputLateData("旁路输出对象")
.apply("全窗口聚合")
(1)代码实战
package com.lihaiwei.text1.app;
import com.lihaiwei.text1.util.TimeUtil;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import org.apache.flink.util.OutputTag;
import javax.print.DocFlavor;
import java.time.Duration;
import java.util.ArrayList;
import java.util.List;
public class flink08watermark {
public static void main(String[] args) throws Exception {
// 1、构建流环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
// 2、从socket中读取数据
DataStream<String> DS = env.socketTextStream("192.168.6.104",8888);
// 3、将输入数据(java,2022-11-11 09-10-10,12)转为tuple3类型
SingleOutputStreamOperator<Tuple3<String, String, Integer>> flatmapDS = DS.flatMap(new FlatMapFunction<String, Tuple3<String, String, Integer>>() {
@Override
public void flatMap(String value, Collector<Tuple3<String, String, Integer>> out) throws Exception {
// 3.1、将字符串切割为数组
String[] arr = value.split(",");
// 3.2、使用收集器收集到tuple中
out.collect(Tuple3.of(arr[0],arr[1],Integer.parseInt(arr[2])));
}
});
// 4.1、指定特定列为eventtime,进行watermark生成
SingleOutputStreamOperator<Tuple3<String, String, Integer>> watermarkDS = flatmapDS.assignTimestampsAndWatermarks(WatermarkStrategy
// 4.1、指定允许最大延迟乱序时间
.<Tuple3<String, String, Integer>>forBoundedOutOfOrderness(Duration.ofSeconds(3))
// 4.2、指定water的时间语义及时间时间列
.withTimestampAssigner((event, timestamp) -> TimeUtil.strToDate(event.f1).getTime()));
// 4.2、旁路输出对象
OutputTag<Tuple3<String,String,Integer>> latedata = new OutputTag<Tuple3<String, String, Integer>>("latedate");
// 5、分组开窗统计
SingleOutputStreamOperator<String> sumDS = watermarkDS.keyBy(new KeySelector<Tuple3<String, String, Integer>, String>() {
@Override
public String getKey(Tuple3<String, String, Integer> value) throws Exception {
return value.f0;
}
})
// 开窗
.window(TumblingEventTimeWindows.of(Time.seconds(10)))
// 允许一分钟延迟
.allowedLateness(Time.minutes(1))
// 旁路输出
.sideOutputLateData(latedata)
// 聚合
.apply(new WindowFunction<Tuple3<String, String, Integer>, String, String, TimeWindow>() {
@Override
public void apply(String key, TimeWindow window, Iterable<Tuple3<String, String, Integer>> input, Collector<String> out) throws Exception {
// 准备list,存放窗口的事件时间
List<String> timeList = new ArrayList<>();
int total = 0;
//遍历数据集,
for(Tuple3<String,String,Integer> order:input){
timeList.add(order.f1);
total = total + order.f2;
}
String outStr = String.format("分组key:%s,聚合值:%s,窗⼝开始结束:[%s~%s),窗⼝所有事件时 间:%s", key,total,TimeUtil.format(window.getStart()),TimeUtil.format(window.getEnd()), timeList);
out.collect(outStr);
}
});
// 6.1、打印输出
sumDS.print();
// 6.2、输出旁路对象
sumDS.getSideOutput(latedata).print();
// 7、命名Job并提交
env.execute("watermark job");
}
}
(2)调试结果
- 窗口一
// 窗口一[12:00,12:10)
java,2022-11-11 23:12:07,10 // NO
java,2022-11-11 23:12:11,10 // NO,在另一个窗口
java,2022-11-11 23:12:08,10 // NO
mysql,2022-11-11 23:12:13,10 //YES,在另一个窗口
- 窗口二
// 窗口二[12:10,12:20)
java,2022-11-11 23:12:17,10 // NO
java,2022-11-11 23:12:09,10 // YES,触发窗口一更新计算
java,2022-11-11 23:12:20,10 // NO,数据在第三个窗口
java,2022-11-11 23:12:22,10 // NO,数据在第三个窗口
java,2022-11-11 23:12:23,10 // YES,数据在第三个窗口
- 窗口三
// 窗口二[12:20,12:30)
java,2022-11-11 23:13:20,10 // YES,数据在第四个窗口
java,2022-11-11 23:12:01,10 // NO,数据在旁路输出
- 运行结果
4、多层保证措施归纳
- flink采用watermark、allowedLateness()、sideOutoutLateDate()三个机制保证获取数据
4.1、原理
①watermark:防⽌数据出现延迟乱序,允许等待⼀会 再触发窗⼝计算,将其输出;
②allowLateness:将窗⼝关闭时间再延迟⼀段时间,这期间迟到数据会通过allowLateness主动更新watermark输出的局部数据从而将其修复;
③sideOutPut:超过allowLateness后,窗⼝已 经彻底关闭了,就会把数据放到侧输出;
4.2、应用场景
-
watermark及时输出数据
-
allowLateness 做短期的更新迟到数据
-
sideOutPut做兜底更新保证数据准确性
4.3、Flink机制相关
(1)第一层 - DataStream
- 从DataStream数据流⾥指定范围获取数据;
(2)第二层 - Watermark
- 是防⽌数据出现乱序延迟,允许窗口等待延迟数据达到,再触发计算;
(3)第三层 - allowLateness
- 会让窗⼝关闭时间再延迟⼀段时间, 如 果还有数据达到,会局部修复数据并主动更新窗⼝的数据输出;
(4)第四层 - sideOutPut
- 在窗⼝已经彻底关闭后,所有过期延迟数据放到侧输出流,可以单独获取, 存储到某个地⽅再批量更新之前的聚合的数据;