Flink Window窗口机制
Window是无限数据流处理的核心,Window将一个无限的stream拆分成有限大小的”buckets”桶,我们可以在这些桶上做计算操作。本文主要聚焦于在Flink中如何进行窗口操作,以及程序员如何从window提供的功能中获得最大的收益。
窗口化的Flink程序的一般结构如下,第一个代码段中是分组的流,而第二段是非分组的流。正如我们所见,唯一的区别是分组的stream调用keyBy(…)和window(…),而非分组的stream中window()换成了windowAll(…),这些也将贯穿都这一页的其他部分中。
Demo 1
利用countwindow实现两条数据之间的求和功能:
样例类依然使用上一篇所写的SensorReading
public class Window1 {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// String filePath="D:\\Project\\FlinkStu\\resources\\sensor.txt";
// DataStreamSource<String> inputStream = env.readTextFile(filePath);
Properties prop = new Properties();
prop.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,"192.168.146.222:9092");
prop.setProperty(ConsumerConfig.GROUP_ID_CONFIG,"sensor_group1");
prop.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,"org.apache.kafka.common.serialization.StringDeserializer");
prop.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,"org.apache.kafka.common.serialization.StringDeserializer");
prop.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG,"latest");
DataStreamSource<String> inputStream = env.addSource(new FlinkKafkaConsumer011<String>("sensor", new SimpleStringSchema(), prop));
SingleOutputStreamOperator<SensorReading> mapStream = inputStream.map(new MapFunction<String, SensorReading>() {
@Override
public SensorReading map(String s) throws Exception {
String[] split = s.split(",");
return new SensorReading(split[0], Long.parseLong(split[1]), Double.parseDouble(split[2]));
}
});
SingleOutputStreamOperator<SensorReading> resultMaxStream
= mapStream.keyBy("id")
// .timeWindow(Time.seconds(15));
// .timeWindow(Time.seconds(15),Time.seconds(15));
// .countWindow(6);
// .countWindow(6,2);
// .window(EventTimeSessionWindows.withGap(Time.seconds(15)));
// .window(TumblingEventTimeWindows.of(Time.seconds(5))); //事件内包含时间的处理窗口
// .window(TumblingProcessingTimeWindows.of(Time.seconds(5))); //系统处理时间的窗口
// .timeWindow(Time.seconds(15)).max("temperature");
// .max("temperature");
.countWindow(6,2)
.reduce(new ReduceFunction<SensorReading>() {
@Override
public SensorReading reduce(SensorReading sensorReading, SensorReading t1) throws Exception {
return new SensorReading(sensorReading.getId(),
sensorReading.getTimestamp(),
sensorReading.getTemperature()+t1.getTemperature()
);
}
});
resultMaxStream.print("max");
env.execute("flinkwindow");
}
}
运行截图如下:
Demo 2
实现求平均值的功能
public class window2 {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties prop = new Properties();
prop.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,"192.168.146.222:9092");
prop.setProperty(ConsumerConfig.GROUP_ID_CONFIG,"group_id_2");
prop.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,"org.apache.kafka.common.serialization.StringDeserializer");
prop.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,"org.apache.kafka.common.serialization.StringDeserializer");
DataStreamSource<String> inputStream = env.addSource(new FlinkKafkaConsumer011<String>("sensor", new SimpleStringSchema(), prop));
SingleOutputStreamOperator<SensorReading> mapStream = inputStream.map(new MapFunction<String, SensorReading>() {
@Override
public SensorReading map(String s) throws Exception {
String[] split = s.split(",");
return new SensorReading(split[0], Long.parseLong(split[1]), Double.parseDouble(split[2]));
}
});
SingleOutputStreamOperator<Double> resultAvgStream = mapStream.keyBy("id")
.countWindow(6, 2).aggregate(new AvgFunction());
resultAvgStream.print("avg");
env.execute("avgwindow");
}
private static class AvgFunction implements AggregateFunction<SensorReading, Tuple2<Double,Integer>,Double> {
//初始化
@Override
public Tuple2<Double, Integer> createAccumulator() {
return new Tuple2<>(0.0,0);
}
//统计传入数据的 总和 以及 个数
@Override
public Tuple2<Double, Integer> add(SensorReading sensorReading, Tuple2<Double, Integer> doubleIntegerTuple2) {
double temp = sensorReading.getTemperature() + doubleIntegerTuple2.f0;
int count = doubleIntegerTuple2.f1 + 1;
return new Tuple2<>(temp,count);
}
//根据统计的总和 以及 个数 计算出平均数
@Override
public Double getResult(Tuple2<Double, Integer> doubleIntegerTuple2) {
double resultAvg = doubleIntegerTuple2.f0 / doubleIntegerTuple2.f1;
return resultAvg;
}
//组间合并
@Override
public Tuple2<Double, Integer> merge(Tuple2<Double, Integer> doubleIntegerTuple2, Tuple2<Double, Integer> acc1) {
double tempSum = doubleIntegerTuple2.f0 + acc1.f0;
int countSum = doubleIntegerTuple2.f1 + acc1.f1;
return new Tuple2<>(tempSum,countSum);
}
}
}
运行如下所示:
Demo 3
代码如下:
public class window3 {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 设定时间语义 事件时间 TimeCharacteristic.EventTime
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
Properties prop = new Properties();
prop.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,"192.168.146.222:9092");
prop.setProperty(ConsumerConfig.GROUP_ID_CONFIG,"sensor_group1");
prop.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,"org.apache.kafka.common.serialization.StringDeserializer");
prop.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,"org.apache.kafka.common.serialization.StringDeserializer");
prop.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG,"latest");
DataStreamSource<String> inputStream =
env.addSource(new FlinkKafkaConsumer011<String>(
"sensor",
new SimpleStringSchema(),
prop)
);
SingleOutputStreamOperator<SensorReading> mapStream = inputStream.map(new MapFunction<String, SensorReading>() {
@Override
public SensorReading map(String s) throws Exception {
String[] split = s.split(",");
return new SensorReading(split[0], Long.parseLong(split[1]), Double.parseDouble(split[2]));
}
})
.assignTimestampsAndWatermarks( // 处理乱序时间
new BoundedOutOfOrdernessTimestampExtractor<SensorReading>(Time.seconds(0)) {
@Override
public long extractTimestamp(SensorReading sensorReading) {
return sensorReading.getTimestamp()*1000L;
}
});
OutputTag<SensorReading> outputTag = new OutputTag<SensorReading>("late") {
};
SingleOutputStreamOperator<SensorReading> maxResultStream =
mapStream.keyBy("id")
// .timeWindow(Time.seconds(15)) //, Time.seconds(2) ,Time.seconds(5)
.window(TumblingEventTimeWindows.of(Time.seconds(15),Time.seconds(1)))
.allowedLateness(Time.seconds(30))
.sideOutputLateData(outputTag)
.max("temperature");
maxResultStream.print("max");
DataStream<SensorReading> sideOutput = maxResultStream.getSideOutput(outputTag);
sideOutput.print("late");
env.execute("flinkwindow");
}
}
运行结果如图:
事件时间为15秒,延迟30秒,那么只要在45秒以内出现 1~15秒的最大值,都是会更新为当前最大值,例如:
第一秒出现数据:35.5
第十秒出现数据:36.6 (此时最大值为36.6)
第二十秒出现数据:37.7 (该数据属于第二个15秒的窗口)
然后再次输入 第八秒的数据:39.9
那么第一个窗口的最大值会对应变动为 39.9,直到出现时间小于45秒(不包含45秒),都可以往1~15秒这个区间新增最大值,除非后续出现时间大于等于45秒,那么第一个窗口将关闭,如果依然往 1-15秒这个区别输入数据,那么该数据将被sideOutput 作为迟到数据收集起来等待下一步处理(防止数据丢失)