本文的基础环境可以参考flink 1.10.1 java版本wordcount演示 (nc + socket)
1. 窗口基本概念
窗口是将无界数据量划分为有界数据流的一种方式。
窗口类型按照窗口计算因子划分,包括时间窗口和事件窗口。
窗口类型按窗口计算方式包括滚动窗口,滑动窗口和会话窗口。
窗口计算采用左闭右开的方式。比如统计1秒以内的数据,则时间区间为[0,1000)毫秒,即包括0和999毫秒,不会包括1000毫秒。
2. Window API
window()方法必须在keyBy()之后才能使用。
时间窗口:timeWindow()
DataStream<Integer> resultStream = dataStream.keyBy("id")
// 滑动窗口的简写方式
.timeWindow(Time.seconds(15),Time.seconds(10))
// 滚动窗口的简写方式
// .timeWindow(Time.seconds(15))
事件窗口:countWindow()
3. 窗口函数
增量聚合函数:RuduceFunciton,AggregateFunction,只做计算,不输出结果;
全量聚合函数:ProcessWindowFunction,窗口关闭,执行计算并输出结果
取最小值
DataStream<Tuple2<String, Integer>> windowCounts = text
.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
@Override
public void flatMap(String value, Collector<Tuple2<String, Integer>> out) {
for (String word : value.split("\\s")) {
out.collect(Tuple2.of(word, 1));
}
}
})
.keyBy(0)
.timeWindow(Time.seconds(5))
.minBy(1);
求和
DataStream<Tuple2<String, Integer>> windowCounts = text
.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
@Override
public void flatMap(String value, Collector<Tuple2<String, Integer>> out) {
for (String word : value.split("\\s")) {
out.collect(Tuple2.of(word, 1));
}
}
})
.keyBy(0)
.timeWindow(Time.seconds(5))
.sum(1);
.trigger() 触发器,定义window关闭,触发计算并输出结果。
4. 窗口计算案例代码
package com.demo.window;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import java.util.Iterator;
public class FlinkWindowDemo {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration());
// 此处数据源是的 Linux 主机,通过 socket 的方式传输数据
DataStreamSource<String> socketStream = env.socketTextStream("192.168.0.181", 9000, "\n");
DataStream<String> flatMap = socketStream.flatMap(new FlatMapFunction<String, String>() {
@Override
public void flatMap(String value, Collector<String> out) throws Exception {
String[] strings = value.split(" ");
for (String s : strings) {
out.collect(s);
}
}
});
DataStream<Tuple2<String, Integer>> map = flatMap.map(new MapFunction<String, Tuple2<String, Integer>>() {
@Override
public Tuple2<String, Integer> map(String value) throws Exception {
return Tuple2.of(value, 1);
}
});
// DataStream<Tuple2<String, Integer>> sum = map
// .keyBy("f0")
// .timeWindow(Time.seconds(10))
// .sum(1);
//
// sum.print();
// 全窗口函数,窗口统计数据的同时,输出窗口结束时间
DataStream<Tuple3<String, Integer, Long>> windowsum= map
.keyBy("f0")
.timeWindow(Time.seconds(60))
.apply(new WindowFunction<Tuple2<String, Integer>, Tuple3<String, Integer, Long>, Tuple, TimeWindow>() {
@Override
public void apply(Tuple tuple, TimeWindow timeWindow, Iterable<Tuple2<String, Integer>> iterable, Collector<Tuple3<String, Integer,Long>> collector) throws Exception {
Iterator<Tuple2<String, Integer>> iterator = iterable.iterator();
Integer sum = 0;
Long windowEnd = timeWindow.getEnd();
while (iterator.hasNext())
{
Tuple2<String, Integer> next = iterator.next();
sum = sum + next.f1;
}
collector.collect(new Tuple3<>(tuple.getField(0), sum, windowEnd));
}
});
windowsum.print();
env.execute();
}
}
5. 执行测试
在nc输出如下的测试数据:
[root@bogon ~]# nc -l 9000
hello world
hello flink
hello window
timewindow
可以在控制台看到类似如下输出。
3> (hello,3,1644545760000)
7> (flink,1,1644545760000)
1> (timewindow,1,1644545760000)
5> (world,1,1644545760000)
3> (window,1,1644545760000)
在输出数据中,可以看到输出数据的同时,还给出了窗口结束时间,单位是毫秒。
只有在窗口关闭以后,才会进行计算,并输出计算结果。当没有数据时,不会开启窗口,也就不会进行相应的计算了。
为了测试,这里设置了一个60秒的滚动窗口。