Window编程模型
Keyed Windows
stream
.keyBy(...) <- keyed versus non-keyed windows
.window(...) <- required: "assigner"
[.trigger(...)] <- optional: "trigger" (else default trigger)
[.evictor(...)] <- optional: "evictor" (else no evictor)
[.allowedLateness(...)] <- optional: "lateness" (else zero)
[.sideOutputLateData(...)] <- optional: "output tag" (else no side output for late data)
.reduce/aggregate/fold/apply() <- required: "function"
[.getSideOutput(...)] <- optional: "output tag"
Non-Keyed Windows
stream
.windowAll(...) <- required: "assigner"
[.trigger(...)] <- optional: "trigger" (else default trigger)
[.evictor(...)] <- optional: "evictor" (else no evictor)
[.allowedLateness(...)] <- optional: "lateness" (else zero)
[.sideOutputLateData(...)] <- optional: "output tag" (else no side output for late data)
.reduce/aggregate/fold/apply() <- required: "function"
[.getSideOutput(...)] <- optional: "output tag"
一、Window 生命周期
**创建:**当属于该窗口的第一个元素到达时,窗口就被创建;
**移除:**当时间超过(指定的结束时间 + 用户指定的允许超时时长),窗口被移除;
二、Window Assigners
2.1 滚动窗口(Tumbling Windows)
在滚动时间窗口内,每一个元素被分配到唯一的窗口。
例:
DataStream<T> input = ...;
// tumbling event-time windows
input
.keyBy(<key selector>)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.<windowed transformation>(<window function>);
// tumbling processing-time windows
input
.keyBy(<key selector>)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.<windowed transformation>(<window function>);
// daily tumbling event-time windows offset by -8 hours.
input
.keyBy(<key selector>)
.window(TumblingEventTimeWindows.of(Time.days(1), Time.hours(-8))) //可用于指定时差
.<windowed transformation>(<window function>);
2.2 滑动窗口(Sliding Windows)
滑动窗口中一个元素可能属于多个窗口(在滑动步长小于窗口大小的情况下)
例:
DataStream<T> input = ...;
// sliding event-time windows
input
.keyBy(<key selector>)
.window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))
.<windowed transformation>(<window function>);
// sliding processing-time windows
input
.keyBy(<key selector>)
.window(SlidingProcessingTimeWindows.of(Time.seconds(10), Time.seconds(5)))
.<windowed transformation>(<window function>);
// sliding processing-time windows offset by -8 hours
input
.keyBy(<key selector>)
.window(SlidingProcessingTimeWindows.of(Time.hours(12), Time.hours(1), Time.hours(-8)))
.<windowed transformation>(<window function>);
2.3 会话窗口(Session Windows)
会话窗口没有固定的大小,不会覆盖,当一个窗口在一定时间间隔内没有收到数据该窗口就会关闭,如果有新数据到来,被分配到新的窗口。该窗口内部实现的方法是对每个元素都创建一个window,如果这两个窗口的时间间隔小于定义的gap,就将这两个窗口合并。
DataStream<T> input = ...;
// event-time session windows with static gap
input
.keyBy(<key selector>)
.window(EventTimeSessionWindows.withGap(Time.minutes(10)))
.<windowed transformation>(<window function>);
// event-time session windows with dynamic gap
input
.keyBy(<key selector>)
.window(EventTimeSessionWindows.withDynamicGap((element) -> {
// determine and return session gap
}))
.<windowed transformation>(<window function>);
// processing-time session windows with static gap
input
.keyBy(<key selector>)
.window(ProcessingTimeSessionWindows.withGap(Time.minutes(10)))
.<windowed transformation>(<window function>);
// processing-time session windows with dynamic gap
input
.keyBy(<key selector>)
.window(ProcessingTimeSessionWindows.withDynamicGap((element) -> {
// determine and return session gap
}))
.<windowed transformation>(<window function>);
3.4 全局窗口(Global Windows)
全局窗口只有在自定义trigger的情况下才有用,否则, 将不会执行任何计算,因为全局窗口自己没有结束时间来执行聚合。
DataStream<T> input = ...;
input
.keyBy(<key selector>)
.window(GlobalWindows.create())
.<windowed transformation>(<window function>);
三、Window Function
Window Function 包括 ReduceFunction、AggregateFunction 和 ProcessWindowFunction,前两个执行效率更高,因为每来一条数据就可以聚合一次,ProcessWindowFunction 在窗口中根据所有得到一个迭代器,以及这些元素所属窗口的元信息。
ProcessWindowFunction 的执行效率低,因为在调用函数之前,Flink必须在内部缓冲窗口的所有元素。它可以与AggretateFunction 或者 ReduceFunction 结合使用,来实现增量聚合,并且得到窗口的元数据信息。
3.1 ReduceFunction
ReduceFunction 指定如何将两个输入元素聚合,产生一个相同类型的输出元素。
例:
import bean.SensorReading
import org.apache.flink.api.common.functions.ReduceFunction
import org.apache.flink.api.scala.createTypeInformation
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
object ReduceFunctionTest {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val source = env.socketTextStream("localhost", 8888)
val dataStream = source.map(line => {
val infos = line.split(",")
SensorReading(infos(0), infos(1).toLong, infos(2).toDouble)
})
val window = dataStream.keyBy(sensorReading => sensorReading.id)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
// .minBy("timestamp")
.reduce(new MyReduceFunction) //返回温度最小值以及截止到当前的时间戳
window.print()
env.execute()
}
}
class MyReduceFunction extends ReduceFunction[SensorReading] {
override def reduce(curSensor: SensorReading, newSensor: SensorReading): SensorReading = {
SensorReading(curSensor.id, newSensor.timestamp, newSensor.temperature.min(curSensor.temperature))
}
}
3.2 AggregateFunction
AggregateFunction 是 ReduceFunction 的泛化,相比于 ReduceFunction,它有三种类型:输入类型、累加器类型和输出类型。它有四个方法:创建累加器、修改累加器(add)、合并累加器(merge)、得到输出(getResult)。
例:计算第二个字段的平均值
import bean.SensorReading
import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.api.scala.createTypeInformation
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
object AggregateFunctionTest {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val source = env.socketTextStream("localhost", 8888)
val dataStream = source.map(line => {
val infos = line.split(",")
SensorReading(infos(0), infos(1).toLong, infos(2).toDouble)
})
val window = dataStream.keyBy(sensorReading => sensorReading.id)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.aggregate(new MyAggregateFunction) //计算温度的平均值
window.print()
env.execute()
}
}
class MyAggregateFunction extends AggregateFunction[SensorReading, (Double, Int), Double] {
override def createAccumulator(): (Double, Int) = (0.0, 0)
override def add(value: SensorReading, accumulator: (Double, Int)): (Double, Int) = {
(value.temperature + accumulator._1, accumulator._2 + 1)
}
override def getResult(accumulator: (Double, Int)): Double = {
accumulator._1 / accumulator._2
}
override def merge(a: (Double, Int), b: (Double, Int)): (Double, Int) = {
(a._1 + b._1, a._2 + b._2)
}
}
3.3 ProcessWindowFunction
3.3.1 用法
ProcessWindowFunction属于全局函数,它可以获得一个包含窗口所有元素的可迭代对象,以及一个可以访问时间和状态信息的上下文对象(如watermark、窗口信息、状态),这使得它比其他窗口函数提供了更多的灵活性,但是性能会降低。
应用场景:1. 需要窗口的上下文对象时;
2. 对窗口数据进行排序(也可以使用 aggragate 来实现)
import bean.SensorReading
import org.apache.flink.api.scala.createTypeInformation
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
object ProcessFunctionTest {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val source = env.socketTextStream("localhost", 8888)
val dataStream = source.map(line => {
val infos = line.split(",")
SensorReading(infos(0), infos(1).toLong, infos(2).toDouble)
})
val window = dataStream.keyBy(sensorReading => sensorReading.id)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.process(new MyProcessFunction) //输出窗口信息和当前窗口温度的中位数
window.print()
env.execute()
}
}
class MyProcessFunction extends ProcessWindowFunction[SensorReading, String, String, TimeWindow] {
override def process(key: String, context: Context, elements: Iterable[SensorReading], out: Collector[String]): Unit = {
val list = elements.toList
val sortList = list.sortBy(sensor => sensor.temperature)
out.collect(s"window of ${context.window}: " + (key, sortList(sortList.size / 2).temperature)) //输出窗口信息和当前窗口温度的中位数
}
}
3.3.2 与增量聚合函数结合使用
但我们需要窗口信息时,同时又想做增量聚合,这二者可以一起使用
3.3.2.1 与 reduce 结合
import bean.SensorReading
import org.apache.flink.api.common.functions.ReduceFunction
import org.apache.flink.api.scala.createTypeInformation
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
object ProcessFunctionWithReduce {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val source = env.socketTextStream("localhost", 8888)
val dataStream = source.map(line => {
val infos = line.split(",")
SensorReading(infos(0), infos(1).toLong, infos(2).toDouble)
})
val window = dataStream.keyBy(sensorReading => sensorReading.id)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.reduce(new MyReduceFunction2, new MyProcessFunction2) //输出窗口信息和当前窗口温度的最小值
window.print()
env.execute()
}
}
class MyReduceFunction2 extends ReduceFunction[SensorReading] {
override def reduce(curSensor: SensorReading, newSensor: SensorReading): SensorReading = {
SensorReading(curSensor.id, newSensor.timestamp, newSensor.temperature.min(curSensor.temperature))
}
}
class MyProcessFunction2 extends ProcessWindowFunction[SensorReading, String, String, TimeWindow] {
override def process(key: String, context: Context, elements: Iterable[SensorReading], out: Collector[String]): Unit = {
val min = elements.iterator.next()
out.collect(s"${context.window}: $min")
}
}
3.3.2.2 与 aggregate 结合
和上面的同理
四、迟到数据处理
4.1 设置允许迟到的时间
可以通过设置允许迟到的时间来指定窗口延迟多少秒关闭。
4.2 侧输出流
窗口延迟关闭会影响系统的实时性,我们不能将允许迟到时间设置的太大。可以将在设置迟到时间的基础上加侧输出流,侧输出流可以将迟到数据输出到我们想要存放的地方,便于之后分析。
4.3 代码
//定义侧输出流,来处理迟到数据
val outputTag = new OutputTag[(String, Long)]("late-data")
val window = waterMarkStream.keyBy(0)
.window(TumblingEventTimeWindows.of(Time.seconds(3)))
.allowedLateness(Time.seconds(2)) //窗口最多等2s
.sideOutputLateData(outputTag = outputTag)
.apply(new WindowFunction[(String, Long), String, Tuple, TimeWindow] {
override def apply(tuple: Tuple, window: TimeWindow, input: Iterable[(String, Long)], out: Collector[String]): Unit = {
val key = tuple.toString
val list: util.ArrayList[java.lang.Long] = new util.ArrayList[java.lang.Long]()
val it = input.iterator
while(it.hasNext) {
val next = it.next()
list.add(next._2)
println(next)
}
Collections.sort(list)
val sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS")
val result = s"key: $key, window size: [${list.size}], first arrive at: [${sdf.format(list.get(0))}], last arrive at: [${sdf.format(list.get(list.size() - 1))}], " +
s"window start: [${sdf.format(window.getStart)}], window end: [${sdf.format(window.getEnd)}]"
out.collect(result)
}
})
//将侧输出流输出到控制台
val sideOutput = window.getSideOutput(outputTag)
sideOutput.print("side-output")
window.print("window")