8. WindowAPI

Window编程模型

Keyed Windows

stream
       .keyBy(...)               <-  keyed versus non-keyed windows
       .window(...)              <-  required: "assigner"
      [.trigger(...)]            <-  optional: "trigger" (else default trigger)
      [.evictor(...)]            <-  optional: "evictor" (else no evictor)
      [.allowedLateness(...)]    <-  optional: "lateness" (else zero)
      [.sideOutputLateData(...)] <-  optional: "output tag" (else no side output for late data)
       .reduce/aggregate/fold/apply()      <-  required: "function"
      [.getSideOutput(...)]      <-  optional: "output tag"

Non-Keyed Windows

stream
       .windowAll(...)           <-  required: "assigner"
      [.trigger(...)]            <-  optional: "trigger" (else default trigger)
      [.evictor(...)]            <-  optional: "evictor" (else no evictor)
      [.allowedLateness(...)]    <-  optional: "lateness" (else zero)
      [.sideOutputLateData(...)] <-  optional: "output tag" (else no side output for late data)
       .reduce/aggregate/fold/apply()      <-  required: "function"
      [.getSideOutput(...)]      <-  optional: "output tag"

一、Window 生命周期

​ **创建:**当属于该窗口的第一个元素到达时,窗口就被创建;

​ **移除:**当时间超过(指定的结束时间 + 用户指定的允许超时时长),窗口被移除;

二、Window Assigners

2.1 滚动窗口(Tumbling Windows)

​ 在滚动时间窗口内,每一个元素被分配到唯一的窗口。

在这里插入图片描述

例:

DataStream<T> input = ...;

// tumbling event-time windows
input
    .keyBy(<key selector>)
    .window(TumblingEventTimeWindows.of(Time.seconds(5)))
    .<windowed transformation>(<window function>);

// tumbling processing-time windows
input
    .keyBy(<key selector>)
    .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
    .<windowed transformation>(<window function>);

// daily tumbling event-time windows offset by -8 hours.
input
    .keyBy(<key selector>)
    .window(TumblingEventTimeWindows.of(Time.days(1), Time.hours(-8))) //可用于指定时差
    .<windowed transformation>(<window function>);

2.2 滑动窗口(Sliding Windows)

​ 滑动窗口中一个元素可能属于多个窗口(在滑动步长小于窗口大小的情况下)

在这里插入图片描述

例:

DataStream<T> input = ...;

// sliding event-time windows
input
    .keyBy(<key selector>)
    .window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))
    .<windowed transformation>(<window function>);

// sliding processing-time windows
input
    .keyBy(<key selector>)
    .window(SlidingProcessingTimeWindows.of(Time.seconds(10), Time.seconds(5)))
    .<windowed transformation>(<window function>);

// sliding processing-time windows offset by -8 hours
input
    .keyBy(<key selector>)
    .window(SlidingProcessingTimeWindows.of(Time.hours(12), Time.hours(1), Time.hours(-8)))
    .<windowed transformation>(<window function>);

2.3 会话窗口(Session Windows)

​ 会话窗口没有固定的大小,不会覆盖,当一个窗口在一定时间间隔内没有收到数据该窗口就会关闭,如果有新数据到来,被分配到新的窗口。该窗口内部实现的方法是对每个元素都创建一个window,如果这两个窗口的时间间隔小于定义的gap,就将这两个窗口合并。

在这里插入图片描述

DataStream<T> input = ...;

// event-time session windows with static gap
input
    .keyBy(<key selector>)
    .window(EventTimeSessionWindows.withGap(Time.minutes(10)))
    .<windowed transformation>(<window function>);
    
// event-time session windows with dynamic gap
input
    .keyBy(<key selector>)
    .window(EventTimeSessionWindows.withDynamicGap((element) -> {
        // determine and return session gap
    }))
    .<windowed transformation>(<window function>);

// processing-time session windows with static gap
input
    .keyBy(<key selector>)
    .window(ProcessingTimeSessionWindows.withGap(Time.minutes(10)))
    .<windowed transformation>(<window function>);
    
// processing-time session windows with dynamic gap
input
    .keyBy(<key selector>)
    .window(ProcessingTimeSessionWindows.withDynamicGap((element) -> {
        // determine and return session gap
    }))
    .<windowed transformation>(<window function>);

3.4 全局窗口(Global Windows)

​ 全局窗口只有在自定义trigger的情况下才有用,否则, 将不会执行任何计算,因为全局窗口自己没有结束时间来执行聚合。

在这里插入图片描述

DataStream<T> input = ...;

input
    .keyBy(<key selector>)
    .window(GlobalWindows.create())
    .<windowed transformation>(<window function>);

三、Window Function

Window Function 包括 ReduceFunction、AggregateFunction 和 ProcessWindowFunction,前两个执行效率更高,因为每来一条数据就可以聚合一次,ProcessWindowFunction 在窗口中根据所有得到一个迭代器,以及这些元素所属窗口的元信息。

​ ProcessWindowFunction 的执行效率低,因为在调用函数之前,Flink必须在内部缓冲窗口的所有元素。它可以与AggretateFunction 或者 ReduceFunction 结合使用,来实现增量聚合,并且得到窗口的元数据信息。

3.1 ReduceFunction

​ ReduceFunction 指定如何将两个输入元素聚合,产生一个相同类型的输出元素。

例:

import bean.SensorReading
import org.apache.flink.api.common.functions.ReduceFunction
import org.apache.flink.api.scala.createTypeInformation
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time

object ReduceFunctionTest {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    val source = env.socketTextStream("localhost", 8888)

    val dataStream = source.map(line => {
      val infos = line.split(",")
      SensorReading(infos(0), infos(1).toLong, infos(2).toDouble)
    })

    val window = dataStream.keyBy(sensorReading => sensorReading.id)
      .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
//      .minBy("timestamp")
      .reduce(new MyReduceFunction) //返回温度最小值以及截止到当前的时间戳

    window.print()

    env.execute()
  }
}

class MyReduceFunction extends ReduceFunction[SensorReading] {
  override def reduce(curSensor: SensorReading, newSensor: SensorReading): SensorReading = {
    SensorReading(curSensor.id, newSensor.timestamp, newSensor.temperature.min(curSensor.temperature))
  }
}

3.2 AggregateFunction

AggregateFunction 是 ReduceFunction 的泛化,相比于 ReduceFunction,它有三种类型:输入类型、累加器类型和输出类型。它有四个方法:创建累加器、修改累加器(add)、合并累加器(merge)、得到输出(getResult)。

例:计算第二个字段的平均值

import bean.SensorReading
import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.api.scala.createTypeInformation
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time

object AggregateFunctionTest {

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    val source = env.socketTextStream("localhost", 8888)

    val dataStream = source.map(line => {
      val infos = line.split(",")
      SensorReading(infos(0), infos(1).toLong, infos(2).toDouble)
    })

    val window = dataStream.keyBy(sensorReading => sensorReading.id)
      .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
      .aggregate(new MyAggregateFunction) //计算温度的平均值

    window.print()

    env.execute()
  }
}

class MyAggregateFunction extends AggregateFunction[SensorReading, (Double, Int), Double] {
  override def createAccumulator(): (Double, Int) = (0.0, 0)

  override def add(value: SensorReading, accumulator: (Double, Int)): (Double, Int) = {
    (value.temperature + accumulator._1, accumulator._2 + 1)
  }

  override def getResult(accumulator: (Double, Int)): Double = {
    accumulator._1 / accumulator._2
  }

  override def merge(a: (Double, Int), b: (Double, Int)): (Double, Int) = {
    (a._1 + b._1, a._2 + b._2)
  }
}

3.3 ProcessWindowFunction

3.3.1 用法

ProcessWindowFunction属于全局函数,它可以获得一个包含窗口所有元素的可迭代对象,以及一个可以访问时间和状态信息的上下文对象(如watermark、窗口信息、状态),这使得它比其他窗口函数提供了更多的灵活性,但是性能会降低。

应用场景:1. 需要窗口的上下文对象时;

​ 2. 对窗口数据进行排序(也可以使用 aggragate 来实现)

import bean.SensorReading
import org.apache.flink.api.scala.createTypeInformation
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector


object ProcessFunctionTest {
  def main(args: Array[String]): Unit = {

    val env = StreamExecutionEnvironment.getExecutionEnvironment

    val source = env.socketTextStream("localhost", 8888)

    val dataStream = source.map(line => {
      val infos = line.split(",")
      SensorReading(infos(0), infos(1).toLong, infos(2).toDouble)
    })

    val window = dataStream.keyBy(sensorReading => sensorReading.id)
      .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
      .process(new MyProcessFunction) //输出窗口信息和当前窗口温度的中位数

    window.print()

    env.execute()
  }

}

class MyProcessFunction extends ProcessWindowFunction[SensorReading, String, String, TimeWindow] {
  override def process(key: String, context: Context, elements: Iterable[SensorReading], out: Collector[String]): Unit = {
    val list = elements.toList
    val sortList = list.sortBy(sensor => sensor.temperature)


    out.collect(s"window of ${context.window}: " + (key, sortList(sortList.size / 2).temperature))  //输出窗口信息和当前窗口温度的中位数
  }
}
3.3.2 与增量聚合函数结合使用

​ 但我们需要窗口信息时,同时又想做增量聚合,这二者可以一起使用

3.3.2.1 与 reduce 结合
import bean.SensorReading
import org.apache.flink.api.common.functions.ReduceFunction
import org.apache.flink.api.scala.createTypeInformation
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

object ProcessFunctionWithReduce {
  def main(args: Array[String]): Unit = {

    val env = StreamExecutionEnvironment.getExecutionEnvironment

    val source = env.socketTextStream("localhost", 8888)

    val dataStream = source.map(line => {
      val infos = line.split(",")
      SensorReading(infos(0), infos(1).toLong, infos(2).toDouble)
    })

    val window = dataStream.keyBy(sensorReading => sensorReading.id)
      .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
      .reduce(new MyReduceFunction2, new MyProcessFunction2) //输出窗口信息和当前窗口温度的最小值

    window.print()

    env.execute()
  }

}


class MyReduceFunction2 extends ReduceFunction[SensorReading] {
  override def reduce(curSensor: SensorReading, newSensor: SensorReading): SensorReading = {
    SensorReading(curSensor.id, newSensor.timestamp, newSensor.temperature.min(curSensor.temperature))
  }
}

class MyProcessFunction2 extends ProcessWindowFunction[SensorReading, String, String, TimeWindow] {
  override def process(key: String, context: Context, elements: Iterable[SensorReading], out: Collector[String]): Unit = {
    val min = elements.iterator.next()
    out.collect(s"${context.window}: $min")
  }
}
3.3.2.2 与 aggregate 结合

​ 和上面的同理

四、迟到数据处理

4.1 设置允许迟到的时间

​ 可以通过设置允许迟到的时间来指定窗口延迟多少秒关闭。

4.2 侧输出流

​ 窗口延迟关闭会影响系统的实时性,我们不能将允许迟到时间设置的太大。可以将在设置迟到时间的基础上加侧输出流,侧输出流可以将迟到数据输出到我们想要存放的地方,便于之后分析。

4.3 代码

//定义侧输出流,来处理迟到数据
val outputTag = new OutputTag[(String, Long)]("late-data")
val window = waterMarkStream.keyBy(0)
  .window(TumblingEventTimeWindows.of(Time.seconds(3)))
  .allowedLateness(Time.seconds(2)) //窗口最多等2s
  .sideOutputLateData(outputTag = outputTag)
  .apply(new WindowFunction[(String, Long), String, Tuple, TimeWindow] {
    override def apply(tuple: Tuple, window: TimeWindow, input: Iterable[(String, Long)], out: Collector[String]): Unit = {
      val key = tuple.toString
      val list: util.ArrayList[java.lang.Long] = new util.ArrayList[java.lang.Long]()

      val it = input.iterator
      while(it.hasNext) {
        val next = it.next()
        list.add(next._2)
        println(next)
      }

      Collections.sort(list)

      val sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS")

      val result = s"key: $key, window size: [${list.size}], first arrive at: [${sdf.format(list.get(0))}], last arrive at: [${sdf.format(list.get(list.size() - 1))}], " +
        s"window start: [${sdf.format(window.getStart)}], window end: [${sdf.format(window.getEnd)}]"

      out.collect(result)
    }
  })

//将侧输出流输出到控制台
val sideOutput = window.getSideOutput(outputTag)
sideOutput.print("side-output")

window.print("window")
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值