【Flink流式计算框架】：Flink中的Window API

最新推荐文章于 2024-07-24 10:08:29 发布

Yuan_CSDF

最新推荐文章于 2024-07-24 10:08:29 发布

阅读量246

点赞数

分类专栏： # Flink基础文章标签： flink

本文链接：https://blog.csdn.net/Yuan_CSDF/article/details/117376970

版权

Flink基础专栏收录该内容

15 篇文章 2 订阅

订阅专栏

1.Window概述

streaming 流式计算是一种被设计用于处理无限数据集的数据处理引擎，而无限数据集是指一种不断增长的本质上无限的数据集，而window 是一种切割无限数据为有限块进行处理的手段。

Window 是无限数据流处理的核心，Window 将一个无限的stream 拆分成有限大小的”buckets”桶，我们可以在这些桶上做计算操作。

2.Window类型

Window 可以分成两类：

CountWindow：按照指定的数据条数生成一个Window，与时间无关。
TimeWindow：按照时间生成Window。

对于TimeWindow，可以根据窗口实现原理的不同分成三类：滚动窗口（TumblingWindow）、滑动窗口（Sliding Window）和会话窗口（Session Window）。

2.1. 滚动窗口（Tumbling Windows）

将数据依据固定的窗口长度对数据进行切片。

特点：时间对齐，窗口长度固定，没有重叠。

滚动窗口分配器将每个元素分配到一个指定窗口大小的窗口中，滚动窗口有一个固定的大小，并且不会出现重叠。例如：如果你指定了一个5 分钟大小的滚动窗口，窗口的创建如下图所示：

适用场景：适合做BI 统计等（做每个时间段的聚合计算）。

2.2. 滑动窗口（Sliding Windows）

滑动窗口是固定窗口的更广义的一种形式，滑动窗口由固定的窗口长度和滑动间隔组成。

特点：时间对齐，窗口长度固定，可以有重叠。

滑动窗口分配器将元素分配到固定长度的窗口中，与滚动窗口类似，窗口的大小由窗口大小参数来配置，另一个窗口滑动参数控制滑动窗口开始的频率。因此，滑动窗口如果滑动参数小于窗口大小的话，窗口是可以重叠的，在这种情况下元素会被分配到多个窗口中。

例如，你有10 分钟的窗口和5 分钟的滑动，那么每个窗口中5 分钟的窗口里包含着上个10 分钟产生的数据，如下图所示：

适用场景：对最近一个时间段内的统计（求某接口最近5min 的失败率来决定是否要报警）。

2.3. 会话窗口（Session Windows）

由一系列事件组合一个指定时间长度的timeout 间隙组成，类似于web 应用的session，也就是一段时间没有接收到新数据就会生成新的窗口。

特点：时间无对齐。

session 窗口分配器通过session 活动来对元素进行分组，session 窗口跟滚动窗口和滑动窗口相比，不会有重叠和固定的开始时间和结束时间的情况，相反，当它在一个固定的时间周期内不再收到元素，即非活动间隔产生，那个这个窗口就会关闭。一个session 窗口通过一个session 间隔来配置，这个session 间隔定义了非活跃周期的长度，当这个非活跃周期产生，那么当前的session 将关闭并且后续的元素将被分配到新的session 窗口中去。

3.Window API

窗口代码使用如下：

 val resultStream = dataStream
      .map(data => (data.id, data.temperature, data.timestamp))
      .keyBy(_._1) // 按照二元组的第一个元素（id）分组
      //      .window( TumblingEventTimeWindows.of(Time.seconds(15)))    // 滚动时间窗口
      //      .window( SlidingProcessingTimeWindows.of(Time.seconds(15), Time.seconds(3)) )    // 滑动时间窗口
      //      .window( EventTimeSessionWindows.withGap(Time.seconds(10)) )    // 会话窗口
      //      .countWindow(10)    // 滚动计数窗口
      // 查看起始时间 TimeWindows 269行 getWindowStartWithOffset
      .timeWindow(Time.seconds(15))
      .allowedLateness(Time.minutes(1))
      .sideOutputLateData(latetag)
      //      .minBy(1)
      .reduce((curRes, newData) => (curRes._1, curRes._2.min(newData._2), newData._3))

3.1其它可选API

trigger() —— 触发器定义window 什么时候关闭，触发计算并输出结果
1. TriggerResult.CONTINUE ：表示对 window 不做任何处理
2. TriggerResult.FIRE ：表示触发 window 的计算
3. TriggerResult.PURGE ：表示清除 window 中的所有数据
4. TriggerResult.FIRE_AND_PURGE ：表示先触发 window 计算，然后删除
evitor() —— 移除器
- 定义移除某些数据的逻辑
allowedLateness() —— 允许处理迟到的数据
sideOutputLateData() —— 将迟到的数据放入侧输出流
getSideOutput() —— 获取侧输出流

3.2使用迟到元素更新窗口计算结果

val readings: DataStream[SensorReading] = ...

val countPer10Secs: DataStream[(String, Long, Int, String)] = readings
  .keyBy(_.id)
  .timeWindow(Time.seconds(10))
  // process late readings for 5 additional seconds
  .allowedLateness(Time.seconds(5))
  // count readings and update results if late readings arrive
  .process(new UpdatingWindowCountFunction)

  /** A counting WindowProcessFunction that distinguishes between
  * first results and updates. */
class UpdatingWindowCountFunction
    extends ProcessWindowFunction[SensorReading,
      (String, Long, Int, String), String, TimeWindow] {

  override def process(
      id: String,
      ctx: Context,
      elements: Iterable[SensorReading],
      out: Collector[(String, Long, Int, String)]): Unit = {

    // count the number of readings
    val cnt = elements.count(_ => true)

    // state to check if this is
    // the first evaluation of the window or not
    val isUpdate = ctx.windowState.getState(
      new ValueStateDescriptor[Boolean](
        "isUpdate",
        Types.of[Boolean]))

    if (!isUpdate.value()) {
      // first evaluation, emit first result
      out.collect((id, ctx.window.getEnd, cnt, "first"))
      isUpdate.update(true)
    } else {
      // not the first evaluation, emit an update
      out.collect((id, ctx.window.getEnd, cnt, "update"))
    }
  }
}

4.窗口计算函数的调用

窗口计算函数大致可以分为两类：

增量聚合函数(Incremental aggregation functions)：当一个事件被添加到窗口时，触发函数计算，并且更新window的状态(单个值)。最终聚合的结果将作为输出。ReduceFunction和AggregateFunction是增量聚合函数。
全窗口函数(Full window functions)：这个函数将会收集窗口中所有的元素，可以做一些复杂计算。ProcessWindowFunction是window function。

4.1增量聚合函数

// ReduceFunction与AggregateFunction

val minTempPerWindow = sensorData
  .map(r => (r.id, r.temperature))
  .keyBy(_._1)
  .timeWindow(Time.seconds(15))
  .reduce((r1, r2) => (r1._1, r1._2.min(r2._2)))

// IN是输入元素的类型，ACC是累加器的类型，OUT是输出元素的类型。
val avgTempPerWindow: DataStream[(String, Double)] = sensorData
  .map(r => (r.id, r.temperature))
  .keyBy(_._1)
  .timeWindow(Time.seconds(15))
  .aggregate(new AvgTempFunction)

// An AggregateFunction to compute the average temperature per sensor.
// The accumulator holds the sum of temperatures and an event count.
class AvgTempFunction
  extends AggregateFunction[(String, Double),
    (String, Double, Int), (String, Double)] {

  override def createAccumulator() = {
    ("", 0.0, 0)
  }

  override def add(in: (String, Double), acc: (String, Double, Int)) = {
    (in._1, in._2 + acc._2, 1 + acc._3)
  }

  override def getResult(acc: (String, Double, Int)) = {
    (acc._1, acc._2 / acc._3)
  }

  override def merge(acc1: (String, Double, Int),
    acc2: (String, Double, Int)) = {
    (acc1._1, acc1._2 + acc2._2, acc1._3 + acc2._3)
  }
}

4.2全窗口函数

一些业务场景，我们需要收集窗口内所有的数据进行计算，例如计算窗口数据的中位数，或者计算窗口数据中出现频率最高的值。这样的需求，使用ReduceFunction和AggregateFunction就无法实现了。这个时候就需要ProcessWindowFunction了。

// 计算5s滚动窗口中的最低和最高的温度。输出的元素包含了(流的Key, 最低温度, 最高温度, 窗口结束时间)。

val minMaxTempPerWindow: DataStream[MinMaxTemp] = sensorData
  .keyBy(_.id)
  .timeWindow(Time.seconds(5))
  .process(new HighAndLowTempProcessFunction)

case class MinMaxTemp(id: String, min: Double, max: Double, endTs: Long)

class HighAndLowTempProcessFunction
  extends ProcessWindowFunction[SensorReading,
    MinMaxTemp, String, TimeWindow] {
  override def process(key: String,  //window的key
                       ctx: Context,  // 　Context参数和别的process方法一样。而ProcessWindowFunction的Context对象还可以访问window的元数据(窗口开始和结束时间)，当前处理时间和水位线，per-window state和per-key global state，side outputs。
                       vals: Iterable[SensorReading], // 迭代器包含窗口的所有元素
                       out: Collector[MinMaxTemp]
                      ): Unit = {
    val temps = vals.map(_.temperature)
    val windowEnd = ctx.window.getEnd

    out.collect(MinMaxTemp(key, temps.min, temps.max, windowEnd))
  }
}

4.3两种窗口函数的混合调用

case class MinMaxTemp(id: String, min: Double, max: Double, endTs: Long)

val minMaxTempPerWindow2: DataStream[MinMaxTemp] = sensorData
  .map(r => (r.id, r.temperature, r.temperature))
  .keyBy(_._1)
  .timeWindow(Time.seconds(5))
  .reduce(
    (r1: (String, Double, Double), r2: (String, Double, Double)) => {
      (r1._1, r1._2.min(r2._2), r1._3.max(r2._3))
    },
    new AssignWindowEndProcessFunction
  )

class AssignWindowEndProcessFunction
  extends ProcessWindowFunction[(String, Double, Double),
    MinMaxTemp, String, TimeWindow] {
    override def process(key: String,
                       ctx: Context,
                       minMaxIt: Iterable[(String, Double, Double)],
                       out: Collector[MinMaxTemp]): Unit = {
    val minMax = minMaxIt.head
    val windowEnd = ctx.window.getEnd
    out.collect(MinMaxTemp(key, minMax._2, minMax._3, windowEnd))
  }
}

5.window join

5.1Tumbling Window Join

两个window之间可以进行join，join操作只支持三种类型的window：滚动窗口，滑动窗口，会话窗口

使用方式：

stream.join(otherStream) //两个流进行关联
.where(<KeySelector>) //选择第一个流的key作为关联字段
.equalTo(<KeySelector>)//选择第二个流的key作为关联字段
.window(<WindowAssigner>)//设置窗口的类型
.apply(<JoinFunction>) //对结果做操作 process apply = foreachWindow

示例代码如下：

 def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
    env.setParallelism(1)

    import org.apache.flink.api.scala.{ExecutionEnvironment, _}
    val stream1 = env
      .fromElements(
        ("a", 1000L),
        ("a", 2000L)
      )
      .assignAscendingTimestamps(_._2)

    val stream2 = env
      .fromElements(
        ("a", 3000L),
        ("a", 4000L)
      )
      .assignAscendingTimestamps(_._2)

    stream1
      .join(stream2)
      // on A.id = B.id
      .where(_._1)
      .equalTo(_._1)
      .window(TumblingEventTimeWindows.of(Time.seconds(5)))
      .apply(new JoinFunction[(String, Long), (String, Long), String] {
        override def join(in1: (String, Long), in2: (String, Long)): String = {
          in1 + " => " + in2
        }
      })
      .print()

    env.execute()
  }

5.2Interval Join

object TxMatchWithJoin {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

    // 1. 读取订单事件数据
    val resource1 = getClass.getResource("/OrderLog.csv")
    val orderEventStream = env.readTextFile(resource1.getPath)
      //    val orderEventStream = env.socketTextStream("localhost", 7777)
      .map( data => {
      val arr = data.split(",")
      OrderEvent(arr(0).toLong, arr(1), arr(2), arr(3).toLong)
    } )
      .assignAscendingTimestamps(_.timestamp * 1000L)
      .filter(_.eventType == "pay")
      .keyBy(_.txId)

    // 2. 读取到账事件数据
    val resource2 = getClass.getResource("/ReceiptLog.csv")
    val receiptEventStream = env.readTextFile(resource2.getPath)
      //    val orderEventStream = env.socketTextStream("localhost", 7777)
      .map( data => {
      val arr = data.split(",")
      ReceiptEvent(arr(0), arr(1), arr(2).toLong)
    } )
      .assignAscendingTimestamps(_.timestamp * 1000L)
      .keyBy(_.txId)

    val resultStream = orderEventStream.intervalJoin(receiptEventStream)
      .between(Time.seconds(0), Time.seconds(5))
      .process( new TxMatchWithJoinResult() )

    resultStream.print()
    env.execute("tx match with join job")
  }
}

class TxMatchWithJoinResult() extends ProcessJoinFunction[OrderEvent, ReceiptEvent, (OrderEvent, ReceiptEvent)]{
  override def processElement(left: OrderEvent, right: ReceiptEvent, ctx: ProcessJoinFunction[OrderEvent, ReceiptEvent, (OrderEvent, ReceiptEvent)]#Context, out: Collector[(OrderEvent, ReceiptEvent)]): Unit = {
    out.collect((left, right))
  }
}

注意：先后顺序

// stream1比stream2晚到10秒 --> 同时到
    stream1
      .intervalJoin(stream2)
      .between(Time.minutes(-10), Time.minutes(0))
      .process(new ProcessJoinFunction[(String, Long, String), (String, Long, String), String] {
        override def processElement(in1: (String, Long, String), in2: (String, Long, String), context: ProcessJoinFunction[(String, Long, String), (String, Long, String), String]#Context, collector: Collector[String]): Unit = {
          collector.collect(in1 + " => " + in2)
        }
      })
      .print()

    stream2
      .intervalJoin(stream1)
      //  stream1比stream2晚到10秒 --> 同时到
      .between(Time.minutes(0), Time.minutes(10))
      .process(new ProcessJoinFunction[(String, Long, String), (String, Long, String), String] {
        override def processElement(in1: (String, Long, String), in2: (String, Long, String), context: ProcessJoinFunction[(String, Long, String), (String, Long, String), String]#Context, collector: Collector[String]): Unit = {
          collector.collect(in1 + " => " + in2)
        }
      })
      .print()