Flink的时间和窗口的使用、水位线的设置

最新推荐文章于 2023-09-12 15:41:23 发布

木鱼-

最新推荐文章于 2023-09-12 15:41:23 发布

阅读量1.8k

点赞数

分类专栏： flink

原文链接：https://blog.csdn.net/sghuu/article/details/103697533

版权

flink 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

Event Time：是事件创建的时间。它通常由事件中的时间戳描述，例如采集的日志数据中，每一条日志都会记录自己的生成时间，Flink通过时间戳分配器访问事件时间戳。
Ingestion Time：是数据进入Flink的时间。
Processing Time：是每一个执行基于时间操作的算子的本地系统时间，与机器相关，默认的时间属性就是Processing Time。

window分为两大类：
CountWindow：按照指定的数据条数生成一个Window，与时间无关。

TimeWindow：按照时间生成Window：对于TimeWindow，可以根据窗口实现原理的不同分成三类：滚动窗口(Tumbling Window)、滑动窗口(Sliding Window)和会话窗口(Session Window)。

滚动窗口(Tumbling Windows)
将数据依据固定的窗口长度对数据进行切片。
特点：时间对齐，窗口长度固定，没有重叠。滚动窗口分配器将每个元素分配到一个指定窗口大小的窗口中，滚动窗口有一个固定的大小，并且不会出现重叠。

滑动窗口(Sliding Windows)
滑动窗口是固定窗口的更广义的一种形式，滑动窗口由固定的窗口长度和滑动间隔组成。
特点：时间对齐，窗口长度固定，有重叠。滑动窗口分配器将元素分配到固定长度的窗口中，与滚动窗口类似，窗口的大小由窗口大小参数来配置，另一个窗口滑动参数控制滑动窗口开始的频率。因此，滑动窗口如果滑动参数小于窗口大小的话，窗口是可以重叠的，在这种情况下元素会被分配到多个窗口中。

会话窗口(Session Windows)
由一系列事件组合一个指定时间长度的timeout间隙组成，类似于web应用的session，也就是一段时间没有接收到新数据就会生成新的窗口。
特点：时间无对齐。

//遥感数据样例类数据来源id ，产生的时间戳，温度

case  class SensorReading(id:String,timestamp :Long,temperature : Double)

Flink默认的时间窗口根据Processing Time进行窗口的划分，将Flink获取到的数据根据进入Flink的时间划分到不同的窗口中。

// 每个传感器每个滚动窗口(15s)的最小温度值
val minTempPerWindow: DataStream[(String, Double)] = sensorData
  .map(r => (r.id, r.temperature))
  // 按照传感器id分流
  .keyBy(_._1)
  .timeWindow(Time.seconds(15))
  .reduce((r1, r2) => (r1._1, r1._2.min(r2._2)))

这种设置就是根据到达系统的时间为依据就行开窗计算，系统时间到达窗口结束时间时就会触发窗口的计算。

使用事件时间为依据
在Flink的流式处理中，绝大部分的业务都会使用Event Time，一般只在Event Time无法使用时，才会被迫使用Processing Time或者Ingestion Time。如果要使用Event Time，那么需要引入Event Time的时间属性，引入方式如下所示：

val env = StreamExecutionEnvironment.getExecutionEnvironment

// 从调用时刻开始给env创建的每一个stream追加时间特征
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

流处理从事件产生，到流经source，再到operator，中间是有一个过程和时间的，虽然大部分情况下，流到operator的数据都是按照事件产生的时间顺序来的，但是也不排除由于网络、分布式等原因，导致乱序的产生，所谓乱序，就是指Flink接收到的事件的先后顺序不是严格按照事件的Event Time顺序排列的。
Watermark是一种衡量Event Time进展的机制，它是数据本身的一个隐藏属性，数据本身携带着对应的Watermark。
Watermark是用于处理乱序事件的，而正确的处理乱序事件，通常用Watermark机制结合Window来实现。
数据流中的Watermark用于表示timestamp小于Watermark的数据，都已经到达了，因此，Window的执行也是由Watermark触发的。

Watermark可以理解成一个延迟触发机制，我们可以设置Watermark的延时时长t，每次系统会校验已经到达的数据中最大的maxEventTime，然后认定Event Time小于maxEventTime - t的所有数据都已经到达，如果有窗口的停止时间等于maxEventTime – t，那么这个窗口被触发执行。

AssignerWithPeriodicWatermarks

AssignerWithPunctuatedWatermarks

以上两个接口都继承自TimestampAssigner。

AssignerWithPeriodicWatermark是周期性的产生水银，默认时间是200毫秒，可以通过参数来设置
// 每隔5秒产生一个水印

env.getConfig.setAutoWatermarkInterval(5000）

eg：周期性的时间戳抽取

class PeriodicAssigner extends AssignerWithPeriodicWatermarks[SensorReading] {
  val bound: Long = 60 * 1000 // 延时为1分钟
  var maxTs: Long = Long.MinValue // 观察到的最大时间戳

  override def getCurrentWatermark: Watermark = {
    new Watermark(maxTs - bound)
  }

  override def extractTimestamp(r: SensorReading, previousTS: Long) = {
    maxTs = maxTs.max(r.timestamp)
    r.timestamp
  }
}
//这种情况设置延时以后的产生的水位线

如果我们事先得知数据流的时间戳是单调递增的，也就是说没有乱序。我们可以使用assignAscendingTimestamps，方法会直接使用数据的时间戳生成水印。

val stream: DataStream[SensorReading] = ...
val withTimestampsAndWatermarks = stream
  .assignAscendingTimestamps(e => e.timestamp)

如果能够大致推算出数据的中时间的最大延迟时间可以使用：

val stream: DataStream[SensorReading] = ...
val withTimestampsAndWatermarks = stream.assignTimestampsAndWatermarks(
  new SensorTimeAssigner
)

class SensorTimeAssigner
  extends BoundedOutOfOrdernessTimestampExtractor[SensorReading](Time.seconds(5)) {

  // 抽取时间戳
  override def extractTimestamp(r: SensorReading): Long = r.timestamp
}

AssignerWithPunctuatedWatermarks 是间歇性的产生水位线比如我们可以只对某一个key的数据产生水位线：
直接上代码，只给sensor_1的传感器的数据流插入水印

class PunctuatedAssigner extends AssignerWithPunctuatedWatermarks[SensorReading] {
//设置延迟时间
  val bound: Long = 60 * 1000
//设置水位线产生的逻辑
  override def checkAndGetNextWatermark(r: SensorReading, extractedTS: Long): Watermark = {
    if (r.id == "sensor_1") {
      new Watermark(extractedTS - bound)
    } else {
      null
    }
  }
//或事件时间的方式
  override def extractTimestamp(r: SensorReading, previousTS: Long): Long = {
    r.timestamp
  }
}

水位线的设置需要权衡：
1）对数据处理要求严格就需要得到水位线之前的所有数据，必然需要增大延迟时间，但带来的压力是内存中的数据会产生更多的挤压，造成内存压力
2）设置的延迟时间稍微小一点后可以减少触发的等待时间，缓解内存压力，但是可能会丢失延迟的数据，但可以通过迟到数据的处理来更新窗口运算的结果

介绍flink对迟到数据的处理（默认是直接舍弃）：

https://blog.csdn.net/sghuu/article/details/103704415

Flink对迟到数据的处理的三种方式

Flink对迟到数据的处理

水位线可以用来平衡计算的完整性和延迟两方面。除非我们选择一种非常保守的水位线策略(最大延时设置的非常大，以至于包含了所有的元素，但结果是非常大的延迟)，否则我们总需要处理迟到的元素。

迟到的元素是指当这个元素来到时，这个元素所对应的窗口已经计算完毕了(也就是说水位线已经没过窗口结束时间了)。这说明迟到这个特性只针对事件时间。

DataStream API提供了三种策略来处理迟到元素

直接抛弃迟到的元素

将迟到的元素发送到另一条流中去

可以更新窗口已经计算完的结果，并发出计算结果。

抛弃迟到元素
抛弃迟到的元素是event time window operator的默认行为。也就是说一个迟到的元素不会创建一个新的窗口。
process function可以通过比较迟到元素的时间戳和当前水位线的大小来很轻易的过滤掉迟到元素。
重定向迟到元素
迟到的元素也可以使用侧输出(side output)特性被重定向到另外的一条流中去。迟到元素所组成的侧输出流可以继续处理或者sink到持久化设施中去。

例子：直接使用算子指定迟到数据输出到测输出流

package com.late
import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

//重定向的处理
object LateTest {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    //设置并行度
    env.setParallelism(1)
    //设置时间为事件时间
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
    //
    val stream = env.socketTextStream("linux102", 9999, '\n')

    //数据来源 "a 100 12"   "a 99 14"
    val value = stream.map(x => {
      val str = x.split(" ")
      (str(0), str(1).toInt, str(2).toLong * 1000)
      //未设置延迟的水位线
    }).assignAscendingTimestamps(x => x._3).keyBy(_._1).timeWindow(Time.seconds(5))
      //直接将迟到数据重定向到”late_date"的数据流中
      .sideOutputLateData(new OutputTag[(String, Int, Long)]("late_date"))
      //.process(new MaxFunction)
      .aggregate(new MaxFunction)

    //获取“late_date"的测输出流
    value.getSideOutput(new OutputTag[(String,Int,Long)]("late_date")).print()
    env.execute()

    /* stream.map(x=>{
      val str = x.split(" ")
      (str(0),str(1).toLong*1000)*/

  }

//采用全窗口函数的形式
  /*class MaxFunction extends ProcessWindowFunction[(String,Int,Long),(String,Int),String,TimeWindow]{
    override def process(key: String, context: Context, elements: Iterable[(String, Int, Long)], out: Collector[(String, Int)]): Unit = {

      out.collect((key,elements.map(_._2).toIterator.max))
    }

  }*/

  // in out key windom
//采取质量函数的形式
  class MaxFunction extends AggregateFunction[(String,Int,Long),(String,Int),(String,Int)]{
    // 累加逻辑
    override def add(in: (String, Int, Long), acc: (String, Int)): (String, Int) = {
      (in._1,in._2.max(acc._2))}
    //初始化累加器
    override def createAccumulator(): (String, Int) =
      ("",0)
    //返回结果
    override def getResult(acc: (String, Int)): (String, Int) =
      acc
    //累加器聚合
    override def merge(acc: (String, Int), acc1: (String, Int)): (String, Int) =
      (acc._1,acc._2.max(acc1._2))
  }
}

通过自定义ProcessFunction来实现对迟到数据的处理输出到侧输出流

object LateElement {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
    val stream = env.socketTextStream("linux1", 9999, '\n')
    val s = stream
      .map(line => {
        val arr = line.split(" ")
        (arr(0), arr(1).toLong * 1000)
      })  //设值获取时间的方式 ，水位线的延迟为5秒钟
      .assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[(String, Long)](Time.seconds(5)) {
        override def extractTimestamp(element: (String, Long)): Long = element._2
      })
      .process(new MyLateProcess)
    s.getSideOutput(new OutputTag[String]("late")).print()
    env.execute()
  }
  class MyLateProcess extends ProcessFunction[(String, Long), (String, Long)] {
    val late = new OutputTag[String]("late")
    override def processElement(value: (String, Long),
                                ctx: ProcessFunction[(String, Long), (String, Long)]#Context,
                                out: Collector[(String, Long)]): Unit = {
      if (value._2 < ctx.timerService().currentWatermark()) {
      //将低于水位线的迟到数据输出到侧输出流
        ctx.output(late, "这个元素迟到了！")
      } else {
        out.collect(value)
      }
    }
  }
}

使用迟到元素更新窗口计算结果(Updating Results by Including Late Events)
由于存在迟到的元素，所以已经计算出的窗口结果是不准确和不完全的。我们可以使用迟到元素更新已经计算完的窗口结果。

如果我们要求一个operator支持重新计算和更新已经发出的结果，就需要在第一次发出结果以后也要保存之前所有的状态。但显然我们不能一直保存所有的状态，肯定会在某一个时间点将状态清空，而一旦状态被清空，结果就再也不能重新计算或者更新了。而迟到的元素只能被抛弃或者发送到侧输出流。

window operator API提供了方法来明确声明我们要等待迟到元素。当使用event-time window，我们可以指定一个时间段叫做allowed lateness。window operator如果设置了allowed lateness，这个window operator在水位线没过窗口结束时间时也将不会删除窗口和窗口中的状态。窗口会在一段时间内(allowed lateness设置的)保留所有的元素。

当迟到元素在allowed lateness时间内到达时，这个迟到元素会被实时处理并发送到触发器(trigger)。当水位线没过了窗口结束时间+allowed lateness时间时，窗口会被删除，并且所有后来的迟到的元素都会被丢弃。

package com.atguigu

import org.apache.flink.api.common.state.ValueStateDescriptor
import org.apache.flink.api.scala.typeutils.Types
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.ProcessFunction
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

object AllowedLateTest {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

    val stream = env.socketTextStream("Linux1", 9999, '\n')
    val s = stream
      .map(line => {
        val arr = line.split(" ")
        (arr(0), arr(1).toLong * 1000)
      })
//      .assignAscendingTimestamps(_._2)
      .assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[(String, Long)](Time.seconds(5)) {
        override def extractTimestamp(element: (String, Long)): Long = element._2
      })
      .keyBy(_._1)
      // [0,5),...
      .timeWindow(Time.seconds(5))
      // 水位线超过 窗口结束时间 窗口闭合计算，但不销毁
      // 水位线超过 窗口结束时间 + allowed lateness，窗口更新结果并销毁
      .allowedLateness(Time.seconds(5))
      .process(new MyAllowedLateProcess)
    s.print()
    env.execute()
  }
  class MyAllowedLateProcess extends ProcessWindowFunction[(String, Long),
    String, String,TimeWindow] {
    override def process(key: String,
                         context: Context,
                         elements: Iterable[(String, Long)],
                         out: Collector[String]): Unit = {
      lazy val isUpdate = getRuntimeContext.getState(
        new ValueStateDescriptor[Boolean]("update", Types.of[Boolean])
      )
      if (!isUpdate.value()) {
        out.collect("在水位线超过窗口结束时间的时候，窗口第一次闭合计算")
        isUpdate.update(true)
      } else {
        out.collect("迟到元素来了以后，更新窗口闭合计算的结果")
      }
    }
  }
}