Flink时间窗口

最新推荐文章于 2024-08-01 08:17:37 发布

陈同学�

最新推荐文章于 2024-08-01 08:17:37 发布

阅读量327

点赞数

文章标签： flink 人工智能机器学习

本文链接：https://blog.csdn.net/weixin_43866666/article/details/120567137

版权

Flink时间窗口

1. Time

Event time：数据真正产生的时间
access：time
应该是属于数据/时间的一个部分
watermark：肯定和EventTIme相关允许数据迟到多久
优点：
执行结果确定的
乱序、延时
缺点：
延迟

Ingestion time: 进入flink的时间
与机器时间有关系
Flink source operator 有关

processiong time：业务执行的时间
operator运行时间

在这里插入图片描述

2. window

window：窗口
无限
分类：时间窗口和数量窗口
在这里插入图片描述

TUMBLING WINDOW 滚动
时间对齐，窗口长度固定，不重叠
BI统计
SLIDING WINDOW 滑动
在这里插入图片描述

SESSION WINDOW 会话窗口

GLOBAL WINDOW

3. 滚动window

数量窗口

    val stream = env.socketTextStream("hadoop001", 9527)
    stream.map(_.trim.toInt)
      .countWindowAll(5)
      .sum(0).print()

keyby后的数量窗口

    stream.map(x => {
      (x.trim.toInt , 1 )
    }).keyBy(x => x._1)
      .countWindowAll(5) // keyby 分组后的五条，也就是说一个partition内积累5条才展示
      .sum(1).print()

滚动窗口简单写法默认是processing time

    stream.map(_.trim.toInt)
      .timeWindowAll(Time.seconds(5))
      .sum(0).print()

keyby后的数量窗口

    stream.map(x => {
      (x.trim.toInt , 1 )
    }).keyBy(x => x._1)
      .timeWindowAll(Time.seconds(5)) // TumblingProcessingTimeWindow
      .sum(1).print()

3. 滑动window

10秒一个窗口，每隔5秒滚一次

    stream.map(_.trim.toInt)
      .timeWindowAll(Time.seconds(10),Time.seconds(5))
      .sum(0).print()

开始有半个窗口
0 - 5
0 -10
5 -15
10 -20
在这里插入图片描述
带key

        stream.map(x => {
          (x.trim.toInt , 1 )
        }).keyBy(x => x._1)
          .timeWindowAll(Time.seconds(10),Time.seconds(5))
          .sum(1).print()

5. 窗口function

5.1 增量

增量聚合来一条处理一条
求和

    stream.map(x => (1, x.trim.toInt))
      .keyBy(x=> x._1)
      .timeWindow(Time.seconds(5))
      .reduce((x,y) => {
        (x._1,(x._2 + y._2))
      }).print()

在这里插入图片描述
求平均

 /**
     * 平均数 = 和 / 次数
     *
     * interface AggregateFunction<IN, ACC, OUT>
     *   IN : <a,100 >  --> <String, Long>
     *   ACC: 中间的数据类型 (Long , Long) 所以中间需要 和 和 次数 两个Int类型数据
     *   Out: Double
     */
        stream
          .map(x => ("a", x.trim.toLong))
          .keyBy(x=> x._1)
          .timeWindow(Time.seconds(5))
          .aggregate(new myAverageAggFunction)
          .print()



class myAverageAggFunction extends AggregateFunction[(String,Long),(Long,Long) , Double] {
  // 初始化累加器， 赋初始值
  override def createAccumulator(): (Long, Long) = (0L, 0L)

  override def add(value: (String, Long), accumulator: (Long, Long)): (Long, Long) = {
    println("add ....... invoke ....." + value._1 + "      " + value._2)
    // 累加器里 第一个存放 和，  第二个存放 次数
    (accumulator._1 + value._2, accumulator._2 + 1L)
  }

  override def getResult(accumulator: (Long, Long)): Double = {
    // 总数 / 次数
    accumulator._1 / accumulator._2.toDouble
  }
  // 累加器的合并操作
  override def merge(a: (Long, Long), b: (Long, Long)): (Long, Long) = {
    (a._1 + b._1, a._2 + b._2)
  }
}

5.2 全量

全量聚合批次处理

    /**
     * 求窗口内最大值
     * ProcessWindowFunction[IN, OUT, KEY, W <: Window]
     * IN : INT, INT OR LONG,LONG
     * OUT: String
     * KEY : Tuple
     * W :  TimeWindow
     */
        stream
          .map(x => (1, x.trim.toInt))
          .keyBy(0)
          .timeWindow(Time.seconds(5))
          .process(new myProcessFunction)

    env.execute(getClass.getCanonicalName)
  }

}

class myProcessFunction extends ProcessWindowFunction[(Int, Int), String, Tuple, TimeWindow]{
  override def process(key: Tuple, context: Context, elements: Iterable[(Int, Int)], out: Collector[String]): Unit = {
    println(" ---------------process.invoked---------------")
    var maxValue = Int.MinValue
    for(ele <- elements){
      maxValue = ele._2.max(maxValue)
    }
    
    val start = new Timestamp(context.window.getStart)
    val end = new Timestamp(context.window.getEnd)
    out.collect(" 最大值是====" + maxValue  +  " time : "+start +  "     -     "+end)


  }
}

5.3 side 输出

    /**
     * 大于39的人 sideoutput ： pk,1000,37.6
     */
    val streamT = stream.map(x => {
      val splits = x.split(",")
      Temperature(splits(0).trim, splits(1).trim.toLong, splits(2).trim.toFloat)
    }).process(new TemperatureProcess(39.4F))

    streamT.print("正常的 ：" )
    streamT.getSideOutput(new OutputTag[(String, Long, Float)]("high")).print("高的")

    env.execute(getClass.getCanonicalName)
  }

}

class TemperatureProcess(threshold: Float) extends ProcessFunction[Temperature, Temperature]{
  override def processElement(value: Temperature, ctx: ProcessFunction[Temperature, Temperature]#Context, out: Collector[Temperature]): Unit = {
    if(value.temperature <= threshold){
      out.collect(value)
    } else{
      ctx.output(new OutputTag[(String, Long, Float)]("high"), (value.name, value.time, value.temperature))
    }
  }
}

case class Temperature(name: String, time:Long, temperature: Float)

在这里插入图片描述

陈同学�

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Flink时间窗口

Flink时间窗口1. TimeEvent time：数据真正产生的时间access：time应该是属于数据/时间的一个部分watermark：肯定和EventTIme相关允许数据迟到多久优点：执行结果确定的乱序、延时缺点：延迟Ingestion time: 进入flink的时间与机器时间有关系Flink source operator 有关processiong time：业务执行的时间operator运行时间2. windowwindow：窗口无限分类：时
复制链接

扫一扫