flink按照eventTime处理流&watermark

最新推荐文章于 2022-03-12 14:26:24 发布

xuehuagongzi000

最新推荐文章于 2022-03-12 14:26:24 发布

阅读量783

点赞数

分类专栏： flink 文章标签： flink scala big data

本文链接：https://blog.csdn.net/xuehuagongzi000/article/details/101760962

版权

flink 专栏收录该内容

33 篇文章 1 订阅

订阅专栏

1、日志

查找当前时间戳网站：http://coolaf.com/tool/unix
要在时间戳后面加000

000002 1595903822000
000002 1595903832000
000002 1595903842000
000002 1595903852000
000002 1595903862000
000002 1595903872000
000002 1595903882000
000002 1595903892000

2、代码

import java.util.Properties
import java.util.regex.Pattern

import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala.{DataStream, KeyedStream, StreamExecutionEnvironment, WindowedStream}
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.api.scala._

import scala.collection.mutable

object StreamingApiWaterMark {
  private val patternBd = Pattern.compile("^(.*),[0-9]{1,3}(.*)  - (.*)")

  def main(args: Array[String]): Unit = {
    val environment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    environment.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
    environment.setParallelism(1)
    val dstream = environment.socketTextStream("13.11.32.192", 1111)
    val textWithTsDstream: DataStream[(String, Long, Int)] = dstream.map { text =>
      val arr: Array[String] = text.split(" ")
      (arr(0), arr(1).trim.toLong, 1)
    }
    val textWithEventTimeDstream: DataStream[(String, Long, Int)] = textWithTsDstream.
      assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[(String, Long, Int)]
      (Time.milliseconds(1000)) {
        override def extractTimestamp(element: (String, Long, Int)): Long = {
          return element._2
        }
      })
    val textKeyStream: KeyedStream[(String, Long, Int), Tuple] = textWithEventTimeDstream.keyBy(0)
    textKeyStream.print("textkey:")

    val windowStream: WindowedStream[(String, Long, Int), Tuple, TimeWindow] = textKeyStream.
      window(TumblingEventTimeWindows.of(Time.seconds(60)))

    val groupDstream: DataStream[mutable.HashSet[Long]] = windowStream.
      fold(new mutable.HashSet[Long]()) { case (set, (key, ts, count)) =>
        set += ts
      }

    groupDstream.print("window::::").setParallelism(1)


    environment.execute()
  }
}

3、结果

textkey:> (000002,1595903822000,1)
textkey:> (000002,1595903832000,1)
textkey:> (000002,1595903842000,1)
textkey:> (000002,1595903852000,1)
textkey:> (000002,1595903862000,1)
window::::> Set(1595903822000)
window::::> Set(1595903832000)
window::::> Set(1595903842000)
window::::> Set(1595903852000)
textkey:> (000002,1595903872000,1)
window::::> Set(1595903862000)
textkey:> (000002,1595903882000,1)
window::::> Set(1595903872000)
textkey:> (000002,1595903822000,1)
textkey:> (000002,1595903832000,1)
textkey:> (000002,1595903842000,1)
textkey:> (000002,1595903892000,1)
window::::> Set(1595903882000)

4、总结

每60s一个滚动窗口

1、watermark时间 >= window_end_time
2、在[window_start_time,window_end_time)中有数据存在
3、window里面包含的数据是eventime在[start_time,end_time]，如果该在[start_time,end_time]范围内的eventime在end_time+延迟时间还没有到达，就会被忽略。

后续数据再加入到后面的窗口时，会被忽略，比如如下三个数据

textkey:> (000002,1595903822000,1)
textkey:> (000002,1595903832000,1)
textkey:> (000002,1595903842000,1)

1、waterMark概念

：取自所有到达数据时间戳的最大值-延迟时间，所以保持单调递增的

窗口内的数据是按照eventTime分桶的时间窗口数据

watermark 是每条数据上带了一个时间戳，其含义是：watermark之前的数据都到齐了，在这个之后不会收到小于或等于该时间戳的数据。

eventTime值为1501750584000（2017-08-03 08:56:24.000），watermark策略为偏移4秒，这条数据的watermark时间就是 1501750584000 - 4000 = 1501750580000（2017-08-03 08:56:20.000）。

这条数据的watermark时间是什么含义呢？即：timestamp小于2017-08-03 08:56:20.000的数据都已经到达了。

2、watermark案例：

3、watermark上下游传递

如果上下游有多个并行子任务的话，上游朝下游传递是广播，下游接收上游的watermark会保留分区watermark，下游自己的watermark会以所有分区最小的watermaker为准。

4、窗口时间确定

xuehuagongzi000

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
flink按照eventTime处理流&watermark

1、日志查找当前时间戳网站：http://coolaf.com/tool/unix要在时间戳后面加000000002 1569813325000000002 1569813325000000002 1569813345000000002 1569813345000000002 1569813368000000002 1569813368000000002 156981344...
复制链接

扫一扫