1、日志
查找当前时间戳网站:http://coolaf.com/tool/unix
要在时间戳后面加000
000002 1595903822000
000002 1595903832000
000002 1595903842000
000002 1595903852000
000002 1595903862000
000002 1595903872000
000002 1595903882000
000002 1595903892000
2、代码
import java.util.Properties
import java.util.regex.Pattern
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala.{DataStream, KeyedStream, StreamExecutionEnvironment, WindowedStream}
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.api.scala._
import scala.collection.mutable
object StreamingApiWaterMark {
private val patternBd = Pattern.compile("^(.*),[0-9]{1,3}(.*) - (.*)")
def main(args: Array[String]): Unit = {
val environment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
environment.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
environment.setParallelism(1)
val dstream = environment.socketTextStream("13.11.32.192", 1111)
val textWithTsDstream: DataStream[(String, Long, Int)] = dstream.map { text =>
val arr: Array[String] = text.split(" ")
(arr(0), arr(1).trim.toLong, 1)
}
val textWithEventTimeDstream: DataStream[(String, Long, Int)] = textWithTsDstream.
assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[(String, Long, Int)]
(Time.milliseconds(1000)) {
override def extractTimestamp(element: (String, Long, Int)): Long = {
return element._2
}
})
val textKeyStream: KeyedStream[(String, Long, Int), Tuple] = textWithEventTimeDstream.keyBy(0)
textKeyStream.print("textkey:")
val windowStream: WindowedStream[(String, Long, Int), Tuple, TimeWindow] = textKeyStream.
window(TumblingEventTimeWindows.of(Time.seconds(60)))
val groupDstream: DataStream[mutable.HashSet[Long]] = windowStream.
fold(new mutable.HashSet[Long]()) { case (set, (key, ts, count)) =>
set += ts
}
groupDstream.print("window::::").setParallelism(1)
environment.execute()
}
}
3、结果
textkey:> (000002,1595903822000,1)
textkey:> (000002,1595903832000,1)
textkey:> (000002,1595903842000,1)
textkey:> (000002,1595903852000,1)
textkey:> (000002,1595903862000,1)
window::::> Set(1595903822000)
window::::> Set(1595903832000)
window::::> Set(1595903842000)
window::::> Set(1595903852000)
textkey:> (000002,1595903872000,1)
window::::> Set(1595903862000)
textkey:> (000002,1595903882000,1)
window::::> Set(1595903872000)
textkey:> (000002,1595903822000,1)
textkey:> (000002,1595903832000,1)
textkey:> (000002,1595903842000,1)
textkey:> (000002,1595903892000,1)
window::::> Set(1595903882000)
4、总结
每60s一个滚动窗口
1、watermark时间 >= window_end_time
2、在[window_start_time,window_end_time)中有数据存在
3、window里面包含的数据是eventime在[start_time,end_time],如果该在[start_time,end_time]范围内的eventime在end_time+延迟时间还没有到达,就会被忽略。
后续数据再加入到后面的窗口时,会被忽略,比如如下三个数据
textkey:> (000002,1595903822000,1)
textkey:> (000002,1595903832000,1)
textkey:> (000002,1595903842000,1)
1、waterMark概念
:取自所有到达数据时间戳的最大值-延迟时间,所以保持单调递增的
窗口内的数据是按照eventTime分桶的时间窗口数据
watermark 是每条数据上带了一个时间戳,其含义是:watermark之前的数据都到齐了,在这个之后不会收到小于或等于该时间戳的数据。
eventTime值为1501750584000(2017-08-03 08:56:24.000),watermark策略为偏移4秒,这条数据的watermark时间就是 1501750584000 - 4000 = 1501750580000(2017-08-03 08:56:20.000)。
这条数据的watermark时间是什么含义呢?即:timestamp小于2017-08-03 08:56:20.000的数据都已经到达了。
2、watermark案例:
3、watermark上下游传递
如果上下游有多个并行子任务的话,上游朝下游传递是广播,下游接收上游的watermark会保留分区watermark,下游自己的watermark会以所有分区最小的watermaker为准。
4、窗口时间确定