一、场景
以有序数据流中的WaterMark和EventTime为例:
设置有序数据流时间语义为EventTime事件时间,即水位线WaterMark等于EventTime,设置窗口大小为5s。
第一次事件时间为
1461756859000(2016-04-27 19:34:19)
第二次事件事件时间为1461756860000(2016-04-27 19:34:20)
触发水位线,窗口数据返回:
窗口的数据条数:1
窗口的第一条数据:(000001,1461756859000)
窗口的最后一条数据:(000001,1461756859000)
窗口的开始时间:1461756855000
窗口的结束时间:1461756860000
当前的watermark:1461756859999
(000001,1)
疑问:为什么第一次事件时间为1461756859000,但是窗口开始时间是1461756855000(2016-04-27 19:34:15)
二、源码解析
以计时窗口的滚动窗口为例
按照源码获知窗口开始时间公式:
window start = timestamp - (timestamp - offset + windowSize) % windowSize
以本次数据为例:
timestamp=1461756859000
offset=0 滚动窗口offset为0
windowSize=5000
即window start :1461756859000-(1461756859000-0+5000)%5000=1461756859000-4000=1461756855000(2016-04-27 19:34:15)
因此该次窗口为[1461756855000,1461756860000)
三、参考代码
package com.kkb.time
import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
object OrderedStreamWaterMark {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
import org.apache.flink.api.scala._
//设置时间类型,以事件类型为准去计算
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
env.setParallelism(1)
val sourceDataStream: DataStream[String] = env.socketTextStream("node01", 9999)
val mapStream: DataStream[(String, Long)] = sourceDataStream.map(x => (x.split(",")(0), x.split(",")(1).toLong))
/**
* This method is a shortcut for data streams where the element timestamp are known
* to be monotonously ascending(有序增加) within each parallel stream.
*
* 有序数据流中,eventTime等于watermark
*/
val watermarkStream: DataStream[(String, Long)] = mapStream.assignAscendingTimestamps(x => x._2)
watermarkStream
.keyBy(0)
.timeWindow(Time.seconds(5))
.process(new ProcessWindowFunction[(String,Long),(String,Long),Tuple,TimeWindow] {
override def process(key: Tuple, context: Context, elements: Iterable[(String, Long)], out: Collector[(String, Long)]): Unit = {
//获取分组字段
val value: String = key.getField[String](0)
//获取窗口开始时间
val windowStartTime: Long = context.window.getStart
//获取窗口结束时间
val windowEndTime: Long = context.window.getEnd
//获取当前的watermark
val watermark: Long = context.currentWatermark
var sum = 0L
val toList: List[(String, Long)] = elements.toList
for(eachElement <- toList){
sum += 1
}
println("窗口的数据条数:"+sum+
"|窗口的第一条数据:"+toList.head+
"|窗口的最后一条数据:"+toList.last+
"窗口的开始时间:"+windowStartTime+
"|窗口的结束时间:"+windowEndTime+
"|当前的watermark:"+watermark)
out.collect((value,sum))
}
})
.print()
env.execute()
}
}
发送数据:
[hadoop@node01 ~]$ nc -lk 9999
000001,1461756859000
000001,1461756860000