1.WaterMarker的理解
Watermark 是插入到数据流中的一种特殊的数据,watermark 上带了一个时间戳,其含义是:在这个之后不会收到小于或等于该时间戳的数据。假设数据的乱序程度是1分钟,也就是说等1分钟之后,绝大部分迟到的数据都到了,那么我们就可以定义 watermark 计算方式为偏移1分钟。
Watermark特性:单调递增
2.案例一
import java.sql.Timestamp
import org.apache.flink.api.common.functions.ReduceFunction
import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala.{DataStream, OutputTag, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.scala.function.{ProcessWindowFunction, RichWindowFunction}
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.windowing.assigners.{EventTimeSessionWindows, ProcessingTimeSessionWindows, SlidingEventTimeWindows, TumblingEventTimeWindows, TumblingProcessingTimeWindows}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
/**
*
1.Time :三个时间的理解
https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/event_time.html
(1)Event Time 数据/事件真正产生的时间 ,生产一般99%使用EventTime
a.理解:
access:time
应该是数以数据/事件的一个部分(字段)
Watermark: 肯定与EventTime相关,WaterMarker可以简单理解为在某个时间内处理,高于某个时间后不处理,结果放到其他地方(允许延迟的时间)
b.优点:
执行结果确定的
处理乱序 、延时数据问题
c.缺点:
延迟,是因为waterMark(水印)会等一段时间内的数据,达到marker才处理,如果是Progressing Time则是来一条处理一条,相对快一些
d.举例:
例如一个人在某宝上的关注行为轨迹
1:关注 丝袜
2:取消关注 丝袜
3. 关注 连衣裙
4. 取消关注 连衣裙
5. 关注 连衣裙
正常的顺序是 1->2->3 ->4 ->5 操作结果是关注了连衣裙
由于网络抖动,机器硬件等原因,进入flink的时候顺序变成 1->2>3->5->4 此时操作结果变成 取消关注连衣裙.....发生了截然相反的判断
(2)Ingestion Time 进入flink的时间
与机器的时间有关系,如FlinkSource的Operator算子
(3)Progressing Time 业务执行的时间
2.Window
https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/stream/operators/windows.html
无限的数据流进行切分=》有界的数据集
窗口的分类
(1)是否keyBy
a.Keyed Windows
stream
.keyBy(...) <- keyed versus non-keyed windows
.window(...) <- required: "assigner"
[.trigger(...)] <- optional: "trigger" (else default trigger)
[.evictor(...)] <- optional: "evictor" (else no evictor)
[.allowedLateness(...)] <- optional: "lateness" (else zero)
[.sideOutputLateData(...)] <- optional: "output tag" (else no side output for late data)
.reduce/aggregate/fold/apply() <- required: "function"
[.getSideOutput(...)] <- optional: "output tag"
b.Non-Keyed Windows
stream
.windowAll(...) <- required: "assigner"
[.trigger(...)] <- optional: "trigger" (else default trigger)
[.evictor(...)] <- optional: "evictor" (else no evictor)
[.allowedLateness(...)] <- optional: "lateness" (else zero)
[.sideOutputLateData(...)] <- optional: "output tag" (else no side output for late data)
.reduce/aggregate/fold/apply() <- required: "function"
[.getSideOutput(...)] <- optional: "output tag"
(2)时间和数量
a.时间窗口 :Time-based window
以时间为单位,比如说每隔5s一个窗口
b.数量窗口: Count-based window
以Event个数为单位,比如说每个窗口10条数据
(3)Window Assigners
a.Tumbling Windows(滚动窗口) 生产中使用的最多
时间对齐、窗口长度固定,没有重叠的
b.Sliding Windows(滑动窗口)
时间对齐、长度固定,数据有重叠的
注意前面有半个窗口的数据:
假设窗口时10s,滑动步长为5s
第一个窗口:0-5
第二个窗口:0-10
第三个窗口:5-15
使用event_time就能够测试的出来
c.Session Windows(会话窗口)
比如说 只要找一个用户session 的时间超过30分钟的数据
d.Global Windows
(4)Window Functions 窗口函数
Window核心就是将无限的数据流进行切分成有界的数据集进行计算,没有Window的计算就是来一条处理一条。
sparkStreaming就是batch,就是有界的数据集进行计算
a.增量
ReduceFunction
AggregateFunction
每条数据到了就计算一次,带状态
b.全量
ProcessWindowFunction
AllWindowFunction
窗口所有数据到了,计算一次
假如窗口里面做排序,只能够用全量
3.WaterMarker(WM)
a.简介
只有Window里面 才会有WaterMarker操作,允许数据延迟
EventTime:带有一个时间的字段
Window: 5s {0,5) 等10s
衡量数据/事件进展的机制+EventTime,WaterMarker的概念来源于Google DataFlow
b.作用: 确定等多久触发整个window的计算
c.机制:waterMarker(WM)
d.计算:
WM=窗口内最大的时间-允许延迟执行的时间
VM是不断增大的
e. 窗口触发条件: WM 》=上一个窗口的结束边界
窗口内最大的时间-允许延迟执行的时间 > = 上一个窗口的结束边界
4.WaterMarker时间戳抽取接口
AssignerWithPeriodicWatermarks
(1)abstract class BoundedOutOfOrdernessTimestampExtractor<T> implements AssignerWithPeriodicWatermarks<T> //有界无序时间戳抽取
(2)abstract class AscendingTimestampExtractor<T> implements AssignerWithPeriodicWatermarks<T> //增量时间戳抽取
(3)class RuozedataAssignerWithPeriodicWatermarks2(maxAllowedUnOrderedTime: Long) extends AssignerWithPeriodicWatermarks[XX2] //自定义实现时间戳抽取
*/
object WMApp {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
// 1.告诉引擎使用eventTime来进行处理,默认processTime
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
//2.1 会话窗口 代码中抽取eventTime
//sessionWindow(env)
//2.2 滚动窗口 代码中抽取eventTime
//tumblingWindow(env)
//同样数据使用自定义waterMarker实现
slidingWindowCustom(env)
// 2.3 滑动窗口
//slidingWindow(env)
// 2.4 滚动窗口 查看具体WM时间戳的
// wm(env)
env.execute(getClass.getCanonicalName)
}
/**
WaterMarker
滚动窗口3秒,允许延迟10s,
窗口具体切分
Window1 [00,03)
Window2 [03,06)
Window3 [06,09)
Window4 [09.12)
...
Window19 {57,00)
数据:
1,36.8,a,1582641128,xx
数据时间:2020-02-25 22:32:08.0 ,当前窗口内元素最大时间:2020-02-25 22:32:08.0, WM的时间:2020-02-25 22:31:58.0
2,37.8,b,1582641129,yy
数据时间:2020-02-25 22:32:09.0 ,当前窗口内元素最大时间:2020-02-25 22:32:09.0, WM的时间:2020-02-25 22:31:59.0
3,38.8,c,1582641129,zz
数据时间:2020-02-25 22:32:09.0 ,当前窗口内元素最大时间:2020-02-25 22:32:09.0, WM的时间:2020-02-25 22:31:59.0
4,38.8,d,1582641139,zz WM的时间:2020-02-25 22:32:09.0 >= Window3[06,09)
数据时间:2020-02-25 22:32:19.0 ,当前窗口内元素最大时间:2020-02-25 22:32:19.0, WM的时间:2020-02-25 22:32:09.0
监测点:1, 平均温度: 36.8, 窗口开始:2020-02-25 22:32:06.0, 窗口结束:2020-02-25 22:32:09.0
5,38.8,e,1582641142,zz WM的时间:2020-02-25 22:32:12.0 >= Window4[09.12)
数据时间:2020-02-25 22:32:22.0 ,当前窗口内元素最大时间:2020-02-25 22:32:22.0, WM的时间:2020-02-25 22:32:12.0
监测点:2, 平均温度: 37.8, 窗口开始:2020-02-25 22:32:09.0, 窗口结束:2020-02-25 22:32:12.0
监测点:3, 平均温度: 38.8, 窗口开始:2020-02-25 22:32:09.0, 窗口结束:2020-02-25 22:32:12.0
* @param environment
*/
def wm(environment: StreamExecutionEnvironment): Unit = {
val maxOutOfOrderness = 10 * 1000
val lateData = new OutputTag[XX]("late-data")
val ds: DataStream[String] = environment.socketTextStream("hadoop", 9527)
.map(x => {
val splits = x.split(",")
XX(splits(0), splits(1).trim.toDouble, splits(2), splits(3).trim.toLong, splits(4))
})
.assignTimestampsAndWatermarks(new RuozedataAssignerWithPeriodicWatermarks(maxOutOfOrderness))
.keyBy(0)
.window(TumblingEventTimeWindows.of(Time.seconds(3)))
.sideOutputLateData(lateData)
.apply(new RichWindowFunction[XX, String, Tuple, TimeWindow] {
override def apply(key: Tuple, window: TimeWindow, input: Iterable[XX], out: Collector[String]): Unit = {
// 如何拿到监测点的id???
val id = key.getField[String](0)
// 平均体温 = 总体温 / 人数
val totalCnt = input.size
var totalTemp = 0.0
input.foreach(x => totalTemp = totalTemp + x.temperature)
val avgTemp = totalTemp / totalCnt
val start = new Timestamp(window.getStart)
val end = new Timestamp(window.getEnd)
out.collect(s"监测点:$id, 平均温度: $avgTemp, 窗口开始:$start, 窗口结束:$end") // id
}
})
ds.print()
ds.getSideOutput(lateData).print("late-data")
}
/**
窗口触发的规则:
1) 两个窗口之间的数据时间差超过Session Gap
2) 窗口内要有数据
场景一: 数据时间戳严格递增,且env的并行度为1
时间字段,id,name ...
时间字段,单词,次数
1000,a,1
2000,a,2
3000,a,3
4000,a,4
4999,a,6 Window [1000-5000)
10000,a,7 --> 4999与10000之间的时间差超过5秒,触发了 Window [1000-5000)的计算
输出-》1970-01-01 08:00:01.0==>1970-01-01 08:00:09.999 , a --> 16
*
* 调用的是这个reduce方法
* def reduce[R: TypeInformation](
* preAggregator: ReduceFunction[T], 增量预聚合
* function: ProcessWindowFunction[T, R, K, W]) 全量聚合处理
* : DataStream[R] =
*/
def sessionWindow(environment: StreamExecutionEnvironment): Unit = {
environment.socketTextStream("hadoop", 9527)
.assignTimestampsAndWatermarks(
// (非升序时间戳使用此方法,生产一般这么用)乱序数据处理,创建时间戳和水位
//定义时间戳延迟时间 10s Time.seconds(10) ,可以给毫秒 Time.milliseconds(10),如果很多延迟时间大于10s,则可能会丢弃数据
//以eventTime作为waterMarks
new BoundedOutOfOrdernessTimestampExtractor[String](Time.seconds(0)) {
override def extractTimestamp(element: String): Long = {
element.split(",")(0).toLong // 获取到数据中的时间
}
}
).map(x => {
val splits = x.split(",")
(splits(1).trim, splits(2).trim.toInt)
}).keyBy(0)
//按照EventTime时间触发,而不是按照ProcessTime
.window(EventTimeSessionWindows.withGap(Time.seconds(5)))
.reduce(new ReduceFunction[(String, Int)] {
override def reduce(value1: (String, Int), value2: (String, Int)): (String, Int) = {
(value1._1, value1._2 + value2._2)
}
}, new ProcessWindowFunction[(String,Int), String, Tuple, TimeWindow] {
override def process(key: Tuple, context: Context, elements: Iterable[(String, Int)], out: Collector[String]): Unit = {
for(ele <- elements) {
out.collect(new Timestamp(context.window.getStart) +
"==>" + new Timestamp(context.window.getEnd) +
" , " + ele._1 + " --> " + ele._2)
}
}
}).print()
}
/**
窗口触发的规则:
1) 数据带的时间 >= 上一个窗口的结束边界
2) 窗口内要有数据
场景一:数据时间戳严格递增(1000 就代表1s),且env的并行度为1
1000,a,1
2000,a,1
4998,a,1
4999,a,1 Window1 [0,5000) -- > 1970-01-01 08:00:00.0==>1970-01-01 08:00:05.0 , a --> 4 最后一个时间戳4999触发窗口1的计算
6999,a,1
8888,a,1
9998,a,1 Window2 [5000,10000)
12000,a,1 -- > 1970-01-01 08:00:05.0==>1970-01-01 08:00:10.0 , a --> 3 时间戳12000超过第二2窗口时间范围,触发窗口2计算
场景二:数据时间戳乱序,允许延迟时间为0s,且env的并行度为1,乱序数据直接丢失
1000,a,1
2000,a,1
3000,b,1
4999,c,1 Window1 [0,5000) --> 最后一个时间戳 4999=窗口1的边界 ,触发窗口1计算,输出:
1970-01-01 08:00:00.0==>1970-01-01 08:00:05.0 , a --> 2
1970-01-01 08:00:00.0==>1970-01-01 08:00:05.0 , c --> 1
1970-01-01 08:00:00.0==>1970-01-01 08:00:05.0 , b --> 1
4000,a,1 --> ----乱序数据丢失-----
4444,a,1 --> ----乱序数据丢失-----
8888,a,1
9999,b,1 Window2 [5000,10000) --> 最后一个时间戳 9999=窗口2边界,触发窗口2计算,输出:
1970-01-01 08:00:05.0==>1970-01-01 08:00:10.0 , a --> 1
1970-01-01 08:00:05.0==>1970-01-01 08:00:10.0 , b --> 1
场景三:数据时间戳乱序,允许延迟时间为0s,且env的并行度为1,乱序数据通过侧流获取再次做处理
1000,a,1
2000,a,1
3000,b,1
4999,c,1 Window1 [0,5000) --> 最后一个时间戳 4999=窗口1的边界 ,触发窗口1计算,输出:
1970-01-01 08:00:00.0==>1970-01-01 08:00:05.0 , a --> 2
1970-01-01 08:00:00.0==>1970-01-01 08:00:05.0 , c --> 1
1970-01-01 08:00:00.0==>1970-01-01 08:00:05.0 , b --> 1
4000,a,1 --> ----丢失数据,侧流获取-----> (a,1)
4444,a,1 --> ----丢失数据,侧流获取-----> (a,1)
8888,a,1
9999,b,1 Window2 [5000,10000) --> 最后一个时间戳 9999=窗口2边界,触发窗口2计算,输出:
1970-01-01 08:00:05.0==>1970-01-01 08:00:10.0 , a --> 1
1970-01-01 08:00:05.0==>1970-01-01 08:00:10.0 , b --> 1
场景四:数据时间戳乱序,允许延迟时间为2s,且env的并行度为1,乱序数据通过侧流获取再次做处理
1000,a,1
2000,a,1
3000,b,1
4999,c,1 Window1 [0,5000) 窗口1
4000,a,1 --乱序数据
4444,a,1 --乱序数据
6999,a,1 允许延迟2s后, 6999=窗口1截止时间,触发Window1的计算(窗口内的数据最大时间6999-允许延迟时间2000>=Window1的截止时间,其中窗口内的数据最大时间6999-允许延迟时间2000这个值的概念叫做水印WaterMarker)
# 注意:a-->4而不是a-->5,为什么?
窗口1还是不变的,变的是窗口的触发时间(原来>=4999,变成>=6999)改变了,但是计算的还是[0,5000)的数据
1970-01-01 08:00:00.0==>1970-01-01 08:00:05.0 , a --> 4
1970-01-01 08:00:00.0==>1970-01-01 08:00:05.0 , b --> 1
1970-01-01 08:00:00.0==>1970-01-01 08:00:05.0 , c --> 1
8888,a,1
9999,b,1 Window2 [5000,10000) 窗口2
13000,d,1 允许延迟2s后,13000>窗口2截止时间,触发Window2计算
1970-01-01 08:00:05.0==>1970-01-01 08:00:10.0 , a --> 2
1970-01-01 08:00:05.0==>1970-01-01 08:00:10.0 , b --> 1
*
* @param environment
*/
def tumblingWindow(environment: StreamExecutionEnvironment): Unit = {
val outputTag = new OutputTag[(String, Int)]("late-data")
val window = environment.socketTextStream("hadoop", 9527)
.assignTimestampsAndWatermarks(
new BoundedOutOfOrdernessTimestampExtractor[String](Time.seconds(2)) {
override def extractTimestamp(element: String): Long = {
element.split(",")(0).toLong // 获取到数据中的时间
}
}
).map(x => {
val splits = x.split(",")
(splits(1).trim, splits(2).trim.toInt)
}).keyBy(0)
//滚动窗口为5秒
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.sideOutputLateData(outputTag) // 乱序的数据
.reduce(new ReduceFunction[(String, Int)] {
override def reduce(value1: (String, Int), value2: (String, Int)): (String, Int) = {
(value1._1, value1._2 + value2._2)
}
}, new ProcessWindowFunction[(String,Int), String, Tuple, TimeWindow] {
override def process(key: Tuple, context: Context, elements: Iterable[(String, Int)], out: Collector[String]): Unit = {
for(ele <- elements) {
out.collect(new Timestamp(context.window.getStart) +
"==>" + new Timestamp(context.window.getEnd) +
" , " + ele._1 + " --> " + ele._2)
}
}
}) //.print()
window.print()
window.getSideOutput(outputTag).print("----丢失数据,侧流获取-----")
}
/**
场景一: 滑动窗口,窗口大小6s,步长2秒 ,延迟为0,env为1
# 注意滑动窗口的特殊性,前面有半个窗口
理论Window
Window1 [-6,0)
Window1 [-4,2)
Window1 [-2,4)
Window2 [0,6)
Window3 [2,8)
Window4 [4,10)
Window4 [6,12)
实际Window,触发哪个Window,就计算哪个Window内的数据.(Windows内按照数据的EventTime计算)
Window1 [0,2)
Window2 [0,4)
Window3 [0,6)
Window4 [2,8)
Window5 [4,10)
Window6 [6,12)
Window7 [8,14)
Window8 [10,16)
......
数据输入1:
1000,a,1 Window1 [0,2)
4000,a,1 Window2 [0,4) --> 时间戳4000>3999,4000>1999,触发Window1与Window2 输出
1970-01-01 07:59:56.0==>1970-01-01 08:00:02.0 , a --> 1
1970-01-01 07:59:58.0==>1970-01-01 08:00:04.0 , a --> 1
5555,a,1
7777,a,1 Window3 [0,6) --> 时间戳7777>5999,触发Window3输出
1970-01-01 08:00:00.0==>1970-01-01 08:00:06.0 , a --> 3
9998,a,1 Window4 [2,8) --> 时间戳9998>7999,触发Window4输出
1970-01-01 08:00:02.0==>1970-01-01 08:00:08.0 , a --> 3
12000,a,1 Window5 [4,10)
Window6 [6,12) --> 时间戳12000>11999,触发Window5和Window6输出
1970-01-01 08:00:04.0==>1970-01-01 08:00:10.0 , a --> 4
1970-01-01 08:00:06.0==>1970-01-01 08:00:12.0 , a --> 2
14999,a,1 Window7 [8,14) --> 时间戳14999>13999,触发Window7输出
1970-01-01 08:00:08.0==>1970-01-01 08:00:14.0 , a --> 2
15000,a,1 Window8 [10,16) --> 时间戳15000<15999,未触发Window8,无任何输出
数据输入2:
1999,a,1 Window1 [0,2) --> 时间戳1999=1999,触发Window1计算输出
1970-01-01 07:59:56.0==>1970-01-01 08:00:02.0 , a --> 1
4000,a,1 Window2 [0,4) --> 时间戳4000>3999,触发Window2计算输出
1970-01-01 07:59:58.0==>1970-01-01 08:00:04.0 , a --> 1
5555,a,1
7777,a,1 Window3 [0,6) --> 时间戳7777>5999,触发Window3计算输出
1970-01-01 08:00:00.0==>1970-01-01 08:00:06.0 , a --> 3
9998,a,1 Window4 [2,8) --> 时间戳9998>7999,触发Window4计算输出
1970-01-01 08:00:02.0==>1970-01-01 08:00:08.0 , a --> 3
12000,a,1 Window5 [4,10)
Window6 [6,12) --> 时间戳12000>时间戳11999,触发Window5与Window6计算输出
1970-01-01 08:00:04.0==>1970-01-01 08:00:10.0 , a --> 4
1970-01-01 08:00:06.0==>1970-01-01 08:00:12.0 , a --> 2
数据输入3:
3000,a,1
3999,a,1 Window2 [0,4) --> 时间戳3999=3999,触发Window2输出
1970-01-01 07:59:58.0==>1970-01-01 08:00:04.0 , a --> 2
4000,a,1
5555,a,1
7777,a,1 Window3 [0,6) --> 时间戳7777>5999,触发Window3计算输出
1970-01-01 08:00:00.0==>1970-01-01 08:00:06.0 , a --> 4
9998,a,1 Window4 [2,8) --> 时间戳9998>7999,触发Window4计算输出
1970-01-01 08:00:02.0==>1970-01-01 08:00:08.0 , a --> 5
12000,a,1 Window5 [4,10)
Window6 [6,12) -->1970-01-01 08:00:04.0==>1970-01-01 08:00:10.0 , a --> 4
1970-01-01 08:00:06.0==>1970-01-01 08:00:12.0 , a --> 2
* @param environment
*/
def slidingWindow(environment: StreamExecutionEnvironment): Unit = {
val window = environment.socketTextStream("hadoop", 9527)
.assignTimestampsAndWatermarks(
new BoundedOutOfOrdernessTimestampExtractor[String](Time.seconds(0)) {
override def extractTimestamp(element: String): Long = {
element.split(",")(0).toLong // 获取到数据中的时间
}
}
).map(x => {
val splits = x.split(",")
(splits(1).trim, splits(2).trim.toInt)
}).keyBy(0)
.window(SlidingEventTimeWindows.of(Time.seconds(6),Time.seconds(2)))
.reduce(new ReduceFunction[(String, Int)] {
override def reduce(value1: (String, Int), value2: (String, Int)): (String, Int) = {
(value1._1, value1._2 + value2._2)
}
}, new ProcessWindowFunction[(String,Int), String, Tuple, TimeWindow] {
override def process(key: Tuple, context: Context, elements: Iterable[(String, Int)], out: Collector[String]): Unit = {
for(ele <- elements) {
out.collect(new Timestamp(context.window.getStart) +
"==>" + new Timestamp(context.window.getEnd) +
" , " + ele._1 + " --> " + ele._2)
}
}
}) //.print()
window.print()
}
/**
使用官方提供BoundedOutOfOrdernessTimestampExtractor()实现
数据输入1:
1000,a,1 Window1 [0,2)
4000,a,1 Window2 [0,4) --> 时间戳4000>3999,4000>1999,触发Window1与Window2 输出
1970-01-01 07:59:56.0==>1970-01-01 08:00:02.0 , a --> 1
1970-01-01 07:59:58.0==>1970-01-01 08:00:04.0 , a --> 1
5555,a,1
7777,a,1 Window3 [0,6) --> 时间戳7777>5999,触发Window3输出
1970-01-01 08:00:00.0==>1970-01-01 08:00:06.0 , a --> 3
9998,a,1 Window4 [2,8) --> 时间戳9998>7999,触发Window4输出
1970-01-01 08:00:02.0==>1970-01-01 08:00:08.0 , a --> 3
12000,a,1 Window5 [4,10)
Window6 [6,12) --> 时间戳12000>11999,触发Window5和Window6输出
1970-01-01 08:00:04.0==>1970-01-01 08:00:10.0 , a --> 4
1970-01-01 08:00:06.0==>1970-01-01 08:00:12.0 , a --> 2
14999,a,1 Window7 [8,14) --> 时间戳14999>13999,触发Window7输出
1970-01-01 08:00:08.0==>1970-01-01 08:00:14.0 , a --> 2
15000,a,1 Window8 [10,16) --> 时间戳15000<15999,未触发Window8,无任何输出
自定义WaterMarker实现
1000,a,1
数据时间:1970-01-01 08:16:40.0 ,当前窗口内元素最大时间:1970-01-01 08:16:40.0, WM的时间:1970-01-01 08:16:39.998
4000,a,1
数据时间:1970-01-01 09:06:40.0 ,当前窗口内元素最大时间:1970-01-01 09:06:40.0, WM的时间:1970-01-01 09:06:39.998
word:a, 次数: 1, 窗口开始:1970-01-01 08:16:36.0, 窗口结束:1970-01-01 08:16:42.0
word:a, 次数: 1, 窗口开始:1970-01-01 08:16:38.0, 窗口结束:1970-01-01 08:16:44.0
word:a, 次数: 1, 窗口开始:1970-01-01 08:16:40.0, 窗口结束:1970-01-01 08:16:46.0
5555,a,1
7777,a,1
9998,a,1
12000,a,1
14999,a,1
15000,a,1
* @param environment
*/
def slidingWindowCustom(environment: StreamExecutionEnvironment): Unit = {
val maxOutOfOrderness=2
val window = environment.socketTextStream("hadoop", 9527)
.map(x => {
val splits = x.split(",")
XX2(splits(0).toLong,splits(1).trim, splits(2).trim.toInt)
})
.assignTimestampsAndWatermarks(new RuozedataAssignerWithPeriodicWatermarks2(maxOutOfOrderness))
.keyBy(1)
.window(SlidingEventTimeWindows.of(Time.seconds(6),Time.seconds(2)))
.apply(new RichWindowFunction[XX2,String,Tuple,TimeWindow] {
override def apply(key: Tuple, window: TimeWindow, input: Iterable[XX2], out: Collector[String]): Unit = {
val start = new Timestamp(window.getStart)
val end = new Timestamp(window.getEnd)
val word=key.getField[String](0)
var count:Int=0
input.foreach(x=>count =count+x.count)
out.collect(s"word:$word, 次数: $count, 窗口开始:$start, 窗口结束:$end")
}
})
window.print()
}
}
case class XX(id:String, temperature:Double, name:String, time:Long, location:String)
class RuozedataAssignerWithPeriodicWatermarks(maxAllowedUnOrderedTime: Long) extends AssignerWithPeriodicWatermarks[XX] {
var currentMaxTimestamp: Long = _
override def getCurrentWatermark: Watermark = new Watermark(currentMaxTimestamp - maxAllowedUnOrderedTime)
override def extractTimestamp(element: XX, recordTimestamp: Long): Long = {
val nowTime = element.time * 1000
currentMaxTimestamp = currentMaxTimestamp.max(nowTime)
println("数据时间:" +new Timestamp(nowTime)
+ " ,当前窗口内元素最大时间:" + new Timestamp(currentMaxTimestamp)
+ ", WM的时间:" + new Timestamp(getCurrentWatermark.getTimestamp))
nowTime
}
}
case class XX2(time:Long,word:String,count:Int)
class RuozedataAssignerWithPeriodicWatermarks2(maxAllowedUnOrderedTime: Long) extends AssignerWithPeriodicWatermarks[XX2] {
var currentMaxTimestamp: Long = _
override def getCurrentWatermark: Watermark = new Watermark(currentMaxTimestamp - maxAllowedUnOrderedTime)
override def extractTimestamp(element: XX2, recordTimestamp: Long): Long = {
val nowTime = element.time * 1000
currentMaxTimestamp = currentMaxTimestamp.max(nowTime)
println("数据时间:" +new Timestamp(nowTime)
+ " ,当前窗口内元素最大时间:" + new Timestamp(currentMaxTimestamp)
+ ", WM的时间:" + new Timestamp(getCurrentWatermark.getTimestamp))
nowTime
}
}
/*
BoundedOutOfOrdernessTimestampExtractor() 这个抽取watermarker的方式:
public final Watermark getCurrentWatermark() {
// this guarantees that the watermark never goes backwards.
long potentialWM = currentMaxTimestamp - maxOutOfOrderness;
if (potentialWM >= lastEmittedWatermark) {
lastEmittedWatermark = potentialWM;
}
return new Watermark(lastEmittedWatermark);
}
自定义 RuozedataAssignerWithPeriodicWatermarks(maxAllowedUnOrderedTime: Long) extends AssignerWithPeriodicWatermarks[XX]
是一样的意思,只不过自定义能够看到waterMarker的具体生产
*/
Footer
© 2023 GitHub, Inc.
Footer navigation
Terms
Privacy
Security
Status
Docs
Contact GitHub
Pricing
API
Training
Blog
About
rzG9_My/WMApp.scala at 69c529f73c5a9a9c1fa8a3a73ff2984b0882c703 · sixPulseExcalibur/rzG9_My