Window(窗口)
-
Window是Flink处理无限流的核心,Windows将流拆分为有限大小的“桶”,我们可以在其上应用计算。
-
Flink认为Batch是Streaming的一个特例,所以Flink底层引擎是一个流式引擎,在上面实现了流处理和批处理。
-
而窗口(window)就是从Streaming到Batch的一个桥梁。
-
Flink提供了非常完善的窗口机制。
-
在流处理应用中,数据时连续不断的,因此我们不能等到所有数据都到了才开始处理。
-
当然我们可以每来一个消息就处理一次,但是有时我们需要做一些聚合类的处理,例如:在过去的1分钟内有多少用户点击了我们的网页。
-
在这种情况下,我们必须定义一个窗口,用来收集最近一分钟内的数据,并对这个窗口内的数据进行计算。
-
窗口可以是基于时间驱动的(Time Window,例如:每30秒钟)
-
也可以是基于数据驱动的(Count Window,例如:每一百个元素)
-
同时基于不同事件驱动的窗口又可以分成以下几类:
翻滚窗口(Tumbling Window,无重叠) 滑动窗口(Sliding Window,有重叠) 会话窗口(Session Window,活动间隙) 全局窗口(略)
Flink要操作窗口,先得将StreamSource转成WindowedStream
Window操作 | 作用 |
---|---|
Window Keyed Streaming->WindowedStream | 可以在已经分区的KeyedStream上定义Windows,即K,V格式的数据。 |
WindowAll DataStream->AllWindowedStream | 对常规的DataStream上定义Window,即非K,V格式的数据 |
Window Apply WindowedStream->AllWindowedStream AllWindowedStream->DataStream | 将函数应用于窗口中的数据 |
Window Reduce WindowedStream->DataStream | 对窗口里的数据进行“reduce”减少聚合统计 |
Aggregations on Windows WindowedStream->DataStream | 对窗口里的数据进行聚合操作:sum(),max(),min() |
Tumbling Window(翻滚窗口)
- 翻滚窗口能将数据流切分成不重叠的窗口,每一个事件只能属于一个窗口
- 翻滚具有固定的尺寸,不重叠
- 例图
代码展示:
package window
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
/*
flink窗口 tumbling(滚动窗口) 窗口周期10秒
找出最小空高
从socket端口接收字符串格式为“ws_001,1577844001,45.0”数据
接收到字符串后将字符串流转换成WaterSensor流 dataStream
*/
case class WaterSensor(id:String,ts:Long,vc:Double)
object WindowDemo {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//设置并行度
env.setParallelism(1)
val stream: DataStream[String] = env.socketTextStream("192.168.136.100",7777)
val dataStream: DataStream[WaterSensor] = stream.map(x => {
val strings: Array[String] = x.split(",")
WaterSensor(strings(0).trim.toString, strings(1).trim.toLong, strings(2).trim.toDouble)
})
val dataStream2: DataStream[(String, Double)] = dataStream.map(data => (data.id, data.vc))
val dataStream3: KeyedStream[(String, Double), String] = dataStream2.keyBy(_._1)
//Tumbling 窗口
val dataStream4: WindowedStream[(String, Double), String, TimeWindow] = dataStream3.timeWindow(Time.seconds(10))
val minDataStream: DataStream[(String, Double)] = dataStream4.reduce((x,y)=>(x._1,x._2.min(y._2)))
dataStream.print("orig")
minDataStream.print("min")
env.execute("windowdemo")
}
}
- 开启端口
[root@hadoop100 ~]# nc -lk 7777
- 启动Scala程序
- 在端口输入信息
ws_001,1609314670,45.0
ws_002,1609314671,33.0
ws_003,1609314672,32.0
ws_002,1609314673,23.0
ws_003,1609314674,31.0
ws_002,1609314675,45.0
ws_003,1609314676,18.0
ws_002,1609314677,34.0
ws_003,1609314678,47.0
ws_001,1609314679,55.0
ws_001,1609314680,25.0
ws_001,1609314681,25.0
ws_001,1609314682,25.0
ws_001,1609314683,26.0
ws_001,1609314684,21.0
ws_001,1609314685,24.0
ws_001,1609314686,15.0
- 相应的在控制台打印出了结果,求出了每个key的最小值
Sliding Window(滑动窗口)
- 滑动窗口和翻滚窗口类似,区别在于:滑动窗口可以有重叠的部分。
- 在滑动窗口中,一个元素可以对应多个窗口。
- 例图:
代码展示:
package window
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
/*
flink窗口 tumbling(滚动窗口) 窗口周期10秒
找出最小空高
从socket端口接收字符串格式为“ws_001,1577844001,45.0”数据
接收到字符串后将字符串流转换成WaterSensor流 dataStream
*/
case class WaterSensor(id:String,ts:Long,vc:Double)
object WindowDemo {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//设置并行度
env.setParallelism(1)
val stream: DataStream[String] = env.socketTextStream("192.168.136.100",7777)
val dataStream: DataStream[WaterSensor] = stream.map(x => {
val strings: Array[String] = x.split(",")
WaterSensor(strings(0).trim.toString, strings(1).trim.toLong, strings(2).trim.toDouble)
})
val dataStream2: DataStream[(String, Double)] = dataStream.map(data => (data.id, data.vc))
val dataStream3: KeyedStream[(String, Double), String] = dataStream2.keyBy(_._1)
//sliding 滑动窗口
val dataStream4: WindowedStream[(String, Double), String, TimeWindow] = dataStream3.timeWindow(Time.seconds(10),Time.seconds(5))
val minDataStream: DataStream[(String, Double)] = dataStream4.reduce((x,y)=>(x._1,x._2.min(y._2)))
dataStream.print("orig")
minDataStream.print("min")
env.execute("windowdemo")
}
}
- 开启端口
[root@hadoop100 ~]# nc -lk 7777
- 启动Scala程序
- 在端口输入信息
ws_001,1609314670,45.0
ws_002,1609314671,33.0
ws_003,1609314672,32.0
ws_002,1609314673,23.0
ws_003,1609314674,31.0
ws_002,1609314675,45.0
ws_003,1609314676,18.0
ws_002,1609314677,34.0
ws_003,1609314678,47.0
ws_001,1609314679,55.0
ws_001,1609314680,25.0
ws_001,1609314681,25.0
ws_001,1609314682,25.0
ws_001,1609314683,26.0
ws_001,1609314684,21.0
ws_001,1609314685,24.0
ws_001,1609314686,15.0
- 相应的在控制台打印出了结果,求出了每个key的最小值
Sliding Window(滑动窗口)设置Watermark时间
package window
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
/*
flink窗口 tumbling(滚动窗口) 窗口周期10秒
找出最小空高
从socket端口接收字符串格式为“ws_001,1577844001,45.0”数据
接收到字符串后将字符串流转换成WaterSensor流 dataStream
*/
object WindowDemo2 {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//设置并行度
env.setParallelism(1)
/*
flink 时间原意三种情况
EventTime 事件发生时间
IngestionTime 事件进入到flink时间
ProcessingTime 事件处理时间
*/
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val stream: DataStream[String] = env.socketTextStream("192.168.136.100",7777)
val dataStream: DataStream[WaterSensor] = stream.map(x => {
val strings: Array[String] = x.split(",")
WaterSensor(strings(0).trim, strings(1).trim.toLong, strings(2).trim.toDouble)
})
//.assignAscendingTimestamps(x=>x.ts*1000) //简单,时间永远是升序,不能延时
.assignTimestampsAndWatermarks(new MyAssigner) //自定义类
//
//flink自带类
/*.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[WaterSensor](Time.seconds(1)) {
override def extractTimestamp(t: WaterSensor): Long = {
t.ts*1000
}
})*/
val dataStream2: DataStream[(String, Double)] = dataStream.map(data => (data.id, data.vc))
val dataStream3: KeyedStream[(String, Double), String] = dataStream2.keyBy(_._1)
//Tumbling 窗口
//val dataStream4: WindowedStream[(String, Double), String, TimeWindow] = dataStream3.timeWindow(Time.seconds(10))
//sliding 滑动窗口
val dataStream4: WindowedStream[(String, Double), String, TimeWindow] = dataStream3.timeWindow(Time.seconds(8),Time.seconds(2))
val minDataStream: DataStream[(String, Double)] = dataStream4.reduce((x,y)=>(x._1,x._2.min(y._2)))
dataStream.print("orig")
minDataStream.print("min")
env.execute("windowdemo")
}
}
class MyAssigner extends AssignerWithPeriodicWatermarks[WaterSensor]{
var bound=3000 //waterMark 延迟关窗时间
var maxTs= 0L //记录当前最大时间戳
override def getCurrentWatermark: Watermark = {
new Watermark(maxTs-bound)
}
override def extractTimestamp(t: WaterSensor, l: Long): Long = {
maxTs= Math.max(t.ts*1000,maxTs)
t.ts*1000
}
}
- 开启端口
[root@hadoop100 ~]# nc -lk 7777
- 启动Scala程序
- 在端口输入信息
ws_002, 1609314671, 33.0
ws_002, 1609314673, 23.0
ws_002, 1609314673, 43.0
ws_002, 1609314673, 13.0
ws_002, 1609314673, 43.0
ws_001, 1609314683, 26.0
- 相应的在控制台打印出了结果,求出了每个key的最小值