水位线(Watermark)和窗口(Window)
Watermark
在事件时间语义下,我们不依赖系统时间,而是基于数据自带的时间戳去定义了一个时钟,用来表示当前时间的进展。于是每个并行子任务都会有一个自己的逻辑时钟,它的前进是靠数据的时间戳来驱动的。
Window
Flink 是一种流式计算引擎,主要是来处理无界数据流的,数据源源不断、无穷无尽。想要更加方便高效地处理无界流,一种方式就是将无限数据切割成有限的“数据块”进行处理,这就是所谓的“窗口”(Window)。在 Flink 中, 窗口就是用来处理无界流的核心
import java.time.Duration
import org.apache.flink.api.common.eventtime.{SerializableTimestampAssigner, WatermarkStrategy}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
case class User(name: String, money: Double, time: Long)
object StreamWindow {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val strategy: WatermarkStrategy[User] = WatermarkStrategy.forBoundedOutOfOrderness[User](Duration.ofSeconds(5))
.withTimestampAssigner(new SerializableTimestampAssigner[User] {
override def extractTimestamp(element: User, recordTimestamp: Long): Long = element.time
})
val dataStream: DataStream[User] = env.socketTextStream("localhost", 7777)
.map((data: String) => {
val arr: Array[String] = data.split(",")
User(arr(0), arr(1).toDouble, arr(2).toLong)
})
.assignTimestampsAndWatermarks(strategy)
val resultStream: DataStream[(String, Double)] = dataStream.map((data: User) => (data.name, data.money))
.keyBy((_: (String, Double))._1)
.window(TumblingEventTimeWindows.of(Time.seconds(10)))
.reduce((x: (String, Double), y: (String, Double)) => (x._1, y._2 + x._2))
resultStream.print()
env.execute("stream window")
}
}
基本处理函数(ProcessFunction)
处理函数主要是定义数据流的转换操作,所以也可以把它归到转换算子中。我们知道在Flink 中几乎所有转换算子都提供了对应的函数类接口,处理函数也不例外;它所对应的函数类,就叫作 ProcessFunction。ProcessWindowFunction 既是处理函数又是全窗口函数。从名字上也可以推测出,它的本质似乎更倾向于“窗口函数”一些。事实上它的用法也确实跟其他处理函数有很大不同。
package com.atguigu.chapter02
import java.time.Duration
import java.util.Calendar
import com.atguigu.chapter05.Event
import org.apache.flink.api.common.eventtime.{SerializableTimestampAssigner, WatermarkStrategy}
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
case class Event(user: String, url: String, timestamp: Long)
object StreamWordCount {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val parameterTool: ParameterTool = ParameterTool.fromArgs(args)
val hostname: String = parameterTool.get("host")
val port: Int = parameterTool.getInt("port")
val lineDataStream: DataStream[String] = env.socketTextStream(hostname, port)
val stream: DataStream[Event] = lineDataStream.map((data: String) => {
val fields: Array[String] = data.split(",")
Event(fields(0).trim, fields(1).trim, fields(2).trim.toLong)
})
stream.assignTimestampsAndWatermarks(WatermarkStrategy.forBoundedOutOfOrderness[Event](Duration.ofSeconds(5))
.withTimestampAssigner(
new SerializableTimestampAssigner[Event] {
override def extractTimestamp(t: Event, l: Long): Long = t.timestamp
}
))
.keyBy((_: Event).user)
.window(TumblingEventTimeWindows.of(Time.seconds(10)))
.process(new WatermarkWindowResult)
.print()
env.execute()
}
class WatermarkWindowResult extends ProcessWindowFunction[Event, String, String, TimeWindow] {
override def process(user: String, context: Context, elements: Iterable[Event], out: Collector[String]): Unit = {
val start: Long = context.window.getStart
val end: Long = context.window.getEnd
val count: Int = elements.size
val currentWatermark: Long = context.currentWatermark
out.collect(s"窗口 $start ~ $end , 用户 $user 的活跃度为:$count, 水位线现在位于:$currentWatermark")
}
}
}