文章目录
1.概念
Process Function 用来构建事件驱动的应用以及实现自定义的业务逻辑
Flink 提供了 8 个 Process Function:
• ProcessFunction
• KeyedProcessFunction
• CoProcessFunction
• ProcessJoinFunction
• BroadcastProcessFunction
• KeyedBroadcastProcessFunction
• ProcessWindowFunction
• ProcessAllWindowFunction
2. KeyedProcessFunction
KeyedProcessFunction 用来操作 KeyedStream。KeyedProcessFunction 会处理流的每一个元素,输出为 0 个、1 个或者多个元素;
应用场景:keyby后不开窗
2.1 案例:注册定时器和输出水位线
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.util.Collector
/**
* @Author jaffe
* @Date 2020/06/11 14:56
*/
object EventTimeOnTimer {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val stream = env
.socketTextStream("hadoop103",9999,'\n')
.map(line => {
val arr = line.split(" ")
(arr(0),arr(1).toLong * 1000)
})
// 插入水位线的时候,时间戳 - 1ms
.assignAscendingTimestamps(_._2)
.keyBy(_._1)
.process(new MyKeyedProcess)
.print()
env.execute()
}
class MyKeyedProcess extends KeyedProcessFunction[String,(String,Long),String]{
// 来一条数据调用一次
override def processElement(i: (String, Long), context: KeyedProcessFunction[String, (String, Long), String]#Context, collector: Collector[String]): Unit = {
// 在当前元素时间戳的10s钟以后,注册一个定时器,定时器的业务逻辑由`onTimer`函数实现
context.timerService().registerEventTimeTimer(i._2 + 10 * 1000)
collector.collect("当前水位线是:" + context.timerService().currentWatermark())
}
override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[String, (String, Long), String]#OnTimerContext, out: Collector[String]): Unit = {
out.collect("位于时间戳:" + timestamp + "的定时器触发了!")
}
}
}
- processElement(v: IN, ctx: Context, out: Collector[OUT]), 流中的每一个元素都会调用这个方法,调用结果将会放在 Collector 数据类型中输出。Context 可以访问元素的时间
戳,元素的 key,以及 TimerService 时间服务。Context 还可以将结果输出到别的流 (sideoutputs)。 - onTimer(timestamp: Long, ctx: OnTimerContext, out: Collector[OUT]) 是一个回调函数。当之前注册的定时器触发时调用。参数 timestamp 为定时器所设定的触发的时间戳
2.2 TimerService and Timers
定时器 timer 只能在 KeyedStream 上面使用
每一个 key 可以注册多个定时器,但在每一个时间戳只能注册一个定时器
Context 和 OnTimerContext 所持有的 TimerService 对象拥有以下方法:
- currentProcessingTime(): Long 返回当前处理时间
- currentWatermark(): Long 返回当前水位线的时间戳
- registerProcessingTimeTimer(timestamp: Long): Unit 会注册当前 key 的 processing time 的 timer。当 processing time 到达定时时间时,触发 timer。
- registerEventTimeTimer(timestamp: Long): Unit 会注册当前 key 的 event time
timer。当水位线大于等于定时器注册的时间时,触发定时器执行回调函数。 - deleteProcessingTimeTimer(timestamp: Long): Unit 删除之前注册处理时间定时
器。如果没有这个时间戳的定时器,则不执行。 - deleteEventTimeTimer(timestamp: Long): Unit 删除之前注册的事件时间定时器,
如果没有此时间戳的定时器,则不执行
2.3 案例:一秒钟温度连续上升报警
import com.jaffe.day02.{SensorReading, SensorSource}
import org.apache.flink.api.common.state.ValueStateDescriptor
import org.apache.flink.api.scala.typeutils.Types
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.util.Collector
/**
* @Author jaffe
* @Date 2020/06/11 15:50
*/
object TempIncreaseAlert {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val stream = env
.addSource(new SensorSource)
.keyBy(_.id)
.process(new TemplncreaseAlertFunction)
stream.print()
env.execute()
}
class TemplncreaseAlertFunction extends KeyedProcessFunction[String,SensorReading,String]{
// 用来存储最近一次的温度
// 当保存检查点的时候,会将状态变量保存到状态后端
// 默认状态后端是内存,也可以配置hdfs等为状态后端
// 懒加载,当运行到process方法的时候,才会惰性赋值
// 状态变量只会被初始化一次
// 根据`last-temp`这个名字到状态后端去查找,如果状态后端中没有,那么初始化
// 如果在状态后端中存在`last-temp`的状态变量,直接懒加载
// 默认值是`0.0`
lazy val lastTemp = getRuntimeContext
.getState(new ValueStateDescriptor[Double](
"last-temp",
Types.of[Double]
))
// 存储定时器时间戳的状态变量
// 默认值是`0L`
lazy val currentTime = getRuntimeContext
.getState(
new ValueStateDescriptor[Long](
"timer",
Types.of[Long]
)
)
override def processElement(value: SensorReading, context: KeyedProcessFunction[String, SensorReading, String]#Context, collector: Collector[String]): Unit = {
// 获取最近一次的温度, 使用`.value()`
val prevTemp = lastTemp.value()
// 将当前温度存入状态变量, `.update()`
lastTemp.update(value.temperature)
// 获取定时器状态变量中的时间戳
val curTimerTimestamp = currentTime.value()
if (prevTemp == 0.0 || value.temperature < prevTemp){
// 如果当前温度是第一个温度读数,或者温度下降
// 删除状态变量保存的时间戳对应的定时器
context.timerService().deleteProcessingTimeTimer(curTimerTimestamp)
// 清空状态变量
currentTime.clear()
}else if (value.temperature > prevTemp && curTimerTimestamp == 0L){
// 如果温度上升,且保存定时器时间戳的状态变量为空,就注册一个定时器
// 注册一个1s之后的定时器
val timerTs = context.timerService().currentProcessingTime() + 1000L
context.timerService().registerProcessingTimeTimer(timerTs)
// 将时间戳存入状态变量
currentTime.update(timerTs)
}
}
override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[String, SensorReading, String]#OnTimerContext, out: Collector[String]): Unit = {
out.collect("传感器ID为 " + ctx.getCurrentKey + " 的传感器,温度连续1秒钟上升了!")
currentTime.clear()
}
}
}
3.ProcessFunction 案例:将温度小于32F的温度读数发送到侧输出流
import com.jaffe.day02.{SensorReading, SensorSource}
import org.apache.flink.streaming.api.functions.ProcessFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.util.Collector
/**
* @Author jaffe
* @Date 2020/06/11 16:30
*/
object SideOutputExample {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val stream = env
.addSource(new SensorSource)
.process(new FreezingMonitor)
stream
.getSideOutput(new OutputTag[String]("freezing-alarms"))
//.print()
stream.print()
env.execute()
}
// 为什么用`ProcessFunction`? 因为没有keyBy分流
class FreezingMonitor extends ProcessFunction[SensorReading,SensorReading]{
// 定义侧输出标签
lazy val freezingAlarmOutput = new OutputTag[String]("freezing-alarms")
// 来一条数据,调用一次
override def processElement(value: SensorReading, context: ProcessFunction[SensorReading, SensorReading]#Context, out: Collector[SensorReading]): Unit = {
if (value.temperature < 32.0){
// 将报警信息发送到侧输出流
context.output(freezingAlarmOutput,s"传感器ID为 ${value.id} 的传感器发出低于32华氏度的低温报警!")
}
// 在主流上,将数据继续向下发送
out.collect(value)
}
}
}
4.CoProcessFunction 案例:双流合并
import com.jaffe.day02.{SensorReading, SensorSource}
import org.apache.flink.api.common.state.ValueStateDescriptor
import org.apache.flink.api.scala.typeutils.Types
import org.apache.flink.streaming.api.functions.co.CoProcessFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.util.Collector
/**
* @Author jaffe
* @Date 2020/06/15 09:27
*/
object CoProcessFunctionExample {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
// 无限流
val readings = env
.addSource(new SensorSource)
.keyBy(_.id)
// 有限流
val filterSwitches = env
.fromElements(
("sensor_1", 10 * 1000L),
("sensor_7", 30 * 1000L)
)
.keyBy(_._1)
readings.connect(filterSwitches)
.process(new ReadingFilter)
.print
env.execute()
}
class ReadingFilter extends CoProcessFunction[SensorReading, (String, Long), SensorReading] {
// 初始化传送数据的开关,默认值是false
// 只针对当前key可见的状态变量
lazy val forwardingEnabled = getRuntimeContext.getState(
new ValueStateDescriptor[Boolean]("filter-switch", Types.of[Boolean])
)
override def processElement1(in1: SensorReading, context: CoProcessFunction[SensorReading, (String, Long), SensorReading]#Context, collector: Collector[SensorReading]): Unit = {
// 处理第一条流,无限流
// 如果开关是true,将传感器数据向下游发送
if (forwardingEnabled.value()) {
collector.collect(in1)
}
}
override def processElement2(in2: (String, Long), context: CoProcessFunction[SensorReading, (String, Long), SensorReading]#Context, collector: Collector[SensorReading]): Unit = {
// 处理第二条流,有限流,只会被调用两次
forwardingEnabled.update(true)// 打开开关
// `value._2`是开关打开的时间
val timerTs = context.timerService().currentProcessingTime() + in2._2
context.timerService().registerProcessingTimeTimer(timerTs)
}
override def onTimer(timestamp: Long, ctx: CoProcessFunction[SensorReading, (String, Long), SensorReading]#OnTimerContext, out: Collector[SensorReading]): Unit = {
forwardingEnabled.update(false)// 关闭开关
}
}
}
5.触发器
5.1 处理时间触发器案例
import java.sql.Timestamp
import com.jaffe.day02.{SensorReading, SensorSource}
import org.apache.flink.api.common.state.ValueStateDescriptor
import org.apache.flink.api.scala.typeutils.Types
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.triggers.{Trigger, TriggerResult}
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
/**
* @Author jaffe
* @Date 2020/06/15 11:32
*/
object TriggerExample {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val stream = env
.addSource(new SensorSource)
.keyBy(_.id)
.timeWindow(Time.seconds(10))
.trigger(new OneSecondIntervalTrigger)
.process(new WindowResult)
stream.print()
env.execute()
}
class WindowResult extends ProcessWindowFunction[SensorReading, String, String, TimeWindow] {
override def process(key: String, context: Context, elements: Iterable[SensorReading], out: Collector[String]): Unit = {
out.collect("传感器ID为 " + key + " 的传感器窗口中元素的数量是 " + elements.size)
}
}
class OneSecondIntervalTrigger extends Trigger[SensorReading, TimeWindow] {
// 来一条数据调用一次
override def onElement(t: SensorReading, l: Long, w: TimeWindow, triggerContext: Trigger.TriggerContext): TriggerResult = {
val firstSeen = triggerContext.getPartitionedState(
new ValueStateDescriptor[Boolean]("first-seen", Types.of[Boolean])
)
// 如果firstSeen为false,也就是当碰到第一条元素的时候
if (!firstSeen.value()) {
// 假设第一条事件来的时候,机器时间是1234ms,t是多少?t是2000ms
val t = triggerContext.getCurrentProcessingTime
+(1000 - (triggerContext.getCurrentProcessingTime % 1000))
triggerContext.registerProcessingTimeTimer(t)// 在2000ms注册一个定时器
triggerContext.registerProcessingTimeTimer(w.getEnd)// 在窗口结束时间注册一个定时器
firstSeen.update(true)
}
TriggerResult.CONTINUE
}
// 注册的定时器的回调函数
override def onProcessingTime(l: Long, w: TimeWindow, triggerContext: Trigger.TriggerContext): TriggerResult = {
println("回调函数触发时间:" + new Timestamp(l))
if (l == w.getEnd) {
TriggerResult.FIRE_AND_PURGE
} else {
val t = triggerContext.getCurrentProcessingTime + (1000 - (triggerContext.getCurrentProcessingTime % 1000))
if (t < w.getEnd) {
triggerContext.registerProcessingTimeTimer(t)
}
TriggerResult.FIRE
}
}
override def onEventTime(l: Long, w: TimeWindow, triggerContext: Trigger.TriggerContext): TriggerResult = {
TriggerResult.CONTINUE
}
override def clear(w: TimeWindow, triggerContext: Trigger.TriggerContext): Unit = {
// SingleTon, 单例模式,只会被初始化一次
val firstSeen = triggerContext.getPartitionedState(
new ValueStateDescriptor[Boolean]("first-seen", Types.of[Boolean]))
firstSeen.clear()
}
}
}
5.2 事件时间触发器-基于时间的双流Join案例
import org.apache.flink.api.common.state.ValueStateDescriptor
import org.apache.flink.api.scala.typeutils.Types
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.triggers.{Trigger, TriggerResult}
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
/**
* @Author jaffe
* @Date 2020/06/15 15:18
*/
object TiggerEventTimeExample {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val stream = env
.socketTextStream("hadoop103", 9999, '\n')
.map(line => {
val arr = line.split(" ")
(arr(0), arr(1).toLong)
})
.assignAscendingTimestamps(_._2)
.keyBy(_._1)
.timeWindow(Time.seconds(10))
.trigger(new OneSecondIntervalTrigger)
.process(new WindowResult)
stream.print()
env.execute()
}
class WindowResult extends ProcessWindowFunction[(String, Long), String, String, TimeWindow] {
override def process(key: String, context: Context, elements: Iterable[(String, Long)], out: Collector[String]): Unit = {
out.collect("key为 " + key + " 的窗口中元素的数量是 " + elements.size)
}
}
class OneSecondIntervalTrigger extends Trigger[(String, Long), TimeWindow] {
// 来一条数据调用一次
override def onElement(t: (String, Long), l: Long, w: TimeWindow, triggerContext: Trigger.TriggerContext): TriggerResult = {
val firstSeen = triggerContext.getPartitionedState(
new ValueStateDescriptor[Boolean]("first-seen", Types.of[Boolean])
)
// 如果firstSeen为false,也就是当碰到第一条元素的时候
if (!firstSeen.value()) {
// 假设第一条事件来的时候,机器时间是1234ms,t是多少?t是2000ms
val tm = t._2+ (1000 - (t._2 % 1000))
triggerContext.registerEventTimeTimer(tm) // 在2000ms注册一个定时器
triggerContext.registerEventTimeTimer(w.getEnd) // 在窗口结束时间注册一个定时器
firstSeen.update(true)
}
TriggerResult.CONTINUE
}
// 注册的定时器的回调函数
override def onProcessingTime(l: Long, w: TimeWindow, triggerContext: Trigger.TriggerContext): TriggerResult = {
TriggerResult.CONTINUE
}
override def onEventTime(l: Long, w: TimeWindow, triggerContext: Trigger.TriggerContext): TriggerResult = {
println("回调函数触发时间:" + l)
if (l == w.getEnd) {
TriggerResult.FIRE_AND_PURGE
} else {
val t = triggerContext.getCurrentWatermark + (1000 - (triggerContext.getCurrentWatermark % 1000))
if (t < w.getEnd) {
triggerContext.registerEventTimeTimer(t)
}
TriggerResult.FIRE
}
}
override def clear(w: TimeWindow, triggerContext: Trigger.TriggerContext): Unit = {
// SingleTon, 单例模式,只会被初始化一次
val firstSeen = triggerContext.getPartitionedState(
new ValueStateDescriptor[Boolean]("first-seen", Types.of[Boolean])
)
firstSeen.clear()
}
}
}
6.处理迟到的元素 (Handling Late Data)
三种策略:
- 直接抛弃迟到的元素
- 将迟到的元素发送到另一条流中去
- 可以更新窗口已经计算完的结果,并发出计算结果
6.1 抛弃迟到的元素
- 抛弃迟到的元素是 event time window operator 的默认行为。也就是说一个迟到的元素不会创建一个新的窗口。
- process function 可以通过比较迟到元素的时间戳和当前水位线的大小来很轻易的过滤掉迟到元素。
6.2 重定向迟到元素
迟到的元素也可以使用侧输出 (side output) 特性被重定向到另外的一条流中去。迟到元素所组成的侧输出流可以继续处理或者 sink 到持久化设施中去
案例1:将迟到的元素发送到侧输出流
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
/**
* @Author jaffe
* @Date 2020/06/15 18:41
*/
object LateElementToSideOutput {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
env.setParallelism(1)
val readings = env
.socketTextStream("hadoop103",9999,'\n')
.map(line => {
val arr = line.split(" ")
(arr(0),arr(1).toLong * 1000L)
})
.assignAscendingTimestamps(_._2)
// .assignTimestampsAndWatermarks(
// new BoundedOutOfOrdernessTimestampExtractor[(String, Long)](Time.milliseconds(1)) {
// override def extractTimestamp(element: (String, Long)): Long = element._2
// }
// )
.keyBy(_._1)
.timeWindow(Time.seconds(10))
.sideOutputLateData(
new OutputTag[(String, Long)]("late")
)
.process(new CountFunction)
readings.print()
readings.getSideOutput(new OutputTag[(String,Long)]("late")).print()
env.execute()
}
class CountFunction extends ProcessWindowFunction[(String,Long),String,String,TimeWindow]{
override def process(key: String, context: Context, elements: Iterable[(String, Long)], out: Collector[String]): Unit = {
out.collect(context.window.getStart + "到" + context.window.getEnd + "的窗口闭合了!")
}
}
}
案例2:将迟到的元素发送到侧输出流(无窗口)
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.ProcessFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.util.Collector
/**
* @Author jaffe
* @Date 2020/06/15 18:41
*/
object LateElementToSideOutputNonWindow {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val readings = env
.socketTextStream("hadoop103",9999,'\n')
.map(line => {
val arr = line.split(" ")
(arr(0),arr(1).toLong * 1000L)
})
.assignAscendingTimestamps(_._2)
.process(new LateToSideOutput)
readings.print()
readings.getSideOutput(new OutputTag[String]("late")).print()
env.execute()
}
class LateToSideOutput extends ProcessFunction[(String,Long),String]{
val lateReadingOutput = new OutputTag[String]("late")
override def processElement(i: (String, Long), context: ProcessFunction[(String, Long), String]#Context, collector: Collector[String]): Unit = {
if (i._2 < context.timerService().currentWatermark()){
context.output(lateReadingOutput,"迟到事件来了!")
} else {
collector.collect("没有迟到的事件来了!")
}
}
}
}
6.3 使用迟到元素更新窗口计算结果 (更新元素个数)
由于存在迟到的元素,所以已经计算出的窗口结果是不准确和不完全的。我们可以使用迟到元素更新已经计算完的窗口结果
import org.apache.flink.api.common.state.ValueStateDescriptor
import org.apache.flink.api.scala.typeutils.Types
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
/**
* @Author jaffe
* @Date 2020/06/16 09:11
*/
object UpdateWindowResultWithLateElement {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
env.setParallelism(1)
val stream = env
.socketTextStream("hadoop103",9999,'\n')
.map(line => {
val arr = line.split(" ")
(arr(0),arr(1).toLong * 1000L)
})
.assignTimestampsAndWatermarks(
new BoundedOutOfOrdernessTimestampExtractor[(String, Long)](Time.seconds(5)) {
override def extractTimestamp(t: (String, Long)): Long = t._2
}
)
.keyBy(_._1)
.timeWindow(Time.seconds(5))
.allowedLateness(Time.seconds(5))
.process(new UpdatingWindowCountFunction)
stream.print()
env.execute()
}
class UpdatingWindowCountFunction extends ProcessWindowFunction[(String,Long),String,String,TimeWindow]{
override def process(key: String, context: Context, elements: Iterable[(String, Long)], out: Collector[String]): Unit = {
val count = elements.size
// 基于窗口的状态变量,仅当前窗口可见
// 默认值是false
val isUpdate = context.windowState.getState(
new ValueStateDescriptor[Boolean]("is-update",Types.of[Boolean])
)
if (!isUpdate.value()){
out.collect("当水位线超过窗口结束时间的时候,窗口第一次触发计算!元素数量是 " + count + " 个!")
isUpdate.update(true)
} else {
// 迟到元素到来以后,更新窗口的计算结果
out.collect("迟到元素来了!元素数量是 " + count + " 个!")
}
}
}
}