Flink ProcessFunction API

1、为什么要使用ProcessFunction

  • 我们之前学习的转换算子是无法访问事件的时间戳信息和水位线信息的。而这在一些应用场景下,极为重要。例如MapFunction这样的map转换算子就无法访问时间戳或者当前事件的事件时间。
  • 基于此,DataStream API提供了一系列的Low-Level转换算子。可以访问时间戳、watermark以及注册定时事件。还可以输出特定的一些事件,例如超时事件等。
  • Process Function用来构建事件驱动的应用以及实现自定义的业务逻辑(使用之前的window函数和转换算子无法实现)。例如,Flink SQL就是使用Process Function实现的。
  • Flink提供了8个Process Function
    • ProcessFunction
    • KeyedProcessFunction
    • CoProcessFunction
    • ProcessJoinFunction
    • BroadcastProcessFunction
    • KeyedBroadcastProcessFunction
    • ProcessWindowFunction
    • ProcessAllWindowFunction

2、KeyedProcessFunction

  • KeyedProcessFunction用来操作KeyedStream。
  • KeyedProcessFunction会处理流的每一个元素,输出为0个、1个或者多个元素。
  • 所有的Process Function都继承自RichFunction接口,所以都有open()、**close()getRuntimeContext()**等方法。
  • 而KeyedProcessFunction[KEY, IN, OUT]还额外提供了两个方法:
    • processElement(v: IN, ctx: Context, out: Collector[OUT]),流中的每一个元素都会调用这个方法,调用结果将会放在Collector数据类型中输出。Context可以访问元素的时间戳,元素的key,以及TimerService时间服务。Context还可以将结果输出到别的流(side outputs)。
    • onTimer(timestamp: Long, ctx: OnTimerContext, out:Collector[OUT])是一个回调函数。当之前注册的定时器触发时调用。参数timestamp为定时器所设定的触发的时间戳。Collector为输出结果的集合。OnTimerContext和processElement的Context参数一样,提供了上下文的一些信息,例如定时器触发的时间信息(事件时间或者处理时间)。

3、TimeService和定时器(Timers)

  • Context和OnTimerContext所持有的TimerService对象拥有以下方法:

    • currentProcessingTime(): Long 返回当前处理时间
    • currentWatermark(): Long 返回当前watermark的时间戳
    • registerProcessingTimeTimer(timestamp: Long): Unit 会注册当前key的processing time的定时器。当processing time到达定时时间时,触发timer。
    • registerEventTimeTimer(timestamp: Long): Unit 会注册当前key的event time定时器。当水位线大于等于定时器注册的时间时,触发定时器执行回调函数。
    • deleteProcessingTimeTimer(timestamp: Long): Unit 删除之前注册处理时间定时器。如果没有这个时间戳的定时器,则不执行。
    • deleteEventTimeTimer(timestamp: Long): Unit 删除之前注册的事件时间定时器,如果没有此时间戳的定时器,则不执行。
  • 当定时器timer触发时,会执行回调函数onTimer()。

  • 注意定时器timer只能在keyed streams上面使用。

需求:监控水位传感器的水位值,如果水位值在10秒之内(processing time)连续上升,则报警。

import org.apache.flink.api.common.state.{ValueState, ValueStateDescriptor}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.util.Collector

object ProcessFunctionDemo1 {
  def main(args: Array[String]): Unit = {
      val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
      env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)  //设置事件发生时间
      env.setParallelism(1)  //设置并行度
      //端口采集数据
      val stream: DataStream[String] = env.socketTextStream("192.168.136.20",7777)
      val dataStream: DataStream[WaterSensor] = stream.map(data => {
        val arrays: Array[String] = data.split(",")
        WaterSensor(arrays(0).trim, arrays(1).trim.toLong, arrays(2).trim.toDouble)
      })
      //设定数据的事件时间定义WaterMark
      .assignTimestampsAndWatermarks(
        new BoundedOutOfOrdernessTimestampExtractor[WaterSensor](Time.seconds(1)) {
          override def extractTimestamp(element: WaterSensor): Long = {
            element.ts * 1000
          }
        }
      )
      //对分区后的数据进行处理
      val processStream: DataStream[String] = dataStream.keyBy(_.id)
        .process(new WaterLevelAlarm)

      dataStream.print("data")
      processStream.print("alarm:")

      env.execute("keyedProcessFunction")
    }
    case class WaterSensor(id:String,ts:Long,vc:Double)
  }

  //结合业务逻辑自定义类
  class WaterLevelAlarm extends KeyedProcessFunction[String,WaterSensor,String] {

    //上一次记录水位线的高度
    private var waterHeightState:ValueState[Double] = _
    //上一次记录事件的时间
    private var currentTSState:ValueState[Long] = _

	//初始化,一般完成状态对象的初始化
    override def open(parameters: Configuration): Unit = {
      waterHeightState = getRuntimeContext.getState(
        new ValueStateDescriptor[Double]("waterHeight", classOf[Double])
      )

      currentTSState = getRuntimeContext.getState(
        new ValueStateDescriptor[Long]("currentTS",classOf[Long])
      )
    }

	//处理每一条传感器数据
    override def processElement(
                                 value: WaterSensor,
                                 ctx: KeyedProcessFunction[String, WaterSensor, String]#Context,
                                 out: Collector[String]): Unit = {
      // 获取上一条数据的水位线高度,用当前传入的数据value与之比较
      val lastWaterHeight: Double = waterHeightState.value()
      // 获取当前注册的时间事件的时间戳,如果没有事件返回0L
      val currentTS: Long = currentTSState.value()
      // 如果当前传入的数据值比上一条数据的水位线高度值大,则注册事件
      if(value.vc>lastWaterHeight && currentTS == 0L){
        val timeTS: Long = ctx.timerService().currentProcessingTime()+10000L
        ctx.timerService().registerProcessingTimeTimer(timeTS)  //回调ontime的事件戳,当前时间的10秒以后,如果不取消就去调用ontime
        currentTSState.update(timeTS)
      }else if(value.vc<=lastWaterHeight || currentTS == 0L){  // 如果水位线下降,则解除报警事件
        ctx.timerService().deleteProcessingTimeTimer(currentTS)
        currentTSState.clear()
      }
      //设定waterHeightState的值等于当前传感器的值
      waterHeightState.update(value.vc)
    }

	//当连续10秒水位上升时,需要发出警报
    override def onTimer(
                          timestamp: Long,
                          ctx: KeyedProcessFunction[String, WaterSensor, String]#OnTimerContext,
                          out: Collector[String]): Unit = {
      out.collect(ctx.getCurrentKey+"水位持续上涨,警告")
      currentTSState.clear()
    }
}

需求:监控水位传感器的水位值,如果水位变化值超过10米,则报警。

方法一:

import org.apache.flink.api.common.functions.{RichFlatMapFunction, RichMapFunction}
import org.apache.flink.api.common.state.{ValueState, ValueStateDescriptor}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.scala._
import org.apache.flink.util.Collector


object ProcessFunctionDemo2 {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
    env.setParallelism(1)
    val stream: DataStream[String] = env.socketTextStream("192.168.136.20",7777)
    val dataStream: DataStream[WaterSensor] = stream.map(data => {
      val arrays: Array[String] = data.split(",")
      WaterSensor(arrays(0).trim, arrays(1).trim.toLong, arrays(2).trim.toDouble)
    }).assignTimestampsAndWatermarks(
      new BoundedOutOfOrdernessTimestampExtractor[WaterSensor](Time.seconds(1)) {
        override def extractTimestamp(element: WaterSensor): Long = {
          element.ts * 1000
        }
      }
    )
    val processStream: DataStream[(String, Double, Double)] = dataStream.keyBy(_.id)
      .flatMap(new HeightChangeAlarm(10.0))

    dataStream.print("data")
    processStream.print("alarm")

    env.execute("keyedprocessfunction")
  }
  case class WaterSensor(id:String,ts:Long,vc:Double)
}

class HeightChangeAlarm(alarmValue:Double) extends RichFlatMapFunction[WaterSensor,(String,Double,Double)] {
  var waterLevelState: ValueState[Double] = _

  override def open(parameters: Configuration): Unit = {
    waterLevelState = getRuntimeContext
      .getState(new ValueStateDescriptor[Double](
        "waterLevel",
        classOf[Double]
      ))
  }

  override def flatMap(
                        in: WaterSensor,
                        out: Collector[(String, Double, Double)]): Unit = {
    val lastValue: Double = waterLevelState.value()
    val abs: Double = (lastValue-in.vc).abs
    if(abs>alarmValue){
      out.collect(in.id,lastValue,in.vc)
    }
    waterLevelState.update(in.vc)
  }
}

方法二:

import org.apache.flink.api.common.functions.{RichFlatMapFunction, RichMapFunction}
import org.apache.flink.api.common.state.{ValueState, ValueStateDescriptor}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.scala._
import org.apache.flink.util.Collector

object ProcessFunctionDemo2 {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
    env.setParallelism(1)
    val stream: DataStream[String] = env.socketTextStream("192.168.136.20",7777)
    val dataStream: DataStream[WaterSensor] = stream.map(data => {
      val arrays: Array[String] = data.split(",")
      WaterSensor(arrays(0).trim, arrays(1).trim.toLong, arrays(2).trim.toDouble)
    }).assignTimestampsAndWatermarks(
      new BoundedOutOfOrdernessTimestampExtractor[WaterSensor](Time.seconds(1)) {
        override def extractTimestamp(element: WaterSensor): Long = {
          element.ts * 1000
        }
      }
    )
    val processStream: DataStream[(String, Double, Double)] = dataStream.keyBy(_.id)
       .process(new HeightChangeAlarm2(10.0))

    dataStream.print("data")
    processStream.print("alarm")

    env.execute("keyedprocessfunction")
  }
  case class WaterSensor(id:String,ts:Long,vc:Double)
}


class HeightChangeAlarm2(alarmValue:Double) extends
  KeyedProcessFunction[String,WaterSensor,(String,Double,Double)]{
  var waterLevelState:ValueState[Double]=_
  override def open(parameters: Configuration): Unit = {
    waterLevelState= getRuntimeContext
      .getState(
        new ValueStateDescriptor[Double]("lastTempState", classOf[Double]))
  }

  override def processElement(
                               in: WaterSensor,
                               context: KeyedProcessFunction[String, WaterSensor, (String, Double, Double)]#Context,
                               collector: Collector[(String, Double, Double)]): Unit ={
    val lastValue: Double = waterLevelState.value()
    val abs: Double = (lastValue-in.vc).abs
    if(abs>alarmValue){
      collector.collect((in.id,lastValue,in.vc))
      waterLevelState.update(in.vc)
    }
  }
}

4、侧输出流

  • 大部分的DataStream API的算子的输出是单一输出,也就是某种数据类型的流。
  • 除了split算子,可以将一条流分成多条流,这些流的数据类型也都相同。
  • process function的side outputs功能可以产生多条流,并且这些流的数据类型可以不一样。
  • 一个side output可以定义为OutputTag[X]对象,X是输出流的数据类型。
  • process function可以通过Context对象发送一个事件到一个或者多个side outputs。

在某些情况下,我们需要将数据流根据某些特征拆分成两个或者多个数据流,给不同数据流增加标记以便于从流中取出。我们之前学过一种方法:split+select可以实现流的拆分(split:将数据流拆分标记,select:根据标记将流取出)

用侧输出流也能实现(推荐)。

需求:将水位传感器数据按照空高高低(以40m,5m为界),拆分成三个流,并将这三个流取出。

import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.ProcessFunction
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.util.Collector

object ProcessFunctionDemo3 {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
    env.setParallelism(1)
    val stream: DataStream[String] = env.socketTextStream("192.168.136.20", 7777)
    val dataStream: DataStream[WaterSensor] = stream.map(data => {
      val arrays: Array[String] = data.split(",")
      WaterSensor(arrays(0).trim, arrays(1).trim.toLong, arrays(2).trim.toDouble)
    }).assignTimestampsAndWatermarks(
      new BoundedOutOfOrdernessTimestampExtractor[WaterSensor](Time.seconds(1)) {
        override def extractTimestamp(element: WaterSensor): Long = {
          element.ts * 1000
        }
      }
    )

    val processStream: DataStream[WaterSensor] = dataStream.process(new WaterLevelAlarm2())
    //processStream.print("data")
    val sideStream: DataStream[String] = processStream.getSideOutput(new OutputTag[String]("water level lower"))
    val sideStream2: DataStream[String] = processStream.getSideOutput(new OutputTag[String]("water level higher"))
    sideStream.print("waterlower")
    sideStream2.print("waterhigher")

    env.execute("keyedprocessfunction2")
  }
  case class WaterSensor(id:String,ts:Long,vc:Double)
}


class WaterLevelAlarm2 extends ProcessFunction[WaterSensor,WaterSensor]{
  private val waterLevelAlarm = new OutputTag[String]("water level lower")  //低水位的流
  private val waterLevelAlarmhigh = new OutputTag[String]("water level higher")		//高水位的流

  override def processElement(
             value: WaterSensor,
             ctx: ProcessFunction[WaterSensor, WaterSensor]#Context,
             out: Collector[WaterSensor]): Unit = {
      if(value.vc<5){  //水位线低于5米报警,放入侧输出流
        ctx.output(waterLevelAlarm,"水位线低于警戒水位:"+value.id)
      }else if(value.vc>40){
        ctx.output(waterLevelAlarmhigh,"水位线高于警戒水位:"+value.id)
      } else{
        out.collect(value)  //正常的放入主流
      }
    }
}
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值