各种示例让你学会Flink复杂事件模式CEP

1. 什么是Flink CEP

从DataStream中检测出符合特定规则的数据结果,如下图所示DataStream中元素是各种形状,我们想检测长方形后面跟着椭圆形这样一个规则,最后得到两次检测结果

什么是CEP

2. CEP的使用准备

pom.xml

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-cep-scala_2.11</artifactId>
    <version>1.13.2</version>
    <scope>provided</scope>
</dependency>

指定timestamp和watermark的MyWatermarkStrategy.scala

package cepTest

import org.apache.commons.lang3.time.FastDateFormat
import org.apache.flink.api.common.eventtime.{TimestampAssigner, TimestampAssignerSupplier, Watermark, WatermarkGenerator, WatermarkGeneratorSupplier, WatermarkOutput, WatermarkStrategy}

class RecordTimestampAssigner extends TimestampAssigner[(String, String, Int)] {
  val fdf = FastDateFormat.getInstance("yyyy-MM-dd HH:mm:ss")

  override def extractTimestamp(element: (String, String, Int), recordTimestamp: Long): Long = {

    fdf.parse(element._2).getTime

  }

}

class PeriodWatermarkGenerator extends WatermarkGenerator[(String, String, Int)] {
  val fdf = FastDateFormat.getInstance("yyyy-MM-dd HH:mm:ss")
  var maxTimestamp: Long = _
  val maxOutofOrderness = 0

  override def onEvent(event: (String, String, Int), eventTimestamp: Long, output: WatermarkOutput): Unit = {

    maxTimestamp = math.max(fdf.parse(event._2).getTime, maxTimestamp)

  }

  override def onPeriodicEmit(output: WatermarkOutput): Unit = {

    output.emitWatermark(new Watermark(maxTimestamp - maxOutofOrderness - 1))
  }
}


class MyWatermarkStrategy extends WatermarkStrategy[(String, String, Int)] {

  override def createTimestampAssigner(context: TimestampAssignerSupplier.Context): TimestampAssigner[(String, String, Int)] = {

    new RecordTimestampAssigner()
  }

  override def createWatermarkGenerator(context: WatermarkGeneratorSupplier.Context): WatermarkGenerator[(String, String, Int)] = {
    new PeriodWatermarkGenerator()

  }

}

CEP的程序模板CEPDemo.scala

  • 如果后面的内容没有贴出CEPDemo的完整内容,代表使用的都是这个程序模板
  • 模板需填充的两个地方:
    1. input的数据内容
    2. pattern匹配模式的指定
  • 当两个Event的timestamp相同时,EventComparator用来决定哪个Event先处理
  • input可以是non-keyed DataStream,也可以是keyed DataStream
package cepTest

import org.apache.flink.cep.functions.PatternProcessFunction
import org.apache.flink.cep.scala.pattern.Pattern
import org.apache.flink.cep.scala.{CEP, PatternStream}
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment, createTypeInformation}
import org.apache.flink.util.Collector
import java.util

object CEPDemo {

  def main(args: Array[String]): Unit = {

    val senv = StreamExecutionEnvironment.getExecutionEnvironment

    val input = senv.fromElements(

      // 数据格式为:("a", "2021-09-29 18:00:01", 0)

    ).assignTimestampsAndWatermarks(new MyWatermarkStrategy())

    val pattern = // 指定匹配的模式

    val event_comparator = new EventComparator[(String,String,Int)] {
      override def compare(o1: (String, String, Int), o2: (String, String, Int)): Int = {
        if(o1._3 > o2._3) 1 else if(o1._3 == o2._3) 0 else -1
      }
    }

    val pattern_stream: PatternStream[(String, String, Int)] = CEP.pattern(input, pattern, event_comparator)

    val result_stream: DataStream[(String, String, Int)] = pattern_stream.process(
      new PatternProcessFunction[(String, String, Int), (String, String, Int)] {
        override def processMatch(map: util.Map[String, util.List[(String, String, Int)]], context: PatternProcessFunction.Context, collector: Collector[(String, String, Int)]): Unit = {

          println(map)

        }
      }
    )

    result_stream.print("result_stream")

    senv.execute("CEPDemo")
  }
}

3. Individual Patterns(单个模式)

  • 每个模式的名称必须唯一,如本示例的my_start

示例1:检测第一个字段为字符a的数据

数据内容

("a", "2021-09-29 18:00:01", 0),
("b", "2021-09-29 18:00:02", 0),
("a", "2021-09-29 18:00:03", 0),
("c", "2021-09-29 18:00:04", 0)

指定匹配模式

Pattern.begin[(String,String,Int)]("my_start")
  .where(_._1 == "a")

程序执行结果

{my_start=[(a,2021-09-29 18:00:01,0)]}
{my_start=[(a,2021-09-29 18:00:03,0)]}

3.1 times、oneOrMore、timesOrMore(不连续匹配)

  • 只对前面的一个模式起作用,本示例为my_start
  • times有两种指定方式:times(times: Int)和times(from: Int, to: Int)
  • oneOrMore指1次或多次,timesOrMore指大于等于n次

示例2:字符a不连续出现2次或3次

数据内容

("a", "2021-09-29 18:00:01", 0),
("b", "2021-09-29 18:00:02", 0),
("a", "2021-09-29 18:00:03", 0),
("a", "2021-09-29 18:00:04", 0),
("c", "2021-09-29 18:00:05", 0)

指定匹配模式:

Pattern.begin[(String,String,Int)]("my_start")
  .where(_._1 == "a")
  .times(2, 3)

程序执行结果:

{my_start=[(a,2021-09-29 18:00:01,0), (a,2021-09-29 18:00:03,0)]}
{my_start=[(a,2021-09-29 18:00:01,0), (a,2021-09-29 18:00:03,0), (a,2021-09-29 18:00:04,0)]}
{my_start=[(a,2021-09-29 18:00:03,0), (a,2021-09-29 18:00:04,0)]}

3.2 consecutive(限定连续匹配)

  • times、oneOrMore、timesOrMore为不连续匹配,使用consecutive限定它们为连续匹配

数据内容

("a", "2021-09-29 18:00:01", 0),
("b", "2021-09-29 18:00:02", 0),
("a", "2021-09-29 18:00:03", 0),
("a", "2021-09-29 18:00:04", 0),
("c", "2021-09-29 18:00:05", 0)

指定匹配模式:

Pattern.begin[(String,String,Int)]("my_start")
  .where(_._1 == "a")
  .times(2, 3)
  .consecutive()

程序执行结果:

{my_start=[(a,2021-09-29 18:00:03,0), (a,2021-09-29 18:00:04,0)]}

3.3 allowCombinations(不确定的不连续匹配)

  • 原理和Non-Deterministic Relaxed Contiguity(followedByAny)一样的

数据内容

("a", "2021-09-29 18:00:01", 0),
("b", "2021-09-29 18:00:02", 0),
("a", "2021-09-29 18:00:03", 0),
("a", "2021-09-29 18:00:04", 0),
("c", "2021-09-29 18:00:05", 0)

指定匹配模式:

val pattern =
  Pattern.begin[(String,String,Int)]("my_start")
    .where(_._1 == "a")
    .times(2, 3)
    .allowCombinations()

程序执行结果:

{my_start=[(a,2021-09-29 18:00:01,0), (a,2021-09-29 18:00:03,0)]}
{my_start=[(a,2021-09-29 18:00:01,0), (a,2021-09-29 18:00:03,0), (a,2021-09-29 18:00:04,0)]}
{my_start=[(a,2021-09-29 18:00:01,0), (a,2021-09-29 18:00:04,0)]}
{my_start=[(a,2021-09-29 18:00:03,0), (a,2021-09-29 18:00:04,0)]}

3.3 指定条件的方法

3.3.1 where

  • 多个where相当于and的效果
    数据内容
("a", "2021-09-29 18:00:01", 200),
("b", "2021-09-29 18:00:02", 200),
("a", "2021-09-29 18:00:03", 300),
("a", "2021-09-29 18:00:04", 50)

指定匹配模式

Pattern.begin[(String,String,Int)]("my_start")
  .where(_._1 == "a")
  .where(_._3 > 100)

程序执行结果

{my_start=[(a,2021-09-29 18:00:01,200)]}
{my_start=[(a,2021-09-29 18:00:03,300)]}

3.3.2 or

数据内容

("a", "2021-09-29 18:00:01", 0),
("b", "2021-09-29 18:00:02", 0),
("a", "2021-09-29 18:00:03", 0),
("c", "2021-09-29 18:00:04", 0)

指定匹配模式

Pattern.begin[(String,String,Int)]("my_start")
  .where(_._1 == "a").or(_._1 == "b")

程序执行结果

{my_start=[(a,2021-09-29 18:00:01,0)]}
{my_start=[(b,2021-09-29 18:00:02,0)]}
{my_start=[(a,2021-09-29 18:00:03,0)]}

3.3.3 until

  • 只能用于oneOrMore之后
  • 遇到既不符合where的条件,也不符合until的条件,直接忽略
  • 最后一次匹配,没有遇到until, 也算匹配成功

数据内容

("a", "2021-09-29 18:00:01", 200),
("a", "2021-09-29 18:00:02", 100),
("a", "2021-09-29 18:00:03", 300),
("a", "2021-09-29 18:00:04", 50),
("a", "2021-09-29 18:00:05", 400),
("a", "2021-09-29 18:00:06", 500)

指定匹配模式

Pattern.begin[(String,String,Int)]("my_start")
  .where(_._3 > 100).oneOrMore
  .until(_._3 < 100)

程序执行结果

{my_start=[(a,2021-09-29 18:00:01,200)]}
{my_start=[(a,2021-09-29 18:00:01,200), (a,2021-09-29 18:00:03,300)]}
{my_start=[(a,2021-09-29 18:00:03,300)]}
{my_start=[(a,2021-09-29 18:00:05,400)]}
{my_start=[(a,2021-09-29 18:00:05,400), (a,2021-09-29 18:00:06,500)]}
{my_start=[(a,2021-09-29 18:00:06,500)]}

3.3.4 subtype

数据内容

("a", "2021-09-29 18:00:01", 0),
("b", "2021-09-29 18:00:02", 0),
("c", "2021-09-29 18:00:03", 0)

指定匹配模式

class My_tuple3(x1:String, x2:String, x3:Int) extends scala.Tuple3[String,String,Int](x1, x2, x3)


Pattern.begin[(String,String,Int)]("my_start")
  .subtype(classOf[My_tuple3])
  • 程序执行结果为空
  • Pattern匹配的数据的数据类型为(String,String,Int), 而我们指定的是它的子类My_tuple3,所有没有结果

3.4 条件表达式之IterativeCondition

  • 先指定times、oneOrMore、timesOrMore,再指定IterativeCondition

数据内容

("a", "2021-09-29 18:00:01", 10),
("b", "2021-09-29 18:00:02", 30),
("a", "2021-09-29 18:00:03", 20),
("a", "2021-09-29 18:00:04", 10),
("a", "2021-09-29 18:00:05", 30),
("a", "2021-09-29 18:00:06", 20)

指定匹配模式:

import org.apache.flink.cep.pattern.conditions.IterativeCondition

Pattern.begin[(String,String,Int)]("my_start")
  .times(3)
  .where(new IterativeCondition[(String, String, Int)] {
    override def filter(t: (String, String, Int), context: IterativeCondition.Context[(String, String, Int)]): Boolean = {

      import scala.collection.JavaConversions.iterableAsScalaIterable
      lazy val previous_acc = context.getEventsForPattern("my_start").map(_._3).aggregate((0,0))(
        (acc, value) => (acc._1 + 1, acc._2 + value),
        (acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2)
      )
      lazy val avg = if(previous_acc._1 == 0) 0 else (previous_acc._2.toDouble / previous_acc._1)

      t._1 == "a" && t._3 >= avg
    }
  })

程序执行结果

{my_start=[(a,2021-09-29 18:00:01,10), (a,2021-09-29 18:00:03,20), (a,2021-09-29 18:00:05,30)]}
{my_start=[(a,2021-09-29 18:00:04,10), (a,2021-09-29 18:00:05,30), (a,2021-09-29 18:00:06,20)]}

结果步骤说明:

  • 第一条结果:

    1. 数据(“a”, “2021-09-29 18:00:01”, 10)进场,context.getEventsForPattern(“my_start”)没有获取到数据,avg = 0,字符等于a且10 >= 0,找到符合条件的第一条数据,且将该条数据添加到context
    2. 数据(“b”, “2021-09-29 18:00:02”, 30)进场,字符不等于a,不符合条件
    3. 数据(“a”, “2021-09-29 18:00:03”, 20)进场,context.getEventsForPattern(“my_start”)获取到数据Seq((“a”, “2021-09-29 18:00:01”, 10)),avg = 10,字符等于a且20 >= 10,找到符合条件的第二条数据,且将该条数据添加到context
    4. 数据(“a”, “2021-09-29 18:00:04”, 10)进场,context.getEventsForPattern(“my_start”)获取到数据Seq((“a”, “2021-09-29 18:00:01”, 10), (“a”, “2021-09-29 18:00:03”, 20)),avg = 15,字符等于a且10 < 15,不符合条件
    5. 数据(“a”, “2021-09-29 18:00:05”, 30)进场,context.getEventsForPattern(“my_start”)获取到数据Seq((“a”, “2021-09-29 18:00:01”, 10), (“a”, “2021-09-29 18:00:03”, 20)),avg = 15,字符等于a且30 >= 15,找到符合条件的第三条数据,因为我们设置的是times(3),输出第一条结果,并将添加到context的数据清除
  • 第二条结果:

    1. 数据(“a”, “2021-09-29 18:00:04”, 10)进场,context.getEventsForPattern(“my_start”)没有获取到数据,avg = 0,字符等于a且10 >= 0,找到符合条件的第一条数据,且将该条数据添加到context
    2. 数据(“a”, “2021-09-29 18:00:05”, 30)进场,context.getEventsForPattern(“my_start”)获取到数据Seq((“a”, “2021-09-29 18:00:04”, 10)),avg = 10,字符等于a且30 >= 10,找到符合条件的第二条数据,且将该条数据添加到context
    3. 数据(“a”, “2021-09-29 18:00:06”, 20)进场,context.getEventsForPattern(“my_start”)获取到数据Seq((“a”, “2021-09-29 18:00:04”, 10), (“a”, “2021-09-29 18:00:05”, 30)),avg = 20,字符等于a且20 >= 20,找到符合条件的第三条数据,因为我们设置的是times(3),输出第二条结果,并将添加到context的数据清除

4. Combining Patterns(组合模式)

4.1 三种连续形式

4.1.1 Strict Contiguity(next)

  • 模式1和模式2必须紧密相连,例如对于a, c, b1, b2,匹配模式为(a, b), 无匹配结果

数据内容

("a", "2021-09-29 18:00:01", 0),
("b", "2021-09-29 18:00:02", 0),
("c", "2021-09-29 18:00:03", 0),
("a", "2021-09-29 18:00:04", 0),
("d", "2021-09-29 18:00:05", 0),
("b", "2021-09-29 18:00:06", 0),
("b", "2021-09-29 18:00:07", 0),
("a", "2021-09-29 18:00:08", 0),
("b", "2021-09-29 18:00:09", 0)

指定匹配模式:

Pattern.begin[(String,String,Int)]("my_start")
  .where(_._1 == "a")
  .next("my_next").where(_._1 == "b")

程序执行结果:

{my_start=[(a,2021-09-29 18:00:01,0)], my_next=[(b,2021-09-29 18:00:02,0)]}
{my_start=[(a,2021-09-29 18:00:08,0)], my_next=[(b,2021-09-29 18:00:09,0)]}

4.1.2 Relaxed Contiguity(followedBy)

  • 模式1和模式2可以不相连, 例如对于a, c, b1, b2,匹配模式为(a, b), 匹配结果为(a, b1)

数据内容

("a", "2021-09-29 18:00:01", 0),
("b", "2021-09-29 18:00:02", 0),
("c", "2021-09-29 18:00:03", 0),
("a", "2021-09-29 18:00:04", 0),
("d", "2021-09-29 18:00:05", 0),
("b", "2021-09-29 18:00:06", 0),
("b", "2021-09-29 18:00:07", 0),
("a", "2021-09-29 18:00:08", 0),
("b", "2021-09-29 18:00:09", 0)

指定匹配模式:

Pattern.begin[(String,String,Int)]("my_start")
  .where(_._1 == "a")
  .followedBy("my_next").where(_._1 == "b")

程序执行结果:

{my_start=[(a,2021-09-29 18:00:01,0)], my_next=[(b,2021-09-29 18:00:02,0)]}
{my_start=[(a,2021-09-29 18:00:04,0)], my_next=[(b,2021-09-29 18:00:06,0)]}
{my_start=[(a,2021-09-29 18:00:08,0)], my_next=[(b,2021-09-29 18:00:09,0)]}

4.1.3 Non-Deterministic Relaxed Contiguity(followedByAny)

  • 模式1和模式2可以不相连, 还可以跳过已匹配的模式2,去匹配其它模式2,例如对于a, c, b1, b2,匹配模式为(a, b), 匹配结果为(a, b1), (a, b2)

数据内容

("a", "2021-09-29 18:00:01", 0),
("b", "2021-09-29 18:00:02", 0),
("c", "2021-09-29 18:00:03", 0),
("a", "2021-09-29 18:00:04", 0),
("d", "2021-09-29 18:00:05", 0),
("b", "2021-09-29 18:00:06", 0),
("b", "2021-09-29 18:00:07", 0),
("a", "2021-09-29 18:00:08", 0),
("b", "2021-09-29 18:00:09", 0)

指定匹配模式:

Pattern.begin[(String,String,Int)]("my_start")
  .where(_._1 == "a")
  .followedByAny("my_next").where(_._1 == "b")

程序执行结果:

{my_start=[(a,2021-09-29 18:00:01,0)], my_next=[(b,2021-09-29 18:00:02,0)]}
{my_start=[(a,2021-09-29 18:00:01,0)], my_next=[(b,2021-09-29 18:00:06,0)]}
{my_start=[(a,2021-09-29 18:00:04,0)], my_next=[(b,2021-09-29 18:00:06,0)]}
{my_start=[(a,2021-09-29 18:00:01,0)], my_next=[(b,2021-09-29 18:00:07,0)]}
{my_start=[(a,2021-09-29 18:00:04,0)], my_next=[(b,2021-09-29 18:00:07,0)]}
{my_start=[(a,2021-09-29 18:00:01,0)], my_next=[(b,2021-09-29 18:00:09,0)]}
{my_start=[(a,2021-09-29 18:00:04,0)], my_next=[(b,2021-09-29 18:00:09,0)]}
{my_start=[(a,2021-09-29 18:00:08,0)], my_next=[(b,2021-09-29 18:00:09,0)]}

4.1.4 notNext

  • 一个模式后面不跟着另外一个模式
    数据内容
("a", "2021-09-29 18:00:01", 0),
("b", "2021-09-29 18:00:02", 0),
("a", "2021-09-29 18:00:03", 0),
("c", "2021-09-29 18:00:04", 0)

指定匹配模式:

Pattern.begin[(String,String,Int)]("my_start")
  .where(_._1 == "a")
  .notNext("my_not_next").where(_._1 == "b")

程序执行结果:

{my_start=[(a,2021-09-29 18:00:03,0)]}

4.1.5 notFollowedBy

  • 用于指定一个模式,不能包含在两个模式之间
  • 数据内容
("a", "2021-09-29 18:00:01", 0),
("d", "2021-09-29 18:00:02", 0),
("b", "2021-09-29 18:00:03", 0),
("c", "2021-09-29 18:00:04", 0),
("a", "2021-09-29 18:00:05", 0),
("e", "2021-09-29 18:00:06", 0),
("c", "2021-09-29 18:00:07", 0)

指定匹配模式:

Pattern.begin[(String,String,Int)]("my_start")
  .where(_._1 == "a")
  .notFollowedBy("my_not_next").where(_._1 == "b")
  .followedBy("my_follow_by").where(_._1 == "c")

程序执行结果:

{my_start=[(a,2021-09-29 18:00:04,0)], my_follow_by=[(c,2021-09-29 18:00:06,0)]}

4.1.6 within

  • 多个模式的匹配在规定时间内才有效
  • 时间范围的开始时间由符合条件的第一个模式决定,不包括时间范围的结束时间

数据内容

("a", "2021-09-29 18:00:01", 0),
("b", "2021-09-29 18:00:10", 0),
("a", "2021-09-29 18:00:11", 0),
("b", "2021-09-29 18:00:21", 0)

指定匹配模式:

Pattern.begin[(String,String,Int)]("my_start")
  .where(_._1 == "a")
  .next("my_next").where(_._1 == "b")
  .within(Time.seconds(10L))

程序执行结果:

{my_start=[(a,2021-09-29 18:00:01,0)], my_next=[(b,2021-09-29 18:00:10,0)]}

可以使用TimedOutPartialMatchHandler,来输出因within窗口而组合失败的元素,但只会输出第一个模式的元素,示例如下:

完整的CEPDemo:

package cepTest

import org.apache.flink.cep.EventComparator
import org.apache.flink.cep.functions.{PatternProcessFunction, TimedOutPartialMatchHandler}
import org.apache.flink.cep.scala.pattern.Pattern
import org.apache.flink.cep.scala.{CEP, PatternStream}
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment, createTypeInformation}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.util.Collector

import java.util

class My_patternProcessFunction extends PatternProcessFunction[(String, String, Int), (String, String, Int)] with TimedOutPartialMatchHandler[(String, String, Int)]{

  override def processMatch(map: util.Map[String, util.List[(String, String, Int)]], context: PatternProcessFunction.Context, collector: Collector[(String, String, Int)]): Unit = {

    println(map)

  }

  override def processTimedOutMatch(map: util.Map[String, util.List[(String, String, Int)]], context: PatternProcessFunction.Context): Unit = {

    // context.output()
    println("processTimedOutMatch: " + map.toString)
  }

}

object CEPDemo {

  def main(args: Array[String]): Unit = {

    val senv = StreamExecutionEnvironment.getExecutionEnvironment

    val input = senv.fromElements(

      ("a", "2021-09-29 18:00:01", 0),
      ("b", "2021-09-29 18:00:10", 0),
      ("a", "2021-09-29 18:00:11", 0),
      ("b", "2021-09-29 18:00:21", 0),
      ("a", "2021-09-29 18:00:22", 0),
      ("b", "2021-09-29 18:00:32", 0)



    ).assignTimestampsAndWatermarks(new MyWatermarkStrategy())




    val pattern =
      Pattern.begin[(String,String,Int)]("my_start")
        .where(_._1 == "a")
        .next("my_next").where(_._1 == "b")
        .within(Time.seconds(10L))



    val event_comparator = new EventComparator[(String,String,Int)] {
      override def compare(o1: (String, String, Int), o2: (String, String, Int)): Int = {
        if(o1._3 > o2._3) 1 else if(o1._3 == o2._3) 0 else -1

      }
    }




    val pattern_stream: PatternStream[(String, String, Int)] = CEP.pattern(input, pattern, event_comparator)

    val result_stream: DataStream[(String, String, Int)] = pattern_stream.process(new My_patternProcessFunction())

    result_stream.print("result_stream")

    senv.execute("CEPDemo")
  }
}

程序执行结果:

{my_start=[(a,2021-09-29 18:00:01,0)], my_next=[(b,2021-09-29 18:00:10,0)]}
processTimedOutMatch: {my_start=[(a,2021-09-29 18:00:11,0)]}
processTimedOutMatch: {my_start=[(a,2021-09-29 18:00:22,0)]}

4.2 循环模式

4.2.1 times、oneOrMore、timesOrMore(不连续匹配)

数据内容

("a", "2021-09-29 18:00:01", 0),
("b", "2021-09-29 18:00:02", 0),
("d", "2021-09-29 18:00:03", 0),
("b", "2021-09-29 18:00:04", 0),
("b", "2021-09-29 18:00:05", 0),
("c", "2021-09-29 18:00:06", 0)

指定匹配模式:

Pattern.begin[(String,String,Int)]("my_start")
  .where(_._1 == "a")
  .next("my_nextb").where(_._1 == "b").times(2, 3)
  .followedBy("my_nextc").where(_._1 == "c")

程序执行结果:

{my_start=[(a,2021-09-29 18:00:01,0)], my_nextb=[(b,2021-09-29 18:00:02,0), (b,2021-09-29 18:00:04,0), (b,2021-09-29 18:00:05,0)], my_nextc=[(c,2021-09-29 18:00:06,0)]}
{my_start=[(a,2021-09-29 18:00:01,0)], my_nextb=[(b,2021-09-29 18:00:02,0), (b,2021-09-29 18:00:04,0)], my_nextc=[(c,2021-09-29 18:00:06,0)]}

执行结果说明:

  • 使用next说明a和第一个b是紧密相连的
  • times说明多个b之间不需要紧密相连
  • 使用followedBy说明最后一个b和c之间不需要紧密相连

4.2.2 optional

  • 定义一个模式,可以有,也可以没有
  • optional在times、oneOrMore、timesOrMore后使用

数据内容

("a", "2021-09-29 18:00:01", 0),
("b", "2021-09-29 18:00:02", 0),
("c", "2021-09-29 18:00:03", 0),
("b", "2021-09-29 18:00:04", 0),
("b", "2021-09-29 18:00:05", 0),
("a", "2021-09-29 18:00:06", 0)

指定匹配模式:

Pattern.begin[(String,String,Int)]("my_start")
  .where(_._1 == "a")
  .next("my_next").where(_._1 == "b").times(2, 3).optional

程序执行结果:

{my_start=[(a,2021-09-29 18:00:01,0)]}
{my_start=[(a,2021-09-29 18:00:01,0)], my_next=[(b,2021-09-29 18:00:02,0), (b,2021-09-29 18:00:04,0)]}
{my_start=[(a,2021-09-29 18:00:01,0)], my_next=[(b,2021-09-29 18:00:02,0), (b,2021-09-29 18:00:04,0), (b,2021-09-29 18:00:05,0)]}
{my_start=[(a,2021-09-29 18:00:06,0)]}

4.2.3 greedy

  • 一条数据既符合模式1,也符合模式2,如果对模式1使用greedy,则该条数据只符合模式1
  • greedy在times(from: Int, to: Int)、oneOrMore、timesOrMore后使用,且能和optional配合使用
  • 不能应用于Groups of patterns(模式组)

数据内容

("a", "2021-09-29 18:00:01", 0),
("aa", "2021-09-29 18:00:02", 0),
("aaa", "2021-09-29 18:00:03", 0),
("bbb", "2021-09-29 18:00:04", 0)

不使用greedy的匹配模式

Pattern.begin[(String,String,Int)]("my_start")
  .where(_._1.startsWith("a")).times(2,3)
  .next("my_next").where(_._1.length == 3)

不使用greedy的程序执行结果

{my_start=[(a,2021-09-29 18:00:01,0), (aa,2021-09-29 18:00:02,0)], my_next=[(aaa,2021-09-29 18:00:03,0)]}
{my_start=[(a,2021-09-29 18:00:01,0), (aa,2021-09-29 18:00:02,0), (aaa,2021-09-29 18:00:03,0)], my_next=[(bbb,2021-09-29 18:00:04,0)]}
{my_start=[(aa,2021-09-29 18:00:02,0), (aaa,2021-09-29 18:00:03,0)], my_next=[(bbb,2021-09-29 18:00:04,0)]}

使用greedy的匹配模式

Pattern.begin[(String,String,Int)]("my_start")
  .where(_._1.startsWith("a")).times(2,3).greedy
  .next("my_next").where(_._1.length == 3)

使用greedy的程序执行结果

{my_start=[(a,2021-09-29 18:00:01,0), (aa,2021-09-29 18:00:02,0), (aaa,2021-09-29 18:00:03,0)], my_next=[(bbb,2021-09-29 18:00:04,0)]}
{my_start=[(aa,2021-09-29 18:00:02,0), (aaa,2021-09-29 18:00:03,0)], my_next=[(bbb,2021-09-29 18:00:04,0)]}

5. Groups of patterns(模式组)

  • 可以将一个组合模式Pattern作为参数传递给begin、next、followedBy、followedByAny,生成一个GroupPattern(Pattern的子类)

数据内容

("a", "2021-09-29 18:00:01", 0),
("b", "2021-09-29 18:00:02", 0),
("a", "2021-09-29 18:00:03", 0),
("c", "2021-09-29 18:00:04", 0),
("b", "2021-09-29 18:00:05", 0)

指定匹配模式

Pattern.begin(
  Pattern.begin[(String,String,Int)]("my_start")
    .where(_._1 == "a")
    .next("my_next").where(_._1 == "b")
)

程序执行结果:

{my_start=[(a,2021-09-29 18:00:01,0)], my_next=[(b,2021-09-29 18:00:02,0)]}

6. After Match Skip Strategy(跳过策略)

数据内容

  1. noSkip指定匹配模式
val skip_strategy = AfterMatchSkipStrategy.noSkip()

Pattern.begin[(String,String,Int)]("my_start", skip_strategy)
  .where(_._1 == "a").oneOrMore.consecutive()
  .next("my_next").where(_._1 == "b")

noSkip程序执行结果:

{my_start=[(a,2021-09-29 18:00:01,0), (a,2021-09-29 18:00:02,0), (a,2021-09-29 18:00:03,0)], my_next=[(b,2021-09-29 18:00:04,0)]}
{my_start=[(a,2021-09-29 18:00:02,0), (a,2021-09-29 18:00:03,0)], my_next=[(b,2021-09-29 18:00:04,0)]}
{my_start=[(a,2021-09-29 18:00:03,0)], my_next=[(b,2021-09-29 18:00:04,0)]}
  1. skipToNext指定匹配模式
val skip_strategy = AfterMatchSkipStrategy.skipToNext()

Pattern.begin[(String,String,Int)]("my_start", skip_strategy)
  .where(_._1 == "a").oneOrMore.consecutive()
  .next("my_next").where(_._1 == "b")

skipToNext程序执行结果:

{my_start=[(a,2021-09-29 18:00:01,0), (a,2021-09-29 18:00:02,0), (a,2021-09-29 18:00:03,0)], my_next=[(b,2021-09-29 18:00:04,0)]}
{my_start=[(a,2021-09-29 18:00:02,0), (a,2021-09-29 18:00:03,0)], my_next=[(b,2021-09-29 18:00:04,0)]}
{my_start=[(a,2021-09-29 18:00:03,0)], my_next=[(b,2021-09-29 18:00:04,0)]}
  1. skipPastLastEvent指定匹配模式
val skip_strategy = AfterMatchSkipStrategy.skipPastLastEvent()

Pattern.begin[(String,String,Int)]("my_start", skip_strategy)
  .where(_._1 == "a").oneOrMore.consecutive()
  .next("my_next").where(_._1 == "b")

skipPastLastEvent程序执行结果:

{my_start=[(a,2021-09-29 18:00:01,0), (a,2021-09-29 18:00:02,0), (a,2021-09-29 18:00:03,0)], my_next=[(b,2021-09-29 18:00:04,0)]}
  1. skipToFirst指定匹配模式
val skip_strategy = AfterMatchSkipStrategy.skipToFirst("my_start")

Pattern.begin[(String,String,Int)]("my_start", skip_strategy)
  .where(_._1 == "a").oneOrMore.consecutive()
  .next("my_next").where(_._1 == "b")

skipToFirst程序执行结果:

{my_start=[(a,2021-09-29 18:00:01,0), (a,2021-09-29 18:00:02,0), (a,2021-09-29 18:00:03,0)], my_next=[(b,2021-09-29 18:00:04,0)]}
{my_start=[(a,2021-09-29 18:00:02,0), (a,2021-09-29 18:00:03,0)], my_next=[(b,2021-09-29 18:00:04,0)]}
{my_start=[(a,2021-09-29 18:00:03,0)], my_next=[(b,2021-09-29 18:00:04,0)]}
  1. skipToLast指定匹配模式
val skip_strategy = AfterMatchSkipStrategy.skipToLast("my_start")

Pattern.begin[(String,String,Int)]("my_start", skip_strategy)
  .where(_._1 == "a").oneOrMore.consecutive()
  .next("my_next").where(_._1 == "b")

skipToLast程序执行结果:

{my_start=[(a,2021-09-29 18:00:01,0), (a,2021-09-29 18:00:02,0), (a,2021-09-29 18:00:03,0)], my_next=[(b,2021-09-29 18:00:04,0)]}
{my_start=[(a,2021-09-29 18:00:03,0)], my_next=[(b,2021-09-29 18:00:04,0)]}

执行结果说明:

  • 对于本示例,{my_start=[(a,2021-09-29 18:00:01,0), (a,2021-09-29 18:00:02,0), (a,2021-09-29 18:00:03,0)], my_next=[(b,2021-09-29 18:00:04,0)]}最先匹配到的数据,不论哪种skip strategy都会保留该匹配数据,且以该匹配数据为基准;01秒的a位于第一;对于模式my_start中,01秒的a位于第一,03秒的a位于最后
  • noSkip:为默认值
  • skipToNext:跳过以01秒的a开始的匹配数据
  • skipPastLastEvent:跳过除基准匹配数据外的所有数据
  • skipToFirst:跳过01秒的a之前的匹配数据
  • skipToLast:跳过03秒的a之前的匹配数据

7. Handling Lateness in Event Time

CEP中的数据处理都是按照timestamp排序的,用Watermark来标记当前已经处理完的最小timestampA数据,当最新接收到的Event的timestamp,比timestampA小,就被认为是Lateness Event, 可以通过sideOutputLateData进行输出,
操作如下:

pattern_stream.sideOutputLateData(lateDataOutputTag: OutputTag[T]).process(......)
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值