目录
1. 什么是Flink CEP
从DataStream中检测出符合特定规则的数据结果,如下图所示DataStream中元素是各种形状,我们想检测长方形后面跟着椭圆形这样一个规则,最后得到两次检测结果
2. CEP的使用准备
pom.xml
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-cep-scala_2.11</artifactId>
<version>1.13.2</version>
<scope>provided</scope>
</dependency>
指定timestamp和watermark的MyWatermarkStrategy.scala
package cepTest
import org.apache.commons.lang3.time.FastDateFormat
import org.apache.flink.api.common.eventtime.{TimestampAssigner, TimestampAssignerSupplier, Watermark, WatermarkGenerator, WatermarkGeneratorSupplier, WatermarkOutput, WatermarkStrategy}
class RecordTimestampAssigner extends TimestampAssigner[(String, String, Int)] {
val fdf = FastDateFormat.getInstance("yyyy-MM-dd HH:mm:ss")
override def extractTimestamp(element: (String, String, Int), recordTimestamp: Long): Long = {
fdf.parse(element._2).getTime
}
}
class PeriodWatermarkGenerator extends WatermarkGenerator[(String, String, Int)] {
val fdf = FastDateFormat.getInstance("yyyy-MM-dd HH:mm:ss")
var maxTimestamp: Long = _
val maxOutofOrderness = 0
override def onEvent(event: (String, String, Int), eventTimestamp: Long, output: WatermarkOutput): Unit = {
maxTimestamp = math.max(fdf.parse(event._2).getTime, maxTimestamp)
}
override def onPeriodicEmit(output: WatermarkOutput): Unit = {
output.emitWatermark(new Watermark(maxTimestamp - maxOutofOrderness - 1))
}
}
class MyWatermarkStrategy extends WatermarkStrategy[(String, String, Int)] {
override def createTimestampAssigner(context: TimestampAssignerSupplier.Context): TimestampAssigner[(String, String, Int)] = {
new RecordTimestampAssigner()
}
override def createWatermarkGenerator(context: WatermarkGeneratorSupplier.Context): WatermarkGenerator[(String, String, Int)] = {
new PeriodWatermarkGenerator()
}
}
CEP的程序模板CEPDemo.scala
- 如果后面的内容没有贴出CEPDemo的完整内容,代表使用的都是这个程序模板
- 模板需填充的两个地方:
- input的数据内容
- pattern匹配模式的指定
- 当两个Event的timestamp相同时,EventComparator用来决定哪个Event先处理
- input可以是non-keyed DataStream,也可以是keyed DataStream
package cepTest
import org.apache.flink.cep.functions.PatternProcessFunction
import org.apache.flink.cep.scala.pattern.Pattern
import org.apache.flink.cep.scala.{CEP, PatternStream}
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment, createTypeInformation}
import org.apache.flink.util.Collector
import java.util
object CEPDemo {
def main(args: Array[String]): Unit = {
val senv = StreamExecutionEnvironment.getExecutionEnvironment
val input = senv.fromElements(
// 数据格式为:("a", "2021-09-29 18:00:01", 0)
).assignTimestampsAndWatermarks(new MyWatermarkStrategy())
val pattern = // 指定匹配的模式
val event_comparator = new EventComparator[(String,String,Int)] {
override def compare(o1: (String, String, Int), o2: (String, String, Int)): Int = {
if(o1._3 > o2._3) 1 else if(o1._3 == o2._3) 0 else -1
}
}
val pattern_stream: PatternStream[(String, String, Int)] = CEP.pattern(input, pattern, event_comparator)
val result_stream: DataStream[(String, String, Int)] = pattern_stream.process(
new PatternProcessFunction[(String, String, Int), (String, String, Int)] {
override def processMatch(map: util.Map[String, util.List[(String, String, Int)]], context: PatternProcessFunction.Context, collector: Collector[(String, String, Int)]): Unit = {
println(map)
}
}
)
result_stream.print("result_stream")
senv.execute("CEPDemo")
}
}
3. Individual Patterns(单个模式)
- 每个模式的名称必须唯一,如本示例的my_start
示例1:检测第一个字段为字符a的数据
数据内容
("a", "2021-09-29 18:00:01", 0),
("b", "2021-09-29 18:00:02", 0),
("a", "2021-09-29 18:00:03", 0),
("c", "2021-09-29 18:00:04", 0)
指定匹配模式
Pattern.begin[(String,String,Int)]("my_start")
.where(_._1 == "a")
程序执行结果
{my_start=[(a,2021-09-29 18:00:01,0)]}
{my_start=[(a,2021-09-29 18:00:03,0)]}
3.1 times、oneOrMore、timesOrMore(不连续匹配)
- 只对前面的一个模式起作用,本示例为my_start
- times有两种指定方式:times(times: Int)和times(from: Int, to: Int)
- oneOrMore指1次或多次,timesOrMore指大于等于n次
示例2:字符a不连续出现2次或3次
数据内容
("a", "2021-09-29 18:00:01", 0),
("b", "2021-09-29 18:00:02", 0),
("a", "2021-09-29 18:00:03", 0),
("a", "2021-09-29 18:00:04", 0),
("c", "2021-09-29 18:00:05", 0)
指定匹配模式:
Pattern.begin[(String,String,Int)]("my_start")
.where(_._1 == "a")
.times(2, 3)
程序执行结果:
{my_start=[(a,2021-09-29 18:00:01,0), (a,2021-09-29 18:00:03,0)]}
{my_start=[(a,2021-09-29 18:00:01,0), (a,2021-09-29 18:00:03,0), (a,2021-09-29 18:00:04,0)]}
{my_start=[(a,2021-09-29 18:00:03,0), (a,2021-09-29 18:00:04,0)]}
3.2 consecutive(限定连续匹配)
- times、oneOrMore、timesOrMore为不连续匹配,使用consecutive限定它们为连续匹配
数据内容
("a", "2021-09-29 18:00:01", 0),
("b", "2021-09-29 18:00:02", 0),
("a", "2021-09-29 18:00:03", 0),
("a", "2021-09-29 18:00:04", 0),
("c", "2021-09-29 18:00:05", 0)
指定匹配模式:
Pattern.begin[(String,String,Int)]("my_start")
.where(_._1 == "a")
.times(2, 3)
.consecutive()
程序执行结果:
{my_start=[(a,2021-09-29 18:00:03,0), (a,2021-09-29 18:00:04,0)]}
3.3 allowCombinations(不确定的不连续匹配)
- 原理和Non-Deterministic Relaxed Contiguity(followedByAny)一样的
数据内容
("a", "2021-09-29 18:00:01", 0),
("b", "2021-09-29 18:00:02", 0),
("a", "2021-09-29 18:00:03", 0),
("a", "2021-09-29 18:00:04", 0),
("c", "2021-09-29 18:00:05", 0)
指定匹配模式:
val pattern =
Pattern.begin[(String,String,Int)]("my_start")
.where(_._1 == "a")
.times(2, 3)
.allowCombinations()
程序执行结果:
{my_start=[(a,2021-09-29 18:00:01,0), (a,2021-09-29 18:00:03,0)]}
{my_start=[(a,2021-09-29 18:00:01,0), (a,2021-09-29 18:00:03,0), (a,2021-09-29 18:00:04,0)]}
{my_start=[(a,2021-09-29 18:00:01,0), (a,2021-09-29 18:00:04,0)]}
{my_start=[(a,2021-09-29 18:00:03,0), (a,2021-09-29 18:00:04,0)]}
3.3 指定条件的方法
3.3.1 where
- 多个where相当于and的效果
数据内容
("a", "2021-09-29 18:00:01", 200),
("b", "2021-09-29 18:00:02", 200),
("a", "2021-09-29 18:00:03", 300),
("a", "2021-09-29 18:00:04", 50)
指定匹配模式
Pattern.begin[(String,String,Int)]("my_start")
.where(_._1 == "a")
.where(_._3 > 100)
程序执行结果
{my_start=[(a,2021-09-29 18:00:01,200)]}
{my_start=[(a,2021-09-29 18:00:03,300)]}
3.3.2 or
数据内容
("a", "2021-09-29 18:00:01", 0),
("b", "2021-09-29 18:00:02", 0),
("a", "2021-09-29 18:00:03", 0),
("c", "2021-09-29 18:00:04", 0)
指定匹配模式
Pattern.begin[(String,String,Int)]("my_start")
.where(_._1 == "a").or(_._1 == "b")
程序执行结果
{my_start=[(a,2021-09-29 18:00:01,0)]}
{my_start=[(b,2021-09-29 18:00:02,0)]}
{my_start=[(a,2021-09-29 18:00:03,0)]}
3.3.3 until
- 只能用于oneOrMore之后
- 遇到既不符合where的条件,也不符合until的条件,直接忽略
- 最后一次匹配,没有遇到until, 也算匹配成功
数据内容
("a", "2021-09-29 18:00:01", 200),
("a", "2021-09-29 18:00:02", 100),
("a", "2021-09-29 18:00:03", 300),
("a", "2021-09-29 18:00:04", 50),
("a", "2021-09-29 18:00:05", 400),
("a", "2021-09-29 18:00:06", 500)
指定匹配模式
Pattern.begin[(String,String,Int)]("my_start")
.where(_._3 > 100).oneOrMore
.until(_._3 < 100)
程序执行结果
{my_start=[(a,2021-09-29 18:00:01,200)]}
{my_start=[(a,2021-09-29 18:00:01,200), (a,2021-09-29 18:00:03,300)]}
{my_start=[(a,2021-09-29 18:00:03,300)]}
{my_start=[(a,2021-09-29 18:00:05,400)]}
{my_start=[(a,2021-09-29 18:00:05,400), (a,2021-09-29 18:00:06,500)]}
{my_start=[(a,2021-09-29 18:00:06,500)]}
3.3.4 subtype
数据内容
("a", "2021-09-29 18:00:01", 0),
("b", "2021-09-29 18:00:02", 0),
("c", "2021-09-29 18:00:03", 0)
指定匹配模式
class My_tuple3(x1:String, x2:String, x3:Int) extends scala.Tuple3[String,String,Int](x1, x2, x3)
Pattern.begin[(String,String,Int)]("my_start")
.subtype(classOf[My_tuple3])
- 程序执行结果为空
- Pattern匹配的数据的数据类型为(String,String,Int), 而我们指定的是它的子类My_tuple3,所有没有结果
3.4 条件表达式之IterativeCondition
- 先指定times、oneOrMore、timesOrMore,再指定IterativeCondition
数据内容
("a", "2021-09-29 18:00:01", 10),
("b", "2021-09-29 18:00:02", 30),
("a", "2021-09-29 18:00:03", 20),
("a", "2021-09-29 18:00:04", 10),
("a", "2021-09-29 18:00:05", 30),
("a", "2021-09-29 18:00:06", 20)
指定匹配模式:
import org.apache.flink.cep.pattern.conditions.IterativeCondition
Pattern.begin[(String,String,Int)]("my_start")
.times(3)
.where(new IterativeCondition[(String, String, Int)] {
override def filter(t: (String, String, Int), context: IterativeCondition.Context[(String, String, Int)]): Boolean = {
import scala.collection.JavaConversions.iterableAsScalaIterable
lazy val previous_acc = context.getEventsForPattern("my_start").map(_._3).aggregate((0,0))(
(acc, value) => (acc._1 + 1, acc._2 + value),
(acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2)
)
lazy val avg = if(previous_acc._1 == 0) 0 else (previous_acc._2.toDouble / previous_acc._1)
t._1 == "a" && t._3 >= avg
}
})
程序执行结果
{my_start=[(a,2021-09-29 18:00:01,10), (a,2021-09-29 18:00:03,20), (a,2021-09-29 18:00:05,30)]}
{my_start=[(a,2021-09-29 18:00:04,10), (a,2021-09-29 18:00:05,30), (a,2021-09-29 18:00:06,20)]}
结果步骤说明:
-
第一条结果:
- 数据(“a”, “2021-09-29 18:00:01”, 10)进场,context.getEventsForPattern(“my_start”)没有获取到数据,avg = 0,字符等于a且10 >= 0,找到符合条件的第一条数据,且将该条数据添加到context
- 数据(“b”, “2021-09-29 18:00:02”, 30)进场,字符不等于a,不符合条件
- 数据(“a”, “2021-09-29 18:00:03”, 20)进场,context.getEventsForPattern(“my_start”)获取到数据Seq((“a”, “2021-09-29 18:00:01”, 10)),avg = 10,字符等于a且20 >= 10,找到符合条件的第二条数据,且将该条数据添加到context
- 数据(“a”, “2021-09-29 18:00:04”, 10)进场,context.getEventsForPattern(“my_start”)获取到数据Seq((“a”, “2021-09-29 18:00:01”, 10), (“a”, “2021-09-29 18:00:03”, 20)),avg = 15,字符等于a且10 < 15,不符合条件
- 数据(“a”, “2021-09-29 18:00:05”, 30)进场,context.getEventsForPattern(“my_start”)获取到数据Seq((“a”, “2021-09-29 18:00:01”, 10), (“a”, “2021-09-29 18:00:03”, 20)),avg = 15,字符等于a且30 >= 15,找到符合条件的第三条数据,因为我们设置的是times(3),输出第一条结果,并将添加到context的数据清除
-
第二条结果:
- 数据(“a”, “2021-09-29 18:00:04”, 10)进场,context.getEventsForPattern(“my_start”)没有获取到数据,avg = 0,字符等于a且10 >= 0,找到符合条件的第一条数据,且将该条数据添加到context
- 数据(“a”, “2021-09-29 18:00:05”, 30)进场,context.getEventsForPattern(“my_start”)获取到数据Seq((“a”, “2021-09-29 18:00:04”, 10)),avg = 10,字符等于a且30 >= 10,找到符合条件的第二条数据,且将该条数据添加到context
- 数据(“a”, “2021-09-29 18:00:06”, 20)进场,context.getEventsForPattern(“my_start”)获取到数据Seq((“a”, “2021-09-29 18:00:04”, 10), (“a”, “2021-09-29 18:00:05”, 30)),avg = 20,字符等于a且20 >= 20,找到符合条件的第三条数据,因为我们设置的是times(3),输出第二条结果,并将添加到context的数据清除
4. Combining Patterns(组合模式)
4.1 三种连续形式
4.1.1 Strict Contiguity(next)
- 模式1和模式2必须紧密相连,例如对于a, c, b1, b2,匹配模式为(a, b), 无匹配结果
数据内容
("a", "2021-09-29 18:00:01", 0),
("b", "2021-09-29 18:00:02", 0),
("c", "2021-09-29 18:00:03", 0),
("a", "2021-09-29 18:00:04", 0),
("d", "2021-09-29 18:00:05", 0),
("b", "2021-09-29 18:00:06", 0),
("b", "2021-09-29 18:00:07", 0),
("a", "2021-09-29 18:00:08", 0),
("b", "2021-09-29 18:00:09", 0)
指定匹配模式:
Pattern.begin[(String,String,Int)]("my_start")
.where(_._1 == "a")
.next("my_next").where(_._1 == "b")
程序执行结果:
{my_start=[(a,2021-09-29 18:00:01,0)], my_next=[(b,2021-09-29 18:00:02,0)]}
{my_start=[(a,2021-09-29 18:00:08,0)], my_next=[(b,2021-09-29 18:00:09,0)]}
4.1.2 Relaxed Contiguity(followedBy)
- 模式1和模式2可以不相连, 例如对于a, c, b1, b2,匹配模式为(a, b), 匹配结果为(a, b1)
数据内容
("a", "2021-09-29 18:00:01", 0),
("b", "2021-09-29 18:00:02", 0),
("c", "2021-09-29 18:00:03", 0),
("a", "2021-09-29 18:00:04", 0),
("d", "2021-09-29 18:00:05", 0),
("b", "2021-09-29 18:00:06", 0),
("b", "2021-09-29 18:00:07", 0),
("a", "2021-09-29 18:00:08", 0),
("b", "2021-09-29 18:00:09", 0)
指定匹配模式:
Pattern.begin[(String,String,Int)]("my_start")
.where(_._1 == "a")
.followedBy("my_next").where(_._1 == "b")
程序执行结果:
{my_start=[(a,2021-09-29 18:00:01,0)], my_next=[(b,2021-09-29 18:00:02,0)]}
{my_start=[(a,2021-09-29 18:00:04,0)], my_next=[(b,2021-09-29 18:00:06,0)]}
{my_start=[(a,2021-09-29 18:00:08,0)], my_next=[(b,2021-09-29 18:00:09,0)]}
4.1.3 Non-Deterministic Relaxed Contiguity(followedByAny)
- 模式1和模式2可以不相连, 还可以跳过已匹配的模式2,去匹配其它模式2,例如对于a, c, b1, b2,匹配模式为(a, b), 匹配结果为(a, b1), (a, b2)
数据内容
("a", "2021-09-29 18:00:01", 0),
("b", "2021-09-29 18:00:02", 0),
("c", "2021-09-29 18:00:03", 0),
("a", "2021-09-29 18:00:04", 0),
("d", "2021-09-29 18:00:05", 0),
("b", "2021-09-29 18:00:06", 0),
("b", "2021-09-29 18:00:07", 0),
("a", "2021-09-29 18:00:08", 0),
("b", "2021-09-29 18:00:09", 0)
指定匹配模式:
Pattern.begin[(String,String,Int)]("my_start")
.where(_._1 == "a")
.followedByAny("my_next").where(_._1 == "b")
程序执行结果:
{my_start=[(a,2021-09-29 18:00:01,0)], my_next=[(b,2021-09-29 18:00:02,0)]}
{my_start=[(a,2021-09-29 18:00:01,0)], my_next=[(b,2021-09-29 18:00:06,0)]}
{my_start=[(a,2021-09-29 18:00:04,0)], my_next=[(b,2021-09-29 18:00:06,0)]}
{my_start=[(a,2021-09-29 18:00:01,0)], my_next=[(b,2021-09-29 18:00:07,0)]}
{my_start=[(a,2021-09-29 18:00:04,0)], my_next=[(b,2021-09-29 18:00:07,0)]}
{my_start=[(a,2021-09-29 18:00:01,0)], my_next=[(b,2021-09-29 18:00:09,0)]}
{my_start=[(a,2021-09-29 18:00:04,0)], my_next=[(b,2021-09-29 18:00:09,0)]}
{my_start=[(a,2021-09-29 18:00:08,0)], my_next=[(b,2021-09-29 18:00:09,0)]}
4.1.4 notNext
- 一个模式后面不跟着另外一个模式
数据内容
("a", "2021-09-29 18:00:01", 0),
("b", "2021-09-29 18:00:02", 0),
("a", "2021-09-29 18:00:03", 0),
("c", "2021-09-29 18:00:04", 0)
指定匹配模式:
Pattern.begin[(String,String,Int)]("my_start")
.where(_._1 == "a")
.notNext("my_not_next").where(_._1 == "b")
程序执行结果:
{my_start=[(a,2021-09-29 18:00:03,0)]}
4.1.5 notFollowedBy
- 用于指定一个模式,不能包含在两个模式之间
- 数据内容
("a", "2021-09-29 18:00:01", 0),
("d", "2021-09-29 18:00:02", 0),
("b", "2021-09-29 18:00:03", 0),
("c", "2021-09-29 18:00:04", 0),
("a", "2021-09-29 18:00:05", 0),
("e", "2021-09-29 18:00:06", 0),
("c", "2021-09-29 18:00:07", 0)
指定匹配模式:
Pattern.begin[(String,String,Int)]("my_start")
.where(_._1 == "a")
.notFollowedBy("my_not_next").where(_._1 == "b")
.followedBy("my_follow_by").where(_._1 == "c")
程序执行结果:
{my_start=[(a,2021-09-29 18:00:04,0)], my_follow_by=[(c,2021-09-29 18:00:06,0)]}
4.1.6 within
- 多个模式的匹配在规定时间内才有效
- 时间范围的开始时间由符合条件的第一个模式决定,不包括时间范围的结束时间
数据内容
("a", "2021-09-29 18:00:01", 0),
("b", "2021-09-29 18:00:10", 0),
("a", "2021-09-29 18:00:11", 0),
("b", "2021-09-29 18:00:21", 0)
指定匹配模式:
Pattern.begin[(String,String,Int)]("my_start")
.where(_._1 == "a")
.next("my_next").where(_._1 == "b")
.within(Time.seconds(10L))
程序执行结果:
{my_start=[(a,2021-09-29 18:00:01,0)], my_next=[(b,2021-09-29 18:00:10,0)]}
可以使用TimedOutPartialMatchHandler,来输出因within窗口而组合失败的元素,但只会输出第一个模式的元素,示例如下:
完整的CEPDemo:
package cepTest
import org.apache.flink.cep.EventComparator
import org.apache.flink.cep.functions.{PatternProcessFunction, TimedOutPartialMatchHandler}
import org.apache.flink.cep.scala.pattern.Pattern
import org.apache.flink.cep.scala.{CEP, PatternStream}
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment, createTypeInformation}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.util.Collector
import java.util
class My_patternProcessFunction extends PatternProcessFunction[(String, String, Int), (String, String, Int)] with TimedOutPartialMatchHandler[(String, String, Int)]{
override def processMatch(map: util.Map[String, util.List[(String, String, Int)]], context: PatternProcessFunction.Context, collector: Collector[(String, String, Int)]): Unit = {
println(map)
}
override def processTimedOutMatch(map: util.Map[String, util.List[(String, String, Int)]], context: PatternProcessFunction.Context): Unit = {
// context.output()
println("processTimedOutMatch: " + map.toString)
}
}
object CEPDemo {
def main(args: Array[String]): Unit = {
val senv = StreamExecutionEnvironment.getExecutionEnvironment
val input = senv.fromElements(
("a", "2021-09-29 18:00:01", 0),
("b", "2021-09-29 18:00:10", 0),
("a", "2021-09-29 18:00:11", 0),
("b", "2021-09-29 18:00:21", 0),
("a", "2021-09-29 18:00:22", 0),
("b", "2021-09-29 18:00:32", 0)
).assignTimestampsAndWatermarks(new MyWatermarkStrategy())
val pattern =
Pattern.begin[(String,String,Int)]("my_start")
.where(_._1 == "a")
.next("my_next").where(_._1 == "b")
.within(Time.seconds(10L))
val event_comparator = new EventComparator[(String,String,Int)] {
override def compare(o1: (String, String, Int), o2: (String, String, Int)): Int = {
if(o1._3 > o2._3) 1 else if(o1._3 == o2._3) 0 else -1
}
}
val pattern_stream: PatternStream[(String, String, Int)] = CEP.pattern(input, pattern, event_comparator)
val result_stream: DataStream[(String, String, Int)] = pattern_stream.process(new My_patternProcessFunction())
result_stream.print("result_stream")
senv.execute("CEPDemo")
}
}
程序执行结果:
{my_start=[(a,2021-09-29 18:00:01,0)], my_next=[(b,2021-09-29 18:00:10,0)]}
processTimedOutMatch: {my_start=[(a,2021-09-29 18:00:11,0)]}
processTimedOutMatch: {my_start=[(a,2021-09-29 18:00:22,0)]}
4.2 循环模式
4.2.1 times、oneOrMore、timesOrMore(不连续匹配)
数据内容
("a", "2021-09-29 18:00:01", 0),
("b", "2021-09-29 18:00:02", 0),
("d", "2021-09-29 18:00:03", 0),
("b", "2021-09-29 18:00:04", 0),
("b", "2021-09-29 18:00:05", 0),
("c", "2021-09-29 18:00:06", 0)
指定匹配模式:
Pattern.begin[(String,String,Int)]("my_start")
.where(_._1 == "a")
.next("my_nextb").where(_._1 == "b").times(2, 3)
.followedBy("my_nextc").where(_._1 == "c")
程序执行结果:
{my_start=[(a,2021-09-29 18:00:01,0)], my_nextb=[(b,2021-09-29 18:00:02,0), (b,2021-09-29 18:00:04,0), (b,2021-09-29 18:00:05,0)], my_nextc=[(c,2021-09-29 18:00:06,0)]}
{my_start=[(a,2021-09-29 18:00:01,0)], my_nextb=[(b,2021-09-29 18:00:02,0), (b,2021-09-29 18:00:04,0)], my_nextc=[(c,2021-09-29 18:00:06,0)]}
执行结果说明:
- 使用next说明a和第一个b是紧密相连的
- times说明多个b之间不需要紧密相连
- 使用followedBy说明最后一个b和c之间不需要紧密相连
4.2.2 optional
- 定义一个模式,可以有,也可以没有
- optional在times、oneOrMore、timesOrMore后使用
数据内容
("a", "2021-09-29 18:00:01", 0),
("b", "2021-09-29 18:00:02", 0),
("c", "2021-09-29 18:00:03", 0),
("b", "2021-09-29 18:00:04", 0),
("b", "2021-09-29 18:00:05", 0),
("a", "2021-09-29 18:00:06", 0)
指定匹配模式:
Pattern.begin[(String,String,Int)]("my_start")
.where(_._1 == "a")
.next("my_next").where(_._1 == "b").times(2, 3).optional
程序执行结果:
{my_start=[(a,2021-09-29 18:00:01,0)]}
{my_start=[(a,2021-09-29 18:00:01,0)], my_next=[(b,2021-09-29 18:00:02,0), (b,2021-09-29 18:00:04,0)]}
{my_start=[(a,2021-09-29 18:00:01,0)], my_next=[(b,2021-09-29 18:00:02,0), (b,2021-09-29 18:00:04,0), (b,2021-09-29 18:00:05,0)]}
{my_start=[(a,2021-09-29 18:00:06,0)]}
4.2.3 greedy
- 一条数据既符合模式1,也符合模式2,如果对模式1使用greedy,则该条数据只符合模式1
- greedy在times(from: Int, to: Int)、oneOrMore、timesOrMore后使用,且能和optional配合使用
- 不能应用于Groups of patterns(模式组)
数据内容
("a", "2021-09-29 18:00:01", 0),
("aa", "2021-09-29 18:00:02", 0),
("aaa", "2021-09-29 18:00:03", 0),
("bbb", "2021-09-29 18:00:04", 0)
不使用greedy的匹配模式
Pattern.begin[(String,String,Int)]("my_start")
.where(_._1.startsWith("a")).times(2,3)
.next("my_next").where(_._1.length == 3)
不使用greedy的程序执行结果
{my_start=[(a,2021-09-29 18:00:01,0), (aa,2021-09-29 18:00:02,0)], my_next=[(aaa,2021-09-29 18:00:03,0)]}
{my_start=[(a,2021-09-29 18:00:01,0), (aa,2021-09-29 18:00:02,0), (aaa,2021-09-29 18:00:03,0)], my_next=[(bbb,2021-09-29 18:00:04,0)]}
{my_start=[(aa,2021-09-29 18:00:02,0), (aaa,2021-09-29 18:00:03,0)], my_next=[(bbb,2021-09-29 18:00:04,0)]}
使用greedy的匹配模式
Pattern.begin[(String,String,Int)]("my_start")
.where(_._1.startsWith("a")).times(2,3).greedy
.next("my_next").where(_._1.length == 3)
使用greedy的程序执行结果
{my_start=[(a,2021-09-29 18:00:01,0), (aa,2021-09-29 18:00:02,0), (aaa,2021-09-29 18:00:03,0)], my_next=[(bbb,2021-09-29 18:00:04,0)]}
{my_start=[(aa,2021-09-29 18:00:02,0), (aaa,2021-09-29 18:00:03,0)], my_next=[(bbb,2021-09-29 18:00:04,0)]}
5. Groups of patterns(模式组)
- 可以将一个组合模式Pattern作为参数传递给begin、next、followedBy、followedByAny,生成一个GroupPattern(Pattern的子类)
数据内容
("a", "2021-09-29 18:00:01", 0),
("b", "2021-09-29 18:00:02", 0),
("a", "2021-09-29 18:00:03", 0),
("c", "2021-09-29 18:00:04", 0),
("b", "2021-09-29 18:00:05", 0)
指定匹配模式
Pattern.begin(
Pattern.begin[(String,String,Int)]("my_start")
.where(_._1 == "a")
.next("my_next").where(_._1 == "b")
)
程序执行结果:
{my_start=[(a,2021-09-29 18:00:01,0)], my_next=[(b,2021-09-29 18:00:02,0)]}
6. After Match Skip Strategy(跳过策略)
数据内容
- noSkip指定匹配模式
val skip_strategy = AfterMatchSkipStrategy.noSkip()
Pattern.begin[(String,String,Int)]("my_start", skip_strategy)
.where(_._1 == "a").oneOrMore.consecutive()
.next("my_next").where(_._1 == "b")
noSkip程序执行结果:
{my_start=[(a,2021-09-29 18:00:01,0), (a,2021-09-29 18:00:02,0), (a,2021-09-29 18:00:03,0)], my_next=[(b,2021-09-29 18:00:04,0)]}
{my_start=[(a,2021-09-29 18:00:02,0), (a,2021-09-29 18:00:03,0)], my_next=[(b,2021-09-29 18:00:04,0)]}
{my_start=[(a,2021-09-29 18:00:03,0)], my_next=[(b,2021-09-29 18:00:04,0)]}
- skipToNext指定匹配模式
val skip_strategy = AfterMatchSkipStrategy.skipToNext()
Pattern.begin[(String,String,Int)]("my_start", skip_strategy)
.where(_._1 == "a").oneOrMore.consecutive()
.next("my_next").where(_._1 == "b")
skipToNext程序执行结果:
{my_start=[(a,2021-09-29 18:00:01,0), (a,2021-09-29 18:00:02,0), (a,2021-09-29 18:00:03,0)], my_next=[(b,2021-09-29 18:00:04,0)]}
{my_start=[(a,2021-09-29 18:00:02,0), (a,2021-09-29 18:00:03,0)], my_next=[(b,2021-09-29 18:00:04,0)]}
{my_start=[(a,2021-09-29 18:00:03,0)], my_next=[(b,2021-09-29 18:00:04,0)]}
- skipPastLastEvent指定匹配模式
val skip_strategy = AfterMatchSkipStrategy.skipPastLastEvent()
Pattern.begin[(String,String,Int)]("my_start", skip_strategy)
.where(_._1 == "a").oneOrMore.consecutive()
.next("my_next").where(_._1 == "b")
skipPastLastEvent程序执行结果:
{my_start=[(a,2021-09-29 18:00:01,0), (a,2021-09-29 18:00:02,0), (a,2021-09-29 18:00:03,0)], my_next=[(b,2021-09-29 18:00:04,0)]}
- skipToFirst指定匹配模式
val skip_strategy = AfterMatchSkipStrategy.skipToFirst("my_start")
Pattern.begin[(String,String,Int)]("my_start", skip_strategy)
.where(_._1 == "a").oneOrMore.consecutive()
.next("my_next").where(_._1 == "b")
skipToFirst程序执行结果:
{my_start=[(a,2021-09-29 18:00:01,0), (a,2021-09-29 18:00:02,0), (a,2021-09-29 18:00:03,0)], my_next=[(b,2021-09-29 18:00:04,0)]}
{my_start=[(a,2021-09-29 18:00:02,0), (a,2021-09-29 18:00:03,0)], my_next=[(b,2021-09-29 18:00:04,0)]}
{my_start=[(a,2021-09-29 18:00:03,0)], my_next=[(b,2021-09-29 18:00:04,0)]}
- skipToLast指定匹配模式
val skip_strategy = AfterMatchSkipStrategy.skipToLast("my_start")
Pattern.begin[(String,String,Int)]("my_start", skip_strategy)
.where(_._1 == "a").oneOrMore.consecutive()
.next("my_next").where(_._1 == "b")
skipToLast程序执行结果:
{my_start=[(a,2021-09-29 18:00:01,0), (a,2021-09-29 18:00:02,0), (a,2021-09-29 18:00:03,0)], my_next=[(b,2021-09-29 18:00:04,0)]}
{my_start=[(a,2021-09-29 18:00:03,0)], my_next=[(b,2021-09-29 18:00:04,0)]}
执行结果说明:
- 对于本示例,{my_start=[(a,2021-09-29 18:00:01,0), (a,2021-09-29 18:00:02,0), (a,2021-09-29 18:00:03,0)], my_next=[(b,2021-09-29 18:00:04,0)]}最先匹配到的数据,不论哪种skip strategy都会保留该匹配数据,且以该匹配数据为基准;01秒的a位于第一;对于模式my_start中,01秒的a位于第一,03秒的a位于最后
- noSkip:为默认值
- skipToNext:跳过以01秒的a开始的匹配数据
- skipPastLastEvent:跳过除基准匹配数据外的所有数据
- skipToFirst:跳过01秒的a之前的匹配数据
- skipToLast:跳过03秒的a之前的匹配数据
7. Handling Lateness in Event Time
CEP中的数据处理都是按照timestamp排序的,用Watermark来标记当前已经处理完的最小timestampA数据,当最新接收到的Event的timestamp,比timestampA小,就被认为是Lateness Event, 可以通过sideOutputLateData进行输出,
操作如下:
pattern_stream.sideOutputLateData(lateDataOutputTag: OutputTag[T]).process(......)