Flink系列01: FlinkCEP从源码开始学习(定义与基本概念)

前情提要: 最近阅读《Flink实战派》这本书学习Flink,但是这本书基本上就是整理、翻译了Flink上面的文档,几乎没有什么入门价值,然而Flink文档这块写的也不是很好,网上基本上又都是二手贩子,复制粘贴的,所以自己动手吧。

国际惯例,先放引用方法与源码地址:

<!-- pom.xml -->
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-cep_2.11</artifactId>
    <version>1.13.6</version>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-cep-scala_2.11</artifactId>
    <version>1.13.6</version>
</dependency>

源码位于flinkGithub的flink-libraries/目录下,分为flink-cep和 flink-cep-scala两个分支。

考虑到flink上面去scala化是未来大趋势,所以本文选取java版本来讲解。

Github: Apache/Flink/FlinkCEP

了解一个东西,最好的方法是从定义开始:

官方定义:

FlinkCEP is the Complex Event Processing (CEP) library implemented on top of Flink. It allows you to detect event patterns in an endless stream of events, giving you the opportunity to get hold of what’s important in your data.

The events in the DataStream to which you want to apply pattern matching must implement proper equals() and hashCode() methods because FlinkCEP uses them for comparing and matching events.

分析:

  • FlinkCEP不是Flink-dist的一部分,是基于Flink实现的,因此需要单独饮用。
  • FlinkCEP的作用是在无界数据流(因此只能应用于DataStream)中执行模式匹配
  • 因为要执行模式匹配,所以事件类必须实现用来比较和比对的equals()和hashCode()方法

先放一段官方给出的“意义不明”的示范代码:

DataStream<Event> input = ...

Pattern<Event, ?> pattern = Pattern.<Event>begin("start").where(
        new SimpleCondition<Event>() {
            @Override
            public boolean filter(Event event) {
                return event.getId() == 42;
            }
        }
    ).next("middle").subtype(SubEvent.class).where(
        new SimpleCondition<SubEvent>() {
            @Override
            public boolean filter(SubEvent subEvent) {
                return subEvent.getVolume() >= 10.0;
            }
        }
    ).followedBy("end").where(
         new SimpleCondition<Event>() {
            @Override
            public boolean filter(Event event) {
                return event.getName().equals("end");
            }
         }
    );

PatternStream<Event> patternStream = CEP.pattern(input, pattern);

DataStream<Alert> result = patternStream.process(
    new PatternProcessFunction<Event, Alert>() {
        @Override
        public void processMatch(
                Map<String, List<Event>> pattern,
                Context ctx,
                Collector<Alert> out) throws Exception {
            out.collect(createAlertFrom(pattern));
        }
    });

看完之后,我有许多疑问:

  • 是否第一个事件必须起名为start?
  • next和followedBy是否存在意义上的先后顺序?
  • 这种调用方式看起来很恶心,我能不能先把每个事件建立好,然后建立连接关系呢?

通过阅读源码,我们必须要一一回答这些问题!

Pattern、模式,作为模式匹配的基本单位,分为个体模式 Individual Pattern 和 组合模式 Combining Pattern两种。个体模式通过调用next、followedby 等意义不明的函数来组合,建立顺序关系。作者本身比较讨厌这种文档,因为在中文语境里,先说next、还是先说followedby等词语,是意义不明的。因此,我们直接阅读源代码。

如何定义

我们打开org.apache.flink.cep.pattern.Pattern类的源代码,找到定义方法:

protected Pattern(
            final String name,
            final Pattern<T, ? extends T> previous,
            final ConsumingStrategy consumingStrategy,
            final AfterMatchSkipStrategy afterMatchSkipStrategy) {
        this.name = name;
        this.previous = previous;
        this.quantifier = Quantifier.one(consumingStrategy);
        this.afterMatchSkipStrategy = afterMatchSkipStrategy;
    }

FlinkCEP库不允许直接以new的形式定义,属于设计模式的一种,意义在于,必须通过指定的static 方法来创建对象。

Pattern类提供的创建模式的方法只有一个begin,但是有四种重载方法,两种对应个体模式,两种对应GroupPattern组合模式。

/**
     * Starts a new pattern sequence. The provided name is the one of the initial pattern of the new
     * sequence. Furthermore, the base type of the event sequence is set.
     *
     * @param name The name of starting pattern of the new pattern sequence
     * @param <X> Base type of the event pattern
     * @return The first pattern of a pattern sequence
     */
    public static <X> Pattern<X, X> begin(final String name) {
        return new Pattern<>(name, null, ConsumingStrategy.STRICT, AfterMatchSkipStrategy.noSkip());
    }

    /**
     * Starts a new pattern sequence. The provided name is the one of the initial pattern of the new
     * sequence. Furthermore, the base type of the event sequence is set.
     *
     * @param name The name of starting pattern of the new pattern sequence
     * @param afterMatchSkipStrategy the {@link AfterMatchSkipStrategy.SkipStrategy} to use after
     *     each match.
     * @param <X> Base type of the event pattern
     * @return The first pattern of a pattern sequence
     */
    public static <X> Pattern<X, X> begin(
            final String name, final AfterMatchSkipStrategy afterMatchSkipStrategy) {
        return new Pattern<X, X>(name, null, ConsumingStrategy.STRICT, afterMatchSkipStrategy);
    }

...

/**
     * Starts a new pattern sequence. The provided pattern is the initial pattern of the new
     * sequence.
     *
     * @param group the pattern to begin with
     * @param afterMatchSkipStrategy the {@link AfterMatchSkipStrategy.SkipStrategy} to use after
     *     each match.
     * @return The first pattern of a pattern sequence
     */
    public static <T, F extends T> GroupPattern<T, F> begin(
            final Pattern<T, F> group, final AfterMatchSkipStrategy afterMatchSkipStrategy) {
        return new GroupPattern<>(null, group, ConsumingStrategy.STRICT, afterMatchSkipStrategy);
    }

    /**
     * Starts a new pattern sequence. The provided pattern is the initial pattern of the new
     * sequence.
     *
     * @param group the pattern to begin with
     * @return the first pattern of a pattern sequence
     */
    public static <T, F extends T> GroupPattern<T, F> begin(Pattern<T, F> group) {
        return new GroupPattern<>(
                null, group, ConsumingStrategy.STRICT, AfterMatchSkipStrategy.noSkip());
    }

看完构造器,就能回答这么一个问题:

我能不能先把所以的个体模式建造好,然后通过设置previous关系的方法连接起来呢?

答案是:不能,因为Pattern类的构造器中Previous参数为final,一旦被初始化为null,就只能getPrevious来获取,而不能setPrevious了!

同时,我们也可以得到两个显然的结论:

  • 必须通过Pattern.begin方法来建立第一个模式
  • String不必一定起名为start, 可以是任意String

在个体模式的两种begin方法中,共同的参数为ConsumingStrategy.STRICT, 不同的是AfterMatchSkipStrategy。那么这俩又是什么?

AfterMatchSkipStrategy

可以直接看官方放出的文档,配置的是匹配到之后要不要跳过的问题,主要是为了处理Partial Match,了解什么是Partial Match的话可以直接看这一段的官方文档的案例讲解:

AfterMatchSkipStrategy

其中特意指明的SKIP_TO_FIRST 和 SKIP_TO_LAST两个策略需要指明的first和last的模式名称也可以从代码中得到体现:

// org.apache.flink.cep.nfa.aftermatch.AfterMatchSkipStrategy
/**
     * Discards every partial match that started before the first event of emitted match mapped to
     * *PatternName*.
     *
     * @param patternName the pattern name to skip to
     * @return the created AfterMatchSkipStrategy
     */
    public static SkipToFirstStrategy skipToFirst(String patternName) {
        return new SkipToFirstStrategy(patternName, false);
    }

    /**
     * Discards every partial match that started before the last event of emitted match mapped to
     * *PatternName*.
     *
     * @param patternName the pattern name to skip to
     * @return the created AfterMatchSkipStrategy
     */
    public static SkipToLastStrategy skipToLast(String patternName) {
        return new SkipToLastStrategy(patternName, false);
    }

那,什么是Partial Match呢?

理解下面这个例子需要一点正则表达式的知识:

比如:

正则表达式:(a | b | c) (b | c) c+.greedy d

和序列:a b c1 c2 c3 d

正则表达的部分匹配和完全匹配的概念(引用:原文链接:https://blog.csdn.net/leen0304/article/details/78551191)

java的正则表达式有个很容易混淆的概念,部分匹配和完全匹配:
在Matcher类中有matches、lookingAt和find都是匹配目标的方法,但容易混淆,整理它们的区别如下:

  • matches:整个匹配,只有整个字符序列完全匹配成功,才返回True,否则返回False。但如果前部分匹配成功,将移动下次匹配的位置。
  • lookingAt:部分匹配,总是从第一个字符进行匹配,匹配成功了不再继续匹配,匹配失败了,也不继续匹配。
  • find:部分匹配,从当前位置开始匹配,找到一个匹配的子串,将移动下次匹配的位置。
String word = "-12";
String regex = "-1|1|0";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(word);

# 完全匹配结果为False
# matcher.matches()   matches返回false
# 部分匹配结果为True
# matcher.find()      find返回true

了解这个概念就可以知道,此处完全匹配指的是匹配结果:

a b c1 c2 c3 d 完整匹配

而b c1 c2 c3 d则称之为部分匹配

解释一下官方这个案例:

模式正则: b+ c 

事件流: b1 b2 b3 c,

其实官方的名称在作者开来用until来代替to比较容易理解一点?

skip_to_next 即 skip_until_next 意思是,从B1事件被匹配到开始,触发AfterMatchSkipStrategy检测,策略为 直到下一个事件开始,所有的部分匹配都会被丢掉。也就是说

After Matched - Skip until next event = 匹配后,直到下一个事件全都略过。

用这个思路去理解:

比如SKIP_TO_FIRST[b] = After Matched, Skip until first event b.

SKIP_TO_LAST[b] = After Matched, Skip Until Last Event B.

对于完整匹配b1 b2 b3 c来说,存在部分匹配 b2 b3 c 和 b3 c, 但是由于策略是until last Event B, 即b3,所以b2 b3 c被skip掉。

关于skip_past_last_event, 笔者查了一下资料,确定了意义是past last event,即最后一个事件也不能重复。

也就是说整理如下:

策略名意义
no_skip全部保留
skip_past_last_event全部去掉,无重复(与no_skip相反)
skip_to_next本次事件跳过,下一次事件可以开始
skip_to_first_b从本次事件开始,直到第一个B事件之间,全部跳过
skip_until_last_b从本次事件开始,直到最后个B事件之间,全部跳过

为了印证这不是我胡说八道,我们去看一眼源代码:

public final class SkipToNextStrategy extends SkipRelativeToWholeMatchStrategy {

    public static final SkipToNextStrategy INSTANCE = new SkipToNextStrategy();

    private static final long serialVersionUID = -6490314998588752621L;

    private SkipToNextStrategy() {}

    @Override
    protected EventId getPruningId(final Collection<Map<String, List<EventId>>> match) {
        EventId pruningId = null;
        for (Map<String, List<EventId>> resultMap : match) {
            for (List<EventId> eventList : resultMap.values()) {
                pruningId = min(pruningId, eventList.get(0));
            }
        }

        return pruningId;
    }

    @Override
    public String toString() {
        return "SkipToNextStrategy{}";
    }
}

SkipRelativeToWholeMatchStrategy的名称就可以印证我们关于完全匹配和部分匹配的想法和官方是一致的。getPruningId, Prune这个词表示修剪,常用于各种剪枝算法中。

skipToNext => 返回全部match中的最小的0号事件ID

SkipPastLastStrategy => 返回全部match中的最大的最后一号事件ID

两个SkipToFirst/Last的代码稍微复杂一点,是SkipToElementStrategy类的子类,区别仅在于:

getIndex方法的不同,skipToFirst 返回0,skipToLast返回size - 1

// skipToElementStrategy
@Override
    protected EventId getPruningId(Collection<Map<String, List<EventId>>> match) {
        EventId pruningId = null;
        for (Map<String, List<EventId>> resultMap : match) {
            List<EventId> pruningPattern = resultMap.get(patternName);
            if (pruningPattern == null || pruningPattern.isEmpty()) {
                if (shouldThrowException) {
                    throw new FlinkRuntimeException(
                            String.format(
                                    "Could not skip to %s. No such element in the found match %s",
                                    patternName, resultMap));
                }
            } else {
//注意看这里,调用了getIndex方法
                pruningId = max(pruningId, pruningPattern.get(getIndex(pruningPattern.size())));
            }

            if (shouldThrowException) {
                EventId startEvent =
                        resultMap.values().stream()
                                .flatMap(Collection::stream)
                                .min(EventId::compareTo)
                                .orElseThrow(
                                        () ->
                                                new IllegalStateException(
                                                        "Cannot prune based on empty match"));

                if (pruningId != null && pruningId.equals(startEvent)) {
                    throw new FlinkRuntimeException("Could not skip to first element of a match.");
                }
            }
        }

        return pruningId;
    }

public final class SkipToFirstStrategy extends SkipToElementStrategy {
    private static final long serialVersionUID = 7127107527654629026L;

    SkipToFirstStrategy(String patternName, boolean shouldThrowException) {
        super(patternName, shouldThrowException);
    }

    @Override
    public SkipToElementStrategy throwExceptionOnMiss() {
        return new SkipToFirstStrategy(getPatternName().get(), true);
    }

    @Override
    int getIndex(int size) {
        return 0;
    }

    @Override
    public String toString() {
        return "SkipToFirstStrategy{" + "patternName='" + getPatternName().get() + '\'' + '}';
    }
}

public final class SkipToLastStrategy extends SkipToElementStrategy {
    private static final long serialVersionUID = 7585116990619594531L;

    SkipToLastStrategy(String patternName, boolean shouldThrowException) {
        super(patternName, shouldThrowException);
    }

    @Override
    public SkipToElementStrategy throwExceptionOnMiss() {
        return new SkipToLastStrategy(getPatternName().get(), true);
    }

    @Override
    int getIndex(int size) {
        return size - 1;
    }

    @Override
    public String toString() {
        return "SkipToLastStrategy{" + "patternName='" + getPatternName().get() + '\'' + '}';
    }
}

那,如何触发剪枝呢?这一部分的代码略有区别,区别主要在于是不是skipToElement的:

abstract class SkipRelativeToWholeMatchStrategy extends AfterMatchSkipStrategy {
    private static final long serialVersionUID = -3214720554878479037L;

    @Override
    public final boolean isSkipStrategy() {
        return true;
    }

    @Override
    protected final boolean shouldPrune(EventId startEventID, EventId pruningId) {
        return startEventID != null && startEventID.compareTo(pruningId) <= 0;
    }
}
abstract class SkipToElementStrategy extends AfterMatchSkipStrategy {
    private static final long serialVersionUID = 7127107527654629026L;
    private final String patternName;
    private final boolean shouldThrowException;

    SkipToElementStrategy(String patternName, boolean shouldThrowException) {
        this.patternName = checkNotNull(patternName);
        this.shouldThrowException = shouldThrowException;
    }

    @Override
    public boolean isSkipStrategy() {
        return true;
    }

    @Override
    protected boolean shouldPrune(EventId startEventID, EventId pruningId) {
        return startEventID != null && startEventID.compareTo(pruningId) < 0;
    }

注意看这里shouldPrune()方法的<0 和 <=0的区别。

代码印证了我们的想法,skipToFirst/Next,剪枝点不是不含盖最后一个点的,因此是until。

而skipToNext和skipPastLast都是skipRelativeToWhole...的子类,因此剪枝点都是含盖的(即数学上的闭区间)。因此skipToNext返回自身ID,即到自己ID为止,全都不算。

最后我们来看具体是怎么剪枝的,代码涵盖在Base类AfterMatchSkipStrategy的Prune方法里:

public void prune(
            Collection<ComputationState> matchesToPrune,
            Collection<Map<String, List<EventId>>> matchedResult,
            SharedBufferAccessor<?> sharedBufferAccessor)
            throws Exception {

        EventId pruningId = getPruningId(matchedResult);
        if (pruningId != null) {
            List<ComputationState> discardStates = new ArrayList<>();
            for (ComputationState computationState : matchesToPrune) {
                if (computationState.getStartEventID() != null
                        && shouldPrune(computationState.getStartEventID(), pruningId)) {
                    sharedBufferAccessor.releaseNode(
                            computationState.getPreviousBufferEntry(),
                            computationState.getVersion());
                    discardStates.add(computationState);
                }
            }
            matchesToPrune.removeAll(discardStates);
        }
    }

 具体的Prune又调用了sharedBufferAccessor的releaseNode方法,经过查看源代码,表示的含义为把指定entry的引用次数下降为0,以便JVM垃圾回收算法回收。

/**
     * Decreases the reference counter for the given entry so that it can be removed once the
     * reference counter reaches 0.
     *
     * @param node id of the entry
     * @param version dewey number of the (potential) edge that locked the given node
     * @throws Exception Thrown if the system cannot access the state.
     */
    public void releaseNode(final NodeId node, final DeweyNumber version) throws Exception {
        // the stack used to detect all nodes that needs to be released.
        Stack<NodeId> nodesToExamine = new Stack<>();
        Stack<DeweyNumber> versionsToExamine = new Stack<>();
        nodesToExamine.push(node);
        versionsToExamine.push(version);

        while (!nodesToExamine.isEmpty()) {
            NodeId curNode = nodesToExamine.pop();
            Lockable<SharedBufferNode> curBufferNode = sharedBuffer.getEntry(curNode);

            if (curBufferNode == null) {
                break;
            }

            DeweyNumber currentVersion = versionsToExamine.pop();
            List<Lockable<SharedBufferEdge>> edges = curBufferNode.getElement().getEdges();
            Iterator<Lockable<SharedBufferEdge>> edgesIterator = edges.iterator();
            while (edgesIterator.hasNext()) {
                Lockable<SharedBufferEdge> sharedBufferEdge = edgesIterator.next();
                SharedBufferEdge edge = sharedBufferEdge.getElement();
                if (currentVersion.isCompatibleWith(edge.getDeweyNumber())) {
                    if (sharedBufferEdge.release()) {
                        edgesIterator.remove();
                        NodeId targetId = edge.getTarget();
                        if (targetId != null) {
                            nodesToExamine.push(targetId);
                            versionsToExamine.push(edge.getDeweyNumber());
                        }
                    }
                }
            }

            if (curBufferNode.release()) {
                // first release the current node
                sharedBuffer.removeEntry(curNode);
                releaseEvent(curNode.getEventId());
            } else {
                sharedBuffer.upsertEntry(curNode, curBufferNode);
            }
        }
    }

关于ComsumingStrategy的问题

根据源代码,这是一个和量词相关的概念

 public enum ConsumingStrategy {
        STRICT,
        SKIP_TILL_NEXT,
        SKIP_TILL_ANY,

        NOT_FOLLOW,
        NOT_NEXT
    }

量词,Quantifier,也很好理解,其实就是对应的正则表达式里面的,比如:

a b c 默认没有量词表示匹配一次

a{0,1} b c 表示a的量词为0次到1次,最多1次

optional 差不多相当于 "|"的意思,即要么不发生,要么发生几次

需要注意的是greedy(), greedy经过笔者确认,表示的意思是从匹配开始起,向后延展,即

比如

a[2,4] => aa,aaa,aaaa, 满足条件的话会返回3个

a[2,4].greedy() => aaaa,会尽可能的返回最多的匹配次数。

了解了这些基本概念,我们可以进一步去学习和了解个体模式和模式组了。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值