Flink中CEP使用within时出现的问题(多条数据输入时没有数据输出的原因解释)

墨者大数据

已于 2022-09-06 18:11:58 修改

阅读量1.2k

点赞数 1

分类专栏：大数据文章标签： flink 大数据

于 2022-09-06 18:09:51 首次发布

本文链接：https://blog.csdn.net/weixin_42965737/article/details/126730520

版权

大数据专栏收录该内容

21 篇文章 0 订阅

订阅专栏

1.CEP 复杂事件处理举例

求用户跳出率，连续两次访问的最后一次页面id为null,视为跳出。

package com.atguigu.gmall.realtime.app.dwm;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import com.atguigu.gmall.realtime.utils.MyKafkaUtil;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.cep.CEP;
import org.apache.flink.cep.PatternSelectFunction;
import org.apache.flink.cep.PatternStream;
import org.apache.flink.cep.PatternTimeoutFunction;
import org.apache.flink.cep.pattern.Pattern;
import org.apache.flink.cep.pattern.conditions.SimpleCondition;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.OutputTag;

import java.time.Duration;
import java.util.List;
import java.util.Map;

public class UserJumpDetailApp {

    public static void main(String[] args) throws Exception {

        // TODO 1.获取执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);   //生产环境应该与Kafka的分区数保持一致

        //开启CK 以及 指定状态后端
        //        env.enableCheckpointing(5 * 60000L);
        //        env.getCheckpointConfig().setMaxConcurrentCheckpoints(2);
        //        env.getCheckpointConfig().setMinPauseBetweenCheckpoints(2000L);
        //        env.setRestartStrategy();
        //
        //        env.setStateBackend(new FsStateBackend(""));

        // TODO 2.读取kafka页面主题的数据创建流
        String sourceTopic = "dwd_page_log";
        String groupId = "user_jump_detail_app_210526";
        String sinkTopic = "dwm_user_jump_detail";

        DataStreamSource<String> kafkaDS = env.addSource(MyKafkaUtil.getKafkaSource(sourceTopic, groupId));

        // TODO 3.将每行数据转换为JSON对象
        SingleOutputStreamOperator<JSONObject> jsonObjDS = kafkaDS.map(JSON::parseObject);
        // TODO 4.提取事件时间生成Watermark
        SingleOutputStreamOperator<JSONObject> jsonObjWithWMDS = jsonObjDS.assignTimestampsAndWatermarks(
                WatermarkStrategy.<JSONObject>forBoundedOutOfOrderness(Duration.ofSeconds(2)).withTimestampAssigner(
                        new SerializableTimestampAssigner<JSONObject>() {
                            @Override
                            public long extractTimestamp(JSONObject element, long recordTimestamp) {

                                return element.getLong("ts");
                            }
                        }
                ));
        // TODO 5.按照mid分组
        KeyedStream<JSONObject, String> keyedStream = jsonObjWithWMDS.keyBy(json -> json.getJSONObject("common").getString("mid"));

        // TODO 6.定义模式
        Pattern<JSONObject, JSONObject> pattern = Pattern.<JSONObject>begin("start").where(new SimpleCondition<JSONObject>() {
            @Override
            public boolean filter(JSONObject value) throws Exception {
                return value.getJSONObject("page").getString("last_page_id") == null;
            }
        }).next("next").where(new SimpleCondition<JSONObject>() {
            @Override
            public boolean filter(JSONObject value) throws Exception {
                return value.getJSONObject("page").getString("last_page_id") == null;
            }
        }).within(Time.seconds(2));
        // TODO 7.将模式序列作用到流上
        PatternStream<JSONObject> patternStream = CEP.pattern(keyedStream, pattern);
        // TODO 8.提取时间（匹配上的时间和超时时间）
        OutputTag outputTag = new OutputTag<JSONObject>("timeOut") {
        };
        SingleOutputStreamOperator selectDS = patternStream.select(outputTag, new PatternTimeoutFunction<JSONObject, JSONObject>() {
            @Override
            public JSONObject timeout(Map<String, List<JSONObject>> map, long timeoutTimestamp) throws Exception {
                return map.get("start").get(0);
            }
        }, new PatternSelectFunction<JSONObject, JSONObject>() {
            @Override
            public JSONObject select(Map<String, List<JSONObject>> map) throws Exception {
                return map.get("start").get(0);
            }
        });

        // TODO 9.结合两个流
        DataStream timeOutDS = selectDS.getSideOutput(outputTag);
        DataStream unionDS = selectDS.union(timeOutDS);

        // TODO 10.将数据写入kafka
        unionDS.print();
        unionDS.map(obj -> JSON.toJSONString(obj))
                .addSink(MyKafkaUtil.getKafkaSink(sinkTopic));

        // TODO 11.启动任务
        env.execute("UserJumpDetailApp");
    }
}

2. within(2)表示的是watermark

在这里插入图片描述

3. 举例说明

摘录一部分

// TODO 6.定义模式
Pattern<JSONObject, JSONObject> pattern = Pattern.<JSONObject>begin("start").where(new SimpleCondition<JSONObject>() {
    @Override
    public boolean filter(JSONObject value) throws Exception {
        return value.getJSONObject("page").getString("last_page_id") == null;
    }
}).next("next").where(new SimpleCondition<JSONObject>() {
    @Override
    public boolean filter(JSONObject value) throws Exception {
        return value.getJSONObject("page").getString("last_page_id") == null;
    }
}).within(Time.seconds(2));

程序开启后，连续输入两条数据，依旧没有数据输出(前提条件：两条数据都是相同的mid, 符合触发条件)，原因是因为 watermark（水印）引起的，具体解释如下：

翻译后：

通过水位线解释输入两次数据后，依旧没有输出的原因：(time.within(2)) 乱序时间为2s ------> watermark的原因，即数据延迟或者乱序
数据顺序		事件时间			watermark
第一条			756					754
第二条			758					756
-------------------------------------------------------------------------------
第三条			762					760
-------------------------------------------------------------------------------

#解释：
按理说两个符合条件的数据已经的到来，但是依旧没有输出，因为CEP设置乱序时间：2s，由于第二条的watermark是2，后面可能会出现 watermark = 755的数据。
如果第三条数据是 762 760 ----> 说明 watermark < 760的数据都已经到来，因此此时会触发窗口计算，完成数据输出。
也就是说，此时第一条数据和第二条数据组成了 next关系。序列模式匹配完成。