第九章 Flink高阶之迟到无序数据处理代码实战

1、WaterMark机制

  • Watermark机制:由某个算子操作生成,在整个程序中随event数据流转
Watermaker = 当前计算窗⼝最⼤的事件时间 - 允许乱序延迟的时间
  • 触发计算时机:
1、:Watermaker >= Window EndTime窗⼝结束时间
2、:当前计算窗⼝最⼤的事件时间 - 允许乱序延迟的时间 >=Window EndTime窗⼝结束时间

(1)需求

需求:分组统计不同视频的成交价,数据有乱序延迟,允许3秒延迟

(2)代码实战:watermark+window机制

package com.lihaiwei.text1.app;

import com.lihaiwei.text1.util.TimeUtil;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;


import javax.print.DocFlavor;
import java.time.Duration;
import java.util.ArrayList;
import java.util.List;

public class flink08watermark {
    public static void main(String[] args) throws Exception {
        // 1、构建流环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);
        // 2、从socket中读取数据
        DataStream<String> DS = env.socketTextStream("192.168.6.104",8088);

        // 3、将输入数据(java,2022-11-11 09-10-10,12)转为tuple3类型
        SingleOutputStreamOperator<Tuple3<String, String, Integer>> flatmapDS = DS.flatMap(new FlatMapFunction<String, Tuple3<String, String, Integer>>() {
            @Override
            public void flatMap(String value, Collector<Tuple3<String, String, Integer>> out) throws Exception {
                // 3.1、将字符串切割为数组
                String[] arr = value.split(",");
                // 3.2、使用收集器收集到tuple中
                out.collect(Tuple3.of(arr[0],arr[2],Integer.parseInt(arr[2])));
            }
        });
        // 4、指定特定列为eventtime,进行watermark生成
        SingleOutputStreamOperator<Tuple3<String, String, Integer>> watermarkDS = flatmapDS.assignTimestampsAndWatermarks(WatermarkStrategy
                // 4.1、指定允许最大延迟乱序时间
                .<Tuple3<String, String, Integer>>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                // 4.2、指定water的时间语义及时间时间列
                .withTimestampAssigner((event, timestamp) -> TimeUtil.strToDate(event.f1).getTime()));
        // 5、分组开窗统计
        SingleOutputStreamOperator<String> sumDS = watermarkDS.keyBy(new KeySelector<Tuple3<String, String, Integer>, String>() {
            @Override
            public String getKey(Tuple3<String, String, Integer> value) throws Exception {
                return value.f0;
            }
        })
                // 开窗
                .window(TumblingEventTimeWindows.of(Time.seconds(10)))
                // 聚合
                .apply(new WindowFunction<Tuple3<String, String, Integer>, String, String, TimeWindow>() {
                    @Override
                    public void apply(String key, TimeWindow window, Iterable<Tuple3<String, String, Integer>> input, Collector<String> out) throws Exception {
                        // 准备list,存放窗口的事件时间
                        List<String> timeList = new ArrayList<>();
                        int total = 0;
                        //遍历数据集,
                        for(Tuple3<String,String,Integer> order:input){
                            timeList.add(order.f1);
                            total = total + order.f2;
                        }
                        String outStr = String.format("分组key:%s,聚合值:%s,窗⼝开始结束:[%s~%s),窗⼝所有事件时 间:%s", key,total,TimeUtil.format(window.getStart()),TimeUtil.format(window.getEnd()), timeList);
                        out.collect(outStr);
                    }
                });
        // 6、打印输出
        sumDS.print();

        // 7、命名Job并提交
        env.execute("watermark job");
    }
}

(3)代码调试过程

  • 开启sokect
nc -lk 9999
  • 窗口一
// 窗口一[12:00,12:10)
java,2022-11-11 23:12:07,10  // NO
java,2022-11-11 23:12:11,10  // NO,在另一个窗口
java,2022-11-11 23:12:08,10  // NO
mysql,2022-11-11 23:12:13,10 //YES,在另一个窗口
  • 窗口二
// 窗口二[12:10,12:20)
java,2022-11-11 23:12:13,10  // NO
java,2022-11-11 23:12:17,10  // NO
java,2022-11-11 23:12:09,10  // NO,数据丢弃
java,2022-11-11 23:12:20,10  // NO,数据在第三个窗口
java,2022-11-11 23:12:22,10  // NO,数据在第三个窗口
java,2022-11-11 23:12:23,10  // YES,数据在第三个窗口
  • 运行结果

在这里插入图片描述

2、allowLateness机制 - 二次兜底

  • 应用场景:超过了watermark的等待,然后配置allowedLateness 再延⻓时间,然后到了后更新之前的窗⼝数据;
DataStream.window("窗口时间")
    	  .allowedLateness("允许时间")
          .apply("全窗口聚合")

(1)代码实战

package com.lihaiwei.text1.app;

import com.lihaiwei.text1.util.TimeUtil;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;


import javax.print.DocFlavor;
import java.time.Duration;
import java.util.ArrayList;
import java.util.List;

public class flink08watermark {
    public static void main(String[] args) throws Exception {
        // 1、构建流环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);
        // 2、从socket中读取数据
        DataStream<String> DS = env.socketTextStream("192.168.6.104",9999);

        // 3、将输入数据(java,2022-11-11 09-10-10,12)转为tuple3类型
        SingleOutputStreamOperator<Tuple3<String, String, Integer>> flatmapDS = DS.flatMap(new FlatMapFunction<String, Tuple3<String, String, Integer>>() {
            @Override
            public void flatMap(String value, Collector<Tuple3<String, String, Integer>> out) throws Exception {
                // 3.1、将字符串切割为数组
                String[] arr = value.split(",");
                // 3.2、使用收集器收集到tuple中
                out.collect(Tuple3.of(arr[0],arr[1],Integer.parseInt(arr[2])));
            }
        });
        // 4、指定特定列为eventtime,进行watermark生成
        SingleOutputStreamOperator<Tuple3<String, String, Integer>> watermarkDS = flatmapDS.assignTimestampsAndWatermarks(WatermarkStrategy
                // 4.1、指定允许最大延迟乱序时间
                .<Tuple3<String, String, Integer>>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                // 4.2、指定water的时间语义及时间时间列
                .withTimestampAssigner((event, timestamp) -> TimeUtil.strToDate(event.f1).getTime()));
        // 5、分组开窗统计
        SingleOutputStreamOperator<String> sumDS = watermarkDS.keyBy(new KeySelector<Tuple3<String, String, Integer>, String>() {
            @Override
            public String getKey(Tuple3<String, String, Integer> value) throws Exception {
                return value.f0;
            }
        })
                // 开窗
                .window(TumblingEventTimeWindows.of(Time.seconds(10)))
                // 允许一分钟延迟
                .allowedLateness(Time.minutes(1))
                // 聚合
                .apply(new WindowFunction<Tuple3<String, String, Integer>, String, String, TimeWindow>() {
                    @Override
                    public void apply(String key, TimeWindow window, Iterable<Tuple3<String, String, Integer>> input, Collector<String> out) throws Exception {
                        // 准备list,存放窗口的事件时间
                        List<String> timeList = new ArrayList<>();
                        int total = 0;
                        //遍历数据集,
                        for(Tuple3<String,String,Integer> order:input){
                            timeList.add(order.f1);
                            total = total + order.f2;
                        }
                        String outStr = String.format("分组key:%s,聚合值:%s,窗⼝开始结束:[%s~%s),窗⼝所有事件时 间:%s", key,total,TimeUtil.format(window.getStart()),TimeUtil.format(window.getEnd()), timeList);
                        out.collect(outStr);
                    }
                });
        // 6、打印输出
        sumDS.print();

        // 7、命名Job并提交
        env.execute("watermark job");
    }
}

(2)调试结果

  • 窗口一
// 窗口一[12:00,12:10)
java,2022-11-11 23:12:07,10  // NO
java,2022-11-11 23:12:11,10  // NO,在另一个窗口
java,2022-11-11 23:12:08,10  // NO
mysql,2022-11-11 23:12:13,10 //YES,在另一个窗口
  • 窗口二
// 窗口二[12:10,12:20)
java,2022-11-11 23:12:13,10  // NO
java,2022-11-11 23:12:17,10  // NO
java,2022-11-11 23:12:09,10  // YES,触发窗口一更新计算
java,2022-11-11 23:12:20,10  // NO,数据在第三个窗口
java,2022-11-11 23:12:22,10  // NO,数据在第三个窗口
java,2022-11-11 23:12:23,10  // YES,数据在第三个窗口
  • 运行结果

在这里插入图片描述

3、sideOutPut机制 - 最后兜底

  • 应用场景:超过了watermark的等待,到了后更新之前的窗⼝数据数据超过了allowedLateness 后,⽤侧输出流 SideOutput
// 1、new旁路输出对象



DataStream.window("窗口时间")
    	  .allowedLateness("允许时间")
    	  .sideOutputLateData("旁路输出对象")
          .apply("全窗口聚合")

(1)代码实战

package com.lihaiwei.text1.app;

import com.lihaiwei.text1.util.TimeUtil;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import org.apache.flink.util.OutputTag;


import javax.print.DocFlavor;
import java.time.Duration;
import java.util.ArrayList;
import java.util.List;

public class flink08watermark {
    public static void main(String[] args) throws Exception {
        // 1、构建流环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);
        // 2、从socket中读取数据
        DataStream<String> DS = env.socketTextStream("192.168.6.104",8888);

        // 3、将输入数据(java,2022-11-11 09-10-10,12)转为tuple3类型
        SingleOutputStreamOperator<Tuple3<String, String, Integer>> flatmapDS = DS.flatMap(new FlatMapFunction<String, Tuple3<String, String, Integer>>() {
            @Override
            public void flatMap(String value, Collector<Tuple3<String, String, Integer>> out) throws Exception {
                // 3.1、将字符串切割为数组
                String[] arr = value.split(",");
                // 3.2、使用收集器收集到tuple中
                out.collect(Tuple3.of(arr[0],arr[1],Integer.parseInt(arr[2])));
            }
        });
        // 4.1、指定特定列为eventtime,进行watermark生成
        SingleOutputStreamOperator<Tuple3<String, String, Integer>> watermarkDS = flatmapDS.assignTimestampsAndWatermarks(WatermarkStrategy
                // 4.1、指定允许最大延迟乱序时间
                .<Tuple3<String, String, Integer>>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                // 4.2、指定water的时间语义及时间时间列
                .withTimestampAssigner((event, timestamp) -> TimeUtil.strToDate(event.f1).getTime()));
        // 4.2、旁路输出对象
        OutputTag<Tuple3<String,String,Integer>> latedata = new OutputTag<Tuple3<String, String, Integer>>("latedate");

        // 5、分组开窗统计
        SingleOutputStreamOperator<String> sumDS = watermarkDS.keyBy(new KeySelector<Tuple3<String, String, Integer>, String>() {
            @Override
            public String getKey(Tuple3<String, String, Integer> value) throws Exception {
                return value.f0;
            }
        })
                // 开窗
                .window(TumblingEventTimeWindows.of(Time.seconds(10)))
                // 允许一分钟延迟
                .allowedLateness(Time.minutes(1))
                // 旁路输出
                .sideOutputLateData(latedata)
                // 聚合
                .apply(new WindowFunction<Tuple3<String, String, Integer>, String, String, TimeWindow>() {
                    @Override
                    public void apply(String key, TimeWindow window, Iterable<Tuple3<String, String, Integer>> input, Collector<String> out) throws Exception {
                        // 准备list,存放窗口的事件时间
                        List<String> timeList = new ArrayList<>();
                        int total = 0;
                        //遍历数据集,
                        for(Tuple3<String,String,Integer> order:input){
                            timeList.add(order.f1);
                            total = total + order.f2;
                        }
                        String outStr = String.format("分组key:%s,聚合值:%s,窗⼝开始结束:[%s~%s),窗⼝所有事件时 间:%s", key,total,TimeUtil.format(window.getStart()),TimeUtil.format(window.getEnd()), timeList);
                        out.collect(outStr);
                    }
                });
        // 6.1、打印输出
        sumDS.print();
        // 6.2、输出旁路对象
        sumDS.getSideOutput(latedata).print();

        // 7、命名Job并提交
        env.execute("watermark job");
    }
}

(2)调试结果

  • 窗口一
// 窗口一[12:00,12:10)
java,2022-11-11 23:12:07,10  // NO
java,2022-11-11 23:12:11,10  // NO,在另一个窗口
java,2022-11-11 23:12:08,10  // NO
mysql,2022-11-11 23:12:13,10 //YES,在另一个窗口
  • 窗口二
// 窗口二[12:10,12:20)
java,2022-11-11 23:12:17,10  // NO
java,2022-11-11 23:12:09,10  // YES,触发窗口一更新计算
java,2022-11-11 23:12:20,10  // NO,数据在第三个窗口
java,2022-11-11 23:12:22,10  // NO,数据在第三个窗口
java,2022-11-11 23:12:23,10  // YES,数据在第三个窗口
  • 窗口三
// 窗口二[12:20,12:30)
java,2022-11-11 23:13:20,10  // YES,数据在第四个窗口
java,2022-11-11 23:12:01,10  // NO,数据在旁路输出
  • 运行结果

在这里插入图片描述

4、多层保证措施归纳

  • flink采用watermark、allowedLateness()、sideOutoutLateDate()三个机制保证获取数据
4.1、原理

watermark:防⽌数据出现延迟乱序,允许等待⼀会 再触发窗⼝计算,将其输出;

allowLateness:将窗⼝关闭时间再延迟⼀段时间,这期间迟到数据会通过allowLateness主动更新watermark输出的局部数据从而将其修复

sideOutPut:超过allowLateness后,窗⼝已 经彻底关闭了,就会把数据放到侧输出;

4.2、应用场景
  • watermark及时输出数据

  • allowLateness 做短期的更新迟到数据

  • sideOutPut做兜底更新保证数据准确性

4.3、Flink机制相关

(1)第一层 - DataStream

  • 从DataStream数据流⾥指定范围获取数据

(2)第二层 - Watermark

  • 防⽌数据出现乱序延迟允许窗口等待延迟数据达到,再触发计算;

(3)第三层 - allowLateness

  • 会让窗⼝关闭时间再延迟⼀段时间, 如 果还有数据达到,会局部修复数据主动更新窗⼝的数据输出

(4)第四层 - sideOutPut

  • 在窗⼝已经彻底关闭后,所有过期延迟数据放到侧输出流,可以单独获取, 存储到某个地⽅再批量更新之前的聚合的数据;
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

随缘清风殇

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值