【Flink】水位线设置与窗口的合并

Flink 水位线

时间语义

  • 事件时间:数据产生的时间
  • 处理时间:数据真正被处理的时刻

一般情况下,业务日志数据中都会记录数据生成的时间戳(timestamp),它就可以作为事件时间的判断基础。Flink 将事件时间作为默认的时间语义。

事件时间和窗口

逻辑时钟:事件进展靠着数据记录的时间戳来推动,使计算过程完全不依赖处理时间(系统时间)

水位线

用来衡量事件时间进展的标记。

有序流中的水位线

理想状态下希望数据按生成顺序进入流中,每条数据产生一个水位线。在实际中,由于数据量非常大,为提高效率,每隔一段时间生成一个水位线。

乱序流中的水位线

数据在节点中传输会导致顺序发生改变的乱序数据。

  • 当数据量小时候,对数据提取时间戳插入水位线。(对乱序情况判定,判断时间戳是否比之前的大,如果小于之前的时间戳则不在生成新的水位线 d)。
  • 数据量大的时候,同样以周期生成水位线,只需要保存一下之前所有数据中的最大时间戳,需要插入水位线时,就以最大时间戳作为新的水位线。
  • 对于迟到数据,为了让窗口正确收集迟到数据,可以等上一段时间(比如 2 秒)。即用当前已有数据最大时间戳减去 2 秒,就是要插入水位线的时间戳。
水位线特征
  • 水位线是插入数据流中的一个标记,可以认为是一个特殊数据。
  • 水位线主要是一个时间戳,表示当前事件时间的进展。
  • 水位线基于数据的时间戳生成。
  • 水位线时间戳必须单调递增,确保时钟一直向前推进。
  • 水位线可以设置延迟来保证正确处理乱序数据。
  • 水位线表示时间戳之前所有数据都到齐了,之后流中不会出现小于该时间戳数据。

生成水位线

基本原则:水位线设置后表示这个时间之前的数据全部到齐,之后不再出现。

生成策略

有序流中内置水位线设置

package ac.sict.reid.leo.Watermark;

import ac.sict.reid.leo.Computing.Aggregation.WaterSensorMapFunction;
import ac.sict.reid.leo.POJO.WaterSensor;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

public class WaterlineCreateDemo {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
        environment.setParallelism(1);

        SingleOutputStreamOperator<WaterSensor> sensor = environment.socketTextStream("localhost", 7777).map(new WaterSensorMapFunction());

        //TODO 定义 Watermark 策略
        WatermarkStrategy<WaterSensor> watermarkStrategy = WatermarkStrategy.
                <WaterSensor>forMonotonousTimestamps().         // 指定watermark生成:升序、没有等待时间
                withTimestampAssigner(new SerializableTimestampAssigner<WaterSensor>() {
                    // 指定时间戳分配器,从数据中提取
                    @Override
                    public long extractTimestamp(WaterSensor element, long recordTimestrap) {
                        System.out.println("数据=" + element + ",recordTs = " + recordTimestrap);
                        return element.getTs() * 1000L;
                    }
                }
        );

        // TODO 指定 watermark 策略
        SingleOutputStreamOperator<WaterSensor> sensorWithWatermark = sensor.assignTimestampsAndWatermarks(watermarkStrategy);

        sensorWithWatermark.keyBy(value -> value.getId()).
          window(TumblingProcessingTimeWindows.of(Time.seconds(10))).process(
                new ProcessWindowFunction<WaterSensor, String, String, TimeWindow>() {
                    @Override
                    public void process(String s, ProcessWindowFunction<WaterSensor, String, String, TimeWindow>.Context context, Iterable<WaterSensor> iterable, Collector<String> collector) throws Exception {
                        long start = context.window().getStart();
                        long end = context.window().getEnd();
                        long count = iterable.spliterator().estimateSize();
                        collector.collect("key = "+s+" 的窗口[" + start+" , "+ end +")包含 " + count + " 条数据 ===>" +iterable.toString());
                    }
                }
        ).print();

        environment.execute();


    }
}

乱序流中内置水位线设置

package ac.sict.reid.leo.Watermark;

import ac.sict.reid.leo.Computing.Aggregation.WaterSensorMapFunction;
import ac.sict.reid.leo.POJO.WaterSensor;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

import java.time.Duration;

public class WatermarkOutOfOrdernessDemo {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
        environment.setParallelism(1);

        SingleOutputStreamOperator<WaterSensor> sensor = environment.socketTextStream("localhost", 7777).map(new WaterSensorMapFunction());

        //TODO 定义 Watermark 策略
        // 指定时间戳分配器,从数据中提取
        WatermarkStrategy<WaterSensor> watermarkStrategy = WatermarkStrategy.
                <WaterSensor>forBoundedOutOfOrderness(Duration.ofSeconds(3)).         
                withTimestampAssigner((element, recordTimestrap) -> {
                    System.out.println("数据=" + element + ",recordTs = " + recordTimestrap);
                    return element.getTs() * 1000L;
                }
        );

        // TODO 指定 watermark 策略
        SingleOutputStreamOperator<WaterSensor> sensorWithWatermark = sensor.assignTimestampsAndWatermarks(watermarkStrategy);

        sensorWithWatermark.keyBy(value -> value.getId()).window(TumblingProcessingTimeWindows.of(Time.seconds(10))).process(
                new ProcessWindowFunction<WaterSensor, String, String, TimeWindow>() {
                    @Override
                    public void process(String s, ProcessWindowFunction<WaterSensor, String, String, TimeWindow>.Context context, Iterable<WaterSensor> iterable, Collector<String> collector) throws Exception {
                        long start = context.window().getStart();
                        long end = context.window().getEnd();
                        long count = iterable.spliterator().estimateSize();
                        collector.collect("key = "+s+" 的窗口[" + start+" , "+ end +")包含 " + count + " 条数据 ===>" +iterable.toString());
                    }
                }
        ).print();

        environment.execute();

    }
}

水位线的传递

上游任务处理完水位线、时钟改变以后,要把当前水位线再次发出,广播给所有下游子任务。

此时当一个任务接收到了多个上游并行任务传递过来的水位线时,应该以最小的那个作为当前任务的事件时钟。(每个任务都以“处理完之前所有数据”为标准来确定自己的时钟。)

模拟传递
package ac.sict.reid.leo.Watermark;

import ac.sict.reid.leo.Computing.PhysicalPartition.Computing.MyPartitioner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

import java.time.Duration;

public class WatermarkIdlenessDemo {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
        environment.setParallelism(2);

        SingleOutputStreamOperator<Integer> sockerDS = environment.socketTextStream("localhost", 7777).
                partitionCustom(new MyPartitioner(), r -> r).
                map(r -> Integer.parseInt(r)).assignTimestampsAndWatermarks(
                        WatermarkStrategy.<Integer>forMonotonousTimestamps().
                                withTimestampAssigner((r, ts) -> r * 1000L).withIdleness(Duration.ofSeconds(5))
                );

        sockerDS.keyBy(r -> r % 2).window(TumblingEventTimeWindows.of(Time.seconds(10))).
                process(new ProcessWindowFunction<Integer, String, Integer, TimeWindow>() {
                    @Override
                    public void process(Integer integer, ProcessWindowFunction<Integer, String, Integer, TimeWindow>.Context context, Iterable<Integer> iterable, Collector<String> collector) throws Exception {
                        long start = context.window().getStart();
                        long end = context.window().getEnd();
                        long count = iterable.spliterator().estimateSize();
                        collector.collect("key = "+integer+" 的窗口[" + start+" , "+ end +")包含 " + count + " 条数据 ===>" +iterable.toString());
                    }
                }).print();

        environment.execute();

    }
}

迟到数据的处理

package ac.sict.reid.leo.Watermark;

import ac.sict.reid.leo.Computing.Aggregation.WaterSensorMapFunction;
import ac.sict.reid.leo.POJO.WaterSensor;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import org.apache.flink.util.OutputTag;

import java.time.Duration;

public class WatermarkLateDemo {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
        environment.setParallelism(1);

        SingleOutputStreamOperator<WaterSensor> sensor = environment.socketTextStream("localhost", 7777).
                map(new WaterSensorMapFunction());

      
      // 水平线设置为乱序等待3秒
        WatermarkStrategy<WaterSensor> watermarkStrategy
          = WatermarkStrategy.<WaterSensor>forBoundedOutOfOrderness(Duration.ofSeconds(3)).
          										withTimestampAssigner(
          														(element, recordTimestrap) -> 
          										(element.getTs() * 1000L));
        SingleOutputStreamOperator<WaterSensor> sensorWithWatermark 
          = sensor.assignTimestampsAndWatermarks(watermarkStrategy);

        OutputTag<WaterSensor> lateTag = new OutputTag<>("late-data", Types.POJO(WaterSensor.class));
        SingleOutputStreamOperator<String> process 
          = sensorWithWatermark.keyBy(sensorDocker -> sensorDocker.
                                      getId()).
          														// 推迟 2 秒关窗
                                      window(TumblingEventTimeWindows.of(Time.seconds(2))).
                                      // 关窗后的迟到数据放入侧输出流
          														sideOutputLateData(lateTag).process(
                        new ProcessWindowFunction<WaterSensor, String, String, TimeWindow>() {
                            @Override
                            public void process(String s,
                                                ProcessWindowFunction<WaterSensor, String, String, TimeWindow>.Context context,
                                                Iterable<WaterSensor> iterable, Collector<String> collector) throws Exception {
                                long start = context.window().getStart();
                                long end = context.window().getEnd();
                                long count = iterable.spliterator().estimateSize();
                                collector.collect("key = " + s + " 的窗口[" + start + " , " + end + ")包含 " + count + " 条数据 ===>" + iterable.toString());
                            }
                        }
                );

        process.print();
      	// 从主流获取侧输出流,打印
        process.getSideOutput(lateTag).printToErr("关窗后的迟到数据");
        environment.execute();

    }
}

基于时间的合流

窗口联结

package ac.sict.reid.leo.Join;

import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.JoinFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;

public class WindowJoinDemo {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
        environment.setParallelism(1);

        SingleOutputStreamOperator<Tuple2<String, Integer>> ds1 = environment.fromElements(
                Tuple2.of("a", 1), Tuple2.of("a", 2), Tuple2.of("b", 3), Tuple2.of("c", 4)
        ).assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple2<String, Integer>>forMonotonousTimestamps().
                withTimestampAssigner((value, ts) -> value.f1 * 1000L));

        SingleOutputStreamOperator<Tuple3<String, Integer, Integer>> ds2 = environment.fromElements(
                Tuple3.of("a", 1, 1), Tuple3.of("a", 11, 1), Tuple3.of("b", 2, 1),
                Tuple3.of("b", 12, 1), Tuple3.of("c", 14, 1), Tuple3.of("d", 15, 1)
        ).assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple3<String, Integer, Integer>>forMonotonousTimestamps().
                withTimestampAssigner((value, ts) -> value.f1 * 1000L));

        /*
        * TODO window join
        *   1. 落在同一个时间窗口范围内才能匹配
        *   2. 根据keyBy的key,来进行匹配关联
        *   3. 只能拿到匹配上的数据,类似有固定时间范围的inner join
        * */

        DataStream<String> join = ds1.join(ds2).where(r1 -> r1.f0).equalTo(r2 -> r2.f0).
                window(TumblingEventTimeWindows.of(Time.seconds(10))).apply(new JoinFunction<Tuple2<String, Integer>, Tuple3<String, Integer, Integer>, String>() {

                    /**
                     *
                     * 关联上的数据,调用Join方法
                     *
                     * @param stringIntegerTuple2               ds1的数据
                     * @param stringIntegerIntegerTuple3        ds2的数据
                     * @return
                     * @throws Exception
                     */
                    @Override
                    public String join(Tuple2<String, Integer> stringIntegerTuple2, Tuple3<String, Integer, Integer> stringIntegerIntegerTuple3) throws Exception {
                        return stringIntegerTuple2 + "<---------------->" + stringIntegerIntegerTuple3;
                    }
                });

        join.print();
        environment.execute();
    }
}

间接联结

处理的时间间隔不固定情况。Interval Join方法针对一条流的每个数据,开辟出时间戳前后一段时间间隔,这期间是否有来自另一条流的数据匹配。

原理

给定两个时间点,作为间隔的“上界”和“下界”,对一条流A伤的任意数据a的,开辟一段时间间隔[a.timestamp + lowerBound , a.timestamp + upperBound],即 a 的时间戳为中心,下至下界点、上至上界点的一个闭区间:这段时间作为可以匹配另一条数据的窗口范围。

package ac.sict.reid.leo.Join;

import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;

public class IntervalJoinDemo {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
        environment.setParallelism(1);

        SingleOutputStreamOperator<Tuple2<String, Integer>> ds1 = environment.fromElements(
                Tuple2.of("a", 1), Tuple2.of("a", 2), Tuple2.of("b", 3), Tuple2.of("c", 4)
        ).assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple2<String, Integer>>forMonotonousTimestamps().
                withTimestampAssigner((value, ts) -> value.f1 * 1000L));

        SingleOutputStreamOperator<Tuple3<String, Integer, Integer>> ds2 = environment.fromElements(
                Tuple3.of("a", 1, 1), Tuple3.of("a", 11, 1), Tuple3.of("b", 2, 1),
                Tuple3.of("b", 12, 1), Tuple3.of("c", 14, 1), Tuple3.of("d", 15, 1)
        ).assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple3<String, Integer, Integer>>forMonotonousTimestamps().
                withTimestampAssigner((value, ts) -> value.f1 * 1000L));

        /*
        * TODO interval join
        *   1. 分别做keyBy, key ----> 解决关联条件
        *   2. 调用interval join
        * */
        KeyedStream<Tuple2<String, Integer>, String> ks1 = ds1.keyBy(r1 -> r1.f0);
        KeyedStream<Tuple3<String, Integer, Integer>, String> ks2 = ds2.keyBy(r2 -> r2.f0);

        ks1.intervalJoin(ks2).between(Time.seconds(-2),Time.seconds(2)).process(
                new ProcessJoinFunction<Tuple2<String, Integer>, Tuple3<String, Integer, Integer>, String>() {
                    /**
                     *
                     * 两条流数据匹配上调用该方法
                     *
                     * @param stringIntegerTuple2                   ks1 data
                     * @param stringIntegerIntegerTuple3            ks2 data
                     * @param context                               上下文
                     * @param collector                             采集器
                     * @throws Exception
                     */
                    @Override
                    public void processElement(Tuple2<String, Integer> stringIntegerTuple2, Tuple3<String, Integer, Integer> stringIntegerIntegerTuple3, ProcessJoinFunction<Tuple2<String, Integer>, Tuple3<String, Integer, Integer>, String>.Context context, Collector<String> collector) throws Exception{
                        collector.collect(stringIntegerTuple2 + "<------------>" + stringIntegerIntegerTuple3);
                    }
                }
        ).print();

        environment.execute();

    }
}

对于处理迟到数据

package ac.sict.reid.leo.Join;

import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
import org.apache.flink.util.OutputTag;

import java.time.Duration;

/*
* 针对有迟到数据情况的间接联结
*
* */
public class IntervalJoinWithLateDemo {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
        environment.setParallelism(1);

        SingleOutputStreamOperator<Tuple2<String, Integer>> ds1 = environment.socketTextStream("localhost", 7777).map(new MapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> map(String s) throws Exception {
                String[] split = s.split(",");
                return Tuple2.of(split[0], Integer.valueOf(split[1]));
            }
        }).assignTimestampsAndWatermarks(
                WatermarkStrategy.<Tuple2<String, Integer>>forBoundedOutOfOrderness(Duration.ofSeconds(3)).
                        withTimestampAssigner((value, ts) -> value.f1 * 1000L)
        );

        SingleOutputStreamOperator<Tuple3<String, Integer, Integer>> ds2 = environment.socketTextStream("localhost", 8888).map(new MapFunction<String, Tuple3<String, Integer, Integer>>() {
            @Override
            public Tuple3<String, Integer, Integer> map(String s) throws Exception {
                String[] split = s.split(",");
                return Tuple3.of(split[0], Integer.valueOf(split[1]), Integer.valueOf(split[2]));
            }
        }).assignTimestampsAndWatermarks(
                WatermarkStrategy.<Tuple3<String, Integer, Integer>>forBoundedOutOfOrderness(Duration.ofSeconds(3)).
                        withTimestampAssigner(
                                (value, ts) -> value.f1 * 1000L
                        )
        );

        /*
        * TODO Interval join
        *  1. 只支持事件时间
        *  2. 指定上界、下界的偏移,负号表示时间往前,正号表示时间往后
        *  3. process中只能处理 join 上的数据
        *  4. 两条流关联后的watermark,以两条流中最小的为准
        *  5. 如果当前数据的事件时间小于当前watermark,就是迟到数据,主流的process不处理
        *                ----> 将迟到数据放入侧输出流
        * */

        KeyedStream<Tuple2<String, Integer>, String> ks1 = ds1.keyBy(r -> r.f0);
        KeyedStream<Tuple3<String, Integer, Integer>, String> ks2 = ds2.keyBy(r -> r.f0);

        OutputTag<Tuple2<String,Integer>> ks1LateTag = new OutputTag<>("ks1-late", Types.TUPLE(Types.STRING, Types.INT));
        OutputTag<Tuple3<String,Integer,Integer>> ks2LateTag = new OutputTag<>("ks2-late", Types.TUPLE(Types.STRING, Types.INT, Types.INT));

        SingleOutputStreamOperator<String> process = ks1.intervalJoin(ks2).between(Time.seconds(-2), Time.seconds(2)).
                sideOutputLeftLateData(ks1LateTag).sideOutputRightLateData(ks2LateTag).process(
                        new ProcessJoinFunction<Tuple2<String, Integer>, Tuple3<String, Integer, Integer>, String>() {
                            @Override
                            public void processElement(Tuple2<String, Integer> stringIntegerTuple2, Tuple3<String, Integer, Integer> stringIntegerIntegerTuple3, ProcessJoinFunction<Tuple2<String, Integer>, Tuple3<String, Integer, Integer>, String>.Context context, Collector<String> collector) throws Exception {
                                collector.collect(stringIntegerTuple2 + "<---------->" + stringIntegerIntegerTuple3);
                            }
                        }
                );

        process.print("主流");
        process.getSideOutput(ks1LateTag).printToErr("ks1 迟到数据");
        process.getSideOutput(ks2LateTag).printToErr("ks2 迟到数据");

        environment.execute();
    }
}
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值