flink中watermaker的理解

目录

一,watermaker的概念

1,什么是watermaker

2,watermaker的例子

二,代码演习

1,简易版本

2,自定义watermaker类


一,watermaker的概念

1,什么是watermaker

watermaker又称水位线,它是一个带时间戳的DataStream,是为了解决实时计算中的数据乱序问题,以窗口延迟来兼容迟到的数据,当水位线大于窗口时间时触发窗口执行(watermaker是跟窗口一起配合使用的,比如五分钟的滑动或者滚动窗口等)。
watermaker的计算方式是:

watermaker = 当前窗口中最大的eventTime - 延迟时间

flink中的时间分成三类:

事件时间(EventTime): 事件真真正正发生产生的时间
摄入时间(IngestionTime): 事件到达Flink的时间
处理时间(ProcessingTime): 事件真正被处理/计算的时间
EventTime事件时间是最为重要的,因为只要事件时间一产生就不会变化,事件时间更能反映事件的本质

2,watermaker的例子

在时间窗口10:00:00-10:10:00中有6条数据依次到达。

a  10:01:00
b  10:11:00
c  10:08:00
d  10:14:00
e  10:15:00
f  10:09:00


abcdef依次到达。最大延迟时间是5分钟。对于10:00:00-10:10:00的数据统计。
没有Watemaker时
a到达 会正常统计
b到达 触发窗口,认为已经到达统计时间了就结束统计了,所以只有a,b会被统计,后面的cdef就丢弃不会被统计。

而有Watemaker,且最大延迟时间是5分钟时
a到达 Watermaker=max(10:01:00)-5 =09:56:00 < 窗口结束时间10:10:00 – 不触发条件
b到达 Watermaker=max(10:11:00)-5 =10:06:00 < 窗口结束时间10:10:00 – 不触发条件
c到达 Watermaker=max(10:08:00)-5 =10:03:00 < 窗口结束时间10:10:00 – 不触发条件
d到达 Watermaker=max(10:14:00)-5 =10:09:00 < 窗口结束时间10:10:00 – 不触发条件
e到达 Watermaker=max(10:15:00)-5 =10:10:00 = 窗口结束时间10:10:00 – 满足触发条件,窗口开始计算那么a,b,c,d,e会被统计进窗口[10:00:00  10:10:00].
Watermaker机制可以在一定程度上解决数据乱序后延迟到达问题,但是更严重的还是无法解决。比如f数据到达时窗口已经计算完毕,所以f数据还是会丢失。

二,代码演习

Event Time的使用通常要指定数据源中的时间戳:

env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

需要的依赖:

    <dependency>
      <groupId>org.projectlombok</groupId>
      <artifactId>lombok</artifactId>
      <version>1.18.10</version>
    </dependency>
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-clients_2.11</artifactId>
      <version>${flink.version}</version>
      <scope>provided</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-java</artifactId>
      <version>${flink.version}</version>
      <scope>provided</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.kafka</groupId>
      <artifactId>kafka-clients</artifactId>
      <version>${kafka.version}</version>
      <scope>provided</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-statebackend-rocksdb_2.11</artifactId>
      <version>${flink.version}</version>
      <scope>provided</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-streaming-java_2.11</artifactId>
      <version>${flink.version}</version>
      <scope>provided</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-streaming-scala_2.11</artifactId>
      <version>${flink.version}</version>
      <scope>provided</scope>
      <exclusions>
        <exclusion>
          <groupId>com.google.code.findbugs</groupId>
          <artifactId>jsr305</artifactId>
        </exclusion>
      </exclusions>
    </dependency>

1,简易版本

构造一个订单数据来模拟实时数据,然后统计相同用户在5秒内的消费情况。

import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;

import java.time.Duration;
import java.util.Random;
import java.util.UUID;
import java.util.concurrent.TimeUnit;

public class check_waterMaker1 {

    public static void main(String[] args) throws Exception {
        //1.env
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        env.getConfig().setAutoWatermarkInterval(100);
        env.getCheckpointConfig().setCheckpointTimeout(60 * 60 * 1000);
        env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);

        //2.Source
        //模拟实时订单数据(数据有延迟和乱序)
        DataStream<Order> orderDS = env.addSource(new SourceFunction<Order>() {
            private boolean flag = true;

            @Override
            public void run(SourceContext<Order> ctx) throws Exception {
                Random random = new Random();
                while (flag) {
                    String orderId = UUID.randomUUID().toString();
                    int userId = random.nextInt(3);
                    int money = random.nextInt(100);
                    //模拟数据延迟和乱序!
                    long eventTime = System.currentTimeMillis() - random.nextInt(5) * 1000;
                    ctx.collect(new Order(orderId, userId, money, eventTime));
                    System.out.println("orderId:"+orderId+"\tuserId:"+userId+"\tmoney:"+money+"\teventTime:"+eventTime);
                    TimeUnit.SECONDS.sleep(1);
                }
            }

            @Override
            public void cancel() {
                flag = false;
            }
        }).setParallelism(1);

        DataStream<Order> watermakerDS = orderDS
                .assignTimestampsAndWatermarks(
                        WatermarkStrategy.<Order>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                                .withTimestampAssigner((event, timestamp) -> event.getEventTime())
                ).setParallelism(1);
        //代码走到这里,就已经被添加上Watermaker了!接下来就可以进行窗口计算了
        //要求每隔5s,计算5秒内(基于时间的滚动窗口),每个用户的订单总金额
        SingleOutputStreamOperator<Order> money = watermakerDS
                .keyBy(Order_t -> Order_t.userId)
                .window(TumblingEventTimeWindows.of(Time.seconds(5)))
                .sum("money")
                .setParallelism(1);

        money.print();

        //5.execute
        env.execute();
    }


    @Data
    @AllArgsConstructor
    @NoArgsConstructor
    public static class Order {

        public String orderId;
        public Integer userId;
        public Integer money;
        public Long eventTime;

        public Order(String orderId, int userId, int money, long eventTime) {
            this.orderId = orderId;
            this.userId = userId;
            this.money = money;
            this.eventTime = eventTime;

        }

        public long getEventTime() {
            return this.eventTime;
        }

        public static <Integer> Integer getUserId(Order order) {
            return (Integer) order.userId;
        }

    }

}

构造实时数据时需要引用lombok包。

2,自定义watermaker类

如下:


import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
import org.apache.commons.lang.time.FastDateFormat;
import org.apache.flink.api.common.eventtime.*;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

import java.util.*;
import java.util.concurrent.TimeUnit;

public class check_waterMaker2 {

    public static void main(String[] args) throws Exception {

        FastDateFormat df = FastDateFormat.getInstance("HH:mm:ss");

        //1.env
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        // 提交水位线的周期,将之设置成1秒下面orderDS.assignTimestampsAndWatermarks中的输出就是一条一次,设置成100就是一条十次。
        env.getConfig().setAutoWatermarkInterval(1000);
        env.setParallelism(1);
//        env.enableCheckpointing(checkpointInteval);   //把这个注释掉就不会有offset上传了。
        env.getCheckpointConfig().setCheckpointTimeout(60 * 60 * 1000);
        env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);

        //2.Source
        //模拟实时订单数据(数据有延迟和乱序)
        DataStreamSource<Order> orderDS = env.addSource(new SourceFunction<Order>() {
            private boolean flag = true;

            @Override
            public void run(SourceContext<Order> ctx) throws Exception {
                Random random = new Random();
                while (flag) {
                    String orderId = UUID.randomUUID().toString();
                    int userId = random.nextInt(3);
                    int money = random.nextInt(100);
                    //模拟数据延迟和乱序!
                    long eventTime = System.currentTimeMillis() - random.nextInt(5) * 1000;
                    System.out.println("发送的数据为: " + orderId + " \t userId:"+userId+"\t"+money+"\t"+ df.format(eventTime));
                    ctx.collect(new Order(orderId, userId, money, eventTime));
                    TimeUnit.SECONDS.sleep(1);
                }
            }

            @Override
            public void cancel() {
                flag = false;
            }
        });

//        orderDS.print();
        DataStream<Order> watermakerDS = orderDS
                .assignTimestampsAndWatermarks(
                        new WatermarkStrategy<Order>() {
                            @Override
                            public WatermarkGenerator<Order> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
                                return new WatermarkGenerator<Order>() {
                                    private int userId = 0;
                                    private long eventTime = 0L;
                                    private final long outOfOrdernessMillis = 3000;
                                    private long maxTimestamp = Long.MIN_VALUE + outOfOrdernessMillis + 1;

                                    @Override
                                    public void onEvent(Order event, long eventTimestamp, WatermarkOutput output) {
                                        userId = event.userId;
                                        eventTime = event.eventTime;
                                        maxTimestamp = Math.max(maxTimestamp, eventTimestamp);
                                    }

                                    @Override
                                    public void onPeriodicEmit(WatermarkOutput output) {
                                        //Watermaker = 当前最大事件时间 - 最大允许的延迟时间或乱序时间
                                        Watermark watermark = new Watermark(maxTimestamp - outOfOrdernessMillis - 1);
                                        System.out.println("key:" + userId + ",系统时间:" + df.format(System.currentTimeMillis()) + ",事件时间:" + df.format(eventTime) + ",水位线时间:" + df.format(watermark.getTimestamp()));
                                        output.emitWatermark(watermark);
                                    }
                                };
                            }
                        }.withTimestampAssigner((event, timestamp) -> event.getEventTime())
                ).setParallelism(1);
        //学习测试时可以使用下面的代码对数据进行更详细的输出,如输出窗口触发时各个窗口中的数据的事件时间,Watermaker时间
        SingleOutputStreamOperator<Object> result = watermakerDS
                .keyBy(Order_t -> Order_t.userId)
                .window(TumblingEventTimeWindows.of(Time.seconds(5)))
                //把apply中的函数应用在窗口中的数据上
                //WindowFunction<IN, OUT, KEY, W extends Window>
                .apply(new WindowFunction<Order, Object, Integer, TimeWindow>() {
                    @Override
                    public void apply(Integer key, TimeWindow window, Iterable<Order> input, Collector<Object> out) throws Exception {
                        //准备一个集合用来存放属于该窗口的数据的事件时间
                        List<String> eventTimeList = new ArrayList<>();
                        Iterator<Order> iterator = input.iterator();
                        int sum_n = 0;
                        while (iterator.hasNext()) {
                            Order next = iterator.next();
                            Long eventTime = next.eventTime;
                            eventTimeList.add(df.format(eventTime));
                            sum_n += next.money;
                        }
                        String outStr = String.format("key:%s,窗口开始结束:[%s~%s),属于该窗口的事件时间:%s,userid:%s,total_money:%d",
                                key.toString(), df.format(window.getStart()), df.format(window.getEnd()), eventTimeList, key, sum_n);
                        out.collect(outStr);
                    }
                })
                .setParallelism(1)
                ;

        //4.Sink
        result.print();

        //5.execute
        env.execute();
    }


    @Data
    @AllArgsConstructor
    @NoArgsConstructor
    public static class Order {

        public String orderId;
        public Integer userId;
        public Integer money;
        public Long eventTime;

        public Order(String orderId, int userId, int money, long eventTime) {
            this.orderId = orderId;
            this.userId = userId;
            this.money = money;
            this.eventTime = eventTime;

        }

        public long getEventTime() {
            return this.eventTime;
        }

        public static <Integer> Integer getUserId(Order order) {
            return (Integer) order.userId;
        }

    }

}

参考:

Flink教程(12)- Flink高级API(Time与Watermaker)_杨林伟的博客-CSDN博客

https://www.cnblogs.com/dalianpai/p/15268363.html

Flink最佳实践 - Watermark原理及实践问题解析 - 知乎

https://zhuanlan.zhihu.com/p/364013202

2021年大数据Flink(二十三):Watermaker案例演示-云社区-华为云

Flink水位线Watermaker生产应用避坑分享 - 墨天轮

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

一杯拿铁go

你的打赏是我更新最大的动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值