flink 处理迟到数据(Trigger、设置水位线延迟时间、允许窗口处理迟到数据、将迟到数据放入侧输出流、代码示例、迟到数据触发窗口计算重复结果处理)


前言

  迟到数据,是指在watermark之后到来的数据,事件时间在水位线之前。所以只有在事件时间语义下,讨论迟到数据的处理才有意义。对于乱序流,可以设置一个延迟时间;对于窗口计算,可以设置窗口的允许延迟时间;另外可以将迟到数据输出到Side Outputs


1.Trigger

  Trigger决定窗口调用窗口函数的时间,抽象类Trigger含有的方法

onElement() called for each element that is added to a window.

onEventTime() called when a registered event-time timer fires.

onProcessingTime() called when a registered processing-time timer fires.

onMerge() elevant for stateful triggers and merges the states of two triggers when their corresponding windows merge, e.g. when using session windows.

clear() performs any action needed upon removal of the corresponding window.

在这里插入图片描述
前三个方法返回TriggerResult,对应下列枚举,决定window操作

    /** No action is taken on the window. */
    CONTINUE(false, false),

    /** {@code FIRE_AND_PURGE} evaluates the window function and emits the window result. */
    FIRE_AND_PURGE(true, true),

    /**
     * On {@code FIRE}, the window is evaluated and results are emitted. The window is not purged,
     * though, all elements are retained.
     */
    FIRE(true, false),

    /**
     * All elements in the window are cleared and the window is discarded, without evaluating the
     * window function or emitting any elements.
     */
    PURGE(false, true);

2.处理迟到数据

2.1 设置水位线延迟时间

In order to work with event time, Flink needs to know the events timestamps, meaning each element in the stream needs to have its event timestamp assigned. This is usually done by accessing/extracting the timestamp from some field in the element by using a TimestampAssigner.Timestamp assignment goes hand-in-hand with generating watermarks, which tell the system about progress in event time. You can configure this by specifying a WatermarkGenerator.

  watermark标记了event time的进展,是整个应用的全局逻辑时钟。水位线会随着数据在任务间流动,从而给每个任务指明当前的事件时间。生成水位线需要设置TimestampAssigner(分配事件时间的时间戳)和WatermarkGenerator(生成水位线的方法,on events 或者 periodically)
  当设置水位线的延迟后,所有定时器就都会按照延迟后的水位线来触发,注意一般情况不应该把延迟设置得太大,否则会大幅度降低流处理的实时性,视需求一般设在毫秒~秒级

如 处理无序流常用:

WatermarkStrategy
        .<Tuple2<Long, String>>forBoundedOutOfOrderness(Duration.ofSeconds(20))
        .withTimestampAssigner((event, timestamp) -> event.f0);

BoundedOutOfOrdernessWatermarks源码

public class BoundedOutOfOrdernessWatermarks<T> implements WatermarkGenerator<T> {

    /** The maximum timestamp encountered so far. */
    private long maxTimestamp;

    /** The maximum out-of-orderness that this watermark generator assumes. */
    private final long outOfOrdernessMillis;

    /**
     * Creates a new watermark generator with the given out-of-orderness bound.
     *
     * @param maxOutOfOrderness The bound for the out-of-orderness of the event timestamps.
     */
    public BoundedOutOfOrdernessWatermarks(Duration maxOutOfOrderness) {
        checkNotNull(maxOutOfOrderness, "maxOutOfOrderness");
        checkArgument(!maxOutOfOrderness.isNegative(), "maxOutOfOrderness cannot be negative");

        this.outOfOrdernessMillis = maxOutOfOrderness.toMillis();

        // start so that our lowest watermark would be Long.MIN_VALUE.
        this.maxTimestamp = Long.MIN_VALUE + outOfOrdernessMillis + 1;
    }

    // ------------------------------------------------------------------------

    @Override
    public void onEvent(T event, long eventTimestamp, WatermarkOutput output) {
        maxTimestamp = Math.max(maxTimestamp, eventTimestamp);
    }

    @Override
    public void onPeriodicEmit(WatermarkOutput output) {
        output.emitWatermark(new Watermark(maxTimestamp - outOfOrdernessMillis - 1));
    }
}

2.2 允许窗口处理迟到数据

When working with event-time windowing, it can happen that elements arrive late, i.e. the watermark that Flink uses to keep track of the progress of event-time is already past the end timestamp of a window to which an element belongs. See event time and especially late elements for a more thorough discussion of how Flink deals with event time.
By default, late elements are dropped when the watermark is past the end of the window. However, Flink allows to specify a maximum allowed lateness for window operators. Allowed lateness specifies by how much time elements can be late before they are dropped, and its default value is 0. Elements that arrive after the watermark has passed the end of the window but before it passes the end of the window plus the allowed lateness, are still added to the window. Depending on the trigger used, a late but not dropped element may cause the window to fire again. This is the case for the EventTimeTrigger.
In order to make this work, Flink keeps the state of windows until their allowed lateness expires. Once this happens, Flink removes the window and deletes its state, as also described in the Window Lifecycle section.
By default, the allowed lateness is set to 0. That is, elements that arrive behind the watermark will be dropped.

  对于窗口计算,如果水位线已经到了窗口结束时间,默认窗口就会关闭,那么迟到数据就要被丢弃,因此可以设置延迟时间,允许继续处理迟到数据的。默认情况下延迟时间为0,若设置延迟时间后,watermark超过窗口结束时间戳,但未超过 延迟后的时间戳,迟到数据仍然可添加到窗口中,触发计算。
  中间过程可视为,在水位线到达窗口结束时间时,先快速地输出一个近似正确的计算结果;然后保持窗口继续等到延迟数据,每来一条数据,窗口就会再次计算,并将更新后的结果输出。逐步修正计算结果,最终得到准确的统计值。

注:对于GlobalWindows,不会存在迟到数据,因为全局窗口的结束时间戳为Long.MAX_VALUE.

DataStream<T> input = ...;

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .allowedLateness(<time>)
    .<windowed transformation>(<window function>);

2.3 将迟到数据放入侧输出流

Using Flink’s side output feature you can get a stream of the data that was discarded as late.
You first need to specify that you want to get late data using sideOutputLateData(OutputTag) on the windowed stream. Then, you can get the side-output stream on the result of the windowed operation:

  窗口后关闭,仍然有迟到数据,则用侧输出流来收集关窗后的迟到数据,保证数据不丢失。因为窗口已经真正关闭,只能将之前的窗口计算结果保存下来,然后获取侧输出流中的迟到数据,判断数据所属的窗口,手动对结果进行合并更新

3.实操

3.1 代码示例

pojo对象

public class Event {
    public String user;
    public String url;
    public long timestamp;

    public Event() {
    }

    public Event(String user, String url, Long timestamp) {
        this.user = user;
        this.url = url;
        this.timestamp = timestamp;
    }

    @Override
    public int hashCode() {
        return super.hashCode();
    }

    public String getUser() {
        return user;
    }

    public void setUser(String user) {
        this.user = user;
    }

    public String getUrl() {
        return url;
    }

    public void setUrl(String url) {
        this.url = url;
    }

    public Long getTimestamp() {
        return timestamp;
    }

    public void setTimestamp(Long timestamp) {
        this.timestamp = timestamp;
    }

    @Override
    public String toString() {
        return "Event{" +
                "user='" + user + '\'' +
                ", url='" + url + '\'' +
                ", timestamp=" + new Timestamp(timestamp) +
                '}';
    }
}

核心代码
(1) 从kakfa抽取数据,并处理为Event数据流
(2) 设置watermark
(3) window窗口操作
(4) 输出到控制台

public class LaterDataTest {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env =
                StreamExecutionEnvironment.getExecutionEnvironment();
        //设置并行度
        env.setParallelism(1);

        //读取kafka
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "192.168.42.102:9092,192.168.42.103:9092,192.168.42.104:9092");
        properties.setProperty("group.id", "consumer-group");
        properties.setProperty("key.deserializer",
                "org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("value.deserializer",
                "org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("auto.offset.reset", "latest");
        DataStream<Event> stream = env
                .addSource(new
                        FlinkKafkaConsumer<String>(
                        "clicks",
                        new SimpleStringSchema(),
                        properties
                ))
                .flatMap(new FlatMapFunction<String, Event>() {
                    @Override
                    public void flatMap(String s, Collector<Event> collector) throws Exception {
                        String[] split = s.split(",");
                        collector.collect(new Event(split[0], split[1], Long.parseLong(split[2])));
                    }
                });

        //水位线  乱序流 需要设置Duration和timestampAssigner
        SingleOutputStreamOperator<Event> watermarks = stream.assignTimestampsAndWatermarks(WatermarkStrategy.<Event>forBoundedOutOfOrderness(
                Duration.ofSeconds(2)).withTimestampAssigner(
                new SerializableTimestampAssigner<Event>() {
                    @Override
                    public long extractTimestamp(Event element, long recordTimestamp) {
                        return element.timestamp;
                    }
                }));
        watermarks.print("input");

        //定义一个输出标签
        OutputTag<Event> late = new OutputTag<Event>("late") {
        };

        //设置延迟时间
        SingleOutputStreamOperator<UrlViewCount> aggregate = watermarks
                .keyBy(new KeySelector<Event, String>() {
                    @Override
                    public String getKey(Event value) throws Exception {
                        return value.url;
                    }
                })
                .window(TumblingEventTimeWindows.of(Time.seconds(5)))
                .allowedLateness(Time.minutes(1))
                .sideOutputLateData(late)
                .aggregate(new AggregateFunction<Event, Long, Long>() {
                               @Override
                               public Long createAccumulator() {
                                   return 0L;
                               }

                               @Override
                               public Long add(Event value, Long accumulator) {
                                   return accumulator + 1;
                               }

                               @Override
                               public Long getResult(Long accumulator) {
                                   return accumulator;
                               }

                               @Override
                               public Long merge(Long a, Long b) {
                                   return null;
                               }
                           }
                        ,
                        new ProcessWindowFunction<Long, UrlViewCount, String, TimeWindow>() {
                            @Override
                            public void process(String s, Context context, Iterable<Long> elements, Collector<UrlViewCount> out) throws Exception {
                                long start = context.window().getStart();
                                long end = context.window().getEnd();
                                Long next = elements.iterator().next();
                                long currentWatermark = context.currentWatermark();
                                out.collect(new UrlViewCount(s + " 水位线" + currentWatermark, next, start, end));
                            }
                        }
                );

        //输出
        aggregate.print("result");
        aggregate.getSideOutput(late).print("late");
        
        env.execute();

    }
}
final OutputTag<T> lateOutputTag = new OutputTag<T>("late-data"){};

DataStream<T> input = ...;

SingleOutputStreamOperator<T> result = input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .allowedLateness(<time>)
    .sideOutputLateData(lateOutputTag)
    .<windowed transformation>(<window function>);

DataStream<T> lateStream = result.getSideOutput(lateOutputTag);

3.2 中间遇到的异常

(1) 泛型擦除

Caused by: org.apache.flink.api.common.functions.InvalidTypesException: The types of the interface org.apache.flink.util.OutputTag could not be inferred. Support for synthetic interfaces, lambdas, and generic or raw types is limited at this point

在这里插入图片描述

处理:
在这里插入图片描述
改为
在这里插入图片描述

(2) 未设置并行度为1,窗口操作和预期不符
在这里插入图片描述
在这里插入图片描述
处理
在这里插入图片描述

原因:
  在水位线的传递过程中,所有的上游并行任务符合木桶原理,最短的那一块决定了桶的水位。

3.3 结果演示

kafka生成数据

./kafka-console-producer.sh  --broker-list 192.168.42.102:9092,192.168.42.103:9092,192.168.42.104:9092 --topic clicks

水位线允许2秒延迟,窗口允许1分钟迟到数据,之后的迟到数据发送到late标签的侧输出流中
(1) 消费到时间戳7000的数据,此时窗口水位线时间戳4999
在这里插入图片描述
在这里插入图片描述
(2)时间戳7000之后的迟到数据,仍然触发窗口计算
在这里插入图片描述
在这里插入图片描述

在这里插入图片描述
在这里插入图片描述

(3)接收到时间戳67000窗口关闭后,迟到数据写入侧输出流,需要手动合并

在这里插入图片描述
在这里插入图片描述

4. 迟到数据触发窗口计算重复结果处理

When specifying an allowed lateness greater than 0, the window along with its content is kept after the watermark passes the end of the window. In these cases, when a late but not dropped element arrives, it could trigger another firing for the window. These firings are called late firings, as they are triggered by late events and in contrast to the main firing which is the first firing of the window. In case of session windows, late firings can further lead to merging of windows, as they may “bridge” the gap between two pre-existing, unmerged windows.
The elements emitted by a late firing should be treated as updated results of a previous computation, i.e., your data stream will contain multiple results for the same computation. Depending on your application, you need to take these duplicated results into account or deduplicate them.

  late firings更新计算结果后,数据流中将包含同一窗口计算的多个结果,需要对重复数据进行删除

  • 4
    点赞
  • 16
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

但行益事莫问前程

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值