flink窗口相关

！@123

已于 2023-09-21 14:25:18 修改

阅读量71

点赞数

分类专栏：大数据文章标签： flink 大数据

于 2023-09-21 14:24:15 首次发布

本文链接：https://blog.csdn.net/a123op2346/article/details/133133582

版权

大数据专栏收录该内容

17 篇文章 0 订阅

订阅专栏

对flink官网的窗口的知识进行的整理

1、flink窗口

flink窗口是无界流处理的关键所在,窗口可以将数据流装入大小有限的桶中，在对每个桶内的数据进行处理。

窗口根据是否根据key进行集合分为keyed streaming，no-keyed streams。

Keyed Windows

stream
       .keyBy(...)               <-  仅 keyed 窗口需要
       .window(...)              <-  必填项："assigner"
      [.trigger(...)]            <-  可选项："trigger" (省略则使用默认 trigger)
      [.evictor(...)]            <-  可选项："evictor" (省略则不使用 evictor)
      [.allowedLateness(...)]    <-  可选项："lateness" (省略则为 0)
      [.sideOutputLateData(...)] <-  可选项："output tag" (省略则不对迟到数据使用 side output)
       .reduce/aggregate/apply()      <-  必填项："function"
      [.getSideOutput(...)]      <-  可选项："output tag"

Non-Keyed Windows

stream
       .windowAll(...)           <-  必填项："assigner"
      [.trigger(...)]            <-  可选项："trigger" (else default trigger)
      [.evictor(...)]            <-  可选项："evictor" (else no evictor)
      [.allowedLateness(...)]    <-  可选项："lateness" (else zero)
      [.sideOutputLateData(...)] <-  可选项："output tag" (else no side output for late data)
       .reduce/aggregate/apply()      <-  必填项："function"
      [.getSideOutput(...)]      <-  可选项："output tag"

2、窗口的生命周期

窗口在属于他的元素到来的时候就会被创建，超过窗口的"结束时间戳 + 用户定义的allowed lateness"时被完全删除。

例如：一个5分钟的窗口12:00 - 12:05 这个时间窗口，定义的可容忍的迟到时间为1分钟。如果数据落入这个区间那么flink会创建一个窗口，当watermark越过12:06时，窗口会被摧毁。

另外为窗口设置trigger和function（ProcessWindowFunction、ReduceFunction、AggregateFunction）。function决定了如何计算窗口的内容，trigger决定了窗口数据何时被function计算。初次之外你还可以指定Evictor，在 trigger 触发之后，Evictor 可以在窗口函数的前后删除数据。

3 window中关于时间的属性

窗口计算一般都是济源时间来做的，按时间间隔进行划分，因此需要时间的属性。

flink中的时间有三种：处理时间、事件时间、摄入时间

处理时间：指的是执行具体操作时的机器时间。（大家熟知的绝对时间, 例如 Java的 System.currentTimeMillis() ）。它既不需要从数据里获取时间，也不需要生成 watermark。

事件时间：数据本身携带的时间。这样可以在有乱序或者晚到的数据的情况下产生一致的处理结果。它可以保证从外部存储读取数据后产生可以复现（replayable）的结果。

摄入时间：数据进入 Flink 的时间。在系统内部，会把它当做事件时间来处理，目前没发现能用到哪。

4、Window Assigners

Assigner定义了stream中的元素如何被发送到各个窗口。Flink 为最常用的情况提供了一些定义好的 window assigner，也就是 tumbling windows滚动窗口、 sliding windows滑动窗口、 session windows会话窗口和 global windows全局窗口。继承WindowAssigner 类来实现自定义的 window assigner。

时间属性 | Apache Flink

滑动窗口：

DataStream<T> input = ...;

// 滚动 event-time 窗口，基于eventTime，事件时间，数据自身携带的时间戳
input
    .keyBy(<key selector>)
    .window(TumblingEventTimeWindows.of(Time.seconds(5)))
    .<windowed transformation>(<window function>);

// 滚动 processing-time 窗口，处理时间，
input
    .keyBy(<key selector>)
    .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
    .<windowed transformation>(<window function>);

// 长度为一天的滚动 event-time 窗口， 偏移量为 -8 小时。
input
    .keyBy(<key selector>)
    .window(TumblingEventTimeWindows.of(Time.days(1), Time.hours(-8)))
    .<windowed transformation>(<window function>);

会话窗口：

DataStream<T> input = ...;

// 设置了固定间隔的 event-time 会话窗口
input
    .keyBy(<key selector>)
    .window(EventTimeSessionWindows.withGap(Time.minutes(10)))
    .<windowed transformation>(<window function>);
    
// 设置了动态间隔的 event-time 会话窗口
input
    .keyBy(<key selector>)
    .window(EventTimeSessionWindows.withDynamicGap((element) -> {
        // 决定并返回会话间隔
    }))
    .<windowed transformation>(<window function>);

// 设置了固定间隔的 processing-time session 窗口
input
    .keyBy(<key selector>)
    .window(ProcessingTimeSessionWindows.withGap(Time.minutes(10)))
    .<windowed transformation>(<window function>);
    
// 设置了动态间隔的 processing-time 会话窗口
input
    .keyBy(<key selector>)
    .window(ProcessingTimeSessionWindows.withDynamicGap((element) -> {
        // 决定并返回会话间隔
    }))
    .<windowed transformation>(<window function>);

全局窗口：

全局窗口的 assigner 将拥有相同 key 的所有数据分发到一个全局窗口。这样的窗口模式仅在你指定了自定义的 trigger 时有用。否则，计算不会发生，因为全局窗口没有天然的终点去触发其中积累的数据。

DataStream<T> input = ...;

input
    .keyBy(<key selector>)
    .window(GlobalWindows.create())
    .<windowed transformation>(<window function>);

5、窗口函数

三种窗口函数：ReduceFunction、AggregateFunction、ProcessWindowFunction。前两者执行起来比较高效，因为前两个函数都可以在每条数据到达窗口后进行增量聚合。而ProcessWindowFunction在窗口触发前必须缓存里面的所有数据。得到能够遍历当前窗口的所有数据的Iterable，以及关于窗口的meta-information。

ReduceFunction：指定两条输入数据如何合并起来产生一条输出数据，输入和输出数据的类型必须相同。

DataStream<Tuple2<String, Long>> input = ...;

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .reduce(new ReduceFunction<Tuple2<String, Long>>() {
      public Tuple2<String, Long> reduce(Tuple2<String, Long> v1, Tuple2<String, Long> v2) {
        return new Tuple2<>(v1.f0, v1.f1 + v2.f1);
      }
    });

AggregateFunction：ReduceFunction是 AggregateFunction的特殊情况。 AggregateFunction 接收三个类型：输入数据的类型(IN)、累加器的类型（ACC）和输出数据的类型（OUT）。输入数据的类型是输入流的元素类型，AggregateFunction 接口有如下几个方法：把每一条元素加进累加器、创建初始累加器、合并两个累加器、从累加器中提取输出（OUT 类型）。与 ReduceFunction 相同，Flink 会在输入数据到达窗口时直接进行增量聚合。

计算了窗口内所有元素第二个属性的平均值：

private static class AverageAggregate
    implements AggregateFunction<Tuple2<String, Long>, Tuple2<Long, Long>, Double> {
    
  @Override
  public Tuple2<Long, Long> createAccumulator() {
    return new Tuple2<>(0L, 0L);
  }

  @Override
  public Tuple2<Long, Long> add(Tuple2<String, Long> value, Tuple2<Long, Long> accumulator) {
    return new Tuple2<>(accumulator.f0 + value.f1, accumulator.f1 + 1L);
  }

  @Override
  public Double getResult(Tuple2<Long, Long> accumulator) {
    return ((double) accumulator.f0) / accumulator.f1;
  }

  @Override
  public Tuple2<Long, Long> merge(Tuple2<Long, Long> a, Tuple2<Long, Long> b) {
    return new Tuple2<>(a.f0 + b.f0, a.f1 + b.f1);
  }
}

DataStream<Tuple2<String, Long>> input = ...;

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .aggregate(new AverageAggregate());

ProcessWindowFunction：

有能获取包含窗口内所有元素的 Iterable，以及用来获取时间和状态信息的 Context 对象，比其他窗口函数更加灵活。下面有关于ProcessWindowFunction单独使用和结合ReduceFunction和AggregateFunction一起使用的案例。

ProcessWindowFunction可以与 ReduceFunction 或 AggregateFunction 搭配使用，使其能够在数据到达窗口的时候进行增量聚合。当窗口关闭时，ProcessWindowFunction将会得到聚合的结果。这样它就可以增量聚合窗口的元素并且从 ProcessWindowFunction 中获得窗口的元数据。

ProcessWindowFunction 的灵活性是以性能和资源消耗为代价的，因为窗口中的数据无法被增量聚合，而需要在窗口触发前缓存所有数据。

public abstract class ProcessWindowFunction<IN, OUT, KEY, W extends Window> implements Function {

 
    public abstract void process(
            KEY key,
            Context context,
            Iterable<IN> elements,
            Collector<OUT> out) throws Exception;

    public void clear(Context context) throws Exception {}


    public abstract class Context implements java.io.Serializable {

        public abstract W window();
        public abstract long currentProcessingTime();
        public abstract long currentWatermark();
        public abstract KeyedStateStore windowState();
        public abstract KeyedStateStore globalState();
    }

}

使用示例单独使用ProcessWindowFunction：

DataStream<Tuple2<String, Long>> input = ...;

input
  .keyBy(t -> t.f0)
  .window(TumblingEventTimeWindows.of(Time.minutes(5)))
  .process(new MyProcessWindowFunction());

public class MyProcessWindowFunction 
    extends ProcessWindowFunction<Tuple2<String, Long>, String, String, TimeWindow> {

  @Override
  public void process(String key, Context context, Iterable<Tuple2<String, Long>> input, Collector<String> out) {
    long count = 0;
    for (Tuple2<String, Long> in: input) {
      count++;
    }
    out.collect("Window: " + context.window() + "count: " + count);
  }
}

结合ReduceFunction 使用ProcessWindowFunction，返回窗口中的最小元素和窗口的开始时间：

DataStream<SensorReading> input = ...;

input
  .keyBy(<key selector>)
  .window(<window assigner>)
  .reduce(new MyReduceFunction(), new MyProcessWindowFunction());


private static class MyReduceFunction implements ReduceFunction<SensorReading> {

  public SensorReading reduce(SensorReading r1, SensorReading r2) {
      return r1.value() > r2.value() ? r2 : r1;
  }
}

private static class MyProcessWindowFunction
    extends ProcessWindowFunction<SensorReading, Tuple2<Long, SensorReading>, String, TimeWindow> {

  public void process(String key,
                    Context context,
                    Iterable<SensorReading> minReadings,
                    Collector<Tuple2<Long, SensorReading>> out) {
      SensorReading min = minReadings.iterator().next();
      out.collect(new Tuple2<Long, SensorReading>(context.window().getStart(), min));
  }
}

结合AggregateFunction 增量聚合，计算平均值并与窗口对应的 key 一同输出：

DataStream<Tuple2<String, Long>> input = ...;

input
  .keyBy(<key selector>)
  .window(<window assigner>)
  .aggregate(new AverageAggregate(), new MyProcessWindowFunction());

// Function definitions

/**
 * The accumulator is used to keep a running sum and a count. The {@code getResult} method
 * computes the average.
 */
private static class AverageAggregate
    implements AggregateFunction<Tuple2<String, Long>, Tuple2<Long, Long>, Double> {
  @Override
  public Tuple2<Long, Long> createAccumulator() {
    return new Tuple2<>(0L, 0L);
  }

  @Override
  public Tuple2<Long, Long> add(Tuple2<String, Long> value, Tuple2<Long, Long> accumulator) {
    return new Tuple2<>(accumulator.f0 + value.f1, accumulator.f1 + 1L);
  }

  @Override
  public Double getResult(Tuple2<Long, Long> accumulator) {
    return ((double) accumulator.f0) / accumulator.f1;
  }

  @Override
  public Tuple2<Long, Long> merge(Tuple2<Long, Long> a, Tuple2<Long, Long> b) {
    return new Tuple2<>(a.f0 + b.f0, a.f1 + b.f1);
  }
}

private static class MyProcessWindowFunction
    extends ProcessWindowFunction<Double, Tuple2<String, Double>, String, TimeWindow> {

  public void process(String key,
                    Context context,
                    Iterable<Double> averages,
                    Collector<Tuple2<String, Double>> out) {
      Double average = averages.iterator().next();
      out.collect(new Tuple2<>(key, average));
  }
}

6、Triggers

trigger决定了一个窗口合适何时被windwo functin处理。每个windowAssigner都有一个默认的Trigger。如果默认的Trigger无法满足，可以自定义trigger。调用.trigger()方法，传入自定义trigger。

一般使用默认的就行，即不定义，有具体需求的话，可以实现自定义来控制。

Trigger接口提供了五个方法来相应不同的事件：

onElement() 方法在每个元素被加入窗口时调用。
onEventTime() 方法在注册的 event-time timer 触发时调用。
onProcessingTime() 方法在注册的 processing-time timer 触发时调用。
onMerge() 方法与有状态的 trigger 相关。该方法会在两个窗口合并时，将窗口对应 trigger 的状态进行合并，比如使用会话窗口时。
最后，clear() 方法处理在对应窗口被移除时所需的逻辑。

有两点需要注意：

前三个方法通过返回 TriggerResult 来决定 trigger 如何应对到达窗口的事件。应对方案有以下几种：

CONTINUE: 什么也不做
FIRE: 触发计算
PURGE: 清空窗口内的元素
FIRE_AND_PURGE: 触发计算，计算结束后清空窗口内的元素

上面的任意方法都可以用来注册 processing-time 或 event-time timer。

Trigger接口实现案例：

/**
 * 此处定义的trigger 第一个数据来到的时候注册了一个定时器，
 * 定时器的时间是第一条数据来到时间向下取整， 然后加上10000L 十秒，10秒后会触发onEventTime方法，决定是否进行窗口聚合计算
 * 触发onEventTime方法后，需要onEventTime方法再去定义一个定时器
 */
public class KafkaTrigger extends Trigger<Tuple4<String, Integer,Long,String>, TimeWindow> {


    @Override
    public TriggerResult onElement(Tuple4<String, Integer,Long,String> element, long timestamp, TimeWindow window, TriggerContext ctx) throws Exception {
        ValueStateDescriptor<Boolean> valueStateDescriptor = new ValueStateDescriptor<>("isFirstState", Boolean.class);
        ValueState<Boolean> isFirstState = ctx.getPartitionedState(valueStateDescriptor);
        Boolean isFirst = isFirstState.value();

        if(isFirst == null){
//            每个窗口，如果是第一个元素，则将状态值进行更新
            isFirstState.update(true);
//            注册定时器 当前事件事件取整后 + 10s执行
            ctx.registerEventTimeTimer(timestamp - timestamp % 10000L + 1000L);
        } else if (isFirst) {
            isFirstState.update(false);
        }

        return TriggerResult.CONTINUE;
    }


//    time表示事件事件触发器
    @Override
    public TriggerResult onProcessingTime(long time, TimeWindow window, TriggerContext ctx) throws Exception {
        return TriggerResult.CONTINUE;
    }

    @Override
    public TriggerResult onEventTime(long time, TimeWindow window, TriggerContext ctx) throws Exception {
        long end = window.getEnd();
        if(time < end){
            if(time + 1000L < end){
                ctx.registerEventTimeTimer(time + 1000L);
            }
            return TriggerResult.FIRE;
        }
        return TriggerResult.CONTINUE;
    }

    @Override
    public void clear(TimeWindow window, TriggerContext ctx) throws Exception {
        ctx.deleteEventTimeTimer(window.maxTimestamp());
    }
}

7、Evictors

应该可以翻译为清除器吧。Flink 的窗口模型允许在WindowAssigner 和 Trigger 之外指定可选的 Evictor。Evictor 可以在 trigger 触发后、调用窗口函数之前或之后从窗口中删除元素。

/**
 * Optionally evicts elements. Called before windowing function.
 *
 * @param elements The elements currently in the pane.
 * @param size The current number of elements in the pane.
 * @param window The {@link Window}
 * @param evictorContext The context for the Evictor
 */
void evictBefore(Iterable<TimestampedValue<T>> elements, int size, W window, EvictorContext evictorContext);

/**
 * Optionally evicts elements. Called after windowing function.
 *
 * @param elements The elements currently in the pane.
 * @param size The current number of elements in the pane.
 * @param window The {@link Window}
 * @param evictorContext The context for the Evictor
 */
void evictAfter(Iterable<TimestampedValue<T>> elements, int size, W window, EvictorContext evictorContext);

evictBefore() 包含在调用窗口函数前的逻辑，而 evictAfter() 包含在窗口函数调用之后的逻辑。在调用窗口函数之前被移除的元素不会被窗口函数计算。

Flink 内置有三个 evictor：

CountEvictor: 仅记录用户指定数量的元素，一旦窗口中的元素超过这个数量，多余的元素会从窗口缓存的开头移除

DeltaEvictor: 接收 DeltaFunction和 threshold 参数，计算最后一个元素与窗口缓存中所有元素的差值，并移除差值大于或等于 threshold 的元素。

TimeEvictor: 接收 interval参数，以毫秒表示。它会找到窗口中元素的最大 timestamp max_ts 并移除比 max_ts - interval小的所有元素。

默认情况下，所有内置的 evictor 逻辑都在调用窗口函数前执行

指定一个 evictor 可以避免预聚合，因为窗口中的所有元素在计算前都必须经过 evictor。

到目前为止是flink1.17版本，python中还不支持。

Flink 不对窗口中元素的顺序做任何保证。也就是说，即使 evictor 从窗口缓存的开头移除一个元素，这个元素也不一定是最先或者最后到达窗口的。

8、Allowed Lateness

在使用 event-time 窗口时，数据可能会迟到，即 Flink 用来追踪 event-time 进展的 watermark 已经越过了窗口结束的 timestamp 后，数据才到达。

默认情况下，watermark 一旦越过窗口结束的 timestamp，迟到的数据就会被直接丢弃。但是 Flink 允许指定窗口算子最大的 allowed lateness。 Allowed lateness 定义了一个元素可以在迟到多长时间的情况下不被丢弃，这个参数默认是 0。在 watermark 超过窗口末端、到达窗口末端加上 allowed lateness 之前的这段时间内到达的元素，依旧会被加入窗口。取决于窗口的 trigger，一个迟到但没有被丢弃的元素可能会再次触发窗口

Flink 会将窗口状态保存到 allowed lateness 超时才会将窗口及其状态删除。

默认情况下，allowed lateness 被设为 0。即 watermark 之后到达的元素会被丢弃。

DataStream<T> input = ...;

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .allowedLateness(<time>)
    .<windowed transformation>(<window function>);

使用 GlobalWindows 时，没有数据会被视作迟到，因为全局窗口的结束 timestamp 是 Long.MAX_VALUE。

9、旁路输出side output

可以通过旁路输出功能获取迟到的数据。

final OutputTag<T> lateOutputTag = new OutputTag<T>("late-data"){};

DataStream<T> input = ...;

SingleOutputStreamOperator<T> result = input
    .keyBy(<key selector>)
    .window(<window assigner>)
    // 最迟多久可以再触发窗口计算
    .allowedLateness(Time.seconds(10))
    .sideOutputLateData(lateOutputTag)
    .<windowed transformation>(<window function>);

DataStream<T> lateStream = result.getSideOutput(lateOutputTag);

声明一个OutputTag，然后在开窗的Stream上使用sideOutputLateData(OutputTag) 表明你需要获取迟到数据。然后调用result.getSideOutput(lateOutputTag);获得旁路输出对的流。

对于迟到的数据的一些考虑：

当指定了大于 0 的 allowed lateness 时，窗口本身以及其中的内容仍会在 watermark 越过窗口末端后保留。这时，如果一个迟到但未被丢弃的数据到达，它可能会再次触发这个窗口。这种触发被称作 late firing，与表示第一次触发窗口的 main firing 相区别。如果是使用会话窗口的情况，late firing 可能会进一步合并已有的窗口，因为他们可能会连接现有的、未被合并的窗口。

你应该注意：late firing 发出的元素应该被视作对之前计算结果的更新，即你的数据流中会包含一个相同计算任务的多个结果。你的应用需要考虑到这些重复的结果，或去除重复的部分。

10、处理多个窗口的聚合结果

可以保留多个聚合窗口的结果，又能够将多个窗口的聚合结果放在一起进行操作。

这提供了一种便利的方法，让你能够有两个连续的窗口，他们即能使用不同的 key，又能让上游操作中某个窗口的数据出现在下游操作的相同窗口

DataStream<Integer> input = ...;

DataStream<Integer> resultsPerKey = input
    .keyBy(<key selector>)
    .window(TumblingEventTimeWindows.of(Time.seconds(5)))
    .reduce(new Summer());

DataStream<Integer> globalResults = resultsPerKey
    .windowAll(TumblingEventTimeWindows.of(Time.seconds(5)))
    .process(new TopKWindowFunction());

这个例子中，第一个操作中时间窗口[0, 5) 的结果会出现在下一个窗口操作的 [0, 5)窗口中。这就可以让我们先在一个窗口内按 key 求和，再在下一个操作中找出这个窗口中 top-k 的元素。

11、关于状态大小的考量

窗口可以被定义在很长的时间段上（比如几天、几周或几个月）并且积累下很大的状态。当你估算窗口计算的储存需求时，可以铭记几条规则：

Flink 会为一个元素在它所属的每一个窗口中都创建一个副本。因此，一个元素在滚动窗口的设置中只会存在一个副本（一个元素仅属于一个窗口，除非它迟到了）。与之相反，一个元素可能会被拷贝到多个滑动窗口中，就如我们在 Window Assigners 中描述的那样。因此，设置一个大小为一天、滑动距离为一秒的滑动窗口可能不是个好想法。
ReduceFunction 和 AggregateFunction 可以极大地减少储存需求，因为他们会就地聚合到达的元素，且每个窗口仅储存一个值。而使用 ProcessWindowFunction 需要累积窗口中所有的元素。
使用 Evictor 可以避免预聚合，因为窗口中的所有数据必须先经过 evictor 才能进行计算（详见 Evictors）。