Flink专题四：Flink DataStream 窗口介绍及使用

beyond的架构之旅

已于 2022-03-23 15:40:14 修改

阅读量1.6k

点赞数 1

分类专栏： flink 文章标签： flink java datastream

于 2022-03-23 15:37:40 首次发布

本文链接：https://blog.csdn.net/qq_41969358/article/details/123683537

版权

flink 专栏收录该内容

9 篇文章 6 订阅

订阅专栏

由于工作需要最近学习flink
现记录下Flink介绍和实际使用过程
这是flink系列的第四篇文章

Flink DataStream 窗口介绍及使用

窗口介绍
时间窗口
窗口函数

窗口介绍

Flink 认为 Batch 是 Streaming 的一个特例，所以 Flink 底层引擎是一个流式引擎，在上面实现了流处理和批处理。而窗口（window）就是从 Streaming 到 Batch 的一个桥梁。Flink 提供了非常完善的窗口机制。

官方中文文档地址：
https://nightlies.apache.org/flink/flink-docs-release-1.14/zh/docs/dev/datastream/operators/windows/

因为本人目前接触的都是每来一个数据就处理一次的流式数据，所以对窗口的概念和使用场景一直比较模糊，下方的介绍可以说是点醒了我。

在流处理应用中，数据是连续不断的，因此我们不可能等到所有数据都到了才开始处理。当然我们可以每来一个消息就处理一次，但是有时我们需要做一些聚合类的处理，例如：在过去的1分钟内有多少用户点击了我们的网页。在这种情况下，我们必须定义一个窗口，用来收集最近一分钟内的数据，并对这个窗口内的数据进行计算。

如果在数据流上，截取固定大小的一部分，这部分是可以进行统计的。截取方式主要有两种:

根据时间进行截取(time-driven-window)，比如每1分钟统计一次或每10分钟统计一次。
根据消息数量进行截取(data-driven-window)，比如每5个数据统计一次或每50个数据统计一次。

在这里插入图片描述

时间窗口

翻滚窗口（数据以一个时间断为节点不会有重复）

在这里插入图片描述
按照时间来进行窗口划分,每次窗口的滑动距离等于窗口的长度,这样数据不会重复计算,我们参考上面的案例

翻滚窗口java使用示例

DataStream<T> input = ...;

// tumbling event-time windows
input
    .keyBy(<key selector>)
    .window(TumblingEventTimeWindows.of(Time.seconds(5)))
    .<windowed transformation>(<window function>);

// tumbling processing-time windows
input
    .keyBy(<key selector>)
    .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
    .<windowed transformation>(<window function>);

// daily tumbling event-time windows offset by -8 hours.
input
    .keyBy(<key selector>)
    .window(TumblingEventTimeWindows.of(Time.days(1), Time.hours(-8)))
    .<windowed transformation>(<window function>);

可以使用Time.milliseconds(x)、Time.seconds(x)、 Time.minutes(x)等之一指定时间间隔。

如上一个示例所示，翻转窗口分配器还采用一个可选offset 参数，可用于更改窗口的对齐方式。例如，在没有偏移量的情况下，每小时翻滚的窗口与
epoch 对齐，也就是说，您将获得诸如等的 1:00:00.000 - 1:59:59.999窗口2:00:00.000 -
2:59:59.999。如果你想改变它，你可以给一个偏移量。例如，使用 15 分钟的偏移量，您可以获取 1:15:00.000 -
2:14:59.999等2:15:00.000 - 3:14:59.999。偏移量的一个重要用例是将窗口调整为 UTC-0
以外的时区。例如，在中国，您必须指定偏移量Time.hours(-8).

滑动窗口

数据在某一个时间段内会有重叠，也就是说数据会重复
在这里插入图片描述
按照时间来进行窗口划分,每次窗口的滑动距离小于窗口的长度,这样数据就会有一部分重复计算。

滑动窗口java使用示例

DataStream<T> input = ...;

// sliding event-time windows
input
    .keyBy(<key selector>)
    .window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))
    .<windowed transformation>(<window function>);

// sliding processing-time windows
input
    .keyBy(<key selector>)
    .window(SlidingProcessingTimeWindows.of(Time.seconds(10), Time.seconds(5)))
    .<windowed transformation>(<window function>);

// sliding processing-time windows offset by -8 hours
input
    .keyBy(<key selector>)
    .window(SlidingProcessingTimeWindows.of(Time.hours(12), Time.hours(1), Time.hours(-8)))
    .<windowed transformation>(<window function>);

可以使用Time.milliseconds(x)、Time.seconds(x)、 Time.minutes(x)等之一指定时间间隔。

如上一个示例所示，滑动窗口分配器还带有一个可选offset参数，可用于更改窗口的对齐方式。例如，如果没有偏移量，每小时滑动 30
分钟的窗口与 epoch 对齐，即您将获得诸如等的 1:00:00.000 - 1:59:59.999窗口1:30:00.000 -
2:29:59.999。如果你想改变它，你可以给一个偏移量。例如，使用 15 分钟的偏移量，您可以获取 1:15:00.000 -
2:14:59.999等1:45:00.000 - 2:44:59.999。偏移量的一个重要用例是将窗口调整为 UTC-0
以外的时区。例如，在中国，您必须指定偏移量Time.hours(-8).

如果窗口计算时间 > 窗口时间，会出现数据丢失
如果窗口计算时间 < 窗口时间，会出现数据重复计算
如果窗口计算时间 = 窗口时间，数据不会被重复计算

会话窗口

会话窗口分配器按活动会话对元素进行分组。与翻滚窗口和滑动窗口相比，会话窗口不重叠，也没有固定的开始和结束时间。相反，当会话窗口在一段时间内没有接收到元素时，即当出现不活动间隙时，会话窗口将关闭。会话窗口分配器可以配置有静态会话间隙或会话间隙提取器功能，该功能定义不活动的时间长度。当此期限到期时，当前会话关闭，后续元素被分配到新的会话窗口。

DataStream<T> input = ...;

// event-time session windows with static gap
input
    .keyBy(<key selector>)
    .window(EventTimeSessionWindows.withGap(Time.minutes(10)))
    .<windowed transformation>(<window function>);
    
// event-time session windows with dynamic gap
input
    .keyBy(<key selector>)
    .window(EventTimeSessionWindows.withDynamicGap((element) -> {
        // determine and return session gap
    }))
    .<windowed transformation>(<window function>);

// processing-time session windows with static gap
input
    .keyBy(<key selector>)
    .window(ProcessingTimeSessionWindows.withGap(Time.minutes(10)))
    .<windowed transformation>(<window function>);
    
// processing-time session windows with dynamic gap
input
    .keyBy(<key selector>)
    .window(ProcessingTimeSessionWindows.withDynamicGap((element) -> {
        // determine and return session gap
    }))
    .<windowed transformation>(<window function>);

全局窗口

全局窗口分配器将具有相同键的所有元素分配给同一个全局窗口。此窗口方案仅在您还指定自定义触发器时才有用。否则，将不会执行任何计算，因为全局窗口没有自然结束，我们可以在该端处理聚合元素。

DataStream<T> input = ...;

input
    .keyBy(<key selector>)
    .window(GlobalWindows.create())
    .<windowed transformation>(<window function>);

窗口函数

在定义了窗口分配器之后，我们需要指定我们想要在每个窗口上执行的计算。这是窗口函数的职责，一旦系统确定一个窗口已准备好处理，该函数用于处理每个（可能是键控的）窗口的元素（请参阅触发器以了解 Flink 如何确定窗口何时准备好）。

窗口函数可以是ReduceFunction、AggregateFunction或ProcessWindowFunction。前两个可以更有效地执行（参见状态大小部分），因为 Flink 可以在每个窗口到达时增量聚合元素。AProcessWindowFunction获取一个Iterable窗口中包含的所有元素的一个，以及有关元素所属窗口的附加元信息。

减少函数

AReduceFunction指定如何组合输入中的两个元素以生成相同类型的输出元素。Flink 使用 aReduceFunction来增量聚合窗口的元素。

DataStream<Tuple2<String, Long>> input = ...;

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .reduce(new ReduceFunction<Tuple2<String, Long>>() {
      public Tuple2<String, Long> reduce(Tuple2<String, Long> v1, Tuple2<String, Long> v2) {
        return new Tuple2<>(v1.f0, v1.f1 + v2.f1);
      }
    });

上面的示例相加了窗口中所有元素的元组的第二个字段。

聚合函数

AnAggregateFunction是 a 的通用版本ReduceFunction，具有三种类型：输入类型 ( IN)、累加器类型 ( ACC) 和输出类型 ( OUT)。输入类型是输入流中元素的类型，并且AggregateFunction具有将一个输入元素添加到累加器的方法。该接口还具有用于创建初始累加器、将两个累加器合并为一个累加器以及OUT从累加器中提取输出（类型为）的方法。我们将在下面的示例中看到它是如何工作的。

与ReduceFunction相同，Flink 将在窗口的输入元素到达时增量聚合它们。

/**
 * The accumulator is used to keep a running sum and a count. The {@code getResult} method
 * computes the average.
 */
private static class AverageAggregate
    implements AggregateFunction<Tuple2<String, Long>, Tuple2<Long, Long>, Double> {
  @Override
  public Tuple2<Long, Long> createAccumulator() {
    return new Tuple2<>(0L, 0L);
  }

  @Override
  public Tuple2<Long, Long> add(Tuple2<String, Long> value, Tuple2<Long, Long> accumulator) {
    return new Tuple2<>(accumulator.f0 + value.f1, accumulator.f1 + 1L);
  }

  @Override
  public Double getResult(Tuple2<Long, Long> accumulator) {
    return ((double) accumulator.f0) / accumulator.f1;
  }

  @Override
  public Tuple2<Long, Long> merge(Tuple2<Long, Long> a, Tuple2<Long, Long> b) {
    return new Tuple2<>(a.f0 + b.f0, a.f1 + b.f1);
  }
}

DataStream<Tuple2<String, Long>> input = ...;

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .aggregate(new AverageAggregate());

上面的示例计算窗口中元素的第二个字段的平均值。

进程窗口函数

ProcessWindowFunction 获得一个包含窗口所有元素的 Iterable，以及一个可以访问时间和状态信息的 Context 对象，这使得它能够提供比其他窗口函数更大的灵活性。这是以性能和资源消耗为代价的，因为元素不能增量聚合，而是需要在内部缓冲，直到窗口被认为准备好处理。

public abstract class ProcessWindowFunction<IN, OUT, KEY, W extends Window> implements Function {

    /**
     * Evaluates the window and outputs none or several elements.
     *
     * @param key The key for which this window is evaluated.
     * @param context The context in which the window is being evaluated.
     * @param elements The elements in the window being evaluated.
     * @param out A collector for emitting elements.
     *
     * @throws Exception The function may throw exceptions to fail the program and trigger recovery.
     */
    public abstract void process(
            KEY key,
            Context context,
            Iterable<IN> elements,
            Collector<OUT> out) throws Exception;

   	/**
   	 * The context holding window metadata.
   	 */
   	public abstract class Context implements java.io.Serializable {
   	    /**
   	     * Returns the window that is being evaluated.
   	     */
   	    public abstract W window();

   	    /** Returns the current processing time. */
   	    public abstract long currentProcessingTime();

   	    /** Returns the current event-time watermark. */
   	    public abstract long currentWatermark();

   	    /**
   	     * State accessor for per-key and per-window state.
   	     *
   	     * <p><b>NOTE:</b>If you use per-window state you have to ensure that you clean it up
   	     * by implementing {@link ProcessWindowFunction#clear(Context)}.
   	     */
   	    public abstract KeyedStateStore windowState();

   	    /**
   	     * State accessor for per-key global state.
   	     */
   	    public abstract KeyedStateStore globalState();
   	}

}

beyond的架构之旅

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Flink专题四：Flink DataStream 窗口介绍及使用

由于工作需要最近学习flink现记录下Flink介绍和实际使用过程这是flink系列的第四篇文章Flink DataStream 窗口介绍及使用窗口介绍时间窗口翻滚窗口（数据以一个时间断为节点不会有重复）滑动窗口会话窗口全局窗口窗口函数减少函数聚合函数进程窗口函数窗口介绍Flink 认为 Batch 是 Streaming 的一个特例，所以 Flink 底层引擎是一个流式引擎，在上面实现了流处理和批处理。而窗口（window）就是从 Streaming 到 Batch 的一个桥梁。Flink .
复制链接

扫一扫

专栏目录