Flink中Window Function(窗口函数)的三种类型对比及应用

怀羽默

已于 2022-11-20 14:15:22 修改

阅读量1.7k

点赞数 1

分类专栏： Flink 文章标签： flink java 大数据

于 2022-11-20 13:56:34 首次发布

本文链接：https://blog.csdn.net/weixin_44972197/article/details/127947105

版权

Flink 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Flink中窗口函数的使用主要包含增量聚合函数、全量窗口函数和增量聚合+全量窗口组合函数三种类型，本文将从该三种类型函数的特点、应用场景、对比和示例方面进行讲解。

增量聚合函数

特点:计算性能好，占用存储空间少，因为基于中间状态的计算结果，窗口中只维护中间结果状态，不需要缓冲原始数据。一句话总结：不耗空间、性能好，但不够灵活。

【说明】“灵活”指可以获取上下文信息、元数据等信息。下同。

应用场景：基于中间状态进行计算的的场景，如求和、求最大值、求最小值等。

ReduceFunction

对输入的两个相同类型的数据元素按照指定的计算方法进行聚合计算，然后输出类型相同的一个结果值，即进行聚合计算的前后数据类型一致。

AggregateFunction

AggregateFunction接口需要实现以下四个方法，实现复杂度也相对较高，但AggregateFunction接口相对ReduceFunction接口更加灵活，也较为常用。示例如下：

// 自定义增量聚合函数AggregateFunction
public static class ItemCountAgg implements AggregateFunction<UserBehavior, Long, Long> {
    @Override
    public Long createAccumulator() {
        return 0L;
    }
    @Override
    public Long add(UserBehavior value, Long accumulator) {
        return accumulator + 1;
    }
    @Override
    public Long getResult(Long accumulator) {
        return accumulator;
    }
    @Override
    public Long merge(Long a, Long b) {
        return a + b;
    }
}

FoldFunction

实现将窗口中的输入元素与外部元素进行合并操作，但在最新的Flink 1.16.0版本中已经将其删除掉，因此，建议用AggregateFunction来替换FoldFunction。

全量窗口函数

特点:使用的代价相对较高，性能比较弱，因为此时算子需要对所有属于该窗口的接入数据进行缓存，然后等到窗口触发的时候对所有的原始数据进行汇总计算，如果接入数据量比较大或者窗口时间比较长，就有可能导致计算性能的下降。一句话总结：比较灵活，但耗空间、性能较差。
应用场景：需要获取窗口中的状态数据、窗口信息、运行上下文等场景，或者需要依赖于窗口中所有的数据元素场景，如统计中位数、众数等。

【说明】利用全量窗口函数来完成简单的聚合运算明显非常浪费资源，因而，简单的聚合运算建议用增量聚合函数来实现。

WindowFunction

在实现WindowFunction接口的过程中，如果不操作状态数据，则只需实现apply()方法即可，如下：

// 自定义全量窗口函数WindowFunction
public static class WindowItemCount implements WindowFunction<Long, ItemCount, Tuple, TimeWindow> {
    @Override
    public void apply(Tuple tuple, TimeWindow window, Iterable<Long> input, Collector<ItemCount> out) throws Exception {
        Long itemId = tuple.getField(0);
        Long windowEnd = window.getEnd();
        Long count = input.iterator().next();   //因为只有一条数据，所以可以next()
        out.collect(new ItemCount(itemId, windowEnd, count));
    }
}

ProcessWindowFunction

在继承ProcessWindowFunction抽象类过程中，如果不操作状态数据，则只需实现process()方法，如下：

// 实现自定义的全量窗口函数ProcessWindowFunction
public static class UsersCount extends ProcessWindowFunction<Long, PromotionCount, Tuple, TimeWindow>{
    @Override
    public void process(Tuple tuple, Context context, Iterable<Long> elements, Collector<PromotionCount> out) throws Exception {
        String channel = tuple.getField(0);
        String behavior = tuple.getField(1);
        String windowEnd = new Timestamp(context.window().getEnd()).toString();
        Long count = elements.iterator().next();

        out.collect(new PromotionCount(channel, behavior, windowEnd, count));
    }
}

WindowFunction和ProcessWindowFunction的区别

既然知道了全量窗口函数有上述两种类型，那么二者有什么区别？什么时候用那种合适呢？
其实，这需要从WindowFunction和ProcessWindowFunction底层如何实现的进行找原因，前面已经讲到WindowFunction需要实现apply()方法，ProcessWindowFunction需要实现process()方法，两种方法的具体参数信息如下：

public void apply(Tuple tuple, TimeWindow window, Iterable<Long> input, Collector<ItemViewCount> out)

public void process(Tuple tuple, Context context, Iterable<Long> elements, Collector<PromotionCount> out)

通过以上两个方法的对比后不难发现，区别关键点在于apply()能获取到Winodw窗口信息，process()能获取到Context上下文信息，而Context上下文信息里面不仅包含窗口信息，还包含水印、时间戳等信息，即一句话总结：WindowFunction可以获取当前Window窗口信息，ProcessWindowFunction获取到的信息更多，可以获取当前Context上下文信息，包括Window窗口信息、Watermark、时间戳等。

增量聚合函数+全量窗口函数

特点:考虑到增量聚合函数具备性能好、全量窗口函数具备灵活的特点，实际应用过程中
可ReduceFunction/AggregateFunction和ProcessWindowFunction/WindowFunction整合使用，以充分利用上述两种函数的各自优势。该过程中分配到某个窗口的元素将提前完成增量聚合，而当窗口的trigger触发时，也就是窗口收集完数据关闭时，将会把增量聚合的结果作为输入数据发送到ProcessWindowFunction/WindowFunction中以继续完成全量窗口功能，此时Iterable参数将会只有一个值，即前面增量聚合传过来的值。一句话总结：性能好，灵活。
应用场景：如统计一段时间内的Top N场景（需返回窗口部分信息）。
增量聚合函数+全量窗口函数的使用示例如下：

DataStream<ItemCount> windowAggStream = dataStream
        .filter(…..)    // 过滤pv行为
        .keyBy(…..)    // 按商品ID分组。
        .timeWindow(Time.hours(1), Time.minutes(10))    // 开滑动窗口
        .aggregate(new ItemCountAgg(), new WindowItemCount());