Flink countWindow窗口

最新推荐文章于 2023-08-09 13:10:48 发布

vincent_hahaha

最新推荐文章于 2023-08-09 13:10:48 发布

阅读量5.8k

点赞数 3

分类专栏： Apache Flink

本文链接：https://blog.csdn.net/vincent_duan/article/details/102619887

版权

Apache Flink 专栏收录该内容

33 篇文章 11 订阅

订阅专栏

窗口在处理数据前，会对数据做分流，有两种控制流的方式，按照数据流划分：Keyed和Non-Keyed Windows
Keyed Windows：就是有按照某个字段分组的数据流使用的窗口，可以理解为按照原始数据流中的某个key进行分类，拥有同一个key值的数据流将为进入同一个window，多个窗口并行的逻辑流。

stream
       .keyBy(...)               //  是keyed类型数据集
       .window(...)              //  指定窗口分配器类型
      [.trigger(...)]            //  指定触发器类型（可选）
      [.evictor(...)]            //  指定evictor或不指定（可选）
      [.allowedLateness(...)]    //  指定是否延迟处理数据（可选）
      [.sideOutputLateData(...)] //  optional: "output tag" (else no side output for late data)
      .reduce/aggregate/flod/apply()   //指定窗口计算函数
      .getSideOutput(...)   //根据Tag输出数据（可选）

Non-Keyed Windows：没有进行按照某个字段分组的数据使用的窗口

stream
       .windowAll(...)           <-  required: "assigner"
      [.trigger(...)]            <-  optional: "trigger" (else default trigger)
      [.evictor(...)]            <-  optional: "evictor" (else no evictor)
      [.allowedLateness(...)]    <-  optional: "lateness" (else zero)
      [.sideOutputLateData(...)] <-  optional: "output tag" (else no side output for late data)
       .reduce/aggregate/fold/apply()      <-  required: "function"
      [.getSideOutput(...)]      <-  optional: "output tag"

Keyed和Non-Keyed Windows的区别
在定义窗口之前,要指定的第一件事是流是否需要Keyed，使用keyBy（…）将无界流分成逻辑的keyed stream。如果未调用keyBy（…），则表示流不是keyed stream。

对于Keyed流:可以将传入事件的任何属性用作key。拥有Keyed stream将允许窗口计算由多个任务并行执行，因为每个逻辑Keyed流可以独立于其余任务进行处理。相同Key的所有元素将被发送到同一个任务。
对于Non-Keyed流：原始流将不会被分成多个逻辑流，并且所有窗口逻辑将由单个Task执行，即并行性为1。

实战

目前原始数据内容如下：

我们使用keyBy(0)，根据第一个字段进行分组，然后使用countWindow(2)来进行触发，代码如下：

public class Test3 {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<String> dataStreamSource = env.readTextFile("E:/test/haha.txt");

        dataStreamSource.map(new MapFunction<String, Tuple2<String, String>>() {
            @Override
            public Tuple2<String, String> map(String s) throws Exception {
                String[] split = s.split("\t");
                return Tuple2.of(split[0], split[1]);
            }
        }).keyBy(0).countWindow(2).apply(new WindowFunction<Tuple2<String, String>, Object, Tuple, GlobalWindow>() {

            @Override
            public void apply(Tuple tuple, GlobalWindow window, Iterable<Tuple2<String, String>> input, Collector<Object> out) throws Exception {
                Iterator<Tuple2<String, String>> iterator = input.iterator();
                while (iterator.hasNext()) {
                    Tuple2<String, String> next = iterator.next();
                    System.out.println("执行操作：" + next.f0 + ", " + next.f1);
                    out.collect( next.f1 + "======");
                }
            }
        }).print();

        env.execute("Test3");
    }
}

输出结果如下：

执行操作：2, 200
执行操作：2, 201
执行操作：1, 100
执行操作：1, 101
执行操作：3, 300
执行操作：3, 302
执行操作：6, 602
执行操作：6, 601
执行操作：4, 401
执行操作：4, 402

原始数据有13条，而输出结果只有10条，这是为什么？
原因是我们使用的是countWindow(2)，也就是当根据keyBy(0)分组之后，数据的数量达到2时进行输出。
而我们的数据中id为1的有3条，因此其中一条数据将不会被触发，id为6的有3条，其中一条数据没有达到countWindow(2)也不会触发,id为5的有1条，没有达到countWindow(2)也不会被触发，因此输出结果少了3条数据。
因此countWindow是根据分组之后的数据条数来确定是否执行后面的运算。
当把countWindow(2)改为countWindow(1)时，每一条数据都会被处理输出。

vincent_hahaha

关注

3
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
Flink countWindow窗口

窗口在处理数据前，会对数据做分流，有两种控制流的方式，按照数据流划分：Keyed和Non-Keyed WindowsKeyed Windows：就是有按照某个字段分组的数据流使用的窗口，可以理解为按照原始数据流中的某个key进行分类，拥有同一个key值的数据流将为进入同一个window，多个窗口并行的逻辑流。stream .keyBy(...) // ...
复制链接

扫一扫

专栏目录