【Flink】窗口的机制及相关实验

Flink 窗口

Flink作为流计算引擎,主要用来处理无界数据流。数据源源不断、无穷无尽。通过将无限数据切割成有限的“数据块”进行处理,就有“窗口”的概念。
在Flink中,窗口可以把流切割成有限大小的多个“存储桶”,每个数据都会分发的对应的桶中。当到达窗口结束时间时,就会对每个桶中收集数据进行计算处理。
窗口不是静态准备好的,是动态创建的——有数据到达时才会创建对应窗口。窗口结束时间时,窗口会触发计算并关闭。

窗口的分类

按照驱动类型分:

  • 时间窗口:以时间点来定义窗口的开始和结束,到达结束时间时,窗口不再收集数据,触发计算输出结果。
  • 计数窗口:基于元素个数截取数据,到达固定个数触发计算关闭窗口。

窗口分配规则分类:

  • 滚动窗口:窗口之间没有重叠,也没有间隔。每个数据被分配到一个窗口,而且只属于一个窗口。
  • 滑动窗口:除窗口大小外,还有滑动步长概念(代表计算频率)。数据可能被分配到多个窗口中。
  • 会话窗口:长 度不固定,起始时间和结束时间也不固定,各分区之间没有关联,窗口之间一定不重叠。
  • 全局窗口:相同key的所有数据分配到同一窗口,窗口没有结束,默认不会触发计算。

窗口API

  • 按键分区窗口(Keyed Windows)
stream.keyBy(...).window(...)
  • 非按键分区
stream.windowAll(...)

没有进行 keyBy 则原始的 DataStream 不会分多条逻辑流。窗口只有一个task执行,并行度为1。

代码中窗口API调用
stream.keyBy(<key selector>).window(<window assigner>).aggregate(<window function>)
窗口分配器

定义窗口分配器是构建窗口算子的第一步,作用是定义数据应该被分配到哪个窗口。(指定窗口的类型)

定义方式

stream.keyBy(...).window(< WindowAssigner >)  //返回WindowsStream
  
stream.windowAll(< WindowAssigner >)  //返回AllWindowedStream

时间窗口

// 滚动处理时间窗口
stream.keyBy(...).window(TumblingProcessingTimeWindows.of(Time.second(5))).aggregate(...)

// 滑动处理时间窗口
stream.keyBy(...).window(SlidingProcessingTimeWindows.of(Time.seconds(10))).aggregate(...)
  
// 处理时间会话窗口
stream.keyBy(...).window(ProcessingTimeSessionWindows.withGap(Time.seconds(10))).aggregate(...)
  
// 滚动事件时间窗口
stream.keyBy(...).window(TumblingEventTimeWindows.of(Time.second(5))).aggregate(...)
  
// 滑动事件时间窗口
stream.keyBy(...).window(SlidingEventTimeWindows.of(Time.second(10),Time.second(5))).aggregate(...)
  
// 事件时间会话窗口
stream.keyBy(...).window(EventTimeSessionWindows.withGap(Time.second(10))).aggregate(...)

计数窗口

// 滚动计数窗口
stream.keyBy(...).countWindow(10)									// 当窗口元素达到10的时候,触发计算执行
  
// 滑动计数窗口
stream.keyBy(...).countWindow(10, 3)					// 每个窗口统计10个数据,每隔3个数据就统计输出一次结果

全局窗口

stream.keyBy(...).window(GlobalWindows.create());

窗口函数

作用:定义窗口如何进行计算的操作

增量聚合函数

将数据收集出来,进行聚合。

  • ReduceFunction
package ac.sict.reid.leo.Window;

import ac.sict.reid.leo.Computing.Aggregation.WaterSensorMapFunction;
import ac.sict.reid.leo.POJO.WaterSensor;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;

public class ReduceFunction {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.
                getExecutionEnvironment();
        environment.setParallelism(1);
        environment.socketTextStream("localhost", 7777).
                map(new WaterSensorMapFunction()).keyBy(r -> r.getId()).
                window(TumblingProcessingTimeWindows.of(Time.seconds(10))).
                reduce(new ReduceFunction<WaterSensor>() {
                    @Override
                    public WaterSensor reduce(WaterSensor t0, WaterSensor t1) throws Exception {
                        System.out.println("调用reduce方法,之前结果:" + t0 + ",现在来的数据:" + t1);
                        return new WaterSensor(t0.getId(), System.currentTimeMillis(),
                                t0.getVc() + t1.getVc());
                    }
                }).print();

        environment.execute();

    }

}
  • AggregateFunction
package ac.sict.reid.leo.Window;

import ac.sict.reid.leo.Computing.Aggregation.WaterSensorMapFunction;
import ac.sict.reid.leo.POJO.WaterSensor;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.datastream.WindowedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;

public class AggregateFunction {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
        environment.setParallelism(1);
        SingleOutputStreamOperator<WaterSensor> stream = environment.socketTextStream("localhost", 7777).map(new WaterSensorMapFunction());
        KeyedStream<WaterSensor, String> keyedStream = stream.keyBy(sensor -> sensor.getId());
        WindowedStream<WaterSensor, String, TimeWindow> sensorKS = keyedStream.window(TumblingProcessingTimeWindows.of(Time.seconds(10)));
        SingleOutputStreamOperator<String> aggregate = sensorKS.aggregate(new org.apache.flink.api.common.functions.AggregateFunction<WaterSensor, Integer, String>() {
            @Override
            public Integer createAccumulator() {
                System.out.println("创建累加器\n");
                return 0;
            }

            @Override
            public Integer add(WaterSensor waterSensor, Integer integer) {
                System.out.println("调用add方法,value = " + waterSensor + "\n");
                return integer + waterSensor.getVc();
            }

            @Override
            public String getResult(Integer integer) {
                System.out.println("调用getResult方法\n");
                return integer.toString();
            }

            @Override
            public Integer merge(Integer integer, Integer acc1) {
                System.out.println("调用merge方法\n");
                return null;
            }
        });

        aggregate.print();

        environment.execute();


    }

}

输出结果

创建累加器

调用add方法,value = WaterSensor{id='s1', ts=2, vc=3}

调用add方法,value = WaterSensor{id='s1', ts=3, vc=5}

调用add方法,value = WaterSensor{id='s1', ts=2, vc=5}

调用add方法,value = WaterSensor{id='s1', ts=3, vc=6}

调用getResult方法

19
全窗口函数

在增量聚合函数中,每次窗口进行计算后窗口的数据不会被保存。然而,在计算中如果需要基于全部数据处理,需要窗口上下文的信息时,可以通过全窗口函数:通过收集窗口数据,在内部缓存起来,等窗口要输出结果时候在取出运算。

与增量聚合函数的区别:数据不是即来即处理,而是在窗口结束后统一处理。

方法

  • ProcessWindowFunction

处理窗口函数。除了可以拿到窗口中所有数据以外,还可以拿到(上下文对象 => 包含窗口信息、当前时间和状态信息)

package ac.sict.reid.leo.Window;

import ac.sict.reid.leo.Computing.Aggregation.WaterSensorMapFunction;
import ac.sict.reid.leo.POJO.WaterSensor;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.datastream.WindowedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

public class WindowProcessDemo {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
        environment.setParallelism(1);
        SingleOutputStreamOperator<WaterSensor> sensor = environment.socketTextStream("localhost", 7777).map(new WaterSensorMapFunction());
        KeyedStream<WaterSensor, String> keyedStream = sensor.keyBy(value -> value.id);

        // 窗口
        WindowedStream<WaterSensor, String, TimeWindow> windowed = keyedStream.window(TumblingProcessingTimeWindows.of(Time.seconds(10)));

        SingleOutputStreamOperator<String> process = windowed.process(new ProcessWindowFunction<WaterSensor, String, String, TimeWindow>() {
            @Override
            // 分析参数:s => 窗口的key , context => 上下文信息 , iterable => 窗口包含的数据 , collector => 输出收集器
            public void process(String s, ProcessWindowFunction<WaterSensor, String, String, TimeWindow>.Context context, Iterable<WaterSensor> iterable, Collector<String> collector) throws Exception {
                long count = iterable.spliterator().estimateSize();
                long start = context.window().getStart();
                long end = context.window().getEnd();
                collector.collect("key = " + s + " 的窗口[" + start + "," + end + "]包含 " + count + " 条数据 ====> " + iterable.toString());
            }
        });

        process.print();

        environment.execute();

    }
}

输出结果

key = s1 的窗口[1693101120000,1693101130000]包含 5 条数据 ====> [WaterSensor{id='s1', ts=4, vc=5}, WaterSensor{id='s1', ts=4, vc=5}, WaterSensor{id='s1', ts=4, vc=5}, WaterSensor{id='s1', ts=4, vc=5}, WaterSensor{id='s1', ts=4, vc=5}]
key = s2 的窗口[1693101120000,1693101130000]包含 4 条数据 ====> [WaterSensor{id='s2', ts=3, vc=4}, WaterSensor{id='s2', ts=3, vc=4}, WaterSensor{id='s2', ts=3, vc=4}, WaterSensor{id='s2', ts=3, vc=4}]

综合操作

增量聚合和全窗口函数的结合使用

package ac.sict.reid.leo.Window;

import ac.sict.reid.leo.Computing.Aggregation.WaterSensorMapFunction;
import ac.sict.reid.leo.POJO.WaterSensor;
import org.apache.flink.api.common.functions.AggregateFunction;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.datastream.WindowedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

public class WindowAggregateAndProcessDemo {

    /**
     * @param args
     *
     * 思路:基于增量聚合函数处理窗口数据,每来一个数据就做一次聚合;等窗口需要触发计算,调用全窗口函数处理逻辑输出结果
     *
     * 此时,全窗口函数不再缓存所有数据,直接将增量聚合函数结果作为Iterable类型输入
     */
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
        environment.setParallelism(1);
        SingleOutputStreamOperator<WaterSensor> sensor = environment.socketTextStream("localhost", 7777).map(new WaterSensorMapFunction());
        KeyedStream<WaterSensor, String> keyedStream = sensor.keyBy(value -> value.id);

        WindowedStream<WaterSensor, String, TimeWindow> windowed = keyedStream.window(TumblingProcessingTimeWindows.of(Time.seconds(10)));

        /*
        * 窗口函数
        *
        * 增量聚合 Aggregate + 全窗口 process
        *    1、增量聚合函数处理数据:来一条计算一条
        *    2、窗口触发时,增量聚合的结果(只有一条)传递给全窗口函数
        *    3、经过全窗口函数的处理包装后,输出
        *
        * 结合两者优点:
        *   1、增量聚合:即时计算,存储中间结果,占用空间少
        *   2、全窗口函数:通过上下文实现灵活功能
        * */

        SingleOutputStreamOperator<String> aggregate = windowed.aggregate(new MyAgg(), new MyProcess());

        aggregate.print();

        environment.execute();


    }

    public static class MyAgg implements AggregateFunction<WaterSensor,Integer,String> {

        @Override
        public Integer createAccumulator() {
            System.out.println("创建累加器..");
            return 0;
        }

        @Override
        public Integer add(WaterSensor waterSensor, Integer integer) {
            System.out.println("调用add方法,value=" + waterSensor);
            return integer + waterSensor.getVc();
        }

        @Override
        public String getResult(Integer integer) {
            System.out.println("调用getResult方法:");
            return integer.toString();
        }

        @Override
        public Integer merge(Integer integer, Integer acc1) {
            System.out.println("调用merge方法..");
            return null;
        }
    }

    public static class MyProcess extends ProcessWindowFunction<String, String, String, TimeWindow> {

        @Override
        public void process(String s, ProcessWindowFunction<String, String, String, TimeWindow>.Context context, Iterable<String> iterable, Collector<String> collector) throws Exception {
            long start = context.window().getStart();
            long end = context.window().getEnd();
            long count = iterable.spliterator().estimateSize();

            collector.collect("key = "+ s + "的窗口[" + start + "," + end + ") 包含 " + count + "条数据 ====>" + iterable.toString());
        }
    }
}

输出结果

创建累加器..
调用add方法,value=WaterSensor{id='s2', ts=3, vc=4}
创建累加器..
调用add方法,value=WaterSensor{id='s1', ts=4, vc=5}
调用add方法,value=WaterSensor{id='s2', ts=3, vc=4}
调用add方法,value=WaterSensor{id='s1', ts=4, vc=5}
调用add方法,value=WaterSensor{id='s2', ts=3, vc=4}
调用add方法,value=WaterSensor{id='s1', ts=4, vc=5}
调用add方法,value=WaterSensor{id='s2', ts=3, vc=4}
调用add方法,value=WaterSensor{id='s1', ts=4, vc=5}
调用add方法,value=WaterSensor{id='s2', ts=3, vc=4}
调用add方法,value=WaterSensor{id='s1', ts=4, vc=5}
调用add方法,value=WaterSensor{id='s2', ts=3, vc=4}
调用add方法,value=WaterSensor{id='s1', ts=4, vc=5}
调用getResult方法:
key = s2的窗口[1693101010000,1693101020000) 包含 1条数据 ====>[24]
调用getResult方法:
key = s1的窗口[1693101010000,1693101020000) 包含 1条数据 ====>[30]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值