Flink之窗口的使用(java)

Flink之窗口的使用

前言

我们前面其实一直都有在使用窗口,那么我们现在来针对窗口的各种类型做一个演示。

1.1 窗口简述

聚合事件(比如计数、求和)在流上的工作方式与批处理不同。比如,对流中的所有元素进行计数是不可能的,因为通常流是无限的(无界的)。所以,流上的聚合需要由 window 来划定范围,比如 “计算过去的5分钟” ,或者 “最后100个元素的和” 。window是一种可以把无限数据切割为有限数据块的手段。

1.2 窗口类型

  • tumbling window:滚动窗口

  • sliding window:滑动窗口

  • session window:会话窗口

  • global window: 没有窗口

窗口还可以划分为 Keyed Window与Non-Keyed Window,简单来讲,就是是否经过了keyBy算子,Keyed Window就相当于stream流的数据根据key,进行了分组,然后窗口针对每一个key的数据进行相应划分,然后执行窗口的统计。而Non-Keyed Window 则相当于我不对流进行split,那么所有的数据都在一起,那么就只有一个task对于当前的数据流,进行窗口划分与计算。

可以参考Keyed Window与Non-Keyed Window

1.3 窗口案例演示

1.3.1 tumbling windows

dataStream.keyBy(0)
    .window(TumblingEventTimeWindows.of(Time.seconds(5)))
    ...;
复制代码

1.3.2 sliding windows

dataStream.keyBy(0)
    .window(SlidingProcessingTimeWindows.of(Time.seconds(10),Time.seconds(10)))
    ...;
复制代码

1.3.3 session windows

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.ProcessingTimeSessionWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;

/**
 * 一个单词,再5秒之内都没有出现过,那么就输出它一共出现了多少次
 */
public class SessionWindowDemo {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStreamSource<String> localhost = see.socketTextStream("localhost", 8888);

        SingleOutputStreamOperator<Tuple2<String, Integer>> sum = localhost.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public void flatMap(String line, Collector<Tuple2<String, Integer>> out) throws Exception {
                for (String str : line.split(",")) {
                    out.collect(Tuple2.of(str, 1));
                }
            }
        }).keyBy(0).window(ProcessingTimeSessionWindows.withGap(Time.seconds(5))).sum(1);

        sum.print().setParallelism(1);

        see.execute("SessionWindowDemo");
    }
}
复制代码

1.3.4 global windows

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.GlobalWindows;
import org.apache.flink.streaming.api.windowing.triggers.CountTrigger;
import org.apache.flink.util.Collector;

/**
 * 单词每出现2次 统计一次
 */
public class GlobalWindowDemo {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStreamSource<String> localhost = see.socketTextStream("localhost", 8888);

        SingleOutputStreamOperator<Tuple2<String, Integer>> sum = localhost.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public void flatMap(String line, Collector<Tuple2<String, Integer>> out) throws Exception {
                for (String str : line.split(",")) {
                    out.collect(Tuple2.of(str, 1));
                }
            }
            //GlobalWindows 的使用需要结合trigger能使使用,因为如果你只是设置了窗口,但是没有触发,那么这个窗口没有意义
            //就如transformation算子需要一个action来触发 是一样的。
        }).keyBy(0).window(GlobalWindows.create()).trigger(CountTrigger.of(2)).sum(1);
        
        sum.print().setParallelism(1);

        see.execute("GlobalWindowDemo");

    }
}

复制代码

结果
demo,2
demo,4
demo,6
复制代码

1.3.5 自定义trigger

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.common.state.ReducingState;
import org.apache.flink.api.common.state.ReducingStateDescriptor;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.GlobalWindows;
import org.apache.flink.streaming.api.windowing.triggers.Trigger;
import org.apache.flink.streaming.api.windowing.triggers.TriggerResult;
import org.apache.flink.streaming.api.windowing.windows.GlobalWindow;
import org.apache.flink.streaming.api.windowing.windows.Window;
import org.apache.flink.util.Collector;

/**
 * 单词每出现2次 统计一次
 */
public class TriggerWindowDemo {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStreamSource<String> localhost = see.socketTextStream("localhost", 8888);

        SingleOutputStreamOperator<Tuple2<String, Integer>> sum = localhost.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public void flatMap(String line, Collector<Tuple2<String, Integer>> out) throws Exception {
                for (String str : line.split(",")) {
                    out.collect(Tuple2.of(str, 1));
                }
            }
            //.window(GlobalWindows.create()).trigger(CountTrigger.of(2)).sum(1); 和CountTrigger.of(2)里面的源码逻辑是一样的
        }).keyBy(0).window(GlobalWindows.create()).trigger(new TriggerImpl(2l)).sum(1);

        sum.print().setParallelism(1);

        see.execute("TriggerWindowDemo");

    }

    /**
     * @param <T> The type of elements on which this {@code Trigger} works.
     *            输入的类型
     *            * @param <W> The type of {@link Window Windows} on which this {@code Trigger} can operate.
     *            窗口类型
     */
    private static class TriggerImpl extends Trigger<Tuple2<String, Integer>, GlobalWindow> {

        // 指定出现的次数
        private Long maxCount;

        // 记录key出现的次数
        private ReducingStateDescriptor<Long> descriptor = new ReducingStateDescriptor<Long>("count", new ReduceFunction<Long>() {
            @Override
            public Long reduce(Long aLong, Long t1) throws Exception {
                return aLong + t1;
            }
        }, Long.class);

        public TriggerImpl(Long maxCount) {
            this.maxCount = maxCount;
        }


        /**
         * 当一个元素进入到一个 window 中的时候就会调用这个方法
         *
         * @param element   元素
         * @param timestamp 进来的时间
         * @param window    元素所属的窗口
         * @param ctx       上下文
         *                  1. TriggerResult.CONTINUE :表示对 window 不做任何处理
         *                  2. TriggerResult.FIRE :表示触发 window 的计算
         *                  3. TriggerResult.PURGE :表示清除 window 中的所有数据
         *                  4. TriggerResult.FIRE_AND_PURGE :表示先触发 window 计算,然后删除 window 中的数据
         */
        @Override
        public TriggerResult onElement(Tuple2<String, Integer> element, long timestamp, GlobalWindow window, TriggerContext ctx) throws Exception {
            // 获取state
            ReducingState<Long> count = ctx.getPartitionedState(descriptor);
            // count 累加 1
            count.add(1L);
            // 如果当前 key 的 count 值等于 maxCount
            if (count.get().equals(maxCount)) {
                count.clear();
                // 触发 window 计算,删除数据
                return TriggerResult.FIRE;
            }
            // 否则,对 window 不做任何的处理
            return TriggerResult.CONTINUE;
        }


        // 当使用processingTime时的处理逻辑
        @Override
        public TriggerResult onProcessingTime(long time, GlobalWindow window, TriggerContext ctx) throws Exception {
            return TriggerResult.CONTINUE;
        }

        // 当使用processingTime时的处理逻辑
        @Override
        public TriggerResult onEventTime(long time, GlobalWindow window, TriggerContext ctx) throws Exception {
            return TriggerResult.CONTINUE;
        }

        @Override
        public void clear(GlobalWindow window, TriggerContext ctx) throws Exception {
            ctx.getPartitionedState(descriptor).clear();
        }
    }
}
复制代码

结果
demo,2
demo,4
demo,6
复制代码

1.3.6 evictor

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.common.state.ReducingState;
import org.apache.flink.api.common.state.ReducingStateDescriptor;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.GlobalWindows;
import org.apache.flink.streaming.api.windowing.evictors.Evictor;
import org.apache.flink.streaming.api.windowing.triggers.Trigger;
import org.apache.flink.streaming.api.windowing.triggers.TriggerResult;
import org.apache.flink.streaming.api.windowing.windows.GlobalWindow;
import org.apache.flink.streaming.api.windowing.windows.Window;
import org.apache.flink.streaming.runtime.operators.windowing.TimestampedValue;
import org.apache.flink.util.Collector;

import java.util.Iterator;

/**
 * 单词每出现2次,统计最近3的3个单词
 */
public class EvictorWindowDemo {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStreamSource<String> localhost = see.socketTextStream("localhost", 8888);

        SingleOutputStreamOperator<Tuple2<String, Integer>> sum = localhost.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public void flatMap(String line, Collector<Tuple2<String, Integer>> out) throws Exception {
                for (String str : line.split(",")) {
                    out.collect(Tuple2.of(str, 1));
                }
            }
            //.window(GlobalWindows.create()).trigger(CountTrigger.of(2)).sum(1); 和CountTrigger.of(2)里面的源码逻辑是一样的
        }).keyBy(0).window(GlobalWindows.create()).trigger(new TriggerImpl(2l))
                .evictor(new EvictorImpl(3)).sum(1);

        sum.print().setParallelism(1);

        see.execute("TriggerWindowDemo");

    }

    /**
     * @param <T> The type of elements on which this {@code Trigger} works.
     *            输入的类型
     *            * @param <W> The type of {@link Window Windows} on which this {@code Trigger} can operate.
     *            窗口类型
     */
    private static class TriggerImpl extends Trigger<Tuple2<String, Integer>, GlobalWindow> {

        // 指定出现的次数
        private Long maxCount;

        // 记录key出现的次数
        private ReducingStateDescriptor<Long> descriptor = new ReducingStateDescriptor<Long>("count", new ReduceFunction<Long>() {
            @Override
            public Long reduce(Long aLong, Long t1) throws Exception {
                return aLong + t1;
            }
        }, Long.class);

        public TriggerImpl(Long maxCount) {
            this.maxCount = maxCount;
        }


        /**
         * 当一个元素进入到一个 window 中的时候就会调用这个方法
         *
         * @param element   元素
         * @param timestamp 进来的时间
         * @param window    元素所属的窗口
         * @param ctx       上下文
         *                  1. TriggerResult.CONTINUE :表示对 window 不做任何处理
         *                  2. TriggerResult.FIRE :表示触发 window 的计算
         *                  3. TriggerResult.PURGE :表示清除 window 中的所有数据
         *                  4. TriggerResult.FIRE_AND_PURGE :表示先触发 window 计算,然后删除 window 中的数据
         */
        @Override
        public TriggerResult onElement(Tuple2<String, Integer> element, long timestamp, GlobalWindow window, TriggerContext ctx) throws Exception {
            // 获取state
            ReducingState<Long> count = ctx.getPartitionedState(descriptor);
            // count 累加 1
            count.add(1L);
            // 如果当前 key 的 count 值等于 maxCount
            if (count.get().equals(maxCount)) {
                count.clear();
                // 触发 window 计算,删除数据
                return TriggerResult.FIRE;
            }
            // 否则,对 window 不做任何的处理
            return TriggerResult.CONTINUE;
        }


        // 当使用processingTime时的处理逻辑
        @Override
        public TriggerResult onProcessingTime(long time, GlobalWindow window, TriggerContext ctx) throws Exception {
            return TriggerResult.CONTINUE;
        }

        // 当使用processingTime时的处理逻辑
        @Override
        public TriggerResult onEventTime(long time, GlobalWindow window, TriggerContext ctx) throws Exception {
            return TriggerResult.CONTINUE;
        }

        @Override
        public void clear(GlobalWindow window, TriggerContext ctx) throws Exception {
            ctx.getPartitionedState(descriptor).clear();
        }
    }
    private static class EvictorImpl implements Evictor<Tuple2<String, Integer>, GlobalWindow> {
        // window 的大小
        private long windowCount;

        public EvictorImpl(long windowCount) {
            this.windowCount = windowCount;
        }

        /**
         *  在 window 计算之前删除特定的数据
         * @param elements  window 中所有的元素
         * @param size  window 中所有元素的大小
         * @param window    window
         * @param evictorContext    上下文
         */
        @Override
        public void evictBefore(Iterable<TimestampedValue<Tuple2<String, Integer>>> elements,
                                int size, GlobalWindow window, EvictorContext evictorContext) {
            if (size <= windowCount) {
                return;
            } else {
                int evictorCount = 0;
                Iterator<TimestampedValue<Tuple2<String, Integer>>> iterator = elements.iterator();
                while (iterator.hasNext()) {
                    iterator.next();
                    evictorCount++;
                    // 如果删除的数量小于当前的 window 大小减去规定的 window 的大小,就需要删除当前的元素
                    if (evictorCount > size - windowCount) {
                        break;
                    } else {
                        iterator.remove();
                    }
                }
            }
        }

        /**
         *  在 window 计算之后删除特定的数据
         * @param elements  window 中所有的元素
         * @param size  window 中所有元素的大小
         * @param window    window
         * @param evictorContext    上下文
         */
        @Override
        public void evictAfter(Iterable<TimestampedValue<Tuple2<String, Integer>>> elements,
                               int size, GlobalWindow window, Evictor.EvictorContext evictorContext) {

        }
    }
}
复制代码

结果
(a,2)
(a,3)
(a,3)
(a,3)

复制代码

1.4 窗口的增量聚合与全量聚合

1.4.1 增量聚合

窗口中每进入一条数据,就进行一次计算,等时间到了展示最后的结果

常用的聚合算子

reduce(reduceFunction),aggregate(aggregateFunction),sum(),min(),max()
复制代码

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;

public class ReduceDemo {
    public static void main(String[] args) throws  Exception{
        StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStreamSource<String> localhost = see.socketTextStream("localhost", 8888);

        localhost.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public void flatMap(String line, Collector<Tuple2<String, Integer>> out) throws Exception {
                for (String str : line.split(",")) {
                    out.collect(Tuple2.of(str, 1));
                }
            }
        }).keyBy(0).timeWindow(Time.seconds(3)).reduce(new ReduceFunction<Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> reduce(Tuple2<String, Integer> reduceTuple, Tuple2<String, Integer> value) throws Exception {
                return Tuple2.of(reduceTuple.f0,reduceTuple.f1 + value.f1);
            }
        }).print().setParallelism(1);
        see.execute("ReduceDemo");
    }
}

复制代码

1.4.2 全量聚合

等属于窗口的数据到齐,才开始进行聚合计算【可以实现对窗口内的数据进行排序等需求】

apply(windowFunction)
process(processWindowFunction)
processWindowFunction比windowFunction提供了更多的上下文信息。类似于map和RichMap的关系
复制代码

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

import java.util.Iterator;

public class ProcessDemo {
    public static void main(String[] args) throws  Exception{
        StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStreamSource<String> localhost = see.socketTextStream("localhost", 8888);

        localhost.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public void flatMap(String line, Collector<Tuple2<String, Integer>> out) throws Exception {
                for (String str : line.split(",")) {
                    out.collect(Tuple2.of(str, 1));
                }
            }
        }).keyBy(0).timeWindow(Time.seconds(5)).process(new ProcessWindowFunction<Tuple2<String, Integer>, Tuple2<String, Integer>, Tuple, TimeWindow>() {
            @Override
            public void process(Tuple tuple, Context context, Iterable<Tuple2<String, Integer>> elements, Collector<Tuple2<String, Integer>> out) throws Exception {
                int sum = 0;
                Iterator<Tuple2<String, Integer>> iterator = elements.iterator();
                while (iterator.hasNext()){
                    sum += iterator.next().f1;
                }
                out.collect(Tuple2.of(tuple.getField(0),sum));
            }
        }).print().setParallelism(1);
        see.execute("ProcessDemo");
    }
}
复制代码

1.5 窗口的join

两个window之间可以进行join,join操作只支持三种类型的window:滚动窗口,滑动窗口,会话窗口

使用方式:

stream.join(otherStream) //两个流进行关联
    .where(<KeySelector>) //选择第一个流的key作为关联字段
    .equalTo(<KeySelector>)//选择第二个流的key作为关联字段
    .window(<WindowAssigner>)//设置窗口的类型
    .apply(<JoinFunction>) //对结果做操作
复制代码

Tumbling Window Join

import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
 
...

DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...

orangeStream.join(greenStream)
    .where(<KeySelector>)
    .equalTo(<KeySelector>)
    .window(TumblingEventTimeWindows.of(Time.milliseconds(2)))
    .apply (new JoinFunction<Integer, Integer, String> (){
        @Override
        public String join(Integer first, Integer second) {
            return first + "," + second;
        }
    });
复制代码

Sliding Window Join

import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;

...

DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...

orangeStream.join(greenStream)
    .where(<KeySelector>)
    .equalTo(<KeySelector>)
    .window(SlidingEventTimeWindows.of(Time.milliseconds(2) /* size */, Time.milliseconds(1) /* slide */))
    .apply (new JoinFunction<Integer, Integer, String> (){
        @Override
        public String join(Integer first, Integer second) {
            return first + "," + second;
        }
    });
复制代码

Session Window Join

import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.EventTimeSessionWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
 
...

DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...

orangeStream.join(greenStream)
    .where(<KeySelector>)
    .equalTo(<KeySelector>)
    .window(EventTimeSessionWindows.withGap(Time.milliseconds(1)))
    .apply (new JoinFunction<Integer, Integer, String> (){
        @Override
        public String join(Integer first, Integer second) {
            return first + "," + second;
        }
    });
复制代码

Interval Join

import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction;
import org.apache.flink.streaming.api.windowing.time.Time;

...

DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...

orangeStream
    .keyBy(<KeySelector>)
    .intervalJoin(greenStream.keyBy(<KeySelector>))
    .between(Time.milliseconds(-2), Time.milliseconds(1))
    .process (new ProcessJoinFunction<Integer, Integer, String(){

        @Override
        public void processElement(Integer left, Integer right, Context ctx, Collector<String> out) {
            out.collect(first + "," + second);
        }
    });
复制代码
  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值