Flink 窗口 Window

官网地址  窗口 | Apache Flink

Windows are at the heart of processing infinite streams. Windows split the stream into “buckets” of finite size, over which we can apply computations.

简单来说就是窗口是用来分桶的

一般流程:分配器(assigner)按规则创建窗口并分配数据,数据进入窗口(window),触发器(trigger)根据数据和定时器判断是否已满足特定条件,不满足继续等待并接收数据,满足后结束接受数据,并将数据移交给方法(function)开始运算

1. 一些概念

Trigger -- 负责触发运算

Function -- 负责实际运算

Assigner -- 负责分配数据

data.keyBy(TestObj::getKey)
    //window和windowAll的参数都是assigner
    .window(TumblingEventTimeWindows.of(Time.minutes(1)))
    //设置trigger,window有默认的trigger,非特殊需要可以不设置
    .trigger(new Trigger<TestObj, TimeWindow>() {
        @Override
        public TriggerResult onElement(TestObj testObj, long l, TimeWindow timeWindow, TriggerContext triggerContext) throws Exception {
            return null;
        }
        
        @Override
        public TriggerResult onProcessingTime(long l, TimeWindow timeWindow, TriggerContext triggerContext) throws Exception {
            return null;
        }
        
        @Override
        public TriggerResult onEventTime(long l, TimeWindow timeWindow, TriggerContext triggerContext) throws Exception {
            return null;
        }
        
        @Override
        public void clear(TimeWindow timeWindow, TriggerContext triggerContext) throws Exception {
        
        }
    })
    //function,实际运算的方法
    .reduce(new ReduceFunction<TestObj>() {
        @Override
        public TestObj reduce(TestObj testObj, TestObj t1) throws Exception {
            return new TestObj(testObj.getKey(), testObj.getValue()+","+t1.getValue());
        }
    })
    .print();

process time -- 操作执行时的服务器系统时间

event time -- 数据获取后自己打上的Watermark

keyed windows -- 按key分区后的datastream(keyedstream)使用的窗口

data.keyBy(TestObj::getKey).window(TumblingEventTimeWindows.of(Time.minutes(5)));

non-keyed windows -- 没有按key分区的datastream使用的窗口

data.windowAll(TumblingEventTimeWindows.of(Time.minutes(5)));

2. Assigners

1) Tumbling windows

基于时间,有固定大小,不会彼此覆盖,下一个window开始时,上一个window结束

使用

//使用process time,每1分钟生成一个window, 并在1min后结束 
TumblingProcessingTimeWindows.of(Time.minutes(1))

//使用event time,每1分钟生成一个window, 并在1min后结束 
TumblingEventTimeWindows.of(Time.minutes(1))

//使用event time,每1分钟生成一个window, 并在1min后结束, 全局时间-8h
TumblingEventTimeWindows.of(Time.minutes(1),Time.hours(-8))

//使用event time,每1分钟生成一个window, 并在1min后结束, 全局偏移量-8h, 窗口策略同时(默认)
TumblingEventTimeWindows.of(Time.minutes(1),Time.hours(-8), WindowStagger.ALIGNED)

WindowStagger

ALIGNED  所有分区所有窗口同时运行

NATURE  当前窗口接收到第一个event的时候,和窗口开始时间的时间差会作为偏移量

RANDOM  当窗口算子摄取到到第一个event,从区间(0,窗口大小)中随机取样作为偏移量

2)Sliding Windows 

基于时间,有固定大小,可以彼此覆盖

//使用process time 每分钟启动一个window, 5分钟后关闭
SlidingProcessingTimeWindows.of(Time.minutes(1),Time.minutes(5))

//使用event time 每分钟启动一个window, 5分钟后关闭
SlidingEventTimeWindows.of(Time.minutes(1),Time.minutes(5))

//使用event time 每分钟启动一个window, 5分钟后关闭, 全局偏移量-8h
SlidingEventTimeWindows.of(Time.minutes(1),Time.minutes(5),Time.hours(-8))

3)Session Windows 

没有固定的开始和结束时间,在一段时间没有获得数据后自动结束,下一个数据自动开启新的窗口

//10分钟内没有获得数据自动结束
EventTimeSessionWindows.withGap(Time.minutes(10))

//动态设置
EventTimeSessionWindows.withDynamicGap(new SessionWindowTimeGapExtractor<TestObj>() {
    @Override
    public long extract(TestObj testObj) {
        return testObj.getKey()*1000;
    }
})

4) Global Windows

每个分区只用一个window,因为默认不会自己结束,所以如果没有自定义trigger,就不会执行任何计算

GlobalWindows.create()

2. trigger

fire -- 当trigger判断当前窗口可以执行的时候,触发fire

purge -- 清理数据

1)custom trigger

trigger有三个方法返回TriggerResult他们决定是否现在执行运算(fire trigger)

onElement(Object element, long timestamp, TimeWindow window, TriggerContext ctx)

接收到数据的时候调用

onEventTime(long time, TimeWindow window, TriggerContext ctx)

事件计时器结束时调用

onProcessingTime(long time, TimeWindow window, TriggerContext ctx)

进程计时器结束时调用

返回值

TriggerResult.CONTINUE   继续(不做操作)

TriggerResult.FIRE  开始运算

TriggerResult.FIRE_AND_PURGE   开始运算并在结束后清理数据

TriggerResult.PURGE   清理

仿照CountTrigger写了个栗子

public class Test {
    public static void main(String[] args) {
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.registerJobListener(new JobListener() {
            @Override
            public void onJobSubmitted(@Nullable JobClient jobClient, @Nullable Throwable throwable) {
                Logger.getLogger("test").info("onJobSubmitted");
            }

            @Override
            public void onJobExecuted(@Nullable JobExecutionResult jobExecutionResult, @Nullable Throwable throwable) {
                Logger.getLogger("test").info("onJobExecuted");
            }
        });
        List<TestObj> testObjs=new ArrayList<>();
        testObjs.add(new TestObj(1,"苹果,梨"));
        testObjs.add(new TestObj(1,"柚子,橘子"));
        testObjs.add(new TestObj(3,"猫,虎"));
        testObjs.add(new TestObj(3,"狗,狼"));
        testObjs.add(new TestObj(3,"羊,牛"));
        testObjs.add(new TestObj(1,"葡萄"));
        testObjs.add(new TestObj(3,"树袋熊"));
        DataStream<TestObj> data=env.fromCollection(testObjs);
        data.keyBy(TestObj::getKey).window(GlobalWindows.create()).trigger(new TestTrigger(3L)).reduce(new ReduceFunction<TestObj>() {
            @Override
            public TestObj reduce(TestObj testObj, TestObj t1) throws Exception {
                return new TestObj(testObj.getKey(), testObj.getValue()+","+t1.getValue());
            }
        }).print();
        try {
            env.execute();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    private static class TestTrigger extends Trigger<TestObj, Window>{
        private final long maxCount;
        private final ReducingStateDescriptor<Long> stateDesc;

        private TestTrigger(long maxCount) {
            this.maxCount = maxCount;
            this.stateDesc = new ReducingStateDescriptor<>("count", new Test.Sum(), LongSerializer.INSTANCE);;
        }

        @Override
        public TriggerResult onElement(TestObj testObj, long l, Window window, TriggerContext triggerContext) throws Exception {
            ReducingState<Long> count = (ReducingState)triggerContext.getPartitionedState(this.stateDesc);
            Logger.getLogger("test").info("onElement: "+count.get());
            count.add(1L);
            if ((Long)count.get() >= this.maxCount) {
                count.clear();
                return TriggerResult.FIRE;
            } else {
                return TriggerResult.CONTINUE;
            }
        }

        @Override
        public TriggerResult onProcessingTime(long l, Window window, TriggerContext triggerContext) throws Exception {
            Logger.getLogger("test").info("onProcessingTime");
            return TriggerResult.CONTINUE;
        }

        @Override
        public TriggerResult onEventTime(long l, Window window, TriggerContext triggerContext) throws Exception {
            Logger.getLogger("test").info("onEventTime");
            return TriggerResult.CONTINUE;
        }

        @Override
        public void clear(Window window, TriggerContext triggerContext) throws Exception {
            Logger.getLogger("test").info("clear");
            ((ReducingState)triggerContext.getPartitionedState(this.stateDesc)).clear();
        }
    }

    private static class Sum implements ReduceFunction<Long> {
        private static final long serialVersionUID = 1L;

        private Sum() {
        }

        @Override
        public Long reduce(Long value1, Long value2) throws Exception {
            return value1 + value2;
        }
    }

    private static class Sum implements ReduceFunction<Long> {
        private static final long serialVersionUID = 1L;

        private Sum() {
        }

        @Override
        public Long reduce(Long value1, Long value2) throws Exception {
            return value1 + value2;
        }
    }
}

输出

十月 11, 2021 1:36:42 下午 com.test.flink.Test$1 onJobSubmitted
信息: onJobSubmitted
十月 11, 2021 1:36:43 下午 com.test.flink.Test$TestTrigger onElement
信息: onElement: null
十月 11, 2021 1:36:43 下午 com.test.flink.Test$TestTrigger onElement
信息: onElement: null
十月 11, 2021 1:36:43 下午 com.test.flink.Test$TestTrigger onElement
信息: onElement: 1
十月 11, 2021 1:36:43 下午 com.test.flink.Test$TestTrigger onElement
信息: onElement: 2
8> com.test.flink.entity.TestObj@5ee540e9
十月 11, 2021 1:36:43 下午 com.test.flink.Test$TestTrigger onElement
信息: onElement: null
十月 11, 2021 1:36:43 下午 com.test.flink.Test$TestTrigger onElement
信息: onElement: 1
十月 11, 2021 1:36:43 下午 com.test.flink.Test$TestTrigger onElement
信息: onElement: 2
6> com.test.flink.entity.TestObj@2414aeb1
十月 11, 2021 1:36:43 下午 com.test.flink.Test$1 onJobExecuted
信息: onJobExecuted

能看出几点:

1) 最初获得的ReducingState<Long> count是null,在第一次add操作之后变成1

2) 一共启动了三个窗口,但有一个窗口没有输出,因为这个trigger是满3个数据执行,而上面的数据在keyBy之后生成两个分区,一个是3个数据(分区a),一个是4个数据(分区b),a分区启动一个窗口之后就全部执行完毕,b分区启动一个窗口后,还剩余一个数据,因此启动了两个窗口,但第二个窗口的数据数量始终达不到3,因此它的trigger永远不会返回fire,它的function也就无法执行

3)在没有timer的情况下,onProcessTime和onEventTime两个方法不会被执行

4)clear()方法没有执行。这个方法在文档中的说明中有写:

Clears any state that the trigger might still hold for the given window. This is called when a window is purged. 

也就是说它会在窗口被清理时调用,然后官网关于窗口的生命周期又说了

Flink guarantees removal only for time-based windows and not for other types, e.g. global windows.

好了,只有基于时间的窗口会自动清理,global window不行。把例子改一改

data.assignTimestampsAndWatermarks(new WatermarkStrategy<TestObj>() {
    @Override
    public WatermarkGenerator<TestObj> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
        return new WatermarkGenerator<TestObj>() {
            @Override
            public void onEvent(TestObj testObj, long l, WatermarkOutput watermarkOutput) {

            }

            @Override
            public void onPeriodicEmit(WatermarkOutput watermarkOutput) {
                watermarkOutput.emitWatermark(new Watermark(System.currentTimeMillis()));
            }
        };
    }

    @Override
    public TimestampAssigner<TestObj> createTimestampAssigner(TimestampAssignerSupplier.Context context) {
        return new TimestampAssigner<TestObj>() {
            @Override
            public long extractTimestamp(TestObj testObj, long l) {
                return System.currentTimeMillis();
            }
        };
    }
}).keyBy(TestObj::getKey).window(TumblingEventTimeWindows.of(Time.minutes(1))).trigger(new TestTrigger(3L)).reduce(new ReduceFunction<TestObj>() {
    @Override
    public TestObj reduce(TestObj testObj, TestObj t1) throws Exception {
        return new TestObj(testObj.getKey(), testObj.getValue()+","+t1.getValue());
    }
}).print();

输出

十月 11, 2021 4:27:38 下午 com.test.flink.Test$1 onJobSubmitted
信息: onJobSubmitted
十月 11, 2021 4:27:38 下午 com.test.flink.Test$TestTrigger onElement
信息: onElement: null
十月 11, 2021 4:27:38 下午 com.test.flink.Test$TestTrigger onElement
信息: onElement: null
十月 11, 2021 4:27:38 下午 com.test.flink.Test$TestTrigger onElement
信息: onElement: 1
十月 11, 2021 4:27:38 下午 com.test.flink.Test$TestTrigger onElement
信息: onElement: 1
十月 11, 2021 4:27:38 下午 com.test.flink.Test$TestTrigger onElement
信息: onElement: 2
8> com.test.flink.entity.TestObj@2b14de2f
十月 11, 2021 4:27:38 下午 com.test.flink.Test$TestTrigger onElement
信息: onElement: 2
十月 11, 2021 4:27:38 下午 com.test.flink.Test$TestTrigger onElement
信息: onElement: null
6> com.test.flink.entity.TestObj@606f9086
十月 11, 2021 4:27:38 下午 com.test.flink.Test$TestTrigger onEventTime
信息: onEventTime
十月 11, 2021 4:27:38 下午 com.test.flink.Test$TestTrigger onEventTime
信息: onEventTime
十月 11, 2021 4:27:38 下午 com.test.flink.Test$TestTrigger clear
信息: clear
十月 11, 2021 4:27:38 下午 com.test.flink.Test$TestTrigger clear
信息: clear
十月 11, 2021 4:27:38 下午 com.test.flink.Test$1 onJobExecuted
信息: onJobExecuted

clear()方法有了

关于global window 的clear方法如何触发,暂时莫得找到,先留着(。。)

trigger里还有canMerge()和onMerge(W window, Trigger.OnMergeContext ctx)方法

canMerge()表示这个trigger是否支持数据合并,onMerge()是合并时调用的方法

2) CountTrigger

和上面例子里用的那个长得差不多,因为当时就是按照这个写的,这种触发器是数据累计到一定数量后fire的,不需要event time或者process time,只要设置一个数据最大值就可以了

它在数据合并的时候会把累加数加到一起

public class CountTrigger<W extends Window> extends Trigger<Object, W> {
    private static final long serialVersionUID = 1L;
    private final long maxCount;
    private final ReducingStateDescriptor<Long> stateDesc;

    private CountTrigger(long maxCount) {
        this.stateDesc = new ReducingStateDescriptor("count", new CountTrigger.Sum(), LongSerializer.INSTANCE);
        this.maxCount = maxCount;
    }

    public TriggerResult onElement(Object element, long timestamp, W window, TriggerContext ctx) throws Exception {
        ReducingState<Long> count = (ReducingState)ctx.getPartitionedState(this.stateDesc);
        count.add(1L);
        if ((Long)count.get() >= this.maxCount) {
            count.clear();
            return TriggerResult.FIRE;
        } else {
            return TriggerResult.CONTINUE;
        }
    }

    public TriggerResult onEventTime(long time, W window, TriggerContext ctx) {
        return TriggerResult.CONTINUE;
    }

    public TriggerResult onProcessingTime(long time, W window, TriggerContext ctx) throws Exception {
        return TriggerResult.CONTINUE;
    }

    public void clear(W window, TriggerContext ctx) throws Exception {
        ((ReducingState)ctx.getPartitionedState(this.stateDesc)).clear();
    }

    public boolean canMerge() {
        return true;
    }

    public void onMerge(W window, OnMergeContext ctx) throws Exception {
        ctx.mergePartitionedState(this.stateDesc);
    }

    public String toString() {
        return "CountTrigger(" + this.maxCount + ")";
    }

    public static <W extends Window> CountTrigger<W> of(long maxCount) {
        return new CountTrigger(maxCount);
    }

    private static class Sum implements ReduceFunction<Long> {
        private static final long serialVersionUID = 1L;

        private Sum() {
        }

        public Long reduce(Long value1, Long value2) throws Exception {
            return value1 + value2;
        }
    }
}

2)EventTimeTrigger

这个trigger一看上去好像应该是在onEventTime方法里判断是否fire,但实际上onElement和onEventTime两个方法里都有

onEventTime判断的是当前时间是否是窗口最大时间,如果是则fire

onElement判断的是当前环境的watermark是否大于窗口最大时间,如果是则fire, 不是则注册事件时间计时器,这个计时器结束会触发onEventTime

数据合并的时候会取所有数据源窗口最大时间最为新的窗口最大时间

public class EventTimeTrigger extends Trigger<Object, TimeWindow> {
    private static final long serialVersionUID = 1L;

    private EventTimeTrigger() {
    }

    public TriggerResult onElement(Object element, long timestamp, TimeWindow window, TriggerContext ctx) throws Exception {
        if (window.maxTimestamp() <= ctx.getCurrentWatermark()) {
            return TriggerResult.FIRE;
        } else {
            ctx.registerEventTimeTimer(window.maxTimestamp());
            return TriggerResult.CONTINUE;
        }
    }

    public TriggerResult onEventTime(long time, TimeWindow window, TriggerContext ctx) {
        return time == window.maxTimestamp() ? TriggerResult.FIRE : TriggerResult.CONTINUE;
    }

    public TriggerResult onProcessingTime(long time, TimeWindow window, TriggerContext ctx) throws Exception {
        return TriggerResult.CONTINUE;
    }

    public void clear(TimeWindow window, TriggerContext ctx) throws Exception {
        ctx.deleteEventTimeTimer(window.maxTimestamp());
    }

    public boolean canMerge() {
        return true;
    }

    public void onMerge(TimeWindow window, OnMergeContext ctx) {
        long windowMaxTimestamp = window.maxTimestamp();
        if (windowMaxTimestamp > ctx.getCurrentWatermark()) {
            ctx.registerEventTimeTimer(windowMaxTimestamp);
        }

    }

    public String toString() {
        return "EventTimeTrigger()";
    }

    public static EventTimeTrigger create() {
        return new EventTimeTrigger();
    }
}

3) ProcessingTimeTrigger

onElement注册进程时间计时器,这个计时器结束会触发onProcessingTime

onProcessingTime直接触发fire

数据合并时同上

public class ProcessingTimeTrigger extends Trigger<Object, TimeWindow> {
    private static final long serialVersionUID = 1L;

    private ProcessingTimeTrigger() {
    }

    public TriggerResult onElement(Object element, long timestamp, TimeWindow window, TriggerContext ctx) {
        ctx.registerProcessingTimeTimer(window.maxTimestamp());
        return TriggerResult.CONTINUE;
    }

    public TriggerResult onEventTime(long time, TimeWindow window, TriggerContext ctx) throws Exception {
        return TriggerResult.CONTINUE;
    }

    public TriggerResult onProcessingTime(long time, TimeWindow window, TriggerContext ctx) {
        return TriggerResult.FIRE;
    }

    public void clear(TimeWindow window, TriggerContext ctx) throws Exception {
        ctx.deleteProcessingTimeTimer(window.maxTimestamp());
    }

    public boolean canMerge() {
        return true;
    }

    public void onMerge(TimeWindow window, OnMergeContext ctx) {
        long windowMaxTimestamp = window.maxTimestamp();
        if (windowMaxTimestamp > ctx.getCurrentProcessingTime()) {
            ctx.registerProcessingTimeTimer(windowMaxTimestamp);
        }

    }

    public String toString() {
        return "ProcessingTimeTrigger()";
    }

    public static ProcessingTimeTrigger create() {
        return new ProcessingTimeTrigger();
    }
}

4)PurgingTrigger

PurgingTrigger的参数是一个trigger,它本质上就是在作为参数的trigger基础上加一个清理,直观来说就是把返回的TriggerResult.FIRE换成TriggerResult.FIRE_AND_PURGE

public class PurgingTrigger<T, W extends Window> extends Trigger<T, W> {
    private static final long serialVersionUID = 1L;
    private Trigger<T, W> nestedTrigger;

    private PurgingTrigger(Trigger<T, W> nestedTrigger) {
        this.nestedTrigger = nestedTrigger;
    }

    public TriggerResult onElement(T element, long timestamp, W window, TriggerContext ctx) throws Exception {
        TriggerResult triggerResult = this.nestedTrigger.onElement(element, timestamp, window, ctx);
        return triggerResult.isFire() ? TriggerResult.FIRE_AND_PURGE : triggerResult;
    }

    public TriggerResult onEventTime(long time, W window, TriggerContext ctx) throws Exception {
        TriggerResult triggerResult = this.nestedTrigger.onEventTime(time, window, ctx);
        return triggerResult.isFire() ? TriggerResult.FIRE_AND_PURGE : triggerResult;
    }

    public TriggerResult onProcessingTime(long time, W window, TriggerContext ctx) throws Exception {
        TriggerResult triggerResult = this.nestedTrigger.onProcessingTime(time, window, ctx);
        return triggerResult.isFire() ? TriggerResult.FIRE_AND_PURGE : triggerResult;
    }

    public void clear(W window, TriggerContext ctx) throws Exception {
        this.nestedTrigger.clear(window, ctx);
    }

    public boolean canMerge() {
        return this.nestedTrigger.canMerge();
    }

    public void onMerge(W window, OnMergeContext ctx) throws Exception {
        this.nestedTrigger.onMerge(window, ctx);
    }

    public String toString() {
        return "PurgingTrigger(" + this.nestedTrigger.toString() + ")";
    }

    public static <T, W extends Window> PurgingTrigger<T, W> of(Trigger<T, W> nestedTrigger) {
        return new PurgingTrigger(nestedTrigger);
    }

    @VisibleForTesting
    public Trigger<T, W> getNestedTrigger() {
        return this.nestedTrigger;
    }
}

3. Functions

1)ReduceFunction

将多个数据按照某种方式合成一个数据,返回值类型必须和传入值一样

List<TestObj> testObjs=new ArrayList<>();
testObjs.add(new TestObj(1,"苹果,梨"));
testObjs.add(new TestObj(1,"柚子,橘子"));
testObjs.add(new TestObj(3,"猫,虎"));
testObjs.add(new TestObj(3,"狗,狼"));
testObjs.add(new TestObj(3,"羊,牛"));
testObjs.add(new TestObj(1,"葡萄"));
testObjs.add(new TestObj(3,"树袋熊"));
DataStream<TestObj> data=env.fromCollection(testObjs);
data.assignTimestampsAndWatermarks(new WatermarkStrategy<TestObj>() {
    @Override
    public WatermarkGenerator<TestObj> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
        return new WatermarkGenerator<TestObj>() {
            @Override
            public void onEvent(TestObj testObj, long l, WatermarkOutput watermarkOutput) {

            }

            @Override
            public void onPeriodicEmit(WatermarkOutput watermarkOutput) {
                watermarkOutput.emitWatermark(new Watermark(System.currentTimeMillis()));
            }
        };
    }

    @Override
    public TimestampAssigner<TestObj> createTimestampAssigner(TimestampAssignerSupplier.Context context) {
         return new TimestampAssigner<TestObj>() {
             @Override
             public long extractTimestamp(TestObj testObj, long l) {
                 return System.currentTimeMillis();
             }
         };
    }
}).keyBy(TestObj::getKey).window(TumblingEventTimeWindows.of(Time.minutes(1)))
.reduce(new ReduceFunction<TestObj>() {
    @Override
    public TestObj reduce(TestObj testObj, TestObj t1) throws Exception {
        return new TestObj(testObj.getKey(), testObj.getValue()+","+t1.getValue());
    }
}).map(new MapFunction<TestObj, String>() {
    @Override
    public String map(TestObj testObj) throws Exception {
        return testObj.getKey()+": "+testObj.getValue();
    }
}).print();

输出

8> 3: 猫,虎,狗,狼,羊,牛,树袋熊
6> 1: 苹果,梨,柚子,橘子,葡萄

因为没有条件判断所以同一个分区的数据都被合并成同一个数据,如果修改一下reduce

@Override
public TestObj reduce(TestObj testObj, TestObj t1) throws Exception {
    int size1=testObj.getValue().indexOf(",");
    int size2=t1.getValue().indexOf(",");
    if (size1>0&&size2>0){
        return new TestObj(testObj.getKey(), testObj.getValue()+","+t1.getValue());
    }else if (size1>0){
        return testObj;
    }else {
        return t1;
    }
}

输出

6> 1: 苹果,梨,柚子,橘子
8> 3: 猫,虎,狗,狼,羊,牛

莫得逗号的葡萄和树袋熊就被抛弃啦

2)AggregateFunction

也是合并数据的方法,实现比reduce function复杂,但适用性更强,在reduce function无法满足需求时可以考虑使用此方法

createAcummulator()  : 初始化累加器

add()  :  向累加器里添加数据

merge()  :  合并累加器

getResult()  :  从累加器获得结果

AggregateFunction<IN, ACC, OUT>

IN - 输入值类型

ACC - 累加器类型

OUT - 结果类型

.aggregate(new AggregateFunction<TestObj, List<String>, String>() {

    @Override
    public List<String> createAccumulator() {
        return new ArrayList<>();
    }

    @Override
    public List<String> add(TestObj testObj, List<String> testObjs) {
        testObjs.add(testObj.getValue());
        return testObjs;
    }

    @Override
    public String getResult(List<String> testObjs) {
        String[] ss=new String[testObjs.size()];
        testObjs.toArray(ss);
        return StringUtils.join(ss,",")+" count:"+testObjs.size();;
    }

    @Override
    public List<String> merge(List<String> testObjs1, List<String> testObjs2) {
        testObjs1.addAll(testObjs2);
        return testObjs1;
    }
})

输出

8> 猫,虎,狗,狼,羊,牛,树袋熊 count:4
6> 苹果,梨,柚子,橘子,葡萄 count:3

3)ProcessWindowFunction

这种function最为灵活,处理数据是可以使用key,context,数据集合,最终结果也是一个集合

ProcessWindowFunction<IN, OUT, KEY, W extends Window>

IN -- 输入值类型

OUT -- 输出值类型

KEY -- keyBy时使用的key的类型

W -- 窗口类型

.process(new ProcessWindowFunction<TestObj, String, Integer, TimeWindow>() {
    @Override
    public void process(Integer key, Context context, Iterable<TestObj> iterable, Collector<String> collector) throws Exception {
        for (TestObj testObj : iterable) {
            collector.collect(testObj.getValue());
        }
    }
})

输出

8> 猫,虎
8> 狗,狼
6> 苹果,梨
6> 柚子,橘子
6> 葡萄
8> 羊,牛
8> 树袋熊

在使用windowAll的情况下没有key

.windowAll(TumblingEventTimeWindows.of(Time.minutes(1)))
.process(new ProcessAllWindowFunction<TestObj, String, TimeWindow>() {
    @Override
    public void process(Context context, Iterable<TestObj> iterable, Collector<String> collector) throws Exception {
                
    }
})

process function 可以和reduce function 或 aggregate function 联合使用

reduce(ReduceFunction<T> reduceFunction, ProcessWindowFunction<T, R, K, W> function)
aggregate(AggregateFunction<T, ACC, V> aggFunction, ProcessWindowFunction<V, R, K, W> windowFunction)

注意processfunction的输入值是前面方法的输出值

4. Evictors 

使用evictor()可以在function前后移除一些数据

evictBefore()在function之前执行

evictAfter()在function之后执行

例子

data.assignTimestampsAndWatermarks(new WatermarkStrategy<TestObj>() {
    @Override
    public WatermarkGenerator<TestObj> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
        return new WatermarkGenerator<TestObj>() {
            @Override
            public void onEvent(TestObj testObj, long l, WatermarkOutput watermarkOutput)          {

            }

            @Override
            public void onPeriodicEmit(WatermarkOutput watermarkOutput) {
                watermarkOutput.emitWatermark(new Watermark(System.currentTimeMillis()));
            }
       };
    }

    @Override
    public TimestampAssigner<TestObj> createTimestampAssigner(TimestampAssignerSupplier.Context context) {
        return new TimestampAssigner<TestObj>() {
            @Override
            public long extractTimestamp(TestObj testObj, long l) {
                return System.currentTimeMillis();
            }
        };
    }
}).keyBy(TestObj::getKey).window(TumblingEventTimeWindows.of(Time.minutes(1)))
.evictor(new Evictor<TestObj, TimeWindow>() {
    @Override
    public void evictBefore(Iterable<TimestampedValue<TestObj>> iterable, int i, TimeWindow timeWindow, EvictorContext evictorContext) {
        Iterator<TimestampedValue<TestObj>> iterator=iterable.iterator();
        while (iterator.hasNext()){
            if (iterator.next().getValue().getValue().indexOf(",")>0){
                iterator.remove();
            }
        }
    }

    @Override
    public void evictAfter(Iterable<TimestampedValue<TestObj>> iterable, int i, TimeWindow timeWindow, EvictorContext evictorContext) {

    }
})
.reduce(new ReduceFunction<TestObj>() {
    @Override
    public TestObj reduce(TestObj testObj, TestObj t1) throws Exception {
        return new TestObj(testObj.getKey(), testObj.getValue()+","+t1.getValue());
    }
}).map(new MapFunction<TestObj, String>() {
    @Override
    public String map(TestObj testObj) throws Exception {
        return testObj.getValue();
    }
}).print();

输出

8> 树袋熊
6> 葡萄

包含逗号的数据被去掉了

5. Allowed Lateness

当一个数据在它对应的watermark所属的窗口结束之后才到达的时候默认会被丢弃,使用allowedLateness(Time lateness)方法可以设置一个允许延迟时间,在这个时间内到达的迟到数据会再次触发fire,与之前的数据一起进行再一次计算。需要注意的是,这样的重新计算根据具体的逻辑可能会导致数据重复,要注意去重。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值