Flink学习笔记-WindowsFunction(篇一)
在确定窗口类型之后,便可以定义窗口数据的计算逻辑,也就是定义Window Function。Flink提供了四种类型Window Function,其中有ReduceFunction、AggregateFunction、FlodFunction和ProcessWindowFunction。其中ReduceFunction、AggregateFunction、FlodFunction根据计算原理,属于增量聚合函数,而ProcessWindowFunction属于全量聚合函数。增量聚合函数是基于中间状态计算结果的,窗口中只维护中间状态结果值,不需要缓存原始的数据,而全量窗口函数在窗口触发时对所有的原始数据进行汇总计算,因此相对性能会较差。本篇介绍增量聚合相关的函数。
ReduceFunction
ReduceFunction:对输入的两个相同类型的数据元素按照指定的计算方法进行聚合,然后输出一个类型相同的结果元素。例子:
List<Tuple2<String, Long>> source = Lists.newArrayList();
source.add(new Tuple2<>("qh1", 100L));
source.add(new Tuple2<>("qh2", 101L));
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<Tuple2<String, Long>> dataStreamSource = env.fromCollection(source);
DataStream<Tuple2<String, Long>> result = dataStreamSource.keyBy(0).
countWindow(5).reduce(new ReduceFunction<Tuple2<String, Long>>() {
@Override
public Tuple2<String, Long> reduce(Tuple2<String, Long> param1, Tuple2<String, Long> param2) throws Exception {
return new Tuple2<>(param1.f0,param1.f1+param2.f1);
}
});
result.print();
env.execute("q Demo");
上诉例子的含义是按key分组,且窗口内到达五个元素时触发计算,计算内容是将f1相加迭代输出。最终结果可以是:

AggregateFunction
AggregateFunction:更加通用,也更加复杂,通过WindowedStream的aggregate方法指定一个AggregateFunction来处理。其中实现AggregateFunction需要传入三个泛型,第一个表示源数据类型,第二个表示acc(accumulator)的类型,第三个是结果数据类型,并且要实现四个方法,createAccumulator为初始化acc,其目的是用于add第一个元素,add将每一个元素以某种方式添加到acc中,getResult获取最终计算结果,merge为合并acc;也就是说add需要传入一条元素和当前累加的中间结果,且第一次add的acc是预先定义的createAccumulator,add输出的是中间状态的acc,一般来说,元素add完毕之后便会调用getResult计算自身业务想要的结果。简单实现一个AggregateFunction具备计算平均数如下:
private static class AverageAggregate
implements AggregateFunction<Tuple2<String, Long>, Tuple2<Long, Long>, Double> {
@Override
public Tuple2<Long, Long> createAccumulator() {
return new Tuple2<>(0L,0L);
}
@Override
public Tuple2<Long, Long> add(Tuple2<String, Long> value, Tuple2<Long, Long> accumulator) {
return new Tuple2<>(accumulator.f0 + value.f1, accumulator.f1 + 1L);
}
@Override
public Double getResult(Tuple2<Long, Long> accumulator) {
return ((double) accumulator.f0) / accumulator.f1;
}
@Override
public Tuple2<Long, Long> merge(Tuple2<Long, Long> a, Tuple2<Long, Long> b) {
return new Tuple2<>(a.f0 + b.f0, a.f1 + b.f1);
}
}
具体计算平均数的完整demo如下:
List<Tuple2<String, Long>> source = Lists.newArrayList();
source.add(new Tuple2<>("qh1", 88L));
source.add(new Tuple2<>("qh1", 99L));
source.add(new Tuple2<>("qh1", 100L));
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<Tuple2<String, Long>> dataStreamSource = env.fromCollection(source);
DataStream<Double> output= dataStreamSource.keyBy(0).
countWindow(3).aggregate(new AverageAggregate());
output.print();
env.execute("q Demo");
输出结果为(需要注意,在本例中merge其实是没有用到的):

思考问题:merge在何时会被触发?
FlodFunction
FlodFunction:可以根据定义的规则将外部元素合并到窗口元素中。flink中已经Deprecated警告,且建议使用AggregateFunction代替。

简单例子如下:
List<Tuple2<String, Long>> source = Lists.newArrayList();
source.add(new Tuple2<>("qh1", 88L));
source.add(new Tuple2<>("qh1", 99L));
source.add(new Tuple2<>("qh1", 100L));
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<Tuple2<String, Long>> dataStreamSource = env.fromCollection(source);
SingleOutputStreamOperator<String> fold = dataStreamSource.keyBy(0).
countWindow(3).fold("qh", new FoldFunction<Tuple2<String, Long>, String>() {
@Override
public String fold(String accumulator, Tuple2<String, Long> value) throws Exception {
return accumulator+value.f1;
}
});
fold.print();
env.execute("q Demo");
1165

被折叠的 条评论
为什么被折叠?



