Flink学习笔记-WindowsFunction(篇二)


在某些业务场景下,统计更复杂的指标,就可能会依赖窗口中所有的数据元素,以及可能会需要操作窗口中的状态数据和窗口元数据,全量聚合函数ProcessWindowFunction能够提供类似这种支持。ProcessWindowFunction的简单应用如:统计窗口数据元素中某一字段的中位数和众数。

ProcessWindowFunction

Flink针对全量聚合计算提供了一个骨架抽象类ProcessWindowFunction,如果我们不需要操作状态数据,则只需要实现ProcessWindowFunction的process()方法即可,在该方法中具体定义计算评估和输出的逻辑。

ProcessWindowFunction抽象类

/*
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.flink.streaming.api.functions.windowing;

import org.apache.flink.annotation.PublicEvolving;
import org.apache.flink.api.common.functions.AbstractRichFunction;
import org.apache.flink.api.common.state.KeyedStateStore;
import org.apache.flink.streaming.api.windowing.windows.Window;
import org.apache.flink.util.Collector;
import org.apache.flink.util.OutputTag;

/**
 * Base abstract class for functions that are evaluated over keyed (grouped) windows using a context
 * for retrieving extra information.
 *
 * @param <IN> The type of the input value.
 * @param <OUT> The type of the output value.
 * @param <KEY> The type of the key.
 * @param <W> The type of {@code Window} that this window function can be applied on.
 */
@PublicEvolving
public abstract class ProcessWindowFunction<IN, OUT, KEY, W extends Window> extends AbstractRichFunction {

	private static final long serialVersionUID = 1L;

	/**
	 * Evaluates the window and outputs none or several elements.
	 *
	 * @param key The key for which this window is evaluated.
	 * @param context The context in which the window is being evaluated.
	 * @param elements The elements in the window being evaluated.
	 * @param out A collector for emitting elements.
	 *
	 * @throws Exception The function may throw exceptions to fail the program and trigger recovery.
	 */
	 // 评估窗口并且定义窗口输出元素
	public abstract void process(KEY key, Context context, Iterable<IN> elements, Collector<OUT> out) throws Exception;

	/**
	 * Deletes any state in the {@code Context} when the Window is purged.
	 *
	 * @param context The context to which the window is being evaluated
	 * @throws Exception The function may throw exceptions to fail the program and trigger recovery.
	 */
	 //定义清除每个窗口计算结束后中间状态的逻辑
	public void clear(Context context) throws Exception {}

	/**
	 * The context holding window metadata.
	 */
	 //该抽象类定义了window的元数据以及可以操作window的状态数据
	public abstract class Context implements java.io.Serializable {
		/**
		 * Returns the window that is being evaluated.
		 */
		 // 返回窗口的元数据
		public abstract W window();

		/** Returns the current processing time. */
		// 返回窗口当前的处理时间
		public abstract long currentProcessingTime();

		/** Returns the current event-time watermark. */
		// 返回窗口当前的event-time的watermark
		public abstract long currentWatermark();

		/**
		 * State accessor for per-key and per-window state.
		 *
		 * <p><b>NOTE:</b>If you use per-window state you have to ensure that you clean it up
		 * by implementing {@link ProcessWindowFunction#clear(Context)}.
		 */
		 //返回每个窗口的中间状态
		public abstract KeyedStateStore windowState();

		/**
		 * State accessor for per-key global state.
		 */
		 //返回每个key对应的中间状态
		public abstract KeyedStateStore globalState();

		/**
		 * Emits a record to the side output identified by the {@link OutputTag}.
		 *
		 * @param outputTag the {@code OutputTag} that identifies the side output to emit to.
		 * @param value The record to emit.
		 */
		 //根据outputTag输出数据
		public abstract <X> void output(OutputTag<X> outputTag, X value);
	}
}

ProcessWindowFunction简单例子

通过实现ProcessWindowFunction完成基于窗口上的key的统计:包括求和,最小值,最大值,以及平均值等聚合指标,并获取窗口结束时间等元数据信息

  public static void main(String[] args) throws Exception {
        List<Tuple2<String, Long>> source = Lists.newArrayList();
        source.add(new Tuple2<>("qh1", 88L));
        source.add(new Tuple2<>("qh1", 99L));
        source.add(new Tuple2<>("qh1", 100L));
        source.add(new Tuple2<>("qh1", 155L));
        source.add(new Tuple2<>("qh1", 8L));
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStream<Tuple2<String, Long>> dataStreamSource = env.fromCollection(source);
        SingleOutputStreamOperator<Tuple5<String, Long, Long, Long, Long>> result = dataStreamSource.keyBy(t -> t.f0).
                timeWindow(Time.seconds(10)).process(new QhProcessWindowFunction());
        result.print();
        env.execute("q Demo");
    }

    public static class QhProcessWindowFunction extends ProcessWindowFunction<Tuple2<String, Long>, Tuple5<String, Long, Long, Long, Long>, String, TimeWindow> {

        @Override
        public void process(String key, Context context, Iterable<Tuple2<String, Long>> elements, Collector<Tuple5<String, Long, Long, Long, Long>> out) throws Exception {
            Long sum = 0L;
            Long max = null;
            Long min = null;
            for (Tuple2<String, Long> element : elements) {
                sum += element.f1;
                if (max == null) {
                    max = element.f1;
                }
                if (min == null) {
                    min = element.f1;
                }
                if (max < element.f1) {
                    max = element.f1;
                }
                if (min > element.f1) {
                    min = element.f1;
                }
            }
            // 求取窗口结束时间
            long winEndTime = context.window().getEnd();
            // 返回计算结果
            out.collect(new Tuple5<>(key, sum, max, min, winEndTime));
        }
    }

ProcessWindowFunction with Incremental Aggregation

增量聚合函数由于是基于中间状态计算,因此性能较好,但是灵活性却不及ProcessWindowFunction;缺失了对窗口状态数据的操作以及对窗口中元数据信息的获取等。但是使用全量聚合函数去完成一些基础的增量统计运算又相对比较浪费资源,性能低于增量。因此Flink提供了一种方式,可以将Incremental Aggregation Function和ProcessWindowFunction整合起来,充分利用这两种计算方式的优势去处理数据。

AggregateFunction combined with ProcessWindowFunction

该例通过定义AggregateFunction 求取平均数的逻辑,然后AggregateFunction 的输出会作为ProcessWindowFunction 的输入,ProcessWindowFunction 会将window触发时的平均值连同key一起作为输出。

    public static void main(String[] args) throws Exception {
        List<Tuple2<String, Long>> source = Lists.newArrayList();
        source.add(new Tuple2<>("qh1", 88L));
        source.add(new Tuple2<>("qh1", 99L));
        source.add(new Tuple2<>("qh1", 100L));
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStream<Tuple2<String, Long>> dataStreamSource = env.fromCollection(source);
        SingleOutputStreamOperator<Tuple2<String, Double>> result = dataStreamSource.keyBy(t -> t.f0).
                timeWindow(Time.seconds(10)).aggregate(new QhAverageAggregate(), new QhProcessWindowFunction());
        result.print();
        env.execute("q Demo");
    }

    public static class QhAverageAggregate implements AggregateFunction<Tuple2<String, Long>, Tuple2<Long, Long>, Double> {

        @Override
        public Tuple2<Long, Long> createAccumulator() {
            return new Tuple2<>(0L, 0L);
        }

        @Override
        public Tuple2<Long, Long> add(Tuple2<String, Long> value, Tuple2<Long, Long> accumulator) {
            return new Tuple2<Long, Long>(accumulator.f0 + value.f1, +accumulator.f1 + 1);
        }

        @Override
        public Double getResult(Tuple2<Long, Long> accumulator) {
            return ((double) accumulator.f0) / accumulator.f1;
        }

        @Override
        public Tuple2<Long, Long> merge(Tuple2<Long, Long> a, Tuple2<Long, Long> b) {
            return new Tuple2<Long, Long>(a.f0 + b.f0, +a.f1 + b.f1);
        }
    }

    private static class QhProcessWindowFunction
            extends ProcessWindowFunction<Double, Tuple2<String, Double>, String, TimeWindow> {
        public void process(String key,
                            Context context,
                            Iterable<Double> averages,
                            Collector<Tuple2<String, Double>> out) {
            Double average = averages.iterator().next();
            out.collect(new Tuple2<>(key, average));
        }
    }

ReduceFunction combined with ProcessWindowFunction

该例通过定义ReduceFunction 求取最大值,定义ProcessWindowFunction从窗口元数据中获取窗口结束时间,然后将结束时间和ReduceFunction 的最大值结果组合成一个新的Tuple返回。同样的,ReduceFunction 的输出会作为ProcessWindowFunction的输入,同理FoldFunction也可以按照同样的方式和ProcessWindowFunction 整合,在实现增量聚合计算的同时,也可以操作窗口中的元数据信息以及状态数据。

/**
 * @author qingh.yxb
 * @since 2019/7/28
 */
public class WindowDemo {
    public static void main(String[] args) throws Exception {
        List<Tuple2<String, Long>> source = Lists.newArrayList();
        source.add(new Tuple2<>("qh1", 88L));
        source.add(new Tuple2<>("qh1", 99L));
        source.add(new Tuple2<>("qh1", 100L));
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStream<Tuple2<String, Long>> dataStreamSource = env.fromCollection(source);
        SingleOutputStreamOperator<Tuple2<Long, Tuple2<String, Long>>> result = dataStreamSource.keyBy(t -> t.f0).
                timeWindow(Time.seconds(10)).reduce(new QhReduceAggregate(), new QhProcessWindowFunction());
        result.print();
        env.execute("q Demo");
    }

    public static class QhReduceAggregate implements ReduceFunction<Tuple2<String, Long>> {
        @Override
        public Tuple2<String, Long> reduce(Tuple2<String, Long> value1, Tuple2<String, Long> value2) throws Exception {
            // 求取最大值
            return value1.f1 > value2.f1 ? value1 : value2;
        }
    }

    private static class QhProcessWindowFunction
            extends ProcessWindowFunction<Tuple2<String, Long>, Tuple2<Long, Tuple2<String, Long>>, String, TimeWindow> {
        @Override
        public void process(String s, Context context, Iterable<Tuple2<String, Long>> elements, Collector<Tuple2<Long, Tuple2<String, Long>>> out) throws Exception {
            Tuple2<String, Long> max = elements.iterator().next();
            out.collect(new Tuple2<>(context.window().getEnd(), max));
        }
    }
}

Using per-window state in ProcessWindowFunction

ProcessWindowFunction也提供了操作基于窗口之上状态数据的方式,不同于RichFunction的是,ProcessWindowFunction的状态称为Per-window State,状态数据针对指定的key在窗口上存储,例如将用户标识作为key,求取每个用户某段时间内访问某资源的频次,假设平台中共有200用户,则窗口计算中会创建200个窗口实例,每个窗口实例中都会保存每个key的状态数据,使用时,通过ProcessWindowFunction的Context对象中获取即可。
在这里插入图片描述
Per-window State在ProcessWindowFunction中分为两类:

  • globeState:窗口中的keyed state数据不限定在某个窗口中。
  • windowState:窗口中的keyed state数据限定在固定的窗口中。

在同一窗口多次触发计算,或者针对迟到的数据来触发窗口计算可以用到这些状态数据。例如可以存储每次窗口触发的次数以及最新一次触发的信息,为下一次窗口触发提供逻辑处理信息。使用Per-window State数据时要及时清理状态数据,可以覆写,调用ProcessWindowFunction的clear()完成状态数据的清理。

  • 4
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值