聊聊flink KeyedStream的aggregation操作

本文主要研究一下flink KeyedStream的aggregation操作

实例

    @Test
    public void testMax() throws Exception {
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        WordCount[] data = new WordCount[]{new WordCount(1,"Hello", 1), new
                WordCount(1,"World", 3), new WordCount(2,"Hello", 1)};
        env.fromElements(data)
                .keyBy("word")
                .max("frequency")
                .addSink(new SinkFunction<WordCount>() {
                    @Override
                    public void invoke(WordCount value, Context context) throws Exception {
                        LOGGER.info("value:{}",value);
                    }
                });
        env.execute("testMax");
    }
复制代码
  • 这里先对word字段进行keyBy操作,然后再通过KeyedStream的max方法按frequency字段取最大的WordCount

KeyedStream.aggregate

flink-streaming-java_2.11-1.7.0-sources.jar!/org/apache/flink/streaming/api/datastream/KeyedStream.java

	public SingleOutputStreamOperator<T> sum(int positionToSum) {
		return aggregate(new SumAggregator<>(positionToSum, getType(), getExecutionConfig()));
	}

	public SingleOutputStreamOperator<T> sum(String field) {
		return aggregate(new SumAggregator<>(field, getType(), getExecutionConfig()));
	}

	public SingleOutputStreamOperator<T> max(int positionToMax) {
		return aggregate(new ComparableAggregator<>(positionToMax, getType(), AggregationFunction.AggregationType.MAX,
				getExecutionConfig()));
	}

	public SingleOutputStreamOperator<T> max(String field) {
		return aggregate(new ComparableAggregator<>(field, getType(), AggregationFunction.AggregationType.MAX,
				false, getExecutionConfig()));
	}

	public SingleOutputStreamOperator<T> min(int positionToMin) {
		return aggregate(new ComparableAggregator<>(positionToMin, getType(), AggregationFunction.AggregationType.MIN,
				getExecutionConfig()));
	}

	public SingleOutputStreamOperator<T> min(String field) {
		return aggregate(new ComparableAggregator<>(field, getType(), AggregationFunction.AggregationType.MIN,
				false, getExecutionConfig()));
	}

	public SingleOutputStreamOperator<T> maxBy(int positionToMaxBy) {
		return this.maxBy(positionToMaxBy, true);
	}

	public SingleOutputStreamOperator<T> maxBy(String positionToMaxBy) {
		return this.maxBy(positionToMaxBy, true);
	}

	public SingleOutputStreamOperator<T> maxBy(int positionToMaxBy, boolean first) {
		return aggregate(new ComparableAggregator<>(positionToMaxBy, getType(), AggregationFunction.AggregationType.MAXBY, first,
				getExecutionConfig()));
	}

	public SingleOutputStreamOperator<T> maxBy(String field, boolean first) {
		return aggregate(new ComparableAggregator<>(field, getType(), AggregationFunction.AggregationType.MAXBY,
				first, getExecutionConfig()));
	}

	public SingleOutputStreamOperator<T> minBy(int positionToMinBy) {
		return this.minBy(positionToMinBy, true);
	}

	public SingleOutputStreamOperator<T> minBy(String positionToMinBy) {
		return this.minBy(positionToMinBy, true);
	}

	public SingleOutputStreamOperator<T> minBy(int positionToMinBy, boolean first) {
		return aggregate(new ComparableAggregator<T>(positionToMinBy, getType(), AggregationFunction.AggregationType.MINBY, first,
				getExecutionConfig()));
	}

	public SingleOutputStreamOperator<T> minBy(String field, boolean first) {
		return aggregate(new ComparableAggregator(field, getType(), AggregationFunction.AggregationType.MINBY,
				first, getExecutionConfig()));
	}

	protected SingleOutputStreamOperator<T> aggregate(AggregationFunction<T> aggregate) {
		StreamGroupedReduce<T> operator = new StreamGroupedReduce<T>(
				clean(aggregate), getType().createSerializer(getExecutionConfig()));
		return transform("Keyed Aggregation", getType(), operator);
	}
复制代码
  • KeyedStream的aggregation方法是protected修饰的,sum、max、min、maxBy、minBy这几个方法实际都是调用aggregate方法,只是它们创建的ComparableAggregator的AggregationType不一样,分别是SUM, MAX, MIN, MAXBY, MINBY
  • 每个sum、max、min、maxBy、minBy都有两个重载方法,一个是int类型的参数,一个是String类型的参数
  • maxBy、minBy比sum、max、min多了first(boolean)参数,该参数用于指定在碰到多个compare值相等时,是否取第一个返回

ComparableAggregator

flink-streaming-java_2.11-1.7.0-sources.jar!/org/apache/flink/streaming/api/functions/aggregation/ComparableAggregator.java

@Internal
public class ComparableAggregator<T> extends AggregationFunction<T> {

	private static final long serialVersionUID = 1L;

	private Comparator comparator;
	private boolean byAggregate;
	private boolean first;
	private final FieldAccessor<T, Object> fieldAccessor;

	private ComparableAggregator(AggregationType aggregationType, FieldAccessor<T, Object> fieldAccessor, boolean first) {
		this.comparator = Comparator.getForAggregation(aggregationType);
		this.byAggregate = (aggregationType == AggregationType.MAXBY) || (aggregationType == AggregationType.MINBY);
		this.first = first;
		this.fieldAccessor = fieldAccessor;
	}

	public ComparableAggregator(int positionToAggregate,
			TypeInformation<T> typeInfo,
			AggregationType aggregationType,
			ExecutionConfig config) {
		this(positionToAggregate, typeInfo, aggregationType, false, config);
	}

	public ComparableAggregator(int positionToAggregate,
			TypeInformation<T> typeInfo,
			AggregationType aggregationType,
			boolean first,
			ExecutionConfig config) {
		this(aggregationType, FieldAccessorFactory.getAccessor(typeInfo, positionToAggregate, config), first);
	}

	public ComparableAggregator(String field,
			TypeInformation<T> typeInfo,
			AggregationType aggregationType,
			boolean first,
			ExecutionConfig config) {
		this(aggregationType, FieldAccessorFactory.getAccessor(typeInfo, field, config), first);
	}

	@SuppressWarnings("unchecked")
	@Override
	public T reduce(T value1, T value2) throws Exception {
		Comparable<Object> o1 = (Comparable<Object>) fieldAccessor.get(value1);
		Object o2 = fieldAccessor.get(value2);

		int c = comparator.isExtremal(o1, o2);

		if (byAggregate) {
			// if they are the same we choose based on whether we want to first or last
			// element with the min/max.
			if (c == 0) {
				return first ? value1 : value2;
			}

			return c == 1 ? value1 : value2;

		} else {
			if (c == 0) {
				value1 = fieldAccessor.set(value1, o2);
			}
			return value1;
		}
	}
}
复制代码
  • ComparableAggregator继承了AggregationFunction,而AggregationFunction则实现了ReduceFunction接口,这里ComparableAggregator实现的reduce方法,它首先借助Comparator来比较两个对象,然后根据是否是byAggregate做不同处理,如果是byAggregate,则在比较值为0时,判断是否返回最先遇到的元素,如果是则返回value1,否则返回value2,比较值非0时,则取比较值最大的元素返回;如果不是byAggregate,则如果比较值为0(比较字段的值value1小于等于value2的情况),则使用反射方法将value2的比较字段的值更新到value1,最后都是返回value1

AggregationFunction

@Internal
public abstract class AggregationFunction<T> implements ReduceFunction<T> {
	private static final long serialVersionUID = 1L;

	/**
	 * Aggregation types that can be used on a windowed stream or keyed stream.
	 */
	public enum AggregationType {
		SUM, MIN, MAX, MINBY, MAXBY,
	}
}
复制代码
  • AggregationFunction声明实现了ReduceFunction,同时定义了五种类型的AggregationType,分别是SUM, MIN, MAX, MINBY, MAXBY

Comparator

flink-streaming-java_2.11-1.7.0-sources.jar!/org/apache/flink/streaming/api/functions/aggregation/Comparator.java

@Internal
public abstract class Comparator implements Serializable {

	private static final long serialVersionUID = 1L;

	public abstract <R> int isExtremal(Comparable<R> o1, R o2);

	public static Comparator getForAggregation(AggregationType type) {
		switch (type) {
		case MAX:
			return new MaxComparator();
		case MIN:
			return new MinComparator();
		case MINBY:
			return new MinByComparator();
		case MAXBY:
			return new MaxByComparator();
		default:
			throw new IllegalArgumentException("Unsupported aggregation type.");
		}
	}

	private static class MaxComparator extends Comparator {

		private static final long serialVersionUID = 1L;

		@Override
		public <R> int isExtremal(Comparable<R> o1, R o2) {
			return o1.compareTo(o2) > 0 ? 1 : 0;
		}

	}

	private static class MaxByComparator extends Comparator {

		private static final long serialVersionUID = 1L;

		@Override
		public <R> int isExtremal(Comparable<R> o1, R o2) {
			int c = o1.compareTo(o2);
			if (c > 0) {
				return 1;
			}
			if (c == 0) {
				return 0;
			} else {
				return -1;
			}
		}

	}

	private static class MinByComparator extends Comparator {

		private static final long serialVersionUID = 1L;

		@Override
		public <R> int isExtremal(Comparable<R> o1, R o2) {
			int c = o1.compareTo(o2);
			if (c < 0) {
				return 1;
			}
			if (c == 0) {
				return 0;
			} else {
				return -1;
			}
		}

	}

	private static class MinComparator extends Comparator {

		private static final long serialVersionUID = 1L;

		@Override
		public <R> int isExtremal(Comparable<R> o1, R o2) {
			return o1.compareTo(o2) < 0 ? 1 : 0;
		}

	}
}
复制代码
  • Comparator则实现Serializable接口,定义了isExtremal抽象方法,同时提供了getForAggregation工厂方法,根据不同的AggregationType创建不同的Comparator
  • Comparator里头定义了MaxComparator、MinComparator、MinByComparator、MaxByComparator四个子类,它们都实现了isExtremal方法
  • MaxComparator直接利用Comparable接口定义的compareTo方法,不过它的返回只有0和1,compareTo大于0的时候才返回1,否则返回0,也就是大于的情况才返回1,否则返回0;MaxByComparator也先根据Comparable接口定义的compareTo方法获取值,不过它的返回值有3种,大于0的时候返回1,等于0时返回0,小于0时返回-1,也就是大于的情况返回1,相等的情况返回0,小于的情况返回-1

小结

  • KeyedStream的aggregation操作主要分为sum、max、min、maxBy、minBy这几个方法,它们内部都调用了protected修饰的aggregation方法,只是它们创建的ComparableAggregator的AggregationType不一样,分别是SUM, MAX, MIN, MAXBY, MINBY
  • ComparableAggregator继承了AggregationFunction,而AggregationFunction则实现了ReduceFunction接口,这里ComparableAggregator实现的reduce方法,它首先借助Comparator来比较两个对象,然后根据是否是byAggregate做不同处理,如果是byAggregate,则在比较值为0时,判断是否返回最先遇到的元素,如果是则返回最先遇到的,否则返回最后遇到的,比较值非0时,则取比较值最大的元素返回;如果不是byAggregate,则如果比较值为0,则使用反射方法将后者的值更新到value1,最后都是返回value1
  • Comparator里头定义了MaxComparator、MinComparator、MinByComparator、MaxByComparator四个子类,它们都实现了isExtremal方法;MaxComparator与MaxByComparator的区别在于,MaxComparator大于返回1,小于等于返回0,而MaxByComparator返回值更精细,大于返回1,等于返回0,小于返回-1;这个区别也体现在ComparableAggregator的reduce方法中,而且maxBy、minBy比其他方法多了一个first(boolean)参数,专门用于在比较值为的0的时候选择返回哪个元素;而reduce方法对于非byAggregate操作,始终返回的是value1,在比较值小于等于的时候,使用反射更新value1,然后返回value1

doc

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值