用户定义的聚合函数(UDAGG)将一个表(一个或多个具有一个或多个属性的行)聚合为标量值。
上图显示了聚合的示例。假设您有一个包含饮料数据的表格。该表由三列的id,name和price5行。想象一下,您需要找到表中所有饮料的最高价格,即执行max()聚合。您需要检查5行中的每一行,结果将是单个数值。
用户定义的聚合函数通过扩展AggregateFunction类来实现。一个AggregateFunction作品如下。首先,它需要一个accumulator,它是保存聚合的中间结果的数据结构。通过调用createAccumulator()方法创建一个空累加器AggregateFunction。随后,accumulate()为每个输入行调用函数的方法以更新累加器。处理完所有行后,将getValue()调用该函数的方法来计算并返回最终结果。
每种方法都必须使用以下方法AggregateFunction:
createAccumulator()
accumulate()
getValue()
Flink的类型提取工具无法识别复杂的数据类型,例如,如果它们不是基本类型或简单的POJO。类似于ScalarFunction和TableFunction,AggregateFunction提供了指定TypeInformation结果类型(通过 AggregateFunction#getResultType())和累加器类型(通过AggregateFunction#getAccumulatorType())的方法。
除了上述方法之外,还有一些可以选择性实施的简约方法。虽然其中一些方法允许系统更有效地执行查询,但其他方法对于某些用例是强制性的。例如,merge()如果聚合函数应该应用于会话组窗口的上下文中,则该方法是必需的(当观察到“连接”它们的行时,需要连接两个会话窗口的累加器)。
所有方法AggregateFunction必须声明为public,而不是static完全按照上面提到的名称命名。该方法createAccumulator,getValue,getResultType,和getAccumulatorType在定义的AggregateFunction抽象类,而另一些则收缩的方法。为了定义聚合函数,必须扩展基类org.apache.flink.table.functions.AggregateFunction并实现一个(或多个)accumulate方法。该方法accumulate可以使用不同的参数类型重载,并支持可变参数。
/**
* Base class for aggregation functions.
*
* @param <T> the type of the aggregation result
* @param <ACC> the type of the aggregation accumulator. The accumulator is used to keep the
* aggregated values which are needed to compute an aggregation result.
* AggregateFunction represents its state using accumulator, thereby the state of the
* AggregateFunction must be put into the accumulator.
*/
public abstract class AggregateFunction<T, ACC> extends UserDefinedFunction {
/**
* Creates and init the Accumulator for this [[AggregateFunction]].
*
* @return the accumulator with the initial value
*/
public ACC createAccumulator(); // MANDATORY
/** Processes the input values and update the provided accumulator instance. The method
* accumulate can be overloaded with different custom types and arguments. An AggregateFunction
* requires at least one accumulate() method.
*
* @param accumulator the accumulator which contains the current aggregated results
* @param [user defined inputs] the input value (usually obtained from a new arrived data).
*/
public void accumulate(ACC accumulator, [user defined inputs]); // MANDATORY
/**
* Retracts the input values from the accumulator instance. The current design assumes the
* inputs are the values that have been previously accumulated. The method retract can be
* overloaded with different custom types and arguments. This function must be implemented for
* datastream bounded over aggregate.
*
* @param accumulator the accumulator which contains the current aggregated results
* @param [user defined inputs] the input value (usually obtained from a new arrived data).
*/
public void retract(ACC accumulator, [user defined inputs]); // OPTIONAL
/**
* Merges a group of accumulator instances into one accumulator instance. This function must be
* implemented for datastream session window grouping aggregate and dataset grouping aggregate.
*
* @param accumulator the accumulator which will keep the merged aggregate results. It should
* be noted that the accumulator may contain the previous aggregated
* results. Therefore user should not replace or clean this instance in the
* custom merge method.
* @param its an [[java.lang.Iterable]] pointed to a group of accumulators that will be
* merged.
*/
public void merge(ACC accumulator, java.lang.Iterable<ACC> its); // OPTIONAL
/**
* Called every time when an aggregation result should be materialized.
* The returned value could be either an early and incomplete result
* (periodically emitted as data arrive) or the final result of the
* aggregation.
*
* @param accumulator the accumulator which contains the current
* aggregated results
* @return the aggregation result
*/
public T getValue(ACC accumulator); // MANDATORY
/**
* Resets the accumulator for this [[AggregateFunction]]. This function must be implemented for
* dataset grouping aggregate.
*
* @param accumulator the accumulator which needs to be reset
*/
public void resetAccumulator(ACC accumulator); // OPTIONAL
/**
* Returns true if this AggregateFunction can only be applied in an OVER window.
*
* @return true if the AggregateFunction requires an OVER window, false otherwise.
*/
public Boolean requiresOver = false; // PRE-DEFINED
/**
* Returns the TypeInformation of the AggregateFunction's result.
*
* @return The TypeInformation of the AggregateFunction's result or null if the result type
* should be automatically inferred.
*/
public TypeInformation<T> getResultType = null; // PRE-DEFINED
/**
* Returns the TypeInformation of the AggregateFunction's accumulator.
*
* @return The TypeInformation of the AggregateFunction's accumulator or null if the
* accumulator type should be automatically inferred.
*/
public TypeInformation<T> getAccumulatorType = null; // PRE-DEFINED
}