Flink 聚合函数

最新推荐文章于 2023-12-08 17:09:54 发布

ItStar_

最新推荐文章于 2023-12-08 17:09:54 发布

阅读量2.5k

点赞数 1

用户定义的聚合函数（UDAGG）将一个表（一个或多个具有一个或多个属性的行）聚合为标量值。

640?wx_fmt=png

上图显示了聚合的示例。假设您有一个包含饮料数据的表格。该表由三列的id，name和price5行。想象一下，您需要找到表中所有饮料的最高价格，即执行max()聚合。您需要检查5行中的每一行，结果将是单个数值。

用户定义的聚合函数通过扩展AggregateFunction类来实现。一个AggregateFunction作品如下。首先，它需要一个accumulator，它是保存聚合的中间结果的数据结构。通过调用createAccumulator()方法创建一个空累加器AggregateFunction。随后，accumulate()为每个输入行调用函数的方法以更新累加器。处理完所有行后，将getValue()调用该函数的方法来计算并返回最终结果。

每种方法都必须使用以下方法AggregateFunction：

createAccumulator()
accumulate()
getValue()

Flink的类型提取工具无法识别复杂的数据类型，例如，如果它们不是基本类型或简单的POJO。类似于ScalarFunction和TableFunction，AggregateFunction提供了指定TypeInformation结果类型（通过 AggregateFunction#getResultType()）和累加器类型（通过AggregateFunction#getAccumulatorType()）的方法。

除了上述方法之外，还有一些可以选择性实施的简约方法。虽然其中一些方法允许系统更有效地执行查询，但其他方法对于某些用例是强制性的。例如，merge()如果聚合函数应该应用于会话组窗口的上下文中，则该方法是必需的（当观察到“连接”它们的行时，需要连接两个会话窗口的累加器）。

所有方法AggregateFunction必须声明为public，而不是static完全按照上面提到的名称命名。该方法createAccumulator，getValue，getResultType，和getAccumulatorType在定义的AggregateFunction抽象类，而另一些则收缩的方法。为了定义聚合函数，必须扩展基类org.apache.flink.table.functions.AggregateFunction并实现一个（或多个）accumulate方法。该方法accumulate可以使用不同的参数类型重载，并支持可变参数。

/**	
  * Base class for aggregation functions. 	
  *	
  * @param <T>   the type of the aggregation result	
  * @param <ACC> the type of the aggregation accumulator. The accumulator is used to keep the	
  *             aggregated values which are needed to compute an aggregation result.	
  *             AggregateFunction represents its state using accumulator, thereby the state of the	
  *             AggregateFunction must be put into the accumulator.	
  */	
public abstract class AggregateFunction<T, ACC> extends UserDefinedFunction {	
	
  /**	
    * Creates and init the Accumulator for this [[AggregateFunction]].	
    *	
    * @return the accumulator with the initial value	
    */	
  public ACC createAccumulator(); // MANDATORY	
	
  /** Processes the input values and update the provided accumulator instance. The method	
    * accumulate can be overloaded with different custom types and arguments. An AggregateFunction	
    * requires at least one accumulate() method.	
    *	
    * @param accumulator           the accumulator which contains the current aggregated results	
    * @param [user defined inputs] the input value (usually obtained from a new arrived data).	
    */	
  public void accumulate(ACC accumulator, [user defined inputs]); // MANDATORY	
	
  /**	
    * Retracts the input values from the accumulator instance. The current design assumes the	
    * inputs are the values that have been previously accumulated. The method retract can be	
    * overloaded with different custom types and arguments. This function must be implemented for	
    * datastream bounded over aggregate.	
    *	
    * @param accumulator           the accumulator which contains the current aggregated results	
    * @param [user defined inputs] the input value (usually obtained from a new arrived data).	
    */	
  public void retract(ACC accumulator, [user defined inputs]); // OPTIONAL	
	
  /**	
    * Merges a group of accumulator instances into one accumulator instance. This function must be	
    * implemented for datastream session window grouping aggregate and dataset grouping aggregate.	
    *	
    * @param accumulator  the accumulator which will keep the merged aggregate results. It should	
    *                     be noted that the accumulator may contain the previous aggregated	
    *                     results. Therefore user should not replace or clean this instance in the	
    *                     custom merge method.	
    * @param its          an [[java.lang.Iterable]] pointed to a group of accumulators that will be	
    *                     merged.	
    */	
  public void merge(ACC accumulator, java.lang.Iterable<ACC> its); // OPTIONAL	
	
  /**	
    * Called every time when an aggregation result should be materialized.	
    * The returned value could be either an early and incomplete result	
    * (periodically emitted as data arrive) or the final result of the	
    * aggregation.	
    *	
    * @param accumulator the accumulator which contains the current	
    *                    aggregated results	
    * @return the aggregation result	
    */	
  public T getValue(ACC accumulator); // MANDATORY	
	
  /**	
    * Resets the accumulator for this [[AggregateFunction]]. This function must be implemented for	
    * dataset grouping aggregate.	
    *	
    * @param accumulator  the accumulator which needs to be reset	
    */	
  public void resetAccumulator(ACC accumulator); // OPTIONAL	
	
  /**	
    * Returns true if this AggregateFunction can only be applied in an OVER window.	
    *	
    * @return true if the AggregateFunction requires an OVER window, false otherwise.	
    */	
  public Boolean requiresOver = false; // PRE-DEFINED	
	
  /**	
    * Returns the TypeInformation of the AggregateFunction's result.	
    *	
    * @return The TypeInformation of the AggregateFunction's result or null if the result type	
    *         should be automatically inferred.	
    */	
  public TypeInformation<T> getResultType = null; // PRE-DEFINED	
	
  /**	
    * Returns the TypeInformation of the AggregateFunction's accumulator.	
    *	
    * @return The TypeInformation of the AggregateFunction's accumulator or null if the	
    *         accumulator type should be automatically inferred.	
    */	
  public TypeInformation<T> getAccumulatorType = null; // PRE-DEFINED	
}

640?wx_fmt=jpeg