AggregationIterator
Aggregate三种物理算子的doExecute
方法遵循类似的代码框架:
protected override def doExecute(): RDD[InternalRow] {
child.execute().mapPartitionsWithIndex { (partIndex, iter) =>
val hasInput = iter.hasNext
val res = if (!hasInput && groupingExpressions.nonEmpty) {
// This is a grouped aggregate and the input iterator is empty,
// so return an empty iterator.
Iterator.empty
} else {
val aggregationIterator =
new XXXXAggregateIterator(
...,
(expressions, inputSchema) =>
newMutableProjection(expressions, inputSchema, ...),
...
)
aggregationIterator
}
res
}
}
可见,Aggregate的具体实现,是由三种物理算子使用的不同AggregateIterator实现的。
AggregationIterator构造器参数:newMutableProjection
所有聚合迭代器的实现都继承自抽象类AggregationIterator
,其构造函数中的参数newMutableProjection: (Seq[Expression], Seq[Attribute]) => MutableProjection
,接收一个函数,该函数用来根据Catalyst表达式编译出MutableProjection
对象,用于执行对InternalRow类型数据的计算转换(将InternalRow类型数据变换为一个新的InternalRow数据,其作用稍后涉及)。
三种实现中,该函数均为:
(expressions, inputSchema) => newMutableProjection(expressions, inputSchema, ...)
newMutableProjection
继承自SparkPlan
:
protected def newMutableProjection(
expressions: Seq[Expression],
inputSchema: Seq[Attribute],
useSubexprElimination: Boolean = false): MutableProjection = {
GenerateMutableProjection.generate(expressions, inputSchema, useSubexprElimination)
}
GenerateMutableProjection.generate
将Catalyst表达式转换为代码并编译为对象,也即MutableProjection
类型的对象。
抽象类AggregationIterator主要逻辑
在AggregationIterator
的注释中说明了其作用:
/**
* The base class of [[SortBasedAggregationIterator]] and [[TungstenAggregationIterator]].
* It mainly contains two parts:
* 1. It initializes aggregate functions.
* 2. It creates two functions, `processRow` and `generateOutput` based on [[AggregateMode]] of
* its aggregate functions. `processRow` is the function to handle an input. `generateOutput`
* is used to generate result.
*/
其方法
protected def generateProcessRow(
expressions: Seq[AggregateExpression],
functions: Seq[AggregateFunction],
inputAttributes: Seq[Attribute]): (InternalRow, InternalRow) => Unit
返回的函数(processRow
函数)用于迭代器处理每一行输入数据。
AggregateFunction
AggregateFunction分类
AggregateFunction主要分为两类:ImperativeAggregate
和DeclarativeAggregate
,这两者继承了AggregateFunction
。
DeclarativeAggregate
声明式聚合DeclarativeAggregate
用Catalyst表达式的方式声明聚合函数计算逻辑,其实际计算交给由表达式编译得到的MutableProject
对象执行,其子类需声明的的聚合计算相关的表达式:
val initialValues: Seq[Expression]
:初始化聚合buffer的表达式val updateExpressions: Seq[Expression]
:用一行数据更新聚合buffer的表达式val mergeExpressions: Seq[Expression]
:合并若干聚合buffer的表达式val evaluateExpression: Expression
:返回最终聚合结果的表达式
ImperativeAggregate
命令式聚合ImperativeAggregate
具体实现一组基于InternalRow
类型计算的方法来实现聚合逻辑,在聚合执行时调用这些方法执行聚合逻辑,其子类需要实现的相关方法:
def initialize(mutableAggBuffer: InternalRow): Unit
:初始化聚合bufferdef update(mutableAggBuffer: InternalRow, inputRow: InternalRow): Unit
:用一行数据更新聚合bufferdef merge(mutableAggBuffer: InternalRow, inputAggBuffer: InternalRow): Unit
:合并聚合buffer
TypedImperativeAggregate
继承ImperativeAggregate
,使用任意java对象存储聚合buffer中的状态。子类需实现方法:
def createAggregationBuffer(): T
:初始化聚合buffer中存储状态的对象def update(buffer: T, input: InternalRow): T
:用一行数据更新存储状态的对象def merge(buffer: T, input: T): T
:合并若干个聚合buffer中存储状态的对象def eval(buffer: T): Any
:通过聚合buffer中对象返回最终聚合结果
AggregateMode
AggregateMode
分为两类:
- 用一行数据更新聚合buffer
Partial
、Complete
- 合并聚合buffer
PartialMerge
、Final
上面提到AggregationIterator.generateProcessRow
方法返回的processRow
函数用于处理每一行数据。generateProcessRow
函数根据聚合表达式的AggregateMode及聚合函数类型,来组织processRow
的逻辑。若当前AggregateMode为Partial
、Complete
,则传入processRow
的两个InternalRow分别为聚合buffer和一行数据;若当前AggregateMode为PartialMerge
、Final
,则传入两个聚合buffer。processRow
在不同AggregateMode下,根据不同类型聚合函数调用策略:
Partial
、Complete
- 用
DeclarativeAggregate
updateExpressions
编译得到的MutableProject
处理数据 - 调用
ImperativeAggregate
update
方法处理数据
- 用
PartialMerge
、Final
- 用
DeclarativeAggregate
mergeExpressions
编译得到的MutableProject
处理数据 - 调用
ImperativeAggregate
merge
方法处理数据
- 用