概述
Optimizer 中的预处理
当存在多列distinct计算时,Optimizer执行RewriteDistinctAggregates
规则时,该规则会将多列distinct展开(通过插入Expand算子),非distinct聚合列和每个distinct聚合列会被分为不同的组(假设为N组),每个组为一行数据并带有group id,这样一行数据会被扩展为N行。之后,用两层Aggregate算子计算Expand之后的数据,第一层按前面的分组聚合,第二层再将结果聚合。引用RewriteDistinctAggregates
的注释中的例子说明:
val data = Seq(
("a", "ca1", "cb1", 10),
("a", "ca1", "cb2", 5),
("b", "ca1", "cb1", 13))
.toDF("key", "cat1", "cat2", "value")
data.createOrReplaceTempView("data")
val agg = data.groupBy($"key")
.agg(
countDistinct($"cat1").as("cat1_cnt"),
countDistinct($"cat2").as("cat2_cnt"),
sum($"value").as("total"))
原始逻辑计划:
Aggregate(
key = ['key]
functions = [
COUNT(DISTINCT 'cat1),
COUNT(DISTINCT 'cat2),
sum('value)]
output = ['key, 'cat1_cnt, 'cat2_cnt, 'total])
LocalTableScan [...]
改造后逻辑计划
Aggregate(
key = ['key]
functions = [
count(if (('gid = 1)) 'cat1 else null),
count(if (('gid = 2)) 'cat2 else null),
first(if (('gid = 0)) 'total else null) ignore nulls]
output = ['key, 'cat1_cnt, 'cat2_cnt, 'total])
Aggregate(
key = ['key, 'cat1, 'cat2, 'gid]
functions = [
sum('value)]
output = ['key, 'cat1, 'cat2, 'gid, 'total])
Expand(
projections = [
('key, null, null, 0,cast('value as bigint)),
('key, 'cat1, null, 1, null),
('key, null, 'cat2, 2, null)]
output = ['key, 'cat1, 'cat2, 'gid, 'value])
LocalTableScan [...]
只有一个distinct聚合列时不做处理。
变换为物理计划
SparkPlanner
的Aggregation
规则将逻辑计划转换为物理计划。其核心代码:
object Aggregation extends Strategy {
def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
case PhysicalAggregation(
groupingExpressions, aggregateExpressions, resultExpressions, child) =>
val (functionsWithDistinct, functionsWithoutDistinct) = ...
val aggregateOperator =
if (functionsWithDistinct.isEmpty) {
aggregate.AggUtils.planAggregateWithoutDistinct(...)
} else {
aggregate.AggUtils.planAggregateWithOneDistinct(...)
}
aggregateOperator
case _ => Nil
}
}
PhysicalAggregation
将逻辑Aggregate解析成分组表达式、聚合表达式等,将逻辑计划解析为物理Aggregate交由AggUtils
完成。
AggUtils
AggUtils
将聚合分为两个类型:无distinct聚合计算和有一个distinct聚合计算(RewriteDistinctAggregates已经将多distinct转换为无distinct聚合)。
无distinct
无distinct逻辑Aggregate会被转换为2个物理Aggregate:
- 1.
Partial
聚合 - 2.
Final
聚合
def planAggregateWithoutDistinct(
groupingExpressions: Seq[NamedExpression],
aggregateExpressions: Seq[AggregateExpression],
resultExpressions: Seq[NamedExpression],
child: SparkPlan): Seq[SparkPlan] = {
// Check if we can use HashAggregate.
// 1. Create an Aggregate Operator for partial aggregations.
val groupingAttributes = groupingExpressions.map(_.toAttribute)
val partialAggregateExpressions = aggregateExpressions.map(_.copy(mode = Partial))
val partialAggregateAttributes = ...
val partialResultExpressions = ...
val partialAggregate = createAggregate(
...
child = child)
// 2. Create an Aggregate Operator for final aggregations.
val finalAggregateExpressions = aggregateExpressions.map(_.copy(mode = Final))
// The attributes of the final aggregation buffer, which is presented as input to the result
// projection:
val finalAggregateAttributes = ...
val finalAggregate = createAggregate(
...
child = partialAggregate)
finalAggregate :: Nil
}
仅一个distinct
仅有一个distinct逻辑Aggregate会被转换为4个物理Aggregate:
- 1.将distinct列加入group by列进行
Partial
聚合 - 2.将distinct列加入group by列进行
PartialMerge
- 3.仅原group by列进行
PartialMerge
- 4.
Final
聚合
def planAggregateWithOneDistinct(
groupingExpressions: Seq[NamedExpression],
functionsWithDistinct: Seq[AggregateExpression],
functionsWithoutDistinct: Seq[AggregateExpression],
resultExpressions: Seq[NamedExpression],
child: SparkPlan): Seq[SparkPlan] = {
// functionsWithDistinct is guaranteed to be non-empty. Even though it may contain more than one
// DISTINCT aggregate function, all of those functions will have the same column expressions.
// For example, it would be valid for functionsWithDistinct to be
// [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is
// disallowed because those two distinct aggregates have different column expressions.
...
// 1. Create an Aggregate Operator for partial aggregations.
val partialAggregate: SparkPlan = {
...
// We will group by the original grouping expression, plus an additional expression for the
// DISTINCT column. For example, for AVG(DISTINCT value) GROUP BY key, the grouping
// expressions will be [key, value].
createAggregate(
groupingExpressions = groupingExpressions ++ namedDistinctExpressions, // 将distinct列加入分组列
...,
child = child)
}
// 2. Create an Aggregate Operator for partial merge aggregations.
val partialMergeAggregate: SparkPlan = {
...
createAggregate(
...,
groupingExpressions = groupingAttributes ++ distinctAttributes, // 将distinct列加入分组列
...,
child = partialAggregate)
}
// 3. Create an Aggregate operator for partial aggregation (for distinct)
...
val partialDistinctAggregate: SparkPlan = {
...
createAggregate(
groupingExpressions = groupingAttributes,
...,
child = partialMergeAggregate)
}
// 4. Create an Aggregate Operator for the final aggregation.
val finalAndCompleteAggregate: SparkPlan = {
...
createAggregate(
...,
groupingExpressions = groupingAttributes,
...,
child = partialDistinctAggregate)
}
finalAndCompleteAggregate :: Nil
}
结合RewriteDistinctAggregates
规则可见,spark中distinct聚合的计算是被转换为group by进行优化的。
Aggregate物理计划
物理计划对象的选择与创建是由AggUtils.createAggregate
方法完成的。聚合实现的物理计划有三种:
HashAggregateExec
ObjectHashAggregateExec
SortAggregateExec
HashAggregateExec
使用TungstenAggregationIterator
聚合并输出数据,其内部使用hash表实现UnsafeFixedWidthAggregationMap
(下面简称hashmap)进行基于hash的聚合。
- 若数据量较小,能被全部存在hashmap中,则直接将hashmap中数据输出;
- 若数据量较大,当hashmap申请不到内存(装满),会将数据排序并spill到磁盘;清空的数据hashmap被用来聚合新数据。循环上述步骤处理输入数据直到没有输入,spill的排序数据执行排序聚合并输出。可见,当数据量较大内存不足以容纳时,hash聚合实际退化为排序聚合。
ObjectHashAggregateExec
ObjectHashAggregateExec与HashAggregateExec类似,都是基于hash聚合,且内存不能容纳全量数据时会退化为sort聚合,但二者仍有不同:
/*
* A hash-based aggregate operator that supports [[TypedImperativeAggregate]] functions that may
* use arbitrary JVM objects as aggregation states.
*
* Similar to [[HashAggregateExec]], this operator also falls back to sort-based aggregation when
* the size of the internal hash map exceeds the threshold. The differences are:
*
* - It uses safe rows as aggregation buffer since it must support JVM objects as aggregation
* states.
*
* - It tracks entry count of the hash map instead of byte size to decide when we should fall back.
* This is because it's hard to estimate the accurate size of arbitrary JVM objects in a
* lightweight way.
*
* - Whenever fallen back to sort-based aggregation, this operator feeds all of the rest input rows
* into external sorters instead of building more hash map(s) as what [[HashAggregateExec]] does.
* This is because having too many JVM object aggregation states floating there can be dangerous
* for GC.
*
* - CodeGen is not supported yet.
*/
HashAggregateExec
内部使用UnsafeFixedWidthAggregationMap
进行hash聚合,其支持的聚合buffer(亦其聚合过程中状态值)类型为等宽、可变的,如以下类型:
NullType,
BooleanType,
ByteType,
ShortType,
IntegerType,
LongType,
FloatType,
DoubleType,
DateType,
TimestampType,
DecimalType
而ObjectHashAggregateExec
支持聚合buffer为任意java对象的计算。内部使用ObjectAggregationMap
进行hash聚合,实际上这个类持有类型为java.util.LinkedHashMap[UnsafeRow, InternalRow]
的对象,存储每个key对应的聚合buffer(此处可能为任意java对象),亦即聚合值的中间状态。
SortAggregateExec
SortAggregateExec
实现了方法:
override def outputOrdering: Seq[SortOrder] = {
groupingExpressions.map(SortOrder(_, Ascending))
}
规则EnsureRequirements
会在其与子算子之间插入SortExec
,这样SortAggregateExec
计算时每个分区内数据都是已排序的。
SortAggregateExec
直接使用SortBasedAggregationIterator
进行sort聚合,按顺序迭代child算子(即被Aggregate的表)结果数据,由于分区内数据已排序,同一聚合key的数据相邻,从而顺序迭代即可进行聚合。
选择策略
回到AggUtils.createAggregate
方法,下面说明Aggregate逻辑计划转换为物理计划的选择策略:
if(所有聚合函数的聚合buffer结果为等宽、可变类型):
HashAggregateExec
else if(spark.sql.execution.useObjectHashAggregateExec=true && 所有聚合函数为TypedImperativeAggregate类型):
ObjectHashAggregateExec
else:
SortAggregateExec
TypedImperativeAggregate
:
/**
* Aggregation function which allows arbitrary user-defined java object to be used as internal
* aggregation buffer.
*/