前言
由前面博客我们知道了SparkSql整个解析流程如下:
- sqlText 经过 SqlParser 解析成 Unresolved LogicalPlan;
- analyzer 模块结合catalog进行绑定,生成 resolved LogicalPlan;
- optimizer 模块对 resolved LogicalPlan 进行优化,生成 optimized LogicalPlan;
- SparkPlan 将 LogicalPlan 转换成PhysicalPlan;
- prepareForExecution()将 PhysicalPlan 转换成可执行物理计划;
- 使用 execute()执行可执行物理计划;
详解optimizer 模块
optimizer 以及之后的模块都只会在触发了action操作后才会执行。优化器是用来将Resolved LogicalPlan转化为optimized LogicalPlan的。
optimizer 就是根据大佬们多年的SQL优化经验来对语法树进行优化,比如谓词下推、列值裁剪、常量累加等。优化的模式和Analyzer非常相近,Optimizer 同样继承了RuleExecutor
,并定义了很多优化的Rule:
def batches: Seq[Batch] = {
// Technically some of the rules in Finish Analysis are not optimizer rules and belong more
// in the analyzer, because they are needed for correctness (e.g. ComputeCurrentTime).
// However, because we also use the analyzer to canonicalized queries (for view definition),
// we do not eliminate subqueries or compute current time in the analyzer.
Batch("Finish Analysis", Once,
EliminateSubqueryAliases,
EliminateView,
ReplaceExpressions,
ComputeCurrentTime,
GetCurrentDatabase(sessionCatalog),
RewriteDistinctAggregates,
ReplaceDeduplicateWithAggregate) ::