对于一个查询任务:首先将查询语句解析成一个逻辑计划,并且返回一个LogicalPlan对象。这个逻辑计划包括了查询涉及的表、列、过滤条件、聚合操作等信息,但并没有考虑实际的执行方式和数据分布等因素。
然后我们将这个逻辑计划强制转换为Optimizer.OptimizedLogicalPlan类型,它是一个中间类型,包含了一些额外的信息,例如输入输出的Schema(即数据结构)、是否需要执行Codegen(即生成代码优化)等。这个中间类型是用于在查询优化器中进行处理的,而不是真正的物理计划。
def execute(plan: TreeType): TreeType = {
var curPlan = plan
val queryExecutionMetrics = RuleExecutor.queryExecutionMeter
val planChangeLogger = new PlanChangeLogger[TreeType]()
val tracker: Option[QueryPlanningTracker] = QueryPlanningTracker.get
val beforeMetrics = RuleExecutor.getCurrentMetrics()
val enableValidation = SQLConf.get.getConf(SQLConf.PLAN_CHANGE_VALIDATION)
// Validate the initial input.
if (Utils.isTesting || enableValidation) {
validatePlanChanges(plan, plan) match {
case Some(msg) =>
val ruleExecutorName = this.getClass.getName.stripSuffix("$")
throw new SparkException(
errorClass = "PLAN_VALIDATION_FAILED_RULE_EXECUTOR",
messageParameters = Map("ruleExecutor" -> ruleExecutorName, "reason" -> msg),
cause = null)
case _ =>
}
}
batches.foreach { batch =>
val batchStartPlan = curPlan
var iteration = 1
var lastPlan = curPlan
var continue = true
// Run until fix point (or the max number of iterations as specified in the strategy.
while (continue) {
curPlan = batch.rules.foldLeft(curPlan) {
case (plan, rule) =>
val startTime = System.nanoTime()
val result = rule(plan)
val runTime = System.nanoTime() - startTime
val effective = !result.fastEquals(plan)
if (effective) {
queryExecutionMetrics.incNumEffectiveExecution(rule.ruleName)
queryExecutionMetrics.incTimeEffectiveExecutionBy(rule.ruleName, runTime)
planChangeLogger.logRule(rule.ruleName, plan, result)
// Run the plan changes validation after each rule.
if (Utils.isTesting || enableValidation) {
validatePlanChanges(plan, result) match {
case Some(msg) =>
throw new SparkException(
errorClass = "PLAN_VALIDATION_FAILED_RULE_IN_BATCH",
messageParameters = Map(
"rule" -> rule.ruleName,
"batch" -> batch.name,
"reason" -> msg),
cause = null)
case _ =>
}
}
}
queryExecutionMetrics.incExecutionTimeBy(rule.ruleName, runTime)
queryExecutionMetrics.incNumExecution(rule.ruleName)
// Record timing information using QueryPlanningTracker
tracker.foreach(_.recordRuleInvocation(rule.ruleName, runTime, effective))
result
}
iteration += 1
if (iteration > batch.strategy.maxIterations) {
// Only log if this is a rule that is supposed to run more than once.
if (iteration != 2) {
val endingMsg = if (batch.strategy.maxIterationsSetting == null) {
"."
} else {
s", please set '${batch.strategy.maxIterationsSetting}' to a larger value."
}
val message = s"Max iterations (${iteration - 1}) reached for batch ${batch.name}" +
s"$endingMsg"
if (Utils.isTesting || batch.strategy.errorOnExceed) {
throw new RuntimeException(message)
} else {
logWarning(message)
}
}
// Check idempotence for Once batches.
if (batch.strategy == Once &&
Utils.isTesting && !excludedOnceBatches.contains(batch.name)) {
checkBatchIdempotence(batch, curPlan)
}
continue = false
}
if (curPlan.fastEquals(lastPlan)) {
logTrace(
s"Fixed point reached for batch ${batch.name} after ${iteration - 1} iterations.")
continue = false
}
lastPlan = curPlan
}
planChangeLogger.logBatch(batch.name, batchStartPlan, curPlan)
}
planChangeLogger.logMetrics(RuleExecutor.getCurrentMetrics() - beforeMetrics)
curPlan
}
spark.sessionState.optimizer.execute 函数的输入参数是一个经过优化后的逻辑计划,输入类型是 Optimizer.OptimizedLogicalPlan。这个计划包含了SQL查询语句的逻辑结构和一些优化信息,例如可以使用的索引、以及查询执行的方式等等。
函数的返回结果是一个物理计划 SparkPlan。这个物理计划是在逻辑计划的基础上生成的,它包括了具体的执行细节,例如如何进行数据读取、如何进行数据分区,如何进行数据的过滤和排序等等。同样,这个物理计划也会考虑尽可能多的查询优化技术,以提高查询的执行效率。
需要注意的是,spark.sessionState.optimizer.execute 函数只处理查询的优化和转换,而不会执行查询本身。如果需要真正执行查询,我们需要将这个物理计划传递给查询引擎进行实际的查询操作。通常使用DataFrame或Dataset来封装查询请求后,调用show、collect等方法即可执行查询并打印结果。