SparkSql底层解析运行流程

最新推荐文章于 2023-06-03 15:32:52 发布

学庭

最新推荐文章于 2023-06-03 15:32:52 发布

阅读量1k

点赞数 5

分类专栏： Spark 源码文章标签： spark sparksql

本文链接：https://blog.csdn.net/kruskual/article/details/116061055

版权

Spark 同时被 2 个专栏收录

6 篇文章 0 订阅

订阅专栏

源码

4 篇文章 0 订阅

订阅专栏

SparkSql ，一个字符串，是如何在spark集群中被解析运行的呢？本文带你一探究竟。

1.断点1 找到解析入口

在这里插入图片描述

2.进入sql

在这里插入图片描述

3.执行sessionState.sqlParser.parsePlan(sqlText)

sessionState

在给定的[[SparkSession]]中保存所有会话特定状态的类。

sqlParser 是一个接口其相关实现类如下图所示：

在这里插入图片描述

parsePlan方法：SparkSqlParser.parsePlan没有找到在其父类AbstractSqlParse找到如下

override def parsePlan(sqlText: String): LogicalPlan = parse(sqlText) { parser =>
    astBuilder.visitSingleStatement(parser.singleStatement()) match {
      case plan: LogicalPlan => plan
      case _ =>
        val position = Origin(None, None)
        throw new ParseException(Option(sqlText), "Unsupported SQL statement", position, position)
    }
  }

AbstractSqlParse.parsePlan()调用其子类SparkSqlParser的parse()()方法,源代码如下

protected override def parse[T](command: String)(toResult: SqlBaseParser => T): T = {
    super.parse(substitutor.substitute(command))(toResult)
  }

可以看到SparkSqlParser调用了其父类的parse()()方法，其源代码如下：

protected def parse[T](command: String)(toResult: SqlBaseParser => T): T = {
    logInfo(s"Parsing command: $command")
	//lexer 读取字符转为Token
    val lexer = new SqlBaseLexer(new ANTLRNoCaseStringStream(command))
    lexer.removeErrorListeners()
    lexer.addErrorListener(ParseErrorListener)

    val tokenStream = new CommonTokenStream(lexer)
    //parser分析Token的语法意义及其与上下文之间的关系
    val parser = new SqlBaseParser(tokenStream)//SqlBaseParser由antlr4解析而成
    parser.addParseListener(PostProcessor)
    parser.removeErrorListeners()
    parser.addErrorListener(ParseErrorListener)

    try {
      try {
        // first, try parsing with potentially faster SLL mode
        parser.getInterpreter.setPredictionMode(PredictionMode.SLL)
        toResult(parser)
      }
      catch {
        case e: ParseCancellationException =>
          // if we fail, parse with LL mode
          tokenStream.reset() // rewind input stream
          parser.reset()

          // Try Again.
          parser.getInterpreter.setPredictionMode(PredictionMode.LL)
          toResult(parser)
      }
    }
    catch {
      case e: ParseException if e.command.isDefined =>
        throw e
      case e: ParseException =>
        throw e.withCommand(command)
      case e: AnalysisException =>
        val position = Origin(e.line, e.startPosition)
        throw new ParseException(Option(command), e.message, position, position)
    }
  }

parse的两个参数解读：

1.传入的sql命令
2.传入的是一个函数，该函数将输入参数类型是SqlBaseParser，输出类型由代码逻辑决定。

4.parse()方法中 toResult()方法回到parsePlan()方法

 parser =>
    astBuilder.visitSingleStatement(parser.singleStatement()) match {
      case plan: LogicalPlan => plan
      case _ =>
        val position = Origin(None, None)
        throw new ParseException(Option(sqlText), "Unsupported SQL statement", position, position)

5.astBuider是AstBulider类型实现了ParseTreeVisitor接口该接口定义了解析树访问者的基本概念。

在这里插入图片描述

6.astBuilder.visitSingleStatement(parser.singleStatement()) 解析

parser.singleStatement() (SqlBaseParser类下的) 该方法用以返回SingleStatementContext

//可知返回类型为SingleStatementContext
public final SingleStatementContext singleStatement() throws RecognitionException {
    //@parameter _ctx : protected ParserRuleContext 当前执行规则的{@link ParserRuleContext}对象。
    //@parameter getState() : return _stateNumber;  _stateNumber 状态的编号，用以记录什么状态调用了此上下文规则
		SingleStatementContext _localctx = new SingleStatementContext(_ctx, getState());
    
    //总是由生成的解析器在进入规则时调用。访问字段{@link##_localctx}获取当前上下文。
		enterRule(_localctx, 0, RULE_singleStatement);
    /*
    	public void enterRule(ParserRuleContext localctx, int state, int ruleIndex) {
    	//就是设置_stateNumber=0
		setState(state);
		_ctx = localctx;
		//start是一个共有变量的Token ， _input是一个TokenStream
		_ctx.start = _input.LT(1);
		
		//如果我们有parent环境，就把当前环境添加到parent的chlid中去
		if (_buildParseTrees) addContextToParseTree();
		//将enter rule事件通知所有解析侦听器。
        if ( _parseListeners != null) triggerEnterRuleEvent();
	}
    
    */
    
		try {
            //如果我们有新的localctx，请确保替换现有的ctx，它是解析树的上一个子级
			enterOuterAlt(_localctx, 1);
			{
            //CASCADE=190  cascade级联删除表中的信息，当表A中的字段引用了表B中的字段时，一旦删除B中该字段的信息，表A的信息也自动删除。
			setState(190);
            //在这个方法里构建StatementContext()
			statement();
            //RESTRICT=191  restrict限制 ？？？C语言中的一种类型限定符（Type Qualifiers），用于告诉编译器，对象已经被指针所引用，不能通过除该指针外所有其他直接或间接的方式修改该对象的内容。    
			setState(191);
			match(EOF);
			}
		}
		catch (RecognitionException re) {
			_localctx.exception = re;
			_errHandler.reportError(this, re);
			_errHandler.recover(this, re);
		}
		finally {
			exitRule();
		}
    	//返回
		return _localctx;
	}

astBuilder.visitSingleStatement(）

  override def visitSingleStatement(ctx: SingleStatementContext): LogicalPlan = withOrigin(ctx) {
      //其中ctx.statement 返回了StatementContext
      //visit方法返回了StatementContext的父类 ParserRuleContext
      //asInstanceof 将 ParserRuleContext 强转成LogicalPlan
    visit(ctx.statement).asInstanceOf[LogicalPlan]
  }

疑问两个不同的类怎么发生强转？

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ijb4itkp-1619166361038)(C:\Users\Yuao\Desktop\学习笔记\pic\scala类强转.png)]

7.进入DataSet.ofRows(sparkSession,logicalPlan)

def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame = {
    val qe = sparkSession.sessionState.executePlan(logicalPlan)
    qe.assertAnalyzed()
    new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema))
  }

7.1 小插曲点进SessionState发现并没有实现executePlan方法，实际上在创建SparkSession时已经调用builder创建了，其executePlan源代码如下：

protected def createQueryExecution: LogicalPlan => QueryExecution = { plan =>
    new QueryExecution(session, plan)
  }

7.2 QueryExecution 是使用Spark执行关系查询的主要工作流。设计为允许开发人员轻松访问查询的中间阶段。

7.3 qe.assertAnalyzed()

def assertAnalyzed(): Unit = {
    // 在try块外部调用Analyzer，以避免从下面的catch块内部再次调用它。
    analyzed
    try {
      sparkSession.sessionState.analyzer.checkAnalysis(analyzed)
    } catch {
      case e: AnalysisException =>
        val ae = new AnalysisException(e.message, e.line, e.startPosition, Option(analyzed))
        ae.setStackTrace(e.getStackTrace)
        throw ae
    }
  }

7.3.1 analyzed解析

  lazy val analyzed: LogicalPlan = {
    SparkSession.setActiveSession(sparkSession)
     //参数说明 这里的logical是构造QueryExecution的参数
    sparkSession.sessionState.analyzer.execute(logical)
  }

7.3.1.1 analyzer 是提供逻辑查询计划分析器，它使用[[SessionCatalog]]和[[FunctionRegistry]]中的信息将[[unsolvedAttribute]]s和[[unsolvedRelations]]s转换为完全类型化对象。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-OBjLWkLi-1619166361039)(C:\Users\Yuao\Desktop\学习笔记\pic\Analyzer.png)]

详细内容见：

7.3.1.2 analyzer.execute(logical)？？？

analyzer调用父类RuleExecutor的execute方法,执行logicalplan对应的规则并返回结果

//执行子类定义的规则批。批处理是使用定义的执行策略串行执行的。在每个批处理中，规则也是串行执行的。  
def execute(plan: TreeType): TreeType = {
    var curPlan = plan

    batches.foreach { batch =>
      val batchStartPlan = curPlan
      var iteration = 1
      var lastPlan = curPlan
      var continue = true

      // Run until fix point (or the max number of iterations as specified in the strategy.
      while (continue) {
        curPlan = batch.rules.foldLeft(curPlan) {
          case (plan, rule) =>
            val startTime = System.nanoTime()
            val result = rule(plan)
            val runTime = System.nanoTime() - startTime
            RuleExecutor.timeMap.addAndGet(rule.ruleName, runTime)

            if (!result.fastEquals(plan)) {
              logTrace(
                s"""
                  |=== Applying Rule ${rule.ruleName} ===
                  |${sideBySide(plan.treeString, result.treeString).mkString("\n")}
                """.stripMargin)
            }

            result
        }
        iteration += 1
        if (iteration > batch.strategy.maxIterations) {
          // Only log if this is a rule that is supposed to run more than once.
          if (iteration != 2) {
            val message = s"Max iterations (${iteration - 1}) reached for batch ${batch.name}"
            if (Utils.isTesting) {
              throw new TreeNodeException(curPlan, message, null)
            } else {
              logWarning(message)
            }
          }
          continue = false
        }

        if (curPlan.fastEquals(lastPlan)) {
          logTrace(
            s"Fixed point reached for batch ${batch.name} after ${iteration - 1} iterations.")
          continue = false
        }
        lastPlan = curPlan
      }

      if (!batchStartPlan.fastEquals(curPlan)) {
        logDebug(
          s"""
          |=== Result of Batch ${batch.name} ===
          |${sideBySide(batchStartPlan.treeString, curPlan.treeString).mkString("\n")}
        """.stripMargin)
      } else {
        logTrace(s"Batch ${batch.name} has no effect.")
      }
    }

    curPlan
  }

7.3.2 sparkSession.sessionState.analyzer.checkAnalysis(analyzed)

analyzer调用父类CheckAnalysis(抽象类)的checkAnalysis方法对规则进行变换和排序，以便捕获第一个可能的失败，而不是级联解析失败的结果。

def checkAnalysis(plan: LogicalPlan): Unit = {
    
    plan.foreachUp {
      case p if p.analyzed => // Skip already analyzed sub-plans

      case u: UnresolvedRelation =>
        u.failAnalysis(s"Table or view not found: ${u.tableIdentifier}")

      case operator: LogicalPlan =>
        operator transformExpressionsUp {
          case a: Attribute if !a.resolved =>
            val from = operator.inputSet.map(_.name).mkString(", ")
            a.failAnalysis(s"cannot resolve '${a.sql}' given input columns: [$from]")

          case e: Expression if e.checkInputDataTypes().isFailure =>
            e.checkInputDataTypes() match {
              case TypeCheckResult.TypeCheckFailure(message) =>
                e.failAnalysis(
                  s"cannot resolve '${e.sql}' due to data type mismatch: $message")
            }

          case c: Cast if !c.resolved =>
            failAnalysis(
              s"invalid cast from ${c.child.dataType.simpleString} to ${c.dataType.simpleString}")

          case g: Grouping =>
            failAnalysis("grouping() can only be used with GroupingSets/Cube/Rollup")
          case g: GroupingID =>
            failAnalysis("grouping_id() can only be used with GroupingSets/Cube/Rollup")

          case w @ WindowExpression(AggregateExpression(_, _, true, _), _) =>
            failAnalysis(s"Distinct window functions are not supported: $w")

          case w @ WindowExpression(_: OffsetWindowFunction, WindowSpecDefinition(_, order,
               SpecifiedWindowFrame(frame,
                 FrameBoundary(l),
                 FrameBoundary(h))))
             if order.isEmpty || frame != RowFrame || l != h =>
            failAnalysis("An offset window function can only be evaluated in an ordered " +
              s"row-based window frame with a single offset: $w")

          case w @ WindowExpression(e, s) =>
            // Only allow window functions with an aggregate expression or an offset window
            // function.
            e match {
              case _: AggregateExpression | _: OffsetWindowFunction | _: AggregateWindowFunction =>
              case _ =>
                failAnalysis(s"Expression '$e' not supported within a window function.")
            }
            // Make sure the window specification is valid.
            s.validate match {
              case Some(m) =>
                failAnalysis(s"Window specification $s is not valid because $m")
              case None => w
            }

          case s @ ScalarSubquery(query, conditions, _) =>
            checkAnalysis(query)

            // If no correlation, the output must be exactly one column
            if (conditions.isEmpty && query.output.size != 1) {
              failAnalysis(
                s"Scalar subquery must return only one column, but got ${query.output.size}")
            } else if (conditions.nonEmpty) {
              def checkAggregate(agg: Aggregate): Unit = {
                // Make sure correlated scalar subqueries contain one row for every outer row by
                // enforcing that they are aggregates containing exactly one aggregate expression.
                // The analyzer has already checked that subquery contained only one output column,
                // and added all the grouping expressions to the aggregate.
                val aggregates = agg.expressions.flatMap(_.collect {
                  case a: AggregateExpression => a
                })
                if (aggregates.isEmpty) {
                  failAnalysis("The output of a correlated scalar subquery must be aggregated")
                }

                // SPARK-18504/SPARK-18814: Block cases where GROUP BY columns
                // are not part of the correlated columns.
                val groupByCols = AttributeSet(agg.groupingExpressions.flatMap(_.references))
                // Collect the local references from the correlated predicate in the subquery.
                val subqueryColumns = getCorrelatedPredicates(query).flatMap(_.references)
                  .filterNot(conditions.flatMap(_.references).contains)
                val correlatedCols = AttributeSet(subqueryColumns)
                val invalidCols = groupByCols -- correlatedCols
                // GROUP BY columns must be a subset of columns in the predicates
                if (invalidCols.nonEmpty) {
                  failAnalysis(
                    "A GROUP BY clause in a scalar correlated subquery " +
                      "cannot contain non-correlated columns: " +
                      invalidCols.mkString(","))
                }
              }

              // Skip subquery aliases added by the Analyzer.
              // For projects, do the necessary mapping and skip to its child.
              def cleanQuery(p: LogicalPlan): LogicalPlan = p match {
                case s: SubqueryAlias => cleanQuery(s.child)
                case p: Project => cleanQuery(p.child)
                case child => child
              }

              cleanQuery(query) match {
                case a: Aggregate => checkAggregate(a)
                case Filter(_, a: Aggregate) => checkAggregate(a)
                case fail => failAnalysis(s"Correlated scalar subqueries must be Aggregated: $fail")
              }
            }
            s

          case s: SubqueryExpression =>
            checkAnalysis(s.plan)
            s
        }

        operator match {
          case etw: EventTimeWatermark =>
            etw.eventTime.dataType match {
              case s: StructType
                if s.find(_.name == "end").map(_.dataType) == Some(TimestampType) =>
              case _: TimestampType =>
              case _ =>
                failAnalysis(
                  s"Event time must be defined on a window or a timestamp, but " +
                  s"${etw.eventTime.name} is of type ${etw.eventTime.dataType.simpleString}")
            }
          case f: Filter if f.condition.dataType != BooleanType =>
            failAnalysis(
              s"filter expression '${f.condition.sql}' " +
                s"of type ${f.condition.dataType.simpleString} is not a boolean.")

          case Filter(condition, _) if hasNullAwarePredicateWithinNot(condition) =>
            failAnalysis("Null-aware predicate sub-queries cannot be used in nested " +
              s"conditions: $condition")

          case j @ Join(_, _, _, Some(condition)) if condition.dataType != BooleanType =>
            failAnalysis(
              s"join condition '${condition.sql}' " +
                s"of type ${condition.dataType.simpleString} is not a boolean.")

          case Aggregate(groupingExprs, aggregateExprs, child) =>
            def checkValidAggregateExpression(expr: Expression): Unit = expr match {
              case aggExpr: AggregateExpression =>
                aggExpr.aggregateFunction.children.foreach { child =>
                  child.foreach {
                    case agg: AggregateExpression =>
                      failAnalysis(
                        s"It is not allowed to use an aggregate function in the argument of " +
                          s"another aggregate function. Please use the inner aggregate function " +
                          s"in a sub-query.")
                    case other => // OK
                  }

                  if (!child.deterministic) {
                    failAnalysis(
                      s"nondeterministic expression ${expr.sql} should not " +
                        s"appear in the arguments of an aggregate function.")
                  }
                }
              case e: Attribute if groupingExprs.isEmpty =>
                // Collect all [[AggregateExpressions]]s.
                val aggExprs = aggregateExprs.filter(_.collect {
                  case a: AggregateExpression => a
                }.nonEmpty)
                failAnalysis(
                  s"grouping expressions sequence is empty, " +
                    s"and '${e.sql}' is not an aggregate function. " +
                    s"Wrap '${aggExprs.map(_.sql).mkString("(", ", ", ")")}' in windowing " +
                    s"function(s) or wrap '${e.sql}' in first() (or first_value) " +
                    s"if you don't care which value you get."
                )
              case e: Attribute if !groupingExprs.exists(_.semanticEquals(e)) =>
                failAnalysis(
                  s"expression '${e.sql}' is neither present in the group by, " +
                    s"nor is it an aggregate function. " +
                    "Add to group by or wrap in first() (or first_value) if you don't care " +
                    "which value you get.")
              case e if groupingExprs.exists(_.semanticEquals(e)) => // OK
              case e => e.children.foreach(checkValidAggregateExpression)
            }

            def checkValidGroupingExprs(expr: Expression): Unit = {
              if (expr.find(_.isInstanceOf[AggregateExpression]).isDefined) {
                failAnalysis(
                  "aggregate functions are not allowed in GROUP BY, but found " + expr.sql)
              }

              // Check if the data type of expr is orderable.
              if (!RowOrdering.isOrderable(expr.dataType)) {
                failAnalysis(
                  s"expression ${expr.sql} cannot be used as a grouping expression " +
                    s"because its data type ${expr.dataType.simpleString} is not an orderable " +
                    s"data type.")
              }

              if (!expr.deterministic) {
                // This is just a sanity check, our analysis rule PullOutNondeterministic should
                // already pull out those nondeterministic expressions and evaluate them in
                // a Project node.
                failAnalysis(s"nondeterministic expression ${expr.sql} should not " +
                  s"appear in grouping expression.")
              }
            }

            groupingExprs.foreach(checkValidGroupingExprs)
            aggregateExprs.foreach(checkValidAggregateExpression)

          case Sort(orders, _, _) =>
            orders.foreach { order =>
              if (!RowOrdering.isOrderable(order.dataType)) {
                failAnalysis(
                  s"sorting is not supported for columns of type ${order.dataType.simpleString}")
              }
            }

          case GlobalLimit(limitExpr, _) => checkLimitClause(limitExpr)

          case LocalLimit(limitExpr, _) => checkLimitClause(limitExpr)

          case p if p.expressions.exists(ScalarSubquery.hasCorrelatedScalarSubquery) =>
            p match {
              case _: Filter | _: Aggregate | _: Project => // Ok
              case other => failAnalysis(
                s"Correlated scalar sub-queries can only be used in a Filter/Aggregate/Project: $p")
            }

          case p if p.expressions.exists(SubqueryExpression.hasInOrExistsSubquery) =>
            p match {
              case _: Filter => // Ok
              case _ => failAnalysis(s"Predicate sub-queries can only be used in a Filter: $p")
            }

          case _: Union | _: SetOperation if operator.children.length > 1 =>
            def dataTypes(plan: LogicalPlan): Seq[DataType] = plan.output.map(_.dataType)
            def ordinalNumber(i: Int): String = i match {
              case 0 => "first"
              case 1 => "second"
              case i => s"${i}th"
            }
            val ref = dataTypes(operator.children.head)
            operator.children.tail.zipWithIndex.foreach { case (child, ti) =>
              // Check the number of columns
              if (child.output.length != ref.length) {
                failAnalysis(
                  s"""
                    |${operator.nodeName} can only be performed on tables with the same number
                    |of columns, but the first table has ${ref.length} columns and
                    |the ${ordinalNumber(ti + 1)} table has ${child.output.length} columns
                  """.stripMargin.replace("\n", " ").trim())
              }
              // Check if the data types match.
              dataTypes(child).zip(ref).zipWithIndex.foreach { case ((dt1, dt2), ci) =>
                // SPARK-18058: we shall not care about the nullability of columns
                if (TypeCoercion.findWiderTypeForTwo(dt1.asNullable, dt2.asNullable).isEmpty) {
                  failAnalysis(
                    s"""
                      |${operator.nodeName} can only be performed on tables with the compatible
                      |column types. ${dt1.catalogString} <> ${dt2.catalogString} at the
                      |${ordinalNumber(ci)} column of the ${ordinalNumber(ti + 1)} table
                    """.stripMargin.replace("\n", " ").trim())
                }
              }
            }

          case _ => // Fallbacks to the following checks
        }

        operator match {
          case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
            val missingAttributes = o.missingInput.mkString(",")
            val input = o.inputSet.mkString(",")

            failAnalysis(
              s"resolved attribute(s) $missingAttributes missing from $input " +
                s"in operator ${operator.simpleString}")

          case p @ Project(exprs, _) if containsMultipleGenerators(exprs) =>
            failAnalysis(
              s"""Only a single table generating function is allowed in a SELECT clause, found:
                 | ${exprs.map(_.sql).mkString(",")}""".stripMargin)

          case j: Join if !j.duplicateResolved =>
            val conflictingAttributes = j.left.outputSet.intersect(j.right.outputSet)
            failAnalysis(
              s"""
                 |Failure when resolving conflicting references in Join:
                 |$plan
                 |Conflicting attributes: ${conflictingAttributes.mkString(",")}
                 |""".stripMargin)

          case i: Intersect if !i.duplicateResolved =>
            val conflictingAttributes = i.left.outputSet.intersect(i.right.outputSet)
            failAnalysis(
              s"""
                 |Failure when resolving conflicting references in Intersect:
                 |$plan
                 |Conflicting attributes: ${conflictingAttributes.mkString(",")}
               """.stripMargin)

          case e: Except if !e.duplicateResolved =>
            val conflictingAttributes = e.left.outputSet.intersect(e.right.outputSet)
            failAnalysis(
              s"""
                 |Failure when resolving conflicting references in Except:
                 |$plan
                 |Conflicting attributes: ${conflictingAttributes.mkString(",")}
               """.stripMargin)

          // TODO: although map type is not orderable, technically map type should be able to be
          // used in equality comparison, remove this type check once we support it.
          case o if mapColumnInSetOperation(o).isDefined =>
            val mapCol = mapColumnInSetOperation(o).get
            failAnalysis("Cannot have map type columns in DataFrame which calls " +
              s"set operations(intersect, except, etc.), but the type of column ${mapCol.name} " +
              "is " + mapCol.dataType.simpleString)

          case o if o.expressions.exists(!_.deterministic) &&
            !o.isInstanceOf[Project] && !o.isInstanceOf[Filter] &&
            !o.isInstanceOf[Aggregate] && !o.isInstanceOf[Window] =>
            // The rule above is used to check Aggregate operator.
            failAnalysis(
              s"""nondeterministic expressions are only allowed in
                 |Project, Filter, Aggregate or Window, found:
                 | ${o.expressions.map(_.sql).mkString(",")}
                 |in operator ${operator.simpleString}
               """.stripMargin)

          case _: UnresolvedHint =>
            throw new IllegalStateException(
              "Internal error: logical hint operator should have been removed during analysis")

          case _ => // Analysis successful!
        }
    }
    extendedCheckRules.foreach(_(plan))
    plan.foreachUp {
      case o if !o.resolved => failAnalysis(s"unresolved operator ${o.simpleString}")
      case _ =>
    }
    //将此计划标记为已分析。
    plan.foreach(_.setAnalyzed())
  }