SparkSql , 一个字符串 , 是如何在spark集群中被解析运行的呢?本文带你一探究竟。
1.断点1 找到解析入口
2.进入sql
3.执行sessionState.sqlParser.parsePlan(sqlText)
sessionState
在给定的[[SparkSession]]中保存所有会话特定状态的类。
sqlParser 是一个 接口 其相关实现类如下图所示:
parsePlan方法:SparkSqlParser.parsePlan没有找到 在其父类AbstractSqlParse找到如下
override def parsePlan(sqlText: String): LogicalPlan = parse(sqlText) { parser =>
astBuilder.visitSingleStatement(parser.singleStatement()) match {
case plan: LogicalPlan => plan
case _ =>
val position = Origin(None, None)
throw new ParseException(Option(sqlText), "Unsupported SQL statement", position, position)
}
}
AbstractSqlParse.parsePlan()调用其子类SparkSqlParser的parse()()方法,源代码如下
protected override def parse[T](command: String)(toResult: SqlBaseParser => T): T = {
super.parse(substitutor.substitute(command))(toResult)
}
可以看到SparkSqlParser调用了其父类的parse()()方法,其源代码如下:
protected def parse[T](command: String)(toResult: SqlBaseParser => T): T = {
logInfo(s"Parsing command: $command")
//lexer 读取字符转为Token
val lexer = new SqlBaseLexer(new ANTLRNoCaseStringStream(command))
lexer.removeErrorListeners()
lexer.addErrorListener(ParseErrorListener)
val tokenStream = new CommonTokenStream(lexer)
//parser分析Token的语法意义及其与上下文之间的关系
val parser = new SqlBaseParser(tokenStream)//SqlBaseParser由antlr4解析而成
parser.addParseListener(PostProcessor)
parser.removeErrorListeners()
parser.addErrorListener(ParseErrorListener)
try {
try {
// first, try parsing with potentially faster SLL mode
parser.getInterpreter.setPredictionMode(PredictionMode.SLL)
toResult(parser)
}
catch {
case e: ParseCancellationException =>
// if we fail, parse with LL mode
tokenStream.reset() // rewind input stream
parser.reset()
// Try Again.
parser.getInterpreter.setPredictionMode(PredictionMode.LL)
toResult(parser)
}
}
catch {
case e: ParseException if e.command.isDefined =>
throw e
case e: ParseException =>
throw e.withCommand(command)
case e: AnalysisException =>
val position = Origin(e.line, e.startPosition)
throw new ParseException(Option(command), e.message, position, position)
}
}
parse的两个参数解读:
1.传入的sql命令
2.传入的是一个函数,该函数将输入参数类型是SqlBaseParser,输出类型由代码逻辑决定。
4.parse()方法中 toResult()方法回到parsePlan()方法
parser =>
astBuilder.visitSingleStatement(parser.singleStatement()) match {
case plan: LogicalPlan => plan
case _ =>
val position = Origin(None, None)
throw new ParseException(Option(sqlText), "Unsupported SQL statement", position, position)
5.astBuider是AstBulider类型 实现了ParseTreeVisitor接口 该接口定义了解析树访问者的基本概念。
6.astBuilder.visitSingleStatement(parser.singleStatement()) 解析
parser.singleStatement() (SqlBaseParser类下的) 该方法用以返回SingleStatementContext
//可知返回类型为SingleStatementContext
public final SingleStatementContext singleStatement() throws RecognitionException {
//@parameter _ctx : protected ParserRuleContext 当前执行规则的{@link ParserRuleContext}对象。
//@parameter getState() : return _stateNumber; _stateNumber 状态的编号,用以记录什么状态调用了此上下文规则
SingleStatementContext _localctx = new SingleStatementContext(_ctx, getState());
//总是由生成的解析器在进入规则时调用。访问字段{@link##_localctx}获取当前上下文。
enterRule(_localctx, 0, RULE_singleStatement);
/*
public void enterRule(ParserRuleContext localctx, int state, int ruleIndex) {
//就是设置_stateNumber=0
setState(state);
_ctx = localctx;
//start是一个共有变量的Token , _input是一个TokenStream
_ctx.start = _input.LT(1);
//如果我们有parent环境,就把当前环境添加到parent的chlid中去
if (_buildParseTrees) addContextToParseTree();
//将enter rule事件通知所有解析侦听器。
if ( _parseListeners != null) triggerEnterRuleEvent();
}
*/
try {
//如果我们有新的localctx,请确保替换现有的ctx,它是解析树的上一个子级
enterOuterAlt(_localctx, 1);
{
//CASCADE=190 cascade级联删除表中的信息,当表A中的字段引用了表B中的字段时,一旦删除B中该字段的信息,表A的信息也自动删除。
setState(190);
//在这个方法里构建StatementContext()
statement();
//RESTRICT=191 restrict限制 ???C语言中的一种类型限定符(Type Qualifiers),用于告诉编译器,对象已经被指针所引用,不能通过除该指针外所有其他直接或间接的方式修改该对象的内容。
setState(191);
match(EOF);
}
}
catch (RecognitionException re) {
_localctx.exception = re;
_errHandler.reportError(this, re);
_errHandler.recover(this, re);
}
finally {
exitRule();
}
//返回
return _localctx;
}
astBuilder.visitSingleStatement()
override def visitSingleStatement(ctx: SingleStatementContext): LogicalPlan = withOrigin(ctx) {
//其中ctx.statement 返回了StatementContext
//visit方法返回了StatementContext的父类 ParserRuleContext
//asInstanceof 将 ParserRuleContext 强转成LogicalPlan
visit(ctx.statement).asInstanceOf[LogicalPlan]
}
疑问两个不同的类怎么发生强转?
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ijb4itkp-1619166361038)(C:\Users\Yuao\Desktop\学习笔记\pic\scala类强转.png)]
7.进入DataSet.ofRows(sparkSession,logicalPlan)
def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame = {
val qe = sparkSession.sessionState.executePlan(logicalPlan)
qe.assertAnalyzed()
new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema))
}
7.1 小插曲 点进SessionState发现并没有实现executePlan方法,实际上在创建SparkSession时已经调用builder创建了,其executePlan源代码如下:
protected def createQueryExecution: LogicalPlan => QueryExecution = { plan =>
new QueryExecution(session, plan)
}
7.2 QueryExecution 是使用Spark执行关系查询的主要工作流。设计为允许开发人员轻松访问查询的中间阶段。
7.3 qe.assertAnalyzed()
def assertAnalyzed(): Unit = {
// 在try块外部调用Analyzer,以避免从下面的catch块内部再次调用它。
analyzed
try {
sparkSession.sessionState.analyzer.checkAnalysis(analyzed)
} catch {
case e: AnalysisException =>
val ae = new AnalysisException(e.message, e.line, e.startPosition, Option(analyzed))
ae.setStackTrace(e.getStackTrace)
throw ae
}
}
7.3.1 analyzed解析
lazy val analyzed: LogicalPlan = {
SparkSession.setActiveSession(sparkSession)
//参数说明 这里的logical是构造QueryExecution的参数
sparkSession.sessionState.analyzer.execute(logical)
}
7.3.1.1 analyzer 是提供逻辑查询计划分析器,它使用[[SessionCatalog]]和[[FunctionRegistry]]中的信息将[[unsolvedAttribute]]s和[[unsolvedRelations]]s转换为完全类型化对象。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-OBjLWkLi-1619166361039)(C:\Users\Yuao\Desktop\学习笔记\pic\Analyzer.png)]
7.3.1.2 analyzer.execute(logical)???
analyzer调用父类RuleExecutor的execute方法,执行logicalplan对应的规则 并返回结果
//执行子类定义的规则批。批处理是使用定义的执行策略串行执行的。在每个批处理中,规则也是串行执行的。
def execute(plan: TreeType): TreeType = {
var curPlan = plan
batches.foreach { batch =>
val batchStartPlan = curPlan
var iteration = 1
var lastPlan = curPlan
var continue = true
// Run until fix point (or the max number of iterations as specified in the strategy.
while (continue) {
curPlan = batch.rules.foldLeft(curPlan) {
case (plan, rule) =>
val startTime = System.nanoTime()
val result = rule(plan)
val runTime = System.nanoTime() - startTime
RuleExecutor.timeMap.addAndGet(rule.ruleName, runTime)
if (!result.fastEquals(plan)) {
logTrace(
s"""
|=== Applying Rule ${rule.ruleName} ===
|${sideBySide(plan.treeString, result.treeString).mkString("\n")}
""".stripMargin)
}
result
}
iteration += 1
if (iteration > batch.strategy.maxIterations) {
// Only log if this is a rule that is supposed to run more than once.
if (iteration != 2) {
val message = s"Max iterations (${iteration - 1}) reached for batch ${batch.name}"
if (Utils.isTesting) {
throw new TreeNodeException(curPlan, message, null)
} else {
logWarning(message)
}
}
continue = false
}
if (curPlan.fastEquals(lastPlan)) {
logTrace(
s"Fixed point reached for batch ${batch.name} after ${iteration - 1} iterations.")
continue = false
}
lastPlan = curPlan
}
if (!batchStartPlan.fastEquals(curPlan)) {
logDebug(
s"""
|=== Result of Batch ${batch.name} ===
|${sideBySide(batchStartPlan.treeString, curPlan.treeString).mkString("\n")}
""".stripMargin)
} else {
logTrace(s"Batch ${batch.name} has no effect.")
}
}
curPlan
}
7.3.2 sparkSession.sessionState.analyzer.checkAnalysis(analyzed)
analyzer调用父类CheckAnalysis(抽象类)的checkAnalysis方法 对规则进行变换和排序,以便捕获第一个可能的失败,而不是级联解析失败的结果。
def checkAnalysis(plan: LogicalPlan): Unit = {
plan.foreachUp {
case p if p.analyzed => // Skip already analyzed sub-plans
case u: UnresolvedRelation =>
u.failAnalysis(s"Table or view not found: ${u.tableIdentifier}")
case operator: LogicalPlan =>
operator transformExpressionsUp {
case a: Attribute if !a.resolved =>
val from = operator.inputSet.map(_.name).mkString(", ")
a.failAnalysis(s"cannot resolve '${a.sql}' given input columns: [$from]")
case e: Expression if e.checkInputDataTypes().isFailure =>
e.checkInputDataTypes() match {
case TypeCheckResult.TypeCheckFailure(message) =>
e.failAnalysis(
s"cannot resolve '${e.sql}' due to data type mismatch: $message")
}
case c: Cast if !c.resolved =>
failAnalysis(
s"invalid cast from ${c.child.dataType.simpleString} to ${c.dataType.simpleString}")
case g: Grouping =>
failAnalysis("grouping() can only be used with GroupingSets/Cube/Rollup")
case g: GroupingID =>
failAnalysis("grouping_id() can only be used with GroupingSets/Cube/Rollup")
case w @ WindowExpression(AggregateExpression(_, _, true, _), _) =>
failAnalysis(s"Distinct window functions are not supported: $w")
case w @ WindowExpression(_: OffsetWindowFunction, WindowSpecDefinition(_, order,
SpecifiedWindowFrame(frame,
FrameBoundary(l),
FrameBoundary(h))))
if order.isEmpty || frame != RowFrame || l != h =>
failAnalysis("An offset window function can only be evaluated in an ordered " +
s"row-based window frame with a single offset: $w")
case w @ WindowExpression(e, s) =>
// Only allow window functions with an aggregate expression or an offset window
// function.
e match {
case _: AggregateExpression | _: OffsetWindowFunction | _: AggregateWindowFunction =>
case _ =>
failAnalysis(s"Expression '$e' not supported within a window function.")
}
// Make sure the window specification is valid.
s.validate match {
case Some(m) =>
failAnalysis(s"Window specification $s is not valid because $m")
case None => w
}
case s @ ScalarSubquery(query, conditions, _) =>
checkAnalysis(query)
// If no correlation, the output must be exactly one column
if (conditions.isEmpty && query.output.size != 1) {
failAnalysis(
s"Scalar subquery must return only one column, but got ${query.output.size}")
} else if (conditions.nonEmpty) {
def checkAggregate(agg: Aggregate): Unit = {
// Make sure correlated scalar subqueries contain one row for every outer row by
// enforcing that they are aggregates containing exactly one aggregate expression.
// The analyzer has already checked that subquery contained only one output column,
// and added all the grouping expressions to the aggregate.
val aggregates = agg.expressions.flatMap(_.collect {
case a: AggregateExpression => a
})
if (aggregates.isEmpty) {
failAnalysis("The output of a correlated scalar subquery must be aggregated")
}
// SPARK-18504/SPARK-18814: Block cases where GROUP BY columns
// are not part of the correlated columns.
val groupByCols = AttributeSet(agg.groupingExpressions.flatMap(_.references))
// Collect the local references from the correlated predicate in the subquery.
val subqueryColumns = getCorrelatedPredicates(query).flatMap(_.references)
.filterNot(conditions.flatMap(_.references).contains)
val correlatedCols = AttributeSet(subqueryColumns)
val invalidCols = groupByCols -- correlatedCols
// GROUP BY columns must be a subset of columns in the predicates
if (invalidCols.nonEmpty) {
failAnalysis(
"A GROUP BY clause in a scalar correlated subquery " +
"cannot contain non-correlated columns: " +
invalidCols.mkString(","))
}
}
// Skip subquery aliases added by the Analyzer.
// For projects, do the necessary mapping and skip to its child.
def cleanQuery(p: LogicalPlan): LogicalPlan = p match {
case s: SubqueryAlias => cleanQuery(s.child)
case p: Project => cleanQuery(p.child)
case child => child
}
cleanQuery(query) match {
case a: Aggregate => checkAggregate(a)
case Filter(_, a: Aggregate) => checkAggregate(a)
case fail => failAnalysis(s"Correlated scalar subqueries must be Aggregated: $fail")
}
}
s
case s: SubqueryExpression =>
checkAnalysis(s.plan)
s
}
operator match {
case etw: EventTimeWatermark =>
etw.eventTime.dataType match {
case s: StructType
if s.find(_.name == "end").map(_.dataType) == Some(TimestampType) =>
case _: TimestampType =>
case _ =>
failAnalysis(
s"Event time must be defined on a window or a timestamp, but " +
s"${etw.eventTime.name} is of type ${etw.eventTime.dataType.simpleString}")
}
case f: Filter if f.condition.dataType != BooleanType =>
failAnalysis(
s"filter expression '${f.condition.sql}' " +
s"of type ${f.condition.dataType.simpleString} is not a boolean.")
case Filter(condition, _) if hasNullAwarePredicateWithinNot(condition) =>
failAnalysis("Null-aware predicate sub-queries cannot be used in nested " +
s"conditions: $condition")
case j @ Join(_, _, _, Some(condition)) if condition.dataType != BooleanType =>
failAnalysis(
s"join condition '${condition.sql}' " +
s"of type ${condition.dataType.simpleString} is not a boolean.")
case Aggregate(groupingExprs, aggregateExprs, child) =>
def checkValidAggregateExpression(expr: Expression): Unit = expr match {
case aggExpr: AggregateExpression =>
aggExpr.aggregateFunction.children.foreach { child =>
child.foreach {
case agg: AggregateExpression =>
failAnalysis(
s"It is not allowed to use an aggregate function in the argument of " +
s"another aggregate function. Please use the inner aggregate function " +
s"in a sub-query.")
case other => // OK
}
if (!child.deterministic) {
failAnalysis(
s"nondeterministic expression ${expr.sql} should not " +
s"appear in the arguments of an aggregate function.")
}
}
case e: Attribute if groupingExprs.isEmpty =>
// Collect all [[AggregateExpressions]]s.
val aggExprs = aggregateExprs.filter(_.collect {
case a: AggregateExpression => a
}.nonEmpty)
failAnalysis(
s"grouping expressions sequence is empty, " +
s"and '${e.sql}' is not an aggregate function. " +
s"Wrap '${aggExprs.map(_.sql).mkString("(", ", ", ")")}' in windowing " +
s"function(s) or wrap '${e.sql}' in first() (or first_value) " +
s"if you don't care which value you get."
)
case e: Attribute if !groupingExprs.exists(_.semanticEquals(e)) =>
failAnalysis(
s"expression '${e.sql}' is neither present in the group by, " +
s"nor is it an aggregate function. " +
"Add to group by or wrap in first() (or first_value) if you don't care " +
"which value you get.")
case e if groupingExprs.exists(_.semanticEquals(e)) => // OK
case e => e.children.foreach(checkValidAggregateExpression)
}
def checkValidGroupingExprs(expr: Expression): Unit = {
if (expr.find(_.isInstanceOf[AggregateExpression]).isDefined) {
failAnalysis(
"aggregate functions are not allowed in GROUP BY, but found " + expr.sql)
}
// Check if the data type of expr is orderable.
if (!RowOrdering.isOrderable(expr.dataType)) {
failAnalysis(
s"expression ${expr.sql} cannot be used as a grouping expression " +
s"because its data type ${expr.dataType.simpleString} is not an orderable " +
s"data type.")
}
if (!expr.deterministic) {
// This is just a sanity check, our analysis rule PullOutNondeterministic should
// already pull out those nondeterministic expressions and evaluate them in
// a Project node.
failAnalysis(s"nondeterministic expression ${expr.sql} should not " +
s"appear in grouping expression.")
}
}
groupingExprs.foreach(checkValidGroupingExprs)
aggregateExprs.foreach(checkValidAggregateExpression)
case Sort(orders, _, _) =>
orders.foreach { order =>
if (!RowOrdering.isOrderable(order.dataType)) {
failAnalysis(
s"sorting is not supported for columns of type ${order.dataType.simpleString}")
}
}
case GlobalLimit(limitExpr, _) => checkLimitClause(limitExpr)
case LocalLimit(limitExpr, _) => checkLimitClause(limitExpr)
case p if p.expressions.exists(ScalarSubquery.hasCorrelatedScalarSubquery) =>
p match {
case _: Filter | _: Aggregate | _: Project => // Ok
case other => failAnalysis(
s"Correlated scalar sub-queries can only be used in a Filter/Aggregate/Project: $p")
}
case p if p.expressions.exists(SubqueryExpression.hasInOrExistsSubquery) =>
p match {
case _: Filter => // Ok
case _ => failAnalysis(s"Predicate sub-queries can only be used in a Filter: $p")
}
case _: Union | _: SetOperation if operator.children.length > 1 =>
def dataTypes(plan: LogicalPlan): Seq[DataType] = plan.output.map(_.dataType)
def ordinalNumber(i: Int): String = i match {
case 0 => "first"
case 1 => "second"
case i => s"${i}th"
}
val ref = dataTypes(operator.children.head)
operator.children.tail.zipWithIndex.foreach { case (child, ti) =>
// Check the number of columns
if (child.output.length != ref.length) {
failAnalysis(
s"""
|${operator.nodeName} can only be performed on tables with the same number
|of columns, but the first table has ${ref.length} columns and
|the ${ordinalNumber(ti + 1)} table has ${child.output.length} columns
""".stripMargin.replace("\n", " ").trim())
}
// Check if the data types match.
dataTypes(child).zip(ref).zipWithIndex.foreach { case ((dt1, dt2), ci) =>
// SPARK-18058: we shall not care about the nullability of columns
if (TypeCoercion.findWiderTypeForTwo(dt1.asNullable, dt2.asNullable).isEmpty) {
failAnalysis(
s"""
|${operator.nodeName} can only be performed on tables with the compatible
|column types. ${dt1.catalogString} <> ${dt2.catalogString} at the
|${ordinalNumber(ci)} column of the ${ordinalNumber(ti + 1)} table
""".stripMargin.replace("\n", " ").trim())
}
}
}
case _ => // Fallbacks to the following checks
}
operator match {
case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
val missingAttributes = o.missingInput.mkString(",")
val input = o.inputSet.mkString(",")
failAnalysis(
s"resolved attribute(s) $missingAttributes missing from $input " +
s"in operator ${operator.simpleString}")
case p @ Project(exprs, _) if containsMultipleGenerators(exprs) =>
failAnalysis(
s"""Only a single table generating function is allowed in a SELECT clause, found:
| ${exprs.map(_.sql).mkString(",")}""".stripMargin)
case j: Join if !j.duplicateResolved =>
val conflictingAttributes = j.left.outputSet.intersect(j.right.outputSet)
failAnalysis(
s"""
|Failure when resolving conflicting references in Join:
|$plan
|Conflicting attributes: ${conflictingAttributes.mkString(",")}
|""".stripMargin)
case i: Intersect if !i.duplicateResolved =>
val conflictingAttributes = i.left.outputSet.intersect(i.right.outputSet)
failAnalysis(
s"""
|Failure when resolving conflicting references in Intersect:
|$plan
|Conflicting attributes: ${conflictingAttributes.mkString(",")}
""".stripMargin)
case e: Except if !e.duplicateResolved =>
val conflictingAttributes = e.left.outputSet.intersect(e.right.outputSet)
failAnalysis(
s"""
|Failure when resolving conflicting references in Except:
|$plan
|Conflicting attributes: ${conflictingAttributes.mkString(",")}
""".stripMargin)
// TODO: although map type is not orderable, technically map type should be able to be
// used in equality comparison, remove this type check once we support it.
case o if mapColumnInSetOperation(o).isDefined =>
val mapCol = mapColumnInSetOperation(o).get
failAnalysis("Cannot have map type columns in DataFrame which calls " +
s"set operations(intersect, except, etc.), but the type of column ${mapCol.name} " +
"is " + mapCol.dataType.simpleString)
case o if o.expressions.exists(!_.deterministic) &&
!o.isInstanceOf[Project] && !o.isInstanceOf[Filter] &&
!o.isInstanceOf[Aggregate] && !o.isInstanceOf[Window] =>
// The rule above is used to check Aggregate operator.
failAnalysis(
s"""nondeterministic expressions are only allowed in
|Project, Filter, Aggregate or Window, found:
| ${o.expressions.map(_.sql).mkString(",")}
|in operator ${operator.simpleString}
""".stripMargin)
case _: UnresolvedHint =>
throw new IllegalStateException(
"Internal error: logical hint operator should have been removed during analysis")
case _ => // Analysis successful!
}
}
extendedCheckRules.foreach(_(plan))
plan.foreachUp {
case o if !o.resolved => failAnalysis(s"unresolved operator ${o.simpleString}")
case _ =>
}
//将此计划标记为已分析。
plan.foreach(_.setAnalyzed())
}