入口
SQLContext
// 用spark执行sql,返回一个DataFrame最为结果
def sql(sqlText: String): DataFrame = sparkSession.sql(sqlText)
DataFrame,其实是针对数据查询这种应用,提供的一种基于RDD之上的全新概念,但是,其底层还是基于RDD的;它其实和关系型数据库中的表非常类似,但是底层做了很多的优化,DataFrame可以通过很多来源进行构建,包括:结构化的数据文件,Hive中的表,外部的关系型数据库,RDD。
def sql(sqlText: String): DataFrame = {
Dataset.ofRows(self, sessionState.sqlParser.parsePlan(sqlText))
}
sessionState 是spark session的状态,
它持有sharedState, sqlParser,analyzerBuilder,optimizerBuilder,planner等重要的对象。sessionState的创建是由父sessionState拷贝而来的,如果没有父sessionState则重新初始化创建。
上面代码中用sessionState中的sql解析器 sqlParser来解析sql语句 sqlText,最后会返回一个逻辑计划。
Dataset.ofRows方法对逻辑计划logicalPlan进一步处理。
// 解析sql 语句,生成逻辑计划
override def parsePlan(sqlText: String): LogicalPlan = parse(sqlText) { parser =>
astBuilder.visitSingleStatement(parser.singleStatement()) match {
case plan: LogicalPlan => plan
case _ =>
val position = Origin(None, None)
throw new ParseException(Option(sqlText), "Unsupported SQL statement", position, position)
}
}
parsePlan 方法中调用parse方法,parse方法是个scala语法中的柯里化函数,它有两个输入参数,一个是查询语句。另外一个参数是方法参数。
protected def parse[T](command: String)(toResult: SqlBaseParser => T): T = {
val lexer = new SqlBaseLexer(new UpperCaseCharStream(CharStreams.fromString(command)))
lexer.removeErrorListeners()
lexer.addErrorListener(ParseErrorListener)
lexer.legacy_setops_precedence_enbled = SQLConf.get.setOpsPrecedenceEnforced
val tokenStream = new CommonTokenStream(lexer)
val parser = new SqlBaseParser(tokenStream)
parser.addParseListener(PostProcessor)
parser.removeErrorListeners()
parser.addErrorListener(ParseErrorListener)
parser.legacy_setops_precedence_enbled = SQLConf.get.setOpsPrecedenceEnforced
// first, try parsing with potentially faster SLL mode
parser.getInterpreter.setPredictionMode(PredictionMode.SLL)
toResult(parser)
}
这个方法实际上是利用antlr4对sql语句进行语法解析,输出一个语法数parser,用toResult方法对语法树做下一步处理。SqlBaseLexer,SqlBaseParser, SqlBaseBaseVisitor类都是antlr4通过org\apache\spark\sql\catalyst\parser\SqlBase.g4语法文件生成的类。如果对antlr4不熟悉,可以参考Antlr4的使用简介
toResult方法的实现是通过柯里化函数的参数传入的,代码如下:
{ parser =>
astBuilder.visitSingleStatement(parser.singleStatement()) match {
case plan: LogicalPlan => plan
case _ =>
val position = Origin(None, None)
throw new ParseException(Option(sqlText), "Unsupported SQL statement", position, position)
}
}
通过astBuilder遍历parse语法树,将每个数节点的类型转化为LogicalPlan类型。astBuilder实际上是继承了SqlBaseBaseVisitor类,SqlBaseBaseVisitor也是antlr4生成的类,用于访问语法树。astBuilder继承实现了该类,对语法树的不同节点做相应的处理。
以上将sql语句转化为LogicalPlan类型的数结构是在
sessionState.sqlParser.parsePlan(sqlText)
中完成的。得到的LogicalPlan作为参数传给
Dataset.ofRows
方法。
def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame = {
val qe = sparkSession.sessionState.executePlan(logicalPlan) // 创建一个QueryExecuton qe
qe.assertAnalyzed()
new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema))
}
qe调用assertAnalyzed方法对LogicalPlan做分析
def assertAnalyzed(): Unit = analyzed
lazy val analyzed: LogicalPlan = {
SparkSession.setActiveSession(sparkSession)
sparkSession.sessionState.analyzer.executeAndCheck(logical)
}
将sparkSession保存到ThreadLocal中,避免sparkSession被其他线程修改。
用 解析器analyzer 将LogicalPlan中的UnresolvedAttribute和UnresolvedRelation 转化为有类型的对象,这个过程需要使用SessionCatalog中的信息。
def executeAndCheck(plan: LogicalPlan): LogicalPlan = AnalysisHelper.markInAnalyzer {
val analyzed = execute(plan)
checkAnalysis(analyzed)
analyzed
}
override def execute(plan: LogicalPlan): LogicalPlan = {
executeSameContext(plan)
}
private def executeSameContext(plan: LogicalPlan): LogicalPlan = super.execute(plan)
最终analyzer的执行逻辑位于 RuleExecutor类中
def execute(plan: TreeType): TreeType = {
batches.foreach { batch =>
val batchStartPlan = curPlan
var iteration = 1
var lastPlan = curPlan
var continue = true
// Run until fix point (or the max number of iterations as specified in the strategy.
while (continue) {
curPlan = batch.rules.foldLeft(curPlan) {
case (plan, rule) =>
val startTime = System.nanoTime()
val result = rule(plan)
val runTime = System.nanoTime() - startTime
if (!result.fastEquals(plan)) {
queryExecutionMetrics.incNumEffectiveExecution(rule.ruleName)
queryExecutionMetrics.incTimeEffectiveExecutionBy(rule.ruleName, runTime)
logTrace(
s"""
|=== Applying Rule ${rule.ruleName} ===
|${sideBySide(plan.treeString, result.treeString).mkString("\n")}
""".stripMargin)
}
queryExecutionMetrics.incExecutionTimeBy(rule.ruleName, runTime)
queryExecutionMetrics.incNumExecution(rule.ruleName)
// Run the structural integrity checker against the plan after each rule.
if (!isPlanIntegral(result)) {
val message = s"After applying rule ${rule.ruleName} in batch ${batch.name}, " +
"the structural integrity of the plan is broken."
throw new TreeNodeException(result, message, null)
}
result
}
iteration += 1
if (iteration > batch.strategy.maxIterations) {
// Only log if this is a rule that is supposed to run more than once.
if (iteration != 2) {
val message = s"Max iterations (${iteration - 1}) reached for batch ${batch.name}"
if (Utils.isTesting) {
throw new TreeNodeException(curPlan, message, null)
} else {
logWarning(message)
}
}
continue = false
}
if (curPlan.fastEquals(lastPlan)) {
logTrace(
s"Fixed point reached for batch ${batch.name} after ${iteration - 1} iterations.")
continue = false
}
lastPlan = curPlan
}
if (!batchStartPlan.fastEquals(curPlan)) {
logDebug(
s"""
|=== Result of Batch ${batch.name} ===
|${sideBySide(batchStartPlan.treeString, curPlan.treeString).mkString("\n")}
""".stripMargin)
} else {
logTrace(s"Batch ${batch.name} has no effect.")
}
}
curPlan
}
batches 中有多个batch,batch由一组rule组成。rule是将一个未绑定的LogicalPlan转成绑定的Logical。analyze的过程就是用batches中的每个rule去转化LogicalPlan数中的每个LogicalPlan节点,将其解析为resolved logicalPlan 。
解析完成之后会对解析的结果analyzed进行检查,发现错误会抛异常
checkAnalysis(analyzed)
到这里
def sql(sqlText: String): DataFrame = sparkSession.sql(sqlText)
中的代码已经都过了一遍,其中主要做了一下2件事:
- 将sql语句解析成语法数 unresolved logicalPlan
- 用解析器analyze将unresolved logicalPlan 转化为 resolved logicalPlan
返回一个DataFrame。到这里这个sql还尚未执行,只有遇到action 操作spark 任务才会真正执行。
result = sqlContext.sql(statement)
result.collect()
比如上面的两行代码就是对spark sql的运用。
sqlContext.sql(statement) 进去就是sparkSession.sql(sqlText) ,得到DataFrame result之后,result.collect()触发了action操作。
def collect(): Array[T] = withAction("collect", queryExecution)(collectFromPlan)
private def withAction[U](name: String, qe: QueryExecution)(action: SparkPlan => U) = {
val result = SQLExecution.withNewExecutionId(sparkSession, qe) {
action(qe.executedPlan)
}
result
}
上面真正执行查询操作的逻辑在qe.executedPlan中
lazy val executedPlan: SparkPlan = prepareForExecution(sparkPlan)
这里的sparkPlan在前面的分析中没有创建过,那么它是怎么来的呢?
lazy val sparkPlan: SparkPlan = {
SparkSession.setActiveSession(sparkSession)
planner.plan(ReturnAnswer(optimizedPlan)).next()
}`在这里插入代码片`
可以看到sparkPlan 的定义是懒加载的,只有调用的时候才会初始化获取。同样这里的optimizedPlan也是懒加载
lazy val optimizedPlan: LogicalPlan = sparkSession.sessionState.optimizer.execute(withCachedData)
lazy val withCachedData: LogicalPlan = {
assertAnalyzed()
assertSupported()
sparkSession.sharedState.cacheManager.useCachedData(analyzed)
}
所以正确的顺序是 analyzed -> optimizedPlan -> sparkPlan。上面已经分析过analyzed了,接下来看optimizedPlan
其实optimizer.execute 调用的代码与analyzed是一样的,都是RuleExecutor.execute方法。区别在于Optimizer的主要职责是将Analyzer给Resolved的Logical Plan根据不同的优化策略Batch,来对语法树进行优化,优化逻辑计划节点(Logical Plan)以及表达式(Expression),也是转换成物理执行计划的前置。
接下来就是sparkPlan 了
def plan(plan: LogicalPlan): Iterator[PhysicalPlan] = {
// Collect physical plan candidates.
val candidates = strategies.iterator.flatMap(_(plan))
val pruned = prunePlans(plans)
assert(pruned.hasNext, s"No plan for $plan")
pruned
}
strategies是执行策略,通过这些策略将逻辑计划plan转化为物理计划candidates,逻辑计划和物理计划是1对多的关系,因此用的是flatMap。
到这里qe.executedPlan的代码已经走完,它把analyzed逻辑计划经过analyzed -> optimizedPlan -> sparkPlan 转为物理计划,这是执行前的预处理阶段。接下来通过action执行查询任务。action的定义如下
private def collectFromPlan(plan: SparkPlan): Array[T] = {
// This projection writes output to a `InternalRow`, which means applying this projection is not
// thread-safe. Here we create the projection inside this method to make `Dataset` thread-safe.
val objProj = GenerateSafeProjection.generate(deserializer :: Nil)
plan.executeCollect().map { row =>
// The row returned by SafeProjection is `SpecificInternalRow`, which ignore the data type
// parameter of its `get` method, so it's safe to use null here.
objProj(row).get(0, null).asInstanceOf[T]
}
}