一般来说,spark sql的语法并不触发实际的执行,会生成对应的DataFrame或者Dataset,但是对于insert语法等是会触发执行的。下面来一步一步看看insert语法的执行过程。
def sql(sqlText: String): DataFrame = withActive {
val tracker = new QueryPlanningTracker
val plan = tracker.measurePhase(QueryPlanningTracker.PARSING) {
sessionState.sqlParser.parsePlan(sqlText)
}
Dataset.ofRows(self, plan, tracker)
}
第一步sessionState.sqlParser.parsePlan(sqlText)
这个会生成对应的逻辑计划,这部分可以看看spark sql解析过程帮助理解。
def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan, tracker: QueryPlanningTracker)
: DataFrame = sparkSession.withActive {
val qe = new QueryExecution(sparkSession, logicalPlan, tracker)
qe.assertAnalyzed()
new Dataset[Row](qe, RowEncoder(qe.analyzed.schema))
}
然后调用ofRows方法返回一个Dataset对象。
@transient private[sql] val logicalPlan: LogicalPlan = {
// For various commands (like DDL) and queries with side effects, we force query execution
// to happen right away to let these side effects take place eagerly.
val plan = queryExecution.analyzed match {
case c: Command =>
LocalRelation(c.output, withAction("command", queryExecution)(_.executeCollect()))
case u @ Union(children, _, _) if children.forall(_.isInstanceOf[Command]) =>
LocalRelation(u.output, withAction("command", queryExecution)(_.executeCollect()))
case _ =>
queryExecution.analyzed
}
if (sparkSession.sessionState.conf.getConf(SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED) &&
plan.getTagValue(Dataset.DATASET_ID_TAG).isEmpty) {
plan.setTagValue(Dataset.DATASET_ID_TAG, id)
}
plan
}
在创建Dataset对象的过程中会创建一个logicalPlan
逻辑计划。queryExecution.analyzed
这个返回的是sql的逻辑计划。紧接着对这个逻辑计划进行模式匹配,当它是Command
对象的时候,会触发executeCollect()
方法。
trait Command extends LogicalPlan {
override def output: Seq[Attribute] = Seq.empty
override def producedAttributes: AttributeSet = outputSet
override def children: Seq[LogicalPlan] = Seq.empty
// Commands are eagerly executed. They will be converted to LocalRelation after the DataFrame
// is created. That said, the statistics of a command is useless. Here we just return a dummy
// statistics to avoid unnecessary statistics calculation of command's children.
override def stats: Statistics = Statistics.DUMMY
}
这个Command
是一个特质,继承了逻辑计划类。继承这个特质的类有很多。
主要的是一些ddl语法,insert语法也继承这个Command
特质
private def withAction[U](name: String, qe: QueryExecution)(action: SparkPlan => U) = {
SQLExecution.withNewExecutionId(qe, Some(name)) {
qe.executedPlan.resetMetrics()
action(qe.executedPlan)
}
}
先接着上面讲,匹配到Command
命令后,LocalRelation(c.output, withAction(“command”, queryExecution)(_.executeCollect()))在生成LocalRelation
的时候,会调用withAction
方法。这个里面会触发执行qe.executedPlan
,这个executedPlan
是一个转换成rdd之前的最后一层物理执行计划。这里我用DatasourceV2作为例子进行讲解。DatasourceV2实现中的insert语法转换后的executedPlan
是AppendDataExec
这个类。它继承了V2TableWriteExec
特质,V2TableWriteExec
又继承了V2CommandExec
抽象类。
/**
* The value of this field can be used as the contents of the corresponding RDD generated from
* the physical plan of this command.
*/
private lazy val result: Seq[InternalRow] = run()
/**
* The `execute()` method of all the physical command classes should reference `result`
* so that the command can be executed eagerly right after the command query is created.
*/
override def executeCollect(): Array[InternalRow] = result.toArray
V2CommandExec
在withAction
里面对物理计划执行的是executeCollect
方法,也就是将result转换成Array,这个result的来自调用的run()
方法。
case class AppendDataExec(
table: SupportsWrite,
writeOptions: CaseInsensitiveStringMap,
query: SparkPlan,
refreshCache: () => Unit) extends V2TableWriteExec with BatchWriteHelper {
override protected def run(): Seq[InternalRow] = {
val writtenRows = writeWithV2(newWriteBuilder().buildForBatch())
refreshCache()
writtenRows
}
}
这个run()
在AppendDataExec
被重写,里面调用了特质V2TableWriteExec
的writeWithV2
方法。
val rdd: RDD[InternalRow] = {
val tempRdd = query.execute()
// SPARK-23271 If we are attempting to write a zero partition rdd, create a dummy single
// partition rdd to make sure we at least set up one write task to write the metadata.
if (tempRdd.partitions.length == 0) {
sparkContext.parallelize(Array.empty[InternalRow], 1)
} else {
tempRdd
}
}
sparkContext.runJob(
rdd,
(context: TaskContext, iter: Iterator[InternalRow]) =>
DataWritingSparkTask.run(writerFactory, context, iter, useCommitCoordinator),
rdd.partitions.indices,
(index, result: DataWritingSparkTaskResult) => {
val commitMessage = result.writerCommitMessage
messages(index) = commitMessage
totalNumRowsAccumulator.add(result.numRows)
batchWrite.onDataWriterCommit(commitMessage)
}
)
query.execute()
这个方法调用的是SparkPlan
的execute()
方法,这个方法是final的,无法重写,这个方法里面调用了doExecute()
方法,这个doExecute()
方法会被具体的物理执行计划重写,假如insert的是一个select语法,这里就会执行WholeStageCodegenExec
的doExecute()
方法,得到select语法的结果。writeWithV2
方法中调用DataWritingSparkTask.run(writerFactory, context, iter, useCommitCoordinator)
。
val dataWriter = writerFactory.createWriter(partId, taskId)
Utils.tryWithSafeFinallyAndFailureCallbacks(block = {
while (iter.hasNext) {
// Count is here.
count += 1
dataWriter.write(iter.next())
}
...
dataWriter.commit()
}
这个方法的内容比较多,这里只贴了重要的部分,里面的核心就是创建DataWriter,然后write文件,最后统一进行commit实现对数据源的insert。