spark sql insert源码分析

Chrollo

已于 2022-12-02 16:59:18 修改

阅读量393

点赞数

分类专栏： spark源码分析文章标签： spark sql 大数据

于 2022-12-02 10:52:06 首次发布

本文链接：https://blog.csdn.net/u012794781/article/details/128139988

版权

spark源码分析专栏收录该内容

3 篇文章 0 订阅

订阅专栏

一般来说，spark sql的语法并不触发实际的执行，会生成对应的DataFrame或者Dataset，但是对于insert语法等是会触发执行的。下面来一步一步看看insert语法的执行过程。

def sql(sqlText: String): DataFrame = withActive {
  val tracker = new QueryPlanningTracker
  val plan = tracker.measurePhase(QueryPlanningTracker.PARSING) {
    sessionState.sqlParser.parsePlan(sqlText)
  }
  Dataset.ofRows(self, plan, tracker)
}

第一步sessionState.sqlParser.parsePlan(sqlText)这个会生成对应的逻辑计划，这部分可以看看spark sql解析过程帮助理解。

def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan, tracker: QueryPlanningTracker)
  : DataFrame = sparkSession.withActive {
  val qe = new QueryExecution(sparkSession, logicalPlan, tracker)
  qe.assertAnalyzed()
  new Dataset[Row](qe, RowEncoder(qe.analyzed.schema))
}

然后调用ofRows方法返回一个Dataset对象。

@transient private[sql] val logicalPlan: LogicalPlan = {
  // For various commands (like DDL) and queries with side effects, we force query execution
  // to happen right away to let these side effects take place eagerly.
  val plan = queryExecution.analyzed match {
    case c: Command =>
      LocalRelation(c.output, withAction("command", queryExecution)(_.executeCollect()))
    case u @ Union(children, _, _) if children.forall(_.isInstanceOf[Command]) =>
      LocalRelation(u.output, withAction("command", queryExecution)(_.executeCollect()))
    case _ =>
      queryExecution.analyzed
  }
  if (sparkSession.sessionState.conf.getConf(SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED) &&
      plan.getTagValue(Dataset.DATASET_ID_TAG).isEmpty) {
    plan.setTagValue(Dataset.DATASET_ID_TAG, id)
  }
  plan
}

在创建Dataset对象的过程中会创建一个logicalPlan逻辑计划。queryExecution.analyzed这个返回的是sql的逻辑计划。紧接着对这个逻辑计划进行模式匹配，当它是Command对象的时候，会触发executeCollect()方法。

trait Command extends LogicalPlan {
  override def output: Seq[Attribute] = Seq.empty
  override def producedAttributes: AttributeSet = outputSet
  override def children: Seq[LogicalPlan] = Seq.empty
  // Commands are eagerly executed. They will be converted to LocalRelation after the DataFrame
  // is created. That said, the statistics of a command is useless. Here we just return a dummy
  // statistics to avoid unnecessary statistics calculation of command's children.
  override def stats: Statistics = Statistics.DUMMY
}

这个Command是一个特质，继承了逻辑计划类。继承这个特质的类有很多。
Command继承关系
主要的是一些ddl语法，insert语法也继承这个Command特质

private def withAction[U](name: String, qe: QueryExecution)(action: SparkPlan => U) = {
  SQLExecution.withNewExecutionId(qe, Some(name)) {
    qe.executedPlan.resetMetrics()
    action(qe.executedPlan)
  }
}

先接着上面讲，匹配到Command命令后，LocalRelation(c.output, withAction(“command”, queryExecution)(_.executeCollect()))在生成LocalRelation的时候，会调用withAction方法。这个里面会触发执行qe.executedPlan,这个executedPlan是一个转换成rdd之前的最后一层物理执行计划。这里我用DatasourceV2作为例子进行讲解。DatasourceV2实现中的insert语法转换后的executedPlan是AppendDataExec这个类。它继承了V2TableWriteExec特质，V2TableWriteExec又继承了V2CommandExec抽象类。

/**
 * The value of this field can be used as the contents of the corresponding RDD generated from
 * the physical plan of this command.
 */
private lazy val result: Seq[InternalRow] = run()

/**
 * The `execute()` method of all the physical command classes should reference `result`
 * so that the command can be executed eagerly right after the command query is created.
 */
override def executeCollect(): Array[InternalRow] = result.toArray

V2CommandExec在withAction里面对物理计划执行的是executeCollect方法，也就是将result转换成Array，这个result的来自调用的run()方法。

case class AppendDataExec(
    table: SupportsWrite,
    writeOptions: CaseInsensitiveStringMap,
    query: SparkPlan,
    refreshCache: () => Unit) extends V2TableWriteExec with BatchWriteHelper {

  override protected def run(): Seq[InternalRow] = {
    val writtenRows = writeWithV2(newWriteBuilder().buildForBatch())
    refreshCache()
    writtenRows
  }
}

这个run()在AppendDataExec被重写，里面调用了特质V2TableWriteExec的writeWithV2方法。

val rdd: RDD[InternalRow] = {
  val tempRdd = query.execute()
  // SPARK-23271 If we are attempting to write a zero partition rdd, create a dummy single
  // partition rdd to make sure we at least set up one write task to write the metadata.
  if (tempRdd.partitions.length == 0) {
    sparkContext.parallelize(Array.empty[InternalRow], 1)
  } else {
    tempRdd
  }
}

sparkContext.runJob(
  rdd,
  (context: TaskContext, iter: Iterator[InternalRow]) =>
    DataWritingSparkTask.run(writerFactory, context, iter, useCommitCoordinator),
  rdd.partitions.indices,
  (index, result: DataWritingSparkTaskResult) => {
    val commitMessage = result.writerCommitMessage
    messages(index) = commitMessage
    totalNumRowsAccumulator.add(result.numRows)
    batchWrite.onDataWriterCommit(commitMessage)
  }
)

query.execute()这个方法调用的是SparkPlan的execute()方法，这个方法是final的，无法重写，这个方法里面调用了doExecute()方法，这个doExecute()方法会被具体的物理执行计划重写，假如insert的是一个select语法，这里就会执行WholeStageCodegenExec的doExecute()方法，得到select语法的结果。writeWithV2方法中调用DataWritingSparkTask.run(writerFactory, context, iter, useCommitCoordinator)。

val dataWriter = writerFactory.createWriter(partId, taskId)
Utils.tryWithSafeFinallyAndFailureCallbacks(block = {
  while (iter.hasNext) {
    // Count is here.
    count += 1
    dataWriter.write(iter.next())
  }
  ...
  dataWriter.commit()
}

这个方法的内容比较多，这里只贴了重要的部分，里面的核心就是创建DataWriter，然后write文件，最后统一进行commit实现对数据源的insert。