spark sql insert源码分析

一般来说,spark sql的语法并不触发实际的执行,会生成对应的DataFrame或者Dataset,但是对于insert语法等是会触发执行的。下面来一步一步看看insert语法的执行过程。

def sql(sqlText: String): DataFrame = withActive {
  val tracker = new QueryPlanningTracker
  val plan = tracker.measurePhase(QueryPlanningTracker.PARSING) {
    sessionState.sqlParser.parsePlan(sqlText)
  }
  Dataset.ofRows(self, plan, tracker)
}

第一步sessionState.sqlParser.parsePlan(sqlText)这个会生成对应的逻辑计划,这部分可以看看spark sql解析过程帮助理解。

def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan, tracker: QueryPlanningTracker)
  : DataFrame = sparkSession.withActive {
  val qe = new QueryExecution(sparkSession, logicalPlan, tracker)
  qe.assertAnalyzed()
  new Dataset[Row](qe, RowEncoder(qe.analyzed.schema))
}

然后调用ofRows方法返回一个Dataset对象。

@transient private[sql] val logicalPlan: LogicalPlan = {
  // For various commands (like DDL) and queries with side effects, we force query execution
  // to happen right away to let these side effects take place eagerly.
  val plan = queryExecution.analyzed match {
    case c: Command =>
      LocalRelation(c.output, withAction("command", queryExecution)(_.executeCollect()))
    case u @ Union(children, _, _) if children.forall(_.isInstanceOf[Command]) =>
      LocalRelation(u.output, withAction("command", queryExecution)(_.executeCollect()))
    case _ =>
      queryExecution.analyzed
  }
  if (sparkSession.sessionState.conf.getConf(SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED) &&
      plan.getTagValue(Dataset.DATASET_ID_TAG).isEmpty) {
    plan.setTagValue(Dataset.DATASET_ID_TAG, id)
  }
  plan
}

在创建Dataset对象的过程中会创建一个logicalPlan逻辑计划。queryExecution.analyzed这个返回的是sql的逻辑计划。紧接着对这个逻辑计划进行模式匹配,当它是Command对象的时候,会触发executeCollect()方法。

trait Command extends LogicalPlan {
  override def output: Seq[Attribute] = Seq.empty
  override def producedAttributes: AttributeSet = outputSet
  override def children: Seq[LogicalPlan] = Seq.empty
  // Commands are eagerly executed. They will be converted to LocalRelation after the DataFrame
  // is created. That said, the statistics of a command is useless. Here we just return a dummy
  // statistics to avoid unnecessary statistics calculation of command's children.
  override def stats: Statistics = Statistics.DUMMY
}

这个Command是一个特质,继承了逻辑计划类。继承这个特质的类有很多。
Command继承关系
主要的是一些ddl语法,insert语法也继承这个Command特质

private def withAction[U](name: String, qe: QueryExecution)(action: SparkPlan => U) = {
  SQLExecution.withNewExecutionId(qe, Some(name)) {
    qe.executedPlan.resetMetrics()
    action(qe.executedPlan)
  }
}

先接着上面讲,匹配到Command命令后,LocalRelation(c.output, withAction(“command”, queryExecution)(_.executeCollect()))在生成LocalRelation的时候,会调用withAction方法。这个里面会触发执行qe.executedPlan,这个executedPlan是一个转换成rdd之前的最后一层物理执行计划。这里我用DatasourceV2作为例子进行讲解。DatasourceV2实现中的insert语法转换后的executedPlanAppendDataExec这个类。它继承了V2TableWriteExec特质,V2TableWriteExec又继承了V2CommandExec抽象类。

/**
 * The value of this field can be used as the contents of the corresponding RDD generated from
 * the physical plan of this command.
 */
private lazy val result: Seq[InternalRow] = run()

/**
 * The `execute()` method of all the physical command classes should reference `result`
 * so that the command can be executed eagerly right after the command query is created.
 */
override def executeCollect(): Array[InternalRow] = result.toArray

V2CommandExecwithAction里面对物理计划执行的是executeCollect方法,也就是将result转换成Array,这个result的来自调用的run()方法。

case class AppendDataExec(
    table: SupportsWrite,
    writeOptions: CaseInsensitiveStringMap,
    query: SparkPlan,
    refreshCache: () => Unit) extends V2TableWriteExec with BatchWriteHelper {

  override protected def run(): Seq[InternalRow] = {
    val writtenRows = writeWithV2(newWriteBuilder().buildForBatch())
    refreshCache()
    writtenRows
  }
}

这个run()AppendDataExec被重写,里面调用了特质V2TableWriteExecwriteWithV2方法。

val rdd: RDD[InternalRow] = {
  val tempRdd = query.execute()
  // SPARK-23271 If we are attempting to write a zero partition rdd, create a dummy single
  // partition rdd to make sure we at least set up one write task to write the metadata.
  if (tempRdd.partitions.length == 0) {
    sparkContext.parallelize(Array.empty[InternalRow], 1)
  } else {
    tempRdd
  }
}

sparkContext.runJob(
  rdd,
  (context: TaskContext, iter: Iterator[InternalRow]) =>
    DataWritingSparkTask.run(writerFactory, context, iter, useCommitCoordinator),
  rdd.partitions.indices,
  (index, result: DataWritingSparkTaskResult) => {
    val commitMessage = result.writerCommitMessage
    messages(index) = commitMessage
    totalNumRowsAccumulator.add(result.numRows)
    batchWrite.onDataWriterCommit(commitMessage)
  }
)

query.execute()这个方法调用的是SparkPlanexecute()方法,这个方法是final的,无法重写,这个方法里面调用了doExecute()方法,这个doExecute()方法会被具体的物理执行计划重写,假如insert的是一个select语法,这里就会执行WholeStageCodegenExecdoExecute()方法,得到select语法的结果。writeWithV2方法中调用DataWritingSparkTask.run(writerFactory, context, iter, useCommitCoordinator)

val dataWriter = writerFactory.createWriter(partId, taskId)
Utils.tryWithSafeFinallyAndFailureCallbacks(block = {
  while (iter.hasNext) {
    // Count is here.
    count += 1
    dataWriter.write(iter.next())
  }
  ...
  dataWriter.commit()
}

这个方法的内容比较多,这里只贴了重要的部分,里面的核心就是创建DataWriter,然后write文件,最后统一进行commit实现对数据源的insert。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值