spark WholeStageCodegen代码生成过程详解

本文链接：https://blog.csdn.net/u012794781/article/details/128312594

本文介绍Spark中的WholeStageCodegen机制，解析如何通过代码生成优化查询性能，并提供具体示例。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

spark的执行计划如果实现了CodegenSupport的特质，则可以实现代码的生成。
这里用iceberg表的insert语法跟着例子走一遍。

insert into local.ods.member2
select name, age
from local.ods.member1

生成的执行计划如下所示：

AppendDataExec
  WholeStageCodegenExec
    ProjectExec
      BatchScan
        BatchScanExec

AppendDataExec最终执行的方法是WriteToDataSouceV2Exec的writeWithV2方法，里面会执行val tempRdd = query.execute()也就是select的查询rdd结果。

这里的execute方法会调用WholeStageCodegenExec的doExecute()方法。这个doExecute()里面就是在做代码的生成和执行，最后得到select的查询结果。

具体的代码生成过程调用了doCodeGen()方法，这个过程中会调用子节点ProjectExec的produce方法。

final def produce(ctx: CodegenContext, parent: CodegenSupport): String = executeQuery {
  this.parent = parent
  ctx.freshNamePrefix = variablePrefix
  s"""
     |${ctx.registerComment(s"PRODUCE: ${this.simpleString(SQLConf.get.maxToStringFields)}")}
     |${doProduce(ctx)}
   """.stripMargin
}

这里会调用doProduce方法，最终会调用WholeStageCodegenExec的doProduce方法。doProduce的过程源码中给出了生成聚合代码的例子：

/**
 * Generate the Java source code to process, should be overridden by subclass to support codegen.
 *
 * doProduce() usually generate the framework, for example, aggregation could generate this:
 *
 *   if (!initialized) {
 *     # create a hash map, then build the aggregation hash map
 *     # call child.produce()
 *     initialized = true;
 *   }
 *   while (hashmap.hasNext()) {
 *     row = hashmap.next();
 *     # build the aggregation results
 *     # create variables for results
 *     # call consume(), which will call parent.doConsume()
 *      if (shouldStop()) return;
 *   }
 */

生成的代码大概的流程是：如果没有被初始化，就创建一个hash map，对聚合的hash map进行构建，递归调用子节点的produce过程，并将改节点的初始化置为true。对构建好的hash map进行遍历，构建出聚合结果和结果的变量，并调用consume进行消费，这个过程也会调用父节点的doConsume，判断是否需要停止。
这里生成的是processNext()函数中的核心逻辑。这个例子中整体生成的代码如下所示：

public Object generate(Object[] references) {
  return new GeneratedIteratorForCodegenStage1(references);
}

/*wsc_codegenStageId*/
final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator {
  private Object[] references;
  private scala.collection.Iterator[] inputs;
  private scala.collection.Iterator inputadapter_input_0;
  private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] project_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[1];

  public GeneratedIteratorForCodegenStage1(Object[] references) {
    this.references = references;
  }

  public void init(int index, scala.collection.Iterator[] inputs) {
    partitionIndex = index;
    this.inputs = inputs;
    inputadapter_input_0 = inputs[0];
    project_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(2, 32);
  }

  protected void processNext() throws java.io.IOException {
    while ( inputadapter_input_0.hasNext()) {
      InternalRow inputadapter_row_0 = (InternalRow) inputadapter_input_0.next();

      // common sub-expressions

      boolean inputadapter_isNull_0 = inputadapter_row_0.isNullAt(0);
      UTF8String inputadapter_value_0 = inputadapter_isNull_0 ? null : (inputadapter_row_0.getUTF8String(0));
      boolean inputadapter_isNull_1 = inputadapter_row_0.isNullAt(1);
      int inputadapter_value_1 = inputadapter_isNull_1 ? -1 : (inputadapter_row_0.getInt(1));
      project_mutableStateArray_0[0].reset();

      project_mutableStateArray_0[0].zeroOutNullBytes();

      if (inputadapter_isNull_0) {
        project_mutableStateArray_0[0].setNullAt(0);
      } else {
        project_mutableStateArray_0[0].write(0, inputadapter_value_0);
      }

      if (inputadapter_isNull_1) {
        project_mutableStateArray_0[0].setNullAt(1);
      } else {
        project_mutableStateArray_0[0].write(1, inputadapter_value_1);
      }
      append((project_mutableStateArray_0[0].getRow()));
      if (shouldStop()) return;
    }
  }
}

生成代码之后，对生成的代码进行编译，如果异常或者编译的代码过长则走常规流程计算子节点的RDD[InternalRow]结果。

val (_, compiledCodeStats) = try {
  CodeGenerator.compile(cleanedSource)
} catch {
  case NonFatal(_) if !Utils.isTesting && sqlContext.conf.codegenFallback =>
    // We should already saw the error message
    logWarning(s"Whole-stage codegen disabled for plan (id=$codegenStageId):\n $treeString")
    return child.execute()
}

// Check if compiled code has a too large function
if (compiledCodeStats.maxMethodCodeSize > sqlContext.conf.hugeMethodLimit) {
  logInfo(s"Found too long generated codes and JIT optimization might not work: " +
    s"the bytecode size (${compiledCodeStats.maxMethodCodeSize}) is above the limit " +
    s"${sqlContext.conf.hugeMethodLimit}, and the whole-stage codegen was disabled " +
    s"for this plan (id=$codegenStageId). To avoid this, you can raise the limit " +
    s"`${SQLConf.WHOLESTAGE_HUGE_METHOD_LIMIT.key}`:\n$treeString")
  return child.execute()
}

编译通过之后对递归计算子节点的inputRDDs。这里列出主要的调用流程：

val rdds = child.asInstanceOf[CodegenSupport].inputRDDs()

def inputRDDs(): Seq[RDD[InternalRow]]

override def inputRDDs(): Seq[RDD[InternalRow]] = {
  inputRDD :: Nil
}

override def inputRDD: RDD[InternalRow] = child.execute()

因为iceberg中使用的是datasourcev2的表，因此child.execute最终调用的是DataSourceV2ScanExecBase的doExecute方法：

override def doExecute(): RDD[InternalRow] = {
  val numOutputRows = longMetric("numOutputRows")
  inputRDD.map { r =>
    numOutputRows += 1
    r
  }
}

这里的inputRDD是BatchScanExec中的inputRDD:

override lazy val inputRDD: RDD[InternalRow] = {
  new DataSourceRDD(sparkContext, partitions, readerFactory, supportsColumnar)
}

获取到子节点的RDD[InternalRow]后，再次编译生成的代码，前面一次编译的目的是确保编译的成功，这一次编译才会真正的使用。将RDD通过init方法初始化反射生成的类，最后返回整体查询的RDD结果。也就是val tempRdd = query.execute()的结果。以上便是WholeStageCodegen生成代码，并通过生成代码获取查询的RDD结果的过程。