spark的执行计划如果实现了CodegenSupport的特质,则可以实现代码的生成。
这里用iceberg表的insert语法跟着例子走一遍。
insert into local.ods.member2
select name, age
from local.ods.member1
生成的执行计划如下所示:
AppendDataExec
WholeStageCodegenExec
ProjectExec
BatchScan
BatchScanExec
AppendDataExec最终执行的方法是WriteToDataSouceV2Exec的writeWithV2方法,里面会执行val tempRdd = query.execute()
也就是select的查询rdd结果。
这里的execute方法会调用WholeStageCodegenExec的doExecute()
方法。这个doExecute()
里面就是在做代码的生成和执行,最后得到select的查询结果。
具体的代码生成过程调用了doCodeGen()
方法,这个过程中会调用子节点ProjectExec
的produce方法。
final def produce(ctx: CodegenContext, parent: CodegenSupport): String = executeQuery {
this.parent = parent
ctx.freshNamePrefix = variablePrefix
s"""
|${ctx.registerComment(s"PRODUCE: ${this.simpleString(SQLConf.get.maxToStringFields)}")}
|${doProduce(ctx)}
""".stripMargin
}
这里会调用doProduce
方法,最终会调用WholeStageCodegenExec的doProduce
方法。doProduce
的过程源码中给出了生成聚合代码的例子:
/**
* Generate the Java source code to process, should be overridden by subclass to support codegen.
*
* doProduce() usually generate the framework, for example, aggregation could generate this:
*
* if (!initialized) {
* # create a hash map, then build the aggregation hash map
* # call child.produce()
* initialized = true;
* }
* while (hashmap.hasNext()) {
* row = hashmap.next();
* # build the aggregation results
* # create variables for results
* # call consume(), which will call parent.doConsume()
* if (shouldStop()) return;
* }
*/
生成的代码大概的流程是:如果没有被初始化,就创建一个hash map,对聚合的hash map进行构建,递归调用子节点的produce过程,并将改节点的初始化置为true。对构建好的hash map进行遍历,构建出聚合结果和结果的变量,并调用consume进行消费,这个过程也会调用父节点的doConsume,判断是否需要停止。
这里生成的是processNext()函数中的核心逻辑。这个例子中整体生成的代码如下所示:
public Object generate(Object[] references) {
return new GeneratedIteratorForCodegenStage1(references);
}
/*wsc_codegenStageId*/
final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator {
private Object[] references;
private scala.collection.Iterator[] inputs;
private scala.collection.Iterator inputadapter_input_0;
private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] project_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[1];
public GeneratedIteratorForCodegenStage1(Object[] references) {
this.references = references;
}
public void init(int index, scala.collection.Iterator[] inputs) {
partitionIndex = index;
this.inputs = inputs;
inputadapter_input_0 = inputs[0];
project_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(2, 32);
}
protected void processNext() throws java.io.IOException {
while ( inputadapter_input_0.hasNext()) {
InternalRow inputadapter_row_0 = (InternalRow) inputadapter_input_0.next();
// common sub-expressions
boolean inputadapter_isNull_0 = inputadapter_row_0.isNullAt(0);
UTF8String inputadapter_value_0 = inputadapter_isNull_0 ? null : (inputadapter_row_0.getUTF8String(0));
boolean inputadapter_isNull_1 = inputadapter_row_0.isNullAt(1);
int inputadapter_value_1 = inputadapter_isNull_1 ? -1 : (inputadapter_row_0.getInt(1));
project_mutableStateArray_0[0].reset();
project_mutableStateArray_0[0].zeroOutNullBytes();
if (inputadapter_isNull_0) {
project_mutableStateArray_0[0].setNullAt(0);
} else {
project_mutableStateArray_0[0].write(0, inputadapter_value_0);
}
if (inputadapter_isNull_1) {
project_mutableStateArray_0[0].setNullAt(1);
} else {
project_mutableStateArray_0[0].write(1, inputadapter_value_1);
}
append((project_mutableStateArray_0[0].getRow()));
if (shouldStop()) return;
}
}
}
生成代码之后,对生成的代码进行编译,如果异常或者编译的代码过长则走常规流程计算子节点的RDD[InternalRow]结果。
val (_, compiledCodeStats) = try {
CodeGenerator.compile(cleanedSource)
} catch {
case NonFatal(_) if !Utils.isTesting && sqlContext.conf.codegenFallback =>
// We should already saw the error message
logWarning(s"Whole-stage codegen disabled for plan (id=$codegenStageId):\n $treeString")
return child.execute()
}
// Check if compiled code has a too large function
if (compiledCodeStats.maxMethodCodeSize > sqlContext.conf.hugeMethodLimit) {
logInfo(s"Found too long generated codes and JIT optimization might not work: " +
s"the bytecode size (${compiledCodeStats.maxMethodCodeSize}) is above the limit " +
s"${sqlContext.conf.hugeMethodLimit}, and the whole-stage codegen was disabled " +
s"for this plan (id=$codegenStageId). To avoid this, you can raise the limit " +
s"`${SQLConf.WHOLESTAGE_HUGE_METHOD_LIMIT.key}`:\n$treeString")
return child.execute()
}
编译通过之后对递归计算子节点的inputRDDs
。这里列出主要的调用流程:
val rdds = child.asInstanceOf[CodegenSupport].inputRDDs()
def inputRDDs(): Seq[RDD[InternalRow]]
override def inputRDDs(): Seq[RDD[InternalRow]] = {
inputRDD :: Nil
}
override def inputRDD: RDD[InternalRow] = child.execute()
因为iceberg中使用的是datasourcev2的表,因此child.execute最终调用的是DataSourceV2ScanExecBase的doExecute
方法:
override def doExecute(): RDD[InternalRow] = {
val numOutputRows = longMetric("numOutputRows")
inputRDD.map { r =>
numOutputRows += 1
r
}
}
这里的inputRDD是BatchScanExec中的inputRDD:
override lazy val inputRDD: RDD[InternalRow] = {
new DataSourceRDD(sparkContext, partitions, readerFactory, supportsColumnar)
}
获取到子节点的RDD[InternalRow]后,再次编译生成的代码,前面一次编译的目的是确保编译的成功,这一次编译才会真正的使用。将RDD通过init方法初始化反射生成的类,最后返回整体查询的RDD结果。也就是val tempRdd = query.execute()
的结果。以上便是WholeStageCodegen生成代码,并通过生成代码获取查询的RDD结果的过程。