whole stage codegen 是spark 2.0 以后引入的新特性,所以在最后单独把这一块拿出来讲一下。
相关背景可以看spark官方的jira:https://issues.apache.org/jira/browse/SPARK-12795a
whole stage codegen对性能有很大的提升。
如下图所示,将一棵树翻译成了一段代码执行,性能肯定会大幅提升。
codegen的更多原理以及测试结果:
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
Whole stage codegen是默认开启的:
`val WHOLESTAGE_CODEGEN_ENABLED = buildConf("spark.sql.codegen.wholeStage")`
.internal()
.doc("When true, the whole stage (of multiple operators) will be compiled into single java" +
" method.")
.booleanConf
`.createWithDefault(true)`
其入口逻辑在preparations里:
protected def preparations: Seq[Rule[SparkPlan]] = Seq(
python.ExtractPythonUDFs,
PlanSubqueries(sparkSession),
EnsureRequirements(sparkSession.sessionState.conf),
CollapseCodegenStages(sparkSession.sessionState.conf),
ReuseExchange(sparkSession.sessionState.conf),
ReuseSubquery(sparkSession.sessionState.conf))
其中的CollapseCodegenStages是codegen优化的入口。
他的apply方法,如果开启了whole stage codegen,则执行相关的逻辑:
def apply(plan: SparkPlan): SparkPlan = {
if (conf.wholeStageEnabled) {
WholeStageCodegenId.resetPerQuery()
insertWholeStageCodegen(plan)
} else {
plan
}
}
WholeStageCodegenId就是一个递增的计数器,用来计数,resetPerQuery重置为1:
object WholeStageCodegenId {
private val codegenStageCounter = ThreadLocal.withInitial(new Supplier[Integer] {
override def get() = 1 // TODO: change to Scala lambda syntax when upgraded to Scala 2.12+
})
def resetPerQuery(): Unit = codegenStageCounter.set(1)
def getNextStageId(): Int = {
val counter = codegenStageCounter
val id = counter.get()
counter.set(id + 1)
id
}
}
还记得前面的physical plan 每一个stage前面带的数字1,2,… 5么,这个就是WholeStageCodegenId,用来将codegen生成的class和operator关联;前面的*号代表这个stage进行了codegen。可以看到Exchange是没有codegen的,因为它没有计算,只是一个shuffle过程。
*(5) Project [B#6]
+- *(5) SortMergeJoin [B#6], [B#14], Inner
:- *(2) Sort [B#6 ASC NULLS FIRST], false, 0
: +- Exchange(coordinator id: 1121577170) hashpartitioning(B#6, 200), coordinator[target post-shuffle partition size: 67108864]
: +- *(1) Project [B#6]
: +- *(1) Filter isnotnull(B#6)
: +- *(1) FileScan json [B#6] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:examples/src/main/resources/test.json], PartitionFilters: [], PushedFilters: [IsNotNull(B)], ReadSchema