Spark在任务提交后首先会在DAGScheduler中根据任务划分为不同的stage,起点在DAGScheduler的handleJobSubmitted()方法中。
private[scheduler] def handleJobSubmitted(jobId: Int,
finalRDD: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
callSite: CallSite,
listener: JobListener,
properties: Properties) {
var finalStage: ResultStage = null
try {
// New stage creation may throw an exception if, for example, jobs are run on a
// HadoopRDD whose underlying HDFS files have been deleted.
finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
这里的finalRDD是action算子之前的最后一个transform算子。
这里会以逻辑上的最后一个transform算子开始从后往前重新根据宽窄依赖构建stage,其中,最末端的stage称为finalStage,获得其的createResultStage()也是整个dag开始构造的起点。
private def createResultStage(
rdd: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
jobId: Int,
callSite: CallSite): ResultStage = {
checkBarrierStageWithDynamicAllocation(rdd)
checkBarrierStageWithNumSlots(rdd)
checkBarrierStageWithRDDChainPattern(rdd, partitions.toSet.size)
val parents = getOrCreateParentStages(rdd, jobId)
val id = nextStageId.getAndIncrement()
val stage = new ResultStage(id, rdd, func, partitions, parents, jobId, callSite)
stageIdToStage(id) = stage
updateJobIdStageIdMaps(jobId, stage)
stage
}
在createResultStage(),首先通过getOrCreateParentStages()方法不断从前置算子构造stage。
private def getOrCreateParentStages(rdd: RDD[_], firstJobId: Int): List[Stage] = {
getShuffleDependencies(rdd).map { shuffleDep =>
getOrCreateShuffleMapStage(shuffleDep, firstJobId)
}.toList
}
在整个dag的流程中,在于区分宽窄依赖,窄依赖联系的上下游rdd可作为同一个stage,而宽依赖之间的上下游rdd就需要区分为不同的stage。
在首次进入getOrCreateParentStages()方法中,作为参数的是最后的算子,首先会通过getShuffleDependencies()方法获取所有的shuffle依赖,也就是获取所有宽依赖,在这个方法,实则在归纳宽依赖的同时归并所有遇到宽依赖之前的窄依赖。
private[scheduler] def getShuffleDependencies(
rdd: RDD[_]): HashSet[ShuffleDependency[_, _, _]] = {
val parents = new HashSet[ShuffleDependency[_, _, _]]
val visited = new HashSet[RDD[_]]
val waitingForVisit = new ArrayStack[RDD[_]]
waitingForVisit.push(rdd)
while (waitingForVisit.nonEmpty) {
val toVisit = waitingForVisit.pop()
if (!visited(toVisit)) {
visited += toVisit
toVisit.dependencies.foreach {
case shuffleDep: ShuffleDependency[_, _, _] =>
parents += shuffleDep
case dependency =>
waitingForVisit.push(dependency.rdd)
}
}
}
parents
}
以逻辑最末的算子为起点,此处存在两个set分别用来存放已经扫描过的算子和得到的宽依赖,而需要扫描的算子都会存放到堆栈中依次弹出对其上游依赖进行分析。
此处,如果其依次往上游不断检查依赖关系,直到全部到达宽依赖或者扫描完毕,得到的宽依赖集合返回,作为区分stage的依据。
回到getOrCreateParentStages(),得到的宽依赖算子会依次通过getOrCreateShuffleMapStage()中的createShuffleMapStage()方法来构造stage。
def createShuffleMapStage(shuffleDep: ShuffleDependency[_, _, _], jobId: Int): ShuffleMapStage = {
val rdd = shuffleDep.rdd
checkBarrierStageWithDynamicAllocation(rdd)
checkBarrierStageWithNumSlots(rdd)
checkBarrierStageWithRDDChainPattern(rdd, rdd.getNumPartitions)
val numTasks = rdd.partitions.length
val parents = getOrCreateParentStages(rdd, jobId)
val id = nextStageId.getAndIncrement()
val stage = new ShuffleMapStage(
id, rdd, numTasks, parents, jobId, rdd.creationSite, shuffleDep, mapOutputTracker)
stageIdToStage(id) = stage
shuffleIdToMapStage(shuffleDep.shuffleId) = stage
updateJobIdStageIdMaps(jobId, stage)
if (!mapOutputTracker.containsShuffle(shuffleDep.shuffleId)) {
// Kind of ugly: need to register RDDs with the cache and map output tracker here
// since we can't do it in the RDD constructor because # of partitions is unknown
logInfo("Registering RDD " + rdd.id + " (" + rdd.getCreationSite + ")")
mapOutputTracker.registerShuffle(shuffleDep.shuffleId, rdd.partitions.length)
}
stage
}
具体的构造和一开始的resultStage一样,还是通过getOrCreateParentStages()方法不断重复上述的过程首先构造上游的stage,再构造当前的stage。
按照上述的流程,将会根据宽依赖的情况,从上游开始不断生成stage直到一开始的最末尾的算子生成finalStage结束。
以下面这段代码为例子。
def main(args: Array[String]) {
val spark = SparkSession
.builder
.appName("GroupBy Test")
.getOrCreate()
val numMappers = if (args.length > 0) args(0).toInt else 2
val numKVPairs = if (args.length > 1) args(1).toInt else 1000
val valSize = if (args.length > 2) args(2).toInt else 1000
val numReducers = if (args.length > 3) args(3).toInt else numMappers
val pairs1 = spark.sparkContext.parallelize(0 until numMappers, numMappers).flatMap { p =>
val ranGen = new Random
val arr1 = new Array[(Int, Array[Byte])](numKVPairs)
for (i <- 0 until numKVPairs) {
val byteArr = new Array[Byte](valSize)
ranGen.nextBytes(byteArr)
arr1(i) = (ranGen.nextInt(Int.MaxValue), byteArr)
}
arr1
}
println(pairs1.groupByKey(numReducers).count())
spark.stop()
}
其分为三个transform算子,从逻辑起始开始分别为parallel,map,groupByKey,其中map到groupByKey为宽依赖,经过dag的划分,将会被分为两个stage,finalStage只有一个shuffle算子,而其父stage将会有两个算子分别为map和parallel,符合上述的dag划分的流程。
在完成stage的划分之后,就会将最后得到的finalStage提交,准备交给taskScheduler进行调度。
private def submitStage(stage: Stage) {
val jobId = activeJobForStage(stage)
if (jobId.isDefined) {
logDebug("submitStage(" + stage + ")")
if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
val missing = getMissingParentStages(stage).sortBy(_.id)
logDebug("missing: " + missing)
if (missing.isEmpty) {
logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
submitMissingTasks(stage, jobId.get)
} else {
for (parent <- missing) {
submitStage(parent)
}
waitingStages += stage
}
}
} else {
abortStage(stage, "No active job for stage " + stage.id, None)
}
}
在作为参数的finalStage进入之后将会不断得到其父stage进行递归提交,直到到最上端的stage,而子stage只会在其父stage全部提交完才会进行提交。
提交会通过submitMissingTasks()方法。
val tasks: Seq[Task[_]] = try {
val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
stage match {
case stage: ShuffleMapStage =>
stage.pendingPartitions.clear()
partitionsToCompute.map { id =>
val locs = taskIdToLocations(id)
val part = partitions(id)
stage.pendingPartitions += id
new ShuffleMapTask(stage.id, stage.latestInfo.attemptNumber,
taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
Option(sc.applicationId), sc.applicationAttemptId, stage.rdd.isBarrier())
}
case stage: ResultStage =>
partitionsToCompute.map { id =>
val p: Int = stage.partitions(id)
val part = partitions(p)
val locs = taskIdToLocations(id)
new ResultTask(stage.id, stage.latestInfo.attemptNumber,
taskBinary, part, locs, id, properties, serializedTaskMetrics,
Option(jobId), Option(sc.applicationId), sc.applicationAttemptId,
stage.rdd.isBarrier())
}
}
} catch {
case NonFatal(e) =>
abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
}
if (tasks.size > 0) {
logInfo(s"Submitting ${tasks.size} missing tasks from $stage (${stage.rdd}) (first 15 " +
s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
taskScheduler.submitTasks(new TaskSet(
tasks.toArray, stage.id, stage.latestInfo.attemptNumber, jobId, properties))
}
在这里会根据分区数量将stage转为对应数量的task,之后将其作为taskSet交由taskScheduler进行调度。