DAGScheduler面向stage的调度(stage-oriented scheduling)的高级调度器,将job根据类型划分为不同的stage,为每个job的不同stage计算DAG,跟踪哪些RDD和stage被物化并且发现运行job的最小的调度策略,然后并在每一个stage内产生一系列的task并封装为taskset, 结合当前的缓存情况来决定每个Task的最佳位置(任务在数据所在的节点上运行),以taskset为单位提交(submitTasks)给TaskScheduler。此外还处理因shuffle输出丢失而导致的stage失败并重新提交先前的stage。
* TaskScheduler implementation that runs them on the cluster.
*
* In addition to coming up with a DAG of stages, this class also determines the preferred
* locations to run each task on, based on the current cache status, and passes these to the
* low-level TaskScheduler. Furthermore, it handles failures due to shuffle output files being
* lost, in which case old stages may need to be resubmitted. Failures *within* a stage that are
* not caused by shuffle file loss are handled by the TaskScheduler, which will retry each task
* a small number of times before cancelling the whole stage.
*
*/
private[spark]
class DAGScheduler(
private[scheduler] val sc: SparkContext,
private[scheduler] val taskScheduler: TaskScheduler,
listenerBus: LiveListenerBus,
mapOutputTracker: MapOutputTrackerMaster,
blockManagerMaster: BlockManagerMaster,
env: SparkEnv,
clock: Clock = SystemClock)
extends Logging
在初始化SparkContext的时候会配置一系列的内容,例如会查看内存的设置情况<本节主要讨论DAGScheduler的源码和实现原理,所以暂时不讨论设置conf和读取配置的过程>,然后就会创建和启动scheduler:
// Create and start the scheduler
private[spark] var taskScheduler = SparkContext.createTaskScheduler(this, master)
private val heartbeatReceiver = env.actorSystem.actorOf(
Props(new HeartbeatReceiver(taskScheduler)), "HeartbeatReceiver")
@volatile private[spark] var dagScheduler: DAGScheduler = _
try {
dagScheduler = new DAGScheduler(this)
} catch {
case e: Exception => throw
new SparkException("DAGScheduler cannot be initialized due to %s".format(e.getMessage))
}
// start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
// constructor
taskScheduler.start()
SparkContext创建和启动scheduler时首先基于master URL创建TaskScheduler的实例对象(由于还没有在DAGScheduler的主构造方法(primary constructor)中传入TaskScheduler的引用,所以此时的TaskScheduler还没有启动),然后设置driver和executors之间的心跳,通过以下一行代码来创建DAGScheduler的实例对象。
dagScheduler = new DAGScheduler(this)
大家可能注意到在dagScheduler 对象被实例化前有这么一行代码:
@volatile private[spark] var dagScheduler: DAGScheduler = _
@volatile的注解表明dagScheduler是一个易失字段,也就是可以被多个线程同时更新,下面我们继续跟踪DAGScheduler的构造方法,在DAGScheduler的主构造方法(primary constructor)会创建DAGSchedulerActorSupervisor的actor的实例对象dagSchedulerActorSupervisor和TaskScheduler进行通信,例如Job的提交和撤销、stage和task的撤销等,然后会初始化eventProcessActor(eventProcessActor负责接收和发送DAG的消息)。
private val dagSchedulerActorSupervisor =
env.actorSystem.actorOf(Props(new DAGSchedulerActorSupervisor(this)))
private[scheduler] var eventProcessActor: ActorRef = _
private def initializeEventProcessActor() {
// blocking the thread until supervisor is started, which ensures eventProcessActor is
// not null before any job is submitted
implicit val timeout = Timeout(30 seconds)
val initEventActorReply =
dagSchedulerActorSupervisor ? Props(new DAGSchedulerEventProcessActor(this))
eventProcessActor = Await.result(initEventActorReply, timeout.duration).
asInstanceOf[ActorRef]
}
initializeEventProcessActor()
接下来,在SparkContext的runJob方法中dagScheduler对象会调用DAGScheduler中的runJob方法(SparkContext的runJob方法是spark的所有action的主要入口)。
/**
* Run a function on a given set of partitions in an RDD and pass the results to the given
* handler function. This is the main entry point for all actions in Spark. The allowLocal
* flag specifies whether the scheduler can run the computation on the driver rather than
* shipping it out to the cluster, for short actions like first().
*/
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
allowLocal: Boolean,
resultHandler: (Int, U) => Unit) {
if (dagScheduler == null) {
throw new SparkException("SparkContext has been shutdown")
}
val callSite = getCallSite
val cleanedFunc = clean(func)
logInfo("Starting job: " + callSite.shortForm)
val start = System.nanoTime
dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, allowLocal,
resultHandler, localProperties.get)
logInfo(
"Job finished: " + callSite.shortForm + ", took " + (System.nanoTime - start) / 1e9 + " s")
rdd.doCheckpoint()
}
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
allowLocal: Boolean,
resultHandler: (Int, U) => Unit,
properties: Properties = null)
{
val waiter = submitJob(rdd, func, partitions, callSite, allowLocal, resultHandler, properties)
waiter.awaitResult() match {
case JobSucceeded => {}
case JobFailed(exception: Exception) =>
logInfo("Failed to run " + callSite.shortForm)
throw exception
}
}
让我们继续跟踪DAGScheduler的runJob方法会发现一行关键代码用于提交Job(submitJob)。val waiter = submitJob(rdd, func, partitions, callSite, allowLocal, resultHandler, properties)
/**
* Submit a job to the job scheduler and get a JobWaiter object back. The JobWaiter object
* can be used to block until the the job finishes executing or can be used to cancel the job.
*/
def submitJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
allowLocal: Boolean,
resultHandler: (Int, U) => Unit,
properties: Properties = null): JobWaiter[U] =
{
// Check to make sure we are not launching a task on a partition that does not exist.
val maxPartitions = rdd.partitions.length
partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
throw new IllegalArgumentException(
"Attempting to access a non-existent partition: " + p + ". " +
"Total number of partitions: " + maxPartitions)
}
val jobId = nextJobId.getAndIncrement()
if (partitions.size == 0) {
return new JobWaiter[U](this, jobId, 0, resultHandler)
}
assert(partitions.size > 0)
val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
eventProcessActor ! JobSubmitted(
jobId, rdd, func2, partitions.toArray, allowLocal, callSite, waiter, properties)
waiter
}
在submitJob方法中除了获取RDD分区、JobId等行为外,最后一行代码eventProcessActor 实现了Job的提交,"!"表明这个Job的提交行为是通过Akka发送消息的模式完成的,所以JobSubmitted必须是一个样例类(case class)用于模式匹配。
eventProcessActor ! JobSubmitted(
jobId, rdd, func2, partitions.toArray, allowLocal, callSite, waiter, properties)
private[scheduler] case class JobSubmitted(
jobId: Int,
finalRDD: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
allowLocal: Boolean,
callSite: CallSite,
listener: JobListener,
properties: Properties = null)
extends DAGSchedulerEvent
如果还记得最开始的dagSchedulerActorSupervisor对象,我们会发现Job的提交真正的实现是DAGSchedulerEventProcessActor类,其中recieve方法中模式匹配JobSubmitted后内部实现了dagScheduler.handleJobSubmitted对stage进行切分,完成了job到stage的转换,生成finalStage(每个job都有一个finalStage)。
val initEventActorReply =
dagSchedulerActorSupervisor ? Props(new DAGSchedulerEventProcessActor(this))
private[scheduler] class DAGSchedulerEventProcessActor(dagScheduler: DAGScheduler)
extends Actor with Logging {
override def preStart() {
// set DAGScheduler for taskScheduler to ensure eventProcessActor is always
// valid when the messages arrive
dagScheduler.taskScheduler.setDAGScheduler(dagScheduler)
}
/**
* The main event loop of the DAG scheduler.
*/
def receive = {
case JobSubmitted(jobId, rdd, func, partitions, allowLocal, callSite, listener, properties) =>
dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, allowLocal, callSite,
listener, properties)
case StageCancelled(stageId) =>
dagScheduler.handleStageCancellation(stageId)
case JobCancelled(jobId) =>
dagScheduler.handleJobCancellation(jobId)
case JobGroupCancelled(groupId) =>
dagScheduler.handleJobGroupCancelled(groupId)
case AllJobsCancelled =>
dagScheduler.doCancelAllJobs()
case ExecutorAdded(execId, host) =>
dagScheduler.handleExecutorAdded(execId, host)
case ExecutorLost(execId) =>
dagScheduler.handleExecutorLost(execId)
case BeginEvent(task, taskInfo) =>
dagScheduler.handleBeginEvent(task, taskInfo)
case GettingResultEvent(taskInfo) =>
dagScheduler.handleGetTaskResult(taskInfo)
case completion @ CompletionEvent(task, reason, _, _, taskInfo, taskMetrics) =>
dagScheduler.handleTaskCompletion(completion)
case TaskSetFailed(taskSet, reason) =>
dagScheduler.handleTaskSetFailed(taskSet, reason)
case ResubmitFailedStages =>
dagScheduler.resubmitFailedStages()
}
override def postStop() {
// Cancel any active jobs in postStop hook
dagScheduler.cleanUpAfterSchedulerStop()
}
}
private[scheduler] def handleJobSubmitted(jobId: Int,
finalRDD: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
allowLocal: Boolean,
callSite: CallSite,
listener: JobListener,
properties: Properties = null)
{
var finalStage: Stage = null
try {
// New stage creation may throw an exception if, for example, jobs are run on a
// HadoopRDD whose underlying HDFS files have been deleted.
finalStage = newStage(finalRDD, partitions.size, None, jobId, callSite)
} catch {
case e: Exception =>
logWarning("Creating new stage failed due to exception - job: " + jobId, e)
listener.jobFailed(e)
return
}
if (finalStage != null) {
val job = new ActiveJob(jobId, finalStage, func, partitions, callSite, listener, properties)
clearCacheLocs()
logInfo("Got job %s (%s) with %d output partitions (allowLocal=%s)".format(
job.jobId, callSite.shortForm, partitions.length, allowLocal))
logInfo("Final stage: " + finalStage + "(" + finalStage.name + ")")
logInfo("Parents of final stage: " + finalStage.parents)
logInfo("Missing parents: " + getMissingParentStages(finalStage))
val shouldRunLocally =
localExecutionEnabled && allowLocal && finalStage.parents.isEmpty && partitions.length == 1
if (shouldRunLocally) {
// Compute very short actions like first() or take() with no parent stages locally.
listenerBus.post(SparkListenerJobStart(job.jobId, Array[Int](), properties))
runLocally(job)
} else {
jobIdToActiveJob(jobId) = job
activeJobs += job
finalStage.resultOfJob = Some(job)
listenerBus.post(SparkListenerJobStart(job.jobId, jobIdToStageIds(jobId).toArray,
properties))
submitStage(finalStage)
}
}
submitWaitingStages()
} <span style="font-family:Microsoft YaHei;font-size:14px;">
</span>
此时在handleJobSubmitted中会根据finalRDD创建finalstage(newStage),如果该stage不为null则创建一个active job和清除本地cache,并判断运行状态是local模式还是集群模式,如果是local模式则本地运行Job(local模式用于本地调试),如果是集群模式则submitStage(finalStage)。
// New stage creation may throw an exception if, for example, jobs are run on a
// HadoopRDD whose underlying HDFS files have been deleted.
//注意:该RDD是final RDD,而不是一系列的RDD,用finalRDD来创建finalStage
//newStage操作对应会生成新的result stage或者shuffle stage:内部有一个isShuffleMap变量来标识该stage是shuffle or result
finalStage = newStage(finalRDD, partitions.size, None, jobId, callSite)
newStage方法产生的finalStage中其实已经包含了该stage的所有依赖的父Stage,通过getParentStages方法构建该stage的依赖关系,然后调用visit方法反向构建RDD的DAG图,遇到窄依赖就将依赖的RDD加入到stage,遇到宽依赖就切分并递归宽依赖的stage。
/**
* Create a Stage -- either directly for use as a result stage, or as part of the (re)-creation
* of a shuffle map stage in newOrUsedStage. The stage will be associated with the provided
* jobId. Production of shuffle map stages should always use newOrUsedStage, not newStage
* directly.
*/
private def newStage(
rdd: RDD[_],
numTasks: Int,
shuffleDep: Option[ShuffleDependency[_, _, _]],
jobId: Int,
callSite: CallSite)
: Stage =
{
val id = nextStageId.getAndIncrement()
val stage =
new Stage(id, rdd, numTasks, shuffleDep, getParentStages(rdd, jobId), jobId, callSite)
stageIdToStage(id) = stage
updateJobIdStageIdMaps(jobId, stage)
stage
}
submitStage中getMissParentStages(stage)根据finalstage得到该stage的父stage(parent),也就是RDD的依赖关系,生成parentStage是通过RDD的dependencies;如果依赖关系是宽依赖(ShuffleDependency),则生成一个mapStage来作为finalStage的父stage(parent);也就是说对于需要shuffle操作的job,会生成mapStage和finalStage进行处理;如果依赖关系是窄依赖,不会生成新的stage。
val missing = getMissingParentStages(stage).sortBy(_.id)
//根据final stage的parents找出所有的parent stage
private def getMissingParentStages(stage: Stage): List[Stage] = {
val missing = new HashSet[Stage]
val visited = new HashSet[RDD[_]]
// We are manually maintaining a stack here to prevent StackOverflowError
// caused by recursively visiting
val waitingForVisit = new Stack[RDD[_]]
def visit(rdd: RDD[_]) {
if (!visited(rdd)) {
visited += rdd
if (getCacheLocs(rdd).contains(Nil)) {
for (dep <- rdd.dependencies) {
dep match {
//如果是ShuffleDependency,则新建一个shuffle的map stage,且该stage是可用的话则加入missing中</span>
case shufDep: ShuffleDependency[_, _, _] => //ShuffleDependency
val mapStage = getShuffleMapStage(shufDep, stage.jobId)
if (!mapStage.isAvailable) {
missing += mapStage
}
case narrowDep: NarrowDependency[_] => //NarrowDependency
waitingForVisit.push(narrowDep.rdd)
}
}
}
}
}
waitingForVisit.push(stage.rdd)
while (!waitingForVisit.isEmpty) {
visit(waitingForVisit.pop())
}
missing.toList
}
如果当前stage没有任何依赖或者所有的依赖都已执行完,则调用submitMissingTasks提交stage,否则就继续递归调用submitStage即先执行完所有依赖的父stage(根据getMissParentStages方法得到的结果集降序来执行stage),最终导致submitMissingTasks的调用,从代码可以看DAGScheduler向TaskScheduler是以stage为单位提交任务,stage是以taskset为单位的,接下来就和TaskScheduler打交道了。
/** Submits stage, but first recursively submits any missing parents. */
private def submitStage(stage: Stage) {
val jobId = activeJobForStage(stage)
if (jobId.isDefined) {
logDebug("submitStage(" + stage + ")")
if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) { <pre name="code" class="java"> val missing = getMissingParentStages(stage).sortBy(_.id)
logDebug("missing: " + missing)
if (missing == Nil) {
logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
submitMissingTasks(stage, jobId.get)
} else {
for (parent <- missing) {
submitStage(parent)
}
waitingStages += stage
}
}
} else {
abortStage(stage, "No active job for stage " + stage.id)
}
}