1.DAGScheduler.scala
主要作用:
1.DAGScheduler为每一个job中计算出针对每一个stage的DAG,同时追踪与之对应的rdd,每一个阶段输出进行物化(输出保存到磁盘),同时找到一条运行spark job最有路径,根据是否有cache、是否有checkPoint。完毕之后将每一个阶段作为一个TaskSet责成内部在创建DAGScheduler对象的时候传入尽量的TaskScheduler的实现类,让其发送给executor中进行执行。2.另外,针对每一个阶段的DAG,DAGScheduler类会找到一个最有路径去运行相关的task,基于是否cache、是否已经传递给taskScheduler。由于shuffle输出文件丢失造成的stage失败,spark会将上一阶段的运行重新进行提交。如果实在一个stage内部发生的失败,我们spark会在取消整个阶段运行之前,消耗一点时间重试每一个task。重要角色:
EventLoop:事件队列。所提交的job放入EventLoop队列中。ListenerBus:监听总线提交job runJob():运行job,是由SparkContext中runJob方法中调用的。
def runJob[T, U: ClassTag]( rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, partitions: Seq[Int], callSite: CallSite, allowLocal: Boolean, resultHandler: (Int, U) => Unit, properties: Properties): Unit = { val start = System.nanoTime //提交一个job,返回值为当前运行job的状态 val waiter = submitJob(rdd, func, partitions, callSite, allowLocal, resultHandler, properties) waiter.awaitResult() match { case JobSucceeded => {//job运行成功 logInfo("Job %d finished: %s, took %f s".format (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9)) }//job运行失败 case JobFailed(exception: Exception) => logInfo("Job %d failed: %s, took %f s".format (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9)) throw exception } }
分析提交job的方法:submitJob(rdd, func, partitions, callSite, allowLocal, resultHandler, properties)
主要作用:向job scheduler提交job,并返回JobWaiter对象,直到job完成执行或者被取消后才能够被放行,也就是说被放行后才能够使用这个对象。(翻译原句:The JobWaiter object can be used to block until the the job finishes executing or can be used to cancel the job.)。
解释为什么partitions.size==0为true时,状态为JobSuccessed:def submitJob[T, U]( rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, partitions: Seq[Int], callSite: CallSite, allowLocal: Boolean, resultHandler: (Int, U) => Unit, properties: Properties): JobWaiter[U] = { /** * 获取当前rdd指向的所有的partition */ 确保提交的job task到确实存在的partition上 val maxPartitions = rdd.partitions.length partitions.find(p => p >= maxPartitions || p < 0).foreach { p => throw new IllegalArgumentException( "Attempting to access a non-existent partition: " + p + ". " + "Total number of partitions: " + maxPartitions) } val jobId = nextJobId.getAndIncrement() if (partitions.size == 0) {//如果当前partition中没数据,或者rdd对应的partition的个数为0,结束当前提交作业,状态为JobSuccessed return new JobWaiter[U](this, jobId, 0, resultHandler) } assert(partitions.size > 0) val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _] val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler) /** * 将提交的job放置到了eventProcessLoop队列中, * 内部创建了一个DAGSchedulerEventProcessLoop 继承EventLoop,用于将提交的job作业发送到队列中,在其中设置一个 * 工作线程,不断从队列中获取要执行的job */ eventProcessLoop.post(JobSubmitted( jobId, rdd, func2, partitions.toArray, allowLocal, callSite, waiter, properties)) waiter }
查看JobWaiter类发现:针对为0个job,将直接设置jobResult为 JobSuccessedif (partitions.size == 0) {//如果当前partition中没数据,或者rdd对应的partition的个数为0,结束当前提交作业,状态为JobSuccessed return new JobWaiter[U](this, jobId, 0, resultHandler) }
内部创建的DAGSchedulerEventProcessLoop继承自EventLoop[DAGSchedulerEvent],EventLoop主要作用:private[spark] class JobWaiter[T]( dagScheduler: DAGScheduler, val jobId: Int, totalTasks: Int, resultHandler: (Int, T) => Unit) extends JobListener { private var finishedTasks = 0 // Is the job as a whole finished (succeeded or failed)? @volatile private var _jobFinished = totalTasks == 0 def jobFinished = _jobFinished // If the job is finished, this will be its result. In the case of 0 task jobs (e.g. zero // partition RDDs), we set the jobResult directly to JobSucceeded. private var jobResult: JobResult = if (jobFinished) JobSucceeded else null
Event-loop(事件环)接受DAGScheduler将job作业发送到Event内部的队列,启动一个event线程去执行这些所有的event事件。
注意:这个event队列会无限制的增长下去,所以子类必须要在OnReceive方法中在事件内来处理相应的event事件,以免造成OOM异常。
总结:job放入eventProcessLoop,在其父类 EventLoop中,开启守护线程,并从eventQuequ中取事件,再在子类eventProcessLoop的onReceive(event)中处理,以handleJobSubmitted为例。开始stage阶段划分
每一个宽依赖划分一个阶段,
private[scheduler] def handleJobSubmitted(jobId: Int, finalRDD: RDD[_], func: (TaskContext, Iterator[_]) => _, partitions: Array[Int], allowLocal: Boolean, callSite: CallSite, listener: JobListener, properties: Properties) { var finalStage: Stage = null try { // New stage creation may throw an exception if, for example, jobs are run on a // HadoopRDD whose underlying HDFS files have been deleted. //1.new stage 查看stage如何创建 finalStage = newStage(finalRDD, partitions.size, None, jobId, callSite) } catch { case e: Exception => logWarning("Creating new stage failed due to exception - job: " + jobId, e) listener.jobFailed(e) return } if (finalStage != null) { //2.在DAGScheduler跟踪active job的信息 val job = new ActiveJob(jobId, finalStage, func, partitions, callSite, listener, properties) clearCacheLocs() logInfo("Got job %s (%s) with %d output partitions (allowLocal=%s)".format( job.jobId, callSite.shortForm, partitions.length, allowLocal)) logInfo("Final stage: " + finalStage + "(" + finalStage.name + ")") logInfo("Parents of final stage: " + finalStage.parents) logInfo("Missing parents: " + getMissingParentStages(finalStage)) val shouldRunLocally = localExecutionEnabled && allowLocal && finalStage.parents.isEmpty && partitions.length == 1 val jobSubmissionTime = clock.getTimeMillis() if (shouldRunLocally) { // Compute very short actions like first() or take() with no parent stages locally. listenerBus.post( SparkListenerJobStart(job.jobId, jobSubmissionTime, Seq.empty, properties)) runLocally(job) } else { jobIdToActiveJob(jobId) = job activeJobs += job finalStage.resultOfJob = Some(job) val stageIds = jobIdToStageIds(jobId).toArray val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo)) /**3. * 向监听器总线中的队列中增加一个监听事件 * 该监听总线是一个异步的处理注册的监听事件 * Asynchronously passes events to registered listeners. * 只有在被调用start方法之后,才对将该事件绑定到一个监听器上面,否则就只是简单的将该event事件buffer到队列中, * 如果调用stop方法,则会将监听器中所有的事件都移除。 * Until `start()` is called, all posted events are only buffered. Only after this listener bus * has started will events be actually propagated to all attached listeners. This listener bus * is stopped when `stop()` is called, and it will drop further events after stopping. * * 默认监听事件capacity为10000,如果增加的比drained(流走)的快,将报明确的错误 * 查看start(),从队列中取事件postToAll(event),这个应保证在同一个线程中,将事件与监听器绑定, */ listenerBus.post( SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties)) submitStage(finalStage) } } submitWaitingStages() }
1.new stage如何创建创建一个Stage---我们要么直接作为一个ResultStage,针对最后一个stage,要么使用neworUsedStage来创建shuffle map阶段的stage,而不是直接使用newStage(主要创建finalStage),通过jobId将stage和job关联起来。
1-1 getParentStages(rdd, jobId)private def newStage( rdd: RDD[_], numTasks: Int, shuffleDep: Option[ShuffleDependency[_, _, _]], jobId: Int, callSite: CallSite) : Stage = { //1-1获得父stage val parentStages = getParentStages(rdd, jobId) val id = nextStageId.getAndIncrement() val stage = new Stage(id, rdd, numTasks, shuffleDep, parentStages, jobId, callSite) stageIdToStage(id) = stage updateJobIdStageIdMaps(jobId, stage) stage }
获取当前stage的parent stage,同时将该stage绑定到该job。在操作过程当中,我们要手动维护栈内存,尽量去避免由于递归调用造成的StackOverflowError
private def getParentStages(rdd: RDD[_], jobId: Int): List[Stage] = { val parents = new HashSet[Stage] val visited = new HashSet[RDD[_]] // We are manually maintaining a stack here to prevent StackOverflowError // caused by recursively visiting val waitingForVisit = new Stack[RDD[_]] def visit(r: RDD[_]) { if (!visited(r)) { visited += r //注意:如果要进行注册的RDD是带有缓存的,我们不能在这个地方进行操作,因为持久化之后其对应的partition状态unknown // Kind of ugly: need to register RDDs with the cache here since // we can't do it in its constructor because # of partitions is unknown for (dep <- r.dependencies) { dep match { case shufDep: ShuffleDependency[_, _, _] => parents += getShuffleMapStage(shufDep, jobId) case _ => waitingForVisit.push(dep.rdd) } } } } waitingForVisit.push(rdd) while (!waitingForVisit.isEmpty) { visit(waitingForVisit.pop()) } parents.toList//转换为list列表返回 }
再查看Stage.scala源码简介:一个stage就是spark job的一个运行部分,是由一组独立的task组成的用于计算相同功能的task,其中这些所有的task使用一个共同的宽依赖。每一个taskset对应的DAG,会在当shuffle出现的时候,被scheduler切分成一个个stage,由DAGScheduler按照stage的拓扑顺序去执行这些stage。Stage分为两类: ShuffleMapStage:是下游stage的输入,需要去跟踪每一个节点上面的每一个分区shuffleMapStage的输出,ResultStage:连接着action操作,初始化一个job
每一个stage都会绑定一个jobId,在第一个提交job的阶段就能够确定是哪一个job提交的。当采用的FIFO的调度模式,当前较早阶段可能先执行计算操作,如果计算失败还可以进行快速的恢复。