spark源码分析之任务提交(一)Rdd#collect方法分析

本文分析调用RDD的collect方法后的任务提交流程。RDD的collect方法调用了SprakContext的runJob方法,之后又调用了重载的runJob方法,然后调用DAGSchulder的runJob方法,然后调用submitJob方法。调用顺序如下。

RDD#collect()

SprakContext#runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U)

SparkContext#runJob[T,U:ClassTag](rdd:RDD[T],func:Iterator[T]=>U,partitions: Seq[Int],allowLocal:Boolean)

SparkContext#runJob[T, U: ClassTag](
         rdd: RDD[T],
   
func: (TaskContext, Iterator[T]) => U,
   
partitions: Seq[Int],
   
allowLocal: Boolean
   
)

SparkContext#runJob[T, U: ClassTag](
    rdd: RDD[T],
   
func: (TaskContext, Iterator[T]) => U,
   
partitions: Seq[Int],
   
allowLocal: Boolean,
   
resultHandler: (Int, U) => Unit)

DAGScheduler#runJob[T, U: ClassTag](
    rdd: RDD[T],
   
func: (TaskContext, Iterator[T]) => U,
   
partitions: Seq[Int],
   
callSite: CallSite,
   
allowLocal: Boolean,
   
resultHandler: (Int, U) => Unit,
   
properties: Properties)

DAGSchduler#submitJob[T, U](
    rdd: RDD[T],
   
func: (TaskContext, Iterator[T]) => U,
   
partitions: Seq[Int],
   
callSite: CallSite,
   
allowLocal: Boolean,
   
resultHandler: (Int, U) => Unit,
   
properties: Properties)

submitJob方法用于将一个Job提交到jobscheduler。submitJob中首先获得Job的最大分区数,确定不会在一个不存在的分区上启动任务(Check to make sure we are not launching a task on a partition thatdoes not exist.)。然后生成当前Job的jobId。然后创建JobWaiter,它用来等待DAGScheduler job的完成,并处理任务完成的结果(An object that waits for a DAGScheduler job to complete. As tasksfinish, it passes their results to the given handler function.),代码如表1,其中resultHandler是runJob中传递过来的函数,闭包了一个数组类型的results。然后向eventProcessLoop发送JobSubmitted事件,eventProcessLoop的类型是DAGSchedulerEventProcessLoop,该类继承了EventLoop,EventLoop用来接收事件event并在专门的event线程处理所有的event事件,内部有一个LinkedBlockingDeque类型的eventQueue用于存放event和一个eventThread用于处理event,即从eventQueue中获取事件,调用onReceive方法处理事件,onReceive()方法由DAGSchedulerEventProcessLoop实现,EventProcessLoop和DAGSchedulerEventProcessLoop#onReceive()的实现如表2和表3。最后返回JobWaiter。

private[spark] class JobWaiter[T](
    dagScheduler: DAGScheduler,
    val
jobId: Int,
   
totalTasks: Int,
   
resultHandler: (Int, T) => Unit)
  extends JobListener {

  private var finishedTasks = 0

 
// Is the job as a whole finished (succeeded or failed)?
 
@volatile
 
private var _jobFinished = totalTasks == 0

 
def jobFinished = _jobFinished

 
// If the job is finished, this will be its result. In the case of 0 task jobs (e.g. zero
  // partition RDDs), we set the jobResult directly to JobSucceeded.
 
private var jobResult: JobResult = if (jobFinished) JobSucceeded else null

 
def cancel() {
    dagScheduler.cancelJob(jobId)
  }
//Job成功完成,在同步代码块内调用notifyAll方法,通知awaitResult方法内的wait()继续执行。resultHandler负责处理result。在这里用到了闭包,闭包了一个数组类型的results用于存放任务结果。
  override def taskSucceeded(index: Int, result: Any): Unit = synchronized {
    if (_jobFinished) {
      throw new UnsupportedOperationException("taskSucceeded() called on a finished JobWaiter")
    }
    resultHandler(index, result.asInstanceOf[T])
    finishedTasks += 1
   
if (finishedTasks == totalTasks) {
      _jobFinished = true
     
jobResult = JobSucceeded
      this.notifyAll()
    }
  }
//Job失败在同步代码块内调用notifyAll方法,通知awaitResult方法内的wait()继续执行。
  override def jobFailed(exception: Exception): Unit = synchronized {
    _jobFinished = true
   
jobResult = JobFailed(exception)
    this.notifyAll()
  }
//如果Job还没有完成,在synchronized同步代码块内调用wait方法等待。
  def awaitResult(): JobResult = synchronized {
    while (!_jobFinished) {
      this.wait()
    }
    return jobResult
 
}
}

表 1JobWaiter实现

 

private[spark] abstract class EventLoop[E](name: String) extends Logging {
  private val eventQueue: BlockingQueue[E] = new LinkedBlockingDeque[E]() //java concurrent包下的阻塞队列,用于存放event事件
  private val stopped = new AtomicBoolean(false)
  private val eventThread = new Thread(name) {
    setDaemon(true)
    override def run(): Unit = {
      try {
        while (!stopped.get) {
          val event = eventQueue.take()
          try {
            onReceive(event)
          } catch {
            case NonFatal(e) => {
              try {
                onError(e)
              } catch {
                case NonFatal(e) => logError("Unexpected error in " + name, e)
              }
            }
          }
        }
      } catch {
        case ie: InterruptedException => // exit even if eventQueue is not empty
       
case NonFatal(e) => logError("Unexpected error in " + name, e)
      }
    }
  }

  def start(): Unit = {
    if (stopped.get) {
      throw new IllegalStateException(name + " has already been stopped")
    }
    // Call onStart before starting the event thread to make sure it happens before onReceive
   
onStart()
    eventThread.start()
  }
  def stop(): Unit = {
    if (stopped.compareAndSet(false, true)) {
      eventThread.interrupt()
      var onStopCalled = false
      try
{
        eventThread.join()
        // Call onStop after the event thread exits to make sure onReceive happens before onStop
       
onStopCalled = true
       
onStop()
      } catch {
        case ie: InterruptedException =>
          Thread.currentThread().interrupt()
          if (!onStopCalled) {
            // ie is thrown from `eventThread.join()`. Otherwise, we should not call `onStop` since
            // it's already called.
           
onStop()
          }
      }
    } else {
      // Keep quiet to allow calling `stop` multiple times.
   
}
  }

 
def post(event: E): Unit = {
    eventQueue.put(event)
  }

 
def isActive: Boolean = eventThread.isAlive

 
protected def onStart(): Unit = {}

 
protected def onStop(): Unit = {}

 
protected def onReceive(event: E): Unit

 
protected def onError(e: Throwable): Unit

}

 

表 2EventProcessLoop的实现

override def onReceive(event: DAGSchedulerEvent): Unit = event match {
  case JobSubmitted(jobId, rdd, func, partitions, allowLocal, callSite, listener, properties) =>
    dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, allowLocal, callSite,
     
listener, properties)

  case StageCancelled(stageId) =>
    dagScheduler.handleStageCancellation(stageId)

  case JobCancelled(jobId) =>
    dagScheduler.handleJobCancellation(jobId)

  case JobGroupCancelled(groupId) =>
    dagScheduler.handleJobGroupCancelled(groupId)

  case AllJobsCancelled =>
    dagScheduler.doCancelAllJobs()

  case ExecutorAdded(execId, host) =>
    dagScheduler.handleExecutorAdded(execId, host)

  case ExecutorLost(execId) =>
    dagScheduler.handleExecutorLost(execId, fetchFailed = false)

  case BeginEvent(task, taskInfo) =>
    dagScheduler.handleBeginEvent(task, taskInfo)

  case GettingResultEvent(taskInfo) =>
    dagScheduler.handleGetTaskResult(taskInfo)

  case completion @ CompletionEvent(task, reason, _, _, taskInfo, taskMetrics) =>
    dagScheduler.handleTaskCompletion(completion)

  case TaskSetFailed(taskSet, reason) =>
    dagScheduler.handleTaskSetFailed(taskSet, reason)

  case ResubmitFailedStages =>
    dagScheduler.resubmitFailedStages()
}

表 3 DAGSchedulerEventProcessLoop#onReceive()的实现

 

DAGSchedulerEventProcessLoop中的eventThread处理到JobSubmitted事件,会调用DagScheduler的handleJobSubmitted方法,首先创建finalStage,并对stage进行划分。然后创建ActiveJob并更新Job相关信息。然后向listenerBus发送SparkListenerJobStart事件,提交finalStage。最后提交等待中的Stage。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值