Spark技术内幕：Stage划分及提交源码分析

最新推荐文章于 2022-07-14 10:02:22 发布

anzhsoft

最新推荐文章于 2022-07-14 10:02:22 发布

阅读量2.8w

点赞数 9

分类专栏： Spark 云计算 Spark技术内幕文章标签： spark stage RDD

本文链接：https://blog.csdn.net/anzhsoft/article/details/39859463

版权

当触发一个RDD的action后，以count为例，调用关系如下：

org.apache.spark.rdd.RDD#count
org.apache.spark.SparkContext#runJob
org.apache.spark.scheduler.DAGScheduler#runJob
org.apache.spark.scheduler.DAGScheduler#submitJob
org.apache.spark.scheduler.DAGSchedulerEventProcessActor#receive（JobSubmitted）
org.apache.spark.scheduler.DAGScheduler#handleJobSubmitted

其中步骤五的DAGSchedulerEventProcessActor是DAGScheduler 的与外部交互的接口代理，DAGScheduler在创建时会创建名字为eventProcessActor的actor。这个actor的作用看它的实现就一目了然了：

  /**
   * The main event loop of the DAG scheduler.
   */
  def receive = {
    case JobSubmitted(jobId, rdd, func, partitions, allowLocal, callSite, listener, properties) =>
      dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, allowLocal, callSite,
        listener, properties) // 提交job，来自与RDD->SparkContext->DAGScheduler的消息。之所以在这需要在这里中转一下，是为了模块功能的一致性。

    case StageCancelled(stageId) => // 消息源org.apache.spark.ui.jobs.JobProgressTab，在GUI上显示一个SparkContext的Job的执行状态。
      // 用户可以cancel一个Stage，会通过SparkContext->DAGScheduler 传递到这里。
      dagScheduler.handleStageCancellation(stageId)

    case JobCancelled(jobId) => // 来自于org.apache.spark.scheduler.JobWaiter的消息。取消一个Job
      dagScheduler.handleJobCancellation(jobId)

    case JobGroupCancelled(groupId) => // 取消整个Job Group
      dagScheduler.handleJobGroupCancelled(groupId)

    case AllJobsCancelled => //取消所有Job
      dagScheduler.doCancelAllJobs()

    case ExecutorAdded(execId, host) => // TaskScheduler得到一个Executor被添加的消息。具体来自org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers
      dagScheduler.handleExe

最低0.47元/天解锁文章

anzhsoft

关注

9
点赞
踩
9

收藏

觉得还不错? 一键收藏
28
评论
Spark技术内幕：Stage划分及提交源码分析

在一个RDD触发了一个action（比如count，collect）时，任务是如何被提交到？什么是Stage？DAGScheduler的作用是什么?它是如何划分Stage的？本文将基于源码，进行深入分析。
复制链接

扫一扫