菜鸟的Spark 源码学习之路 -4 DAGScheduler源码 - part1

最新推荐文章于 2020-05-10 10:42:47 发布

彼岸枫雪非

最新推荐文章于 2020-05-10 10:42:47 发布

阅读量238

点赞数 1

分类专栏： Spark 文章标签： Spark 大数据分布式计算

本文链接：https://blog.csdn.net/u012543819/article/details/81535422

版权

Spark 专栏收录该内容

23 篇文章 1 订阅

订阅专栏

前面的几篇文章中，我们深入理解的taskScheduler的task提交，管理，资源的管理等，对TaskScheduler有了一个比较系统的了解：

TaskSheduler part1

TaskSheduler part2

TaskSheduler part3

这次，我们回到SparkContext中，看一下Spark的另一个重要的组件，DAGScheduler。

定义：

@volatile private var _dagScheduler: DAGScheduler = _

可以看到，它由volatile修饰，说明它是一个易变类型。

1. 结构概览和功能概述

先看一下它包含哪些数据结构和方法

非常复杂。但是能清晰地看到大多都是跟Job和stage的处理有关。

再看一下注释

里面清晰地讲解了DagScheduler的功能和stages 划分过程，以及Job，Stage等spark的关键概念。同时对出错处理进行了解释

/**
 * The high-level scheduling layer that implements stage-oriented scheduling. It computes a DAG of
 * stages for each job, keeps track of which RDDs and stage outputs are materialized, and finds a
 * minimal schedule to run the job. It then submits stages as TaskSets to an underlying
 * TaskScheduler implementation that runs them on the cluster. A TaskSet contains fully independent
 * tasks that can run right away based on the data that's already on the cluster (e.g. map output
 * files from previous stages), though it may fail if this data becomes unavailable.
 *它属于上层调度层，主要实现面向Stage的调度。它计算每个Job的Stage的DAG，跟踪RDD和Stage输出，寻找能够执行job的最优调度方式。然后，将Stages 以TaskSets的方式将Stages提交给下层的TaskScheduler去执行。TaskSet包含诸多各自独立、且立即能够基于集群中已有数据执行的task，尽管在数据不可获取时这个TaskSet可能会fail。

 * Spark stages are created by breaking the RDD graph at shuffle boundaries. RDD operations with
 * "narrow" dependencies, like map() and filter(), are pipelined together into one set of tasks
 * in each stage, but operations with shuffle dependencies require multiple stages (one to write a
 * set of map output files, and another to read those files after a barrier). In the end, every
 * stage will have only shuffle dependencies on other stages, and may compute multiple operations
 * inside it. The actual pipelining of these operations happens in the RDD.compute() functions of
 * various RDDs
 *Spark通过RDD的shuffle边界来划分Stage。窄依赖的RDD操作都会被划分到一个Stage里面，并打包成一个TaskSet。而宽依赖的Stages之间，则会存在依赖关系。最终，每个stage只有会与一个stage存在Shuffle依赖，stage内部可能会有非常多的操作计算。最终，都是RDD.compute() 来完成计算的。

 * In addition to coming up with a DAG of stages, the DAGScheduler also determines the preferred
 * locations to run each task on, based on the current cache status, and passes these to the
 * low-level TaskScheduler. Furthermore, it handles failures due to shuffle output files being
 * lost, in which case old stages may need to be resubmitted. Failures *within* a stage that are
 * not caused by shuffle file loss are handled by the TaskScheduler, which will retry each task
 * a small number of times before cancelling the whole stage.
 *DAGScheduler 负责根据现有的cache状态计算执行每个task的最优位置，并将这些信息传递给底层的taskScheduler。它还负责在shuffle的输出文件丢失是进行出错处理，这种情况下，之前的stage需要重新提交。 stage内部，非shuffle文件丢失产生的错误由TaskScheduler处理。它会在取消整个stage前多次重试执行task。
 * When looking through this code, there are several key concepts:
 *
 *  - Jobs (represented by [[ActiveJob]]) are the top-level work items submitted to the scheduler.
 *    For example, when the user calls an action, like count(), a job will be submitted through
 *    submitJob. Each Job may require the execution of multiple stages to build intermediate data.
 *    Jobs 是提交给scheduler顶级的工作单元。比如，在用户调用Action操作的时候，就会提交job。每个job可能需要执行多个      stage来产生中间数据

 *  - Stages ([[Stage]]) are sets of tasks that compute intermediate results in jobs, where each
 *    task computes the same function on partitions of the same RDD. Stages are separated at shuffle
 *    boundaries, which introduce a barrier (where we must wait for the previous stage to finish to
 *    fetch outputs). There are two types of stages: [[ResultStage]], for the final stage that
 *    executes an action, and [[ShuffleMapStage]], which writes map output files for a shuffle.
 *    Stages are often shared across multiple jobs, if these jobs reuse the same RDDs.
 *
stages 由job中一系列计算中间结果的task组成，每个task在同一个RDD的每个分区上都是执行相同操作。stages在shuffle的边界处划分，并且引入barrier保证执行顺序。Stage有两种类型，ResultStage 和ShuffleMapStage，前者是最后执行action的stage，后者则是输出一个shuffle过程的结果。如果多个job复用相同RDD， 则对应的stage由这些job共享。
 *  - Tasks are individual units of work, each sent to one machine.
 *
 *  - Cache tracking: the DAGScheduler figures out which RDDs are cached to avoid recomputing them
 *    and likewise remembers which shuffle map stages have already produced output files to avoid
 *    redoing the map side of a shuffle.
 *
 *  - Preferred locations: the DAGScheduler also computes where to run each task in a stage based
 *    on the preferred locations of its underlying RDDs, or the location of cached or shuffle data.
 *
 *  - Cleanup: all data structures are cleared when the running jobs that depend on them finish,
 *    to prevent memory leaks in a long-running application.
 *
 * To recover from failures, the same stage might need to run multiple times, which are called
 * "attempts". If the TaskScheduler reports that a task failed because a map output file from a
 * previous stage was lost, the DAGScheduler resubmits that lost stage. This is detected through a
 * CompletionEvent with FetchFailed, or an ExecutorLost event. The DAGScheduler will wait a small
 * amount of time to see whether other nodes or tasks fail, then resubmit TaskSets for any lost
 * stage(s) that compute the missing tasks. As part of this process, we might also have to create
 * Stage objects for old (finished) stages where we previously cleaned up the Stage object. Since
 * tasks from the old attempt of a stage could still be running, care must be taken to map any
 * events received in the correct Stage object.
 *
  为了完成出错恢复，同一个stage可能会“尝试”（attemps）执行多次。如果TaskScheduler报告task由于前一个stage的输出文件丢失出错，DAGScheduler会重新提交stage。这里通过包含FetchFailed 的CompletionEvent或者ExecutorLost event进行出错检测。DAGScheduler会等待一段时间检测Stage是否有节点或者task失败，如果有则重新提交stage的taskSet。 这个过程中，我们需要重建已完成的stage的对象，因为之前已经对stage对象进行了清除。由于先前尝试执行的stage中的task可能仍在执行，所以要小心将收到的消息事件（event）与stage对应起来
 * Here's a checklist to use when making or reviewing changes to this class:
 *
 *  - All data structures should be cleared when the jobs involving them end to avoid indefinite
 *    accumulation of state in long-running programs.
 *
 *  - When adding a new data structure, update `DAGSchedulerSuite.assertDataStructuresEmpty` to
 *    include the new structure. This will help to catch memory leaks.
 */

从上面这段超长的注释中我们可以得到DAGScheduler的主要功能。因此，作者决定从两个方面来学习DAGScheduler：

1. Job的调度管理

2. Stage的调度管理