前面的几篇文章中,我们深入理解的taskScheduler的task提交,管理,资源的管理等,对TaskScheduler有了一个比较系统的了解:
这次,我们回到SparkContext中,看一下Spark的另一个重要的组件,DAGScheduler。
定义:
@volatile private var _dagScheduler: DAGScheduler = _
可以看到,它由volatile修饰,说明它是一个易变类型。
1. 结构概览和功能概述
先看一下它包含哪些数据结构和方法
非常复杂。但是能清晰地看到大多都是跟Job和stage的处理有关。
再看一下注释
里面清晰地讲解了DagScheduler的功能和stages 划分过程,以及Job,Stage等spark的关键概念。同时对出错处理进行了解释
/**
* The high-level scheduling layer that implements stage-oriented scheduling. It computes a DAG of
* stages for each job, keeps track of which RDDs and stage outputs are materialized, and finds a
* minimal schedule to run the job. It then submits stages as TaskSets to an underlying
* TaskScheduler implementation that runs them on the cluster. A TaskSet contains fully independent
* tasks that can run right away based on the data that's already on the cluster (e.g. map output
* files from previous stages), though it may fail if this data becomes unavailable.
*它属于上层调度层,主要实现面向Stage的调度。它计算每个Job的Stage的DAG,跟踪RDD和Stage输出,寻找能够执行job的最优调度方式。然后,将Stages 以TaskSets的方式将Stages提交给下层的TaskScheduler去执行。TaskSet包含诸多各自独立、且立即能够基于集群中已有数据执行的task,尽管在数据不可获取时这个TaskSet可能会fail。
* Spark stages are created by breaking the RDD graph at shuffle boundaries. RDD operations with
* "narrow" dependencies, like map() and filter(), are pipelined together into one set of tasks
* in each stage, but operations with shuffle dependencies require multiple stages (one to write a
* set of map output files, and another to read those files after a barrier). In the end, every
* stage will have only shuffle dependencies on other stages, and may compute multiple operations
* inside it. The actual pipelining of these operations happens in the RDD.compute() functions of
* various RDDs
*Spark通过RDD的shuffle边界来划分Stage。窄依赖的RDD操作都会被划分到一个Stage里面,并打包成一个TaskSet。而宽依赖的Stages之间,则会存在依赖关系。最终,每个stage只有会与一个stage存在Shuffle依赖,stage内部可能会有非常多的操作计算。最终,都是RDD.compute() 来完成计算的。
* In addition to coming up with a DAG of stages, the DAGScheduler also determines the preferred
* locations to run each task on, based on the current cache status, and passes these to the
* low-level TaskScheduler. Furthermore, it handles failures due to shuffle output files being
* lost, in which case old stages may need to be resubmitted. Failures *within* a stage that are
* not caused by shuffle file loss are handled by the TaskScheduler, which will retry each task
* a small number of times before cancelling the whole stage.
*DAGScheduler 负责根据现有的cache状态计算执行每个task的最优位置,并将这些信息传递给底层的taskScheduler。它还负责在shuffle的输出文件丢失是进行出错处理,这种情况下,之前的stage需要重新提交。 stage内部,非shuffle文件丢失产生的错误由TaskScheduler处理。它会在取消整个stage前多次重试执行task。
* When looking through this code, there are several key concepts:
*
* - Jobs (represented by [[ActiveJob]]) are the top-level work items submitted to the scheduler.
* For example, when the user calls an action, like count(), a job will be submitted through
* submitJob. Each Job may require the execution of multiple stages to build intermediate data.
* Jobs 是提交给scheduler顶级的工作单元。比如,在用户调用Action操作的时候,就会提交job。每个job可能需要执行多个 stage来产生中间数据
* - Stages ([[Stage]]) are sets of tasks that compute intermediate results in jobs, where each
* task computes the same function on partitions of the same RDD. Stages are separated at shuffle
* boundaries, which introduce a barrier (where we must wait for the previous stage to finish to
* fetch outputs). There are two types of stages: [[ResultStage]], for the final stage that
* executes an action, and [[ShuffleMapStage]], which writes map output files for a shuffle.
* Stages are often shared across multiple jobs, if these jobs reuse the same RDDs.
*
stages 由job中一系列计算中间结果的task组成,每个task在同一个RDD的每个分区上都是执行相同操作。stages在shuffle的边界处划分,并且引入barrier保证执行顺序。Stage有两种类型,ResultStage 和ShuffleMapStage,前者是最后执行action的stage,后者则是输出一个shuffle过程的结果。如果多个job复用相同RDD, 则对应的stage由这些job共享。
* - Tasks are individual units of work, each sent to one machine.
*
* - Cache tracking: the DAGScheduler figures out which RDDs are cached to avoid recomputing them
* and likewise remembers which shuffle map stages have already produced output files to avoid
* redoing the map side of a shuffle.
*
* - Preferred locations: the DAGScheduler also computes where to run each task in a stage based
* on the preferred locations of its underlying RDDs, or the location of cached or shuffle data.
*
* - Cleanup: all data structures are cleared when the running jobs that depend on them finish,
* to prevent memory leaks in a long-running application.
*
* To recover from failures, the same stage might need to run multiple times, which are called
* "attempts". If the TaskScheduler reports that a task failed because a map output file from a
* previous stage was lost, the DAGScheduler resubmits that lost stage. This is detected through a
* CompletionEvent with FetchFailed, or an ExecutorLost event. The DAGScheduler will wait a small
* amount of time to see whether other nodes or tasks fail, then resubmit TaskSets for any lost
* stage(s) that compute the missing tasks. As part of this process, we might also have to create
* Stage objects for old (finished) stages where we previously cleaned up the Stage object. Since
* tasks from the old attempt of a stage could still be running, care must be taken to map any
* events received in the correct Stage object.
*
为了完成出错恢复,同一个stage可能会“尝试”(attemps)执行多次。如果TaskScheduler报告task由于前一个stage的输出文件丢失出错,DAGScheduler会重新提交stage。这里通过包含FetchFailed 的CompletionEvent或者ExecutorLost event进行出错检测。DAGScheduler会等待一段时间检测Stage是否有节点或者task失败,如果有则重新提交stage的taskSet。 这个过程中,我们需要重建已完成的stage的对象,因为之前已经对stage对象进行了清除。由于先前尝试执行的stage中的task可能仍在执行,所以要小心将收到的消息事件(event)与stage对应起来
* Here's a checklist to use when making or reviewing changes to this class:
*
* - All data structures should be cleared when the jobs involving them end to avoid indefinite
* accumulation of state in long-running programs.
*
* - When adding a new data structure, update `DAGSchedulerSuite.assertDataStructuresEmpty` to
* include the new structure. This will help to catch memory leaks.
*/
从上面这段超长的注释中我们可以得到DAGScheduler的主要功能。因此,作者决定从两个方面来学习DAGScheduler:
1. Job的调度管理
2. Stage的调度管理