菜鸟的Spark 源码学习之路 -4 DAGScheduler源码 - part1

前面的几篇文章中,我们深入理解的taskScheduler的task提交,管理,资源的管理等,对TaskScheduler有了一个比较系统的了解:

TaskSheduler part1

TaskSheduler part2

TaskSheduler part3

这次,我们回到SparkContext中,看一下Spark的另一个重要的组件,DAGScheduler。

定义:

@volatile private var _dagScheduler: DAGScheduler = _

可以看到,它由volatile修饰,说明它是一个易变类型。

1. 结构概览和功能概述

先看一下它包含哪些数据结构和方法

非常复杂。但是能清晰地看到大多都是跟Job和stage的处理有关。

再看一下注释

里面清晰地讲解了DagScheduler的功能和stages 划分过程,以及Job,Stage等spark的关键概念。同时对出错处理进行了解释

/**
 * The high-level scheduling layer that implements stage-oriented scheduling. It computes a DAG of
 * stages for each job, keeps track of which RDDs and stage outputs are materialized, and finds a
 * minimal schedule to run the job. It then submits stages as TaskSets to an underlying
 * TaskScheduler implementation that runs them on the cluster. A TaskSet contains fully independent
 * tasks that can run right away based on the data that's already on the cluster (e.g. map output
 * files from previous stages), though it may fail if this data becomes unavailable.
 *它属于上层调度层,主要实现面向Stage的调度。它计算每个Job的Stage的DAG,跟踪RDD和Stage输出,寻找能够执行job的最优调度方式。然后,将Stages 以TaskSets的方式将Stages提交给下层的TaskScheduler去执行。TaskSet包含诸多各自独立、且立即能够基于集群中已有数据执行的task,尽管在数据不可获取时这个TaskSet可能会fail。

 * Spark stages are created by breaking the RDD graph at shuffle boundaries. RDD operations with
 * "narrow" dependencies, like map() and filter(), are pipelined together into one set of tasks
 * in each stage, but operations with shuffle dependencies require multiple stages (one to write a
 * set of map output files, and another to read those files after a barrier). In the end, every
 * stage will have only shuffle dependencies on other stages, and may compute multiple operations
 * inside it. The actual pipelining of these operations happens in the RDD.compute() functions of
 * various RDDs
 *Spark通过RDD的shuffle边界来划分Stage。窄依赖的RDD操作都会被划分到一个Stage里面,并打包成一个TaskSet。而宽依赖的Stages之间,则会存在依赖关系。最终,每个stage只有会与一个stage存在Shuffle依赖,stage内部可能会有非常多的操作计算。最终,都是RDD.compute() 来完成计算的。

 * In addition to coming up with a DAG of stages, the DAGScheduler also determines the preferred
 * locations to run each task on, based on the current cache status, and passes these to the
 * low-level TaskScheduler. Furthermore, it handles failures due to shuffle output files being
 * lost, in which case old stages may need to be resubmitted. Failures *within* a stage that are
 * not caused by shuffle file loss are handled by the TaskScheduler, which will retry each task
 * a small number of times before cancelling the whole stage.
 *DAGScheduler 负责根据现有的cache状态计算执行每个task的最优位置,并将这些信息传递给底层的taskScheduler。它还负责在shuffle的输出文件丢失是进行出错处理,这种情况下,之前的stage需要重新提交。 stage内部,非shuffle文件丢失产生的错误由TaskScheduler处理。它会在取消整个stage前多次重试执行task。
 * When looking through this code, there are several key concepts:
 *
 *  - Jobs (represented by [[ActiveJob]]) are the top-level work items submitted to the scheduler.
 *    For example, when the user calls an action, like count(), a job will be submitted through
 *    submitJob. Each Job may require the execution of multiple stages to build intermediate data.
 *    Jobs 是提交给scheduler顶级的工作单元。比如,在用户调用Action操作的时候,就会提交job。每个job可能需要执行多个      stage来产生中间数据

 *  - Stages ([[Stage]]) are sets of tasks that compute intermediate results in jobs, where each
 *    task computes the same function on partitions of the same RDD. Stages are separated at shuffle
 *    boundaries, which introduce a barrier (where we must wait for the previous stage to finish to
 *    fetch outputs). There are two types of stages: [[ResultStage]], for the final stage that
 *    executes an action, and [[ShuffleMapStage]], which writes map output files for a shuffle.
 *    Stages are often shared across multiple jobs, if these jobs reuse the same RDDs.
 *
stages 由job中一系列计算中间结果的task组成,每个task在同一个RDD的每个分区上都是执行相同操作。stages在shuffle的边界处划分,并且引入barrier保证执行顺序。Stage有两种类型,ResultStage 和ShuffleMapStage,前者是最后执行action的stage,后者则是输出一个shuffle过程的结果。如果多个job复用相同RDD, 则对应的stage由这些job共享。
 *  - Tasks are individual units of work, each sent to one machine.
 *
 *  - Cache tracking: the DAGScheduler figures out which RDDs are cached to avoid recomputing them
 *    and likewise remembers which shuffle map stages have already produced output files to avoid
 *    redoing the map side of a shuffle.
 *
 *  - Preferred locations: the DAGScheduler also computes where to run each task in a stage based
 *    on the preferred locations of its underlying RDDs, or the location of cached or shuffle data.
 *
 *  - Cleanup: all data structures are cleared when the running jobs that depend on them finish,
 *    to prevent memory leaks in a long-running application.
 *
 * To recover from failures, the same stage might need to run multiple times, which are called
 * "attempts". If the TaskScheduler reports that a task failed because a map output file from a
 * previous stage was lost, the DAGScheduler resubmits that lost stage. This is detected through a
 * CompletionEvent with FetchFailed, or an ExecutorLost event. The DAGScheduler will wait a small
 * amount of time to see whether other nodes or tasks fail, then resubmit TaskSets for any lost
 * stage(s) that compute the missing tasks. As part of this process, we might also have to create
 * Stage objects for old (finished) stages where we previously cleaned up the Stage object. Since
 * tasks from the old attempt of a stage could still be running, care must be taken to map any
 * events received in the correct Stage object.
 *
  为了完成出错恢复,同一个stage可能会“尝试”(attemps)执行多次。如果TaskScheduler报告task由于前一个stage的输出文件丢失出错,DAGScheduler会重新提交stage。这里通过包含FetchFailed 的CompletionEvent或者ExecutorLost event进行出错检测。DAGScheduler会等待一段时间检测Stage是否有节点或者task失败,如果有则重新提交stage的taskSet。 这个过程中,我们需要重建已完成的stage的对象,因为之前已经对stage对象进行了清除。由于先前尝试执行的stage中的task可能仍在执行,所以要小心将收到的消息事件(event)与stage对应起来
 * Here's a checklist to use when making or reviewing changes to this class:
 *
 *  - All data structures should be cleared when the jobs involving them end to avoid indefinite
 *    accumulation of state in long-running programs.
 *
 *  - When adding a new data structure, update `DAGSchedulerSuite.assertDataStructuresEmpty` to
 *    include the new structure. This will help to catch memory leaks.
 */

从上面这段超长的注释中我们可以得到DAGScheduler的主要功能。因此,作者决定从两个方面来学习DAGScheduler:

1. Job的调度管理

2. Stage的调度管理

 

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值