尊重原创,禁止转载!!
Spark目前是大数据领域中最火的框架之一,可高效实现离线批处理,实时计算和机器学习等多元化操作,阅读源码有助你加深对框架的理解和认知
本人将依次剖析Spark2.0.0.X版本的各个核心组件,包括之后章节的BlockManager,TaskScheduler等
本人也看过一些网上介绍DAGScheduler源码的文章,有一些写的还是很不错,虽然讲到了一些核心点,但是都很片面,并没有提及很多细节的实现,为什么这样实现,其他组件的连带关系等...只是知其然不知所以然.
本人此次的DAGScheduler 源码剖析将涉及最底层的数据结构,每个细节的实现原理,算法,优化细节,各个组件之间交互动作以及纠正网上的一些错误介绍等....
本章提及的主要跟DAGScheduler 交互的组件相关:
一、RDD最常用的几种算子:
①MapPartitionsRDD——转换型操作RDD,产生OneToOneDependency依赖..代表如map,filter等
②ShuffledRDD——依赖一个父RDD,可能产生最影响集群性能的Shuffle,生成ShuffleDependency依赖,划分当前Job的Stage..代表如groupByKey,reduceByKey等
③CoGroupedRDD——依赖多个父RDD,可能产生最影响集群性能的Shuffle,生成ShuffleDependency依赖,划分当前Job的Stage..代表如join,等
二、两种Stage:
①ShuffleMapStage——继承于Stage,ShuffleMapStage刚好就在在shuffle操作之前发生,并且它可能包含多个transformation操作在执行的时候,会保存map端的输出文件并且稍后可以被reduce tasks接收
②ResultStage——一个Job中最后一个Stage,当执行ActionRDD时会被触发
三、MapOutputTracker——存放shuffle阶段的输出元数据,子类Master和Worker实现都不一样
四、Partitioner,默认是HashPartitioner
五、BlockManager
等.....
DAGScheduler——作为Spark的最核心组件之一简明扼要的说就是主要负责Job作业期间的所有Stage最优化的划分部署,并把所有的Task提交给TaskScheduler
下面是源码中它自己的签名:
The high-level scheduling layer that implements stage-oriented scheduling. It computes a DAG of
* stages for each job, keeps track of which RDDs and stage outputs are materialized, and finds a
* minimal schedule to run the job. It then submits stages as TaskSets to an underlying
* TaskScheduler implementation that runs them on the cluster. A TaskSet contains fully independent
* tasks that can run right away based on the data that's already on the cluster (e.g. map output
* files from previous stages), though it may fail if this data becomes unavailable.
DAGScheduler最初会在Driver端的SparkContext生成 具体步骤可以看看我之前的SparkContext文章,而触发DAGScheduler生成DAG Stage的是Action算子
OK,从count开始:
count为Action算子,当执行它是 里面调用的就是runJob
/**
* Return the number of elements in the RDD.
*/
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
跟踪进去:
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
resultHandler: (Int, U) => Unit): Unit = {
if (stopped.get()) {
// 如果是停止状态就跑出异常
throw new IllegalStateException("SparkContext has been shutdown")
}
val callSite = getCallSite
val cleanedFunc = clean(func)
logInfo("Starting job: " + callSite.shortForm)
if (conf.getBoolean("spark.logLineage", false)) {
logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
}
// 进入runJob核心方法
dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
// ConsoleProgressBar 控制台输出的job进度条
progressBar.foreach(_.finishAll())
// 最终递归调用doCheckpoint来检查每个父RDD是否需要checkpoint
// checkpoint一般是存储数据到HDFS上,并切掉之前的RDD的lineage
// 以后的RDD若要重用的话都会先检查是否有checkpoint过
rdd.doCheckpoint()
}
这里会调用dagScheduler的runJob,里面会返回一个阻塞线程等待Job完成的Waiter,并把Job提交到DAGSchduler上
def runJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): Unit = {
val start = System.nanoTime
// 提交job 里面会返回一个阻塞线程JobWaiter等待此Job的完成
val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
ThreadUtils.awaitReady(waiter.completionFuture, Duration.Inf)
// 根据job完成情况匹配不同的Log
waiter.completionFuture.value.get match {
case scala.util.Success(_) =>
logInfo("Job %d finished: %s, took %f s".format
(waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
case scala.util.Failure(exception) =>
logInfo("Job %d failed: %s, took %f s".format
(waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
// SPARK-8644: Include user stack trace in exceptions coming from DAGScheduler.
val callerStackTrace = Thread.currentThread().getStackTrace.tail
exception.setStackTrace(exception.getStackTrace ++ callerStackTrace)
throw exception
}
}
返回JobWaiter,并向event循环队列插入提交Job的事件消息
def submitJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): JobWaiter[U] = {
// Check to make sure we are not launching a task on a partition that does not exist.
// 检查分区是否存在,保证task正常运行
val maxPartitions = rdd.partitions.length
partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
throw new IllegalArgumentException(
"Attempting to access a non-existent partition: " + p + ". " +
"Total number of partitions: " + maxPartitions)
}
// 为nextJobId增加一个JobId作当前Job的标识(+1)
val jobId = nextJobId.getAndIncrement()
if (partitions.size == 0) {
// Return immediately if the job is running 0 tasks
// 如果没有task就立即返回JobWaiter
return new JobWaiter[U](this, jobId, 0, resultHandler)
}
// 为partitions做断言,确保下分区是否大于0
assert(partitions.size > 0)
val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
// 首先构造一个JobWaiter阻塞线程 等待job完成 然后把完成结果提交给resultHandler
val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
// DAGScheduler的事件队列,结构为LinkedBlockingDeque
// 因为可能集群同时运行着多个Job,而DAGSchduler默认是FIFO先进先出的资源调度
// 这里传入的事件类型为JobSubmitted,而在eventProcessLoop会调用doOnReceive
// 来匹配事件类型并执行对应的操作,最终会匹配到dagScheduler.handleJobSubmitted(....)
eventProcessLoop.post(JobSubmitted(
jobId, rdd, func2, partitions.toArray, callSite, waiter,
SerializationUtils.clone(properties)))
waiter
}
eventProcessLoop事件处理循环体继承于EventLoop,专门用来接收Job和Stage阶段中调用者发来的所有事件消息并处理
eventProcessLoop在post事件信息的时候其实是把它put进消息队列,一个单独的Java线程会不停安全阻塞去take这个队列 取出事件并根据匹配的事件类型做对应的处理
// 专门用来接收Job和Stage阶段中调用者发来的所有事件消息并处理
private[scheduler] val eventProcessLoop = new DAGSchedulerEventProcessLoop(this)
private[scheduler] class DAGSchedulerEventProcessLoop(dagScheduler: DAGScheduler)
extends EventLoop[DAGSchedulerEvent]("dag-scheduler-event-loop") with Logging {
private[this] val timer = dagScheduler.metricsSource.messageProcessingTimer
/**
* The main event loop of the DAG scheduler.
*/
// 它的单独的Java线程会不停调用这个方法
override def onReceive(event: DAGSchedulerEvent): Unit = {
val timerContext = timer.time()
try {
doOnReceive(event)
} finally {
timerContext.stop()
}
}
private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
// 提交job
case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)
DAGSchedulerEventProcessLoop继承于基类EventLoop,下面是线程的处理方式
private[spark] abstract class EventLoop[E](name: String) extends Logging {
private val eventQueue: BlockingQueue[E] = new LinkedBlockingDeque[E]()
private val stopped = new AtomicBoolean(false)
// 生成的java.lang.Thread线程
// 这个线程会不停的去eventQueue取出event事件消息然后onReceive做对应的
private val eventThread = new Thread(name) {
setDaemon(true)
override def run(): Unit = {
try {
while (!stopped.get) {
// 提取事件队列里的事件信息
val event = eventQueue.take()
try {
// 调用onReceive模式匹配做事件驱动
onReceive(event)
} catch {
case NonFatal(e) =>
try {
onError(e)
} catch {
case NonFatal(e) => logError("Unexpected error in " + name, e)
}
}
}
} catch {
case ie: InterruptedException => // exit even if eventQueue is not empty
case NonFatal(e) => logError("Unexpected error in " + name, e)
}
}
}
下面会开始创建ResultStage
// 在eventProcessLoop接受到提交job的事件任务后就会触发,开始划分stage
private[scheduler] def handleJobSubmitted(jobId: Int,
finalRDD: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
callSite: CallSite,
listener: JobListener,
properties: Properties) {
var finalStage: ResultStage = null
try {
// New stage creation may throw an exception if, for example, jobs are run on a
// HadoopRDD whose underlying HDFS files have been deleted.
// 创建ResultStage,这里才是真正开始处理提交的job划分stage的时候
// 它会从后往前找递归遍历它的每一个父RDD,从持久化中抽取反之重新计算
// 补充下:stage分为shuffleMapStage和ResultStage两种
// 每个job都是由1个ResultStage和0+个ShuffleMapStage组成
finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
} catch {
case e: Exception =>
logWarning("Creating new stage failed due to exception - job: " + jobId, e)
listener.jobFailed(e)
return
}
// 把createResultStage封装在ActiveJob中,你可以把它看做成Job的代表
val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
// 清除每个被持久化的RDD分区的位置
clearCacheLocs()
logInfo("Got job %s (%s) with %d output partitions".format(
job.jobId, callSite.shortForm, partitions.length))
logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")
logInfo("Parents of final stage: " + finalStage.parents)
logInfo("Missing parents: " + getMissingParentStages(finalStage))
val jobSubmissionTime = clock.getTimeMillis()
// HashMap结构,维护着jobId和jobIdToActiveJob的映射关系
jobIdToActiveJob(jobId) = job
// HashSet结构,维护着所有ActiveJob
activeJobs += job
// finalStage一旦生成就会把封装自己的ActiveJob注册到自己的_activeJob上
// 而整个Job结束后就会清除掉
finalStage.setActiveJob(job)
// 提取出jobId对应的所有StageIds并转换才数组
val stageIds = jobIdToStageIds(jobId).toArray
// 提取出每个stage的最新尝试信息,当job启动时会告知SparkListenersJob
val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
// 封装一个SparkListenerEvent,通知SparkListenersJob启动了,并传递Job相关信息
// 底层会把这个event事件post到eventQueue中,一个单独的Java的线程池会不停的poll出来并做对应的处理
listenerBus.post(
SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
// 开始提交Stage
submitStage(finalStage)
}
在构建一个Job的时候 其实看得出是从最后一个RDD开始创建ResultStage,然后不停遍历自己的父RDD的依赖,并且查看是否之前持久化过(包括缓存,物化,以及Checkpoint)若没有就会从父RDD中提取出它的父RDD并继续检查,一直到发现持久化过 或者 第一个RDD为止,最后拿到的这个RDD的计算结果后,从前往后一次计算直到产生ResultStage
/**
* Create a ResultStage associated with the provided jobId.
*/
private def createResultStage(
rdd: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
jobId: Int,
callSite: CallSite): ResultStage = {
// 开始创建ResultStage的父stage
// 里面有多个嵌套获取shuffle依赖和循环创建shuffleMapStage,若没有shuffle操作返回为空List
val parents = getOrCreateParentStages(rdd, jobId)
// 当前的stageId标识+1
val id = nextStageId.getAndIncrement()
// 放入刚刚生成的父stage等核心参数,生成ResultStage
val stage = new ResultStage(id, rdd, func, partitions, parents, jobId, callSite)
// 把ResultStage和它的ID加入stageIdToStage
stageIdToStage(id) = stage
// 更新jobIds和jobIdToStageIds
updateJobIdStageIdMaps(jobId, stage)
// 返回这个ResultStage
stage
}
OK.......我们从getOrCreateParentStages开始...在这里请大家提高注意力,因为进入这个函数后,里面将出现特别多的嵌套迭代算法和多个组件的交互,包括我最开始看的时候也被犯了些迷糊..
/**
* Get or create the list of parent stages for a given RDD. The new Stages will be created with
* the provided firstJobId.
*/
// 创建每个父stage,而只有shuffle操作才会产生stage
// 所以这里返回的Stage可能为null,也就是只有一个resultStage
private def getOrCreateParentStages(rdd: RDD[_], firstJobId: Int): List[Stage] = {
// 遍历当前父RDD的依赖关系,直到找到它包含的第一个ShuffleDependency
// (可能多个,也可能没有)然后放入HashSet并返回
// 然后用map依次对所有ShuffleDependency创建所有的父shuffleMapStage
// 补充:在后面的代码里面会无限循环调用这段代码来创建父stage
// 如果里面匹配不到ShuffleDependency 那么代码就会在此终止,也就是创建父stage循环终止
getShuffleDependencies(rdd).map { shuffleDep =>
// 里面会创建当前拿到的ShuffleDependency的所有父ShuffleMapStage
getOrCreateShuffleMapStage(shuffleDep, firstJobId)
}.toList
}
从getShuffleDependencies开始,这里仅仅是抽取当前RDD的Shuffle依赖(Job的Stage是以Shuffle划分的,1个Job中只会生成0+个ShuffleMapStage和1个ResultStage),如果不是ShuffleDependency就继续抽取父RDD...迭代遍历一直到抽取出为止或者没有
/**
* Returns shuffle dependencies that are immediate parents of the given RDD.
*
* This function will not return more distant ancestors. For example, if C has a shuffle
* dependency on B which has a shuffle dependency on A:
*
* A <-- B <-- C
*
* calling this function with rdd C will only return the B <-- C dependency.
*
* This function is scheduler-visible for the purpose of unit testing.
*/
// 只会抽取出第一个包含ShuffleDependency的RDD的ShuffleDependency
private[scheduler] def getShuffleDependencies(
rdd: RDD[_]): HashSet[ShuffleDependency[_, _, _]] = {
// 用来存放ShuffleDependency的HashSet
val parents = new HashSet[ShuffleDependency[_, _, _]]
// 临时存放后面遍历过的RDD
val visited = new HashSet[RDD[_]]
// Stack是一个last-in-first-out (LIFO)后进先出的数据结构
val waitingForVisit = new Stack[RDD[_]]
// 把rdd push进waitingForVisit
waitingForVisit.push(rdd)
// 只要waitingForVisit不为空就循环下去
while (waitingForVisit.nonEmpty) {
// 取出顶部的第一个元素 RDD
val toVisit = waitingForVisit.pop()
// 如果刚刚拿出的RDD是否包含在visited中
if (!visited(toVisit)) {
// 就把这个RDD加入visited
// 这个临时visited使用来鉴别RDD之前是否有没被这里面的代码使用过
visited += toVisit
// 遍历这个RDD的所有依赖并做匹配,返回的是Seq[Dependency[_]]序列类型
// 依次遍历出来的RDD会做匹配,非ShuffleDependency的RDD会放回waitingForVisit
// 然后把后来进来的RDD第一个pop出来继续匹配,一直匹配到有ShuffleDependency为止,当然也可能没有
// 补充:返回的ShuffleDependency可能没有,可能是一个也可能是多个
// 比如像CoGroupedRDD就是多个RDD产生的结果依赖,而ShuffledRDD只有一个父RDD
toVisit.dependencies.foreach {
case shuffleDep: ShuffleDependency[_, _, _] =>
// 如果匹配到ShuffleDependency就放进parents
parents += shuffleDep
// 如果匹配到的是其他任何依赖就把这个RDD的父RDD push进waitingForVisit
case dependency =>
waitingForVisit.push(dependency.rdd)
}
}
}
// 遍历完后把存放ShuffleDependency的parents返回
parents
}
在while循环中 它会遍历进来的RDD当前的所有依赖,注意:大伙千万别被方法的字面意思和返回类型 给误解成获取RDD以及父RDD的所有依赖,而这里只是获取当前父RDD的依赖,之所以会这样 是因为有像CoGroupedRDD依赖多个父RDD的算子(比如join),而所有算子都复写的基类RDD的getDependencies,只是实现不一样而已
/**
* Get the list of dependencies of this RDD, taking into account whether the
* RDD is checkpointed or not.
*/
final def dependencies: Seq[Dependency[_]] = {
// 查看RDD之前是否被checkpoint过
// 补充下:checkpoint了的RDD之前的父RDD的lineage会被切断清除
// OneToOneDependency的依赖关系是子RDD每个Partition只依赖父RDD的一个Partition
// 如果有被checkpoint过的RDD就返回都是OneToOneDependency依赖的数组
checkpointRDD.map(r => List(new OneToOneDependency(r))).getOrElse {
// 如果没有被checkpoint过 就判断当前RDD的dependencies_是否存在
// dependencies_ 结构是Seq[Dependency[_]] 里面维护着这个RDD的所有依赖
if (dependencies_ == null) {
// 如果dependencies_为空,就调用getDependencies获取Dependencies
// 不同的RDD子类会复写getDependencies方法,比如ShuffledRDD,CoGroupedRDD等
// 他们都会根据父RDD或者分区数等参数来生成Dependencies
// 最后赋值给dependencies_
dependencies_ = getDependencies
}
// 返回dependencies_
dependencies_
}
}
/**
* Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only
* be called once, so it is safe to implement a time-consuming computation in it.
*/
protected def getDependencies: Seq[Dependency[_]] = deps
普通RDD会直接拿取自己的依赖
abstract class RDD[T: ClassTag](
@transient private var _sc: SparkContext,
@transient private var deps: Seq[Dependency[_]]
) extends Serializable with Logging {
而像ShuffledRDD,CoGroupedRDD,MapPartitionsRDD等都会复写getDependencies实现不同的逻辑
好吧,既然都说到这里了 接下来就顺便提提其他几种算子提取依赖的不同实现:
当我们在调用reduceByKey算子的时候,没有指定分区器的话默认是HashPartitioner,可能产生shuffle,拥有多个重载,最后调用的还是combineByKeyWithClassTag
因为shuffle是公认最影响集群性能的过程,所以Spark设计之初已经在尽量避免shuffle的产生,所以在最终生成ShuffleDependency之前都会做partitioner判断
def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
reduceByKey(defaultPartitioner(self), func)
}
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
}
def combineByKeyWithClassTag[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
if (keyClass.isArray) {
if (mapSideCombine) {
throw new SparkException("Cannot use map-side combining with array keys.")
}
if (partitioner.isInstanceOf[HashPartitioner]) {
throw new SparkException("HashPartitioner cannot partition array keys.")
}
}
// 用作map端和reduce端的聚合操作
val aggregator = new Aggregator[K, V, C](
self.context.clean(createCombiner),
self.context.clean(mergeValue),
self.context.clean(mergeCombiners))
// 判断下当前RDD的partitioner和父RDD的partitioner的属性是否相等
// 包括:partitioner中维护着不同的分区器(Hash/RangePartitioner)以及每个Key对应的分区
if (self.partitioner == Some(partitioner)) {
// 如果都一样的话就调用mapPartitions算子(Transformation算子)
// 避免了shuffle操作
self.mapPartitions(iter => {
val context = TaskContext.get()
new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
}, preservesPartitioning = true)
} else {
// 如果partitioner属性不相等的话就会引发shuffle,参数为当前RDD(shuffled后的父RDD)和partitioner
new ShuffledRDD[K, V, C](self, partitioner)
.setSerializer(serializer)
.setAggregator(aggregator)
.setMapSideCombine(mapSideCombine)
}
}
默认生成的HashPartitioner
def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
// 如果有多的othersRDD传入就加入到rdd的Seq里(++是两个list组合成一起)
val rdds = (Seq(rdd) ++ others)
// 用filter过滤掉每个rdd是否有partitioner并且每个partitioner的numPartitions是否大于0
// 就是判断下之前的RDD有没有partitioner而且分区个数不为0
val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 0))
// 判断刚过滤出来的hasPartitioner是否存在
if (hasPartitioner.nonEmpty) {
// 如果rdd有Partitioner则用maxBy拿到刚刚过滤出来的rdd数组中分区数量最大的那个分区器
hasPartitioner.maxBy(_.partitions.length).partitioner.get
} else {
// 如果走到这里就代表之前所有的RDD都没有设置过Partitioner
// 如果之前我们通过参数设置过 就调用参数的并行度来设置分区 并生成HashPartitioner
if (rdd.context.conf.contains("spark.default.parallelism")) {
new HashPartitioner(rdd.context.defaultParallelism)
} else {
// 同样的 默认使用HashPartitioner,分区数为上游的所有RDD中最大分区数
new HashPartitioner(rdds.map(_.partitions.length).max)
}
}
}
}
class HashPartitioner(partitions: Int) extends Partitioner {
require(partitions >= 0, s"Number of partitions ($partitions) cannot be negative.")
// 上游map端的分区个数
def numPartitions: Int = partitions
// reduce端划分分区的算法
def getPartition(key: Any): Int = key match {
case null => 0
case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
}
// 在所有RDD生成ShuffleDependency之前都会判断下两个分区数是否相等
override def equals(other: Any): Boolean = other match {
case h: HashPartitioner =>
// 比较的仅仅是分区个数
h.numPartitions == numPartitions
case _ =>
false
}
override def hashCode: Int = numPartitions
}
这里顺便也补充一下HashPartitioner在产生shuffle的时候对下游reduce分区的划分算法:
def nonNegativeMod(x: Int, mod: Int): Int = {
// 对于拿到的key求hashCode然后对map端的分区数求模
val rawMod = x % mod
// 如果计算出来的余数小于零就加上分区数,反之返回余数
rawMod + (if (rawMod < 0) mod else 0)
}
若果partitioner相等的话 就直接转换成MapPartitionsRDD(这个属于121依赖的算子,稍后再说)也就不会产生shuffle了
否则就会生成ShuffledRDD,现在回到之前提取依赖的时候
protected def getDependencies: Seq[Dependency[_]] = deps
ShuffledRDD复写了获取依赖的实现:
可以看见最后是new出了ShuffleDependency
// 拿到RDD依赖。
override def getDependencies: Seq[Dependency[_]] = {
// 首先拿到生成ShuffleDependency的成员参数serializer,有的话就直接get
val serializer = userSpecifiedSerializer.getOrElse {
// 若get不到就从sparkEnv执行环境中的serializerManager中拿取
val serializerManager = SparkEnv.get.serializerManager
// 根据是否map端是否聚合触发不同的提取方法
if (mapSideCombine) {
serializerManager.getSerializer(implicitly[ClassTag[K]], implicitly[ClassTag[C]])
} else {
serializerManager.getSerializer(implicitly[ClassTag[K]], implicitly[ClassTag[V]])
}
}
// 生成的ShuffleDependency会被放进list返回
// 补充下:这里只放回一个父DD的依赖
// 因为和CoGroupedRDD都是复写的RDD的protected def getDependencies: Seq[Dependency[_]] = deps
// 所以返回的时候得满足Seq[Dependency[_]]类型 就用list封装了
// 所以大家别被这个方法和返回类型的字面意思给蒙骗了
// 包括像getCacheLocs用来做task最佳位置的判断机制,它判断的也不仅仅是MEMORY级别
List(new ShuffleDependency(prev, part, serializer, keyOrdering, aggregator, mapSideCombine))
}
这时就会把ShuffleDependency相关信息注册到shuffleManager和ContextCleaner上,而最主要的还是封装了自己的父RDD
后面所有递归遍历父RDD都是从中提取
class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](
@transient private val _rdd: RDD[_ <: Product2[K, V]],
val partitioner: Partitioner,
val serializer: Serializer = SparkEnv.get.serializer,
val keyOrdering: Option[Ordering[K]] = None,
val aggregator: Option[Aggregator[K, V, C]] = None,
val mapSideCombine: Boolean = false)
extends Dependency[Product2[K, V]] {
// 父RDD
override def rdd: RDD[Product2[K, V]] = _rdd.asInstanceOf[RDD[Product2[K, V]]]
private[spark] val keyClassName: String = reflect.classTag[K].runtimeClass.getName
private[spark] val valueClassName: String = reflect.classTag[V].runtimeClass.getName
// Note: It's possible that the combiner class tag is null, if the combineByKey
// methods in PairRDDFunctions are used instead of combineByKeyWithClassTag.
private[spark] val combinerClassName: Option[String] =
Option(reflect.classTag[C]).map(_.runtimeClass.getName)
// 生成shuffleId,也就是通过nextShuffleId加1
val shuffleId: Int = _rdd.context.newShuffleId()
// 向shuffleManager注册一个shuffle并且获得一个指定类型的ShuffleHandle
// 比如:在之前章节讲到的SprakEnv中默认使用的SortShuffleManager它会复写registerShuffle
val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle(
shuffleId, _rdd.partitions.length, this)
// 把ShuffleDependency注册到ContextCleaner对象中
_rdd.sparkContext.cleaner.foreach(_.registerShuffleForCleanup(this))
}
当然,所有的RDD都有Dependency,只是不同类型的RDD集成Dependency的实现逻辑都不一样
这里我们看看之前提到过的MapPartitionsRDD,比如map函数:
/**
* Return a new RDD by applying a function to all elements of this RDD.
*/
def map[U: ClassTag](f: T => U): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}
// 继承extends RDD[U](prev) 会产生OneToOneDependency依赖
// 这里的参数:var prev: RDD[T] 是父RDD
private[spark] class MapPartitionsRDD[U: ClassTag, T: ClassTag](
var prev: RDD[T],
f: (TaskContext, Int, Iterator[T]) => Iterator[U], // (TaskContext, partition index, iterator)
preservesPartitioning: Boolean = false)
extends RDD[U](prev) {
// 默认:MapPartitionsRDD不会生成shuffle,也就不会产生ShuffleDependency,所以也就不会生成partitioner
override val partitioner = if (preservesPartitioning) firstParent[T].partitioner else None
// 分区数用的第一个父RDD的分区数
override def getPartitions: Array[Partition] = firstParent[T].partitions
// 计算逻辑是根据最初RDD算子的func来决定的,如下
// runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U)
override def compute(split: Partition, context: TaskContext): Iterator[U] =
f(context, split.index, firstParent[T].iterator(split, context))
// 清除依赖,比如在checkpoint的时候 就会执行此方法
override def clearDependencies() {
super.clearDependencies()
prev = null
}
}
这里很难发现它的依赖是从哪里生成的,可能你会忽略继承的RDD,因为RDD默认是不生成依赖的,但是它继承的是带121依赖的RDD重载构造函数
/** Construct an RDD with just a one-to-one dependency on one parent */
// 带OneToOneDependency依赖参数的RDD构造函数
def this(@transient oneParent: RDD[_]) =
this(oneParent.context, List(new OneToOneDependency(oneParent)))
OneToOneDependency依赖的算子如map,filter,子RDD和父RDD直接的分区是一一对应的 当然也就不会发生shuffle,跟RangeDependency一样继承的是NarrowDependency窄依赖
/**
* :: DeveloperApi ::
* Represents a one-to-one dependency between partitions of the parent and child RDDs.
*/
@DeveloperApi
class OneToOneDependency[T](rdd: RDD[T]) extends NarrowDependency[T](rdd) {
override def getParents(partitionId: Int): List[Int] = List(partitionId)
}
shuffle依赖跟窄依赖(继承于Dependency)是平级:
/**
* :: DeveloperApi ::
* Base class for dependencies.
*/
// 两个直接子类,额外两个NarrowDependency的子类
// 1:ShuffleDependency
// 2: NarrowDependency ———> RangeDependency
// ———> OneToOneDependency
@DeveloperApi
abstract class Dependency[T] extends Serializable {
def rdd: RDD[T]
}
最后再提一个CoGroupedRDD,(感觉扯偏题了很多,但我认为弄清楚DAGScheduler的Stage划分就必须得至少知道这几个核心算子的底层实现以及依赖关系,因为在划分Stage的时候算子不同 划分的一些细节也会不同,本来之前想单起一个章节关于算子RDD,但本人有点懒,索性干脆把几个模块配合着 DAGScheduler一块写了 而且这样看下来也更会有联动性)
以join为例:
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
this.cogroup(other, partitioner).flatMapValues( pair =>
for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
)
}
def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)
: RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope {
if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
throw new SparkException("HashPartitioner cannot partition array keys.")
}
val cg = new CoGroupedRDD[K](Seq(self, other), partitioner)
cg.mapValues { case Array(vs, w1s) =>
(vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]])
}
}
直接看他的获取依赖的核心方法:跟reduceByKey类似,若之前的RDD有join过 分区数相等的话 就直接产生121依赖
不同的是由于join是多个RDD的操作,所以产生的依赖不止一个,最后以Seq[Dependency[_]]返回
override def getDependencies: Seq[Dependency[_]] = {
// 这里跟shuffledRDD的getDependencies不一样的是它是多个RDD聚合产生
// 所以这里会拿到多个RDD的ShuffleDependency,而shuffledRDD仅仅是拿到父RDD的依赖
rdds.map { rdd: RDD[_] =>
// 对比的其实是分区数是否相等
if (rdd.partitioner == Some(part)) {
logDebug("Adding one-to-one dependency with " + rdd)
// 相等的话 就生产OneToOneDependency依赖
new OneToOneDependency(rdd)
} else {
logDebug("Adding shuffle dependency with " + rdd)
// 不相等 就生成ShuffleDependency依赖
new ShuffleDependency[K, Any, CoGroupCombiner](
rdd.asInstanceOf[RDD[_ <: Product2[K, _]]], part, serializer)
}
}
}
OK,核心算子如何获取依赖的实现说完了,现在我们接着之前的DAGSchduler获取依赖开始
若果忘了,返回去看看吧:
dependencies_ = getDependencies
那么在反复的遍历,直到获得到这个RDD最近的ShuffleDependency为止(或者也可能没有),接着开始创建ShuffleMapStage
回到之前的getOrCreateParentStages方法中:
/**
* Get or create the list of parent stages for a given RDD. The new Stages will be created with
* the provided firstJobId.
*/
// 创建每个父stage,而只有shuffle操作才会产生stage
// 所以这里返回的Stage可能为null,也就是只有一个resultStage
private def getOrCreateParentStages(rdd: RDD[_], firstJobId: Int): List[Stage] = {
// 遍历当前父RDD的依赖关系,直到找到它包含的第一个ShuffleDependency
// (可能多个,也可能没有)然后放入HashSet并返回
// 然后用map依次对所有ShuffleDependency创建所有的父shuffleMapStage
// 补充:在后面的代码里面会无限循环调用这段代码来创建父stage
// 如果里面匹配不到ShuffleDependency 那么代码就会在此终止,也就是创建父stage循环终止
getShuffleDependencies(rdd).map { shuffleDep =>
// 里面会创建当前拿到的ShuffleDependency的所有父ShuffleMapStage
getOrCreateShuffleMapStage(shuffleDep, firstJobId)
}.toList
}
在创建ShuffleMapStage之前先会去shuffleIdToMapStage中根据shuffleId提取对应的ShuffleMapStage(若以前创建过 肯定会添加到shuffleIdToMapStage以便同样的算子复用)
没有的话 才会去调用getMissingAncestorShuffleDependencies
整个方法里面有多层嵌套迭代,大家好好看我的注解
/**
* Gets a shuffle map stage if one exists in shuffleIdToMapStage. Otherwise, if the
* shuffle map stage doesn't already exist, this method will create the shuffle map stage in
* addition to any missing ancestor shuffle map stages.
*/
private def getOrCreateShuffleMapStage(
shuffleDep: ShuffleDependency[_, _, _],
firstJobId: Int): ShuffleMapStage = {
// 通过从ShuffleDependency提取到的shuffleId来提取shuffleIdToMapStage中的ShuffleMapStage
shuffleIdToMapStage.get(shuffleDep.shuffleId) match {
// 如果能提取到 就直接返回
case Some(stage) =>
stage
// 如果提取不到就会依次找到所有父ShuffleDependencies并且构建所有父ShuffleMapStage
case None =>
// Create stages for all missing ancestor shuffle dependencies.
// 找到之前还未注册到shuffleIdToMapStage的父RDD的shuffle dependencies
// 这个方法会拿到rdd的所有ShuffleDependency
// 里面还有个逻辑相似的迭代嵌套提取ShuffleDependency方法,所以这段代码很消耗性能
getMissingAncestorShuffleDependencies(shuffleDep.rdd).foreach { dep =>
// Even though getMissingAncestorShuffleDependencies only returns shuffle dependencies
// that were not already in shuffleIdToMapStage, it's possible that by the time we
// get to a particular dependency in the foreach loop, it's been added to
// shuffleIdToMapStage by the stage creation process for an earlier dependency. See
// SPARK-13902 for more information.
// 根据遍历出来的所有ShuffleDependencies依次创建所有父ShuffleMapStage
// 因为返回出来的ShuffleDependency存储结构是Stack,所以是从最第一个ShuffleDependency开始创建
if (!shuffleIdToMapStage.contains(dep.shuffleId)) {
createShuffleMapStage(dep, firstJobId)
}
}
// Finally, create a stage for the given shuffle dependency.
// 最后会创建当前ShuffleDependency的ShuffleMapStage
createShuffleMapStage(shuffleDep, firstJobId)
}
}
shuffleIdToMapStage结构和作用
// shuffle依赖ID和对应的 ShuffleMapStage的映射关系,只包含在运行中的job,运行完毕会清除掉
// 会在创建ShuffleMapStage的时候把该shuffleId和自己的映射加入shuffleIdToMapStage以便后面相同算子复用
private[scheduler] val shuffleIdToMapStage = new HashMap[Int, ShuffleMapStage]
如果不是可复用的ShuffleMapStage那就调用getMissingAncestorShuffleDependencies
/** Find ancestor shuffle dependencies that are not registered in shuffleToMapStage yet */
private def getMissingAncestorShuffleDependencies(
rdd: RDD[_]): Stack[ShuffleDependency[_, _, _]] = {
// Stack是一个last-in-first-out (LIFO)后进先出的数据结构
// 这里之所以用stack是用来待会生成ShuffleMapStage是从最后一个ShuffleDependency开始
val ancestors = new Stack[ShuffleDependency[_, _, _]]
// 临时存放RDD
val visited = new HashSet[RDD[_]]
// We are manually maintaining a stack here to prevent StackOverflowError
// caused by recursively visiting
val waitingForVisit = new Stack[RDD[_]]
// 把父RDDpush进waitingForVisit
waitingForVisit.push(rdd)
while (waitingForVisit.nonEmpty) {
val toVisit = waitingForVisit.pop()
// 判断visited是否包含刚从waitingForVisit.pop出来的RDD
if (!visited(toVisit)) {
// 如果不包含就加入
visited += toVisit
// 这里会拿到父RDD的ShuffleDependency,可能没有,也可能是一个或者多个
// 简单的说里面的实现其实就是一直遍历到之前有可复用的RDD为止,然后把这个阶段遍历的所有RDD的依赖
// 都加入到ancestors中,用来待会创建ShuffleMapStage
getShuffleDependencies(toVisit).foreach { shuffleDep =>
if (!shuffleIdToMapStage.contains(shuffleDep.shuffleId)) {
// 如果shuffleIdToMapStage不包含ShuffleDependency的shuffleId,就push进ancestors
ancestors.push(shuffleDep)
// 把ShuffleDependency的父RDD push进waitingForVisit
// 继续while循环取出父RDD的父RDD依赖..直到遍历完所有ShuffleDependency或者被提取到
waitingForVisit.push(shuffleDep.rdd)
} // Otherwise, the dependency and its ancestors have already been registered.
}
}
}
// 返回的包含所有未注册或者已经注册进shuffleIdToMapStage的所有父RDD依赖,也可能返回为空
ancestors
}
开始创建ShuffleMapStage——ShuffleMapStage刚好就在在shuffle操作之前发生,并且它可能包含多个transformation操作,在执行的时候,会保存map端的输出文件并且稍后可以被reduce tasks接收
里面会继续反复回调getOrCreateParentStages,这里大家可以回去多看看前面的这个方法,它可以说的上是创建ResultStage的入口函数
最后会一直调用到拿去到可复用的stage或者第一个stage为止:
/**
* Creates a ShuffleMapStage that generates the given shuffle dependency's partitions. If a
* previously run stage generated the same shuffle data, this function will copy the output
* locations that are still available from the previous shuffle to avoid unnecessarily
* regenerating data.
*/
def createShuffleMapStage(shuffleDep: ShuffleDependency[_, _, _], jobId: Int): ShuffleMapStage = {
// ShuffleDependency的父RDD
val rdd = shuffleDep.rdd
// 多少个分区
val numTasks = rdd.partitions.length
// 用父RDD循环调用,每次调用都是前一个父RDD
// 在这里其实就会一直递归循环直到拿到首个stage才退出来
// 最后把生成的ShuffleMapStage加入shuffleIdToMapStage以便后面直接从中拿取
val parents = getOrCreateParentStages(rdd, jobId)
// 标记当前StageId nextStageId+1
val id = nextStageId.getAndIncrement()
// 拿到之前的stages等核心参数后就可以构建ShuffleMapStage了
val stage = new ShuffleMapStage(
id, rdd, numTasks, parents, jobId, rdd.creationSite, shuffleDep, mapOutputTracker)
// 把刚创建的ShuffleMapStage赋值给stageIdToStage
stageIdToStage(id) = stage
// 赋值给shuffleIdToMapStage
// 若后面的代码再次生成对应的ShuffleMapStage就可以从shuffleIdToMapStage中直接拿取了
shuffleIdToMapStage(shuffleDep.shuffleId) = stage
// 更新jobIds和jobIdToStageIds
updateJobIdStageIdMaps(jobId, stage)
// 这里会把shuffle信息注册到Driver上的MapOutputTrackerMaster的shuffleStatuses
if (!mapOutputTracker.containsShuffle(shuffleDep.shuffleId)) {
// Kind of ugly: need to register RDDs with the cache and map output tracker here
// since we can't do it in the RDD constructor because # of partitions is unknown
logInfo("Registering RDD " + rdd.id + " (" + rdd.getCreationSite + ")")
// 把Shuffle信息注册到自己Driver的MapOutputTrackerMaster上
// 生成的是shuffleId和ShuffleStatus的映射关系
// 在后面提交Job的时候还会根据它来的验证map stage是否已经准备好
mapOutputTracker.registerShuffle(shuffleDep.shuffleId, rdd.partitions.length)
}
// 最后返回生成的ShuffleMapStage
stage
}
这里创建好ShuffleMapStage后 可以看到把Shuffle信息注册到自己Driver的MapOutputTrackerMaster的shuffleStatuses中,用来在后面的验证 和 reduce端拉取map输出
def registerShuffle(shuffleId: Int, numMaps: Int) {
if (shuffleStatuses.put(shuffleId, new ShuffleStatus(numMaps)).isDefined) {
throw new IllegalArgumentException("Shuffle ID " + shuffleId + " registered twice")
}
}
// 在Driver上存放shuffleId和ShuffleStatus的映射关系
private val shuffleStatuses = new ConcurrentHashMap[Int, ShuffleStatus]().asScala
ShuffleStatus中主要维护了一个Job中单个ShuffleMapStage的mapIds 到 MapStatus的映射关系,MapStatus维护的就是每个分区对应的Map信息,而MapStatus中主要维护着当前分区的BlockManagerId(也就是地址信息)以及包含的block大小.. 这些都会在后面的计算Task最优位置等做交互动作
// 索引长度是分区数,里面维护着每个partition对应的MapStatus
// MapStatus中维护的是BlockManagerId,也就是每个task运行的位置和每个reduce task接收的block大小
private[this] val mapStatuses = new Array[MapStatus](numPartitions)
private[spark] sealed trait MapStatus {
/** Location where this task was run. */
def location: BlockManagerId
/**
* Estimated size for the reduce block, in bytes.
*
* If a block is non-empty, then this method MUST return a non-zero size. This invariant is
* necessary for correctness, since block fetchers are allowed to skip zero-size blocks.
*/
def getSizeForBlock(reduceId: Int): Long
}
ok,在不停迭代上一个RDD/Stage,找到离自己最近的可以创建stage后,从前往后依次构建Stage直到构建出最后一个 ResultStage(当然也可能这个Job只有一个ResultStage)。我们现在回到最初的handleJobSubmitted(忘记的话可以回头看看):
// 在eventProcessLoop接受到提交job的事件任务后就会触发,开始划分stage
private[scheduler] def handleJobSubmitted(jobId: Int,
finalRDD: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
callSite: CallSite,
listener: JobListener,
properties: Properties) {
var finalStage: ResultStage = null
try {
// New stage creation may throw an exception if, for example, jobs are run on a
// HadoopRDD whose underlying HDFS files have been deleted.
// 创建ResultStage,这里才是真正开始处理提交的job划分stage的时候
// 它会从后往前找递归遍历它的每一个父RDD,从持久化中抽取反之重新计算
// 补充下:stage分为shuffleMapStage和ResultStage两种
// 每个job都是由1个ResultStage和0+个ShuffleMapStage组成
finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
} catch {
case e: Exception =>
logWarning("Creating new stage failed due to exception - job: " + jobId, e)
listener.jobFailed(e)
return
}
// 把createResultStage封装在ActiveJob中,你可以把它看做成Job的代表
val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
// 清除每个被持久化的RDD分区的位置
clearCacheLocs()
logInfo("Got job %s (%s) with %d output partitions".format(
job.jobId, callSite.shortForm, partitions.length))
logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")
logInfo("Parents of final stage: " + finalStage.parents)
logInfo("Missing parents: " + getMissingParentStages(finalStage))
val jobSubmissionTime = clock.getTimeMillis()
// HashMap结构,维护着jobId和jobIdToActiveJob的映射关系
jobIdToActiveJob(jobId) = job
// HashSet结构,维护着所有ActiveJob
activeJobs += job
// finalStage一旦生成就会把封装自己的ActiveJob注册到自己的_activeJob上
// 而整个Job结束后就会清除掉
finalStage.setActiveJob(job)
// 提取出jobId对应的所有StageIds并转换才数组
val stageIds = jobIdToStageIds(jobId).toArray
// 提取出每个stage的最新尝试信息,当job启动时会告知SparkListenersJob
val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
// 封装一个SparkListenerEvent,通知SparkListenersJob启动了,并传递Job相关信息
// 底层会把这个event事件post到eventQueue中,一个单独的Java的线程池会不停的poll出来并做对应的处理
listenerBus.post(
SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
// 开始提交Stage
submitStage(finalStage)
}
我们现在拿到了finalStage,在更新和封装了一些属性后,进入submitStage提交Job的入口:
/** Submits stage, but first recursively submits any missing parents. */
private def submitStage(stage: Stage) {
// 拿到第一个activeJob对应的jobId
val jobId = activeJobForStage(stage)
if (jobId.isDefined) {
logDebug("submitStage(" + stage + ")")
// waitingStages->等待运行的stages
// runningStages->正在运行的stages
// failedStages->由于获取失败需要重新提交的stages
if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
// 向前依次遍历出所有deps,并判断dep对应的shuffleMapStage的所有task是否已提交(submitStage),并返回所有未提交的stage
// 备注:在最开始DGA划分时,代码执行到这里时,所有stage都未提交
// 因为第一次进入这里的是ResultStage,而前面的ShuffleMapStage都还未提交(submitStage),所以返会missing不为空
val missing = getMissingParentStages(stage).sortBy(_.id)
logDebug("missing: " + missing)
if (missing.isEmpty) {
logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
// 如果持久化过返回的就会为空,或者持久化不会空并且mapStage已经准备好那么返回也是为空
// 如果返回为空
// 开始提交Tasks!
submitMissingTasks(stage, jobId.get)
} else {
// 若代码走到这里的话 就是之前的mapStage没准备好
for (parent <- missing) {
// 再次提交Stage
submitStage(parent)
}
// 然后放入等待waitingStages
waitingStages += stage
}
}
} else {
// 否则终止stage
abortStage(stage, "No active job for stage " + stage.id, None)
}
}
在提交Tasks之前,首先判断下RDD有没有持久化过,map stage有没有准备好
private def getMissingParentStages(stage: Stage): List[Stage] = {
// 存放没准备好的mapStage
val missing = new HashSet[Stage]
// 存放被访问过的RDD的临时变量
val visited = new HashSet[RDD[_]]
// We are manually maintaining a stack here to prevent StackOverflowError
// caused by recursively visiting
// 又是后进先出的Stack结构
val waitingForVisit = new Stack[RDD[_]]
def visit(rdd: RDD[_]) {
if (!visited(rdd)) {
// 如果这个RDD没被访问过就加入visited,下次循环就不会访问这个RDD了
visited += rdd
// 这里的getCacheLocs并不是根据字面意思的缓存来理解只是检查之前有没有仅仅缓存过RDD
// 而是做的双重检查:
// ①检查cacheLocs.contains(rdd.id) ②检查rdd.getStorageLevel == StorageLevel.NONE
// getCacheLocs返回的是executor_host_executorId标识的task位置,最后判断下是否为空
// 补充:包括在后面的task最佳位置划分算法也是会用到getCacheLocs(rdd: RDD[_])
val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)
// Nil表示空list,当没有被持久化过那么就是为true,需要继续遍历上一个RDD的依赖
if (rddHasUncachedPartitions) {
// 如果之前没持久化过 就遍历当前rdd的所有依赖
// 只有到下次while循环才会遍历父RDD的依赖,可能一个或者多个
// 其实这里主要是在检测之前的createResultStage有没有成功构建好ShuffleMapStage
for (dep <- rdd.dependencies) {
dep match {
case shufDep: ShuffleDependency[_, _, _] =>
// 在之前的代码若成功创建了ShuffleMapStage
// 那么就可以直接从shuffleIdToMapStage拿取
val mapStage = getOrCreateShuffleMapStage(shufDep, stage.firstJobId)
// 判断当前ShuffleMapStage的所有task是否都准备好(输出)
// 也就是是否已经提交(submitStage)过这个ShuffleMapStage
if (!mapStage.isAvailable) {
// 不相等则加入missing
missing += mapStage
}
// 窄依赖就push回去,继续遍历
case narrowDep: NarrowDependency[_] =>
waitingForVisit.push(narrowDep.rdd)
}
}
}
}
}
// 把当前stage的RDDpush进waitingForVisit
waitingForVisit.push(stage.rdd)
// 一直循环到pop出所有RDD
while (waitingForVisit.nonEmpty) {
visit(waitingForVisit.pop())
}
missing.toList
}
这里说下getCacheLocs,网上看到一些源码介绍getCacheLocs仅仅是缓存,这是错误的,不要被字面意思迷惑了,它不仅仅会判断有没有缓存 还会判断其他持久化的方式 包括磁盘和堆外,它在后面的Task最优位置划分中也用上了。补充下:看源码不能只是以字面意思去理解,虽然现在代码都变得像自然语言 但是若要领悟框架的精髓还是得深入底层看细节 看实现。
private[scheduler]
def getCacheLocs(rdd: RDD[_]): IndexedSeq[Seq[TaskLocation]] = cacheLocs.synchronized {
// Note: this doesn't use `getOrElse()` because this method is called O(num tasks) times
// cacheLocs是一个mutable类型的HashMap,里面存储的是各个RDDId和它对应的被持久化的task位置
// rdd.id底层调用的是nextRddId.getAndIncrement()这里会把自己注册到自己的SparkContext中并返回它的rddId
// 判断传进来的rddId是否存在cacheLocs的map里
if (!cacheLocs.contains(rdd.id)) {
// Note: if the storage level is NONE, we don't need to get locations from block manager.
// 如果这个rdd不包含在cacheLocs就判断下是否它的存储级别为NONE,如果是就不需要从blockmanager里面获取
val locs: IndexedSeq[Seq[TaskLocation]] = if (rdd.getStorageLevel == StorageLevel.NONE) {
// 补充 Nil是空的List (extends List[Nothing])
IndexedSeq.fill(rdd.partitions.length)(Nil)
} else {
// 如果这个rdd的StorageLevel不为NONE但却在cacheLocs中没被找到
// 说明这个rdd它是有持久化级别设置的
// 找到这个rdd的所有task持久化的位置最后赋值给cacheLocs 包括这次以后都可以从cacheLocs拿取了
// 像这种情况:如果是这个RDD有持久化级别 但是是第一次调用 就会走到这段代码里,
// 而它的持久化信息会存储到cacheLocs中 方便下次复用直接拿取task地址
val blockIds =
// 拿到rdd的分区Array[Partition]中的每个Partition对应的索引,然后用map遍历操作
// 把拿到的每个index和rddId生成RDDBlockId并把它们转换成BlockId类型的数组
// 这里RDDBlockId继承于BlockId,只是复写了父类的name方法
// 而这个被复写的name就是BlockId作为全局的标识符
// 看见网上很多在问block和partition的关系,而这就是他们的关系之一(一个block对应一个partition)
rdd.partitions.indices.map(index => RDDBlockId(rdd.id, index)).toArray[BlockId]
// 获取到每个blockId的存放地址
// 底层是通过blockManagerMaster调用Driver端的Endpoint的receiveAndReply来做相应的处理
// 最后从Driver端的blockLocations中获取每个blockId对应的多个BlockManagerId
// BlockManagerId是BlockManager的唯一标识符,里面维护了host,executorId等核心成员
blockManagerMaster.getLocations(blockIds).map { bms =>
// 提取出blockManagerId对应的host和executorId(一个host可能会有多个executor
// 再通过提取出的2个参数传入调用TaskLocation,返回的是ExecutorCacheTaskLocation对象
// 返回对象里唯一成员toString最终会格式化成executor_host_executorId
// 也就是每个task运行的位置标记!!!
bms.map(bm => TaskLocation(bm.host, bm.executorId))
}
}
// 把拿到的locs地址信息赋值给cacheLocs里的rdd
// 下面的代码cacheLocs(rdd.id)会直接从中拿取
cacheLocs(rdd.id) = locs
}
// 最后根据rdd从cacheLocs拿去task的持久化地址
// 补充:这里只有一种情况 拿到的为空,就是cacheLocs不包含rdd并且StorageLevel为NONE
cacheLocs(rdd.id)
}
首先看下cacheLocs结构
// 每个被持久化的RDD分区的位置,Key是RDDId,Value是对应的分区序列
// [Int, IndexedSeq[Seq[TaskLocation]]]你可以看成是[RDDId,BlockId[BlockManagerId[TaskLocation]]]
private val cacheLocs = new HashMap[Int, IndexedSeq[Seq[TaskLocation]]]
在获取RDD之前的持久化的时候 这里会跟blockManagerMaster和BlockManagerMasterEndpoint交互,来取得每个之前持久化过的RDD位置
blockManagerMaster是在SparkContext构建SparkEnv的时候生成的,在Driver端的blockManagerMaster维护着集群上每个节点的BlockManager的元数据,而BlockManagerMasterEndpoint是在Driver端创建blockManagerMaster的时候把自己注册到SparkEnv中返回的消息体对象,它会根据收到的事件消息类型做对应处理,具体的细节可以参考我之前的SprakEnv章节介绍
先看下getLocations:这里涉及到了Netty的通信(之前章节有介绍过)
/** Get locations of multiple blockIds from the driver */
def getLocations(blockIds: Array[BlockId]): IndexedSeq[Seq[BlockManagerId]] = {
// askSync会触发调用driver端的receiveAndReply并匹配GetLocationsMultipleBlockIds
// 的context.reply(getLocationsMultipleBlockIds(blockIds))
driverEndpoint.askSync[IndexedSeq[Seq[BlockManagerId]]](
GetLocationsMultipleBlockIds(blockIds))
}
driverEndpoint.askSync 会触发BlockManagerMasterEndpoint的双向消息体receiveAndReply,然后会匹配到GetLocationsMultipleBlockIds
case GetLocationsMultipleBlockIds(blockIds) =>
// 通过回调函数返回给sender多个BlockId的信息
context.reply(getLocationsMultipleBlockIds(blockIds))
private def getLocationsMultipleBlockIds(
blockIds: Array[BlockId]): IndexedSeq[Seq[BlockManagerId]] = {
// 拿到每个blockId对应的多个BlockManagerIds
blockIds.map(blockId => getLocations(blockId))
}
最后拿到BlockId对应的BlockManagerId(里面包含host,executorId,port等成员属性)
private def getLocations(blockId: BlockId): Seq[BlockManagerId] = {
// 如果blockLocations包含blockId就get出来不然就设置为空
if (blockLocations.containsKey(blockId)) blockLocations.get(blockId).toSeq else Seq.empty
}
blockLocations结构:
// Mapping from block id to the set of block managers that have the block.
// BlockId对应的多个BlockmanagerId,因为可能会是StorageLevel或者checkpoint的原因,
// 所以这个Block会存放在多个executor中的Blockmanager中
// 补充:JHashMap是java的HashMap
private val blockLocations = new JHashMap[BlockId, mutable.HashSet[BlockManagerId]]
最后提取出里面的host和executorId然后通过TaskLocation封装成每个partition的对应位置标识符
blockManagerMaster.getLocations(blockIds).map { bms =>
// 提取出blockManagerId对应的host和executorId(一个host可能会有多个executor
// 再通过提取出的2个参数传入调用TaskLocation,返回的是ExecutorCacheTaskLocation对象
// 返回对象里唯一成员toString最终会格式化成executor_host_executorId
// 也就是每个task运行的位置标记!!!
bms.map(bm => TaskLocation(bm.host, bm.executorId))
}
}
这里调用的是TaskLocation的半生对象的apply
def apply(host: String, executorId: String): TaskLocation = {
new ExecutorCacheTaskLocation(host, executorId)
}
/**
* A location that includes both a host and an executor id on that host.
*/
private [spark]
case class ExecutorCacheTaskLocation(override val host: String, executorId: String)
extends TaskLocation {
// executor_host_executorId
override def toString: String = s"${TaskLocation.executorLocationTag}${host}_$executorId"
}
在做好上述的一切检查工作后(是否持久化过,是否map stage准备好)我们开始进入预备提交Tasks的阶段,这里会涉及到Task最佳位置算法,分装闭包,广播变量,生成ShuffleMapTask和ResultTask(下章介绍),提交Task(下章介绍)
/** Called when stage's parents are available and we can now do its task. */
private def submitMissingTasks(stage: Stage, jobId: Int) {
logDebug("submitMissingTasks(" + stage + ")")
// First figure out the indexes of partition ids to compute.
// 返回的是一个Seq[Int],索引长度是需要计算的partitionId
// 补充:shuffleStage和resultStage的实现都不一样
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
// Use the scheduling pool, job group, description, etc. from an ActiveJob associated
// with this Stage
// 拿到该job的properties
val properties = jobIdToActiveJob(jobId).properties
// 把stage加入正在运行状态
runningStages += stage
// SparkListenerStageSubmitted should be posted before testing whether tasks are
// serializable. If tasks are not serializable, a SparkListenerStageCompleted event
// will be posted, which should always come after a corresponding SparkListenerStageSubmitted
// event.
stage match {
case s: ShuffleMapStage =>
outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1)
case s: ResultStage =>
outputCommitCoordinator.stageStart(
stage = s.id, maxPartitionId = s.rdd.partitions.length - 1)
}
val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
// map每个partitionId,根据Id和这个stage的RDD调用Task最佳位置划分算法
// 补充不同类型的RDD所调用的最优位置算法逻辑都不一样
// 假如是ShuffledRDD实现核心思想是:
// 首先会查询BlockManager是否持久化过,若有就去Driver端找BlockManagerMaster获取地址
// 否则就会去查找是否checkpoint过,若有就可能会去hdfs直接获取
// 若都没持久化过,就会去找MapOutputTracker查找之前在map端写入的shuffle文件的地址
stage match {
case s: ShuffleMapStage =>
partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
case s: ResultStage =>
partitionsToCompute.map { id =>
val p = s.partitions(id)
(id, getPreferredLocs(stage.rdd, p))
}.toMap
}
} catch {
case NonFatal(e) =>
stage.makeNewStageAttempt(partitionsToCompute.size)
listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
}
// 这里会把刚刚执行过的最新stage信息更新进_latestInfo中
stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)
// If there are tasks to execute, record the submission time of the stage. Otherwise,
// post the even without the submission time, which indicates that this stage was
// skipped.
if (partitionsToCompute.nonEmpty) {
stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
}
// 告诉listenerBus已经提交stage了
listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
// TODO: Maybe we can keep the taskBinary in Stage to avoid serializing it multiple times.
// Broadcasted binary for the task, used to dispatch tasks to executors. Note that we broadcast
// the serialized copy of the RDD and for each task we will deserialize it, which means each
// task gets a different copy of the RDD. This provides stronger isolation between tasks that
// might modify state of objects referenced in their closures. This is necessary in Hadoop
// where the JobConf/Configuration object is not thread-safe.
// 下面会把task封装成闭包然后通过Broadcast分发到各个节点
var taskBinary: Broadcast[Array[Byte]] = null
try {
// For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
// For ResultTask, serialize and broadcast (rdd, func).
// 不管是ShuffleMapStage的task或者ResultStage的task都得序列化并且广播
// 这里返回的是task字节数组的闭包
val taskBinaryBytes: Array[Byte] = stage match {
case stage: ShuffleMapStage =>
// 转换成字节数组
JavaUtils.bufferToArray(
// 底层用的是java.nio.ByteBuffer缓冲区
closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef))
case stage: ResultStage =>
JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef))
}
// broadcast 可以把指定的对象转换成只读的广播变量发送到每个节点上
// 然后同个节点的每个executor的partition都会找worker拉取自己的闭包
// 如果这里不用broadcast 那么就会把给每个task拷贝一份闭包,这样就会产生大量IO
// 所以这里会用广播去优化,就像平时读取大的配置文件 或者避免join操作的Shuffle时候 都可以用到广播来优化
// 这里顺便提下 spark的RDD都是封装成闭包分布到各个节点的
// 闭包的特性是延迟加载和不能修改闭包外的变量(只能用累加器Accumulator实现修改变量)
taskBinary = sc.broadcast(taskBinaryBytes)
} catch {
// In the case of a failure during serialization, abort the stage.
case e: NotSerializableException =>
abortStage(stage, "Task not serializable: " + e.toString, Some(e))
runningStages -= stage
// Abort execution
return
case NonFatal(e) =>
abortStage(stage, s"Task serialization failed: $e\n${Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
}
// 划分Task
// 根据分区依次创建Task集合,tasks数量对应分区个数
val tasks: Seq[Task[_]] = try {
// 这里也会把task的指标检测对象taskMetrics封装成序列化闭包
val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
stage match {
// 当匹配到生成的是ShuffleMapStage
case stage: ShuffleMapStage =>
// 首先保证pendingPartitions为空
// pendingPartitions中放的是还没完成的partition,还没完成的task
// 如果完成了就会从中清除
// DAGScheduler会用它来确定此state是否已完成
stage.pendingPartitions.clear()
// 开始遍历操作每个分区
partitionsToCompute.map { id =>
// 拿到分区地址
val locs = taskIdToLocations(id)
// 拿到此stage对应的rdd的分区
val part = stage.rdd.partitions(id)
// 加入运行状态
stage.pendingPartitions += id
// 开始构建ShuffleMapTask对象,之后会通过这个对象调用runTask,具体详情会在下个章节
// 补充:Task分为两种:一种是ShuffleMapTask,一种是ResultTask
new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
Option(sc.applicationId), sc.applicationAttemptId)
}
// 当匹配到ResultStage时生成的是ResultTask
case stage: ResultStage =>
partitionsToCompute.map { id =>
val locs = taskIdToLocations(id)
val p: Int = stage.partitions(id)
val part = stage.rdd.partitions(p)
new ResultTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, id, properties, serializedTaskMetrics,
Option(jobId), Option(sc.applicationId), sc.applicationAttemptId)
}
}
} catch {
case NonFatal(e) =>
abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
}
if (tasks.size > 0) {
logInfo(s"Submitting ${tasks.size} missing tasks from $stage (${stage.rdd}) (first 15 " +
s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
// 开始提交task
// 这里调用的是 实现taskScheduler特质的TaskSchedulerImpl
// 它会提交被taskSet封装的tasks
// 具体详细放在下个章节
taskScheduler.submitTasks(new TaskSet(
tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
} else {
// Because we posted SparkListenerStageSubmitted earlier, we should mark
// the stage as completed here in case there are no tasks to run
// 由于某些原因 可能拿到任何task,但是得向SparkListenerStageSubmitted标记下这个stage完成了
// 因为之前我们向SparkListenerStageSubmitted提交过任务,这里得清除它。
markStageAsFinished(stage, None)
val debugString = stage match {
case stage: ShuffleMapStage =>
s"Stage ${stage} is actually done; " +
s"(available: ${stage.isAvailable}," +
s"available outputs: ${stage.numAvailableOutputs}," +
s"partitions: ${stage.numPartitions})"
case stage : ResultStage =>
s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})"
}
logDebug(debugString)
// 父Stage完成后,继续依次提交子Stage
submitWaitingChildStages(stage)
}
}
首先找到需要计算的分区,以shuffleMapStage为例
/** Returns the sequence of partition ids that are missing (i.e. needs to be computed). */
override def findMissingPartitions(): Seq[Int] = {
mapOutputTrackerMaster
.findMissingPartitions(shuffleDep.shuffleId)
// 若返回为空的话 就直接返回所有分区个数
.getOrElse(0 until numPartitions)
}
}
/**
* Returns the sequence of partition ids that are missing (i.e. needs to be computed), or None
* if the MapOutputTrackerMaster doesn't know about this shuffle.
*/
def findMissingPartitions(shuffleId: Int): Option[Seq[Int]] = {
shuffleStatuses.get(shuffleId).map(_.findMissingPartitions())
}
/**
* Returns the sequence of partition ids that are missing (i.e. needs to be computed).
*/
def findMissingPartitions(): Seq[Int] = synchronized {
// 遍历每一个partitionId 看是否在mapStatuses中,若为null则过滤掉
// 这个mapStatuses会在task计算完成之后把对应的partition信息添加进去
// 所以若是第一次计算 mapStatuses是为空的
val missing = (0 until numPartitions).filter(id => mapStatuses(id) == null)
assert(missing.size == numPartitions - _numAvailableOutputs,
s"${missing.size} missing, expected ${numPartitions - _numAvailableOutputs}")
missing
}
然后根据拿到的需要计算的分区Id计算最佳位置,还是以shuffleMapStage为例:
val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
// map每个partitionId,根据Id和这个stage的RDD调用Task最佳位置划分算法
// 补充不同类型的RDD所调用的最优位置算法逻辑都不一样
// 假如是ShuffledRDD实现核心思想是:
// 首先会查询BlockManager是否持久化过,若有就去Driver端找BlockManagerMaster获取地址
// 否则就会去查找是否checkpoint过,若有就可能会去hdfs直接获取
// 若都没持久化过,就会去找MapOutputTracker查找之前在map端写入的shuffle文件的地址
stage match {
case s: ShuffleMapStage =>
partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
case s: ResultStage =>
partitionsToCompute.map { id =>
val p = s.partitions(id)
(id, getPreferredLocs(stage.rdd, p))
}.toMap
}
private[spark]
def getPreferredLocs(rdd: RDD[_], partition: Int): Seq[TaskLocation] = {
getPreferredLocsInternal(rdd, partition, new HashSet)
}
先调用之前的使用过的getCacheLocs从内存,磁盘和堆外查找是否有持久化过,若没有的话再调用preferredLocations 判断是否checkpoint过,若还没有的话 就会判断分配的BlockManager已存在的block总和大小是否超标(默认是集群总block大小的0.2)
// 这里会根据不同的依赖调用不同的逻辑划分算法
private def getPreferredLocsInternal(
rdd: RDD[_],
partition: Int,
visited: HashSet[(RDD[_], Int)]): Seq[TaskLocation] = {
// If the partition has already been visited, no need to re-visit.
// This avoids exponential path exploration. SPARK-695
// 如果之前访问过这个rdd的分区就直接返回空list
if (!visited.add((rdd, partition))) {
// Nil has already been returned for previously visited partitions.
return Nil
}
// If the partition is cached, return the cache locations
// 调用getCacheLocs,之前有介绍
// 这里并不是柯理化,只是在返回值后面继续提取对应的[Seq[TaskLocation]]
// 所以最初返回类型是IndexedSeq[Seq[TaskLocation]]],可以看做是BlockId[BlockManagerId[TaskLocation]
// 然后根据partition返回[Seq[TaskLocation]]
val cached = getCacheLocs(rdd)(partition)
if (cached.nonEmpty) {
// 若有持久化的task就直接返回
return cached
}
// If the RDD has some placement preferences (as is the case for input RDDs), get those
// 这里其实是根据设定阈值筛选清洗出满足Blocks计算后的规定大小的BlockManager的地址。
// 补充:返回的地址格式也不同,这根是否之前被checkpoint有关
val rddPrefs = rdd.preferredLocations(rdd.partitions(partition)).toList
if (rddPrefs.nonEmpty) {
// 这里返回的三个对象里面封装就是task的地址
return rddPrefs.map(TaskLocation(_))
}
// If the RDD has narrow dependencies, pick the first partition of the first narrow dependency
// that has any placement preferences. Ideally we would choose based on transfer sizes,
// but this will do for now.
// 如果过来的RDD的依赖是窄依赖,就会迭代遍历所有父RDD的所有分区 直到任一一个有优先位置为止
rdd.dependencies.foreach {
case n: NarrowDependency[_] =>
// 遍历父RDD的所有分区
for (inPart <- n.getParents(partition)) {
// 回调getPreferredLocsInternal
val locs = getPreferredLocsInternal(n.rdd, inPart, visited)
if (locs != Nil) {
// 一直到任一一个有优先位置为止
return locs
}
}
case _ =>
}
Nil
}
getCahceLocs之前介绍过,忘记的可以回去看看,这里就从preferredLocations:
/**
* Get the preferred locations of a partition, taking into account whether the
* RDD is checkpointed.
*/
final def preferredLocations(split: Partition): Seq[String] = {
// 首先会尝试从checkpoint中拿取RDD,若没有则直接调用getPreferredLocations
// 所以返回的地址格式也会不一样
checkpointRDD.map(_.getPreferredLocations(split)).getOrElse {
getPreferredLocations(split)
}
}
以ShuffledRDD为例,首先获取到Driver端的MapOutputTrackerMaster(上面保存着集群所有节点blockmanager在shuffle阶段的元数据):
override protected def getPreferredLocations(partition: Partition): Seq[String] = {
// 首先拿到Driver端的MapOutputTrackerMaster
val tracker = SparkEnv.get.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster]
// dependencies之前介绍过拿取到当前RDD的依赖
// 拿到的头个依赖强制转换成ShuffleDependency(本身就是ShuffledRDD,这样做也是多个保险)
val dep = dependencies.head.asInstanceOf[ShuffleDependency[K, V, C]]
tracker.getPreferredLocationsForShuffle(dep, partition.index)
}
def getPreferredLocationsForShuffle(dep: ShuffleDependency[_, _, _], partitionId: Int)
: Seq[String] = {
// shuffleLocalityEnabled默认true
// SHUFFLE_PREF_MAP_THRESHOLD默认=1000
// SHUFFLE_PREF_REDUCE_THRESHOLD默认=1000
// REDUCER_PREF_LOCS_FRACTION=0.2
if (shuffleLocalityEnabled && dep.rdd.partitions.length < SHUFFLE_PREF_MAP_THRESHOLD &&
dep.partitioner.numPartitions < SHUFFLE_PREF_REDUCE_THRESHOLD) {
// 这里会过滤清洗出满足要求的所有BlockManagerId
// 补充:BlockManager在每个Executor和Drvier中都存在唯一一个负责数据的传输,接收和持久化,在之后的章节会介绍
val blockManagerIds = getLocationsWithLargestOutputs(dep.shuffleId, partitionId,
dep.partitioner.numPartitions, REDUCER_PREF_LOCS_FRACTION)
if (blockManagerIds.nonEmpty) {
// 拿到所有BlockManager的host地址
blockManagerIds.get.map(_.host)
} else {
Nil
}
} else {
Nil
}
}
这里会从MapOutputTrackerMaster获取到shuffleId对应的shuffleStatus的所有分区的MapStatus,MapStatus分为两种:默认的是CompressedMapStatus,另一种是HighlyCompressedMapStatus。最后从对应的BlockManager中的MapStatus中提取出的block大小做判定和过滤清洗
def getLocationsWithLargestOutputs(
shuffleId: Int,
reducerId: Int,
numReducers: Int,
fractionThreshold: Double)
: Option[Array[BlockManagerId]] = {
// 拿到这个shuffleId对应的shuffleStatuses
val shuffleStatus = shuffleStatuses.get(shuffleId).orNull
if (shuffleStatus != null) {
// 里面主要封装了synchronized用作访问这个shuffle中的mapStatuses数组的线程安全
// 补充下 :在创建一个ShuffleMapStage的时候就会把自己注册到Driver端的MapOutputTrackerMaster上
// 然后同时里面也会生成对应的shuffleStatus和一个分区对应一个mapStatus
// 默认情况下mapstatus会在SortShuffleManager生成SortShuffleWriter时候生成
// 也就是ShuffleMapTask调用runTask的时候会构建
// 里面主要是两种类型:①CompressedMapStatus ②HighlyCompressedMapStatus
shuffleStatus.withMapStatuses { statuses =>
if (statuses.nonEmpty) {
// HashMap to add up sizes of all blocks at the same location
// Map里面存放的是相同地址BlockManagerId对应的所有blocks大小
val locs = new HashMap[BlockManagerId, Long]
var totalOutputSize = 0L
var mapIdx = 0
// 里面会遍历出所有的mapStatus
while (mapIdx < statuses.length) {
// 从第一个mapStatu开始拿取
val status = statuses(mapIdx)
// status may be null here if we are called between registerShuffle, which creates an
// array with null entries for each output, and registerMapOutputs, which populates it
// with valid status entries. This is possible if one thread schedules a job which
// depends on an RDD which is currently being computed by another thread.
// 在registerShuffle的时候status可能会变成null,所以这里加了个判断
if (status != null) {
// 提取并解压缩block,默认是压缩的
val blockSize = status.getSizeForBlock(reducerId)
if (blockSize > 0) {
// 提取对应的BlockManagerId的blockSize并把刚刚解压缩的block大小叠加进去
locs(status.location) = locs.getOrElse(status.location, 0L) + blockSize
// 叠加到总输出中
totalOutputSize += blockSize
}
}
// 开始遍历下个mapStatus
mapIdx = mapIdx + 1
}
val topLocs = locs.filter { case (loc, size) =>
// 过滤条件是:当前blockManager的block总大小 / 所有block大小 >= 0.2(默认)
// 如果为true就说明当前blockManager的block实在太多了,若果再把tasks
// 分配到这个blockManager的话就很可能造成性能瓶颈,比如说等待延迟调度等
size.toDouble / totalOutputSize >= fractionThreshold
}
// Return if we have any locations which satisfy the required threshold
if (topLocs.nonEmpty) {
// 返回满足要求的BlockManagerId的数组
return Some(topLocs.keys.toArray)
}
}
}
}
None
}
若清洗出来的BlockManager都符合要求 就直接返回出去对应的格式化后的地址,回到之前的getPreferredLocsInternal:
val rddPrefs = rdd.preferredLocations(rdd.partitions(partition)).toList
if (rddPrefs.nonEmpty) {
// 这里返回的三个对象里面封装就是task的地址
return rddPrefs.map(TaskLocation(_))
}
/**
* Create a TaskLocation from a string returned by getPreferredLocations.
* These strings have the form executor_[hostname]_[executorid], [hostname], or
* hdfs_cache_[hostname], depending on whether the location is cached.
*/
// 若之checkpoint过那传递过来的str可能是hdfs_cache_[hostname]或者executor_[hostname]_[executorid]
// 若没有则是[hostname]
def apply(str: String): TaskLocation = {
// inMemoryLocationTag = "hdfs_cache_"
// 截取掉前面是hdfs_cache_字符的str,若前缀没有包含就直接返回原来的str
val hstr = str.stripPrefix(inMemoryLocationTag)
// 判断是否是被持久化到过hdfs
if (hstr.equals(str)) {
// 如果不是则判断前缀是否是executor_
if (str.startsWith(executorLocationTag)) {
// 转换成[hostname]_[executorid]
val hostAndExecutorId = str.stripPrefix(executorLocationTag)
// 返回的是Array[String](hostname,executorid)
val splits = hostAndExecutorId.split("_", 2)
require(splits.length == 2, "Illegal executor location format: " + str)
val Array(host, executorId) = splits
// 生成的对象仅包含标识符:executor_host_executorId
new ExecutorCacheTaskLocation(host, executorId)
} else {
// 走到这说明没有被checkpoint过
// 生成的对象仅包含标识符:host
new HostTaskLocation(str)
}
} else {
// 走到这里说明之前有被checkpoint到hdfs
// 生成的对象仅包含标识符:hdfs_cache_host
new HDFSCacheTaskLocation(hstr)
}
}
}
在我们拿到了Task的最佳位置后,Spark会把他们封装封装成序列化闭包,然后广播出去
// 下面会把task封装成闭包然后通过Broadcast分发到各个节点
var taskBinary: Broadcast[Array[Byte]] = null
try {
// For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
// For ResultTask, serialize and broadcast (rdd, func).
// 不管是ShuffleMapStage的task或者ResultStage的task都得序列化并且广播
// 这里返回的是task字节数组的闭包
val taskBinaryBytes: Array[Byte] = stage match {
case stage: ShuffleMapStage =>
// 转换成字节数组
JavaUtils.bufferToArray(
// 底层用的是java.nio.ByteBuffer缓冲区
closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef))
case stage: ResultStage =>
JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef))
}
// broadcast 可以把指定的对象转换成只读的广播变量发送到每个节点上
// 然后同个节点的每个executor的partition都会找worker拉取自己的闭包
// 如果这里不用broadcast 那么就会把给每个task拷贝一份闭包,这样就会产生大量IO
// 所以这里会用广播去优化,就像平时读取大的配置文件 或者避免join操作的Shuffle时候 都可以用到广播来优化
// 这里顺便提下 spark的RDD都是封装成闭包分布到各个节点的
// 闭包的特性是延迟加载和不能修改闭包外的变量(只能用累加器Accumulator实现修改变量)
taskBinary = sc.broadcast(taskBinaryBytes)
然后会把封装好的taskBinary跟着一系列参数生成ShuffleMapTask或者ResultTask(下个章节介绍)
val tasks: Seq[Task[_]] = try {
// 这里也会把task的指标检测对象taskMetrics封装成序列化闭包
val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
stage match {
// 当匹配到生成的是ShuffleMapStage
case stage: ShuffleMapStage =>
// 首先保证pendingPartitions为空
// pendingPartitions中放的是还没完成的partition,还没完成的task
// 如果完成了就会从中清除
// DAGScheduler会用它来确定此state是否已完成
stage.pendingPartitions.clear()
// 开始遍历操作每个需要计算的分区
partitionsToCompute.map { id =>
// 拿到分区地址
val locs = taskIdToLocations(id)
// 拿到此stage对应的rdd的分区
val part = stage.rdd.partitions(id)
// 加入运行状态
stage.pendingPartitions += id
// 开始构建ShuffleMapTask对象,之后会通过这个对象调用runTask,具体详情会在下个章节
// 补充:Task分为两种:一种是ShuffleMapTask,一种是ResultTask
new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
Option(sc.applicationId), sc.applicationAttemptId)
}
// 当匹配到ResultStage时生成的是ResultTask
case stage: ResultStage =>
partitionsToCompute.map { id =>
val locs = taskIdToLocations(id)
val p: Int = stage.partitions(id)
val part = stage.rdd.partitions(p)
new ResultTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, id, properties, serializedTaskMetrics,
Option(jobId), Option(sc.applicationId), sc.applicationAttemptId)
}
}
最后把拿到的Tasks封装成TaskSet交给taskScheduler提交到各个executor上(下个章节介绍)
if (tasks.size > 0) {
logInfo(s"Submitting ${tasks.size} missing tasks from $stage (${stage.rdd}) (first 15 " +
s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
// 开始提交task
// 这里调用的是 实现taskScheduler特质的TaskSchedulerImpl
// 它会提交被taskSet封装的tasks
// 具体详细放在下个章节
taskScheduler.submitTasks(new TaskSet(
tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
} else {
// Because we posted SparkListenerStageSubmitted earlier, we should mark
// the stage as completed here in case there are no tasks to run
// 由于某些原因 可能拿到任何task,但是得向SparkListenerStageSubmitted标记下这个stage完成了
// 因为之前我们向SparkListenerStageSubmitted提交过任务,这里得清除它。
markStageAsFinished(stage, None)
当这个Stage完成后,如果还有等待提交的Stage就继续提交
// 父Stage完成后,继续依次提交子Stage
submitWaitingChildStages(stage)
}
}
/**
* Check for waiting stages which are now eligible for resubmission.
* Submits stages that depend on the given parent stage. Called when the parent stage completes
* successfully.
*/
private def submitWaitingChildStages(parent: Stage) {
logTrace(s"Checking if any dependencies of $parent are now runnable")
logTrace("running: " + runningStages)
logTrace("waiting: " + waitingStages)
logTrace("failed: " + failedStages)
// 过滤掉已完成的父stage
// 数据结构:HashSet,用来存放等待提交的stage
// 这个会在之前调用submitStage的时候把需要提交的stage加入进去
val childStages = waitingStages.filter(_.parents.contains(parent)).toArray
waitingStages --= childStages
for (stage <- childStages.sortBy(_.firstJobId)) {
// 拿到最前面的stage,再次提交
submitStage(stage)
}
}