在前面的几节中,主要介绍了SparkContext的启动初始化过程,包括Driver的启动,向Master的注册,Master启动 Worker,在Worker中启动Executor,以及Worker向Master的注册,在讲述完这些之后,所有的准备工作都已经做完,就开始真正执行我们的Application,首先它会提交job到DAGScheduler中执行,包括对于job的stage划分,还有task的最佳位置计算等等工作,都需要DAGScheduler来完成,那么本文会主要围绕下面两个方面来研究DAGScheduler的源码:
- stage的划分算法
- task对应的partition的最佳位置计算算法
转载请标明原文地址:原文链接
首先我们以wordcount程序为例展开探究:
val rdd=sc.textFile("test.txt")
在执行上面这段代码的时候 ,首先它先调用了textFile创建了一个HadoopRDD,接着再调用了map方法来创建一个MapPartitionRDD,而HadoopRDD,MapPartitionRDD都继承RDD,源码如下所示:
/**
* 首先,调用hadoopFile(),会创建hadoopRDD,其中的元素是(key,value) pair
* key是hdfs文本文件中的每一行的offset,value就是文本行
* 然后对hadoopRDD调用map()方法,会剔除key,保留value,然后获得一个MapPartitionRDD
* MapPartitionRDD内部的元素,就是一行一行的文本。
* @param path
* @param minPartitions
* @return
*/
def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] = {
assertNotStopped()
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
//保留value,剔除key
minPartitions).map(pair => pair._2.toString).setName(path)
}
/**
* Return a new RDD by applying a function to all elements of this RDD.
*/
def map[U: ClassTag](f: T => U): RDD[U] = {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}
在使用textFile方法生成RDD之后,接着执行下面语句:
val linesRDD=rdd.flatMap(_.split(" "))
此时在flatMap内部会创建一个MapPartitionRDD,在其内部会遍历每一个行的元素,源码如下:
/**
* Return a new RDD by first applying a function to all elements of this
* RDD, and then flattening the results.
*/
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
}
def flatMap[B](f: A => GenTraversableOnce[B]): Iterator[B] = new AbstractIterator[B] {
private var cur: Iterator[B] = empty
def hasNext: Boolean =
cur.hasNext || self.hasNext && { cur = f(self.next).toIterator; hasNext }
def next(): B = (if (hasNext) cur else empty).next()
}
接着调用map执行到了下面的语句:
val words=linesRDD.map(x=>(x,1))
其内部还是创建MapPartitionRDD,源码如下:
def map[U: ClassTag](f: T => U): RDD[U] = {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}
val wordCount=words.reduceByKey(_+_)
接着就是调用reduceByKey了,但是你会奇怪的发现,在RDD类中并没有这个方法,其实reduceByKey并不是RDD内部的方法,而是PairRDDFunctions中的方法,它首先在RDD内部会发生隐式转换,转换为PairRDDFunctions,然后再调用这个方法,在RDD内部隐式转换源码如下:
//RDD中没有reduceByKey这个算子,因此调用这个方法的时候会发生隐式转换
//将RDD转换为PairRDDFunction,然后调用PairRDDFunctions类中reduceByKey方法
implicit def rddToPairRDDFunctions[K, V](rdd: RDD[(K, V)])
(implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null): PairRDDFunctions[K, V] = {
new PairRDDFunctions(rdd)
}
PairRDDFunctions类总reduceByKey源码如下:
def reduceByKey(func: (V, V) => V): RDD[(K, V)] = {
reduceByKey(defaultPartitioner(self), func)
}
/**
* 要获得当前RDD的partition数目,在真正将Job提交执行之前,必须知道map中有多少Partition
* 如果没有指定partition,那么使用默认的HashPartitioner
* @param rdd
* @param others
* @return
*/
def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.size).reverse
for (r <- bySize if r.partitioner.isDefined && r.partitioner.get.numPartitions > 0) {
return r.partitioner.get
}
if (rdd.context.conf.contains("spark.default.parallelism")) {
new HashPartitioner(rdd.context.defaultParallelism)
} else {
new HashPartitioner(bySize.head.partitions.size)
}
}
在执行完reduceByKey之后,就会使用foreach操作打印结果:
wordCount.foreach(println)
在调用foreach的时候会触发action操作,也就是说真正的开始执行job了,这里它会调用runJob方法:
/**
* 调用action操作,会调用底层的DAGScheduler来触发job
* @param f
*/
def foreach(f: T => Unit) {
val cleanF = sc.clean(f)
sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
}
在这个runJob方法中,其实还是往下调用多个runJob的重载方法,直到调用到DAGScheduler的runJob方法:
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
allowLocal: Boolean,
resultHandler: (Int, U) => Unit) {
if (stopped) {
throw new IllegalStateException("SparkContext has been shutdown")
}
val callSite = getCallSite
val cleanedFunc = clean(func)
logInfo("Starting job: " + callSite.shortForm)
if (conf.getBoolean("spark.logLineage", false)) {
logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
}
//调动SparkContext初始化时创建的DAGScheduler的runJob方法
dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, allowLocal,
resultHandler, localProperties.get)
progressBar.foreach(_.finishAll())
rdd.doCheckpoint()
}
在DAGScheduler的runJob方法中又会调用submitJob方法:
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
allowLocal: Boolean,
resultHandler: (Int, U) => Unit,
properties: Properties): Unit = {
val start = System.nanoTime
//调用submitJob方法
val waiter = submitJob(rdd, func, partitions, callSite, allowLocal, resultHandler, properties)
waiter.awaitResult() match {
case JobSucceeded => {
logInfo("Job %d finished: %s, took %f s".format
(waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
}
case JobFailed(exception: Exception) =>
logInfo("Job %d failed: %s, took %f s".format
(waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
throw exception
}
}
/**
* Submit a job to the job scheduler and get a JobWaiter object back. The JobWaiter object
* can be used to block until the the job finishes executing or can be used to cancel the job.
*/
def submitJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
allowLocal: Boolean,
resultHandler: (Int, U) => Unit,
properties: Properties): JobWaiter[U] = {
// Check to make sure we are not launching a task on a partition that does not exist.
val maxPartitions = rdd.partitions.length
partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
throw new IllegalArgumentException(
"Attempting to access a non-existent partition: " + p + ". " +
"Total number of partitions: " + maxPartitions)
}
val jobId = nextJobId.getAndIncrement()
if (partitions.size == 0) {
return new JobWaiter[U](this, jobId, 0, resultHandler)
}
assert(partitions.size > 0)
val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
eventProcessLoop.post(JobSubmitted(
jobId, rdd, func2, partitions.toArray, allowLocal, callSite, waiter, properties))
waiter
}
在这里有一个重要的组件eventProcessLoop,在它的内部会使用模式匹配调用JobSubmitted方法,源码如下所示:
private[scheduler] class DAGSchedulerEventProcessLoop(dagScheduler: DAGScheduler)
extends EventLoop[DAGSchedulerEvent]("dag-scheduler-event-loop") with Logging {
/**
* The main event loop of the DAG scheduler.
*/
override def onReceive(event: DAGSchedulerEvent): Unit = event match {
case JobSubmitted(jobId, rdd, func, partitions, allowLocal, callSite, listener, properties) =>
//调用 dagScheduler.handleJobSubmitted
dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, allowLocal, callSite,
listener, properties)
...
这个方法中会调用DAGScheduler的handleJobSubmitted方法,这个方法是job调度的核心入口,这也引出了我们今天的第一个重点,stage的划分算法,在深究stage的划分算法之前,先讲述一下它的核心算法:首先它会从最后一个stage开始创建一个finalStage,然后使用递归调用stage,如果stage的rdd之间是窄依赖,将其放入一个以stack为数据结构的等待队里中,如果是宽依赖,那么将会创建一个新的stage,放入一缓存中;直到递归调用到第一个stage,然后开始提交。这个方法的源码如下:
/**
* DAGScheduler调度的核心入口方法
*/
private[scheduler] def handleJobSubmitted(jobId: Int,
finalRDD: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
allowLocal: Boolean,
callSite: CallSite,
listener: JobListener,
properties: Properties) {
//使用触发job的最后一个RDD,创建一个finalStage
var finalStage: Stage = null
try {
// New stage creation may throw an exception if, for example, jobs are run on a
// HadoopRDD whose underlying HDFS files have been deleted.
finalStage = newStage(finalRDD, partitions.size, None, jobId, callSite)
} catch {
case e: Exception =>
logWarning("Creating new stage failed due to exception - job: " + jobId, e)
listener.jobFailed(e)
return
}
if (finalStage != null) {
//使用finalStage创建一个Job,也就是说这个Job的最后一个stage就是finalStage
val job = new ActiveJob(jobId, finalStage, func, partitions, callSite, listener, properties)
clearCacheLocs()
logInfo("Got job %s (%s) with %d output partitions (allowLocal=%s)".format(
job.jobId, callSite.shortForm, partitions.length, allowLocal))
logInfo("Final stage: " + finalStage + "(" + finalStage.name + ")")
logInfo("Parents of final stage: " + finalStage.parents)
logInfo("Missing parents: " + getMissingParentStages(finalStage))
val shouldRunLocally =
localExecutionEnabled && allowLocal && finalStage.parents.isEmpty && partitions.length == 1
val jobSubmissionTime = clock.getTimeMillis()
if (shouldRunLocally) {
// Compute very short actions like first() or take() with no parent stages locally.
listenerBus.post(
SparkListenerJobStart(job.jobId, jobSubmissionTime, Seq.empty, properties))
runLocally(job)
} else {
//将Job加入内存缓存中
jobIdToActiveJob(jobId) = job
activeJobs += job
finalStage.resultOfJob = Some(job)
val stageIds = jobIdToStageIds(jobId).toArray
val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
listenerBus.post(
SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
//提交Job
submitStage(finalStage)
}
}
//提交stage等待队列
submitWaitingStages()
}
进入submitStage方法中,这个方法就是整个算法的核心:
/**
* 递归提交stage,直到当前stage没有父stage
*/
/** Submits stage, but first recursively submits any missing parents. */
private def submitStage(stage: Stage) {
val jobId = activeJobForStage(stage)
if (jobId.isDefined) {
logDebug("submitStage(" + stage + ")")
if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
//根据rdd的宽依赖关系创建一个新的stage
val missing = getMissingParentStages(stage).sortBy(_.id)
logDebug("missing: " + missing)
//反复递归调用,直到最初的stage,它没有父stage,那么此时首先提交第一个stage,其余的stage都在waitingStages中
if (missing == Nil) {
logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
//提交stage
submitMissingTasks(stage, jobId.get)
} else {
for (parent <- missing) {
//递归调用
submitStage(parent)
}
waitingStages += stage
}
}
} else {
abortStage(stage, "No active job for stage " + stage.id)
}
}
在递归到第一个stage的时候,就会调用submitMissingTasks方法为每一个stage创建task进行提交,在提交task的时候,它会计算每一个task对应的partition的最佳位置,因此其算法是:首先从从stage的最后一个rdd开始找,哪个rdd的partition是被cache了,或者被checkPoint了,那么task的最佳位置就是cache/checkPoint的位置,因为这样的话,task的执行就不需要 计算之前的RDD了,其源码如下:
/**
* 提交stage,为stage创建一批task,task数量与partition数量相同
* @param stage
* @param jobId
*/
private def submitMissingTasks(stage: Stage, jobId: Int) {
logDebug("submitMissingTasks(" + stage + ")")
// Get our pending tasks and remember them in our pendingTasks entry
stage.pendingTasks.clear()
// First figure out the indexes of partition ids to compute.
val partitionsToCompute: Seq[Int] = {
if (stage.isShuffleMap) {
(0 until stage.numPartitions).filter(id => stage.outputLocs(id) == Nil)
} else {
val job = stage.resultOfJob.get
(0 until job.numPartitions).filter(id => !job.finished(id))
}
}
val properties = if (jobIdToActiveJob.contains(jobId)) {
jobIdToActiveJob(stage.jobId).properties
} else {
// this stage will be assigned to "default" pool
null
}
runningStages += stage
stage.latestInfo = StageInfo.fromStage(stage, Some(partitionsToCompute.size))
outputCommitCoordinator.stageStart(stage.id)
listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
var taskBinary: Broadcast[Array[Byte]] = null
try {
// For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
// For ResultTask, serialize and broadcast (rdd, func).
val taskBinaryBytes: Array[Byte] =
if (stage.isShuffleMap) {
closureSerializer.serialize((stage.rdd, stage.shuffleDep.get) : AnyRef).array()
} else {
closureSerializer.serialize((stage.rdd, stage.resultOfJob.get.func) : AnyRef).array()
}
taskBinary = sc.broadcast(taskBinaryBytes)
} catch {
// In the case of a failure during serialization, abort the stage.
case e: NotSerializableException =>
abortStage(stage, "Task not serializable: " + e.toString)
runningStages -= stage
return
case NonFatal(e) =>
abortStage(stage, s"Task serialization failed: $e\n${e.getStackTraceString}")
runningStages -= stage
return
}
//为stage创建指定数量的task
val tasks: Seq[Task[_]] = try {
if (stage.isShuffleMap) {
partitionsToCompute.map { id =>
//给每一个partition创建一个task
//给每一个task计算最佳位置
val locs = getPreferredLocs(stage.rdd, id)
val part = stage.rdd.partitions(id)
//除了finalStage之外所有的stage,它的isShuffleMap是true
//因此会创建ShuffleMapTask
new ShuffleMapTask(stage.id, taskBinary, part, locs)
}
} else {
//如果不是isShuffleMap,那么就是finalStage
//finalStage创建ResultTask
val job = stage.resultOfJob.get
partitionsToCompute.map { id =>
val p: Int = job.partitions(id)
val part = stage.rdd.partitions(p)
val locs = getPreferredLocs(stage.rdd, p)
new ResultTask(stage.id, taskBinary, part, locs, id)
}
}
} catch {
case NonFatal(e) =>
abortStage(stage, s"Task creation failed: $e\n${e.getStackTraceString}")
runningStages -= stage
return
}
if (tasks.size > 0) {
logInfo("Submitting " + tasks.size + " missing tasks from " + stage + " (" + stage.rdd + ")")
stage.pendingTasks ++= tasks
logDebug("New pending tasks: " + stage.pendingTasks)
//调用taskScheduler的submitTask创建TaskSet提交task
taskScheduler.submitTasks(
new TaskSet(tasks.toArray, stage.id, stage.newAttemptId(), stage.jobId, properties))
stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
} else {
markStageAsFinished(stage, None)
logDebug("Stage " + stage + " is actually done; %b %d %d".format(
stage.isAvailable, stage.numAvailableOutputs, stage.numPartitions))
}
}
下面这个方法就是task最佳位置算法的核心方法:
/**
* 计算每一个task对应的partition的最佳位置
* 其实就是从stage的最后一个rdd开始找,哪个rdd的partition是被cache了,或者被checkPoint了
* 那么task的最佳位置就是cache/checkPoint的位置
* 因为这样的话,task的执行就不需要 计算之前的RDD了
*/
private def getPreferredLocsInternal(
rdd: RDD[_],
partition: Int,
visited: HashSet[(RDD[_],Int)])
: Seq[TaskLocation] =
{
// If the partition has already been visited, no need to re-visit.
// This avoids exponential path exploration. SPARK-695
if (!visited.add((rdd,partition))) {
// Nil has already been returned for previously visited partitions.
return Nil
}
// If the partition is cached, return the cache locations
//判断rdd的partition是否被cache
val cached = getCacheLocs(rdd)(partition)
if (!cached.isEmpty) {
return cached
}
// If the RDD has some placement preferences (as is the case for input RDDs), get those
//判断rdd的partition是否被checkPoint
val rddPrefs = rdd.preferredLocations(rdd.partitions(partition)).toList
if (!rddPrefs.isEmpty) {
return rddPrefs.map(TaskLocation(_))
}
// If the RDD has narrow dependencies, pick the first partition of the first narrow dep
// that has any placement preferences. Ideally we would choose based on transfer sizes,
// but this will do for now.
rdd.dependencies.foreach {
case n: NarrowDependency[_] =>
//递归调用自己,遍历父rdd对应的partition是否被cache或者checkPoint
for (inPart <- n.getParents(partition)) {
val locs = getPreferredLocsInternal(n.rdd, inPart, visited)
if (locs != Nil) {
return locs
}
}
case _ =>
}
//如果这个stage,从最后一个rdd到最开始的rdd,partition都没有被缓存或者checkPoint,那么task的最佳位置就是Nil
Nil
}
在选取好task的最佳位置之后,接着就开始调用TaskScheduler的submitTasks方法创建 TaskSet提交task了,在TaskScheduler中,会涉及到task的分配算法,分配到哪几个executor中执行,在后面的文章中我们会深入探究,今天,对于stage的划分算法和task的最佳位置选取算法做了深入的探究,如有任何问题,请不吝赐教,欢迎留言讨论!!!