1、概要
1)、从Stage划分,最后一个Stage称为ResultStage,前面被称为ShufferMapStage,一个ShufferMapStage结束会有一个写磁盘操作。
2)、Spark Shuffer分为Map和Reduce阶段,跟MapReduce不一样,有Map程序和Reduce程序
3)、ResultStage结束了,Shuffer一定会有读写磁盘,一个Job的运行就结束了。
4)、Shuffer任务个数跟分区个数一致
2、从源码分析Shuffer过程
Executor接收到Driver发送过来的Task后,就直接调用Task的run方法,也就是开始执行Task
override def run(): Unit = {
val res = task.run(
taskAttemptId = taskId,
attemptNumber = attemptNumber,
metricsSystem = env.metricsSystem)
threwException = false
res
}
所以我们要看的是从Driver发送过来的Task,首先从行动算子进去。
——>SparkContext
—>runJob():调用DAGScheduler无环图的runJob()
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
resultHandler: (Int, U) => Unit): Unit = {
dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
progressBar.foreach(_.finishAll())
rdd.doCheckpoint()
}
——>DAGScheduler
—>runJob():调用submitJob()
—>submitJob():向BlockingQueue放入一个JobSubmitted(),EventLoop的run()方法的val event = eventQueue.take()就取出来,执行子类的onReceive(event)方法,这里使用了一种模板方法的设计模式,父类提供一个模板,具体实现由子类去实现,调用是在父类。
def submitJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): JobWaiter[U] = {10
val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
eventProcessLoop.post(JobSubmitted(
jobId, rdd, func2, partitions.toArray, callSite, waiter,
SerializationUtils.clone(properties)))
waiter
}
—>doOnReceive():匹配到了event类型是JobSubmitted,执行handleJobSubmitted()方法
override def onReceive(event: DAGSchedulerEvent): Unit = {
val timerContext = timer.time()
try {
doOnReceive(event)
} finally {
timerContext.stop()
}
}
private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)
...
}
—>handleJobSubmitted():调用submitStage()
—>submitStage():循环每个Stage,最后提交每个阶段的Task
private def submitStage(stage: Stage) {
val jobId = activeJobForStage(stage)
if (jobId.isDefined) {
if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
val missing = getMissingParentStages(stage).sortBy(_.id)
if (missing.isEmpty) {
submitMissingTasks(stage, jobId.get)
} else {
for (parent <- missing) {
submitStage(parent)
}
waitingStages += stage
}
}
} else {
abortStage(stage, "No active job for stage " + stage.id, None)
}
}
—>submitMissingTasks():这里的意思实际上就是循环计算后的每个分区,新建ShuffleMapTask或者ResultTask,然后发送到Executor执行。
val tasks: Seq[Task[_]] = try {
stage match {
case stage: ShuffleMapStage =>
partitionsToCompute.map { id =>
...
new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, stage.latestInfo.taskMetrics, properties, Option(jobId),
Option(sc.applicationId), sc.applicationAttemptId)
}
case stage: ResultStage =>
partitionsToCompute.map { id =>
...
new ResultTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, id, properties, stage.latestInfo.taskMetrics,
Option(jobId), Option(sc.applicationId), sc.applicationAttemptId)
}
}
} catch {
}
if (tasks.size > 0) {
taskScheduler.submitTasks(new TaskSet(
tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
——>ShuffleMapTask:这个是一个写过程,那么写之前会有读的(rdd.iterator)操作,也就是读取上一个Stage产生的结果文件
override def runTask(context: TaskContext): MapStatus = {
var writer: ShuffleWriter[Any, Any] = null
try {
val manager = SparkEnv.get.shuffleManager
writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
writer.stop(success = true).get
} catch {
}
}
—>ResultTask:其中rdd.iterator(partition, context)就是一个读取磁盘的过程。
override def runTask(context: TaskContext): U = {
func(context, rdd.iterator(partition, context))
}
3、溢写磁盘逻辑说明
目前spark使用的方式是SortShuffleManager,多个Task只产生两个文件,一个索引文件,一个数据文件。