WordCount
word count是spark 最基本的小程序,主要功能就是统计一个文件里面各个单词出现的个数。代码很简洁,如下。
package swjtu.cn.mi
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object WordCount {
def main(args: Array[String]): Unit = {
// 1.env/准备sc/SparkContext/Spark上下文执行环境
val conf: SparkConf = new SparkConf().setAppName("wc").setMaster("local[*]")
val sc: SparkContext = new SparkContext(conf)
sc.setLogLevel("WARN")
//2.source/读取数据
//RDD:A Resilient Distributed Dataset (RDD):弹性分布式数据集,简单理解为分布式集合!使用起来和普通集合一样简单!
//RDD[就是一行行的数据]
val lines: RDD[String] = sc.textFile("data/input/words.txt")
//3.transformation/数据操作/转换
//切割:RDD【一个一个的单词】
val words: RDD[String] = lines.flatMap(_.split(" "))
// 记为1:RDD【(单词, 1)】
val wordAndOnes: RDD[(String, Int)] = words.map((_,1))
//分组聚合:groupBy + mapValues(_.map(_._2).reduce(_+_)) ===>在Spark里面分组+聚合一步搞定:reduceByKey
val result: RDD[(String, Int)] = wordAndOnes.reduceByKey(_+_)
//4.sink/输出
//直接输出
result.foreach(*println*)
//收集为本地集合再输出
//*println*(result.collect().toBuffer)
//输出到指定path(可以是文件/夹)
//result.repartition(1).saveAsTextFile("data/output/result")
//为了便于查看Web-UI可以让程序睡一会
//Thread.sleep(1000 * 6000)
//*TODO 5.关闭资源*
sc.stop()
}
}
理论剖析
里面的RDD链,就是testFile -> flatMap -> map -> reduceBykey -> foreach
spark里面有两种操作,action和transformation, 其中action会触发提交job的操作,transformation不会触发job,只是进行rdd的转换。而不同transformation操作的rdd链两端的依赖关系也不同,spark中的rdd依赖有两种,分别是narrow dependency 和 wide dependency ,这两种依赖如下图所示。
左边图是窄依赖,右边图是宽依赖,窄依赖里面的partition的对应顺序是不变的,宽依赖会涉及shuffle操作,会造成partition混洗,因此往往以宽依赖划分stage。在上面的操作中,foreach是action,reduceByKey是宽依赖,因此这个应用总共有1个job,两个stage,然后在不同的stage中会执行tasks。
源码剖析
从rdd链开始分析
def textFile(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
assertNotStopped()
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString).setName(path)
}
textFile 这个算子的返回结果是一个RDD,然后RDD链就开始了,可以看出来他调用了一些新的函数,比如hadoopFile啥的,这些我们都不管,因为他们都没有触发 commitJob,所以这些中间过程我们就省略,直到println这个action。
提交job
def foreach(f: T => Unit): Unit = withScope {
val cleanF = sc.clean(f)
sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
}
一直点进去runJob可以看到如下代码
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
resultHandler: (Int, U) => Unit): Unit = {
if (stopped.get()) {
throw new IllegalStateException("SparkContext has been shutdown")
}
val callSite = getCallSite
val cleanedFunc = clean(func)
logInfo("Starting job: " + callSite.shortForm)
if (conf.getBoolean("spark.logLineage", false)) {
logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
}
dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
progressBar.foreach(_.finishAll())
rdd.doCheckpoint()
}
DAGScheduler类内部会进行一系列的方法调用,首先是在runJob方法里调用submitJob方法来继续提交作业,这里会发生阻塞,直到返回作业完成或者失败;具体的实现是创建一个JobWaiter对象,并借助内部消息处理进行把这个对象发送给DAGScheduler的内嵌类DAGSchedulerEventProcessLoop(继承EventLoop)进行处理;最后在DAGSchedulerEventProcessLoop消息接收方法onReceive中,接收到JobSubmitted样例类完成匹配后,继续调用DAGScheduler 的handleJobSubmitted方法来提交作业,在该方法中进行划分阶段
def submitJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): JobWaiter[U] = {
// Check to make sure we are not launching a task on a partition that does not exist.
val maxPartitions = rdd.partitions.length
partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
throw new IllegalArgumentException(
"Attempting to access a non-existent partition: " + p + ". " +
"Total number of partitions: " + maxPartitions)
}
val jobId = nextJobId.getAndIncrement()
// 如何作业包含0个任务,则创建0个任务的JobWaiter返回,立即返回
if (partitions.isEmpty) {
val clonedProperties = Utils.cloneProperties*(properties)
if (sc.getLocalProperty(SparkContext.SPARK_JOB_DESCRIPTION) == null) {
clonedProperties.setProperty(SparkContext.SPARK_JOB_DESCRIPTION, callSite.shortForm)
}
val time = clock.getTimeMillis()
listenerBus.post(
*SparkListenerJobStart(jobId, time, Seq.empty, clonedProperties))
listenerBus.post(
SparkListenerJobEnd(jobId, time, JobSucceeded))
// Return immediately if the job is running 0 tasks
return new JobWaiter[U](this, jobId, 0, resultHandler)
}
*assert*(partitions.nonEmpty)
//创建Jobwaiter对象,等待作业运行完毕,使用内部类提交作业
val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
val waiter = new JobWaiter[U](this, jobId, partitions.size, resultHandler)
eventProcessLoop.post(JobSubmitted*(
jobId, rdd, func2, partitions.toArray, callSite, waiter,
Utils.cloneProperties*(properties)))
waiter
}
划分stage
提交完job之后,会对stage进行划分。本案例中划分为两个stage,stage0的操作是获取文件、flatMap、map等,然后action的reduceByKey分别划分为一个stage
spark调度阶段是由DAGScheduler实现的,DAGScheduler会从最后一个RDD出发使用广度优先遍历整个依赖树,从而划分调度阶段,划分的依据是是否发生shuffleDependency(宽依赖),如果当某个RDD发生Shuffle时,以该shuffle为依据前后两个阶段划分两个stage。
代码实现是在DAGScheduler 的handleJobSubmitted方法中根据最后一个RDD生成的ResultStage开始的,具体方法从getOrCreateParentStages找出其依赖的祖先RDD是否存在shuffle操作,如果没有存在shuffle操作,本次作业仅有一个ResultStage,该ResultStage不存在父调度阶段;如果存在shuffle操作,则本次作业存在一个ResultStage和至少一个ShuffleMapstage,其中handleJobSubmitted代码如下
private[scheduler] def handleJobSubmitted(jobId: Int,
finalRDD: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
callSite: CallSite,
listener: JobListener,
properties: Properties) {
var finalStage: ResultStage = null
try {
// New stage creation may throw an exception if, for example, jobs are run on a
// HadoopRDD whose underlying HDFS files have been deleted.
//根据最后一个RDD回溯,获取最后一个调度阶段finalStage
finalStage = newResultStage(finalRDD, func, partitions, jobId, callSite)
} catch {......}
//根据最后一个调度阶段finalStage生成作用
val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
clearCacheLocs()
logInfo("Got job %s (%s) with %d output partitions".format(
job.jobId, callSite.shortForm, partitions.length))
logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")
logInfo("Parents of final stage: " + finalStage.parents)
logInfo("Missing parents: " + getMissingParentStages(finalStage))
val jobSubmissionTime = clock.getTimeMillis()
jobIdToActiveJob(jobId) = job
activeJobs += job
finalStage.setActiveJob(job)
val stageIds = jobIdToStageIds(jobId).toArray
val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
listenerBus.post(
SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
//提交执行
submitStage(finalStage)
submitWaitingStages()
}
****
** Create a ResultStage associated with the provided jobId.*
**/*
private def createResultStage(
rdd: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
jobId: Int,
callSite: CallSite): ResultStage = {
val (shuffleDeps, resourceProfiles) = getShuffleDependenciesAndResourceProfiles(rdd)
val resourceProfile = mergeResourceProfilesForStage(resourceProfiles)
checkBarrierStageWithDynamicAllocation(rdd)
checkBarrierStageWithNumSlots(rdd, resourceProfile)
checkBarrierStageWithRDDChainPattern(rdd, partitions.toSet.size)
//先创建好所有的父stage,父stage都是shufflemapstge
val parents = getOrCreateParentStages(shuffleDeps, jobId)
...
//父stage都创建好之后,再创建finalstage
val stage = new ResultStage(id, rdd, func, partitions, parents, jobId,
callSite, resourceProfile.id)
...
stage
}
跟进getShuffleDependenciesAndResourceProfiles(rdd),获取finalStage的依赖
private[scheduler] def getShuffleDependenciesAndResourceProfiles(
rdd: RDD[_]): (HashSet[ShuffleDependency[_, _, _]], HashSet[ResourceProfile]) = {
//存储当前RDD与它的父RDD的依赖
val parents = new HashSet[ShuffleDependency[_, _, _]]
val resourceProfiles = new HashSet[ResourceProfile]
//用于校验当前访问的rdd是否已被访问过
val visited = new HashSet[RDD[_]]
//存储将要访问的rdd,即每个rdd被遍历前,需要存入waitingForVisit
val waitingForVisit = new ListBuffer[RDD[_]]
//rdd被放入waitingForVisit集合
waitingForVisit += rdd
//如果集合不为空,说明当前stage还有rdd未遍历完
//经过前面的分析我们知道,最先被访问的是finalRDD,即调用acttion算子的那个RDD
while (waitingForVisit.nonEmpty) {
//rdd出队
val toVisit = waitingForVisit.remove(0)
if (!visited(toVisit)) {
//加入被访问过的列表中
visited += toVisit
Option(toVisit.getResourceProfile).foreach(resourceProfiles += _)
toVisit.dependencies.foreach {
//如果当前RDD与它的的父RDD之间的依赖是宽依赖,将依赖加入parents集合,队列为空,直接退出循环
//即遇到了宽依赖,前面的RDD与当前的RDD的stage要截断
case shuffleDep: ShuffleDependency[_, _, _] =>
parents += shuffleDep
//如果是窄依赖,那么通过dependency.rdd获取当前RDD的父RDD,加入waitingForVisit队列,队列非空,继续访问遍历RDD,获取依赖
case dependency =>
waitingForVisit.prepend(dependency.rdd)
}
}
}
(parents, resourceProfiles)
}
当所有调度阶段划分完毕时,这些调度阶段建立起依赖关系。
提交调度阶段
在DAGScheduler的handleJobSubmitted方法中,生成finalstage的同时建立起所有调度阶段的依赖关系,然后通过finalStage生成一个作业实例,在该作业实例中按照顺序提交调度阶段进行执行。
在作业提交调度阶段开始,在submitStage方法中调用getMissingParentStages方法获取finalStage父调度阶段,如果不存在父调度阶段,则使用submitMissingTasks方法提交执行;如果存在父调度阶段,则把该调度阶段存放到waitingStages列表中,同时递归调用submitStage。通过该算法把存在父调度阶段
的等待调度阶段存入waitingStages,不存在父调度阶段的调度阶段作为作业的入口执行
提交tasks
找到最开始还没完成的stage,那么提交这个stage的Tasks。调用的函数是submitMissingTasks(stage,jobId.get).
private def submitMissingTasks(stage: Stage, jobId: Int) {
logDebug("submitMissingTasks(" + stage + ")")
// Get our pending tasks and remember them in our pendingTasks entry
stage.pendingPartitions.clear()
// First figure out the indexes of partition ids to compute.
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
// Use the scheduling pool, job group, description, etc. from an ActiveJob associated
// with this Stage
val properties = jobIdToActiveJob(jobId).properties
runningStages += stage
// SparkListenerStageSubmitted should be posted before testing whether tasks are
// serializable. If tasks are not serializable, a SparkListenerStageCompleted event
// will be posted, which should always come after a corresponding SparkListenerStageSubmitted
// event.
stage match {
case s: ShuffleMapStage =>
outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1)
case s: ResultStage =>
outputCommitCoordinator.stageStart(
stage = s.id, maxPartitionId = s.rdd.partitions.length - 1)
}
val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
stage match {
case s: ShuffleMapStage =>
partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
case s: ResultStage =>
val job = s.activeJob.get
partitionsToCompute.map { id =>
val p = s.partitions(id)
(id, getPreferredLocs(stage.rdd, p))
}.toMap
}
} catch {
case NonFatal(e) =>
stage.makeNewStageAttempt(partitionsToCompute.size)
listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
abortStage(stage, s"Task creation failed: **$**e\n**$**{Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
}
stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)
listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
// *TODO:* *Maybe we can keep the taskBinary in Stage to avoid serializing it multiple times.*
// Broadcasted binary for the task, used to dispatch tasks to executors. Note that we broadcast
// the serialized copy of the RDD and for each task we will deserialize it, which means each
// task gets a different copy of the RDD. This provides stronger isolation between tasks that
// might modify state of objects referenced in their closures. This is necessary in Hadoop
// where the JobConf/Configuration object is not thread-safe.
var taskBinary: Broadcast[Array[Byte]] = null
try {
// For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
// For ResultTask, serialize and broadcast (rdd, func).
val taskBinaryBytes: Array[Byte] = stage match {
case stage: ShuffleMapStage =>
JavaUtils.bufferToArray(
closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef))
case stage: ResultStage =>
JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef))
}
taskBinary = sc.broadcast(taskBinaryBytes)
} catch {
// In the case of a failure during serialization, abort the stage.
case e: NotSerializableException =>
abortStage(stage, "Task not serializable: " + e.toString, *Some*(e))
runningStages -= stage
// Abort execution
return
case NonFatal(e) =>
abortStage(stage, s"Task serialization failed: **$**e\n**$**{Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
}
val tasks: Seq[Task[_]] = try {
stage match {
case stage: ShuffleMapStage =>
partitionsToCompute.map { id =>
val locs = taskIdToLocations(id)
val part = stage.rdd.partitions(id)
new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, stage.latestInfo.taskMetrics, properties)
}
case stage: ResultStage =>
val job = stage.activeJob.get
partitionsToCompute.map { id =>
val p: Int = stage.partitions(id)
val part = stage.rdd.partitions(p)
val locs = taskIdToLocations(id)
new ResultTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, id, properties, stage.latestInfo.taskMetrics)
}
}
} catch {
case NonFatal(e) =>
abortStage(stage, s"Task creation failed: **$**e\n**$**{Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
}
if (tasks.size > 0) {
logInfo("Submitting " + tasks.size + " missing tasks from " + stage + " (" + stage.rdd + ")")
stage.pendingPartitions ++= tasks.map(_.partitionId)
logDebug("New pending partitions: " + stage.pendingPartitions)
taskScheduler.submitTasks(new TaskSet(
tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
stage.latestInfo.submissionTime = *Some*(clock.getTimeMillis())
} else {
// Because we posted SparkListenerStageSubmitted earlier, we should mark
// the stage as completed here in case there are no tasks to run
.............
//此处省略
}
}
以wordcount为例,介绍下stage
前面我们说过,wordcount只有一个job, 然后redeceBykey是shuffle操作,以这个stage的边界。那么前面的stage就是ShuffleMapstage, 后面的stage就是ResultStage. 因为前面会有shuffle操作,而后面是整个job的计算结果,所有叫ResultStage
ResultStage是有一个函数,应用于rdd的一些partition来计算出这个action的结果。但有些action并不是在每个partition都执行的,比如first().
接下来介绍下这个函数的执行流程。
- 首先是计算出 paritionsToCompute,即用于计算的partition,数据。
- 然后就是outputCommitCoordinator.stageStart,这个类是用来输出到hdfs上的,然后stageStart的两个参数,就是用于发出信息,两个参数分别是stageId和他要用于计算的partition数目。
- 然后就是计算这个stage用于计算的TaskId对应的task所在的location。因为TaskId和partitionId是对应的,所以也就是计算partitionId对应的taskLocation。然后taskLocation是一个host或者是一个(host,executorId)二元组。
- stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)这里创建新的attempt 就是代表这个stage执行了几次。因为stage可能会失败的。如果失败就要接着执行,这个attempt从0开始。
- 然后就是创建广播变量,然后broadcast。广播是用于executor来解析tasks。首先要序列化,给每个task都一个完整的rdd,这样可以让task独立性更强,这对于非线程安全是有必要的。对于ShuffleMapTask我们序列化的数据是(rdd,shuffleDep),对于resultTask,序列化数据为(rdd,func)。
- 然后是创建tasks,当然Tasks分为shuffleMapTask和resultTask,这都是跟stage类型对应的。这里创建tasks,需要用到一个参数stage.latestInfo.attemptId,这里是前面提到的。
- 创建完tasks就是后面的taskScheduler.submitTasks(),这样任务就交由taskScheduler调度了。
override def submitTasks(taskSet: TaskSet) {
val tasks = taskSet.tasks
logInfo("Adding task set " + taskSet.id + " with " + tasks.length + " tasks")
this.synchronized {
val manager = createTaskSetManager(taskSet, maxTaskFailures)
val stage = taskSet.stageId
val stageTaskSets =
taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new HashMap[Int, TaskSetManager])
stageTaskSets(taskSet.stageAttemptId) = manager
val conflictingTaskSet = stageTaskSets.exists { case (_, ts) =>
ts.taskSet != taskSet && !ts.isZombie
}
if (conflictingTaskSet) {
throw new IllegalStateException(s"more than one active taskSet for stage $stage:" +
s" ${stageTaskSets.toSeq.map{_._2.taskSet.id}.mkString(",")}")
}
schedulableBuilder.addTaskSetManager(manager, manager.taskSet.properties)
if (!isLocal && !hasReceivedTask) {
starvationTimer.scheduleAtFixedRate(new TimerTask() {
override def run() {
if (!hasLaunchedTask) {
logWarning("Initial job has not accepted any resources; " +
"check your cluster UI to ensure that workers are registered " +
"and have sufficient resources")
} else {
this.cancel()
}
}
}, STARVATION_TIMEOUT_MS, STARVATION_TIMEOUT_MS)
}
hasReceivedTask = true
}
backend.reviveOffers()
}
这段代码前面部分就是先创建taskManager,然后判断是否有超过一定数目的tasks存在,如果冲突就报异常。
然后把这个TaskSetManager加入schedulableBuilder,这个变量在初始化时候会选择调度策略,比如FIFO、FAIR等,加入之后就会按照相应的策略进行调度。
然后之后的判断是否为本地,和是否已经接收过任务,isLocal代表本地模式。如果非本地模式,而且还没接收到过任务,就会建立一个TimerTask,然后一直查看有没有接收到任务,因为如果没任务就是空转吗。
最后backend就会让这个tasks唤醒。backend.reviveOffers(),这里我们的backend通常是CoarseGrainedSchedulerBackend,在执行reviveOffers之后,driverEndpoint会send消息,然后backend的receive函数会接收到消息,然后执行操作。看CoarseGrainedSchedulerBackend 的receive函数。
override def receive: PartialFunction[Any, Unit] = {
...
case ReviveOffers =>
*makeOffers*()
...
}
private def makeOffers() {
// Filter out executors under killing
val activeExecutors = executorDataMap.filterKeys(executorIsAlive)
val workOffers = activeExecutors.map { case (id, executorData) =>
new WorkerOffer(id, executorData.executorHost, executorData.freeCores)
}.toSeq
launchTasks(scheduler.resourceOffers(workOffers))
}
上面代码显示筛选出存活的Executors,然后就创建出workerOffers,参数是executorId,host,frescoers.
执行task
然后就launchTasks
private def launchTasks(tasks: Seq[Seq[TaskDescription]]) {
for (task <- tasks.flatten) {
val serializedTask = ser.serialize(task)
if (serializedTask.limit >= maxRpcMessageSize) {
scheduler.taskIdToTaskSetManager.get(task.taskId).foreach { taskSetMgr =>
try {
var msg = "Serialized task %s:%d was %d bytes, which exceeds max allowed: " +
"spark.rpc.message.maxSize (%d bytes). Consider increasing " +
"spark.rpc.message.maxSize or using broadcast variables for large values."
msg = msg.format(task.taskId, task.index, serializedTask.limit, maxRpcMessageSize)
taskSetMgr.abort(msg)
} catch {
case e: Exception => logError("Exception in error callback", e)
}
}
}
else {
val executorData = executorDataMap(task.executorId)
executorData.freeCores -= scheduler.CPUS_PER_TASK
logInfo(s"Launching task ${task.taskId} on executor id: ${task.executorId} hostname: " +
s"${executorData.executorHost}.")
executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))
}
}
}
上面的代码显示将task序列化,然后根据task.executorId 给他分配executor,然后就executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask))).
这里有一个executorEndPoint,之前前面有driverEndPoint(出现在backend.reviveOffer那里),这两个端口的基类都是RpcEndpointRef。RpcEndpointRef是RpcEndPoint的远程引用,是线程安全的。
RpcEndpoint是 RPC[Remote Procedure Call :远程过程调用]中定义了收到的消息将触发哪个方法。
同时清楚的阐述了生命周期,构造-> onStart -> receive -> onStop
这里receive 是指receive 和 receiveAndReply。
他们的区别是:
receive是无需等待答复,而receiveAndReply是会阻塞线程,直至有答复的。(参考:http://www.07net01.com/2016/04/1434116.html)
然后这里的driverEndPoint就是代表这个信息会发给CoarseGrainedSchedulerBackEnd,executorEndPoint就是发给coarseGrainedExecutorBackEnd当然就是发给coarseGrainedExecutorBackEnd。接下来去看相应的recieve代码。
override def receive: PartialFunction[Any, Unit] = {
...
case LaunchTask(data) =>
if (executor == null) {
exitExecutor(1, "Received LaunchTask command but executor was null")
} else {
val taskDesc = ser.deserialize[TaskDescription](data.value)
logInfo("Got assigned task " + taskDesc.taskId)
executor.launchTask(this, taskId = taskDesc.taskId, attemptNumber = taskDesc.attemptNumber,
taskDesc.name, taskDesc.serializedTask)
}
...
}
这里先将传过来的数据反序列化,然后executor.launchTask.
def launchTask(
context: ExecutorBackend,
taskId: Long,
attemptNumber: Int,
taskName: String,
serializedTask: ByteBuffer): Unit = {
val tr = new TaskRunner(context, taskId = taskId, attemptNumber = attemptNumber, taskName,
serializedTask)
runningTasks.put(taskId, tr)
threadPool.execute(tr)
}
这里新建了taskRunner,然后之后交由线程池来运行,线程池既然要运行taskRunner,必定是运行taskRunner的run方法。看taskRunner的run方法,代码太长,懒得贴,大概描述下。
主要就是设置参数,属性,反序列化出task等等,之后就要调用task.runTask方法。这里的task可能是ShuffleMapTask也可能是ResultTask,所以我们分别看这两种task的run方法。
ShuffleMapTask
先看ShuffleMapTask。
override def runTask(context: TaskContext): MapStatus = {
// Deserialize the RDD using the broadcast variable.
val deserializeStartTime = System.currentTimeMillis*()
val ser = SparkEnv.get.closureSerializer.newInstance()
val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
_executorDeserializeTime = System.currentTimeMillis*() - deserializeStartTime
var writer: ShuffleWriter[Any, Any] = null
try {
val manager = SparkEnv.get.shuffleManager
writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
writer.stop(success = true).get
} catch {
case e: Exception =>
try {
if (writer != null) {
writer.stop(success = false)
}
} catch {
case e: Exception =>
log.debug("Could not stop writer", e)
}
throw e
}
}
前面部分代码就是反序列化那些,主要看中间的代码。获得shuffleManager,然后getWriter。因为shuffleMapTask有Shuffle操作,所以要shuffleWrite。
ResultTask
看下ResultTask的runTask。
override def runTask(context: TaskContext): U = {
// Deserialize the RDD and the func using the broadcast variables.
val deserializeStartTime = System.currentTimeMillis()
val ser = SparkEnv.get.closureSerializer.newInstance()
val (rdd, func) = ser.deserialize[(RDD[T], (TaskContext, Iterator[T]) => U)](
ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
_executorDeserializeTime = System.*currentTimeMillis() - deserializeStartTime
func(context, rdd.iterator(partition, context))
}
总结
(1)创建Spark编程入口SparkContext
(2)读取文件,将文件中的内容保存到RDD
(3)将工作分配到各主机节点
(4)各主机节点对自己分到的任务进行操作,首先进行单词划分,按空格分隔,生成flatMappedRDD
(5)然后将各单词生成Map键值对,输出(word, 1)
(6)然后将不同节点上的单词进行局部统计求和,生成局部的WordCount 的MapPatitionRDD
(7)接着对各节点间进行Shuffle,将各节点间的单词进行词频统计,生成最后的MapPatitionRDD
(8)最后输出结果
Question
Q1:最后一个stage最先提交,但是却是最后一个执行完成的stage?
Q2:shuffle操作在map和reduce阶段分别做了什么?