忍不住先上一段官方的注释,感觉真的很好,不用看博客,就看这个,最原汁原味,还准确~
我用的spark的版本是2.3
/**
* The high-level scheduling layer that implements stage-oriented scheduling. It computes a DAG of
* stages for each job, keeps track of which RDDs and stage outputs are materialized, and finds a
* minimal schedule to run the job. It then submits stages as TaskSets to an underlying
* TaskScheduler implementation that runs them on the cluster. A TaskSet contains fully independent
* tasks that can run right away based on the data that's already on the cluster (e.g. map output
* files from previous stages), though it may fail if this data becomes unavailable.
*
* Spark stages are created by breaking the RDD graph at shuffle boundaries. RDD operations with
* "narrow" dependencies, like map() and filter(), are pipelined together into one set of tasks
* in each stage, but operations with shuffle dependencies require multiple stages (one to write a
* set of map output files, and another to read those files after a barrier). In the end, every
* stage will have only shuffle dependencies on other stages, and may compute multiple operations
* inside it. The actual pipelining of these operations happens in the RDD.compute() functions of
* various RDDs
*
* In addition to coming up with a DAG of stages, the DAGScheduler also determines the preferred
* locations to run each task on, based on the current cache status, and passes these to the
* low-level TaskScheduler. Furthermore, it handles failures due to shuffle output files being
* lost, in which case old stages may need to be resubmitted. Failures *within* a stage that are
* not caused by shuffle file loss are handled by the TaskScheduler, which will retry each task
* a small number of times before cancelling the whole stage.
*
* When looking through this code, there are several key concepts:
*
* - Jobs (represented by [[ActiveJob]]) are the top-level work items submitted to the scheduler.
* For example, when the user calls an action, like count(), a job will be submitted through
* submitJob. Each Job may require the execution of multiple stages to build intermediate data.
*
* - Stages ([[Stage]]) are sets of tasks that compute intermediate results in jobs, where each
* task computes the same function on partitions of the same RDD. Stages are separated at shuffle
* boundaries, which introduce a barrier (where we must wait for the previous stage to finish to
* fetch outputs). There are two types of stages: [[ResultStage]], for the final stage that
* executes an action, and [[ShuffleMapStage]], which writes map output files for a shuffle.
* Stages are often shared across multiple jobs, if these jobs reuse the same RDDs.
*
* - Tasks are individual units of work, each sent to one machine.
*
* - Cache tracking: the DAGScheduler figures out which RDDs are cached to avoid recomputing them
* and likewise remembers which shuffle map stages have already produced output files to avoid
* redoing the map side of a shuffle.
*
* - Preferred locations: the DAGScheduler also computes where to run each task in a stage based
* on the preferred locations of its underlying RDDs, or the location of cached or shuffle data.
*
* - Cleanup: all data structures are cleared when the running jobs that depend on them finish,
* to prevent memory leaks in a long-running application.
*
* To recover from failures, the same stage might need to run multiple times, which are called
* "attempts". If the TaskScheduler reports that a task failed because a map output file from a
* previous stage was lost, the DAGScheduler resubmits that lost stage. This is detected through a
* CompletionEvent with FetchFailed, or an ExecutorLost event. The DAGScheduler will wait a small
* amount of time to see whether other nodes or tasks fail, then resubmit TaskSets for any lost
* stage(s) that compute the missing tasks. As part of this process, we might also have to create
* Stage objects for old (finished) stages where we previously cleaned up the Stage object. Since
* tasks from the old attempt of a stage could still be running, care must be taken to map any
* events received in the correct Stage object.
*
* Here's a checklist to use when making or reviewing changes to this class:
*
* - All data structures should be cleared when the jobs involving them end to avoid indefinite
* accumulation of state in long-running programs.
*
* - When adding a new data structure, update `DAGSchedulerSuite.assertDataStructuresEmpty` to
* include the new structure. This will help to catch memory leaks.
*/
private[spark]
class DAGScheduler(
private[scheduler] val sc: SparkContext,
private[scheduler] val taskScheduler: TaskScheduler,
listenerBus: LiveListenerBus,
mapOutputTracker: MapOutputTrackerMaster,
blockManagerMaster: BlockManagerMaster,
env: SparkEnv,
clock: Clock = new SystemClock())
extends Logging {
后面我会抽时间来分析具体的~
这个dagscheduler 是一个重量级的角色,作用很大,上面的注释可能有点多,我来稍微总结一下:
1:它是以阶段(stage)为导向的,一个高层次的调度(相对于taskscheduler)
2: dag (传说中的有向无环图)就是它负责计算的,追踪物化的rdd,以及stage 输出,job是如何运行的是它来规划的。
3:它把stage 包成taskset 发送给 TaskScheduler 让它来执行,
4:还有就是 preferred locations 也是这大哥决定的,这个就是那个那几个任务运行的本地化的级别
5:如果是shuflle过程中输出文件丢失了,那么之前的stage需要重新提交的时候,也是它负责,但是,stage内部的非shuffle导致的文件丢失,是TaskSchuduler 的职责范围。
2:这个DagScheduler 是何时被实例化的呢?
这个就是必须从一切的入口sparkcontext 里开看了,可以看到它从一开始就实例化了。是context的一个属性。
它是如何发挥作用盘活全局的?
context 里的runjob是所有action的入口,实际调用的就是dagscheduler 的runJob()
继续追踪下去:
在dagscheduler 里调用的是dagschuduler.submitJob() 最终返回的是一个waiter ,这个waiter 在job执行完毕之前保持阻塞。所以说正常情况下,job是串行执行的就是这个道理。
遇到一个eventprocessLoop【DAGSchedulerEventProcessLoop】,用来提交作业的
我们来追溯一下:
我们来翻译一下注释看看这个EventLoop是几个意思:
event loop 是用来接收从调用这那里过来的events的,并且用event thread 来处理所有的events.
其实就是个容器,不断的接收事件,然后处理,我们来看下:
内部的实现其实就是个双端队列。
new 了线程,用来执行任务:
抽象类EventLoop中的关键方法:onReceive 我们看看在 【DAGSchedulerEventProcessLoop】 是如何实现的
调用的是DAGSchedulerEventProcessLoop.doOnReceive,开启模式匹配:
我们提交任务肯定就是JobSubmitted了:
继续跟踪:
调用了dagScheduler.createResultStage() ,返回一个ResultStage
为什么先建了一个ResultStage呢?
因为我们触发的时候是从最后的一个stage计算的,在spark中是懒加载,所以在计算的时候是从后往前进行回溯,这个和形成DAG的方向是正好相反的。
handleJobSubmmited 方法的末尾
/** Submits stage, but first recursively submits any missing parents. */
private def submitStage(stage: Stage) {
val jobId = activeJobForStage(stage)
if (jobId.isDefined) {
logDebug("submitStage(" + stage + ")")
if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
val missing = getMissingParentStages(stage).sortBy(_.id)
logDebug("missing: " + missing)
if (missing.isEmpty) {
logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
submitMissingTasks(stage, jobId.get)
} else {
for (parent <- missing) {
submitStage(parent)
}
waitingStages += stage
}
}
} else {
abortStage(stage, "No active job for stage " + stage.id, None)
}
}
我们可以看到注释:
提交stage,但是他会递归的先提交那些他的没提交的父stage,因为stage是一层层的依赖的,只有上一个stage完成了之后,下一个依赖它的stage才可以运行,所以必须得先往前递归父stage.
跟踪进去一探究竟:
getMissingParentStages(stage: Stage): List[Stage] 获取到上层的父依赖,我总觉得这里missing翻译成丢失的不合适,所以我理解的是还没有遍历到的父stage。
private def getMissingParentStages(stage: Stage): List[Stage] = {
val missing = new HashSet[Stage]
val visited = new HashSet[RDD[_]]
// We are manually maintaining a stack here to prevent StackOverflowError
// caused by recursively visiting
val waitingForVisit = new ArrayStack[RDD[_]]
//我把这部分代码给拿到上面来了,这样看着会更加的清晰明了~
//把stage 直接压栈
waitingForVisit.push(stage.rdd)
while (waitingForVisit.nonEmpty) {
//这里是对应的弹栈操作,然后只要是waitingForVisit不为空那么就一直调用visit方法
visit(waitingForVisit.pop())
}
//然后这里是 def visit()
// 这里放入的是 rdd
def visit(rdd: RDD[_]) {
// 如果这个rdd被访问过,就加入到已经访问过的集合里
if (!visited(rdd)) {
visited += rdd
val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)
if (rddHasUncachedPartitions) {
//遍历rdd的所有的依赖
for (dep <- rdd.dependencies) {
//看这个依赖是宽依赖还是窄依赖
dep match {
// 如果是牵涉到了shuffle 那么就要获取或者创建一个中间的角色:ShuffleMapStage
case shufDep: ShuffleDependency[_, _, _] =>
val mapStage = getOrCreateShuffleMapStage(shufDep, stage.firstJobId)
// 如果mapStage 不可用,那么就放入到 未访问的集合里
if (!mapStage.isAvailable) {
missing += mapStage
}
//如果是窄依赖,直接放到了栈中
case narrowDep: NarrowDependency[_] =>
waitingForVisit.push(narrowDep.rdd)
}
}
}
}
}
missing.toList
}
有三个集合得提一下:
//用来存放还没有访问呢的stage
val missing = new HashSet[Stage]
//这里用来存放已经访问过的RDD
val visited = new HashSet[RDD[_]]
// We are manually maintaining a stack here to prevent StackOverflowError
// caused by recursively visiting
val waitingForVisit = new ArrayStack[RDD[_]]
这个栈用来防止递归访问的时候导致的栈溢出
我们再回到:
/** Submits stage, but first recursively submits any missing parents. */
private def submitStage(stage: Stage) {
val jobId = activeJobForStage(stage)
if (jobId.isDefined) {
logDebug("submitStage(" + stage + ")")
if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
val missing = getMissingParentStages(stage).sortBy(_.id)
logDebug("missing: " + missing)
//如果所有的父stage都到位了
if (missing.isEmpty) {
logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
//这里就开始提交没有提交的tasks了
submitMissingTasks(stage, jobId.get)
} else {
// 否则的话就继续递归调用自身,直到拿到所有的父stage
for (parent <- missing) {
submitStage(parent)
}
waitingStages += stage
}
}
} else {
abortStage(stage, "No active job for stage " + stage.id, None)
}
}
我们得看下
/** Called when stage's parents are available and we can now do its task. */
private def submitMissingTasks(stage: Stage, jobId: Int) {
stage match {
case s: ShuffleMapStage =>
outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1)
case s: ResultStage =>
outputCommitCoordinator.stageStart(
stage = s.id, maxPartitionId = s.rdd.partitions.length - 1)
}
val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
stage match {
case s: ShuffleMapStage =>
partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
case s: ResultStage =>
partitionsToCompute.map { id =>
val p = s.partitions(id)
(id, getPreferredLocs(stage.rdd, p))
}.toMap
}
} catch {
case NonFatal(e) =>
stage.makeNewStageAttempt(partitionsToCompute.size)
listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
}
stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)
// 这里就牵涉到了序列化的问题了
var taskBinary: Broadcast[Array[Byte]] = null
var partitions: Array[Partition] = null
try {
RDDCheckpointData.synchronized {
//这里对rdd,以及stage 的依赖都做了 做序列化,我们知道序列化就是为了传输做准备的
taskBinaryBytes = stage match {
// 序列化的是 stage.rdd, stage.shuffleDep 因为mapstage只是一个中间过程,所以得带着它的依赖信息
case stage: ShuffleMapStage =>
JavaUtils.bufferToArray(
closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef))
//如果是resultstage 的话 传输的是 stage.rdd, stage.func
case stage: ResultStage =>
JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef))
}
partitions = stage.rdd.partitions
}
// 上面已经赋值的 taskBinaryBytes 被广播
taskBinary = sc.broadcast(taskBinaryBytes)
} catch {
// In the case of a failure during serialization, abort the stage.
case e: NotSerializableException =>
abortStage(stage, "Task not serializable: " + e.toString, Some(e))
runningStages -= stage
// Abort execution
return
case NonFatal(e) =>
abortStage(stage, s"Task serialization failed: $e\n${Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
}
val tasks: Seq[Task[_]] = try {
val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
stage match {
case stage: ShuffleMapStage =>
stage.pendingPartitions.clear()
partitionsToCompute.map { id =>
val locs = taskIdToLocations(id)
val part = partitions(id)
stage.pendingPartitions += id
//new 了 ShuffleMapTask 越来越接近目标了,里面装着我们上面的广播变量 taksBinary
new ShuffleMapTask(stage.id, stage.latestInfo.attemptNumber,
taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
Option(sc.applicationId), sc.applicationAttemptId)
}
case stage: ResultStage =>
partitionsToCompute.map { id =>
val p: Int = stage.partitions(id)
val part = partitions(p)
val locs = taskIdToLocations(id)
//道理同上
new ResultTask(stage.id, stage.latestInfo.attemptNumber,
taskBinary, part, locs, id, properties, serializedTaskMetrics,
Option(jobId), Option(sc.applicationId), sc.applicationAttemptId)
}
}
} catch {
case NonFatal(e) =>
abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
}
//上面我们得到了一个Seq[Task[_]]
if (tasks.size > 0) {
//开始提交task了激动~ taskScheduler 在此闪亮登场,new TaskSet
// taskScheduler 是一个trait 我们得看他的实现类,只有一个就是:TaskSchedulerImpl,这个是真正干活的那个
taskScheduler.submitTasks(new TaskSet(
tasks.toArray, stage.id, stage.latestInfo.attemptNumber, jobId, properties))
} else {
val debugString = stage match {
case stage: ShuffleMapStage =>
s"Stage ${stage} is actually done; " +
s"(available: ${stage.isAvailable}," +
s"available outputs: ${stage.numAvailableOutputs}," +
s"partitions: ${stage.numPartitions})"
case stage : ResultStage =>
s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})"
}
logDebug(debugString)
//这里是提交那些等待的子stages , 之前我们说过,如果stage存在依赖,那么只有父完成之后子才可以运行
submitWaitingChildStages(stage)
}
}
我们还需要看两个点:
TaskSchedulerImpl.submitTasks(taskSet: TaskSet) 来跟一下源码看下:
override def submitTasks(taskSet: TaskSet) {
val tasks = taskSet.tasks
logInfo("Adding task set " + taskSet.id + " with " + tasks.length + " tasks")
this.synchronized {
// 创建tasksetmanager,我们可以看到一个参数maxTaskFailures,这个就是task提交的过程里最大允许重试的次数,我们之前也说过,如果说是在stage内部的非shuffle导致file丢失的任务的重试就是taskschueduler 来负责的,它负责的是task 这种粒度的,而非stage
//这里就是把taskset 包装成了 TaskSetManager 然后放到调度池里去
val manager = createTaskSetManager(taskSet, maxTaskFailures)
val stage = taskSet.stageId
val stageTaskSets =
taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new HashMap[Int, TaskSetManager])
stageTaskSets(taskSet.stageAttemptId) = manager
val conflictingTaskSet = stageTaskSets.exists { case (_, ts) =>
ts.taskSet != taskSet && !ts.isZombie
}
if (conflictingTaskSet) {
throw new IllegalStateException(s"more than one active taskSet for stage $stage:" +
s" ${stageTaskSets.toSeq.map{_._2.taskSet.id}.mkString(",")}")
}
//这里又出现一个角色 schedulableBuilder 字面意思上可以知道,这个大哥应该和调度相关,它负责管理tasksetmanager,我们暂时不说他,往后继续走
//这里是吧TaskSetManager放入了调度池等待调度
schedulableBuilder.addTaskSetManager(manager, manager.taskSet.properties)
if (!isLocal && !hasReceivedTask) {
starvationTimer.scheduleAtFixedRate(new TimerTask() {
override def run() {
if (!hasLaunchedTask) {
logWarning("Initial job has not accepted any resources; " +
"check your cluster UI to ensure that workers are registered " +
"and have sufficient resources")
} else {
this.cancel()
}
}
}, STARVATION_TIMEOUT_MS, STARVATION_TIMEOUT_MS)
}
hasReceivedTask = true
}
//这里的重量级的选手,backend【SchedulerBackend】 ,字面意思了解就是个后台的东西,后台的东西一般都是和通信这类的有关的。我们下面要来分析一下它。
backend.reviveOffers()
}
trait SchedulerBackend 它的实现类我们来看看:
override def reviveOffers() {
//可以看到调用的是driverEndpoint 的send 方法,就是它 var driverEndpoint: RpcEndpointRef
//一个引用,send方法再往后跟就是netty的东西了,就不再涉及了
driverEndpoint.send(ReviveOffers)
}
// Internal messages in driver
// 就是说要发一个消息出去,这个消息发给谁呢,也就是说谁来接收呢????
case object ReviveOffers extends CoarseGrainedClusterMessage
有人send那么必然有人要receive,所以又一个重量级的角色登场:
CoarseGrainedExecutorBackend
至此,两大核心backend悉数登场,我们来看下receive方法:
override def receive: PartialFunction[Any, Unit] = {
// 各种事件的匹配
case RegisteredExecutor =>
logInfo("Successfully registered with driver")
try {
executor = new Executor(executorId, hostname, env, userClassPath, isLocal = false)
} catch {
case NonFatal(e) =>
exitExecutor(1, "Unable to create executor due to " + e.getMessage, e)
}
case RegisterExecutorFailed(message) =>
exitExecutor(1, "Slave registration failed: " + message)
// 这里有个运行task的消息
case LaunchTask(data) =>
if (executor == null) {
exitExecutor(1, "Received LaunchTask command but executor was null")
} else {
val taskDesc = TaskDescription.decode(data.value)
logInfo("Got assigned task " + taskDesc.taskId)
executor.launchTask(this, taskDesc)
}
case KillTask(taskId, _, interruptThread, reason) =>
if (executor == null) {
exitExecutor(1, "Received KillTask command but executor was null")
} else {
executor.killTask(taskId, interruptThread, reason)
}
case StopExecutor =>
stopping.set(true)
logInfo("Driver commanded a shutdown")
// Cannot shutdown here because an ack may need to be sent back to the caller. So send
// a message to self to actually do the shutdown.
self.send(Shutdown)
case Shutdown =>
stopping.set(true)
new Thread("CoarseGrainedExecutorBackend-stop-executor") {
override def run(): Unit = {
// executor.stop() will call `SparkEnv.stop()` which waits until RpcEnv stops totally.
// However, if `executor.stop()` runs in some thread of RpcEnv, RpcEnv won't be able to
// stop until `executor.stop()` returns, which becomes a dead-lock (See SPARK-14180).
// Therefore, we put this line in a new thread.
executor.stop()
}
}.start()
case UpdateDelegationTokens(tokenBytes) =>
logInfo(s"Received tokens of ${tokenBytes.length} bytes")
SparkHadoopUtil.get.addDelegationTokens(tokenBytes, env.conf)
}
其实到这里不知道你是否一个疑问:我发送的是 ReviveOffers 为啥没看到def receive 里有呢?
因为这个ReviveOffers消息是发送给自己的,CoarseGrainedSchedulerBackend.DriverEndpoint class DriverEndpoint
我们来看下有没有什么start这类的方法,真有一个:onStart(),我们跟进去看一下:
override def onStart() {
// Periodically revive offers to allow delay scheduling to work
val reviveIntervalMs = conf.getTimeAsMs("spark.scheduler.revive.interval", "1s")
reviveThread.scheduleAtFixedRate(new Runnable {
override def run(): Unit = Utils.tryLogNonFatalError {
Option(self).foreach(_.send(ReviveOffers))
}
}, 0, reviveIntervalMs, TimeUnit.MILLISECONDS)
}
//receive方法
override def receive: PartialFunction[Any, Unit] = {
case StatusUpdate(executorId, taskId, state, data) =>
scheduler.statusUpdate(taskId, state, data.value)
if (TaskState.isFinished(state)) {
executorDataMap.get(executorId) match {
case Some(executorInfo) =>
executorInfo.freeCores += scheduler.CPUS_PER_TASK
makeOffers(executorId)
case None =>
// Ignoring the update since we don't know about the executor.
logWarning(s"Ignored task status update ($taskId state $state) " +
s"from unknown executor with ID $executorId")
}
}
// 匹配到了,这才是对的,我们看到是driver 自己给自己发的消息 ,哈哈 ~~~~
case ReviveOffers =>
//调用的是makeoffers(),这一步其实就是给分配资源,直白点说就是分配executor,过滤掉被干掉的等不可以用的executor
makeOffers()
case KillTask(taskId, executorId, interruptThread, reason) =>
executorDataMap.get(executorId) match {
case Some(executorInfo) =>
executorInfo.executorEndpoint.send(
KillTask(taskId, executorId, interruptThread, reason))
case None =>
// Ignoring the task kill since the executor is not registered.
logWarning(s"Attempted to kill task $taskId for unknown executor $executorId.")
}
case KillExecutorsOnHost(host) =>
scheduler.getExecutorsAliveOnHost(host).foreach { exec =>
killExecutors(exec.toSeq, replace = true, force = true)
}
case UpdateDelegationTokens(newDelegationTokens) =>
executorDataMap.values.foreach { ed =>
ed.executorEndpoint.send(UpdateDelegationTokens(newDelegationTokens))
}
case RemoveExecutor(executorId, reason) =>
// We will remove the executor's state and cannot restore it. However, the connection
// between the driver and the executor may be still alive so that the executor won't exit
// automatically, so try to tell the executor to stop itself. See SPARK-13519.
executorDataMap.get(executorId).foreach(_.executorEndpoint.send(StopExecutor))
removeExecutor(executorId, reason)
}
我们来看下makeoffers()
private def makeOffers() {
// Make sure no executor is killed while some task is launching on it
val taskDescs = CoarseGrainedSchedulerBackend.this.synchronized {
// Filter out executors under killing
val activeExecutors = executorDataMap.filterKeys(executorIsAlive)
val workOffers = activeExecutors.map {
case (id, executorData) =>
// WorkerOffer 就是可用的资源
new WorkerOffer(id, executorData.executorHost, executorData.freeCores)
}.toIndexedSeq
scheduler.resourceOffers(workOffers)
}
//这里有个关键点:调用了launchTasks方法,运行任务
if (!taskDescs.isEmpty) {
launchTasks(taskDescs)
}
}
我们来看下launchTasks方法:
// Launch tasks returned by a set of resource offers
// 返回的是 TaskDescription 的序列
private def launchTasks(tasks: Seq[Seq[TaskDescription]]) {
for (task <- tasks.flatten) {
// 对task进行encode 操作 ,其实就是做序列化
val serializedTask = TaskDescription.encode(task)
if (serializedTask.limit() >= maxRpcMessageSize) {
//如果你看到了这里,恭喜你,你有机会来了解一个生产上常见的一个报错。我决定咱们来看看 TaskDescription 是什么东西,感觉有必要来理解一下
scheduler.taskIdToTaskSetManager.get(task.taskId).foreach { taskSetMgr =>
try {
var msg = "Serialized task %s:%d was %d bytes, which exceeds max allowed: " +
"spark.rpc.message.maxSize (%d bytes). Consider increasing " +
"spark.rpc.message.maxSize or using broadcast variables for large values."
msg = msg.format(task.taskId, task.index, serializedTask.limit(), maxRpcMessageSize)
taskSetMgr.abort(msg)
} catch {
case e: Exception => logError("Exception in error callback", e)
}
}
}
else {
val executorData = executorDataMap(task.executorId)
executorData.freeCores -= scheduler.CPUS_PER_TASK
logDebug(s"Launching task ${task.taskId} on executor id: ${task.executorId} hostname: " +
s"${executorData.executorHost}.")
//executorData 是关于executor的一系列的资源的载体,这里的send方法是这个重点,发送了一个关键的消息:LaunchTask,带着已经序列化好的task 要开始干活了
executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))
}
}
}
不知道老铁在生产上是否遇到过 和 spark.rpc.message.maxSize 这个超出了阈值而报错的情况。我们来进去看看到底什么情况。
//就是它了,你可以看到里面的属性 files jars 都是map结构的
private[spark] class TaskDescription(
val taskId: Long,
val attemptNumber: Int,
val executorId: String,
val name: String,
val index: Int, // Index within this task's TaskSet
val addedFiles: Map[String, Long],
val addedJars: Map[String, Long],
val properties: Properties,
val serializedTask: ByteBuffer) {
看下encode方法:
def encode(taskDescription: TaskDescription): ByteBuffer = {
val bytesOut = new ByteBufferOutputStream(4096)
val dataOut = new DataOutputStream(bytesOut)
dataOut.writeLong(taskDescription.taskId)
dataOut.writeInt(taskDescription.attemptNumber)
dataOut.writeUTF(taskDescription.executorId)
dataOut.writeUTF(taskDescription.name)
dataOut.writeInt(taskDescription.index)
// Write files.
serializeStringLongMap(taskDescription.addedFiles, dataOut)
// Write jars.
serializeStringLongMap(taskDescription.addedJars, dataOut)
// Write properties.
dataOut.writeInt(taskDescription.properties.size())
taskDescription.properties.asScala.foreach { case (key, value) =>
dataOut.writeUTF(key)
// SPARK-19796 -- writeUTF doesn't work for long strings, which can happen for property values
val bytes = value.getBytes(StandardCharsets.UTF_8)
dataOut.writeInt(bytes.length)
dataOut.write(bytes)
}
// Write the task. The task is already serialized, so write it directly to the byte buffer.
//把已经序列化的task 写入到buffer里 这个是要用来在集群里传输的
Utils.writeByteBuffer(taskDescription.serializedTask, bytesOut)
dataOut.close()
bytesOut.close()
bytesOut.toByteBuffer
}
我们可以看到各种写入,files jars 都给整进去了,先写到这里~
发送LaunchTask 消息,这个消息发送给了谁?还记得上面说的吗?
CoarseGrainedExecutorBackend.receive
CoarseGrainedExecutorBackend里有个onStart方法,他是要和driver建立联系的。我们来看下recevie方法:
这里做了decode的操作也就是进行了反序列化的操作。对应上面的encode
executor.launchTask(this, taskDesc) 这个就是开始运行task了
又一个重量级的角色Executor登场了。
// 看下Executor 的 launchTask方法。
// TaskRunner extends Runnable ,把任务包装成了一个TaskRunner
def launchTask(context: ExecutorBackend, taskDescription: TaskDescription): Unit = {
val tr = new TaskRunner(context, taskDescription)
runningTasks.put(taskDescription.taskId, tr)
threadPool.execute(tr)
}
//这里维护了所有的正在running的tasks
private val runningTasks = new ConcurrentHashMap[Long, TaskRunner]
//线程池,用来执行tasks
threadPool