接着上一讲
进入TaskSchedule#submitTasks
val manager = createTaskSetManager(taskSet, maxTaskFailures) 首先创建一个TaskSetMananger
TaskMananger对一个单独的TaskSet进行任务调度,这个对象负责追踪每一个task,如果task失败的话会负责重试task,直到超过次数,并且会通过延迟调度,未这个taskSet处理本地化调度机制。它的主要接口是resourceOffer,在这个借口中,TaskSet会希望在一个节点上运行一个任务,并且接受任务状态改变消息。
直接跳到
backend.reviveOffers()
这个backend就是SparkDeploySchedulerBackend
通过之前的分析我们知道,这个对象是负责和Master进行通信,注册app
进入这个方法,最终是
driverEndpoint.send(ReviveOffers)
我们找到Drivier的ReviveOffers ,其实就是CoarseGrainedSchedulerBackend中
case ReviveOffers =>
makeOffers()
private def makeOffers() {
// Filter out executors under killing
val activeExecutors = executorDataMap.filterKeys(executorIsAlive)
val workOffers = activeExecutors.map { case (id, executorData) =>
new WorkerOffer(id, executorData.executorHost, executorData.freeCores)
}.toSeq
launchTasks(scheduler.resourceOffers(workOffers))
}
看上面这个方法
第一部过滤掉不存活的executor
然后把上一步中存活的executor封装成一个workerOffer对象,
进入scheduler.resourceOffers(workOffers)
这里面关注
for (taskSet <- sortedTaskSets; maxLocality <- taskSet.myLocalityLevels) {
do {
launchedTask = resourceOfferSingleTaskSet(
taskSet, maxLocality, shuffledOffers, availableCpus, tasks)
} while (launchedTask)
}
这一段分配task到executor上的核心算法
大致流程是根据本地化级别。这个我在spark性能调优中说明过。本地化有下面五种
PROCESS_LOCAL:进程本地化,代码和数据在同一个进程中,也就是在同一个executor中;计算数据的task由executor执行,数据在executor的BlockManager中;性能最好
NODE_LOCAL:节点本地化,代码和数据在同一个节点中;比如说,数据作为一个HDFS block块,就在节点上,而task在节点上某个executor中运行;或者是,数据和task在一个节点上的不同executor中;数据需要在进程间进行传输
NO_PREF:对于task来说,数据从哪里获取都一样,没有好坏之分
RACK_LOCAL:机架本地化,数据和task在一个机架的两个节点上;数据需要通过网络在节点之间进行传输
ANY:数据和task可能在集群中的任何地方,而且不在一个机架中,性能最差
这里是双重循环,针对每一个task,都通过五种本地化级别进行调度分配,从最优的方案开始分配task。进入
resourceOfferSingleTaskSet方法
taskIdToExecutorId 这个容器就是用来存储task和executor的关系,如果可以分配上,launchedTask就会设置成true,如果不能分配上,那么launchTask就会设置成false,然后接着用本地化级别低一级去分配。
每一个task都这么分配,直到所有task都分配完。
回到makeOffers方法,分配完task和executor关系之后,进入launchTasks方法
val executorData = executorDataMap(task.executorId)
executorData.freeCores -= scheduler.CPUS_PER_TASK
executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))
这边会根据task和executor关系,给乡音的executor发送LaunchTask消息。
------------
我们之前说过,Executor启动之后会反向注册到Driver上,而Executor的后台进程就是CoarseGrainedExecutorBackend.scala中
注意到CoarseGrainedExecutorBackend的OnStart方法中有向Driver发送RegisterExecutor消息,我们去看CoarseGrainedSchedulerBackend
case RegisterExecutor(executorId, executorRef, hostPort, cores, logUrls) =>
if (executorDataMap.contains(executorId)) {
context.reply(RegisterExecutorFailed("Duplicate executor ID: " + executorId))
} else {
// If the executor's rpc env is not listening for incoming connections, `hostPort`
// will be null, and the client connection should be used to contact the executor.
val executorAddress = if (executorRef.address != null) {
executorRef.address
} else {
context.senderAddress
}
logInfo(s"Registered executor $executorRef ($executorAddress) with ID $executorId")
addressToExecutorId(executorAddress) = executorId
totalCoreCount.addAndGet(cores)
totalRegisteredExecutors.addAndGet(1)
val data = new ExecutorData(executorRef, executorRef.address, executorAddress.host,
cores, cores, logUrls)
// This must be synchronized because variables mutated
// in this block are read when requesting executors
CoarseGrainedSchedulerBackend.this.synchronized {
executorDataMap.put(executorId, data)
if (numPendingExecutors > 0) {
numPendingExecutors -= 1
logDebug(s"Decremented number of pending executors ($numPendingExecutors left)")
}
}
// Note: some tests expect the reply to come after we put the executor in the map
context.reply(RegisteredExecutor(executorAddress.host))
listenerBus.post(
SparkListenerExecutorAdded(System.currentTimeMillis(), executorId, data))
makeOffers()
}
这里就有很关键的几句话
addressToExecutorId(executorAddress) = executorId
totalCoreCount.addAndGet(cores)
totalRegisteredExecutors.addAndGet(1)
executorDataMap.put(executorId, data)
这里就把executor注册到driver上了。用于后续的分配task
context.reply(RegisteredExecutor(executorAddress.host))
并且会给Executor发送RegisteredExecutor消息
回到CoarseGrainedExecutorBackend
case RegisteredExecutor(hostname) =>
logInfo("Successfully registered with driver")
executor = new Executor(executorId, hostname, env, userClassPath, isLocal = false)
接收到消息后会创建一个Executor
接着我们之前说的。
这边receive方法中有一个LaunchTask事件
case LaunchTask(data) =>
if (executor == null) {
logError("Received LaunchTask command but executor was null")
System.exit(1)
} else {
val taskDesc = ser.deserialize[TaskDescription](data.value)
logInfo("Got assigned task " + taskDesc.taskId)
executor.launchTask(this, taskId = taskDesc.taskId, attemptNumber = taskDesc.attemptNumber,
taskDesc.name, taskDesc.serializedTask)
}
这里关键是 executor.launchTask(this, taskId = taskDesc.taskId, attemptNumber = taskDesc.attemptNumber, taskDesc.name, taskDesc.serializedTask)
val tr = new TaskRunner(context, taskId = taskId, attemptNumber = attemptNumber, taskName,
serializedTask)
runningTasks.put(taskId, tr)
threadPool.execute(tr)
具体的task方法在TaskRunner的run方法中,
先来一张图
找到TaskRunner的run方法
updateDependencies(taskDescription.addedFiles, taskDescription.addedJars)
这段代码会下载jar包,内部做了同步处理
task = ser.deserialize[Task[Any]](
taskDescription.serializedTask, Thread.currentThread.getContextClassLoader)
然后反序列化task
然后我们会看到
val res = task.run(
taskAttemptId = taskId,
attemptNumber = attemptNumber,
metricsSystem = env.metricsSystem)
threwException = false
res
点进去看到一句核心代码
runTask(context)
runTask是一个抽象类,ShuffleMapTask和ResultTask实现方法不一样。
这里先不说,下一节说。
回到Executor
val res = task.run
如果这个task是一个shuffleMapTask,那么返回的是一个Mapstatus,里面封装了ShuffleMapTask的计算数据和输出位置。如果后面一个Task,就会去联系MapOutputTracker,来获取上一个ShuffleMapTask的输出位置,然后通过网络来拉取数据。
看这几行代码是对返回的结果进行一个序列化
val resultSer = env.serializer.newInstance()
val beforeSerialization = System.currentTimeMillis()
val valueBytes = resultSer.serialize(value)
val afterSerialization = System.currentTimeMillis()
这些都是一些监控信息,这些都会在sparkUI中显示
task.metrics.setExecutorDeserializeTime(
(taskStart - deserializeStartTime) + task.executorDeserializeTime)
// We need to subtract Task.run()'s deserialization time to avoid double-counting
task.metrics.setExecutorRunTime((taskFinish - taskStart) - task.executorDeserializeTime)
task.metrics.setJvmGCTime(computeTotalGcTime() - startGCTime)
task.metrics.setResultSerializationTime(afterSerialization - beforeSerialization)
这句话非常核心,这句话其实就是调用CoraseGrainedExcutorBackend
execBackend.statusUpdate(taskId, TaskState.FINISHED, serializedResult)
点击进去实际上就是给drvier发送task运行状态的消息。
override def statusUpdate(taskId: Long, state: TaskState, data: ByteBuffer) {
val msg = StatusUpdate(executorId, taskId, state, data)
driver match {
case Some(driverRef) => driverRef.send(msg)
case None => logWarning(s"Drop $msg because has not yet connected to driver")
}