大致的运行流程:
SparkContext是整个spark程序的入口,在写WordCount程序时会new SparkContext(sparkConf)构建一个SparkContext实例。在SparkContext.scala中会执行一些必要的任务,最重要的如下(在396行的try块中的521行):
再看createTaskScheduler(sc: SparkContext, master: String)方法,这里的master就是SparkConf中setMaster()设置的内容。
看下TaskSchedulerImpl与SparkDeploySchedulerBackend的类结构:
回头再看最开始的代码,其中有_taskScheduler.start(),即TaskSchedulerImpl的start(),在 TaskSchedulerImpl.scala中:
回到SparkDeploySchedulerBackend的start方法,里面定义了appDesc,传入的Command指定了具体的为当前应用程序启动的Executor进程的入口类的名称为CoarseGrainedExecutorBackend,args参数也封装了driverUrl信息,然后将appDesc作为AppClient的参数,创建AppClient对象并调用AppClient对象的start方法。
在AppClient.scala中:
Master收到RegisterApplication消息以后,则会为该程序生产Job ID并通过schedule()来分配计算资源,具体计算资源的分配是通过应用程序的运行方式、Memory、cores等配置信息来决定的。在Master.scala中:
在Master的schedule()方法中当调用完launchDriver后,将调用startExecutorsOnWorkers(),在选取的worker上启动executors, Master会决定好了分配多少cores给worker。
CoarseGrainedExecutorBackend会首先通过传入的driverUrl这个参数向在org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend发送RegisterExecutor(executorId, hostPort, cores) 消息向Driver注册自己,注册成功后Driver会返回RegisteredExecutor消息,在接收到RegisteredExecutor消息后,CoarseGrainedExecutorBackend会初始化Executor:
RDD的action操作才会触发SparkContext的runJob()方法,在runJob中:
DAGScheduler会接收JobSubmitted消息然后处理:
DAGScheduler划分好Stage后会通过TaskSchedulerImpl中的TaskSetManager来管理当前要运行的Stage中的所有任务TaskSet。
在TaskSchedulerImpl.scala:
Driver接收到ReviveOffers后会调用makeOffers():
Exector接收到LaunchTask消息后将真正开始执行任务。在CoarseGrainedExecutorBackend.scala中:
总结:
在SparkContext实例化的时候调用createTaskScheduler来创建TaskSchedulerImpl和SparkDeploySchedulerBachend,TaskSchedulerImpl会持有该SchedulerBackend,即SparkDeploySchedulerBackend。
当调用TaskSchedulerImpl的start方法时会调用SparkDeploySchedluerBackend的start方法, SparkDeploySchedulerBackend继承于CoarseGrainedSchedulerBachend,CoarseGrainedSchedulerBachend在start时将会实例化DriverEndpoint。SparkDeploySchedluerBackend的start方法中会创建AppClient对象并调用其start方法,并在start方法中会实例化ClientEndpoint,在创建ClientEndpoint会传入Command来指定具体的为当前应用程序启动的Executor的入口类的名称,然后ClientEndpoint启动并通过tryRegisterMaster来注册当前应用程序到Master中。
Master接收注册信息后则会为该程序生成JobID并通过schedule来分配计算资源。然后Master会发送消息给Worker,Worker中为当前应用程序分配计算资源时会首先分配ExectorRunner,ExectorRunner内部会启动另外一个JVM进程,该JVM进程加载的main方法所在的类就是创建ClientEndpoint会传入Command来指定的具体类,即CoarseGrainedExecutorBackend。在main方法中实例化CoarseGrainedExecutorBackend时通过回调onStart向DriverEndpoint发送RegisterExecutor来注册当前的ExecutorBackend。
DriverEndpoint收到该注册信息并保存在SparkDeploySchedulerBachend实例中,此时SparkDeploySchedulerBackend就掌握了当前应用程序拥有的计算资源。由于TaskSchedulerImpl会持有SchedulerBackend, TaskScheduler就是通过SparkDeploySchedulerBackend拥有的计算资源来具体运行Task。
- 客户端提交作业给Master;
- Master让一个Worker启动Driver,即SchedulerBackend。Worker创建一个DriverRunner线程,DriverRunner启动SchedulerBackend进程。
- Master还会选择让其余的Worker启动Exeuctor,即ExecutorBackend。Worker创建一个ExecutorRunner线程,ExecutorRunner会启动ExecutorBackend进程。
- ExecutorBackend启动后会向Driver的SchedulerBackend注册。SchedulerBackend进程中包含DAGScheduler,它会根据用户程序生成执行计划,并调度执行。对于每个stage的task,都会被存放到TaskScheduler中,ExecutorBackend向SchedulerBackend汇报的时候把TaskScheduler中的task调度到ExecutorBackend执行。
- 所有stage都完成后作业结束。
SparkContext是整个spark程序的入口,在写WordCount程序时会new SparkContext(sparkConf)构建一个SparkContext实例。在SparkContext.scala中会执行一些必要的任务,最重要的如下(在396行的try块中的521行):
// Create and start the scheduler
val (sched, ts) = SparkContext.createTaskScheduler(this, master)
_schedulerBackend = sched
_taskScheduler = ts
_dagScheduler = new DAGScheduler(this)
_heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)
// start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
// constructor
_taskScheduler.start()
_applicationId = _taskScheduler.applicationId()
_applicationAttemptId = taskScheduler.applicationAttemptId()
_conf.set("spark.app.id", _applicationId)
_ui.foreach(_.setAppId(_applicationId))
_env.blockManager.initialize(_applicationId)
在SparkContext实例化的时候调用createTaskScheduler来创建SchedulerBackend和TaskScheduler,同时创建了DAGScheduler。然后调用taskScheduler.start()方法。
再看createTaskScheduler(sc: SparkContext, master: String)方法,这里的master就是SparkConf中setMaster()设置的内容。
private def createTaskScheduler(
sc: SparkContext,
master: String): (SchedulerBackend, TaskScheduler) = {
import SparkMasterRegex._
...
case SPARK_REGEX(sparkUrl) =>
val scheduler = new TaskSchedulerImpl(sc)
val masterUrls = sparkUrl.split(",").map("spark://" + _)
val backend = new SparkDeploySchedulerBackend(scheduler, sc, masterUrls)
scheduler.initialize(backend)
(backend, scheduler)
...
该方法会根据传入的master值来选择创建对应的SchedulerBackend和TaskScheduler。这里只是截取了匹配” SPARK_REGEX”的代码,可以看出这里的backend是SparkDeploySchedulerBackend ,scheduler实际上是TaskSchedulerImpl。
看下TaskSchedulerImpl与SparkDeploySchedulerBackend的类结构:
private[spark] class TaskSchedulerImpl(
val sc: SparkContext,
val maxTaskFailures: Int,
isLocal: Boolean = false)
extends TaskScheduler with Logging {
private[spark] class SparkDeploySchedulerBackend(
scheduler: TaskSchedulerImpl,
sc: SparkContext,
masters: Array[String])
extends CoarseGrainedSchedulerBackend(scheduler, sc.env.rpcEnv)
with AppClientListener
with Logging {
一步一步看,createTaskScheduler代码片段中scheduler.initialize(backend),在 TaskSchedulerImpl.scala中:
def initialize(backend: SchedulerBackend) {
this.backend = backend
// temporarily set rootPool name to empty
rootPool = new Pool("", schedulingMode, 0, 0)
schedulableBuilder = {
schedulingMode match {
case SchedulingMode.FIFO =>
new FIFOSchedulableBuilder(rootPool)
case SchedulingMode.FAIR =>
new FairSchedulableBuilder(rootPool, conf)
}
}
schedulableBuilder.buildPools()
}
回头再看最开始的代码,其中有_taskScheduler.start(),即TaskSchedulerImpl的start(),在 TaskSchedulerImpl.scala中:
override def start() {
backend.start()
if (!isLocal && conf.getBoolean("spark.speculation", false)) {
logInfo("Starting speculative execution thread")
speculationScheduler.scheduleAtFixedRate(new Runnable {
override def run(): Unit = Utils.tryOrStopSparkContext(sc) {
checkSpeculatableTasks()
}
}, SPECULATION_INTERVAL_MS, SPECULATION_INTERVAL_MS, TimeUnit.MILLISECONDS)
}
}
在backend.start()中,即SparkDeploySchedulerBackend的start():
override def start() {
super.start()
launcherBackend.connect()
// The endpoint for executors to talk to us
val driverUrl = rpcEnv.uriOf(SparkEnv.driverActorSystemName,
RpcAddress(sc.conf.get("spark.driver.host"), sc.conf.get("spark.driver.port").toInt),
CoarseGrainedSchedulerBackend.ENDPOINT_NAME)
val args = Seq(
"--driver-url", driverUrl,
"--executor-id", "{{EXECUTOR_ID}}",
"--hostname", "{{HOSTNAME}}",
"--cores", "{{CORES}}",
"--app-id", "{{APP_ID}}",
"--worker-url", "{{WORKER_URL}}")
val extraJavaOpts = sc.conf.getOption("spark.executor.extraJavaOptions")
.map(Utils.splitCommandString).getOrElse(Seq.empty)
val classPathEntries = sc.conf.getOption("spark.executor.extraClassPath")
.map(_.split(java.io.File.pathSeparator).toSeq).getOrElse(Nil)
val libraryPathEntries = sc.conf.getOption("spark.executor.extraLibraryPath")
.map(_.split(java.io.File.pathSeparator).toSeq).getOrElse(Nil)
// When testing, expose the parent class path to the child. This is processed by
// compute-classpath.{cmd,sh} and makes all needed jars available to child processes
// when the assembly is built with the "*-provided" profiles enabled.
val testingClassPath =
if (sys.props.contains("spark.testing")) {
sys.props("java.class.path").split(java.io.File.pathSeparator).toSeq
} else {
Nil
}
// Start executors with a few necessary configs for registering with the scheduler
val sparkJavaOpts = Utils.sparkJavaOpts(conf, SparkConf.isExecutorStartupConf)
val javaOpts = sparkJavaOpts ++ extraJavaOpts
val command = Command("org.apache.spark.executor.CoarseGrainedExecutorBackend",
args, sc.executorEnvs, classPathEntries ++ testingClassPath, libraryPathEntries, javaOpts)
val appUIAddress = sc.ui.map(_.appUIAddress).getOrElse("")
val coresPerExecutor = conf.getOption("spark.executor.cores").map(_.toInt)
val appDesc = new ApplicationDescription(sc.appName, maxCores, sc.executorMemory,
command, appUIAddress, sc.eventLogDir, sc.eventLogCodec, coresPerExecutor)
client = new AppClient(sc.env.rpcEnv, masters, appDesc, this, conf)
client.start()
launcherBackend.setState(SparkAppHandle.State.SUBMITTED)
waitForRegistration()
launcherBackend.setState(SparkAppHandle.State.RUNNING)
}
在最开始会调用super.stat(),之前介绍了SparkDeploySchedulerBackend继承于CoarseGrainedSchedulerBachend,CoarseGrainedSchedulerBachend的start():
override def start() {
val properties = new ArrayBuffer[(String, String)]
for ((key, value) <- scheduler.sc.conf.getAll) {
if (key.startsWith("spark.")) {
properties += ((key, value))
}
}
// TODO (prashant) send conf instead of properties
driverEndpoint = rpcEnv.setupEndpoint(ENDPOINT_NAME, createDriverEndpoint(properties))
}
protected def createDriverEndpoint(properties: Seq[(String, String)]): DriverEndpoint = {
new DriverEndpoint(rpcEnv, properties)
}
可以知道,CoarseGrainedSchedulerBachend在start时将会实例化DriverEndpoint。
回到SparkDeploySchedulerBackend的start方法,里面定义了appDesc,传入的Command指定了具体的为当前应用程序启动的Executor进程的入口类的名称为CoarseGrainedExecutorBackend,args参数也封装了driverUrl信息,然后将appDesc作为AppClient的参数,创建AppClient对象并调用AppClient对象的start方法。
在AppClient.scala中:
def start() {
// Just launch an rpcEndpoint; it will call back into the listener.
endpoint.set(rpcEnv.setupEndpoint("AppClient", new ClientEndpoint(rpcEnv)))
}
AppClient中的endpoint定义如下:
private val endpoint = new AtomicReference[RpcEndpointRef]
rpcEnv.setupEndpoint()方法是在RpcEnv中注册ClientEndpoint,然后返回RpcEndpointRef:
/**
* Register a [[RpcEndpoint]] with a name and return its [[RpcEndpointRef]]. [[RpcEnv]] does not
* guarantee thread-safety.
*/
def setupEndpoint(name: String, endpoint: RpcEndpoint): RpcEndpointRef
上一篇blog介绍了Spark RPC的机制,那么应该知道在ClientEndpoint会调用onstart方法:
override def onStart(): Unit = {
try {
registerWithMaster(1)
} catch {
case e: Exception =>
logWarning("Failed to connect to master", e)
markDisconnected()
stop()
}
}
registerWithMaster会调用tryRegisterAllMasters方法,发送RegisterApplication(appDescription,self))消息向Master注册当前的应用程序。
Master收到RegisterApplication消息以后,则会为该程序生产Job ID并通过schedule()来分配计算资源,具体计算资源的分配是通过应用程序的运行方式、Memory、cores等配置信息来决定的。在Master.scala中:
override def receive: PartialFunction[Any, Unit] = {
…
case RegisterApplication(description, driver) => {
// TODO Prevent repeated registrations from some driver
if (state == RecoveryState.STANDBY) {
// ignore, don't send response
} else {
logInfo("Registering app " + description.name)
val app = createApplication(description, driver)
registerApplication(app)
logInfo("Registered app " + description.name + " with ID " + app.id)
persistenceEngine.addApplication(app)
driver.send(RegisteredApplication(app.id, self))
schedule()
}
}
…
}
driver.send(RegisteredApplication(app.id, self))是给ClientEndpoint发送RegisteredApplication消息。再看schedule()方法:
/**
* Schedule the currently available resources among waiting apps. This method will be called
* every time a new app joins or resource availability changes.
*/
private def schedule(): Unit = {
if (state != RecoveryState.ALIVE) { return }
// Drivers take strict precedence over executors
val shuffledWorkers = Random.shuffle(workers) // Randomization helps balance drivers
for (worker <- shuffledWorkers if worker.state == WorkerState.ALIVE) {
for (driver <- waitingDrivers) {
if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
launchDriver(worker, driver)
waitingDrivers -= driver
}
}
}
startExecutorsOnWorkers()
}
schedule()是资源调度,会在一台worker上启动Driver,调用launchDriver(worker, driver):
private def launchDriver(worker: WorkerInfo, driver: DriverInfo) {
logInfo("Launching driver " + driver.id + " on worker " + worker.id)
worker.addDriver(driver)
driver.worker = Some(worker)
worker.endpoint.send(LaunchDriver(driver.id, driver.desc))
driver.state = DriverState.RUNNING
}
在launchDriver中会发送LaunchDriver消息给Worker:
case LaunchDriver(driverId, driverDesc) => {
logInfo(s"Asked to launch driver $driverId")
val driver = new DriverRunner(
conf,
driverId,
workDir,
sparkHome,
driverDesc.copy(command = Worker.maybeUpdateSSLSettings(driverDesc.command, conf)),
self,
workerUri,
securityMgr)
drivers(driverId) = driver
driver.start()
coresUsed += driverDesc.cores
memoryUsed += driverDesc.mem
}
Worker接收到该消息后,首先分配DriverRunner,然后调用其start方法,DriverRunner内部通过启动Thread的方式来处理Driver的启动:首先会创建Driver的工作目录,下载jar文件,然后封装Driver的启动的Command并通过ProcessBuilder来启动Driver进程。注意,如果在standaolone模式,Worker会负责重新启动Driver。Cluster中的Driver失败的时候,如果supervise为true,则启动Driver的Worker会负责重新启动该Driver。
在Master的schedule()方法中当调用完launchDriver后,将调用startExecutorsOnWorkers(),在选取的worker上启动executors, Master会决定好了分配多少cores给worker。
/**
* Schedule and launch executors on workers
*/
private def startExecutorsOnWorkers(): Unit = {
// Right now this is a very simple FIFO scheduler. We keep trying to fit in the first app
// in the queue, then the second app, etc.
for (app <- waitingApps if app.coresLeft > 0) {
val coresPerExecutor: Option[Int] = app.desc.coresPerExecutor
// Filter out workers that don't have enough resources to launch an executor
val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
.filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&
worker.coresFree >= coresPerExecutor.getOrElse(1))
.sortBy(_.coresFree).reverse
val assignedCores = scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps)
// Now that we've decided how many cores to allocate on each worker, let's allocate them
for (pos <- 0 until usableWorkers.length if assignedCores(pos) > 0) {
allocateWorkerResourceToExecutors(
app, assignedCores(pos), coresPerExecutor, usableWorkers(pos))
}
}
}
/**
* Allocate a worker's resources to one or more executors.
* @param app the info of the application which the executors belong to
* @param assignedCores number of cores on this worker for this application
* @param coresPerExecutor number of cores per executor
* @param worker the worker info
*/
private def allocateWorkerResourceToExecutors(
app: ApplicationInfo,
assignedCores: Int,
coresPerExecutor: Option[Int],
worker: WorkerInfo): Unit = {
// If the number of cores per executor is specified, we divide the cores assigned
// to this worker evenly among the executors with no remainder.
// Otherwise, we launch a single executor that grabs all the assignedCores on this worker.
val numExecutors = coresPerExecutor.map { assignedCores / _ }.getOrElse(1)
val coresToAssign = coresPerExecutor.getOrElse(assignedCores)
for (i <- 1 to numExecutors) {
val exec = app.addExecutor(worker, coresToAssign)
launchExecutor(worker, exec)
app.state = ApplicationState.RUNNING
}
}
在launchExecutor(worker, exec)中会给选中的worker发送LaunchExecutor消息,worker接收到该消息后,首先分配ExecutorRunner,在Worker.scala的receive方法中:
case LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) =>
if (masterUrl != activeMasterUrl) {
logWarning("Invalid Master (" + masterUrl + ") attempted to launch executor.")
} else {
try {
logInfo("Asked to launch executor %s/%d for %s".format(appId, execId, appDesc.name))
// Create the executor's working directory
val executorDir = new File(workDir, appId + "/" + execId)
if (!executorDir.mkdirs()) {
throw new IOException("Failed to create directory " + executorDir)
}
// Create local dirs for the executor. These are passed to the executor via the
// SPARK_EXECUTOR_DIRS environment variable, and deleted by the Worker when the
// application finishes.
val appLocalDirs = appDirectories.get(appId).getOrElse {
Utils.getOrCreateLocalRootDirs(conf).map { dir =>
val appDir = Utils.createDirectory(dir, namePrefix = "executor")
Utils.chmod700(appDir)
appDir.getAbsolutePath()
}.toSeq
}
appDirectories(appId) = appLocalDirs
val manager = new ExecutorRunner(
appId,
execId,
appDesc.copy(command = Worker.maybeUpdateSSLSettings(appDesc.command, conf)),
cores_,
memory_,
self,
workerId,
host,
webUi.boundPort,
publicAddress,
sparkHome,
executorDir,
workerUri,
conf,
appLocalDirs, ExecutorState.RUNNING)
executors(appId + "/" + execId) = manager
manager.start()
coresUsed += cores_
memoryUsed += memory_
sendToMaster(ExecutorStateChanged(appId, execId, manager.state, None, None))
} catch {
case e: Exception => {
logError(s"Failed to launch executor $appId/$execId for ${appDesc.name}.", e)
if (executors.contains(appId + "/" + execId)) {
executors(appId + "/" + execId).kill()
executors -= appId + "/" + execId
}
sendToMaster(ExecutorStateChanged(appId, execId, ExecutorState.FAILED,
Some(e.toString), None))
}
}
}
worker分配ExecutorRunner,然后调用ExecutorRunner.start()来启动Executor进程。start会调用fetchAndRunExecutor方法,fetchAndRunExecutor下载运行的程序并运行executor。ExecutorRunner内部会通过Thread的方式构建ProcessBuilder来启动另外一个JVM进程,这个JVM进程启动时候加载的main方法所在的类的名称就是在创建ClientEndpoint时传入的Command中指定具体名称为org.apache.spark.executor.CoarseGrainedExecutorBackend类的main方法,在main方法中会实例化CoarseGrainedExecutorBackend本身这个消息循环体。
CoarseGrainedExecutorBackend会首先通过传入的driverUrl这个参数向在org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend发送RegisterExecutor(executorId, hostPort, cores) 消息向Driver注册自己,注册成功后Driver会返回RegisteredExecutor消息,在接收到RegisteredExecutor消息后,CoarseGrainedExecutorBackend会初始化Executor:
case RegisteredExecutor(hostname) =>
logInfo("Successfully registered with driver")
executor = new Executor(executorId, hostname, env, userClassPath, isLocal = false)
CoarseGrainedExecutorBackend是Executor运行所在的程序名称,Executor才是真正处理Task的对象,Executor内部是通过线程池调度来完成Task的计算。
RDD的action操作才会触发SparkContext的runJob()方法,在runJob中:
/**
* Run a function on a given set of partitions in an RDD and pass the results to the given
* handler function. This is the main entry point for all actions in Spark.
*/
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
resultHandler: (Int, U) => Unit): Unit = {
if (stopped.get()) {
throw new IllegalStateException("SparkContext has been shutdown")
}
val callSite = getCallSite
val cleanedFunc = clean(func)
logInfo("Starting job: " + callSite.shortForm)
if (conf.getBoolean("spark.logLineage", false)) {
logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
}
dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
progressBar.foreach(_.finishAll())
rdd.doCheckpoint()
}
重要的是dagScheduler.runJob,最终会给eventProcessLoop: DAGSchedulerEventProcessLoop发送一个JobSubmitted消息。
DAGScheduler会接收JobSubmitted消息然后处理:
case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)
在handleJobSubmitted中DAGScheduler会做stage的划分和submitStage。
DAGScheduler划分好Stage后会通过TaskSchedulerImpl中的TaskSetManager来管理当前要运行的Stage中的所有任务TaskSet。
在TaskSchedulerImpl.scala:
override def submitTasks(taskSet: TaskSet) {
val tasks = taskSet.tasks
logInfo("Adding task set " + taskSet.id + " with " + tasks.length + " tasks")
this.synchronized {
val manager = createTaskSetManager(taskSet, maxTaskFailures)
val stage = taskSet.stageId
val stageTaskSets =
taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new HashMap[Int, TaskSetManager])
stageTaskSets(taskSet.stageAttemptId) = manager
val conflictingTaskSet = stageTaskSets.exists { case (_, ts) =>
ts.taskSet != taskSet && !ts.isZombie
}
if (conflictingTaskSet) {
throw new IllegalStateException(s"more than one active taskSet for stage $stage:" +
s" ${stageTaskSets.toSeq.map{_._2.taskSet.id}.mkString(",")}")
}
schedulableBuilder.addTaskSetManager(manager, manager.taskSet.properties)
if (!isLocal && !hasReceivedTask) {
starvationTimer.scheduleAtFixedRate(new TimerTask() {
override def run() {
if (!hasLaunchedTask) {
logWarning("Initial job has not accepted any resources; " +
"check your cluster UI to ensure that workers are registered " +
"and have sufficient resources")
} else {
this.cancel()
}
}
}, STARVATION_TIMEOUT_MS, STARVATION_TIMEOUT_MS)
}
hasReceivedTask = true
}
backend.reviveOffers()
}
将TaskSet加入到TaskSetManager中进行管理,然后调用schedulableBuilder.addTaskSetManager(),schedulableBuilder会确定TaskSetManager的调度顺序,然后按照TaskSetManager的数据本地性来确定每个Task具体运行在哪个ExecutorBackend中。backend.reviveOffers()会给DriverEndpoint发送ReviveOffers消息。
Driver接收到ReviveOffers后会调用makeOffers():
// Make fake resource offers on all executors
private def makeOffers() {
// Filter out executors under killing
val activeExecutors = executorDataMap.filterKeys(executorIsAlive)
val workOffers = activeExecutors.map { case (id, executorData) =>
new WorkerOffer(id, executorData.executorHost, executorData.freeCores)
}.toSeq
launchTasks(scheduler.resourceOffers(workOffers))
}
在launchTasks方法中,Driver向ExecutorBackend发送LaunchTask消息来启动task:
// Launch tasks returned by a set of resource offers
private def launchTasks(tasks: Seq[Seq[TaskDescription]]) {
for (task <- tasks.flatten) {
val serializedTask = ser.serialize(task)
if (serializedTask.limit >= akkaFrameSize - AkkaUtils.reservedSizeBytes) {
scheduler.taskIdToTaskSetManager.get(task.taskId).foreach { taskSetMgr =>
try {
var msg = "Serialized task %s:%d was %d bytes, which exceeds max allowed: " +
"spark.akka.frameSize (%d bytes) - reserved (%d bytes). Consider increasing " +
"spark.akka.frameSize or using broadcast variables for large values."
msg = msg.format(task.taskId, task.index, serializedTask.limit, akkaFrameSize,
AkkaUtils.reservedSizeBytes)
taskSetMgr.abort(msg)
} catch {
case e: Exception => logError("Exception in error callback", e)
}
}
}
else {
val executorData = executorDataMap(task.executorId)
executorData.freeCores -= scheduler.CPUS_PER_TASK
executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))
}
}
}
这里的executorDataMap(HashMap[String, ExecutorData])是Driver保存由ExecutorData封装的ExecutorBackend信息的集合。
Exector接收到LaunchTask消息后将真正开始执行任务。在CoarseGrainedExecutorBackend.scala中:
case LaunchTask(data) =>
if (executor == null) {
logError("Received LaunchTask command but executor was null")
System.exit(1)
} else {
val taskDesc = ser.deserialize[TaskDescription](data.value)
logInfo("Got assigned task " + taskDesc.taskId)
executor.launchTask(this, taskId = taskDesc.taskId, attemptNumber = taskDesc.attemptNumber,
taskDesc.name, taskDesc.serializedTask)
}
这里先反序列化TaskDescription,然后由调用executor的launchTask():
def launchTask(
context: ExecutorBackend,
taskId: Long,
attemptNumber: Int,
taskName: String,
serializedTask: ByteBuffer): Unit = {
val tr = new TaskRunner(context, taskId = taskId, attemptNumber = attemptNumber, taskName,
serializedTask)
runningTasks.put(taskId, tr)
threadPool.execute(tr)
}
TaskRunner实际上是Runnable对象。
总结:
在SparkContext实例化的时候调用createTaskScheduler来创建TaskSchedulerImpl和SparkDeploySchedulerBachend,TaskSchedulerImpl会持有该SchedulerBackend,即SparkDeploySchedulerBackend。
当调用TaskSchedulerImpl的start方法时会调用SparkDeploySchedluerBackend的start方法, SparkDeploySchedulerBackend继承于CoarseGrainedSchedulerBachend,CoarseGrainedSchedulerBachend在start时将会实例化DriverEndpoint。SparkDeploySchedluerBackend的start方法中会创建AppClient对象并调用其start方法,并在start方法中会实例化ClientEndpoint,在创建ClientEndpoint会传入Command来指定具体的为当前应用程序启动的Executor的入口类的名称,然后ClientEndpoint启动并通过tryRegisterMaster来注册当前应用程序到Master中。
Master接收注册信息后则会为该程序生成JobID并通过schedule来分配计算资源。然后Master会发送消息给Worker,Worker中为当前应用程序分配计算资源时会首先分配ExectorRunner,ExectorRunner内部会启动另外一个JVM进程,该JVM进程加载的main方法所在的类就是创建ClientEndpoint会传入Command来指定的具体类,即CoarseGrainedExecutorBackend。在main方法中实例化CoarseGrainedExecutorBackend时通过回调onStart向DriverEndpoint发送RegisterExecutor来注册当前的ExecutorBackend。
DriverEndpoint收到该注册信息并保存在SparkDeploySchedulerBachend实例中,此时SparkDeploySchedulerBackend就掌握了当前应用程序拥有的计算资源。由于TaskSchedulerImpl会持有SchedulerBackend, TaskScheduler就是通过SparkDeploySchedulerBackend拥有的计算资源来具体运行Task。