Master调度
在前面我们经常看到Master在接收到来自其他组件的消息之后会调用schedule()方法,那么schedule()方法是干什么的呢,下面就将介绍schedule()的作用
/**
*每当一个新的app进入或者有可用资源时,会调用schedule
* 每次调度,首先调度driver,然后调度application
*/
private def schedule(): Unit = {
// 首先判断是否是alive,只有状态为alive的Master会进行调度
if (state != RecoveryState.ALIVE) {
return
}
// 将集合中的元素打乱
// 将当前仍然alive的worker进行打乱
val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state == WorkerState.ALIVE))
// 当前可用worker的数量
val numWorkersAlive = shuffledAliveWorkers.size
var curPos = 0
// 首先调度driver
// 调度driver的原因,什么情况下会调度driver
// 只要使用cluster模式提交application时,Master才会调度Driver
// 在client模式下,Driver会在提交任务的机器上运行,不需要调度
// 遍历每个等待driver
// 使用轮询的方式来将worker分配给driver
// 轮询的意思就是,假如前一个driver使用了序号为2的worker,那么下一个driver首先判断的就是序号为3的worker
for (driver <- waitingDrivers.toList) {
// 下面的代码都是针对指定的driver
var launched = false
var numWorkersVisited = 0
// 只要还有没有访问过的worker并且当前driver没有找到合适的worker那么就持续循环
while (numWorkersVisited < numWorkersAlive && !launched) {
// curPos用来记录当前判断的是哪个worker
val worker = shuffledAliveWorkers(curPos)
numWorkersVisited += 1
// 如果当前worker具有足够的内存并且核数也足够
if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
// 那么在当前worker上启动一个driver,并且将driver从waitingDriver队列中移除
launchDriver(worker, driver)
waitingDrivers -= driver
launched = true
}
curPos = (curPos + 1) % numWorkersAlive
}
}
startExecutorsOnWorkers()
}
Worker执行Driver
下面看一下如何将Driver分配指定的worker来运行
Master
// 在某一个worker上启动driver
private def launchDriver(worker: WorkerInfo, driver: DriverInfo) {
logInfo("Launching driver " + driver.id + " on worker " + worker.id)
// 将待执行driver添加到worker负责执行的driver
// 这里会更新work的信息,更新使用的内存和使用的core数量
worker.addDriver(driver)
driver.worker = Some(worker)
// 向worker发送启动driver信息,让worker来启动driver,将driver的状态设置为running
worker.endpoint.send(LaunchDriver(driver.id, driver.desc))
driver.state = DriverState.RUNNING
}
下面看一下worker如何处理来自Master的启动driver消息
// 接收到启动driver的信息
case LaunchDriver(driverId, driverDesc) =>
logInfo(s"Asked to launch driver $driverId")
// 创建一个driverRunner 来管理一个Driver的执行,自动重启Driver
// 这里看一下workDir的路径
// 如果设置了workDirPath,那么就使用该路径创建workDir
// workDir = Option(workDirPath).map(new File(_)).getOrElse(new File(sparkHome, "work"))
// worker的工作目录默认是SPARK_HOME/work
val driver = new DriverRunner(
conf,
driverId,
workDir,
sparkHome,
driverDesc.copy(command = Worker.maybeUpdateSSLSettings(driverDesc.command, conf)),
self,
workerUri,
securityMgr)
// 将driverId放到内存缓存结构中
drivers(driverId) = driver
// 启动一个线程
driver.start()
// 更新核心和内存使用情况
coresUsed += driverDesc.cores
memoryUsed += driverDesc.mem
下面看一下DriverRunner是如何启动的
/** Starts a thread to run and manage the driver. */
private[worker] def start() = {
/*
* 创建一个线程
* 在线程中启动了一个进程用来运行Driver
* */
new Thread("DriverRunner for " + driverId) {
override def run() {
var shutdownHook: AnyRef = null
try {
shutdownHook = ShutdownHookManager.addShutdownHook { () =>
logInfo(s"Worker shutting down, killing driver $driverId")
kill()
}
// prepare driver jars and run driver
// 进行运行Driver的准备并且运行
val exitCode = prepareAndRunDriver()
// set final state depending on if forcibly killed and process exit code
finalState = if (exitCode == 0) {
Some(DriverState.FINISHED)
} else if (killed) {
Some(DriverState.KILLED)
} else {
Some(DriverState.FAILED)
}
} catch {
case e: Exception =>
kill()
finalState = Some(DriverState.ERROR)
finalException = Some(e)
} finally {
if (shutdownHook != null) {
ShutdownHookManager.removeShutdownHook(shutdownHook)
}
}
// notify worker of final driver state, possible exception
// 向worker发送Driver状态变化消息
worker.send(DriverStateChanged(driverId, finalState.get, finalException))
}
}.start()
}
下面看一下在运行DriverRunner之前都需要进行哪些准备
private[worker] def prepareAndRunDriver(): Int = {
// 创建本地工作目录
// driverDir = workDir/driverId
val driverDir = createWorkingDirectory()
// 将提交的jar下载到本地工作目录中
// 保证在一个物理机上只会下载一次application的jar文件
val localJarFilename = downloadUserJar(driverDir)
def substituteVariables(argument: String): String = argument match {
case "{{WORKER_URL}}" => workerUrl
case "{{USER_JAR}}" => localJarFilename
case other => other
}
// 传入了Driver的启动命令,资源需求
val builder = CommandUtils.buildProcessBuilder(driverDesc.command, securityManager,
driverDesc.mem, sparkHome.getAbsolutePath, substituteVariables)
// 运行Driver
runDriver(builder, driverDir, driverDesc.supervise)
}
private def runDriver(builder: ProcessBuilder, baseDir: File, supervise: Boolean): Int = {
builder.directory(baseDir)
def initialize(process: Process): Unit = {
// Redirect stdout and stderr to files
// 重定向,stdout和stderr到文件中
val stdout = new File(baseDir, "stdout")
CommandUtils.redirectStream(process.getInputStream, stdout)
val stderr = new File(baseDir, "stderr")
val formattedCommand = builder.command.asScala.mkString("\"", "\" \"", "\"")
val header = "Launch Command: %s\n%s\n\n".format(formattedCommand, "=" * 40)
Files.append(header, stderr, StandardCharsets.UTF_8)
CommandUtils.redirectStream(process.getErrorStream, stderr)
}
// 这里就是创建一个进程来运行driver
// 如果在提交任务的时候配置了supervisor参数,那么如果运行失败,会尝试重新运行Driver
runCommandWithRetry(ProcessBuilderLike(builder), initialize, supervise)
}
下面简单总结一下Driver调度的过程(只有工作在cluster模式下的应用程序才需要Master来调度Driver):
- Master将当前存活的Worker随机打乱
- 使用轮询地方法为每个Driver选择Worker
- 向选择的Worker发送启动Driver消息,并且将Driver的消息告诉Worker
- Worker接收到启动Driver的消息后,首先会创建一个DriverRunner用来管理Driver的运行,调用该对象的start(),会启动一个线程,在该线程中首先会创建Driver在当前Worker上的工作目录,然后下载所需jar文件到工作目录,然后创建一个进程用来运行Driver程序
Master调度Application
Master在调度完Driver之后,接着会调度Executor
/**
* Schedule and launch executors on workers
*/
private def startExecutorsOnWorkers(): Unit = {
// 现在只是一种非常简单的fifo调度,首先调度队列中的第一个,然后调度第二个
// 遍历每个Application
for (app <- waitingApps) {
// 获取每个Executor上的core数量,如果没有设置,那么每个Executor上只有一个core
val coresPerExecutor = app.desc.coresPerExecutor.getOrElse(1)
// 如果应用剩余的核心数不够分配一个executor的核心数,那么就调度下一个application
// 如果当前Application剩余的核心数还需要至少一个Executor,那么就开始分配Executor
if (app.coresLeft >= coresPerExecutor) {
// 过滤掉资源不足的worker,这里的资源有两种,core和memory
// 按照剩余cpu数量倒序排序
// worker的剩余资源必须够当前application一个executor的资源用量才能够调度
val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
.filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&
worker.coresFree >= coresPerExecutor)
.sortBy(_.coresFree).reverse
val assignedCores = scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps)
// Now that we've decided how many cores to allocate on each worker, let's allocate them
// 只要需要在该worker上分配核心,那么就向该worker上分配指定的核心数
// 并且通过每个executor需要多少个核心,来计算启动多少个executor
for (pos <- 0 until usableWorkers.length if assignedCores(pos) > 0) {
allocateWorkerResourceToExecutors(
app, assignedCores(pos), app.desc.coresPerExecutor, usableWorkers(pos))
}
}
}
}
下面看一下如何在每个Worker上分配属于同一个Application的Executor
private def scheduleExecutorsOnWorkers(
app: ApplicationInfo,
usableWorkers: Array[WorkerInfo],
spreadOutApps: Boolean): Array[Int] = {
// 每个executor的核心数
val coresPerExecutor = app.desc.coresPerExecutor
// 每个executor的最少核心数
// 如果通过参数设置了每个executor所需核心数,那么就是设定值,否则就是1
val minCoresPerExecutor = coresPerExecutor.getOrElse(1)
// 通过判断是否设置了每个executor的核心数,来判断我们使用的策略,是否是每个worker一个executor
// 如果没有设置每个executor需要多少个core
// 那么就会启动oneExecutorPerWorker模式
// 如果启动了oneExecutorPerWorker模式,那么每个worker上只分配一个executor,核心数会逐渐增大
val oneExecutorPerWorker = coresPerExecutor.isEmpty
val memoryPerExecutor = app.desc.memoryPerExecutorMB
// 可用worker的个数
val numUsable = usableWorkers.length
// 最终返回的结果,每个可用worker分配多少个核心
val assignedCores = new Array[Int](numUsable) // Number of cores to give to each worker
// 每个可用worker上新分配的executor的个数
val assignedExecutors = new Array[Int](numUsable) // Number of new executors on each worker
// 针对当前应用,剩余需要分配的核心数
// 是application剩余需要分配核心和可用worker总核心数的较小值
var coresToAssign = math.min(app.coresLeft, usableWorkers.map(_.coresFree).sum)
/** Return whether the specified worker can launch an executor for this app.
* 判断指定的worker是否能够启动一个executor
* */
def canLaunchExecutor(pos: Int): Boolean = {
// 剩下app的核心数是否能够分配一个executor
val keepScheduling = coresToAssign >= minCoresPerExecutor
// 判断当前worker是否有足够的核心数,可以启动一个executor
val enoughCores = usableWorkers(pos).coresFree - assignedCores(pos) >= minCoresPerExecutor
// If we allow multiple executors per worker, then we can always launch new executors.
// Otherwise, if there is already an executor on this worker, just give it more cores.
// 如果我们允许一个worker上可以启动同一个application的多个executor,那么我们总能启动一个新的executor
// 如果当前worker上已经存在了一个executor,我们只是向该executor添加更多的核心
// 判断是否可以启动一个新的executor
// 设置了每个executor所需要的核心数,或者之前没有使用该worker
val launchingNewExecutor = !oneExecutorPerWorker || assignedExecutors(pos) == 0
// 不管是oneExecutorPerWorker模式还是不是
// 第一次都会创建一个exector
// 假如不是oneExecotrPerWorker模式,每次分配exector的时候都会检查内存,但是非 oneExecutorPerWorker模式就不会检查内存
if (launchingNewExecutor) {
// 判断当前worker是否还有足够的内存,可以启动一个executor
// assignedExecutors代表在这次对application的调度中已经分配的核心数
// 代表将要分配但是还没有分配
val assignedMemory = assignedExecutors(pos) * memoryPerExecutor
val enoughMemory = usableWorkers(pos).memoryFree - assignedMemory >= memoryPerExecutor
val underLimit = assignedExecutors.sum + app.executors.size < app.executorLimit
keepScheduling && enoughCores && enoughMemory && underLimit
} else {
// 已经在一个worker上分配了executor
// 仅向该worker添加更多的核心数
// We're adding cores to an existing executor, so no need
// to check memory and executor limits
keepScheduling && enoughCores
}
}
// Keep launching executors until no more workers can accommodate any
// more executors, or if we have reached this application's limits
/**
* 持续启动executor,直到没有资源或者达到了application的限制
*/
// 进行首次筛选
var freeWorkers = (0 until numUsable).filter(canLaunchExecutor)
//只要还有空闲的worker
// keepScheduling决定这次遍历一个worker是分配一下就继续下一个worker,还是一直在该worker上进行分配,直到不能分配
while (freeWorkers.nonEmpty) {
freeWorkers.foreach { pos =>
var keepScheduling = true
while (keepScheduling && canLaunchExecutor(pos)) {
coresToAssign -= minCoresPerExecutor
assignedCores(pos) += minCoresPerExecutor
// 如果每个worker上只启动一个executor,那么每次迭代,只分配一个executor
// 否则,分配的executor个数递增
// If we are launching one executor per worker, then every iteration assigns 1 core
// to the executor. Otherwise, every iteration assigns cores to a new executor.
// 从这里可以看出,如果每个worker只允许有一个executor,那么后面的分配,只会向之前的executor分配更多的核心
if (oneExecutorPerWorker) {
assignedExecutors(pos) = 1
} else {
assignedExecutors(pos) += 1
}
// Spreading out an application means spreading out its executors across as
// many workers as possible. If we are not spreading out, then we should keep
// scheduling executors on this worker until we use all of its resources.
// Otherwise, just move on to the next worker.
// 如果 设置了spreadOut,那么就移动到下一个worker,否则还会在当前worker上进行判断
if (spreadOutApps) {
keepScheduling = false
}
}
}
freeWorkers = freeWorkers.filter(canLaunchExecutor)
}
assignedCores
}
private def allocateWorkerResourceToExecutors(
app: ApplicationInfo,
assignedCores: Int,
coresPerExecutor: Option[Int],
worker: WorkerInfo): Unit = {
// 如果设置了每个executor的核心个数,那么将分到的核心数平均分,否则将所有分配到的核心数都分配给一个executor
// 没有设置那么使用的就是oneExecutorPerWorker模式,一个worker上只启动一个executor
val numExecutors = coresPerExecutor.map { assignedCores / _ }.getOrElse(1)
val coresToAssign = coresPerExecutor.getOrElse(assignedCores)
for (i <- 1 to numExecutors) {
// 更新application的信息
val exec = app.addExecutor(worker, coresToAssign)
// 启动一个executor
launchExecutor(worker, exec)
app.state = ApplicationState.RUNNING
}
}
下面的方法就是按照之前计算出来的执行计划,向worker发送消息来启动executor
private def launchExecutor(worker: WorkerInfo, exec: ExecutorDesc): Unit = {
logInfo("Launching executor " + exec.fullId + " on worker " + worker.id)
// 更新worker的缓存信息
worker.addExecutor(exec)
// 向worker发送启动executor的消息
worker.endpoint.send(LaunchExecutor(masterUrl,
exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory))
// 向对应application的driver发送添加executor成功的消息
exec.application.driver.send(
ExecutorAdded(exec.id, worker.id, worker.hostPort, exec.cores, exec.memory))
}
下面总结一下调度Application的过程:
- 使用FIFO的方式调度Application,针对每个Application执行下面的处理
- 如果当前Application的剩余需要使用的core的个数还需要分配至少一个Exectutor,那么就执行下面的处理,否则就结束了
- 首先过滤掉core和memory不足以分配一个Executor的Worker,然后将过滤后的Worker按照core的数量递减排序
- 根据是否设置了每个Executor使用的core的数量,来判断使用哪种分配方式,如果没有设置,那么使用oneExecutorPerWorker
,如果设置了,那么我们使用正常模式
下面先简述如何判断一个worker是否可以启动Executor:
- 首先保证当前Worker上有足够的空闲core可以分配给Executor
- 然后使用一个标志launchingNewExecutor,该标志位true的条件是
(1)当前Worker上之前没有分配给当前Application executorMemory
(2)如果不是oneExecutorPerWorker模式,那么该标志永远都是True
也就是说只要在该Worker上第一次分配Executor,那么该标志都是True
当分配过一次之后,如果是oneExecutorPerWorker模式,那么该标志永远都是False,而常规模式永远都是Treu - 那么这个标志有什么用呢,如果该标志为True,那么我们不仅需要保证当前Worker上有足够的core还要保证有足够的memory
也就是oneExecutorPerWorker只有第一个在该Worker上分配Executor需要分配内存,之后就不需要分配内存了
而常规模式每次都需要分配内存
这里再介绍一下spreadout,如果设置了spreadout,会尽量将executor分散到不同的worker上