Master在收到RegisterWorker/RegisterApplication/ExecutorStateChanged/RequestSubmitDriver消息时,
或者完成主备切换后,都要执行schedule(),来
调度waiting apps中当前可用的资源(Driver,Executor on Workers),
每当加入一个新的app,或者资源可用性改变时都会被调用。
private def schedule(): Unit = {
// Standby master是不会参与Application资源的调度的
if (state != RecoveryState.ALIVE) {
return
}
// Drivers take strict precedence over executors
// workers 是registerWorker()注册过来的所有worker,类型是HashSet[WorkerInfo]
// 先对workers随机乱序
val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state == WorkerState.ALIVE))
val numWorkersAlive = shuffledAliveWorkers.size
var curPos = 0
for (driver <- waitingDrivers.toList) { // iterate over a copy of waitingDrivers
// We assign workers to each waiting driver in a round-robin fashion. For each driver, we
// start from the last worker that was assigned a driver, and continue onwards until we have
// explored all alive workers.
var launched = false
var numWorkersVisited = 0
// 遍历所有Alive状态的workers
while (numWorkersVisited < numWorkersAlive && !launched) {
val worker = shuffledAliveWorkers(curPos)
numWorkersVisited += 1
// 如果当前worker上的空闲内存和空闲CPU cores能够满足driver,就launch它,并从waitingDrivers中移除
if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
launchDriver(worker, driver)
waitingDrivers -= driver
launched = true
}
curPos = (curPos + 1) % numWorkersAlive
}
}
// 启动Driver后,启动Worker上的所有Executors
startExecutorsOnWorkers()
}
standalone和yarn-client模式都会在本地启动Driver,不会注册给Master,
只有yarn-cluster模式才会注册进来,才需要调度Drivers。
启动Driver严格优先于Executors。
启动Driver过程
- 1, 将driver加入对应的worker,worker加入到driver内存缓存,互相引用;
- 2, 调用worker的endpoint发送LaunchDriver消息,让Worker来启动Driver
- 3, 更新driver的状态为RUNNING
Worker收到LaunchDriver消息后,new 一个 DriverRunner()并启动,累计占用的cpu和memory。
DriverRunner在org.apache.spark.deploy.worker下。它实际上new Thread并启动,
先Kill掉之前的driver,把Kill()添加给ShutdownHook;
然后准备driver的jars开始运行;
运行后发送DriverStateChanged消息通知driver运行的结果状态。
driver运行时,创建一个CommandUtils.buildProcessBuilder(),运行的其实是这个builder
val builder = CommandUtils.buildProcessBuilder(driverDesc.command, securityManager,
driverDesc.mem, sparkHome.getAbsolutePath, substituteVariables)
然后做一下stdout/stderr重定向,
最后runCommandWithRetry()。
startExecutorsOnWorkers()
调度和启动Worker上的所有Executors
spark2.4这里,Application的调度放在了 startExecutorsOnWorkers()里面,使用简单的FIFO调度算法。
- 遍历waitingApps中的每个app,
- 如果app还需要的cores小于一个Executor指定的cpu数,就不为app的coresLeft执行调度,不分配新的executor-core。
- 过滤出还有可供调度的cpu和memory的workers,按剩余cores的大小降序排序,作为usableWorkers
- 计算所有usableWorkers上要分配多少CPU(用scheduleExecutorsOnWorkers函数)
- 遍历可用的 Workers,分配资源并执行调度,启动Executors(用allocateWorkerResourceToExecutors函数)。
这个版本的scheduleExecutorsOnWorkers()和allocateWorkerResourceToExecutors()似乎名字取反了,
schedule…()其实是资源分配的计算,allocate才是调度和执行。
app.coresLeft 定义在ApplicationInfo.scala:
private[master] def coresLeft: Int = requestedCores - coresGranted
private def startExecutorsOnWorkers(): Unit = {
// Right now this is a very simple FIFO scheduler.
for (app <- waitingApps) {
val coresPerExecutor = app.desc.coresPerExecutor.getOrElse(1)
// If the cores left is less than the coresPerExecutor,the cores left will not be allocated
if (app.coresLeft >= coresPerExecutor) {
// Filter out workers that don't have enough resources to launch an executor
val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
.filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&
worker.coresFree >= coresPerExecutor)
.sortBy(_.coresFree).reverse
val assignedCores = scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps)
// Now that we've decided how many cores to allocate on each worker, let's allocate them
for (pos <- 0 until usableWorkers.length if assignedCores(pos) > 0) {
allocateWorkerResourceToExecutors(
app, assignedCores(pos), app.desc.coresPerExecutor, usableWorkers(pos))
}
}
}
}
scheduleExecutorsOnWorkers()
计算如何调度worker上的Executors
scheduleExecutorsOnWorkers()返回值是assignedCores数组,它记录了给每个worker分配的cpu数。
内部函数 canLaunchExecutor() 用来判断当前Worker是否可以启动一个新的Executor,分配新的cpu资源。
assignedExecutors数组记录了将要在每个Worker上新启动的Executor个数,默认没指定coresPerExecutor时,每个worker上只启动一个executor。
- 如果是每个Worker下面只能够为当前的应用程序分配一个Executor的话,每次是分配一个Core!
- 如果是spreadOutApps(默认的情况下)的时候,会尽量使用集群中所有的executors. 每次都会给executor增加一个core。
- 如果不是spreadOutApps的时候,每次都会给executor增加一个core,会一直循环当前程序的executor上的freeCores,所以会占用本机器上的尽可能多的cores。
参考:https://blog.csdn.net/snail_gesture/article/details/50808239
private def scheduleExecutorsOnWorkers(
app: ApplicationInfo,
usableWorkers: Array[WorkerInfo],
spreadOutApps: Boolean): Array[Int] = {
val coresPerExecutor = app.desc.coresPerExecutor
// 默认情况下一个executor分配一个core
val minCoresPerExecutor = coresPerExecutor.getOrElse(1)
val oneExecutorPerWorker = coresPerExecutor.isEmpty
val memoryPerExecutor = app.desc.memoryPerExecutorMB
val numUsable = usableWorkers.length
val assignedCores = new Array[Int](numUsable) // Number of cores to give to each worker
val assignedExecutors = new Array[Int](numUsable) // Number of new executors on each worker
// 还要分配的CPU总数,取min(app需要的cpu数量, 所有可用Workers剩余cpu的和)
// Workers可用的cores不能满足app需要的话,先把可用的分配上。
var coresToAssign = math.min(app.coresLeft, usableWorkers.map(_.coresFree).sum)
/** Return whether the specified worker can launch an executor for this app. */
def canLaunchExecutor(pos: Int): Boolean = {
val keepScheduling = coresToAssign >= minCoresPerExecutor
val enoughCores = usableWorkers(pos).coresFree - assignedCores(pos) >= minCoresPerExecutor
// 如果允许每个worker有多个Executors(不指定coresPerExecutor),那么可以一直启动新的executor。
// 否则,如果这个worker上已经存在Executor, 就给这个Executor分配更多的core。
val launchingNewExecutor = !oneExecutorPerWorker || assignedExecutors(pos) == 0
if (launchingNewExecutor) {
// assignedMemory:当前worker要分配的Executors * 每个Executor的Memory
val assignedMemory = assignedExecutors(pos) * memoryPerExecutor
// 当前worker空闲memory - 当前worker要分配的Executors总的Memory >= 一个executor的内存
// 意思是:还有足够的内存可以调起至少一个Executor
val enoughMemory = usableWorkers(pos).memoryFree - assignedMemory >= memoryPerExecutor
val underLimit = assignedExecutors.sum + app.executors.size < app.executorLimit
keepScheduling && enoughCores && enoughMemory && underLimit
} else {
// We're adding cores to an existing executor, so no need
// to check memory and executor limits
keepScheduling && enoughCores
}
}
// Keep launching executors until no more workers can accommodate any
// more executors, or if we have reached this application's limits
// 不断启动executor,直到不再有Worker可以容纳更多Executor,或者达到了这个Application的要求
var freeWorkers = (0 until numUsable).filter(canLaunchExecutor)
while (freeWorkers.nonEmpty) {
freeWorkers.foreach { pos =>
var keepScheduling = true
// 默认spreadOut方式,keepScheduling为false,每个pos只循环一次,每次分配minCoresPerExecutor个core。
while (keepScheduling && canLaunchExecutor(pos)) {
// 每次迭代为当前worker增加minCoresPerExecutor个core。
// 如果在每个worker上只启动一个executor(没有指定spark.executor.cores的情况下),
// minCoresPerExecutor的值为1,否则就一次分配指定的coresPerExecutor
coresToAssign -= minCoresPerExecutor
assignedCores(pos) += minCoresPerExecutor
// 如果我们在每个worker上只启动一个executor,(没有指定spark.executor.cores的情况),那么assignedExecutors=1;
// 否则,每次迭代都在worker上再启动一个executor。
if (oneExecutorPerWorker) {
assignedExecutors(pos) = 1
} else {
assignedExecutors(pos) += 1
}
// Spreading out an application means spreading out its executors across as
// many workers as possible. If we are not spreading out, then we should keep
// scheduling executors on this worker until we use all of its resources.
// Otherwise, just move on to the next worker.
if (spreadOutApps) {
keepScheduling = false
}
}
}
freeWorkers = freeWorkers.filter(canLaunchExecutor)
}
assignedCores
}
canLaunchExecutor()主要根据keepScheduling和enoughCores来判断是否为将启动的Executor分配CPU。
keepScheduling是由coresToAssign(app尚未分配的cores) 和 minCoresPerExecutor共同决定的。
spreadOut
scheduleExecutorsOnWorkers在计算资源分配的时候传入了spreadOutApps参数,它
根据配置(spark.deploy.spreadOut)决定是否用spreadOut方式的Apps调度,也就是分配计算资源。
默认的spreadOut为true,会把每个app的所有Executor任务展开,平均分布到所有可用的结点上(Round-Robin),
而不是一个结点不够用时再分布到下一个结点上。
spreadOutApps为true时,每分配一个minCoresPerExecutor,就移动指针到下一个worker,
否则只要当前worker满足条件,就一直累计assignedCores,直到一个worker没有足够资源时再移动指针。
allocateWorkerResourceToExecutors()
遍历所有usableWorkers依次执行allocateWorkerResourceToExecutors(),来分配cores,调度当前App。
for (pos <- 0 until usableWorkers.length if assignedCores(pos) > 0) {
allocateWorkerResourceToExecutors(
app, assignedCores(pos), app.desc.coresPerExecutor, usableWorkers(pos))
}
对一次 Application的调度来说,
- 如果app.desc中配置了每个Executor的Core数(coresPerExecutor),就用计算好的assignedCores除以coresPerExecutor
得到Executor个数,否则就只启动一个Executor,把所有assignedCores在当前一个可用Worker上运行起来。 - 遍历Executor的个数,
每次为当前App指定worker和Core的个数,启动一个Executor,并修改App的状态为RUNNING.
private def allocateWorkerResourceToExecutors(
app: ApplicationInfo,
assignedCores: Int,
coresPerExecutor: Option[Int],
worker: WorkerInfo): Unit = {
// If the number of cores per executor is specified, we divide the cores assigned
// to this worker evenly among the executors with no remainder.
// Otherwise, we launch a single executor that grabs all the assignedCores on this worker.
// 如果我们在每个worker上只启动一个executor(默认没有指定spark.executor.cores 的情况下),
// 这个单独的Executor会捕获Worker上所有可用的核数
val numExecutors = coresPerExecutor.map { assignedCores / _ }.getOrElse(1)
val coresToAssign = coresPerExecutor.getOrElse(assignedCores)
for (i <- 1 to numExecutors) {
val exec = app.addExecutor(worker, coresToAssign)
launchExecutor(worker, exec)
app.state = ApplicationState.RUNNING
}
}
可见App的调度,其实是App的所有Executor的分配资源和启动,均匀分布在可用的worker上。
launchExecutor()把ExecutorDesc添加给worker,
向worker发送LaunchExecutor消息,
最后向app的driver发送ExecutorAdded消息。
private def launchExecutor(worker: WorkerInfo, exec: ExecutorDesc): Unit = {
logInfo("Launching executor " + exec.fullId + " on worker " + worker.id)
worker.addExecutor(exec)
worker.endpoint.send(LaunchExecutor(masterUrl,
exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory))
exec.application.driver.send(
ExecutorAdded(exec.id, worker.id, worker.hostPort, exec.cores, exec.memory))
}