回顾
上一节 我们简单看了下application的注册过程,今天我们接着看下spark里核心设计之一的资源分配实现。
Master处理注册消息
Master在收到消息后调用receive()
方法,根据消息类型找到注册application的实现部分。
case RegisterApplication(description, driver) =>
// 先判断master节点目前状态是否可用
if (state == RecoveryState.STANDBY) {
// ignore, don't send response
} else {
logInfo("Registering app " + description.name)
val app = createApplication(description, driver)
registerApplication(app)
logInfo("Registered app " + description.name + " with ID " + app.id)
// persistenceEngine 用来对任务状态做持久化的对象,方便失败重跑等操作。
persistenceEngine.addApplication(app)
// 给client发送一个注册完成的消息
driver.send(RegisteredApplication(app.id, self))
// 分配资源处理
schedule()
}
我们具体看下schedule()的逻辑
private def schedule(): Unit = {
if (state != RecoveryState.ALIVE) {
return
}
// 获取所有可用的Worker,并打乱顺序
val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state == WorkerState.ALIVE))
val numWorkersAlive = shuffledAliveWorkers.size
var curPos = 0
// waitingDrivers是一个列表,表示driver可以是多个的。这里依次对这些等待的应用进行资源分配
for (driver <- waitingDrivers.toList) {
// 遍历waitingDrivers的副本,我们用循环的方式把Worker分配给每个等待的application。
// 直到这次分配完所有可用的Worker。
var launched = false
var numWorkersVisited = 0
while (numWorkersVisited < numWorkersAlive && !launched) {
val worker = shuffledAliveWorkers(curPos)
numWorkersVisited += 1
// 如果当前分配的worker资源不能满足application要求的数量,会找下一个
// 直到满足则启动这个application
if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
// 向worker发送启动消息
launchDriver(worker, driver)
waitingDrivers -= driver
launched = true
}
curPos = (curPos + 1) % numWorkersAlive
}
}
startExecutorsOnWorkers()
}
资源分配
每一个application任务对应一个driver,driver启动后会向master申请资源并发送task任务。下面我们继续看startExecutorsOnWorkers(),这个方法将遍历所有等待中的application,依次对这些application分配资源。
private def startExecutorsOnWorkers(): Unit = {
// 现在这是一个非常简单的FIFO调度程序
for (app <- waitingApps if app.coresLeft > 0) {
val coresPerExecutor: Option[Int] = app.desc.coresPerExecutor
// 过滤掉没有足够资源启动执行器的worker
val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
.filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&
worker.coresFree >= coresPerExecutor.getOrElse(1))
.sortBy(_.coresFree).reverse
// 根据spreadOutApps参数有两种分配方式到worker,集中或者打散(默认)。
val assignedCores = scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps)
// 确定好分配cores数量,进行具体的分配
for (pos <- 0 until usableWorkers.length if assignedCores(pos) > 0) {
allocateWorkerResourceToExecutors(
app, assignedCores(pos), coresPerExecutor, usableWorkers(pos))
}
}
}
其中scheduleExecutorsOnWorkers方法比较长我们就不列出了,里面的while循环中不停对worker组进行检查,如果某个worker可以加载一个executor的话,会从待分配的核数中减去一个executor中包含的核数,并且在对应的已分配核数的数组中增加一个executor中包含的核数,这个数据与可用worker数据是对应的。另外如果每个worker中允许存在多个executor的话,则该worker每分配一次资源,就会增加一个executor,否则只能存在一个executor,另外如果spreadOutApps值为true的话,则在一个worker上分配完一次资源后,就去下一个worker上分配资源,否则会一直在这个worker分配资源,直到不满足一个executor所需资源为止,其中封装的canLaunchExecutor()方法返回指定的worker是否可以为这个app启动一个executor。
// spreadOutApps 我们可以在环境变量中通过spark.deploy.spreadOut修改
private val spreadOutApps = conf.getBoolean("spark.deploy.spreadOut", true)
接着看allocateWorkerResourceToExecutors的会先创建这个Executor加入集合再发送消息给相应的worker和driver。
private def allocateWorkerResourceToExecutors(
app: ApplicationInfo,
assignedCores: Int,
coresPerExecutor: Option[Int],
worker: WorkerInfo): Unit = {
val numExecutors = coresPerExecutor.map { assignedCores / _ }.getOrElse(1)
val coresToAssign = coresPerExecutor.getOrElse(assignedCores)
for (i <- 1 to numExecutors) {
// 新建ExecutorDesc 放入executors(一个Map集合)
val exec = app.addExecutor(worker, coresToAssign)
// 发送启动消息
launchExecutor(worker, exec)
app.state = ApplicationState.RUNNING
}
}
launchExecutor()给worker和driver分别发送executor启动消息。到此Executor资源分配策略就完成了。
private def launchExecutor(worker: WorkerInfo, exec: ExecutorDesc): Unit = {
logInfo("Launching executor " + exec.fullId + " on worker " + worker.id)
worker.addExecutor(exec)
worker.endpoint.send(LaunchExecutor(masterUrl,
exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory))
exec.application.driver.send(
ExecutorAdded(exec.id, worker.id, worker.hostPort, exec.cores, exec.memory))
}
Worker启动Executor
我们再简单看看worker端收到消息做了哪些相关的事情。首先我们找到Worker类中处理LaunchExecutor消息的地方。
case LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) =>
处理逻辑有点多,就挑些重点了
...
// 先创建ExecutorRunner,里面有很多重要配置信息在构造参数,相信大家看名称也明白
val manager = new ExecutorRunner(
appId,
execId,
appDesc.copy(command = Worker.maybeUpdateSSLSettings(appDesc.command, conf)),
cores_,
memory_,
self,
workerId,
host,
webUi.boundPort,
publicAddress,
sparkHome,
executorDir,
workerUri,
conf,
appLocalDirs, ExecutorState.RUNNING)
// 把ExecutorID放入map中
executors(appId + "/" + execId) = manager
// 启动这个Executor
manager.start()
coresUsed += cores_
memoryUsed += memory_
// 发送状态变更的消息给master
sendToMaster(ExecutorStateChanged(appId, execId, manager.state, None, None))
接着我们看看start(),里面是Executor启动详情。
private[worker] def start() {
// 先创建一个线程
workerThread = new Thread("ExecutorRunner for " + fullId) {
// fetchAndRunExecutor是真正启动的逻辑
override def run() { fetchAndRunExecutor() }
}
workerThread.start()
// Shutdown hook that kills actors on shutdown.
shutdownHook = ShutdownHookManager.addShutdownHook { () =>
// It's possible that we arrive here before calling `fetchAndRunExecutor`, then `state` will
// be `ExecutorState.RUNNING`. In this case, we should set `state` to `FAILED`.
if (state == ExecutorState.RUNNING) {
state = ExecutorState.FAILED
}
killProcess(Some("Worker shutting down")) }
}
为什么要新开一个线程去处理了,就是防止阻塞,我们一个worker是可以执行多个Executor任务的。fetchAndRunExecutor()中是真正的启动逻辑。builder.start()会构建一个新的流程。
try {
// Launch the process
val builder = CommandUtils.buildProcessBuilder(appDesc.command, new SecurityManager(conf),
memory, sparkHome.getAbsolutePath, substituteVariables)
val command = builder.command()
val formattedCommand = command.asScala.mkString("\"", "\" \"", "\"")
logInfo(s"Launch command: $formattedCommand")
builder.directory(executorDir)
builder.environment.put("SPARK_EXECUTOR_DIRS", appLocalDirs.mkString(File.pathSeparator))
// In case we are running this from within the Spark Shell, avoid creating a "scala"
// parent process for the executor command
builder.environment.put("SPARK_LAUNCH_WITH_SCALA", "0")
// Add webUI log urls
val baseUrl =
if (conf.getBoolean("spark.ui.reverseProxy", false)) {
s"/proxy/$workerId/logPage/?appId=$appId&executorId=$execId&logType="
} else {
s"http://$publicAddress:$webUiPort/logPage/?appId=$appId&executorId=$execId&logType="
}
builder.environment.put("SPARK_LOG_URL_STDERR", s"${baseUrl}stderr")
builder.environment.put("SPARK_LOG_URL_STDOUT", s"${baseUrl}stdout")
// 使用processBuilder构造器创建新的进程
process = builder.start()
val header = "Spark Executor Command: %s\n%s\n\n".format(
formattedCommand, "=" * 40)
// Redirect its stdout and stderr to files
val stdout = new File(executorDir, "stdout")
stdoutAppender = FileAppender(process.getInputStream, stdout, conf)
val stderr = new File(executorDir, "stderr")
Files.write(header, stderr, StandardCharsets.UTF_8)
stderrAppender = FileAppender(process.getErrorStream, stderr, conf)
// Wait for it to exit; executor may exit with code 0 (when driver instructs it to shutdown)
// or with nonzero exit code
val exitCode = process.waitFor()
state = ExecutorState.EXITED
val message = "Command exited with code " + exitCode
worker.send(ExecutorStateChanged(appId, execId, state, Some(message), Some(exitCode)))
}
到此spark的资源分配和worker启动Executor的过程就看完了,不足之处还望大家指正。下一节我们来看看RDD的执行流程。