注:本文接上一篇文章【[spark] standalone集群模式Driver启动过程 】继续说明Driver在启动之后,如何申请资源的一个流程......
目录
思路:
在Standalone模式下集群启动时,Worker会向Master注册,使得Master可以感知进而管理整个集群;
Master通过借助Zookeeper,可以简单实现高可用性;
而应用方通过SparkContext这个与集群的交互接口,在创建SparkContext时就完成了Application的注册,由Master为其分配Executor;
在应用方创建了RDD并且在这个RDD上进行了很多的转换后,触发Action,通过DAGScheduler将DAG划分为不同的Stage,并将Stage转换为TaskSet交给TaskSchedulerImpl;
再由TaskSchedulerImpl通过SparkDeploySchedulerBackend的reviveOffers,最终向ExecutorBackend发送LaunchTask的消息;
ExecutorBackend接收到消息后,启动Task,开始在集群中启动计算
步骤:
DriverMapper在Worker节点被启动之后,执行main方法的内容开始工作,所以,直接查看DriverMapper#main()
//DriverWrapper主方法
def main(args: Array[String]) {
args.toList match {
//下面的mainClass就是我们真正提交的application
case workerUrl :: userJar :: mainClass :: extraArgs =>
......
//SecurityManager spark对于认证授权的实现
val rpcEnv = RpcEnv.create("Driver", host, port, conf, new SecurityManager(conf))
//WorkerWatcher监控进程状态
rpcEnv.setupEndpoint("workerWatcher", new WorkerWatcher(rpcEnv, workerUrl))
......
Thread.currentThread.setContextClassLoader(loader)
setupDependencies(loader, userJar)
// 加载客户端运行任务的主类
val clazz = Utils.classForName(mainClass)
//得到提交application的主方法
val mainMethod = clazz.getMethod("main", classOf[Array[String]])
/**
* 启动提交的application 中的main 方法。
* 这里启动application,会先创建SparkConf和SparkContext
* SparkContext中 362行try块中会创建TaskScheduler(492)
*/
mainMethod.invoke(null, extraArgs.toArray[String])
rpcEnv.shutdown()
.......
}
}
main()方法中注释很重要,其中“这里启动application,会先创建SparkConf和SparkContext”,在启动application也就是client任务之前,必须创建SparkConf和SparkContext,这里用一个简单的worldCount当做客户端的application
在new SparkContext(Conf)中,会做一系列初始化工作,其中一个核心方法是createTaskScheduler(),该方法会创建两个对象TaskSchedulerImpl、StandaloneSchedulerBackend
/**
* 启动调度程序,这里(sched,ts) 分别对应 StandaloneSchedulerBackend 和 TaskSchedulerImpl 两个对象
* master 是提交任务写的 spark://node1:7077
*/
val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
private def createTaskScheduler(
sc: SparkContext,
master: String,
deployMode: String): (SchedulerBackend, TaskScheduler) = {
import SparkMasterRegex._
// When running locally, don't try to re-execute tasks on failure.
val MAX_LOCAL_TASK_FAILURES = 1
master match {
。。。。。。
//standalone 提交任务都是以 “spark://”这种方式提交
case SPARK_REGEX(sparkUrl) =>
//scheduler 创建TaskSchedulerImpl 对象
val scheduler = new TaskSchedulerImpl(sc)
val masterUrls = sparkUrl.split(",").map("spark://" + _)
//这里的 backend 是StandaloneSchedulerBackend 这个类型
val backend = new StandaloneSchedulerBackend(scheduler, sc, masterUrls)
//这里会调用 TaskSchedulerImpl 对象中的 initialize 方法将 backend 初始化,一会要用到
scheduler.initialize(backend)
//返回了 StandaloneSchedulerBackend 和 TaskSchedulerImpl 两个对象
(backend, scheduler)
。。。。。。
case masterUrl =>
val cm = getClusterManager(masterUrl) match {
case Some(clusterMgr) => clusterMgr
case None => throw new SparkException("Could not parse Master URL: '" + master + "'")
}
try {
val scheduler = cm.createTaskScheduler(sc, masterUrl)
val backend = cm.createSchedulerBackend(sc, masterUrl, scheduler)
cm.initialize(scheduler, backend)
(backend, scheduler)
。。。。。。
}
}
SparkContext中维护了两个对象: StandaloneSchedulerBackend 和 TaskSchedulerImpl,其中StandaloneSchedulerBackend 的父类是CoarseGrainedSchedulerBackend(粗粒度资源申请),这也是spark和MR的区别之一,MR为细粒度资源申请;TaskSchedulerImpl是TaskScheduler的一个实现,根据SparkContext实例类会实例化一个DAGScheduler做任务切分
需要说明的是:scheduler和backend是相互引用的,被实例化后会调用TaskSchedulerImpl的start()方法,而TaskSchedulerImpl的start()内部执行的是backend.start(),(backend被new后会初始化到scheduler)backend#start()方法会先调用父类CoarseGrainedSchedulerBackend的start()方法向RPC注册DriverEndpoint
SparkContext
scheduler.initialize(backend)
(backend, scheduler)
。。。。。。
//TaskSchedulerImpl 对象的start方法
_taskScheduler.start()
。。。。。。
TaskSchedulerImpl
/**
* TaskScheduler 启动
*/
override def start() {
//StandaloneSchedulerBackend 启动
backend.start()
。。。。。。
}
StandaloneSchedulerBackend
//115行 要提交application的描述信息
override def start() {
/**
* super.start()中有创建Driver的通信邮箱也就是Driver的引用
* 未来Executor就是向 StandaloneSchedulerBackend中父类 CoarseGrainedSchedulerBackend 中反向注册信息的.
*/
super.start()
........
在StandaloneSchedulerBackend的start()调用父类CoarseGrainedSchedulerBackend在start方法后,会执行以下代码
override def start() {
val properties = new ArrayBuffer[(String, String)]
for ((key, value) <- scheduler.sc.conf.getAll) {
if (key.startsWith("spark.")) {
properties += ((key, value))
}
}
// TODO (prashant) send conf instead of properties
/**
* 创建Driver的Endpoint ,就是创建Driver的通信邮箱,向Rpc中注册当前DriverEndpoint
* 未来Executor就是向DriverEndpoint中反向注册信息,这里Driver中会有receiveAndReply方法一直监听匹配发过来的信息【167行】
*/
driverEndpoint = createDriverEndpointRef(properties)
}
protected def createDriverEndpointRef(
properties: ArrayBuffer[(String, String)]): RpcEndpointRef = {
rpcEnv.setupEndpoint(ENDPOINT_NAME, createDriverEndpoint(properties))
}
StandaloneSchedulerBackend
//115行 要提交application的描述信息
override def start() {
/**
* super.start()中有创建Driver的通信邮箱也就是Driver的引用
* 未来Executor就是向 StandaloneSchedulerBackend中父类 CoarseGrainedSchedulerBackend 中反向注册信息的.
*/
super.start()
。。。。。。
val command = Command("org.apache.spark.executor.CoarseGrainedExecutorBackend",
args, sc.executorEnvs, classPathEntries ++ testingClassPath, libraryPathEntries, javaOpts)
val webUrl = sc.ui.map(_.webUrl).getOrElse("")
val coresPerExecutor = conf.getOption("spark.executor.cores").map(_.toInt)
。。。。。。
val appDesc: ApplicationDescription = ApplicationDescription(sc.appName, maxCores, sc.executorMemory, command,
webUrl, sc.eventLogDir, sc.eventLogCodec, coresPerExecutor, initialExecutorLimit)
//提交应用程序的描述信息
//封装 appDesc,这里已经传入了StandaloneAppClient 中
client = new StandaloneAppClient(sc.env.rpcEnv, masters, appDesc, this, conf)
//启动StandaloneAppClient,之后会向Driver注册application的信息
client.start()
launcherBackend.setState(SparkAppHandle.State.SUBMITTED)
waitForRegistration()
launcherBackend.setState(SparkAppHandle.State.RUNNING)
}
仔细看代码,方法中先是实例化一个Command对象,然后作为参数实例化到ApplicationDescription对象,再将ApplicationDescription的实例化对象封装到StandaloneAppClient的对象中,最后启动了StandaloneAppClient的实例化对象start()方法
def start() {
// Just launch an rpcEndpoint; it will call back into the listener.
/**
* 这里就是给空的 endpoint[AtomicReference] 设置下 信息,
* 主要是rpcEnv.setupEndpoint 中创建了 ClientEndpoint 只要设置Endpoint 肯定会调用 ClientEndpoint的onStart方法
*/
endpoint.set(rpcEnv.setupEndpoint("AppClient", new ClientEndpoint(rpcEnv)))
}
StandaloneAppClient的start()方法注册了ClientEndpoint,所以ClientEndpoint的onstart()方法肯定会被调用,最终在ClientEndpoint的onstart()方法内会发现,它会向master注册
//onStart 方法
override def onStart(): Unit = {
try {
//向Master 注册当前application的信息
registerWithMaster(1)
} catch {
case e: Exception =>
logWarning("Failed to connect to master", e)
markDisconnected()
stop()
}
}
在向master注册的方法中,向所有的Master去注册Application的信息,master收到消息类型为RegisterApplication类型做处理并回复
//向所有的Master去注册Application的信息
private def tryRegisterAllMasters(): Array[JFuture[_]] = {
//遍历所有的Master地址
for (masterAddress <- masterRpcAddresses) yield {
registerMasterThreadPool.submit(new Runnable {
override def run(): Unit = try {
if (registered.get) {
return
}
logInfo("Connecting to master " + masterAddress.toSparkURL + "...")
//获取Master的通信邮箱
val masterRef = rpcEnv.setupEndpointRef(masterAddress, Master.ENDPOINT_NAME)
//向Master注册application,Master类中receive方法中会匹配接收 RegisterApplication类型
masterRef.send(RegisterApplication(appDescription, self))
} catch {
case ie: InterruptedException => // Cancelled
case NonFatal(e) => logWarning(s"Failed to connect to master $masterAddress", e)
}
})
}
}
接下来看Master对RegisterApplication类型的消息怎么处理,通过整个调用流程都会知道,schedule()是核心业务方法
//Driver 端提交过来的要注册Application
case RegisterApplication(description, driver) =>
// TODO Prevent repeated registrations from some driver
//如果Master状态是standby 忽略不提交任务
if (state == RecoveryState.STANDBY) {
// ignore, don't send response
} else {
logInfo("Registering app " + description.name)
//这里封装application信息,注意,在这里可以跟进去看到默认一个application使用的core的个数就是 Int.MaxValue
val app = createApplication(description, driver)
//注册app ,这里面有向 waitingApps中加入当前application
registerApplication(app)
logInfo("Registered app " + description.name + " with ID " + app.id)
persistenceEngine.addApplication(app)
driver.send(RegisteredApplication(app.id, self))
//最终又会执行通用方法schedule()
schedule()
}
schedule()方法中会继续执行startExecutorsOnWorkers()方法,看方法名就知道要开始分配executor资源了
/**
* schedule() 方法是通用的方法
* 这个方法中当申请启动Driver的时候也会执行,但是最后一行的startExecutorsOnWorkers 方法中 waitingApp是空的,只是启动Driver。
* 在提交application时也会执行到这个scheduler方法,这个时候就是要启动的Driver是空的,但是会直接运行startExecutorsOnWorkers 方法给当前的application分配资源
*
*/
private def schedule(): Unit = {
。。。。。。。。
startExecutorsOnWorkers()
}
private def startExecutorsOnWorkers(): Unit = {
//从waitingApps中获取提交的app
for (app <- waitingApps) {
//coresPerExecutor 在application中获取启动一个Executor使用几个core 。参数--executor-core可以指定,下面指明不指定就是1
val coresPerExecutor = app.desc.coresPerExecutor.getOrElse(1)
//判断是否给application分配够了core,因为后面每次给application 分配core后 app.coresLeft 都会相应的减去分配的core数
if (app.coresLeft >= coresPerExecutor) {
// Filter out workers that don't have enough resources to launch an executor
//过滤出可用的worker
val usableWorkers : Array[WorkerInfo]= workers.toArray.filter(_.state == WorkerState.ALIVE)
.filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&
worker.coresFree >= coresPerExecutor)
.sortBy(_.coresFree).reverse
//下面就是去worker中划分每个worker提供多少core和启动多少Executor,注意:spreadOutApps 是true
//返回的 assignedCores 就是每个worker节点中应该给当前的application分配多少core
val assignedCores = scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps)
for (pos <- 0 until usableWorkers.length if assignedCores(pos) > 0) {
//在worker中给Executor划分资源
allocateWorkerResourceToExecutors(
app, assignedCores(pos), app.desc.coresPerExecutor, usableWorkers(pos))
}
}
}
}
这个方法做的事情就是:根据client提交的applicationDescription各种条件来匹配满足资源条件的Worker,然后在每一个Worker上的资源分配详细过程(每个worker提供多少core及启动多少executor)
注:其中spreadOutApps表示executor的分配方式
例如:可用worker列表如果有3台,用户提交application需要4个executor,这4个executor可以在同一个worker上执行,也可以在可用的workers列表均衡分布,spreadOutApps就决定这个,默认为true,水平分配资源
executor分配的核心方法就是scheduleExecutorsOnWorkers,这个方法逻辑比较复杂,不一一说明,代码全复制,有兴趣自己看,方法内部有个属性assignedCores,表示在每个可用的Worker上分配的核心数量core,整个方法就计算这个并在方法最后一行返回
private def scheduleExecutorsOnWorkers(
app: ApplicationInfo,
usableWorkers: Array[WorkerInfo],
spreadOutApps: Boolean): Array[Int] = {
//启动一个Executor使用多少core,这里如果提交任务没有指定 --executor-core这个值就是None
val coresPerExecutor : Option[Int]= app.desc.coresPerExecutor
//这里指定如果提交任务没有指定启动一个Executor使用几个core,默认就是1
val minCoresPerExecutor = coresPerExecutor.getOrElse(1)
//oneExecutorPerWorker 当前为true
val oneExecutorPerWorker :Boolean= coresPerExecutor.isEmpty
//默认启动一个Executor使用的内存就是1024M,这个设置在SparkContext中464行
//若提价命令中有 --executor-memory 5*1024 就是指定的参数
val memoryPerExecutor = app.desc.memoryPerExecutorMB
//可用worker的个数
val numUsable = usableWorkers.length
//创建两个重要对象
val assignedCores = new Array[Int](numUsable) // Number of cores to give to each worker
val assignedExecutors = new Array[Int](numUsable) // Number of new executors on each worker
/**
* coresToAssign 指的是当前要给Application分配的core是多少? app.coresLeft 与集群所有worker剩余的全部core 取个最小值
* 这里如果提交application时指定了 --total-executor-core 那么app.coresLeft 就是指定的值
*/
var coresToAssign = math.min(app.coresLeft, usableWorkers.map(_.coresFree).sum)
/** Return whether the specified worker can launch an executor for this app. */
//判断某台worker节点是否还可以启动Executor
def canLaunchExecutor(pos: Int): Boolean = {
//可分配的core是否大于启动一个Executor使用的1个core
val keepScheduling = coresToAssign >= minCoresPerExecutor
//是否有足够的core
val enoughCores = usableWorkers(pos).coresFree - assignedCores(pos) >= minCoresPerExecutor
//assignedExecutors(pos) == 0 为true,launchingNewExecutor就是为true
val launchingNewExecutor = !oneExecutorPerWorker || assignedExecutors(pos) == 0
//启动新的Executor
if (launchingNewExecutor) {
val assignedMemory = assignedExecutors(pos) * memoryPerExecutor
//是否有足够的内存
val enoughMemory = usableWorkers(pos).memoryFree - assignedMemory >= memoryPerExecutor
//这里做安全判断,说的是要分配启动的Executor和当前application启动的使用的Executor总数是否在Application总的Executor限制之下
val underLimit = assignedExecutors.sum + app.executors.size < app.executorLimit
keepScheduling && enoughCores && enoughMemory && underLimit
} else {
keepScheduling && enoughCores
}
}
var freeWorkers = (0 until numUsable).filter(canLaunchExecutor)
while (freeWorkers.nonEmpty) {
freeWorkers.foreach { pos =>
var keepScheduling = true
while (keepScheduling && canLaunchExecutor(pos)) {
coresToAssign -= minCoresPerExecutor
assignedCores(pos) += minCoresPerExecutor
if (oneExecutorPerWorker) {
assignedExecutors(pos) = 1
} else {
assignedExecutors(pos) += 1
}
if (spreadOutApps) {
keepScheduling = false
}
}
}
freeWorkers = freeWorkers.filter(canLaunchExecutor)
}
//最后返回每个Worker上分配多少core
assignedCores
}
Master在计算出每个Worker该分配多少core之后,遍历可用worker,调用allocateWorkerResourceToExecutors()方法为worker分配executor
private def allocateWorkerResourceToExecutors(
app: ApplicationInfo,
assignedCores: Int,
coresPerExecutor: Option[Int],
worker: WorkerInfo): Unit = {
// If the number of cores per executor is specified, we divide the cores assigned
// to this worker evenly among the executors with no remainder.
// Otherwise, we launch a single executor that grabs all the assignedCores on this worker.
val numExecutors = coresPerExecutor.map { assignedCores / _ }.getOrElse(1)
//每个Executor要分配多少个core
val coresToAssign = coresPerExecutor.getOrElse(assignedCores)
for (i <- 1 to numExecutors) {
val exec: ExecutorDesc = app.addExecutor(worker, coresToAssign)
//去worker中启动Executor
launchExecutor(worker, exec)
app.state = ApplicationState.RUNNING
}
}
allocateWorkerResourceToExecutors()获取worker引用,发送消息给worker,worker接收消息匹配LaunchExecutor,创建ExecutorRunner执行任务,运行的是Command中的org.apache.spark.executor.CoarseGrainedExecutorBackend
//创建ExecutorRunner
val manager = new ExecutorRunner(
appId,execId,
//appDesc 中有 Command("org.apache.spark.executor.CoarseGrainedExecutorBackend",.......) 中
// 第一个参数就是Executor类
appDesc.copy(command = Worker.maybeUpdateSSLSettings(appDesc.command, conf)),
cores_,memory_,self,
workerId,host,webUi.boundPort,
publicAddress,sparkHome,executorDir,
workerUri,conf,appLocalDirs, ExecutorState.RUNNING)
executors(appId + "/" + execId) = manager
/**
* 启动ExecutorRunner
* 启动的就是 CoarseGrainedExecutorBackend 类,
* 下面看 CoarseGrainedExecutorBackend 类中的main 方法有反向注册给Driver 【293 run方法】
*/
manager.start()
coresUsed += cores_
memoryUsed += memory_
sendToMaster(ExecutorStateChanged(appId, execId, manager.state, None, None))
到此、还没有结束,application任务还没有被真正执行,真正执行任务在这个CoarseGrainedExecutorBackend中,在他的main()方法中会调用一个run()方法注册Executor,只保留部分核心代码
private def run(
......
val env = SparkEnv.createExecutorEnv(
driverConf, executorId, hostname, cores, cfg.ioEncryptionKey, isLocal = false)
//注册Executor的通信邮箱,会调用CoarseGrainedExecutorBackend的onstart方法【58行】
env.rpcEnv.setupEndpoint("Executor", new CoarseGrainedExecutorBackend(
env.rpcEnv, driverUrl, executorId, hostname, cores, userClassPath, env))
workerUrl.foreach { url =>
env.rpcEnv.setupEndpoint("WorkerWatcher", new WorkerWatcher(env.rpcEnv, url))
}
env.rpcEnv.awaitTermination()
......
}
}
这个方法会创建一个SparkEnv,然后注册Executor,也就是CoarseGrainedExecutorBackend类,所以按照惯例继续看CoarseGrainedExecutorBackend的onStart()方法,直接看注释
override def onStart() {
logInfo("Connecting to driver: " + driverUrl)
//从RPC中拿到Driver的引用,给Driver反向注册Executor
rpcEnv.asyncSetupEndpointRefByURI(driverUrl).flatMap { ref =>
//拿到Driver的引用
driver = Some(ref)
// 给Driver反向注册Executor信息,这里就是注册给之前看到的
//CoarseGrainedSchedulerBackend类中的DriverEndpoint
//DriverEndpoint类中会有receiveAndReply 方法来匹配RegisterExecutor
ref.ask[Boolean](RegisterExecutor(executorId, self, hostname, cores, extractLogUrls))
。。。。。。
当Driver收到RegisterExecutor类型消息之后,driver会将当前executor放入一个HashMap集合,并发送消息给ExecutorRef 告诉 Executor已经被注册。
//反向注册的Executor
case RegisterExecutor(executorId, executorRef, hostname, cores, logUrls) =>
。。。。。。
val data = new ExecutorData(executorRef, executorAddress, hostname,
cores, cores, logUrls)
。。。。。。
executorDataMap.put(executorId, data)
//拿到Execuotr的通信邮箱,发送消息给ExecutorRef 告诉 Executor已经被注册。
//在 CoarseGrainedExecutorBackend 类中 receive方法一直监听有没有被注册,匹配上就会启动Executor
executorRef.send(RegisteredExecutor)
Executor在接收到注册成功的消息后,创建Executor,Executor真正的创建Executor,Executor中有线程池用于task运行
spark中的task是以线程的方式跑在集群中,不是jvm进程,所以这里会创建Executor,所以最终跑任务的是executor中的Threadpool
//匹配上Driver端发过来的消息,已经接受注册Executor了,下面要启动Executor
case RegisteredExecutor =>
//下面创建Executor,Executor真正的创建Executor,Executor中有线程池用于task运行
executor = new Executor(executorId, hostname, env, userClassPath, isLocal = false)
//启动Task
case LaunchTask(data) =>
if (executor == null) {
exitExecutor(1, "Received LaunchTask command but executor was null")
} else {
val taskDesc = TaskDescription.decode(data.value)
logInfo("Got assigned task " + taskDesc.taskId)
//Executor 启动Task
executor.launchTask(this, taskDesc)
}
Executor
//得到task 在线程池中执行
def launchTask(context: ExecutorBackend, taskDescription: TaskDescription): Unit = {
val tr = new TaskRunner(context, taskDescription)
runningTasks.put(taskDescription.taskId, tr)
threadPool.execute(tr)
}
框架默认情况下:
1、水平分配资源
2、每台只启动一个Executor,默认Executor申请1G内存
3、每个Executor消耗所有core
总结:
1、Driver执行客户端application任务之前必须先创建SparkContext及Sparkconf对象
2、在new sparkContext后,TaskSchedulerImpl、StandaloneSchedulerBackend、DAGScheduler类会被初始化到SparkContext对象中
3、TaskSchedulerImpl#start()==>StandaloneSchedulerBackend#start()==>
3.1、父类CoarseGrainedSchedulerBackend#start()会注册一个DriverEndpoint
3.2、StandaloneSchedulerBackend#start()会实例化一个StandaloneAPPClient(application的客户端),StandaloneAPPClient会包含ApplicationDescription对象、ApplicationDescription会包含Command对象
4、StandaloneAPPClient会在自己的rpc中注册ClientEndpoint用来和master建立通信,发送注册application的消息
5、接下来就是看Master接收消息匹配类型然后做处理,Driver的这整个过程就是在做资源申请
6、Master对RegisterApplication类型的消息处理,会创建Application并注册,然后调用schedule()为Application分配资源
6.1、scheduleExecutorsOnWorkers方法计算每个可用的Worker上应该分配的核心数量core并返回
6.2、allocateWorkerResourceToExecutors()方法为worker分配executor
7、worker接收LaunchExecutor消息匹配,会实例化ExecutorRunner调用start(),而ExecutorRunner中最重要的属性是Command中的org.apache.spark.executor.CoarseGrainedExecutorBackend
8、CoarseGrainedExecutorBackend的main()方法中会调用一个run()方法最终经过调用反向注册一个Executor到Driver
9、当Driver收到RegisterExecutor类型消息之后,driver会将当前executor放入一个HashMap集合,并发送消息给ExecutorRef 告诉 Executor已经被注册
10、Executor在接收到注册成功的消息后,创建真正的Executor,Executor中有Threadpool用于task运行