Spark2.2源码阅读顺序
1. Spark2.2源码分析:Spark-Submit提交任务
2. Spark2.2源码分析:Driver的注册与启动
当spark-submit命令提交后,client提交driver到master进行注册,在master里会对该driver做一系列操作(对应图中1部分)
Master接收到提交Driver请求后进行处理
org.apache.spark.deploy.master.Master
override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = {
//此case处理提交Driver请求
case RequestSubmitDriver(description) =>
//如果此master不处于存活状态,返回client false状态
if (state != RecoveryState.ALIVE) {
val msg = s"${Utils.BACKUP_STANDALONE_MASTER_PREFIX}: $state. " +
"Can only accept driver submissions in ALIVE state."
context.reply(SubmitDriverResponse(self, false, None, msg))
} else {
logInfo("Driver submitted " + description.command.mainClass)
//创建Driver信息,由此master进程维护
val driver = createDriver(description)
//持久化driver信息,以用于之后的主备切换或者重启能重读driver信息
persistenceEngine.addDriver(driver)
//加入“等待调度的driver列表”
waitingDrivers += driver
//加入master内存中所管理的driver列表
drivers.add(driver)
//由于有新的driver需要运行,所以开始调度资源
schedule()
//返回消息给Client,Client结束进程
context.reply(SubmitDriverResponse(self, true, Some(driver.id),
s"Driver successfully submitted as ${driver.id}"))
}
case ...
}
先分析第一步:创建DriverInfo
主要生成一个ID,和封装了提交时间、是否supervise,需要用到几个cpu等等信息
private def createDriver(desc: DriverDescription): DriverInfo = {
val now = System.currentTimeMillis()
val date = new Date(now)
new DriverInfo(now, newDriverId(date), desc, date)
}
第二步,由于有了新的需求(driver)需要调度,所以调用schduler方法进行资源分配
此schduler方法在master类里大概有10个地方都用到,很重要,后面详细分析,现在只分析里面的driver部分
private def schedule(): Unit = {
//standby master不做任何操作
if (state != RecoveryState.ALIVE) {
return
}
//拿到所有存活可用的worker,并且进行一个简单的swap打乱操作
val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state == WorkerState.ALIVE))
val numWorkersAlive = shuffledAliveWorkers.size
//访问worker的下标
var curPos = 0
//循环每一个需要调度的driver
for (driver <- waitingDrivers.toList) {
//标识一下,当前循环的driver是否启动了,如果启动,则下面的while结束
var launched = false
//累计一下已经访问过的worker个数,超过可用worker个数,则也跳出下面的循环
var numWorkersVisited = 0
while (numWorkersVisited < numWorkersAlive && !launched) {
//挨个拿出打乱后的worker
val worker = shuffledAliveWorkers(curPos)
//访问个数自增1
numWorkersVisited += 1
//检查当前worker是否满足启动这个driver所需的条件(内存和cpu)
if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
//启动driver
launchDriver(worker, driver)
//把当前driver从等待列表中移除
waitingDrivers -= driver
//标识已经启动,结束本次循环(不一定启动成功)
launched = true
}
//累加下标并保证了不超过可用worker长度
curPos = (curPos + 1) % numWorkersAlive
}
}
//启动executor
startExecutorsOnWorkers()
}
其中我们关系的最关键的应该是“启动driver”
刚才的循环里,已经选出了可用的并且资源足够的worker,传递进这个函数用于启动driver
private def launchDriver(worker: WorkerInfo, driver: DriverInfo) {
logInfo("Launching driver " + driver.id + " on worker " + worker.id)
//给worker和driver互添对方的实例引用
worker.addDriver(driver)
driver.worker = Some(worker)
//用SparkRPC机制去同通知worker可以启动Driver进程了
worker.endpoint.send(LaunchDriver(driver.id, driver.desc))
//给driver状态设置为运行中(因为send异步实现机制,所以如果启动失败或其他意外情况,worker会发送消息过来进行状态改变)
driver.state = DriverState.RUNNING
}
发送消息之后,worker进程会接收到消息并且处理
override def receive: PartialFunction[Any, Unit] = synchronized {
case LaunchDriver(driverId, driverDesc) =>
logInfo(s"Asked to launch driver $driverId")
//封装一个DriverRunner实例,它里面维护了Process,就靠它去启动一个Driver进程
val driver = new DriverRunner(
conf,
driverId,
workDir,
sparkHome,
//这个command的mainClass属性就是在客户端里面默认选择了的
//org.apache.spark.deploy.worker.DriverWrapper这个类
//所以Driver启动是去执行DriverWrapper这个类的main方法
driverDesc.copy(command = Worker.maybeUpdateSSLSettings(driverDesc.command, conf)),
self,
workerUri,
securityMgr)
//在当前worker进程里也需要维护driver列表
drivers(driverId) = driver
//启动Driver
driver.start()
//cpu和内存的使用累计,用于汇报给Master,作用于剩余资源统计
coresUsed += driverDesc.cores
memoryUsed += driverDesc.mem
}
org.apache.spark.deploy.worker.DriverRunner
开启一个线程去发送启动Driver的命令,它是阻塞的
/** Starts a thread to run and manage the driver. */
private[worker] def start() = {
new Thread("DriverRunner for " + driverId) {
override def run() {
var shutdownHook: AnyRef = null
try {
//因为jvm的退出可能导致一些状态并不会正常的保存,所以开启一个钩子函数
shutdownHook = ShutdownHookManager.addShutdownHook { () =>
//jvm退出的时候,杀掉当前的process
logInfo(s"Worker shutting down, killing driver $driverId")
kill()
}
// prepare driver jars and run driver
// 这个方法会干很多事情
// 用driver自身的id去创建本地工作目录
// 下载jar包到这个目录
// 在工作目录创建stdout,stderr的日志目录
// 在一定的条件下,不断的尝试去发送命令启动Driver进程
// 返回code
// 正常来说,此线程执行到这,就会等待,没发生意外的话,master那边的driver state,就会是running不变
val exitCode = prepareAndRunDriver()
// 如果有返回值返回,代表此进程已经结束、主动杀死、失败
// set final state depending on if forcibly killed and process exit code
finalState = if (exitCode == 0) {
Some(DriverState.FINISHED)
} else if (killed) {
Some(DriverState.KILLED)
} else {
Some(DriverState.FAILED)
}
} catch {
case e: Exception =>
kill()
finalState = Some(DriverState.ERROR)
finalException = Some(e)
} finally {
//如果钩子还在,则移除
if (shutdownHook != null) {
ShutdownHookManager.removeShutdownHook(shutdownHook)
}
}
// 通知worker主进程,此driver状态已经改变
// notify worker of final driver state, possible exception
worker.send(DriverStateChanged(driverId, finalState.get, finalException))
}
}.start()
}
至此,driver的启动就完成了,DriverWrapper类里面的main方法,主要就是用反射去执行spark-submit
里指定的 --class 类的main方法,也就是我们所编写的spark类
最后,在看一下当exitCode 返回后会执行什么重要的操作.
org.apache.spark.deploy.worker.Worker
override def receive: PartialFunction[Any, Unit] = synchronized {
case driverStateChanged @ DriverStateChanged(driverId, state, exception) =>
handleDriverStateChanged(driverStateChanged)
}
//专门处理driver状态改变
private[worker] def handleDriverStateChanged(driverStateChanged: DriverStateChanged): Unit = {
val driverId = driverStateChanged.driverId
val exception = driverStateChanged.exception
val state = driverStateChanged.state
//匹配状态打印不同日志
state match {
case DriverState.ERROR =>
logWarning(s"Driver $driverId failed with unrecoverable exception: ${exception.get}")
case DriverState.FAILED =>
logWarning(s"Driver $driverId exited with failure")
case DriverState.FINISHED =>
logInfo(s"Driver $driverId exited successfully")
case DriverState.KILLED =>
logInfo(s"Driver $driverId was killed by user")
case _ =>
logDebug(s"Driver $driverId changed state to $state")
}
//通知master,它所管辖的这个driver状态改了
//master会把它内存里管理的driver相关数据结构里移除或添加这个driver
//master会把移除此driver的持久化信息
//移除之后继续调用scheudler,因为此driver结束代表它的worker释放了新资源,可以调度其他任务
sendToMaster(driverStateChanged)
//从当前worker维护的driver列表里移除掉
val driver = drivers.remove(driverId).get
//添加到完成列表里,此列表,重启后将清空,因为没有持久化
finishedDrivers(driverId) = driver
//检查下内存里已完成的driver个数是否超过设置(spark.worker.ui.retainedDrivers,默认1000)个,
//如果超过,则删除掉 max(总个数/10 ,1)的内存数据
trimFinishedDriversIfNecessary()
//释放一下资源
memoryUsed -= driver.driverDesc.mem
coresUsed -= driver.driverDesc.cores
}
好了,到这里driver的注册与启动完全结束,接下来就是我们自己编写的Spark代码开始初始化和执行
3. Spark2.2源码分析:SparkContext初始化