在《Spark源码学习(一)》中通过Spark的启动脚本,我们看到Spark启动Master的时候实际上是启动了org.apache.spark.deploy.master.Master,下面我们就从这2个类入手,通过阅读Spark的源码,了解Spark的启动流程。
(2)startSystemAndActor方法的关键代码:
(3)createActorSystem关键代码:
(5)Master的receiveWithLogging方法中:
表示Master接收到检测超时消息后的处理,通过查看timeOutDeadWorkers的代码:会把超时的Work从内存中移除.(6)下面我们来看一下Worker是如何注册的:org.apache.spark.deploy.worker.Worker中我们直接从preStart()方法看起:registerWithMaster()表示向Master发送注册消息,关键代码:
向自身发送一个ReregisterWithMaster消息;---
(7)Master接收到RegisterWorker消息,进行处理:
(8)Worker接收到RegisteredWorker注册成功消息:
(9)Master接收到Worker的心跳消息:
(10)如果Worker接收到ReconnectWorker消息,则进行重连:
以上就是Spark的Master和Worker的启动以及Actor通信的主体流程!
1,首先看一下org.apache.spark.deploy.master.Master:
(1)从Master的main方法开始:
val conf = new SparkConf
val args = new MasterArguments(argStrings, conf)
val (actorSystem, _, _, _) = startSystemAndActor(args.host, args.port, args.webUiPort, conf)
(2)startSystemAndActor方法的关键代码:
val (actorSystem, boundPort) = AkkaUtils.createActorSystem(systemName, host, port, conf = conf,
securityManager = securityMgr)
val actor = actorSystem.actorOf(
Props(classOf[Master], host, boundPort, webUiPort, securityMgr, conf), actorName)
(3)createActorSystem关键代码:
val startService: Int => (ActorSystem, Int) = { actualPort =>
doCreateActorSystem(name, host, actualPort, conf, securityManager)
}
这个函数执行后,会返回一个ActorSystem和被绑定的端口
(4)在(2)中actorSystem.actorOf的参数classOf[Master]:相当于java中的Master.class,此时会调用Master的构造方法和生命周期方法;----preStart()方法:
context.system.scheduler.schedule(0 millis, WORKER_TIMEOUT millis,
self, CheckForWorkerTimeOut)
这句话是启动一个定时器,向自身发送一个CheckForWorkerTimeOut(检测Worker超时)
消息,通过查看源代码,CheckForWorkerTimeOut被定义在MasterMessages中,是一个
case object CheckForWorkerTimeOut
(5)Master的receiveWithLogging方法中:
case CheckForWorkerTimeOut => {
timeOutDeadWorkers()
}
表示Master接收到检测超时消息后的处理,通过查看timeOutDeadWorkers的代码:会把超时的Work从内存中移除.(6)下面我们来看一下Worker是如何注册的:org.apache.spark.deploy.worker.Worker中我们直接从preStart()方法看起:registerWithMaster()表示向Master发送注册消息,关键代码:
registrationRetryTimer = Some {
context.system.scheduler.schedule(INITIAL_REGISTRATION_RETRY_INTERVAL,
INITIAL_REGISTRATION_RETRY_INTERVAL, self, ReregisterWithMaster)
}
向自身发送一个ReregisterWithMaster消息;---
case ReregisterWithMaster =>
reregisterWithMaster()
---
master ! RegisterWorker(
workerId, host, port, cores, memory, webUi.boundPort, publicAddress)
(7)Master接收到RegisterWorker消息,进行处理:
case RegisterWorker(id, workerHost, workerPort, cores, memory, workerUiPort, publicAddress) =>
......如果当前节点未注册过,把节点信息记录到内存,并返回注册成功消息,否则返回注册失败
persistenceEngine.addWorker(worker)
sender ! RegisteredWorker(masterUrl, masterWebUiUrl)
schedule()
(8)Worker接收到RegisteredWorker注册成功消息:
case RegisteredWorker(
//更新Master的地址信息并定时向自身发送心跳消息:
changeMaster(masterUrl, masterWebUiUrl)
context.system.scheduler.schedule(0 millis, HEARTBEAT_MILLIS millis,
self, SendHeartbeat)
----自身接收到心跳信息,判断如果和Master是正常连接状态,就向Master发送一个心跳消息:
case SendHeartbeat =>
if (connected) { master ! Heartbeat(workerId) }
(9)Master接收到Worker的心跳消息:
case Heartbeat(workerId) => {
......如果内存中存在该Worker,则更新“最近一次连接成功时间”,否则向Worker发送一个重连
消息:
idToWorker.get(workerId) match {
case Some(workerInfo) =>
workerInfo.lastHeartbeat = System.currentTimeMillis()
case None =>
if (workers.map(_.id).contains(workerId)) {
logWarning(s"Got heartbeat from unregistered worker $workerId." +
" Asking it to re-register.")
sender ! ReconnectWorker(masterUrl)
} else {
logWarning(s"Got heartbeat from unregistered worker $workerId." +
" This worker was never registered, so ignoring the heartbeat.")
}
}
(10)如果Worker接收到ReconnectWorker消息,则进行重连:
case ReconnectWorker(masterUrl) =>
logInfo(s"Master with url $masterUrl requested this worker to reconnect.")
registerWithMaster()
以上就是Spark的Master和Worker的启动以及Actor通信的主体流程!