2.2 Worker的启动
org.apache.spark.deploy.worker
1 从Worker的伴生对象的main方法进入
在main方法中首先是得到一个SparkConf实例conf,然后将conf和启动Worker传入的参数封装得到WorkerArguments的实例args,下一步就是调用startSystemAndActor()方法得到actorSystem实例,在方法内部还创建并开启属于Worker的actor,代码如下:
main方法
private[spark] object Worker extends Logging {
def main(argStrings: Array[String]) {
SignalLogger.register(log)
val conf = new SparkConf
// 准备了一些启动Worker服务的参数
val args = new WorkerArguments(argStrings, conf)
val (actorSystem, _) = startSystemAndActor(args.host, args.port, args.webUiPort, args.cores,
args.memory, args.masters, args.workDir)
actorSystem.awaitTermination()
}
startSystemAndActor()方法,actorSystem.actorOf()如何创建并开启一个actor在上一篇Master的启动流程中有分析
def startSystemAndActor(
host: String,
port: Int,
webUiPort: Int,
cores: Int,
memory: Int,
masterUrls: Array[String],
workDir: String,
workerNumber: Option[Int] = None,
conf: SparkConf = new SparkConf): (ActorSystem, Int) = {
// The LocalSparkCluster runs multiple local sparkWorkerX actor systems
val systemName = "sparkWorker" + workerNumber.map(_.toString).getOrElse("")
val actorName = "Worker"
val securityMgr = new SecurityManager(conf)
// 创建ActorSystem实例,用于创建actor
val (actorSystem, boundPort) = AkkaUtils.createActorSystem(systemName, host, port,
conf = conf, securityManager = securityMgr)
// 获得所有的masterAkkaUrl
val masterAkkaUrls = masterUrls.map(Master.toAkkaUrl(_, AkkaUtils.protocol(actorSystem)))
// 成创建workerActor并启动workerActor
actorSystem.actorOf(Props(classOf[Worker], host, boundPort, webUiPort, cores, memory,
masterAkkaUrls, systemName, actorName, workDir, conf, securityMgr), name = actorName)
(actorSystem, boundPort)
}
2 创建并执行属于Worker的actor
创建Worker actor实例时会先执行Worker类的主构造器中的代码,当执行actor.start()方法时会开始执行actor的生命周期方法
1)执行preStart()方法,在preStart()方法中执行registerWithMaster()实现向master发送注册信息,registerWithMaster()方法中主要执行了tryRegisterAllMasters()方法和启动一个定时器,定时向master 重复注册,代码如下:
preStart()方法
//生命周期方法
override def preStart() {
assert(!registered)
logInfo("Starting Spark worker %s:%d with %d cores, %s RAM".format(
host, port, cores, Utils.megabytesToString(memory)))
logInfo(s"Running Spark version ${org.apache.spark.SPARK_VERSION}")
logInfo("Spark home: " + sparkHome)
createWorkDir()
context.system.eventStream.subscribe(self, classOf[RemotingLifecycleEvent])
shuffleService.startIfEnabled()
webUi = new WorkerWebUI(this, workDir, webUiPort)
webUi.bind()
// 向Master进行注册
registerWithMaster()
metricsSystem.registerSource(workerSource)
metricsSystem.start()
// Attach the worker metrics servlet handler to the web ui after the metrics system is started.
metricsSystem.getServletHandlers.foreach(webUi.attachHandler)
}
registerWithMaster()方法,详细分析见代码注释
//向Master注册
def registerWithMaster() {
// DisassociatedEvent may be triggered multiple times, so don't attempt registration
// if there are outstanding registration attempts scheduled.
// 使用模式匹配的作用:DisassociatedEvent可能会触发多次,
// 而DisassociatedEvent对应的模式匹配中调用了registerWithMaster()方法,
// 因此这里使用模式匹配,以实现当已经有未完成的注册时不能试图再去注册
registrationRetryTimer match {
case None =>
registered = false
// 尝试向所有Master注册,为什么要向所有的Master注册,因为可能高可用集群
tryRegisterAllMasters()
// 初始尝试连接次数
connectionAttemptCount = 0
// 给registrationRetryTimer赋值,当再次触发registerWithMaster(),不会重复执行tryRegisterAllMasters()
registrationRetryTimer = Some {
//开启一个定时器重复注册 ReregisterWithMaster,增加容错性避免因为网络异常或master出错导致的注册失败
context.system.scheduler.schedule(INITIAL_REGISTRATION_RETRY_INTERVAL,
INITIAL_REGISTRATION_RETRY_INTERVAL, self, ReregisterWithMaster)
}
case Some(_) =>
logInfo("Not spawning another attempt to register with the master, since there is an" +
" attempt scheduled already.")
}
}
tryRegisterAllMasters()
//尝试向所有的master发送注册
private def tryRegisterAllMasters() {
for (masterAkkaUrl <- masterAkkaUrls) {
logInfo("Connecting to master " + masterAkkaUrl + "...")
val actor = context.actorSelection(masterAkkaUrl)
// 向actor发送消息,这里的actor是Master的actor
actor ! RegisterWorker(workerId, host, port, cores, memory, webUi.boundPort, publicAddress)
}
}
2)执行receiveWithLogging方法,开始循环监听等待接收消息,其中与Worker启动有关的消息是RegisteredWorker和SendHeartbeat,下面将结合与Master的交互来分析这种消息的作用
2.3 Master收到worker的注册信息
当Master接收到Worker发送的注册消息RegisterWorker后,先判断该worker是否已经注册,如果已经注册返回注册失败消息RegisterWorkerFailed(“Duplicate worker ID”)给worker,否则将worker发送过来的信息封装到WorkInfo对象中并保存在内存和磁盘中,然后返回封装了masterUrl和masterWebUiUrl信息的注册成功消息RegisteredWorker(masterUrl, masterWebUiUrl)给worker,worker接收到RegisteredWorker消息后,更新registered = true,执行changeMaster方法记录有效的masterUrl信息,然后启动一个定时器,定时发送消息给自己以触发向master发送心跳的功能,代码如下:
override def receiveWithLogging = {
// Master向Worker发送的MasterUrl,意味着注册成功了
case RegisteredWorker(masterUrl, masterWebUiUrl) =>
logInfo("Successfully registered with master " + masterUrl)
registered = true
// 把发送过来的url进行更新
changeMaster(masterUrl, masterWebUiUrl)
// 启动一个定时器向Master发送心跳
// 这里的实现过程是,向自己发送一个SendHeartbeat消息,
// 自己的receiveWithLogging接收到该消息触发对应的操作,
// 也就是先判断是否和master还有连接,如果仍然连接就向master发送心跳信息
context.system.scheduler.schedule(0 millis, HEARTBEAT_MILLIS millis, self, SendHeartbeat)
if (CLEANUP_ENABLED) {
logInfo(s"Worker cleanup enabled; old application directories will be deleted in: $workDir")
context.system.scheduler.schedule(CLEANUP_INTERVAL_MILLIS millis,
CLEANUP_INTERVAL_MILLIS millis, self, WorkDirCleanup)
}
case SendHeartbeat =>
// 如果已经连接,向Master发送心跳信息
if (connected) { master ! Heartbeat(workerId) }