源码位置:org.apache.spark.deploy.worker.Worker.scala
首先查看worker的main方法,与master类似,创建sparkConf,参数解析,以及构造worker对象并创建ActorRef用于对外或者本身的信息交互。这里masters参数可以设置多个
- def main(argStrings: Array[String]) {
- SignalLogger.register(log)
- val conf = new SparkConf
- val args = new WorkerArguments(argStrings, conf)
- val (actorSystem, _) = startSystemAndActor(args.host, args.port, args.webUiPort, args.cores,
- args.memory, args.masters, args.workDir)
- actorSystem.awaitTermination()
- }
程序起来后,同样是先执行akka 的preStart方法
- override def preStart() {
- assert(!registered)
- logInfo("Starting Spark worker %s:%d with %d cores, %s RAM".format(
- host, port, cores, Utils.megabytesToString(memory)))
- logInfo(s"Running Spark version ${org.apache.spark.SPARK_VERSION}")
- logInfo("Spark home: " + sparkHome)
- createWorkDir()
-
- context.system.eventStream.subscribe(self, classOf[RemotingLifecycleEvent])
-
- shuffleService.startIfEnabled()
- webUi = new WorkerWebUI(this, workDir, webUiPort)
- webUi.bind()
- registerWithMaster()
-
- metricsSystem.registerSource(workerSource)
- metricsSystem.start()
-
- metricsSystem.getServletHandlers.foreach(webUi.attachHandler)
- }
向Master注册自己
- private def registerWithMaster() {
-
-
- registrationRetryTimer match {
- case None =>
- registered = false
-
- tryRegisterAllMasters()
- connectionAttemptCount = 0
-
- registrationRetryTimer = Some {
- context.system.scheduler.schedule(INITIAL_REGISTRATION_RETRY_INTERVAL,
- INITIAL_REGISTRATION_RETRY_INTERVAL, self, ReregisterWithMaster)
- }
- case Some(_) =>
- logInfo("Not spawning another attempt to register with the master, since there is an" +
- " attempt scheduled already.")
- }
- }
看worker收到master的RegisteredWorker消息会怎么做?这里要说一点,worker要注册时并不知道哪台是主,哪台是备,所以向所有配置的master都发送注册信息。主备都收到worker的注册信息之后,只有主才会反馈,并带上自己的masterUrl信息,worker以此来认定主master的actorRef用于真正的信息交互
worker要通过心跳来保持与master的时刻连通,所以注册成功之后,有一个connected标记是否连接正常,在changeMaster方法内部设置connected = true
- <pre name="code" class="java">case RegisteredWorker(masterUrl, masterWebUiUrl) =>
- logInfo("Successfully registered with master " + masterUrl)
- registered = true
- changeMaster(masterUrl, masterWebUiUrl)
-
-
- context.system.scheduler.schedule(0 millis, HEARTBEAT_MILLIS millis, self, SendHeartbeat)
-
- if (CLEANUP_ENABLED) {
- logInfo(s"Worker cleanup enabled; old application directories will be deleted in: $workDir")
- context.system.scheduler.schedule(CLEANUP_INTERVAL_MILLIS millis,
- CLEANUP_INTERVAL_MILLIS millis, self, WorkDirCleanup)
- }
如果收到RegisterWorkerFailed消息,则退出
下面看master接受到worker的心跳之后如何处理
由于worker注册时,master已经将workerId存入idToWorker中,所以这里走Some分支。很简单,只是更新该worker的一个时间戳。这里有必要说明一下None分支,在注册消息到达后,在master 的idToWorker和workers中都会保存,但是当master检测到worker超时时,将worker从idToWorker中删除,这样新的任务就选不了该worker了,但不删除workers中的。workers中的只会在间隔很长一段时间之后仍然没有心跳上来,才说明该worker真正无法再工作了,再从workers中删除。这里的None分支就是应对超时过后,心跳又继续上来了,就向worker发送重新注册的消息ReconnectWorker
- case Heartbeat(workerId) => {
- idToWorker.get(workerId) match {
- case Some(workerInfo) =>
- workerInfo.lastHeartbeat = System.currentTimeMillis()
- case None =>
- if (workers.map(_.id).contains(workerId)) {
- logWarning(s"Got heartbeat from unregistered worker $workerId." +
- " Asking it to re-register.")
- sender ! ReconnectWorker(masterUrl)
- } else {
- logWarning(s"Got heartbeat from unregistered worker $workerId." +
- " This worker was never registered, so ignoring the heartbeat.")
- }
- }
- }
至此,worker启动流程以及主动发送的消息介绍完了,剩下的都是被动接收并处理的流程,在之后结合具体job介绍。。。
转载:http://blog.csdn.net/yueqian_zhu/article/details/47976127