worker的main方法,与master类似,创建sparkConf,参数解析,以及构造worker对象并创建rpcEnv用于对外或者本身的信息交互。
private[deploy] object Worker extends Logging {
val SYSTEM_NAME = "sparkWorker"
val ENDPOINT_NAME = "Worker"
def main(argStrings: Array[String]) {
SignalLogger.register(log)
val conf = new SparkConf
val args = new WorkerArguments(argStrings, conf)
val rpcEnv = startRpcEnvAndEndpoint(args.host, args.port, args.webUiPort, args.cores,
args.memory, args.masters, args.workDir, conf = conf)
rpcEnv.awaitTermination()
}
def startRpcEnvAndEndpoint(
host: String,
port: Int,
webUiPort: Int,
cores: Int,
memory: Int,
masterUrls: Array[String],
workDir: String,
workerNumber: Option[Int] = None,
conf: SparkConf = new SparkConf): RpcEnv = {
// The LocalSparkCluster runs multiple local sparkWorkerX RPC Environments
val systemName = SYSTEM_NAME + workerNumber.map(_.toString).getOrElse("")
val securityMgr = new SecurityManager(conf)
val rpcEnv = RpcEnv.create(systemName, host, port, conf, securityMgr)
val masterAddresses = masterUrls.map(RpcAddress.fromSparkURL(_))
rpcEnv.setupEndpoint(ENDPOINT_NAME, new Worker(rpcEnv, webUiPort, cores, memory,
masterAddresses, systemName, ENDPOINT_NAME, workDir, conf, securityMgr))
rpcEnv
}
同样的执行onstart方法想master注册
override def onStart() {
assert(!registered)
logInfo("Starting Spark worker %s:%d with %d cores, %s RAM".format(
host, port, cores, Utils.megabytesToString(memory)))
logInfo(s"Running Spark version ${org.apache.spark.SPARK_VERSION}")
logInfo("Spark home: " + sparkHome)
<strong>createWorkDir() //创建工作目录</strong>
shuffleService.startIfEnabled()//是否额外的启动一个shuffle服务,确保被executor所读写的shuffle文件在executor退出后被保存,可配
webUi = new WorkerWebUI(this, workDir, webUiPort)
webUi.bind()
<strong>registerWithMaster() //向master注册</strong>
metricsSystem.registerSource(workerSource)
metricsSystem.start()
// Attach the worker metrics servlet handler to the web ui after the metrics system is started.
metricsSystem.getServletHandlers.foreach(webUi.attachHandler)
}
private def registerWithMaster() {
// onDisconnected may be triggered multiple times, so don't attempt registration
// if there are outstanding registration attempts scheduled.
registrationRetryTimer match {
case None =>
registered = false //这里向所有的master rpcEnv发送RegisterWorker消息,上几节有讲master收到该消息后,如果成功处理会反馈RegisteredWorker消息,不成功会发送RegisterWorkerFailed消息
registerMasterFutures = tryRegisterAllMasters()
connectionAttemptCount = 0 //这里在一定时间之后会进入ReregisterWithMaster,里面会判断是否已注册,如果没有会再次发送注册信息。这个是否注册的状态是由master反馈回来的
registrationRetryTimer = Some(forwordMessageScheduler.scheduleAtFixedRate(
new Runnable {
override def run(): Unit = Utils.tryLogNonFatalError {
Option(self).foreach(_.send(ReregisterWithMaster))
}
},
INITIAL_REGISTRATION_RETRY_INTERVAL_SECONDS,
INITIAL_REGISTRATION_RETRY_INTERVAL_SECONDS,
TimeUnit.SECONDS))
case Some(_) =>
logInfo("Not spawning another attempt to register with the master, since there is an" +
" attempt scheduled already.")
}
}
看worker收到master的RegisteredWorker消息,要注册时并不知道哪台是主,哪台是备,所以向所有配置的master都发送注册信息。主备都收到worker的注册信息之后,只有主才会反馈,并带上自己的masterUrl信息,worker以此来认定主master的rpcEnv用于真正的信息交互
worker要通过心跳来保持与master的时刻连通,所以注册成功之后,有一个connected标记是否连接正常,在changeMaster方法内部设置connected = true
private def tryRegisterAllMasters(): Array[JFuture[_]] = {
masterRpcAddresses.map { masterAddress =>
registerMasterThreadPool.submit(new Runnable {
override def run(): Unit = {
try {
logInfo("Connecting to master " + masterAddress + "...")
val masterEndpoint =
rpcEnv.setupEndpointRef(Master.SYSTEM_NAME, masterAddress, Master.ENDPOINT_NAME)
<strong> registerWithMaster(masterEndpoint)</strong>
} catch {
case ie: InterruptedException => // Cancelled
case NonFatal(e) => logWarning(s"Failed to connect to master $masterAddress", e)
}
}
})
}
}<pre name="code" class="java"> private def registerWithMaster(masterEndpoint: RpcEndpointRef): Unit = {
masterEndpoint.ask[RegisterWorkerResponse](RegisterWorker(
workerId, host, port, self, cores, memory, webUi.boundPort, publicAddress))
.onComplete {
// This is a very fast action so we can use "ThreadUtils.sameThread"
case Success(msg) =>
Utils.tryLogNonFatalError {
<strong>handleRegisterResponse(msg)</strong>
}
case Failure(e) =>
logError(s"Cannot register with master: ${masterEndpoint.address}", e)
System.exit(1)
}(ThreadUtils.sameThread)
}
case RegisteredWorker(masterRef, masterWebUiUrl) =>
logInfo("Successfully registered with master " + masterRef.address.toSparkURL)
registered = true <strong>//注册成功</strong>
changeMaster(masterRef, masterWebUiUrl) //这里是将主master的信息保存
forwordMessageScheduler.scheduleAtFixedRate(new Runnable { //在注册成功之后,才开启定时器向master发送心跳
override def run(): Unit = Utils.tryLogNonFatalError {
self.send(SendHeartbeat) //每4分钟发送一次心跳到master Send a heartbeat every (heartbeat timeout) / 4 milliseconds</strong>
}
}, 0, HEARTBEAT_MILLIS, TimeUnit.MILLISECONDS)
if (CLEANUP_ENABLED) {
logInfo(
s"Worker cleanup enabled; old application directories will be deleted in: $workDir")
forwordMessageScheduler.scheduleAtFixedRate(new Runnable {//定时器清理workDir下很久都没有更新的且app也不在执行状态的目录
override def run(): Unit = Utils.tryLogNonFatalError {
self.send(WorkDirCleanup)
}
}, CLEANUP_INTERVAL_MILLIS, CLEANUP_INTERVAL_MILLIS, TimeUnit.MILLISECONDS)
}
如果收到RegisterWorkerFailed消息,则退出
下面看master接受到worker的心跳之后如何处理