1.简述
spark 源码分析第一篇,准备从最基本的集群搭建入手,全面剖析spark。希望自己能对spark又更深入的理解。希望对读者有所帮助。言归正传,Spark standalone 模式,架构图:
这里先讨论Master和worker启动,以及之间的通讯:worker向master注册,worker向master发送heartbeat。
2.Master及启动流程
继承ThreadSafeRpcEndpoint类。启动master,会执行它自己的onStart函数。
2.1.执行start-master.sh脚本,->spark-daemon.sh "org.apache.spark.deploy.master.Master" -> spark-class start ->org.apache.spark.launcher.Main ->org.apache.spark.deploy.master.Master
2.2.调用Master的main方法,执行
val (rpcEnv, _, _) = startRpcEnvAndEndpoint(args.host, args.port, args.webUiPort, conf)
在此方法里创建RpcEnv对象和master对象:
def startRpcEnvAndEndpoint( host: String, port: Int, webUiPort: Int, conf: SparkConf): (RpcEnv, Int, Option[Int]) = { val securityMgr = new SecurityManager(conf) val rpcEnv = RpcEnv.create(SYSTEM_NAME, host, port, conf, securityMgr) val masterEndpoint = rpcEnv.setupEndpoint(ENDPOINT_NAME, new Master(rpcEnv, rpcEnv.address, webUiPort, securityMgr, conf)) val portsResponse = masterEndpoint.askSync[BoundPortsResponse](BoundPortsRequest) (rpcEnv, portsResponse.webUIPort, portsResponse.restPort) }
2.3.创建master对象,执行onstart方法。
(1).onstart函数做了如下的事情:
启动web UI
发送CheckForWorkerTimeOut消息给自己,移除超时的worker。
根据参数判断是否启动rest的接口,
注册master resources到master MetricsSystem
根据参数指定的recovery mode进行恢复。
(2) receive 函数接收了哪些消息
ElectedLeader
CompleteRecovery
RevokedLeadership 收回leader权力
RegisterWorker 注册worker :注册成功后,发送RegisteredWorker消息给worker
RegisterApplication 注册application
ExecutorStateChanged
DriverStateChanged
Heartbeat
MasterChangeAcknowledged
WorkerSchedulerStateResponse
WorkerLatestState
CheckForWorkerTimeOut
RequestSubmitDriver 请求提交driver
RequestKillDriver
RequestDriverStatus
RequestMasterState
BoundPortsRequest
3.worker及启动流程
同样继承ThreadSafeRpcEndpoint类.
3.1.执行start-slave.sh脚本 -> spark-daemon.sh start "org.apache.spark.deploy.worker.Worker" ->-> spark-class start ->org.apache.spark.launcher.Main ->org.apache.spark.deploy.worker.Worker
3.2.执行Worker的main方法
启动rpcEnv对象和Worker对象。
val rpcEnv = startRpcEnvAndEndpoint(args.host, args.port, args.webUiPort, args.cores, args.memory, args.masters, args.workDir, conf = conf)
def startRpcEnvAndEndpoint( host: String, port: Int, webUiPort: Int, cores: Int, memory: Int, masterUrls: Array[String], workDir: String, workerNumber: Option[Int] = None, conf: SparkConf = new SparkConf): RpcEnv = { // The LocalSparkCluster runs multiple local sparkWorkerX RPC Environments val systemName = SYSTEM_NAME + workerNumber.map(_.toString).getOrElse("") val securityMgr = new SecurityManager(conf) val rpcEnv = RpcEnv.create(systemName, host, port, conf, securityMgr) val masterAddresses = masterUrls.map(RpcAddress.fromSparkURL(_)) rpcEnv.setupEndpoint(ENDPOINT_NAME, new Worker(rpcEnv, webUiPort, cores, memory, masterAddresses, ENDPOINT_NAME, workDir, conf, securityMgr)) rpcEnv }
3.3.创建worker对象,执行onstart方法。
(1).onstart函数做了如下的事情:
在SPAKR_HOME目录下创建work目录
启动外部shuffle服务
启动workerweb UI
注册到master上,根据 val masterEndpoint = rpcEnv.setupEndpointRef(masterAddress, Master.ENDPOINT_NAME)获取masterRef。
启动MetricsSystem
3.4.接收到master注册完成的消息RegisteredWorker之后调用handleRegisterResponse方法
private def handleRegisterResponse(msg: RegisterWorkerResponse): Unit = synchronized { msg match { case RegisteredWorker(masterRef, masterWebUiUrl, masterAddress) => if (preferConfiguredMasterAddress) { logInfo("Successfully registered with master " + masterAddress.toSparkURL) } else { logInfo("Successfully registered with master " + masterRef.address.toSparkURL) } registered = true //设置masterRef changeMaster(masterRef, masterWebUiUrl, masterAddress) //定时发送heartbeat forwordMessageScheduler.scheduleAtFixedRate(new Runnable { override def run(): Unit = Utils.tryLogNonFatalError { self.send(SendHeartbeat) } }, 0, HEARTBEAT_MILLIS, TimeUnit.MILLISECONDS) //是否清理workdir if (CLEANUP_ENABLED) { logInfo( s"Worker cleanup enabled; old application directories will be deleted in: $workDir") forwordMessageScheduler.scheduleAtFixedRate(new Runnable { override def run(): Unit = Utils.tryLogNonFatalError { self.send(WorkDirCleanup) } }, CLEANUP_INTERVAL_MILLIS, CLEANUP_INTERVAL_MILLIS, TimeUnit.MILLISECONDS) } val execs = executors.values.map { e => new ExecutorDescription(e.appId, e.execId, e.cores, e.state) } masterRef.send(WorkerLatestState(workerId, execs.toList, drivers.keys.toSeq)) case RegisterWorkerFailed(message) => if (!registered) { logError("Worker registration failed: " + message) System.exit(1) } case MasterInStandby => // Ignore. Master not yet ready. } }
4.总结
最好是跟着上面的讲接把源码理一遍,就清晰很多了, 接下来会分析driver的启动。