通过学习Spark源码为了更深入的了解Spark。主要按照以下流程进行Spark的源码分析,包含了Spark集群的启动以及任务提交的执行流程:
- Spark RPC分析
- start-all.sh
- Master启动分析
- Work启动分析
- spark-submit.sh脚本分析
- SparkSubmit分析
- SparkContext初始化
2.start-all.sh
源码分析,我这里使用的Spark版本是Spark2.4.7。使用scala版本是Scala2.11。
1.start-all.sh 内部执行start-master.sh和start-slaves.sh。
if [ -z "${SPARK_HOME}" ]; then
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi
# Load the Spark configuration
. "${SPARK_HOME}/sbin/spark-config.sh"
# Start Master
"${SPARK_HOME}/sbin"/start-master.sh
# Start Workers
"${SPARK_HOME}/sbin"/start-slaves.sh
2.start-master.sh和start-slaves.sh都有调用spark-daemon.sh
start-master.sh
"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \
--host $SPARK_MASTER_HOST --port $SPARK_MASTER_PORT --webui-port $SPARK_MASTER_WEBUI_PORT \
$ORIGINAL_ARGS
start-slaves.sh
"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS $WORKER_NUM \
--webui-port "$WEBUI_PORT" $PORT_FLAG $PORT_NUM $MASTER "$@"
3.spark-daemon.sh内部调用spark-class最终转到执行
- Master启动:spark-class org.apache.spark.deploy.master.Master
- Worker启动:spark-class org.apache.spark.deploy.worker.Worker
最终会跳到对应的main方法。
3.Master启动分析
def main(argStrings: Array[String]) {
Thread.setDefaultUncaughtExceptionHandler(new SparkUncaughtExceptionHandler(exitOnUncaughtException = false))
// TODO 注释:初始化日志输出
Utils.initDaemon(log)
// TODO 注释:创建conf对象其实是读取Spark默认配置的参数
val conf = new SparkConf
/**
*
* 1.解析main方法传入的参数
* 2.解析配置文件,加载集群启动时需要解析的配置文件。
*/
val args = new MasterArguments(argStrings, conf)
// TODO 注释:初始化RPC服务和终端
/**
、
* 注释:
* 1、RpcEnv
* 2、Endpoint
*/
val (rpcEnv, _, _) = startRpcEnvAndEndpoint(args.host, args.port, args.webUiPort, conf)
// TODO 注释: 等待结束
rpcEnv.awaitTermination()
}
分析startRpcEnvAndEndpoint方法。这里初始化RpcEnv和启动MsterEndpoint。
def startRpcEnvAndEndpoint(host: String, port: Int, webUiPort: Int, conf: SparkConf): (RpcEnv, Int, Option[Int]) = {
// TODO 注释:初始化securityManager
val securityMgr = new SecurityManager(conf)
// TODO 注释:初始化RPCEnv
val rpcEnv = RpcEnv.create(SYSTEM_NAME, host, port, conf, securityMgr)
// TODO 注释:向RPC注册master终端,涉及到Master的实例化
/**
* 注释: rpcEnv.setupEndpoint 会直接初始化和启动 EndPoint
* 参数解析
* 1、ENDPOINT_NAME endpoint的名字
* 2、new Master() endpoint 实例对象
* 由于这个 endpoint 就是 Master实例:所以代码跳转到两个地方:
* 1、master的构造法方法
* 2、master的 onStart()
* 做了三件事:
* 1、启动 master web ui
* 2、启动定时任务检查 worker 状态
* 3、如果启用了 HA,则 master 参与选举
*/
val masterEndpoint = rpcEnv.setupEndpoint(ENDPOINT_NAME, new Master(rpcEnv, rpcEnv.address, webUiPort, securityMgr, conf))
// TODO 注释:rest的绑定端口
val portsResponse = masterEndpoint.askSync[BoundPortsResponse](BoundPortsRequest)
// TODO 注释:返回结果
(rpcEnv, portsResponse.webUIPort, portsResponse.restPort)
}
对于代码的各个不做详细介绍。这里查看master的onStart方法中的定时任务启动
/**
* scheduleAtFixedRate 启动定时任务
* 注释:启用定时任务,心跳机制,向自己发送CheckForWorkerTimeOut消息,用于检测Worker是否超时
* 跟踪代码可知,最终会调用 timeOutDeadWorkers(),用于检测超时的Worker,并移除。
* 浏览器中会展示子节点信息。
*/
checkForWorkerTimeOutTask = forwardMessageThread.scheduleAtFixedRate(new Runnable {
override def run(): Unit = Utils.tryLogNonFatalError {
/**
* 注释: CheckForWorkerTimeOut 检查 worker 状态的
* 1、ref 发给别人
* 2、self 发给自己
*/
self.send(CheckForWorkerTimeOut)
}
// 每隔 WORKER_TIMEOUT_MS 毫秒执行上面的这个run方法
}, 0, WORKER_TIMEOUT_MS, TimeUnit.MILLISECONDS)
下面代码是master选举的HA机制
/**
* 注释: HA 的事儿 多个master之间的选举
* spark-env.sh
* RECOVERY_MODE = spark.deploy.recoveryMode = ZOOKEEPER
*/
val (persistenceEngine_, leaderElectionAgent_) = RECOVERY_MODE match {
/**
* 注释:ZOOKEEPER 是自動管理 Master
*/
case "ZOOKEEPER" => logInfo("Persisting recovery state to ZooKeeper")
val zkFactory = new ZooKeeperRecoveryModeFactory(conf, serializer)
// TODO 注释:zkFactory.createLeaderElectionAgent(this) 选举
(zkFactory.createPersistenceEngine(), zkFactory.createLeaderElectionAgent(this))
/**
* 注释:FILESYSTEM 的方式在 Master 出现突障后需要手动启动机器,
* 机器启动后会立即成为 Active 级别的 Master 来对外提供服务(接受应用程序提交的请求、接受新的 Job 运行的请求)
*/
case "FILESYSTEM" => val fsFactory = new FileSystemRecoveryModeFactory(conf, serializer)
(fsFactory.createPersistenceEngine(), fsFactory.createLeaderElectionAgent(this))
/**
* 注释:CUSTOM 的方式允许用户自定义 Master HA 的实现,这对于高级用户特别有用;
*/
case "CUSTOM" => val clazz = Utils.classForName(conf.get("spark.deploy.recoveryMode.factory"))
val factory = clazz.getConstructor(classOf[SparkConf], classOf[Serializer]).newInstance(conf, serializer)
.asInstanceOf[StandaloneRecoveryModeFactory]
(factory.createPersistenceEngine(), factory.createLeaderElectionAgent(this))
/**
* 注释:这是默应情况,当我们下载安装了 Spark 集群中就是采用这种方式,该方式不会持久化集群的数据,
* Driver, Application, Worker and Executor. Master 启动起立即管理集群;
*/
case _ => (new BlackHolePersistenceEngine(), new MonarchyLeaderAgent(this))
}
4.Work启动分析
worker中的main方法
def main(argStrings: Array[String]) {
Thread.setDefaultUncaughtExceptionHandler(new SparkUncaughtExceptionHandler(exitOnUncaughtException = false))
Utils.initDaemon(log)
val conf = new SparkConf
val args = new WorkerArguments(argStrings, conf)
/**
* 注释:
* 1、RpcEnv
* 2、Endpoint
*/
val rpcEnv = startRpcEnvAndEndpoint(args.host, args.port, args.webUiPort, args.cores, args.memory, args.masters, args.workDir, conf = conf)
val externalShuffleServiceEnabled = conf.get(config.SHUFFLE_SERVICE_ENABLED)
val sparkWorkerInstances = scala.sys.env.getOrElse("SPARK_WORKER_INSTANCES", "1").toInt
require(externalShuffleServiceEnabled == false || sparkWorkerInstances <= 1,
"Starting multiple workers on one host is failed because we may launch no more than one " + "external shuffle service on each host, please set spark.shuffle.service.enabled to " + "false or set SPARK_WORKER_INSTANCES to 1 to resolve the conflict.")
/**
* 注释:
*/
rpcEnv.awaitTermination()
}
Worer实例创建方法startRpcEnvAndEndpoint。和master创建方法很类似
def startRpcEnvAndEndpoint(host: String, port: Int, webUiPort: Int, cores: Int, memory: Int, masterUrls: Array[String], workDir: String,
workerNumber: Option[Int] = None, conf: SparkConf = new SparkConf): RpcEnv = {
// The LocalSparkCluster runs multiple local sparkWorkerX RPC Environments
val systemName = SYSTEM_NAME + workerNumber.map(_.toString).getOrElse("")
val securityMgr = new SecurityManager(conf)
val rpcEnv = RpcEnv.create(systemName, host, port, conf, securityMgr)
val masterAddresses = masterUrls.map(RpcAddress.fromSparkURL(_))
/**
* 注释: 创建 Worker 实例
* 这句代码会做重要的两件事
* 1、Worker 的构造方法
* 2、Worker 的 onStart 方法
*/
rpcEnv.setupEndpoint(ENDPOINT_NAME, new Worker(rpcEnv, webUiPort, cores, memory, masterAddresses, ENDPOINT_NAME, workDir, conf, securityMgr))
rpcEnv
}
这里主要查看Worker中的onStart方法
/**
* 注释: 执行生命周期方法
* 1、创建工作目录 createWorkDir()
* 2、startExternalShuffleService() 启动 Shuffle 和 RPC 服务
* 3、启动 worker 的webui
* 4、worker 启动了之后要向 Master汇报registerWithMaster()
*/
override def onStart() {
assert(!registered)
logInfo("Starting Spark worker %s:%d with %d cores, %s RAM".format(host, port, cores, Utils.megabytesToString(memory)))
logInfo(s"Running Spark version ${org.apache.spark.SPARK_VERSION}")
logInfo("Spark home: " + sparkHome)
// TODO_MA 注释:创建工作目录
createWorkDir()
/**
* 注释: 创建 RPC 服务
* ExternalShuffleService是一个单独的进程服务,默认不开启
* 用于帮助Executor处理shuffle,降低Executor的压力
*/
startExternalShuffleService()
/**
* 注释: 启动 web 服务
*/
webUi = new WorkerWebUI(this, workDir, webUiPort)
webUi.bind()
workerWebUiUrl = s"http://$publicAddress:${webUi.boundPort}"
/**
* 注释: 向 Master 注册
*/
registerWithMaster()
/**
* 注释:启动metricsSystem,用于度量各种指标
*/
metricsSystem.registerSource(workerSource)
metricsSystem.start() // Attach the worker metrics servlet handler to the web ui after the metrics system is started.
metricsSystem.getServletHandlers.foreach(webUi.attachHandler)
}
可以根据startExternalShuffleService()方法查看创建Worker RPC服务端。核心代码如下:
def start() {
require(server == null, "Shuffle server already started")
val authEnabled = securityManager.isAuthenticationEnabled()
logInfo(s"Starting shuffle service on port $port (auth enabled = $authEnabled)")
val bootstraps: Seq[TransportServerBootstrap] = if (authEnabled) {
Seq(new AuthServerBootstrap(transportConf, securityManager))
} else {
Nil
}
/**
* 注释: 创建 RPC 服务端
* TransportServer server = transportContext.createServer(port, bootstraps.asJava)
*/
server = transportContext.createServer(port, bootstraps.asJava)
shuffleServiceSource.registerMetricSet(server.getAllMetrics)
shuffleServiceSource.registerMetricSet(blockHandler.getAllMetrics)
masterMetricsSystem.registerSource(shuffleServiceSource)
masterMetricsSystem.start()
}
Worker创建好后向master注册registerWithMaster()方法。核心代码如下:
/**
* 有可能master有多个
*/
private def tryRegisterAllMasters(): Array[JFuture[_]] = {
masterRpcAddresses.map { masterAddress =>
registerMasterThreadPool.submit(new Runnable {
override def run(): Unit = {
try {
logInfo("Connecting to master " + masterAddress + "...")
/**
* 注释: 一定是获取 Master 的 引用对象
* 要给 master发消息,就必须获取到 master 中的 endpoint 的 ref
*/
val masterEndpoint = rpcEnv.setupEndpointRef(masterAddress, Master.ENDPOINT_NAME)
/**
* 注释: 发送注册消息给 master
*/
sendRegisterMessageToMaster(masterEndpoint)
} catch {
case ie: InterruptedException => // Cancelled
case NonFatal(e) => logWarning(s"Failed to connect to master $masterAddress", e)
}
}
})
}
}
这里包含了获取master的引用对象和发送消息给master。这里继续深入查看发送消息代码如下:
private def sendRegisterMessageToMaster(masterEndpoint: RpcEndpointRef): Unit = {
/**
* 注释: 发送消息
* RegisterWorker 就是 Worker 的注册消息
* 所以
* Master 的 receive 方法会接受到 RegisterWorker 注册消息
* Worker 的 receive 方法也会收到 Master 返回回来的注册成功消息
*/
masterEndpoint.send(RegisterWorker(workerId, host, port, self, cores, memory, workerWebUiUrl, masterEndpoint.address))
}
当Worker向master发送RegisterWorker后,可以在master的receive方法中看到RegisterWorker接收执行代码:
/**
* 注释: 处理 Worker 注册
*/
case RegisterWorker(id, workerHost, workerPort, workerRef, cores, memory, workerWebUiUrl, masterAddress) => logInfo(
"Registering worker %s:%d with %d cores, %s RAM".format(workerHost, workerPort, cores, Utils.megabytesToString(memory)))
if (state == RecoveryState.STANDBY) {
workerRef.send(MasterInStandby)
} else if (idToWorker.contains(id)) {
workerRef.send(RegisterWorkerFailed("Duplicate worker ID"))
} else {
// TODO 注释:过来注册的 worker的信息的封装
val worker = new WorkerInfo(id, workerHost, workerPort, cores, memory, workerRef, workerWebUiUrl)
// TODO 注释:将 worker 信息注册到 Master 的 workers set 中
if (registerWorker(worker)) {
// TODO 注释:注册成功
persistenceEngine.addWorker(worker)
// TODO 注释:返回注册成功消息
/**
* 注释:
* 1、workerRef 是 Wroker 的endpoint 的代理
* 2、给 worker 发回一个 RegisteredWorker 的消息
* 返回给worker
*/
workerRef.send(RegisteredWorker(self, masterWebUiUrl, masterAddress))
// TODO 注释: 参加工作,加入调度
schedule()
} else {
val workerAddress = worker.endpoint.address
logWarning("Worker registration failed. Attempted to re-register worker at same " + "address: " + workerAddress)
workerRef.send(RegisterWorkerFailed("Attempted to re-register worker at same address: " + workerAddress))
}
}
当master给work发送RegisteredWorker Class。注意这个Class是有实现RegisterWorkerResponse
case class RegisteredWorker(
master: RpcEndpointRef,
masterWebUiUrl: String,
masterAddress: RpcAddress) extends DeployMessage with RegisterWorkerResponse
那么在slave的receive中用RegisterWorkerResponse接收数据。继而去执行handleRegisterResponse方法
override def receive: PartialFunction[Any, Unit] = synchronized {
case msg: RegisterWorkerResponse => handleRegisterResponse(msg)
}
在handleRegisterResponse方法中,返回注册成功消息,会给自己发送心跳消息SendHeartbeat
private def handleRegisterResponse(msg: RegisterWorkerResponse): Unit = synchronized {
msg match {
/**
* 注释: 处理注册成功
*/
case RegisteredWorker(masterRef, masterWebUiUrl, masterAddress) =>
if (preferConfiguredMasterAddress) {
logInfo("Successfully registered with master " + masterAddress.toSparkURL)
} else {
logInfo("Successfully registered with master " + masterRef.address.toSparkURL)
}
registered = true
changeMaster(masterRef, masterWebUiUrl, masterAddress)
/**
* 注释:在 Worker 启动之后,就开始向 Master 进行注册,注册之后,会收到一个 RegisterWorkerResponse 的信息
*/
forwordMessageScheduler.scheduleAtFixedRate(new Runnable {
override def run(): Unit = Utils.tryLogNonFatalError {
/**
* 注释: 发送心跳 给自己。
*/
self.send(SendHeartbeat)
}
}, 0, HEARTBEAT_MILLIS, TimeUnit.MILLISECONDS)
if (CLEANUP_ENABLED) {
logInfo(s"Worker cleanup enabled; old application directories will be deleted in: $workDir")
forwordMessageScheduler.scheduleAtFixedRate(new Runnable {
override def run(): Unit = Utils.tryLogNonFatalError {
self.send(WorkDirCleanup)
}
}, CLEANUP_INTERVAL_MILLIS, CLEANUP_INTERVAL_MILLIS, TimeUnit.MILLISECONDS)
}
val execs = executors.values.map { e =>
new ExecutorDescription(e.appId, e.execId, e.cores, e.state)
}
masterRef.send(WorkerLatestState(workerId, execs.toList, drivers.keys.toSeq))
case RegisterWorkerFailed(message) => if (!registered) {
logError("Worker registration failed: " + message)
System.exit(1)
}
case MasterInStandby => // Ignore. Master not yet ready.
}
}
当有给自己发送SendHeartbeat时候,会发现给master发送心跳Heartbeat
case SendHeartbeat => if (connected) {
sendToMaster(Heartbeat(workerId, self))
}
在Master中会看到对应的处理,如下:
/**
* 注释:心跳
* master启动之后,会等待 Worker 启动好了之后,过来汇报。
* 这里就是处理 Worker 在注册成功只有,发送过来的心跳信息
*/
case Heartbeat(workerId, worker) => idToWorker.get(workerId) match {
// TODO 注释: 每一个woker发送心跳来了之后,master 都会更新他们的最后一次心跳的时间
case Some(workerInfo) => workerInfo.lastHeartbeat = System.currentTimeMillis()
case None => if (workers.map(_.id).contains(workerId)) {
logWarning(s"Got heartbeat from unregistered worker $workerId." + " Asking it to re-register.")
worker.send(ReconnectWorker(masterUrl))
} else {
logWarning(s"Got heartbeat from unregistered worker $workerId." + " This worker was never registered, so ignoring the heartbeat.")
}
}
当Worker向Master注册成功后,执行schedule()方法,如下。
/**
* 注释: 在 master 中, 每一个 worker 上线之后,向 master 注册,注册成功 则证明他可以参加工作
* 执行 schedule 调度方法,具体做两件事,根据需要启动 driver 或者 executor
* 1、有可能:launchDriver(worker, driver)
* 2、startExecutorsOnWorkers()
*/
private def schedule(): Unit = {
if (state != RecoveryState.ALIVE) {
return
}
// TODO 注释:计算存活的 Worker 有多少个
// Drivers take strict precedence over executors
val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state == WorkerState.ALIVE))
val numWorkersAlive = shuffledAliveWorkers.size
// TODO 注释:把等待队列中的待启动的 Driver 拿出来启动
var curPos = 0
for (driver <- waitingDrivers.toList) { // iterate over a copy of waitingDrivers
// We assign workers to each waiting driver in a round-robin fashion. For each driver, we
// start from the last worker that was assigned a driver, and continue onwards until we have
// explored all alive workers.
var launched = false
var numWorkersVisited = 0
while (numWorkersVisited < numWorkersAlive && !launched) {
val worker = shuffledAliveWorkers(curPos)
numWorkersVisited += 1
// TODO 注释: 如果满足,Driver 启动需要的 CPU 和 Memory 资源,则启动 Driver
if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
/**
* 注释: 启动 Driver
*/
launchDriver(worker, driver)
waitingDrivers -= driver
launched = true
}
curPos = (curPos + 1) % numWorkersAlive
}
}
/**
* 注释:启动所有 Worker 节点上的 Executor
*/
startExecutorsOnWorkers()
}