本文参考了图解Spark核心技术与案例实战。书中的划分给了我很好的划分架构,但本文研究的是2.4版代码。所以源码部分是本人基于架构划分后自己去理解的。
先看看spark各个组件
Spark由Driver(写逻辑的main函数),Cluster Manager(资源管理器,如果是standalone则是spark集群的master节点),work(运行spark分配到各个计算节点的所在机器),executor(跑在worker上的进程)
- 配置校验并设置Spark Driver 的 Host 和 Port
_conf.set(DRIVER_HOST_ADDRESS, _conf.get(DRIVER_HOST_ADDRESS))
_conf.setIfMissing("spark.driver.port", "0")
- 初始化事件日志目录和压缩类型
_eventLogDir =
if (isEventLogEnabled) {
val unresolvedDir = conf.get("spark.eventLog.dir", EventLoggingListener.DEFAULT_LOG_DIR)
.stripSuffix("/")
Some(Utils.resolveURI(unresolvedDir))
} else {
None
}
_eventLogCodec = {
val compress = _conf.getBoolean("spark.eventLog.compress", false)
if (compress && isEventLogEnabled) {
Some(CompressionCodec.getCodecName(_conf)).map(CompressionCodec.getShortName)
} else {
None
}
}
- 初始化App状态存储以及事件LiveListenerBus
_listenerBus = new LiveListenerBus(_conf)
// Initialize the app status store and listener before SparkEnv is created so that it gets
// all events.
_statusStore = AppStatusStore.createLiveStore(conf)
listenerBus.addToStatusQueue(_statusStore.listener.get)
- 创建Spark的执行环境SparkEnv
executorEnvs包含的环境变量将会注册应用程序的过程中发送给Master,Master给Worker发送调度后,Worker最终使用executorEnvs提供的信息启动Executor。
通过配置spark.executor.memory指定Executor占用的内存的大小,也可以配置系统变量SPARK_EXECUTOR_MEMORY或者SPARK_MEM设置其大小。
// This function allows components created by SparkEnv to be mocked in unit tests:
private[spark] def createSparkEnv(
conf: SparkConf,
isLocal: Boolean,
listenerBus: LiveListenerBus): SparkEnv = {
SparkEnv.createDriverEnv(conf, isLocal, listenerBus, SparkContext.numDriverCores(master, conf))
}
- 初始化状态跟踪器SparkStatusTracker
_statusTracker = new SparkStatusTracker(this, _statusStore) - 根据配置创建ConsoleProgressBar
_progressBar =
if (_conf.get(UI_SHOW_CONSOLE_PROGRESS) && !log.isInfoEnabled) {
Some(new ConsoleProgressBar(this))
} else {
None
}
- 创建并初始化Spark UI
_ui =
if (conf.getBoolean("spark.ui.enabled", true)) {
Some(SparkUI.create(Some(this), _statusStore, _conf, _env.securityManager, appName, "",
startTime))
} else {
// For tests, do not enable the UI
None
}
// Bind the UI before starting the task scheduler to communicate
// the bound port to the cluster manager properly
_ui.foreach(_.bind())
- Hadoop相关配置及Executor环境变量的设置
_hadoopConfiguration = SparkHadoopUtil.get.newConfiguration(_conf)
- 注册HeartbeatReceiver
// We need to register "HeartbeatReceiver" before "createTaskScheduler" because Executor will
// retrieve "HeartbeatReceiver" in the constructor. (SPARK-6640)
_heartbeatReceiver = env.rpcEnv.setupEndpoint(
HeartbeatReceiver.ENDPOINT_NAME, new HeartbeatReceiver(this))
- 以伪分布(local-cluster)运行模式为例
SparkContext.scala 的方法createTaskScheduler()会生成backend和scheduler
...
case LOCAL_CLUSTER_REGEX(numSlaves, coresPerSlave, memoryPerSlave) =>
// Check to make sure memory requested <= memoryPerSlave. Otherwise Spark will just hang.
val memoryPerSlaveInt = memoryPerSlave.toInt
if (sc.executorMemory > memoryPerSlaveInt) {
throw new SparkException(
"Asked to launch cluster with %d MB RAM / worker but requested %d MB/worker".format(
memoryPerSlaveInt, sc.executorMemory))
}
val scheduler = new TaskSchedulerImpl(sc)
val localCluster = new LocalSparkCluster(
numSlaves.toInt, coresPerSlave.toInt, memoryPerSlaveInt, sc.conf)
val masterUrls = localCluster.start()
val backend = new StandaloneSchedulerBackend(scheduler, sc, masterUrls)
scheduler.initialize(backend)
backend.shutdownCallback = (backend: StandaloneSchedulerBackend) => {
localCluster.stop()
}
(backend, scheduler)
...
其中localCluster就是本地的伪分布组件,如master和worker。
localCluster.start()就会把master和worker在本地机器启动起来
def start(): Array[String] = {
logInfo("Starting a local Spark cluster with " + numWorkers + " workers.")
// Disable REST server on Master in this mode unless otherwise specified
val _conf = conf.clone()
.setIfMissing("spark.master.rest.enabled", "false")
.set(config.SHUFFLE_SERVICE_ENABLED.key, "false")
/* 启动 Master */
val (rpcEnv, webUiPort, _) = Master.startRpcEnvAndEndpoint(localHostname, 0, 0, _conf)
masterWebUIPort = webUiPort
masterRpcEnvs += rpcEnv
val masterUrl = "spark://" + Utils.localHostNameForURI() + ":" + rpcEnv.address.port
val masters = Array(masterUrl)
/* 启动 Workers */
for (workerNum <- 1 to numWorkers) {
val workerEnv = Worker.startRpcEnvAndEndpoint(localHostname, 0, 0, coresPerWorker,
memoryPerWorker, masters, null, Some(workerNum), _conf)
workerRpcEnvs += workerEnv
}
masters
}
启动master和启动workers时会在master和worker之间互相注册对方
- Worker.scala的onStart()方法会就调用registerWithMaster()->tryRegisterAllMasters()
private def tryRegisterAllMasters(): Array[JFuture[_]] = {
masterRpcAddresses.map { masterAddress =>
registerMasterThreadPool.submit(new Runnable {
override def run(): Unit = {
try {
logInfo("Connecting to master " + masterAddress + "...")
val masterEndpoint = rpcEnv.setupEndpointRef(masterAddress, Master.ENDPOINT_NAME)
//worker向master注册自己
sendRegisterMessageToMaster(masterEndpoint)
} catch {
case ie: InterruptedException => // Cancelled
case NonFatal(e) => logWarning(s"Failed to connect to master $masterAddress", e)
}
}
})
}
}
private def sendRegisterMessageToMaster(masterEndpoint: RpcEndpointRef): Unit = {
masterEndpoint.send(RegisterWorker(
workerId,
host,
port,
self,
cores,
memory,
workerWebUiUrl,
masterEndpoint.address))
}
- Master.scala中由recieve()方法响应worker将自己注册到worker里去。首先判断自己是否是standby master,并检查是否已经注册过该worker。如果正常则调用registerWorker(worker),该方法在master内维护workers。然后会告诉worker,我已经把你注册进去了。
override def receive: PartialFunction[Any, Unit] = {
...
case RevokedLeadership =>
logError("Leadership has been revoked -- master shutting down.")
System.exit(0)
case RegisterWorker(
id, workerHost, workerPort, workerRef, cores, memory, workerWebUiUrl, masterAddress) =>
logInfo("Registering worker %s:%d with %d cores, %s RAM".format(
workerHost, workerPort, cores, Utils.megabytesToString(memory)))
if (state == RecoveryState.STANDBY) {
workerRef.send(MasterInStandby)
} else if (idToWorker.contains(id)) {
workerRef.send(RegisterWorkerFailed("Duplicate worker ID"))
} else {
val worker = new WorkerInfo(id, workerHost, workerPort, cores, memory,
workerRef, workerWebUiUrl)
if (registerWorker(worker)) {
persistenceEngine.addWorker(worker)
workerRef.send(RegisteredWorker(self, masterWebUiUrl, masterAddress))
schedule()
} else {
val workerAddress = worker.endpoint.address
logWarning("Worker registration failed. Attempted to re-register worker at same " +
"address: " + workerAddress)
workerRef.send(RegisterWorkerFailed("Attempted to re-register worker at same address: "
+ workerAddress))
}
}
...
- 再来看看SparkContext.scala 方法createTaskScheduler()生成的一个重要产物,backend是StandaloneSchedulerBackend类型,当TaskSchedulerImpl start()时会调用backend的start(),在StandaloneSchedulerBackend方法start()中,首先指定command的类型是org.apache.spark.executor.CoarseGrainedExecutorBackend;然后把command放到放进appDesc对象中,可以看到appDesc对象中包含信息除了command外,还有appName,最大CPU核数,executor内存大小等;紧接着把appDesc放进client对象中,client对象时StandaloneAppClient.scala类型,接着调用start()方法
...
val command = Command("org.apache.spark.executor.CoarseGrainedExecutorBackend",
args, sc.executorEnvs, classPathEntries ++ testingClassPath, libraryPathEntries, javaOpts)
val appDesc = ApplicationDescription(sc.appName, maxCores, sc.executorMemory, command,
webUrl, sc.eventLogDir, sc.eventLogCodec, coresPerExecutor, initialExecutorLimit)
client = new StandaloneAppClient(sc.env.rpcEnv, masters, appDesc, this, conf)
client.start()
...
而StandaloneAppClient方法start()会通过调用registerWithMaster()->tryRegisterAllMasters()把Application注册到master里面去
private def tryRegisterAllMasters(): Array[JFuture[_]] = {
for (masterAddress <- masterRpcAddresses) yield {
registerMasterThreadPool.submit(new Runnable {
override def run(): Unit = try {
...
logInfo("Connecting to master " + masterAddress.toSparkURL + "...")
val masterRef = rpcEnv.setupEndpointRef(masterAddress, Master.ENDPOINT_NAME)
//向master注册application信息
masterRef.send(RegisterApplication(appDescription, self))
} catch {
case ie: InterruptedException => // Cancelled
case NonFatal(e) => logWarning(s"Failed to connect to master $masterAddress", e)
}
})
}
}
- 在master.scala这一端,和注册worker一样,同样在recieve()方法中,这一次是case RegisterApplication代码段的逻辑被调用执行。
override def receive: PartialFunction[Any, Unit] = {
...
case RegisterApplication(description, driver) =>
// TODO Prevent repeated registrations from some driver
if (state == RecoveryState.STANDBY) {
// ignore, don't send response
} else {
logInfo("Registering app " + description.name)
val app = createApplication(description, driver)
registerApplication(app)
logInfo("Registered app " + description.name + " with ID " + app.id)
persistenceEngine.addApplication(app)
driver.send(RegisteredApplication(app.id, self))
schedule()
}
...
}
14.1 registerApplication(app)的主要功能是把app放到waitingApps队列中
14.2 driver.send(RegisteredApplication(app.id, self))是回应driver,master把这个app注册了
14.3 schedule()会分配workers给这个applicaiton并调用startExecutorsOnWorkers()
14.4 startExecutorsOnWorkers()则会循环waitingApps队列,然后就是为每一个被分配executor的worker调用allocateWorkerResourceToExecutors()
14.5 allocateWorkerResourceToExecutors()方法则为每一个executor调用launchExecutor(worker, exec)方法
14.6 launchExecutor()源码如下
private def launchExecutor(worker: WorkerInfo, exec: ExecutorDesc): Unit = {
logInfo("Launching executor " + exec.fullId + " on worker " + worker.id)
worker.addExecutor(exec)
//通知指定的worker启动一个executor
worker.endpoint.send(LaunchExecutor(masterUrl,
exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory))
//通知driver,executor被启动
exec.application.driver.send(
ExecutorAdded(exec.id, worker.id, worker.hostPort, exec.cores, exec.memory))
}
- worker.scala的方法中一样是recieve()方法处理其他节点传过来的事件,case LaunchExecutor主要逻辑如下
override def receive: PartialFunction[Any, Unit] = synchronized {
...
case LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) =>
if (masterUrl != activeMasterUrl) {
logWarning("Invalid Master (" + masterUrl + ") attempted to launch executor.")
} else {
try {
logInfo("Asked to launch executor %s/%d for %s".format(appId, execId, appDesc.name))
//此处省略一万字,主要功能就是创建executor的工作目录
...
//创建名叫manager的ExecutorRunner对象,
val manager = new ExecutorRunner(
appId,
execId,
//注意appDesc的command也传进去了
appDesc.copy(command = Worker.maybeUpdateSSLSettings(appDesc.command, conf)),
cores_,
memory_,
self,
workerId,
host,
webUi.boundPort,
publicAddress,
sparkHome,
executorDir,
workerUri,
conf,
appLocalDirs, ExecutorState.RUNNING)
executors(appId + "/" + execId) = manager
//启动manager对象,start()会启动一个java的Thead(你没看错,就是java的Thread),线程会调
//用方法fetchAndRunExecutor()
manager.start()
coresUsed += cores_
memoryUsed += memory_
//回应master,我这边的exectuor已经准备好了
sendToMaster(ExecutorStateChanged(appId, execId, manager.state, None, None))
} catch {
case e: Exception =>
logError(s"Failed to launch executor $appId/$execId for ${appDesc.name}.", e)
if (executors.contains(appId + "/" + execId)) {
executors(appId + "/" + execId).kill()
executors -= appId + "/" + execId
}
sendToMaster(ExecutorStateChanged(appId, execId, ExecutorState.FAILED,
Some(e.toString), None))
}
}
...
}
- ExecutorRunner.scala的fetchAndRunExecutor()只要源码如下
private def fetchAndRunExecutor() {
try {
// Launch the process
val subsOpts = appDesc.command.javaOpts.map {
Utils.substituteAppNExecIds(_, appId, execId.toString)
}
val subsCommand = appDesc.command.copy(javaOpts = subsOpts)
val builder = CommandUtils.buildProcessBuilder(subsCommand, new SecurityManager(conf),
memory, sparkHome.getAbsolutePath, substituteVariables)
val command = builder.command()
val formattedCommand = command.asScala.mkString("\"", "\" \"", "\"")
logInfo(s"Launch command: $formattedCommand")
builder.directory(executorDir)
builder.environment.put("SPARK_EXECUTOR_DIRS", appLocalDirs.mkString(File.pathSeparator))
// In case we are running this from within the Spark Shell, avoid creating a "scala"
// parent process for the executor command
builder.environment.put("SPARK_LAUNCH_WITH_SCALA", "0")
// Add webUI log urls
val baseUrl =
if (conf.getBoolean("spark.ui.reverseProxy", false)) {
s"/proxy/$workerId/logPage/?appId=$appId&executorId=$execId&logType="
} else {
s"http://$publicAddress:$webUiPort/logPage/?appId=$appId&executorId=$execId&logType="
}
builder.environment.put("SPARK_LOG_URL_STDERR", s"${baseUrl}stderr")
builder.environment.put("SPARK_LOG_URL_STDOUT", s"${baseUrl}stdout")
process = builder.start()
val header = "Spark Executor Command: %s\n%s\n\n".format(
formattedCommand, "=" * 40)
...
}
本质就是利用命令行起了executor进程。
19/02/13 11:37:27 INFO ExecutorRunner: Launch command: "C:\Program Files\Java\jdk1.8.0_73\bin\java" "-cp" "D:\spark-2.4.0\conf\;D:\spark-2.4.0\assembly\target\scala-2.11\jars\*" "-Xmx1024M" "-Dspark.driver.port=2859" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@PAFSH-L1917.paicdom.local:2859" "--executor-id" "0" "--hostname" "10.28.137.223" "--cores" "1" "--app-id" "app-20190213113712-0000" "--worker-url" "spark://Worker@10.28.137.223:2942"
上面是在我本地跑的一个spark job,部署模式时local-cluster。可以看到,ExecutorRunner启动的日志。注意"org.apache.spark.executor.CoarseGrainedExecutorBackend",意思就是executor的主要逻辑都是在CoarseGrainedExecutorBackend中执行。CoarseGrainedExecutorBackend怎么来的还记得吗?是在driver端的sparkContext.scala方法createTaskScheduler()时放到appDesc的command里的。
-
至此,local-cluster部署的driver,master,worker各个组件及executor进程都已资源分配完毕等待执行。
-
对于standalone模式,和本地伪分布类似,只是不像本地还要启动和停止localcluster。因为这些工作,启动spark集群的时候已经由集群代劳了
case SPARK_REGEX(sparkUrl) =>
val scheduler = new TaskSchedulerImpl(sc)
val masterUrls = sparkUrl.split(",").map("spark://" + _)
val backend = new StandaloneSchedulerBackend(scheduler, sc, masterUrls)
scheduler.initialize(backend)
(backend, scheduler)
case LOCAL_CLUSTER_REGEX(numSlaves, coresPerSlave, memoryPerSlave) =>
// Check to make sure memory requested <= memoryPerSlave. Otherwise Spark will just ...
val scheduler = new TaskSchedulerImpl(sc)
val localCluster = new LocalSparkCluster(
numSlaves.toInt, coresPerSlave.toInt, memoryPerSlaveInt, sc.conf)
val masterUrls = localCluster.start()
val backend = new StandaloneSchedulerBackend(scheduler, sc, masterUrls)
scheduler.initialize(backend)
backend.shutdownCallback = (backend: StandaloneSchedulerBackend) => {
localCluster.stop()
}
(backend, scheduler)
- 至于Yarn和mesos模式,spark2.4版本已经和之前版本有了改变。之前的版本会有YarnXXXSchedulerBackend或者CoarseMesosScheulerBackend来部署。而现在先生成一个clusterManager,再通过这个clusterManager来生成schedulerBackend。(我不太喜欢直接叫他backend是为了和worker端executor进程里的executorBackend区分开)
case masterUrl =>
val cm = getClusterManager(masterUrl) match {
case Some(clusterMgr) => clusterMgr
case None => throw new SparkException("Could not parse Master URL: '" + master + "'")
}
try {
val scheduler = cm.createTaskScheduler(sc, masterUrl)
val backend = cm.createSchedulerBackend(sc, masterUrl, scheduler)
cm.initialize(scheduler, backend)
(backend, scheduler)
} catch {
case se: SparkException => throw se
case NonFatal(e) =>
throw new SparkException("External scheduler cannot be instantiated", e)
}
}
在这里getClusterManager(masterUrl)借用了java SPI(sevice provider interface)的概念,由ServiceLoader来动态生成clusterManager。但因为本地没有安装过yarn和mesos,跑不起来(猜测是这个原因,如果哪位大牛验证不是请指正),也确实跑不动了,就没实验下去。
private def getClusterManager(url: String): Option[ExternalClusterManager] = {
val loader = Utils.getContextOrSparkClassLoader
val serviceLoaders =
ServiceLoader.load(classOf[ExternalClusterManager], loader).asScala.filter(_.canCreate(url))
if (serviceLoaders.size > 1) {
throw new SparkException(
s"Multiple external cluster managers registered for the url $url: $serviceLoaders")
}
serviceLoaders.headOption
}
其实local-cluster以前也不像2.4版本那样由StandaloneSchedulerBackend作为schedulerBackend的实例,以前版本用的是SparkDeploySchedulerBackend。有兴趣的同学可以自行去研究下。
小结,以上我们通过伪分布(local-cluster)运行模式为例,详细介绍了rdd的action操作触发sparkContext.scala的createTaskScheduler()后,各个组件是如何被启动,executor进程是如何被分配待命的。下一章我们会看看当环境准备就绪后,job是怎么真正跑起来的。