本文主要讲述standalone模式下,应用程序启动后,创建SparkConf、SparkContext的执行过程。
在应用程序中,一般会先创建SparkConf对象,并作相应的参数设置,然后用该对象来初始化SparkContext对象。
1. SparkConf
SparkConf管理Spark应用程序的配置,参数以key-value形式表示。
class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging {
import SparkConf._
/** Create a SparkConf that loads defaults from system properties and the classpath */
def this() = this(true)
private val settings = new ConcurrentHashMap[String, String]()
if (loadDefaults) {
// Load any spark.* system properties
for ((key, value) <- Utils.getSystemProperties if key.startsWith("spark.")) {
set(key, value)
}
}
如果loadDefaults=true,则会通过System.getProperties获取SparkSubmit设置的Spark属性参数。
所有的参数以key-value形式存储在ConcurrentHashMap对象中。
2. SparkContext
2.1 创建SparkEnv
private[spark] def createSparkEnv(
conf: SparkConf,
isLocal: Boolean,
listenerBus: LiveListenerBus): SparkEnv = {
SparkEnv.createDriverEnv(conf, isLocal, listenerBus)
}
private[spark] val env = createSparkEnv(conf, isLocal, listenerBus)
SparkEnv.set(env)
2.2 创建TaskScheduler
private[spark] var (schedulerBackend, taskScheduler) =
SparkContext.createTaskScheduler(this, master)
2.2.1 SparkContext.createTaskScheduler
在standalone摸下执行:
private def createTaskScheduler(
sc: SparkContext,
master: String): (SchedulerBackend, TaskScheduler) = {
...
master match {
...
case SPARK_REGEX(sparkUrl) =>
val scheduler = new TaskSchedulerImpl(sc)
val masterUrls = sparkUrl.split(",").map("spark://" + _)
val backend = new SparkDeploySchedulerBackend(scheduler, sc, masterUrls)
scheduler.initialize(backend)
(backend, scheduler)
...
}
(1)创建TaskSchedulerImpl对象,该类继承trait TaskScheduler;
(2)创建SparkDeploySchedulerBackend对象,继承结构:
(3)初始化
TaskSchedulerImpl,
将backend设置为
SparkDeploySchedulerBackend
;
2.3 创建DAGScheduler
@volatile private[spark] var dagScheduler: DAGScheduler = _
try {
dagScheduler = new DAGScheduler(this)
} catch {
...
}
2.4 启动TaskScheduler
// start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
// constructor
taskScheduler.start()
实际调用的方法是:TaskSchedulerImpl.start。
从此方法发起调用的流程:
图中涉及多个Actor之间的交互:
(1)DriverActor;
(2)ClientActor;
(3)Master;
(4)Worker;
(5)CoarseGrainedExecutorBackend。
所有的交互有ClientActor发起,ClientActor由AppClient创建。
2.4.1 SparkDeploySchedulerBackend.start
override def start() {
super.start()
// The endpoint for executors to talk to us
val driverUrl = AkkaUtils.address(
AkkaUtils.protocol(actorSystem),
SparkEnv.driverActorSystemName,
conf.get("spark.driver.host"),
conf.get("spark.driver.port"),
CoarseGrainedSchedulerBackend.ACTOR_NAME)
val args = Seq(
"--driver-url", driverUrl,
"--executor-id", "{{EXECUTOR_ID}}",
"--hostname", "{{HOSTNAME}}",
"--cores", "{{CORES}}",
"--app-id", "{{APP_ID}}",
"--worker-url", "{{WORKER_URL}}")
val extraJavaOpts = sc.conf.getOption("spark.executor.extraJavaOptions")
.map(Utils.splitCommandString).getOrElse(Seq.empty)
val classPathEntries = sc.conf.getOption("spark.executor.extraClassPath")
.map(_.split(java.io.File.pathSeparator).toSeq).getOrElse(Nil)
val libraryPathEntries = sc.conf.getOption("spark.executor.extraLibraryPath")
.map(_.split(java.io.File.pathSeparator).toSeq).getOrElse(Nil)
// When testing, expose the parent class path to the child. This is processed by
// compute-classpath.{cmd,sh} and makes all needed jars available to child processes
// when the assembly is built with the "*-provided" profiles enabled.
val testingClassPath =
if (sys.props.contains("spark.testing")) {
sys.props("java.class.path").split(java.io.File.pathSeparator).toSeq
} else {
Nil
}
// Start executors with a few necessary configs for registering with the scheduler
val sparkJavaOpts = Utils.sparkJavaOpts(conf, SparkConf.isExecutorStartupConf)
val javaOpts = sparkJavaOpts ++ extraJavaOpts
val command = Command("org.apache.spark.executor.CoarseGrainedExecutorBackend",
args, sc.executorEnvs, classPathEntries ++ testingClassPath, libraryPathEntries, javaOpts)
val appUIAddress = sc.ui.map(_.appUIAddress).getOrElse("")
val appDesc = new ApplicationDescription(sc.appName, maxCores, sc.executorMemory, command,
appUIAddress, sc.eventLogDir, sc.eventLogCodec)
client = new AppClient(sc.env.actorSystem, masters, appDesc, this, conf)
client.start()
waitForRegistration()
}
方法职责:
(1)调用超类的start方法,即CoarseGrainedSchedulerBackend.start,来创建DriverActor;
(2)组织启动Executor需要的参数,并创建ApplicationDescription对象
(3)创建AppClient对象,并启动。
2.4.2 AppClient.start
def start() {
// Just launch an actor; it will call back into the listener.
actor = actorSystem.actorOf(Props(new ClientActor))
}
创建ClientActor。
ClientActor的preStart方法中将向Master发起应用程序注册流程。
2.5 Driver端BlockManager初始化
// start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
// constructor
taskScheduler.start()
val applicationId: String = taskScheduler.applicationId()
conf.set("spark.app.id", applicationId)
env.blockManager.initialize(applicationId)
(1)taskScheduler.start(),即2.4节启动TaskScheduler;
(2)获取应用程序id,由Master分配;
(3)初始化BlockManager。
2.6 Executor端BlockManager初始化
DriverActor收到
CoarseGrainedExecutorBackend发送的RegisterExecutor消息后,正常返回一个RegisteredExecutor消息。
CoarseGrainedExecutorBackend将对该消息进行处理。
override def receiveWithLogging = {
case RegisteredExecutor =>
logInfo("Successfully registered with driver")
val (hostname, _) = Utils.parseHostPort(hostPort)
executor = new Executor(executorId, hostname, env, userClassPath, isLocal = false)
创将Executor对象,在主构造函数中发起BlockManager的初始化:
if (!isLocal) {
env.metricsSystem.registerSource(executorSource)
env.blockManager.initialize(conf.getAppId)
}
conf的AppId是从Master传递给Worker,由CoarseGrainedExecutorBackend单例对象设置到SparkConf对象中。