Spark-Core源码学习记录
该系列作为Spark源码回顾学习的记录,旨在捋清Spark分发程序运行的机制和流程,对部分关键源码进行追踪,争取做到知其所以然,对枝节部分源码仅进行文字说明,不深入下钻,避免混淆主干内容。
从本篇文章开始,进入到Spark核心的部分,我们将依次展开。
SparkContext 基石
SparkContext在整个Spark
运行期间都起着重要的作用,并在其中完成了许多重要组件的初始化等内容,因此可以说是Saprk
的基石,下面就从上节中的getOrCreate方法开始,进入SaprkContext一探究竟:
def getOrCreate(): SparkSession = synchronized {
val sparkContext = userSuppliedContext.getOrElse {
// 实例化SparkConf,并添加属性值,options是前面builder里面设置的参数
val sparkConf = new SparkConf()
options.foreach {
case (k, v) => sparkConf.set(k, v) }
// set a random app name if not given.
if (!sparkConf.contains("spark.app.name")) {
sparkConf.setAppName(java.util.UUID.randomUUID().toString)
}
// 正式实例化SaprkContext,重要组件,之后会详解。注意:实例化SparkContext后就不可以再改变SparkConf的内容了
SparkContext.getOrCreate(sparkConf)
}
}
首先是初始化SparkConf
,源码比较简单,就是一些赋值存储操作,值得注意的是,无参构造默认会调用SaprkConf(true)
构造,表示需要加载默认的配置文件,// Load any spark.* system properties
,因此在测试阶段的用法经常是SparkConf(false)
构造。
完成SaprkConf初始化后,正式进入SaprkConf的初始化中:
def getOrCreate(config: SparkConf): SparkContext = {
// Synchronize to ensure that multiple create requests don't trigger an exception
// from assertNoOtherContextIsRunning within setActiveContext
SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized {
if (activeContext.get() == null) {
// 如果不存在,就新建一个,并在 ActiveContext中记录
setActiveContext(new SparkContext(config))
} else {
// 如果存在,那么新传入的 SparkConf不会起作用,给出警告提示
if (config.getAll.nonEmpty) {
logWarning("Using an existing SparkContext; some configuration may not take effect.")
}
}
activeContext.get()
}
}
下面进到new SparkContext(config)
中,惯例先看注释
/**
* Main entry point for Spark functionality. A SparkContext represents the connection to a Spark
* cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
* SparkContext是整个Spark功能的重要入口,是与集群的连接点,通过SparkContext可以创建RDD、Accumulators和Broadcast。
* @note Only one `SparkContext` should be active per JVM. You must `stop()` the
* active `SparkContext` before creating a new one.
* @param config a Spark Config object describing the application configuration. Any settings in
* this config overrides the default configs as well as system properties.
*/
下面开始代码段比较长,重要部分已给出注释,始终记住下面代码都是发生在driver
端的,不论部署模式是client还是cluster:
class SparkContext(config: SparkConf) extends Logging {
private[spark] val stopped: AtomicBoolean = new AtomicBoolean(false)
// log out Spark Version in Spark driver log
logInfo(s"Running Spark version $SPARK_VERSION")
/* ------------------------------------------------------------------------------------- *
| Private variables. These variables keep the internal state of the context, and are |
| not accessible by the outside world. They're mutable since we want to initialize all |
| of them to some neutral value ahead of time, so that calling "stop()" while the |
| constructor is still running is safe. |
* ------------------------------------------------------------------------------------- */
private var _conf: SparkConf = _
private var _env: SparkEnv = _
private var _schedulerBackend: SchedulerBackend = _
private var _taskScheduler: TaskScheduler = _
@volatile private var _dagScheduler: DAGScheduler = _
...//这里很多声明的private字段,此处省略掉
/* ------------------------------------------------------------------------------------- *
| Accessors and public fields. These provide access to the internal state of the |
| context. |
* ------------------------------------------------------------------------------------- */
...// 这里是对上面个别字段的get和set方法,采用scala的方式,省略掉
// _listenerBus和_statusStore用于监听应用程序状态
_listenerBus = new LiveListenerBus(_conf)
// Initialize the app status store and listener before SparkEnv is created so that it gets
// all events.
val appStatusSource = AppStatusSource.createSource(conf)
_statusStore = AppStatusStore.createLiveStore(conf, appStatusSource)
listenerBus.addToStatusQueue(_statusStore.listener.get)
// Create the Spark execution environment (cache, map output tracker, etc)
// 这里 createSparkEnv内部调用SparkEnv.createDriverEnv,最终调用 RpcEnv.create(...),创建sparkDriver的Env
//具体过程可以参考前面的 RpcEnv篇中介绍
_env = createSparkEnv(_conf, isLocal, listenerBus)
SparkEnv.set(_env)
// 这里可以看到,默认的executorMemory是1G
_executorMemory = _conf.getOption(EXECUTOR_MEMORY.key)
.orElse(Option(System.getenv("SPARK_EXECUTOR_MEMORY")))
.orElse(Option(System.getenv("SPARK_MEM"))
.map(warnSparkMem))
.map(Utils.memoryStringToMb)
.getOrElse(1024)
// We need to register "HeartbeatReceiver" before "createTaskScheduler" because Executor will
// retrieve "HeartbeatReceiver" in the constructor. (SPARK-6640)
// 实例化driver端的心跳机 Endpoint,用于接收 Executor心跳消息
_heartbeatReceiver = env.rpcEnv.setupEndpoint(
HeartbeatReceiver.ENDPOINT_NAME, new HeartbeatReceiver(this))
//从这里开始是最重要的部分,出事了三个调度器,每一个都需要我们单独的去追踪流程
// Create and start the scheduler
val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
_schedulerBackend = sched
_taskScheduler = ts
_dagScheduler = new DAGScheduler(this)
// 发送 TaskSchedulerIsSet消息,
_heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)
/* heartbeatReceiver自身接到消息后,其实就是绑定了SparkContext的 taskScheduler调度器,然后答复true
case TaskSchedulerIsSet =>
scheduler = sc.taskScheduler
context.reply(true)
*/
// create and start the heartbeater for collecting memory metrics
_heartbeater = new Heartbeater(...)
_heartbeater.start()
// start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's constructor
_taskScheduler.start()
//下面开始就一些应用程序信息的初始化,比如获得 SparkAppId之类的
//对于yarn模式下,还支持动态资源分配模式,该模式下会构造一个ExecutorAllocationManager对象,主要是根据集群资源动态触发增加或者删除资源策略
//最后是一些cleaner方法和一些运行状态的监控内容,具体源码就不再给出,避免影响阅读主干内容
}
现在可以专注于_schedulerBackend 、_taskScheduler 、_dagScheduler
三者的实例化过程。其中前两个是通过SparkContext.createTaskScheduler(this, master, deployMode)
完成实例化,_dagScheduler
是通过new DAGScheduler(this)
完成。
SchedulerBackend与 TaskScheduler
首先进入createTaskScheduler中查看