sparkContext源码分析
1.sparkcontext是spark应用开发的入口,负责与spark集群的连接。可用于在集群上创建RDD,累加器,和 广播变量等一系列操作。
2.sparkcontext初始化主要做了以下几件事:
- 初始化spakEn
- 初始化taskScheduler
- 初始化dagscheduler
- 初始化sparkUI
3.sparkEnv相关代码
调用sparkcontext自己的createSparkEnv方法
// Create the Spark execution environment (cache, map output tracker, etc)
//初始化的时候创建spark的运行环境
_env = createSparkEnv(_conf, isLocal, listenerBus)
SparkEnv.set(_env)
sparkcontext的createSparkEnv()方法会调用sparkEnv的createSparkEnv()方法
// This function allows components created by SparkEnv to be mocked in unit tests:
private[spark] def createSparkEnv(
conf: SparkConf,
isLocal: Boolean,
listenerBus: LiveListenerBus): SparkEnv = {
//创建spark驱动环境(driver),调用sparkEnv的createDriverEnv()方法
SparkEnv.createDriverEnv(conf, isLocal, listenerBus, SparkContext.numDriverCores(master))
}
sparkEnv的createDriverEnv()方法会接着调用sparkEnv的create()方法,create()方法内会初始化cacheManager、blockManagerMaster、blockManager、BroadcastManager(广播变量管理器)、mapOutputTracker(跟踪map阶段任务输出)等一系列组件,最后才会 new SparkEnv
4.创建完sparkEnv创建taskscheduler
// Create and start the scheduler 创建和开始scheduler
val (sched, ts) = SparkContext.createTaskScheduler(this, master)
//将sced赋值给_schedulerBackend
_schedulerBackend = sched
//将ts赋值给_taskScheduler
_taskScheduler = ts
根据不同的运行模式进行创建
TaskScheduler负责实际每个具体Task的物理调度。
//这个是spark提交常用的standalone模式
//第一步创建一个schedulerImpl,
// 第二步获得masterurl,
// 第三步获取backend,是用来提交任务到executor的
// 第四步调用schedulerImpl的初始化方法,创建调度池
case SPARK_REGEX(sparkUrl) =>
//创建taskscheduler任务调度器
val scheduler = new TaskSchedulerImpl(sc)
val masterUrls = sparkUrl.split(",").map("spark://" + _)
//backend负责这个集群资源的获取和调度继承自CoarseGrainedSchedulerBackend
//CoarseGrainedSchedulerBackend是在worker上执行具体任务的代表,是executor的代理人
val backend = new SparkDeploySchedulerBackend(scheduler, sc, masterUrls)
//调用taskschedulerImpl的initialize()方法,传入backend参数
scheduler.initialize(backend)
//返回backend和scheduler
(backend, scheduler)
下面是taskSchedulerImpl中的initialize方法
//taskscheduler的初始化方法,需要传入一个SchedulerBackend
def initialize(backend: SchedulerBackend) {
//将backed赋值给taskSchedulerImpl的backed
this.backend = backend
// temporarily set rootPool name to empty --临时将rootPool名称设置为空
// rootPool 调度池
rootPool = new Pool("", schedulingMode, 0, 0)
schedulableBuilder = {
schedulingMode match {
//任务调度机制,FIFO -- first in first out(默认)
case SchedulingMode.FIFO =>
new FIFOSchedulableBuilder(rootPool)
case SchedulingMode.FAIR =>
new FairSchedulableBuilder(rootPool, conf)
}
}
//创建任务调度池
schedulableBuilder.buildPools()
}
5.创建完taskscheduler 然后创建DAGScheduler相关代码
DAGScheduler负责将Task拆分成不同Stage的具有依赖关系(包含RDD的依赖关系)的多批任务,然后提交给TaskScheduler进行具体处理。
创建一个DAG实例
DAG主要负责为job划分stage,寻找最佳运行task的位置,追踪RDD和stage是否被物化(缓存),等其他功能
//初始化DAGScheduler,this代表sparkcontext
_dagScheduler = new DAGScheduler(this)
//调用DAG的this(sc,sc.taskScheduler)构造函数
//传入sparkcontext和taskScheduler(上一步已经初始化完成)
def this(sc: SparkContext) = this(sc, sc.taskScheduler)
6.启动taskScheduler
//taskScheduler在DAGScheduler的构造函数中设置DAGScheduler引用后 启动TaskScheduler
_taskScheduler.start()
调用taskSchedulerImpl的start()方法,而taskSchedulerImpl的start()方法被SparkDeploySchedulerBackend重写。所以这里需要查看SparkDeploySchedulerBackend的start()方法
以下是部分代码:
//ApplicationDescription它就代表了当前运行的application的情况,比如maxCores,name
val appDesc = new ApplicationDescription(sc.appName, maxCores, sc.executorMemory,
command, appUIAddress, sc.eventLogDir, sc.eventLogCodec, coresPerExecutor)
//创建了AppClient,负责application与集群进行通信,向master发送注册application请求
client = new AppClient(sc.env.rpcEnv, masters, appDesc, this, conf)
//启动了一个aka线程,用来监听通信
client.start()
//状态设置为已提交
launcherBackend.setState(SparkAppHandle.State.SUBMITTED)
//等待注册
waitForRegistration()
//状态设置为运行中
launcherBackend.setState(SparkAppHandle.State.RUNNING)
7.SparkUI
spark图形化界面
_ui =
if (conf.getBoolean("spark.ui.enabled", true)) {
Some(SparkUI.createLiveUI(this, _conf, listenerBus, _jobProgressListener,
_env.securityManager, appName, startTime = startTime))
} else {
// For tests, do not enable the UI
None
}
调用sparkUI的createLiveUI()方法,创建sparkUI
def createLiveUI(
sc: SparkContext,
conf: SparkConf,
listenerBus: SparkListenerBus,
jobProgressListener: JobProgressListener,
securityManager: SecurityManager,
appName: String,
startTime: Long): SparkUI = {
create(Some(sc), conf, listenerBus, securityManager, appName,
jobProgressListener = Some(jobProgressListener), startTime = startTime)
}
8.初始化blockManager
//初始化blockManager,并传入applicationId
//将appliction的id传入让blockManager管理application
_env.blockManager.initialize(_applicationId)
调用blockManager的initialize()方法
/**
* Initializes the BlockManager with the given appId. This is not performed in the constructor as
* the appId may not be known at BlockManager instantiation time (in particular for the driver,
* where it is only learned after registration with the TaskScheduler).
*使用给定的appId初始化BlockManager。
* 这不是在构造函数中执行,因为appId可能在BlockManager实例化时间(特别是对于仅在注册到TaskScheduler之后才学习的驱动程序)中不知道。
* This method initializes the BlockTransferService and ShuffleClient, registers with the
* BlockManagerMaster, starts the BlockManagerWorker endpoint, and registers with a local shuffle
* service if configured.
* 此方法初始化BlockTransferService和ShuffleClient,向BlockManagerMaster注册,启动BlockManagerWorker端点,并配置本地随机服务注册。
*/
def initialize(appId: String): Unit = {
//初始化blockTransferServcic
blockTransferService.init(this)
shuffleClient.init(appId)
//生成一个blockManagerId
blockManagerId = BlockManagerId(
executorId, blockTransferService.hostName, blockTransferService.port)
shuffleServerId = if (externalShuffleServiceEnabled) {
logInfo(s"external shuffle service port = $externalShuffleServicePort")
BlockManagerId(executorId, blockTransferService.hostName, externalShuffleServicePort)
} else {
blockManagerId
}
//blockManagermaster注册blockManager,传入blockmanagerID,最大内存和保存节点地址
master.registerBlockManager(blockManagerId, maxMemory, slaveEndpoint)
// Register Executors' configuration with the local shuffle service, if one should exist.
//使用本地shuffle服务注册执行者的配置,如果存在的话。
if (externalShuffleServiceEnabled && !blockManagerId.isDriver) {
registerWithExternalShuffleServer()
}
}
9.到此为止sparkcontext中的重要工作基本上已经完成了!如果有错误之处,还请及时指正以免误人子弟,谢谢!!