1.简介
SparkContext作为Spark程序的入口,相当于程序的main函数,足以说明它的重要性。官方对于SparkContext的定义是下面这样的注释:
/**
- Main entry point for Spark functionality. A SparkContext represents the connection to a Spark
- cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
- Only one SparkContext may be active per JVM. You must
stop()
the active SparkContext before - creating a new one. This limitation may eventually be removed; see SPARK-2243 for more details.
- @param config a Spark Config object describing the application configuration. Any settings in
- this config overrides the default configs as well as system properties.
*/
翻译过来就是,SparkContext是Spark功能的主要入口,具有连接Spark集群,创建RDD,累加,广播变量等功能。每个JVM中只能有一个处于活跃状态的SparkContext,如需要新建,则必须停掉前面处于活跃状态的SparkContext。
Spark集群的工作原理简图,SparkContext处于DriverProgram核心位置,所有与Cluster,worker的交互操作都需要SparkContext来完成。其中最主要的核心可以分为五个组件的创建,第一个创建SparkEnv主要用于spark集群的通信环境;第二个创建sparkUI,主要用于可以从Web界面看到Spark程序的运行状态;第三个创建SchedulerBackend,用于Driver和相应的Executor通信,分发任务;第四个创建DAGScheduler,主要用于根据程序的逻辑划分Stage,把每一个Stage封装为TaskSet发送给TaskScheduler;第五个创建TaskScheduler,用于接收DAGScheduler发送过来的TaskSet,然后把TaskSetManager进行管理。下面主要看一下SparkContext的源代码,以及上面所说的五个最主要核心的创建。每个组件在这里不做具体详细的讲解,本篇博客主要希望能够帮助大家了解SparkContext的作用,以及里面完成了什么事情,具体事情是如何完成的,后续会继续讲解。
/* ------------------------------------------------------------------------------------- *
| Private variables. These variables keep the internal state of the context, and are |
| not accessible by the outside world. They're mutable since we want to initialize all |
| of them to some neutral value ahead of time, so that calling "stop()" while the |
| constructor is still running is safe. |
* ------------------------------------------------------------------------------------- */
private var _conf: SparkConf = _
private var _eventLogDir: Option[URI] = None
private var _eventLogCodec: Option[String] = None
private var _listenerBus: LiveListenerBus = _
private var _env: SparkEnv = _
private var _statusTracker: SparkStatusTracker = _
private var _progressBar: Option[ConsoleProgressBar] = None
private var _ui: Option[SparkUI] = None
private var _hadoopConfiguration: Configuration = _
private var _executorMemory: Int = _
private var _schedulerBackend: SchedulerBackend = _
private var _taskScheduler: TaskScheduler = _
private var _heartbeatReceiver: RpcEndpointRef = _
@volatile private var _dagScheduler: DAGScheduler = _
private var _applicationId: String = _
private var _applicationAttemptId: Option[String] = None
private var _eventLogger: Option[EventLoggingListener] = None
private var _executorAllocationManager: Option[ExecutorAllocationManager] = None
private var _cleaner: Option[ContextCleaner] = None
private var _listenerBusStarted: Boolean = false
private var _jars: Seq[String] = _
private var _files: Seq[String] = _
private var _shutdownHookRef: AnyRef = _
private var _statusStore: AppStatusStore = _
可以看到在SparkContext中创建很多的变量,包括SparkConf,ListenerBus,以及五大核心等组件。看一下SparkEnv的创建,
private[spark] def env: SparkEnv = _env
进入到SparkEnv的这个类中,
class SparkEnv (
val executorId: String,
private[spark] val rpcEnv: RpcEnv,
val serializer: Serializer,
val closureSerializer: Serializer,
val serializerManager: SerializerManager,
val mapOutputTracker: MapOutputTracker,
val shuffleManager: ShuffleManager,
val broadcastManager: BroadcastManager,
val blockManager: BlockManager,
val securityManager: SecurityManager,
val metricsSystem: MetricsSystem,
val memoryManager: MemoryManager,
val outputCommitCoordinator: OutputCommitCoordinator,
val conf: SparkConf) extends Logging
在这个类中创建了很多组件,包括序列化器,ShuffleManager,BlockManager等,在这里面会调用createDriverEnv和createExecutorEnv方法来为Driver和Executor创建RpcEnv的环境,
private[spark] def createDriverEnv(
conf: SparkConf,
isLocal: Boolean,
listenerBus: LiveListenerBus,
numCores: Int,
mockOutputCommitCoordinator: Option[OutputCommitCoordinator] = None): SparkEnv = {
assert(conf.contains(DRIVER_HOST_ADDRESS),
s"${DRIVER_HOST_ADDRESS.key} is not set on the driver!")
assert(conf.contains("spark.driver.port"), "spark.driver.port is not set on the driver!")
val bindAddress = conf.get(DRIVER_BIND_ADDRESS)
val advertiseAddress = conf.get(DRIVER_HOST_ADDRESS)
val port = conf.get("spark.driver.port").toInt
val ioEncryptionKey = if (conf.get(IO_ENCRYPTION_ENABLED)) {
Some(CryptoStreamUtils.createKey(conf))
} else {
None
}
create(
conf,
SparkContext.DRIVER_IDENTIFIER,
bindAddress,
advertiseAddress,
Option(port),
isLocal,
numCores,
ioEncryptionKey,
listenerBus = listenerBus,
mockOutputCommitCoordinator = mockOutputCommitCoordinator
)
}
/**
* Create a SparkEnv for an executor.
* In coarse-grained mode, the executor provides an RpcEnv that is already instantiated.
*/
private[spark] def createExecutorEnv(
conf: SparkConf,
executorId: String,
hostname: String,
numCores: Int,
ioEncryptionKey: Option[Array[Byte]],
isLocal: Boolean): SparkEnv = {
val env = create(
conf,
executorId,
hostname,
hostname,
None,
isLocal,
numCores,
ioEncryptionKey
)
SparkEnv.set(env)
env
}
SparkEnv创建完成后,看一下SparkUI的创建:
private[spark] def ui: Option[SparkUI] = _ui
进入到SparkUI的类中,主要看一下initialize的方法:
def initialize(): Unit = {
val jobsTab = new JobsTab(this, store)
attachTab(jobsTab)
val stagesTab = new StagesTab(this, store)
attachTab(stagesTab)
attachTab(new StorageTab(this, store))
attachTab(new EnvironmentTab(this, store))
attachTab(new ExecutorsTab(this))
addStaticHandler(SparkUI.STATIC_RESOURCE_DIR)
attachHandler(createRedirectHandler("/", "/jobs/", basePath = basePath))
attachHandler(ApiRootResource.getServletHandler(this))
// These should be POST only, but, the YARN AM proxy won't proxy POSTs
attachHandler(createRedirectHandler(
"/jobs/job/kill", "/jobs/", jobsTab.handleKillRequest, httpMethods = Set("GET", "POST")))
attachHandler(createRedirectHandler(
"/stages/stage/kill", "/stages/", stagesTab.handleKillRequest,
httpMethods = Set("GET", "POST")))
}
SchedulerBackend的创建,
private[spark] def schedulerBackend: SchedulerBackend = _schedulerBackend
接下里TaskScheduler和DAGScheduler的创建:
private[spark] def taskScheduler: TaskScheduler = _taskScheduler
private[spark] def taskScheduler_=(ts: TaskScheduler): Unit = {
_taskScheduler = ts
}
private[spark] def dagScheduler: DAGScheduler = _dagScheduler
private[spark] def dagScheduler_=(ds: DAGScheduler): Unit = {
_dagScheduler = ds
}
val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
_schedulerBackend = sched
_taskScheduler = ts
_dagScheduler = new DAGScheduler(this)
_heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)
主要的调用createTaskScheduler的方法进行TaskScheduler的创建。最后是DAGScheduler的创建。
2. 总结
总结:SparkContext是Spark程序的入口,非常重要,整个Spark流程的细节都是从这里开始的,所以很多人说了解了SparkContext就了解了Spark,,当然这只是从框架的理解上,具体内部的细节还需要自己去花时间细看。由于SparkContext只是程序的入口,所以本篇博客只是介绍性的文章,并没有做过多的解析,后续会对具体的模块继续深入介绍。