SparkContext的初始化

1.简介

SparkContext作为Spark程序的入口,相当于程序的main函数,足以说明它的重要性。官方对于SparkContext的定义是下面这样的注释:
/**

  • Main entry point for Spark functionality. A SparkContext represents the connection to a Spark
  • cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
  • Only one SparkContext may be active per JVM. You must stop() the active SparkContext before
  • creating a new one. This limitation may eventually be removed; see SPARK-2243 for more details.
  • @param config a Spark Config object describing the application configuration. Any settings in
  • this config overrides the default configs as well as system properties.
    */
    翻译过来就是,SparkContext是Spark功能的主要入口,具有连接Spark集群,创建RDD,累加,广播变量等功能。每个JVM中只能有一个处于活跃状态的SparkContext,如需要新建,则必须停掉前面处于活跃状态的SparkContext。
    在这里插入图片描述

Spark集群的工作原理简图,SparkContext处于DriverProgram核心位置,所有与Cluster,worker的交互操作都需要SparkContext来完成。其中最主要的核心可以分为五个组件的创建,第一个创建SparkEnv主要用于spark集群的通信环境;第二个创建sparkUI,主要用于可以从Web界面看到Spark程序的运行状态;第三个创建SchedulerBackend,用于Driver和相应的Executor通信,分发任务;第四个创建DAGScheduler,主要用于根据程序的逻辑划分Stage,把每一个Stage封装为TaskSet发送给TaskScheduler;第五个创建TaskScheduler,用于接收DAGScheduler发送过来的TaskSet,然后把TaskSetManager进行管理。下面主要看一下SparkContext的源代码,以及上面所说的五个最主要核心的创建。每个组件在这里不做具体详细的讲解,本篇博客主要希望能够帮助大家了解SparkContext的作用,以及里面完成了什么事情,具体事情是如何完成的,后续会继续讲解。


/* ------------------------------------------------------------------------------------- *
 | Private variables. These variables keep the internal state of the context, and are    |
 | not accessible by the outside world. They're mutable since we want to initialize all  |
 | of them to some neutral value ahead of time, so that calling "stop()" while the       |
 | constructor is still running is safe.                                                 |
 * ------------------------------------------------------------------------------------- */

private var _conf: SparkConf = _
private var _eventLogDir: Option[URI] = None
private var _eventLogCodec: Option[String] = None
private var _listenerBus: LiveListenerBus = _
private var _env: SparkEnv = _
private var _statusTracker: SparkStatusTracker = _
private var _progressBar: Option[ConsoleProgressBar] = None
private var _ui: Option[SparkUI] = None
private var _hadoopConfiguration: Configuration = _
private var _executorMemory: Int = _
private var _schedulerBackend: SchedulerBackend = _
private var _taskScheduler: TaskScheduler = _
private var _heartbeatReceiver: RpcEndpointRef = _
@volatile private var _dagScheduler: DAGScheduler = _
private var _applicationId: String = _
private var _applicationAttemptId: Option[String] = None
private var _eventLogger: Option[EventLoggingListener] = None
private var _executorAllocationManager: Option[ExecutorAllocationManager] = None
private var _cleaner: Option[ContextCleaner] = None
private var _listenerBusStarted: Boolean = false
private var _jars: Seq[String] = _
private var _files: Seq[String] = _
private var _shutdownHookRef: AnyRef = _
private var _statusStore: AppStatusStore = _

可以看到在SparkContext中创建很多的变量,包括SparkConf,ListenerBus,以及五大核心等组件。看一下SparkEnv的创建,

private[spark] def env: SparkEnv = _env

进入到SparkEnv的这个类中,

class SparkEnv (
    val executorId: String,
    private[spark] val rpcEnv: RpcEnv,
    val serializer: Serializer,
    val closureSerializer: Serializer,
    val serializerManager: SerializerManager,
    val mapOutputTracker: MapOutputTracker,
    val shuffleManager: ShuffleManager,
    val broadcastManager: BroadcastManager,
    val blockManager: BlockManager,
    val securityManager: SecurityManager,
    val metricsSystem: MetricsSystem,
    val memoryManager: MemoryManager,
    val outputCommitCoordinator: OutputCommitCoordinator,
    val conf: SparkConf) extends Logging 

在这个类中创建了很多组件,包括序列化器,ShuffleManager,BlockManager等,在这里面会调用createDriverEnv和createExecutorEnv方法来为Driver和Executor创建RpcEnv的环境,

private[spark] def createDriverEnv(
    conf: SparkConf,
    isLocal: Boolean,
    listenerBus: LiveListenerBus,
    numCores: Int,
    mockOutputCommitCoordinator: Option[OutputCommitCoordinator] = None): SparkEnv = {
  assert(conf.contains(DRIVER_HOST_ADDRESS),
    s"${DRIVER_HOST_ADDRESS.key} is not set on the driver!")
  assert(conf.contains("spark.driver.port"), "spark.driver.port is not set on the driver!")
  val bindAddress = conf.get(DRIVER_BIND_ADDRESS)
  val advertiseAddress = conf.get(DRIVER_HOST_ADDRESS)
  val port = conf.get("spark.driver.port").toInt
  val ioEncryptionKey = if (conf.get(IO_ENCRYPTION_ENABLED)) {
    Some(CryptoStreamUtils.createKey(conf))
  } else {
    None
  }
  create(
    conf,
    SparkContext.DRIVER_IDENTIFIER,
    bindAddress,
    advertiseAddress,
    Option(port),
    isLocal,
    numCores,
    ioEncryptionKey,
    listenerBus = listenerBus,
    mockOutputCommitCoordinator = mockOutputCommitCoordinator
  )
}

/**
 * Create a SparkEnv for an executor.
 * In coarse-grained mode, the executor provides an RpcEnv that is already instantiated.
 */
private[spark] def createExecutorEnv(
    conf: SparkConf,
    executorId: String,
    hostname: String,
    numCores: Int,
    ioEncryptionKey: Option[Array[Byte]],
    isLocal: Boolean): SparkEnv = {
  val env = create(
    conf,
    executorId,
    hostname,
    hostname,
    None,
    isLocal,
    numCores,
    ioEncryptionKey
  )
  SparkEnv.set(env)
  env
}

SparkEnv创建完成后,看一下SparkUI的创建:
private[spark] def ui: Option[SparkUI] = _ui
进入到SparkUI的类中,主要看一下initialize的方法:

def initialize(): Unit = {
  val jobsTab = new JobsTab(this, store)
  attachTab(jobsTab)
  val stagesTab = new StagesTab(this, store)
  attachTab(stagesTab)
  attachTab(new StorageTab(this, store))
  attachTab(new EnvironmentTab(this, store))
  attachTab(new ExecutorsTab(this))
  addStaticHandler(SparkUI.STATIC_RESOURCE_DIR)
  attachHandler(createRedirectHandler("/", "/jobs/", basePath = basePath))
  attachHandler(ApiRootResource.getServletHandler(this))

  // These should be POST only, but, the YARN AM proxy won't proxy POSTs
  attachHandler(createRedirectHandler(
    "/jobs/job/kill", "/jobs/", jobsTab.handleKillRequest, httpMethods = Set("GET", "POST")))
  attachHandler(createRedirectHandler(
    "/stages/stage/kill", "/stages/", stagesTab.handleKillRequest,
    httpMethods = Set("GET", "POST")))
}

SchedulerBackend的创建,

private[spark] def schedulerBackend: SchedulerBackend = _schedulerBackend

接下里TaskScheduler和DAGScheduler的创建:

private[spark] def taskScheduler: TaskScheduler = _taskScheduler
private[spark] def taskScheduler_=(ts: TaskScheduler): Unit = {
  _taskScheduler = ts
}

private[spark] def dagScheduler: DAGScheduler = _dagScheduler
private[spark] def dagScheduler_=(ds: DAGScheduler): Unit = {
  _dagScheduler = ds
}
val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
_schedulerBackend = sched
_taskScheduler = ts
_dagScheduler = new DAGScheduler(this)
_heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)

主要的调用createTaskScheduler的方法进行TaskScheduler的创建。最后是DAGScheduler的创建。

2. 总结

总结:SparkContext是Spark程序的入口,非常重要,整个Spark流程的细节都是从这里开始的,所以很多人说了解了SparkContext就了解了Spark,,当然这只是从框架的理解上,具体内部的细节还需要自己去花时间细看。由于SparkContext只是程序的入口,所以本篇博客只是介绍性的文章,并没有做过多的解析,后续会对具体的模块继续深入介绍。

  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值