SparkContext的初始化

最新推荐文章于 2022-07-02 11:09:38 发布

啥都不会的硕士

最新推荐文章于 2022-07-02 11:09:38 发布

阅读量393

点赞数 2

分类专栏： Spark 文章标签： SparkContext Spark 大数据

本文链接：https://blog.csdn.net/fengshaungme/article/details/87168418

版权

Spark 专栏收录该内容

10 篇文章 2 订阅

订阅专栏

1.简介

SparkContext作为Spark程序的入口，相当于程序的main函数，足以说明它的重要性。官方对于SparkContext的定义是下面这样的注释：
/**

Main entry point for Spark functionality. A SparkContext represents the connection to a Spark
cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
Only one SparkContext may be active per JVM. You must stop() the active SparkContext before
creating a new one. This limitation may eventually be removed; see SPARK-2243 for more details.
@param config a Spark Config object describing the application configuration. Any settings in
this config overrides the default configs as well as system properties.
*/
翻译过来就是，SparkContext是Spark功能的主要入口，具有连接Spark集群，创建RDD，累加，广播变量等功能。每个JVM中只能有一个处于活跃状态的SparkContext，如需要新建，则必须停掉前面处于活跃状态的SparkContext。

Spark集群的工作原理简图，SparkContext处于DriverProgram核心位置，所有与Cluster,worker的交互操作都需要SparkContext来完成。其中最主要的核心可以分为五个组件的创建，第一个创建SparkEnv主要用于spark集群的通信环境；第二个创建sparkUI，主要用于可以从Web界面看到Spark程序的运行状态；第三个创建SchedulerBackend，用于Driver和相应的Executor通信，分发任务；第四个创建DAGScheduler，主要用于根据程序的逻辑划分Stage，把每一个Stage封装为TaskSet发送给TaskScheduler；第五个创建TaskScheduler，用于接收DAGScheduler发送过来的TaskSet，然后把TaskSetManager进行管理。下面主要看一下SparkContext的源代码，以及上面所说的五个最主要核心的创建。每个组件在这里不做具体详细的讲解，本篇博客主要希望能够帮助大家了解SparkContext的作用，以及里面完成了什么事情，具体事情是如何完成的，后续会继续讲解。


/* ------------------------------------------------------------------------------------- *
 | Private variables. These variables keep the internal state of the context, and are    |
 | not accessible by the outside world. They're mutable since we want to initialize all  |
 | of them to some neutral value ahead of time, so that calling "stop()" while the       |
 | constructor is still running is safe.                                                 |
 * ------------------------------------------------------------------------------------- */

private var _conf: SparkConf = _
private var _eventLogDir: Option[URI] = None
private var _eventLogCodec: Option[String] = None
private var _listenerBus: LiveListenerBus = _
private var _env: SparkEnv = _
private var _statusTracker: SparkStatusTracker = _
private var _progressBar: Option[ConsoleProgressBar] = None
private var _ui: Option[SparkUI] = None
private var _hadoopConfiguration: Configuration = _
private var _executorMemory: Int = _
private var _schedulerBackend: SchedulerBackend = _
private var _taskScheduler: TaskScheduler = _
private var _heartbeatReceiver: RpcEndpointRef = _
@volatile private var _dagScheduler: DAGScheduler = _
private var _applicationId: String = _
private var _applicationAttemptId: Option[String] = None
private var _eventLogger: Option[EventLoggingListener] = None
private var _executorAllocationManager: Option[ExecutorAllocationManager] = None
private var _cleaner: Option[ContextCleaner] = None
private var _listenerBusStarted: Boolean = false
private var _jars: Seq[String] = _
private var _files: Seq[String] = _
private var _shutdownHookRef: AnyRef = _
private var _statusStore: AppStatusStore = _

可以看到在SparkContext中创建很多的变量，包括SparkConf，ListenerBus，以及五大核心等组件。看一下SparkEnv的创建，

private[spark] def env: SparkEnv = _env

进入到SparkEnv的这个类中，

class SparkEnv (
    val executorId: String,
    private[spark] val rpcEnv: RpcEnv,
    val serializer: Serializer,
    val closureSerializer: Serializer,
    val serializerManager: SerializerManager,
    val mapOutputTracker: MapOutputTracker,
    val shuffleManager: ShuffleManager,
    val broadcastManager: BroadcastManager,
    val blockManager: BlockManager,
    val securityManager: SecurityManager,
    val metricsSystem: MetricsSystem,
    val memoryManager: MemoryManager,
    val outputCommitCoordinator: OutputCommitCoordinator,
    val conf: SparkConf) extends Logging

在这个类中创建了很多组件，包括序列化器，ShuffleManager，BlockManager等，在这里面会调用createDriverEnv和createExecutorEnv方法来为Driver和Executor创建RpcEnv的环境，

private[spark] def createDriverEnv(
    conf: SparkConf,
    isLocal: Boolean,
    listenerBus: LiveListenerBus,
    numCores: Int,
    mockOutputCommitCoordinator: Option[OutputCommitCoordinator] = None): SparkEnv = {
  assert(conf.contains(DRIVER_HOST_ADDRESS),
    s"${DRIVER_HOST_ADDRESS.key} is not set on the driver!")
  assert(conf.contains("spark.driver.port"), "spark.driver.port is not set on the driver!")
  val bindAddress = conf.get(DRIVER_BIND_ADDRESS)
  val advertiseAddress = conf.get(DRIVER_HOST_ADDRESS)
  val port = conf.get("spark.driver.port").toInt
  val ioEncryptionKey = if (conf.get(IO_ENCRYPTION_ENABLED)) {
    Some(CryptoStreamUtils.createKey(conf))
  } else {
    None
  }
  create(
    conf,
    SparkContext.DRIVER_IDENTIFIER,
    bindAddress,
    advertiseAddress,
    Option(port),
    isLocal,
    numCores,
    ioEncryptionKey,
    listenerBus = listenerBus,
    mockOutputCommitCoordinator = mockOutputCommitCoordinator
  )
}

/**
 * Create a SparkEnv for an executor.
 * In coarse-grained mode, the executor provides an RpcEnv that is already instantiated.
 */
private[spark] def createExecutorEnv(
    conf: SparkConf,
    executorId: String,
    hostname: String,
    numCores: Int,
    ioEncryptionKey: Option[Array[Byte]],
    isLocal: Boolean): SparkEnv = {
  val env = create(
    conf,
    executorId,
    hostname,
    hostname,
    None,
    isLocal,
    numCores,
    ioEncryptionKey
  )
  SparkEnv.set(env)
  env
}

SparkEnv创建完成后，看一下SparkUI的创建：
private[spark] def ui: Option[SparkUI] = _ui
进入到SparkUI的类中，主要看一下initialize的方法：

def initialize(): Unit = {
  val jobsTab = new JobsTab(this, store)
  attachTab(jobsTab)
  val stagesTab = new StagesTab(this, store)
  attachTab(stagesTab)
  attachTab(new StorageTab(this, store))
  attachTab(new EnvironmentTab(this, store))
  attachTab(new ExecutorsTab(this))
  addStaticHandler(SparkUI.STATIC_RESOURCE_DIR)
  attachHandler(createRedirectHandler("/", "/jobs/", basePath = basePath))
  attachHandler(ApiRootResource.getServletHandler(this))

  // These should be POST only, but, the YARN AM proxy won't proxy POSTs
  attachHandler(createRedirectHandler(
    "/jobs/job/kill", "/jobs/", jobsTab.handleKillRequest, httpMethods = Set("GET", "POST")))
  attachHandler(createRedirectHandler(
    "/stages/stage/kill", "/stages/", stagesTab.handleKillRequest,
    httpMethods = Set("GET", "POST")))
}

SchedulerBackend的创建，

private[spark] def schedulerBackend: SchedulerBackend = _schedulerBackend

接下里TaskScheduler和DAGScheduler的创建：

private[spark] def taskScheduler: TaskScheduler = _taskScheduler
private[spark] def taskScheduler_=(ts: TaskScheduler): Unit = {
  _taskScheduler = ts
}

private[spark] def dagScheduler: DAGScheduler = _dagScheduler
private[spark] def dagScheduler_=(ds: DAGScheduler): Unit = {
  _dagScheduler = ds
}

val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
_schedulerBackend = sched
_taskScheduler = ts
_dagScheduler = new DAGScheduler(this)
_heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)

主要的调用createTaskScheduler的方法进行TaskScheduler的创建。最后是DAGScheduler的创建。

2. 总结

总结：SparkContext是Spark程序的入口，非常重要，整个Spark流程的细节都是从这里开始的，所以很多人说了解了SparkContext就了解了Spark,，当然这只是从框架的理解上，具体内部的细节还需要自己去花时间细看。由于SparkContext只是程序的入口，所以本篇博客只是介绍性的文章，并没有做过多的解析，后续会对具体的模块继续深入介绍。

啥都不会的硕士

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
SparkContext的初始化

1.简介SparkContext作为Spark程序的入口，相当于程序的main函数，足以说明它的重要性。官方对于SparkContext的定义是下面这样的注释：/**Main entry point for Spark functionality. A SparkContext represents the connection to a Sparkcluster, and can be...
复制链接

扫一扫