Spark2.0.X源码深度剖析之 SparkContext

最新推荐文章于 2021-12-12 22:53:01 发布

按时吃早饭ABC

最新推荐文章于 2021-12-12 22:53:01 发布

阅读量1k

点赞数

文章标签：源码大数据 spark

本文链接：https://blog.csdn.net/ws0owws0ow/article/details/73005430

版权

微信号：519292115

邮箱：taosiyuan163@163.com

尊重原创，禁止转载！！

Spark目前是大数据领域中最火的框架之一，可高效实现离线批处理，实时计算和机器学习等多元化操作，阅读源码有助你加深对框架的理解和认知

本人将依次剖析Spark2.0.0.X版本的各个核心组件，包括以后章节的SparkContext，SparkEnv，RpcEnv，NettyRpc，BlockManager，OutputTracker,TaskScheduler,DAGScheduler等

SparkContext作为程序员编写代码的第一个生成对象，它会首先在Driver端创建，除了负责连接集群以外还会在创建的时候会初始化各个核心组件，包括DAGScheduler，TaskScheduler，SparkEnv，accumulator等。

/**
 * Main entry point for Spark functionality. A SparkContext represents the connection to a Spark
 * cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
 *
 * Only one SparkContext may be active per JVM.  You must `stop()` the active SparkContext before
 * creating a new one.  This limitation may eventually be removed; see SPARK-2243 for more details.
 *
 * @param config a Spark Config object describing the application configuration. Any settings in
 *   this config overrides the default configs as well as system properties.
 */
class SparkContext(config: SparkConf) extends Logging {

第一个生成的对象，主要用作负责Spark集群的事件监听，和MetricsSystem类似，他们之间也会有消息通信

// An asynchronous listener bus for Spark events
private[spark] val listenerBus = new LiveListenerBus(this)

先把Driver的标记包括地址，主机名和executorId号设置进SparkConf里

// Set Spark driver host and port system properties. This explicitly sets the configuration
// instead of relying on the default value of the config constant.
_conf.set(DRIVER_HOST_ADDRESS, _conf.get(DRIVER_HOST_ADDRESS))
_conf.setIfMissing("spark.driver.port", "0")

_conf.set("spark.executor.id", SparkContext.DRIVER_IDENTIFIER)

在创建SparkEnv之前会创建jobprogressListener，他是负责每个事件的监听并发送给listenerBus进行事件处理，包括对SparkEnv的时间监听

// "_jobProgressListener" should be set up before creating SparkEnv because when creating
// "SparkEnv", some messages will be posted to "listenerBus" and we should not miss them.
// 负责监听事件并把事件消息发送给之前生成的listenerBus
_jobProgressListener = new JobProgressListener(_conf)
listenerBus.addListener(jobProgressListener)

// Create the Spark execution environment (cache, map output tracker, etc)
// 开始创建Spark的执行环境了
_env = createSparkEnv(_conf, isLocal, listenerBus)
SparkEnv.set(_env)

这里其实创建的是DriverEnv

// This function allows components created by SparkEnv to be mocked in unit tests:
private[spark] def createSparkEnv(
    conf: SparkConf,
    isLocal: Boolean,
    listenerBus: LiveListenerBus): SparkEnv = {
  //创建的DriverEnv
  SparkEnv.createDriverEnv(conf, isLocal, listenerBus, SparkContext.numDriverCores(master))
}

下面会走到SparkEnv包下的初始方法，关于SparkEnv和其里面创建的RpcEnv,MapOutputTracker，blockTransferService,blockManager等，都会放在后面另起章节来讲

/**
 * Create a SparkEnv for the driver.
 */
private[spark] def createDriverEnv(
    conf: SparkConf,
    isLocal: Boolean,
    listenerBus: LiveListenerBus,
    numCores: Int,
    mockOutputCommitCoordinator: Option[OutputCommitCoordinator] = None): SparkEnv = {
  // 先做断言 判断是否包含DRIVER_HOST_ADDRESS
  assert(conf.contains(DRIVER_HOST_ADDRESS),
    s"${DRIVER_HOST_ADDRESS.key} is not set on the driver!")
  // 判断是否包含spark.driver.port
  assert(conf.contains("spark.driver.port"), "spark.driver.port is not set on the driver!")
  // 拿到绑定的地址
  val bindAddress = conf.get(DRIVER_BIND_ADDRESS)
  // 拿到HOST地址
  val advertiseAddress = conf.get(DRIVER_HOST_ADDRESS)
  val port = conf.get("spark.driver.port").toInt
  // 判断下是否传输加密  
  val ioEncryptionKey = if (conf.get(IO_ENCRYPTION_ENABLED)) {
    Some(CryptoStreamUtils.createKey(conf))
  } else {
    None
  }
  // 调用通用的Env的create方法
  create(
    conf,
    SparkContext.DRIVER_IDENTIFIER,
    bindAddress,
    advertiseAddress,
    Option(port),
    isLocal,
    numCores,
    ioEncryptionKey,
    listenerBus = listenerBus,
    mockOutputCommitCoordinator = mockOutputCommitCoordinator
  )
}

// 一个低级别的状态报告API，负责监听job和stage的进度
_statusTracker = new SparkStatusTracker(this)

// We need to register "HeartbeatReceiver" before "createTaskScheduler" because Executor will
// retrieve "HeartbeatReceiver" in the constructor. (SPARK-6640)
// 注册heartbeatReceiver的Endpoint到rpcEnv上面并返回他对应的Reference
// 这里可以说一下 ，接下来所有的master-slave模式的组件都是通过setupEndpoint和setupEndpointRef
// 来注册自己和解锁对应的Endponit
_heartbeatReceiver = env.rpcEnv.setupEndpoint(
  // 通过注册他的名字和endpoint对象
  HeartbeatReceiver.ENDPOINT_NAME, new HeartbeatReceiver(this))

接下来就是最核心之一的TaskScheduler及DAGScheduler创建和启动：

// Create and start the scheduler
// 在createTaskScheduler方法里面主要是根据master来匹配对应的schedulerBackend和
// taskScheduler的创建方式
val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
_schedulerBackend = sched
_taskScheduler = ts
// 创建DAGScheduler，具体细节会在接下来章节提及
_dagScheduler = new DAGScheduler(this)
// 通知HeartbeatReceiver taskScheduler被创建
_heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)

// start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
// constructor
// 启动taskScheduler
_taskScheduler.start()

拿到提交job的APPId 并初始化BlockManager

// 获得SparkAppId 格式是"spark-application-" + 时间戳
_applicationId = _taskScheduler.applicationId()
_applicationAttemptId = taskScheduler.applicationAttemptId()
// 设置拿到的job相关的appId到conf里
_conf.set("spark.app.id", _applicationId)
if (_conf.getBoolean("spark.ui.reverseProxy", false)) {
  System.setProperty("spark.ui.proxyBase", "/proxy/" + _applicationId)
}
_ui.foreach(_.setAppId(_applicationId))
// 拿到指定的appId并初始化blockmanager
_env.blockManager.initialize(_applicationId)

如果是动态资源分配模式的话会构造一个ExecutorAllocationManager对象，目前只能在yarn模式使用

// Optionally scale number of executors dynamically based on workload. Exposed for testing.
// 是否动态分配资源，目前只支持yarn模式，包括开启blockManager的externalShuffl也必须启动此参数，这个
// 在后面章节也会提及到
val dynamicAllocationEnabled = Utils.isDynamicAllocationEnabled(_conf)
_executorAllocationManager =
  if (dynamicAllocationEnabled) {
    schedulerBackend match {
        // 负责连接到cluster manager申请或kill掉executors
      case b: ExecutorAllocationClient =>
        // 这个对象 主要是根据集群资源动态触发增加或者删除资源策略
        Some(new ExecutorAllocationManager(
          schedulerBackend.asInstanceOf[ExecutorAllocationClient], listenerBus, _conf))
      case _ =>
        None
    }
  } else {
    None
  }
_executorAllocationManager.foreach(_.start())

创建一个弱引用的Cleaner用作RRD，ShuffleDependicy和Broadcast的强制清理

_cleaner =
  if (_conf.getBoolean("spark.cleaner.referenceTracking", true)) {
    // ContextCleaner RDD, ShuffleDependency和Broadcast的弱引用，负责对他们超出范围的
    // 执行清理，比如CG时候的RDD被回收掉了 而对应的数据集却没有，此时ContextCleaner就负责
    // 清理这个RDD对应的数据集
    Some(new ContextCleaner(this))
  } else {
    None
  }
_cleaner.foreach(_.start())

按时吃早饭ABC

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Spark2.0.X源码深度剖析之 SparkContext

SparkContext作为程序员编写代码的第一个生成对象，它会首先在Driver端创建，除了负责连接集群以外还会在创建的时候会初始化各个核心组件，包括DAGScheduler，TaskScheduler，SparkEnv，accumulator等。
复制链接

扫一扫