Spark学习之4：SparkContext执行过程

最新推荐文章于 2022-05-04 08:39:59 发布

ktlinker1119

最新推荐文章于 2022-05-04 08:39:59 发布

阅读量1.6k

点赞数

分类专栏： Spark

本文链接：https://blog.csdn.net/ktlinker1119/article/details/45242845

版权

Spark 专栏收录该内容

20 篇文章 0 订阅

订阅专栏

本文主要讲述standalone模式下，应用程序启动后，创建SparkConf、SparkContext的执行过程。

在应用程序中，一般会先创建SparkConf对象，并作相应的参数设置，然后用该对象来初始化SparkContext对象。

1. SparkConf

SparkConf管理Spark应用程序的配置，参数以key-value形式表示。

class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging {
  import SparkConf._
  /** Create a SparkConf that loads defaults from system properties and the classpath */
  def this() = this(true)
  private val settings = new ConcurrentHashMap[String, String]()
  if (loadDefaults) {
    // Load any spark.* system properties
    for ((key, value) <- Utils.getSystemProperties if key.startsWith("spark.")) {
      set(key, value)
    }
  }

如果loadDefaults=true，则会通过System.getProperties获取SparkSubmit设置的Spark属性参数。

所有的参数以key-value形式存储在ConcurrentHashMap对象中。

2. SparkContext

2.1 创建SparkEnv

  private[spark] def createSparkEnv(
      conf: SparkConf,
      isLocal: Boolean,
      listenerBus: LiveListenerBus): SparkEnv = {
    SparkEnv.createDriverEnv(conf, isLocal, listenerBus)
  }
  private[spark] val env = createSparkEnv(conf, isLocal, listenerBus)
  SparkEnv.set(env)

2.2 创建TaskScheduler

  private[spark] var (schedulerBackend, taskScheduler) =
    SparkContext.createTaskScheduler(this, master)

2.2.1 SparkContext.createTaskScheduler

在standalone摸下执行：

  private def createTaskScheduler(
      sc: SparkContext,
      master: String): (SchedulerBackend, TaskScheduler) = {
      ...
    master match {
      ...
      case SPARK_REGEX(sparkUrl) =>
        val scheduler = new TaskSchedulerImpl(sc)
        val masterUrls = sparkUrl.split(",").map("spark://" + _)
        val backend = new SparkDeploySchedulerBackend(scheduler, sc, masterUrls)
        scheduler.initialize(backend)
        (backend, scheduler)
      ...
  }

（1）创建TaskSchedulerImpl对象，该类继承trait TaskScheduler；

（2）创建SparkDeploySchedulerBackend对象，继承结构：

（3）初始化 TaskSchedulerImpl，将backend设置为 SparkDeploySchedulerBackend ；

2.3 创建DAGScheduler

  @volatile private[spark] var dagScheduler: DAGScheduler = _
  try {
    dagScheduler = new DAGScheduler(this)
  } catch {
    ...
  }

2.4 启动TaskScheduler

  // start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
  // constructor
  taskScheduler.start()

实际调用的方法是：TaskSchedulerImpl.start。

从此方法发起调用的流程：

图中涉及多个Actor之间的交互：

（1）DriverActor；

（2）ClientActor；

（3）Master；

（4）Worker；

（5）CoarseGrainedExecutorBackend。

所有的交互有ClientActor发起，ClientActor由AppClient创建。

2.4.1 SparkDeploySchedulerBackend.start

  override def start() {
    super.start()
    // The endpoint for executors to talk to us
    val driverUrl = AkkaUtils.address(
      AkkaUtils.protocol(actorSystem),
      SparkEnv.driverActorSystemName,
      conf.get("spark.driver.host"),
      conf.get("spark.driver.port"),
      CoarseGrainedSchedulerBackend.ACTOR_NAME)
    val args = Seq(
      "--driver-url", driverUrl,
      "--executor-id", "{{EXECUTOR_ID}}",
      "--hostname", "{{HOSTNAME}}",
      "--cores", "{{CORES}}",
      "--app-id", "{{APP_ID}}",
      "--worker-url", "{{WORKER_URL}}")
    val extraJavaOpts = sc.conf.getOption("spark.executor.extraJavaOptions")
      .map(Utils.splitCommandString).getOrElse(Seq.empty)
    val classPathEntries = sc.conf.getOption("spark.executor.extraClassPath")
      .map(_.split(java.io.File.pathSeparator).toSeq).getOrElse(Nil)
    val libraryPathEntries = sc.conf.getOption("spark.executor.extraLibraryPath")
      .map(_.split(java.io.File.pathSeparator).toSeq).getOrElse(Nil)
    // When testing, expose the parent class path to the child. This is processed by
    // compute-classpath.{cmd,sh} and makes all needed jars available to child processes
    // when the assembly is built with the "*-provided" profiles enabled.
    val testingClassPath =
      if (sys.props.contains("spark.testing")) {
        sys.props("java.class.path").split(java.io.File.pathSeparator).toSeq
      } else {
        Nil
      }
    // Start executors with a few necessary configs for registering with the scheduler
    val sparkJavaOpts = Utils.sparkJavaOpts(conf, SparkConf.isExecutorStartupConf)
    val javaOpts = sparkJavaOpts ++ extraJavaOpts
    val command = Command("org.apache.spark.executor.CoarseGrainedExecutorBackend",
      args, sc.executorEnvs, classPathEntries ++ testingClassPath, libraryPathEntries, javaOpts)
    val appUIAddress = sc.ui.map(_.appUIAddress).getOrElse("")
    val appDesc = new ApplicationDescription(sc.appName, maxCores, sc.executorMemory, command,
      appUIAddress, sc.eventLogDir, sc.eventLogCodec)
    client = new AppClient(sc.env.actorSystem, masters, appDesc, this, conf)
    client.start()
    waitForRegistration()
  }

方法职责：

（1）调用超类的start方法，即CoarseGrainedSchedulerBackend.start，来创建DriverActor；

（2）组织启动Executor需要的参数，并创建ApplicationDescription对象

（3）创建AppClient对象，并启动。

2.4.2 AppClient.start

  def start() {
    // Just launch an actor; it will call back into the listener.
    actor = actorSystem.actorOf(Props(new ClientActor))
  }

创建ClientActor。

ClientActor的preStart方法中将向Master发起应用程序注册流程。

2.5 Driver端BlockManager初始化

  // start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
  // constructor
  taskScheduler.start()
  val applicationId: String = taskScheduler.applicationId()
  conf.set("spark.app.id", applicationId)
  env.blockManager.initialize(applicationId)

（1）taskScheduler.start()，即2.4节启动TaskScheduler；

（2）获取应用程序id，由Master分配；

（3）初始化BlockManager。

2.6 Executor端BlockManager初始化

DriverActor收到 CoarseGrainedExecutorBackend发送的RegisterExecutor消息后，正常返回一个RegisteredExecutor消息。 CoarseGrainedExecutorBackend将对该消息进行处理。

  override def receiveWithLogging = {
    case RegisteredExecutor =>
      logInfo("Successfully registered with driver")
      val (hostname, _) = Utils.parseHostPort(hostPort)
      executor = new Executor(executorId, hostname, env, userClassPath, isLocal = false)

创将Executor对象，在主构造函数中发起BlockManager的初始化：

  if (!isLocal) {
    env.metricsSystem.registerSource(executorSource)
    env.blockManager.initialize(conf.getAppId)
  }

conf的AppId是从Master传递给Worker，由CoarseGrainedExecutorBackend单例对象设置到SparkConf对象中。

ktlinker1119

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Spark学习之4：SparkContext执行过程

本文主要讲述standalone模式下，应用程序启动后，创建SparkConf、SparkContext的执行过程。在应用程序中，一般会先创建SparkConf对象，并作相应的参数设置，然后用该对象来初始化SparkContext对象。1. SparkConfSparkConf管理Spark应用程序的配置，参数以key-value形式表示。class SparkCo
复制链接

扫一扫