spark源码学习（一）- sparkContext 初始化过程

最新推荐文章于 2023-11-18 09:50:55 发布

beTree_fc

最新推荐文章于 2023-11-18 09:50:55 发布

阅读量768

点赞数

分类专栏： spark源码文章标签： spark sparkcontext 初始化

本文链接：https://blog.csdn.net/u013560925/article/details/79617819

版权

spark源码专栏收录该内容

12 篇文章 0 订阅

订阅专栏

背景

sparkcontext为spark应用程序的入口，sparksession中也集成了sparkconext对象，sparkcontext在初始化的过程中会初始化DAGSchedular、TaskSchedular、SchedularBackend和MapOutputTrackerMaster，TaskSchedular、SchedularBackend都是接口，会根据环境的不同实例出不同的实现对象，在standalone环境中TaskSchedular是TaskSchedularImpl，SchedularBackend是StandaloneSchdularBackend, 其中StandaloneSchdularBackend是继承了java的 coarsegrainedschedularbackend。StandaloneSchdularBackend主要负责通信和资源管理，比如向master注册任务等，下面就会给出源码中这些重要组成部分的初始化过程。

过程

1.SparkContext.scala

    //
    //此期间都是一些配置项的设置过程
    //比如
    if (!_conf.contains("spark.master")) {
      throw new SparkException("A master URL must be set in your configuration")
    }
    if (!_conf.contains("spark.app.name")) {
      throw new SparkException("An application name must be set in your configuration")
    }

    // log out spark.app.name in the Spark driver logs
    logInfo(s"Submitted application: $appName")

    // System property spark.yarn.app.id must be set if user code ran by AM on a YARN cluster
    if (master == "yarn" && deployMode == "cluster" && !_conf.contains("spark.yarn.app.id")) {
      throw new SparkException("Detected yarn cluster mode, but isn't running on a cluster. " +
        "Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.")
    }

    
    
    
    //省略大部分。。。。。。。
    //
    
    //初始化主要部分
    // We need to register "HeartbeatReceiver" before "createTaskScheduler" because Executor will
    // retrieve "HeartbeatReceiver" in the constructor. (SPARK-6640)
    _heartbeatReceiver = env.rpcEnv.setupEndpoint(
      HeartbeatReceiver.ENDPOINT_NAME, new HeartbeatReceiver(this))

    // Create and start the scheduler
    val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
    _schedulerBackend = sched
    _taskScheduler = ts
    _dagScheduler = new DAGScheduler(this)
    _heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)

    // start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
    // constructor
    _taskScheduler.start()

其中比较重要的部分是：

SparkContext.createTaskScheduler(this, master, deployMode)

_taskScheduler.start()

2.SparkContext.createTaskScheduler(this, master, deployMode) 方法

createTaskSchedular会根绝配置设置的master地址来选择不同的方法初始化schedular和backend对象，主要会分为这几类：spark地址、本地地址和masterUrl方式，masterUrl方式，是当driver运行在cluster而不在本机的时候，需要使用masterUrl寻找clusterManager，并使用clusterManager对象来建立schedular和backend对象。而spark地址方式初始化代码如下：

case SPARK_REGEX(sparkUrl) =>
        val scheduler = new TaskSchedulerImpl(sc)
        val masterUrls = sparkUrl.split(",").map("spark://" + _)
        val backend = new StandaloneSchedulerBackend(scheduler, sc, masterUrls)
        
        //初始化方法-重要
        scheduler.initialize(backend)
        //返回 对象
        (backend, scheduler)

schedular.initialize()方法主要是创建TaskSetManager的Pool池，初始化方法有FIFO和FAIR两种方法，后面提交的taskset任务集合都会暂时存储到这里面，源代码如下：

def initialize(backend: SchedulerBackend) {
    this.backend = backend
    schedulableBuilder = {
      schedulingMode match {
        case SchedulingMode.FIFO =>
          new FIFOSchedulableBuilder(rootPool)
        case SchedulingMode.FAIR =>
          new FairSchedulableBuilder(rootPool, conf)
        case _ =>
          throw new IllegalArgumentException(s"Unsupported $SCHEDULER_MODE_PROPERTY: " +
          s"$schedulingMode")
      }
    }
    schedulableBuilder.buildPools()
  }

3. _taskScheduler.start()

在上一步 schedular.initialize()传入backend对象，所以start逻辑主要是调用backend的start()方法向master应用申请和master启动excutor，TaskSchedularImpl的.start()源码如下:

 override def start() {
    
      //注册和申请资源主要逻辑，调用了这个backend的方法
      backend.start()

    if (!isLocal && conf.getBoolean("spark.speculation", false)) {
      logInfo("Starting speculative execution thread")
      speculationScheduler.scheduleWithFixedDelay(new Runnable {
        override def run(): Unit = Utils.tryOrStopSparkContext(sc) {
          checkSpeculatableTasks()
        }
      }, SPECULATION_INTERVAL_MS, SPECULATION_INTERVAL_MS, TimeUnit.MILLISECONDS)
    }
  }

4.StandaloneSchedularBackend.start()方法

StandaloneSchedulerBackend.start()：
1.使用相关参数构建commond对象，相关参数包括： drive-url ／excutor-url ／cores(运行核心数)／app-id
2.构建ApplicationDescription对象：主要包含内核数，内存等限制和说明信息

3.构建StandaloneAppClient 并调用StandaloneAppClient.start方法

4.StandaloneAppClient.start使用rpc. endpoint调用发送注册请求

源码如下：

override def start() {
    super.start()

    // SPARK-21159. The scheduler backend should only try to connect to the launcher when in client
    // mode. In cluster mode, the code that submits the application to the Master needs to connect
    // to the launcher instead.
    if (sc.deployMode == "client") {
      launcherBackend.connect()
    }

    // The endpoint for executors to talk to us
    val driverUrl = RpcEndpointAddress(
      sc.conf.get("spark.driver.host"),
      sc.conf.get("spark.driver.port").toInt,
      CoarseGrainedSchedulerBackend.ENDPOINT_NAME).toString
    val args = Seq(
      "--driver-url", driverUrl,
      "--executor-id", "{{EXECUTOR_ID}}",
      "--hostname", "{{HOSTNAME}}",
      "--cores", "{{CORES}}",
      "--app-id", "{{APP_ID}}",
      "--worker-url", "{{WORKER_URL}}")
    val extraJavaOpts = sc.conf.getOption("spark.executor.extraJavaOptions")
      .map(Utils.splitCommandString).getOrElse(Seq.empty)
    val classPathEntries = sc.conf.getOption("spark.executor.extraClassPath")
      .map(_.split(java.io.File.pathSeparator).toSeq).getOrElse(Nil)
    val libraryPathEntries = sc.conf.getOption("spark.executor.extraLibraryPath")
      .map(_.split(java.io.File.pathSeparator).toSeq).getOrElse(Nil)

    // 此处省略一部分代码
    
    val command = Command("org.apache.spark.executor.CoarseGrainedExecutorBackend",
      args, sc.executorEnvs, classPathEntries ++ testingClassPath, libraryPathEntries, javaOpts)
    val webUrl = sc.ui.map(_.webUrl).getOrElse("")
    val coresPerExecutor = conf.getOption("spark.executor.cores").map(_.toInt)
  
  
    val appDesc = ApplicationDescription(sc.appName, maxCores, sc.executorMemory, command,
      webUrl, sc.eventLogDir, sc.eventLogCodec, coresPerExecutor, initialExecutorLimit)
    client = new StandaloneAppClient(sc.env.rpcEnv, masters, appDesc, this, conf)
    client.start()  //rpc发送请求
    launcherBackend.setState(SparkAppHandle.State.SUBMITTED)
   
  }

commond对象初始化的时候参数为CoarseGrainedExecutorBackend的类名，executor从该类的main函数开始运行，也就是，使用命令行附带一些参数，运行了一个带有main函数的类（CoarseGrainedExecutorBackend）从而启动executor

5.CoarseGrainedExecutorBackend.main()

work端从CoarseGrainedExecutorBackend 入口main开始运行

main函数的主要逻辑为，检查传输过来的参数，并调用 run(driverUrl, executorId, hostname, cores, appId, workerUrl, userClassPath)方法

在run方法中，主要逻辑为，使用rpc获取driver端的spark properties,初始化本机的SparkEnv对象，设置endpoint等

结论

CoarseGrainedExecutorBackend启动之后，其receive方法会等待driver端的TasksetManager发送task任务，然后启动线程运行任务了，具体查看另一片博文：点击打开链接 http://blog.csdn.net/u013560925/article/details/79577957