Spark2.0.2源码分析——Application的注册启动(standalone 模式下)

紧接 Spark2.0.2源码分析——TaskScheduler 启动 分析。

StandaloneSchedulerBackendstart 方法中,除了调用父类
CoarseGrainedSchedulerBackendstart 方法创建 DriverEndPoint 以外主要是创建 AppClientAppClient 会向 Master 注册 ApplicationMaster 通过 Application 信息为它分配 Worker

//  获取Driver RpcEndpoint地址
    val driverUrl = RpcEndpointAddress(
      sc.conf.get("spark.driver.host"),
      sc.conf.get("spark.driver.port").toInt,
      CoarseGrainedSchedulerBackend.ENDPOINT_NAME).toString

//  启动参数
    val args = Seq(
      "--driver-url", driverUrl,
      "--executor-id", "{{EXECUTOR_ID}}",
      "--hostname", "{{HOSTNAME}}",
      "--cores", "{{CORES}}",
      "--app-id", "{{APP_ID}}",
      "--worker-url", "{{WORKER_URL}}")
    //  executor额外java选项
    val extraJavaOpts = sc.conf.getOption("spark.executor.extraJavaOptions")
      .map(Utils.splitCommandString).getOrElse(Seq.empty)
    //  executor额外环境变量
    val classPathEntries = sc.conf.getOption("spark.executor.extraClassPath")
      .map(_.split(java.io.File.pathSeparator).toSeq).getOrElse(Nil)
    //  executor额外依赖路径
    val libraryPathEntries = sc.conf.getOption("spark.executor.extraLibraryPath")
      .map(_.split(java.io.File.pathSeparator).toSeq).getOrElse(Nil)
      
//  spark java 参数
    val sparkJavaOpts = Utils.sparkJavaOpts(conf, SparkConf.isExecutorStartupConf)
    //  整合java 参数
    val javaOpts = sparkJavaOpts ++ extraJavaOpts
    
//  将获取的信息封装为Command
    val command = Command("org.apache.spark.executor.CoarseGrainedExecutorBackend",
      args, sc.executorEnvs, classPathEntries ++ testingClassPath, libraryPathEntries, javaOpts)
    //  获取application UI地址
    val appUIAddress = sc.ui.map(_.appUIAddress).getOrElse("")
    //  每个executor core数
    val coresPerExecutor = conf.getOption("spark.executor.cores").map(_.toInt)

将上面准备参数和 sparkContext 相关信息,封装为
ApplicationDescription 对象

传入 ApplicationDescription,创建 StandaloneAppClient,在创建
StandaloneAppClient 对象时,会进行 StandaloneAppClient 的初始化,下面会有讲解:

// 创建ApplicationDescription
    val appDesc = new ApplicationDescription(sc.appName, maxCores, sc.executorMemory, command,
      appUIAddress, sc.eventLogDir, sc.eventLogCodec, coresPerExecutor, initialExecutorLimit)
    //  创建Standalone AppClient,并发送ApplicationDescription,AppClient初始化
    client = new StandaloneAppClient(sc.env.rpcEnv, masters, appDesc, this, conf)
    //  创建AppClient后,调用AppClient#start方法,向listener返回消息
    client.start()
    launcherBackend.setState(SparkAppHandle.State.SUBMITTED)
    //  等待注册状态更新 
    waitForRegistration()
    launcherBackend.setState(SparkAppHandle.State.RUNNING)
  }

发送的 ApplicationDescription 信息格式如下:

private[spark] case class ApplicationDescription(
    name: String,    //application名字,可以通过spark.app.name进行设置,也可以通过代码中setAppName()进行设置
    maxCores: Option[Int],  //最多可以使用Cpu Cores数量
    memoryPerExecutorMB: Int, //每个Executor使用的内存
    command: Command,   //启动命令
    appUiUrl: String, //application的 UI Url地址
    eventLogDir: Option[URI] = None, //event日志目录
    //  编写事件日志时使用的压缩编解码器的简称,如果有配置的话(例如lzf)
    eventLogCodec: Option[String] = None,
    coresPerExecutor: Option[Int] = None,
    // 这个Application想要启动的executor数,只有在动态分配启用时使用
    initialExecutorLimit: Option[Int] = None,
    user: String = System.getProperty("user.name", "<unknown>")) {

  override def toString: String = "ApplicationDescription(" + name + ")"
}

创建 AppClient,传入参数 rpcEnvmastersappDesc, this, conf

这里就要说一下这个 this,对应的参数类型为 listener

StandaloneAppClientListener 是一个 Trait ,为了确保 AppClient 在各种时间发生时,及时通知 StandaloneSchedulerBackend 某些状态的更新

StandaloneSchedulerBackend 创建了 AppClient 时,需要一个
listener,会将自身作为一个参数传递给 AppClient,就是这个
this

/**
  *   当各种事件发生时,能够及时通知StandaloneSchedulerBackend的回调
  *   目前有四个事件回调函数
  *   1.成功连接 2.断开连接  3.添加一个executor  4.移除一个executor
  */
private[spark] trait StandaloneAppClientListener {
  //  向master成功注册application,成功连接到集群
  def connected(appId: String): Unit

  //  断开连接
  // 断开连接可能是临时的,可能是原master故障,等待master切换后,会恢复连接
  def disconnected(): Unit

  //  application由于不可恢复的错误停止
  def dead(reason: String): Unit

  //  添加一个executor
  def executorAdded(
      fullId: String, workerId: String, hostPort: String, cores: Int, memory: Int): Unit
  //  移除一个executor
  def executorRemoved(
      fullId: String, message: String, exitStatus: Option[Int], workerLost: Boolean): Unit
}

client.start():

def start() {
    // Just launch an rpcEndpoint; it will call back into the listener.
    endpoint.set(rpcEnv.setupEndpoint("AppClient", new ClientEndpoint(rpcEnv)))
  }

在创建 StandaloneAppClient 时,需要执行 onStart 初始化方法,这里细节请看:Spark2.0.2源码分析——RPC 通信机制(消息处理)
初始化方法中调用 registerWithMaster ,向 master 注册:

override def onStart(): Unit = {
      try { //向master注册
        registerWithMaster(1)
      } catch {
        case e: Exception =>
          logWarning("Failed to connect to master", e)
          markDisconnected()
          stop()
      }
    }

registerWithMaster 方法,实际上是调用了 tryRegisterAllMasters ,向所有的 Master 进行注册,在 spark 中,Master可能是高可靠的(HA)Spark配置高可用,会存在多个 Master ,所以这里向所有的 Master 进行注册,但最终只有 Active Master 响应:

/**   向所有的master进行异步注册,将会一直调用registerWithMaster进行注册,直到超过注册时间
      *   一旦成功连接到master,所有调度的工作和Futures都会被取消
      */
    private def registerWithMaster(nthRetry: Int) {
      // 实际上调用了tryRegisterAllMasters,向所有的master进行注册
      registerMasterFutures.set(tryRegisterAllMasters())
      registrationRetryTimer.set(registrationRetryThread.schedule(new Runnable {
        override def run(): Unit = {
          if (registered.get) {
            registerMasterFutures.get.foreach(_.cancel(true))
            registerMasterThreadPool.shutdownNow()
          } else if (nthRetry >= REGISTRATION_RETRIES) {
            markDead("All masters are unresponsive! Giving up.")
          } else {
            registerMasterFutures.get.foreach(_.cancel(true))
            registerWithMaster(nthRetry + 1)
          }
        }
      }, REGISTRATION_TIMEOUT_SECONDS, TimeUnit.SECONDS))
    }

tryRegisterAllMasters 方法,向所有 master 注册:

/**
      *   异步的向master注册,返回一个[Future]数组用来以后取消
      */
    private def tryRegisterAllMasters(): Array[JFuture[_]] = {
    // 遍历所有masterRpcAddresses
      for (masterAddress <- masterRpcAddresses) yield {
        registerMasterThreadPool.submit(new Runnable {
          override def run(): Unit = try {
            if (registered.get) {
              return
            }
            logInfo("Connecting to master " + masterAddress.toSparkURL + "...")
            val masterRef = rpcEnv.setupEndpointRef(masterAddress, Master.ENDPOINT_NAME)
            // 向master注册application
            masterRef.send(RegisterApplication(appDescription, self))
          } catch {
            case ie: InterruptedException => // Cancelled
            case NonFatal(e) => logWarning(s"Failed to connect to master $masterAddress", e)
          }
        })
      }
    }

Master 发出注册 application 请求后,接收到传过来的
ApplicationDescription ,将其封装为 ApplicationInfo,调用
registerApplication 注册 application:

case RegisterApplication(description, driver) =>
      // TODO Prevent repeated registrations from some driver
      if (state == RecoveryState.STANDBY) {
        // ignore, don't send response
      } else {
        logInfo("Registering app " + description.name)
        val app = createApplication(description, driver)
        registerApplication(app)
        logInfo("Registered app " + description.name + " with ID " + app.id)
        persistenceEngine.addApplication(app)
        driver.send(RegisteredApplication(app.id, self))
        schedule()
      }

注册成后向 StandaloneSchedulerBackendAppClient 发送消息,Application 注册成功,listener 接受到消息后,状态改变为 connected:

// driver向StandaloneSchedulerBackend的StandaloneAppClient发送消息RegisteredApplication,已注册Application

driver.send(RegisteredApplication(app.id, self))
case RegisteredApplication(appId_, masterRef) =>
        // FIXME How to handle the following cases?
        // 1. A master receives multiple registrations and sends back multiple
        // RegisteredApplications due to an unstable network.
        // 2. Receive multiple RegisteredApplication from different masters because the master is
        // changing.
        appId.set(appId_)
        registered.set(true)
        master = Some(masterRef)
        listener.connected(appId.get)

最后调用 schedule 方法进行调度,该方法内部先调用 launchDriverWorker 注册并启动 driver ,然后 allocateWorkerResourceToExecutors 调用为 executor 分配资源,接着调用 startExecutorsOnWorkers 在worker上启动 executor ,最后再去 executor 上执行 Task

转载自:Spark源码阅读:Application的注册

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值