spark 1.6.0 core源码分析4 worker启动流程



worker的main方法,与master类似,创建sparkConf,参数解析,以及构造worker对象并创建rpcEnv用于对外或者本身的信息交互。

private[deploy] object Worker extends Logging {
  val SYSTEM_NAME = "sparkWorker"
  val ENDPOINT_NAME = "Worker"

  def main(argStrings: Array[String]) {
    SignalLogger.register(log)
    val conf = new SparkConf
    val args = new WorkerArguments(argStrings, conf)
    val rpcEnv = startRpcEnvAndEndpoint(args.host, args.port, args.webUiPort, args.cores,
      args.memory, args.masters, args.workDir, conf = conf)
    rpcEnv.awaitTermination()
  }

  def startRpcEnvAndEndpoint(
      host: String,
      port: Int,
      webUiPort: Int,
      cores: Int,
      memory: Int,
      masterUrls: Array[String],
      workDir: String,
      workerNumber: Option[Int] = None,
      conf: SparkConf = new SparkConf): RpcEnv = {

    // The LocalSparkCluster runs multiple local sparkWorkerX RPC Environments
    val systemName = SYSTEM_NAME + workerNumber.map(_.toString).getOrElse("")
    val securityMgr = new SecurityManager(conf)
    val rpcEnv = RpcEnv.create(systemName, host, port, conf, securityMgr)
    val masterAddresses = masterUrls.map(RpcAddress.fromSparkURL(_))
    rpcEnv.setupEndpoint(ENDPOINT_NAME, new Worker(rpcEnv, webUiPort, cores, memory,
      masterAddresses, systemName, ENDPOINT_NAME, workDir, conf, securityMgr))
    rpcEnv
  }

同样的执行onstart方法想master注册

override def onStart() {
    assert(!registered)
    logInfo("Starting Spark worker %s:%d with %d cores, %s RAM".format(
      host, port, cores, Utils.megabytesToString(memory)))
    logInfo(s"Running Spark version ${org.apache.spark.SPARK_VERSION}")
    logInfo("Spark home: " + sparkHome)
    <strong>createWorkDir() //创建工作目录</strong>
    shuffleService.startIfEnabled()//是否额外的启动一个shuffle服务,确保被executor所读写的shuffle文件在executor退出后被保存,可配
    webUi = new WorkerWebUI(this, workDir, webUiPort)
    webUi.bind()
    <strong>registerWithMaster() //向master注册</strong>

    metricsSystem.registerSource(workerSource)
    metricsSystem.start()
    // Attach the worker metrics servlet handler to the web ui after the metrics system is started.
    metricsSystem.getServletHandlers.foreach(webUi.attachHandler)
  }

private def registerWithMaster() {
    // onDisconnected may be triggered multiple times, so don't attempt registration
    // if there are outstanding registration attempts scheduled.
    registrationRetryTimer match {
      case None =>
        registered = false //这里向所有的master rpcEnv发送RegisterWorker消息,上几节有讲master收到该消息后,如果成功处理会反馈RegisteredWorker消息,不成功会发送RegisterWorkerFailed消息
        registerMasterFutures = tryRegisterAllMasters()
        connectionAttemptCount = 0 //这里在一定时间之后会进入ReregisterWithMaster,里面会判断是否已注册,如果没有会再次发送注册信息。这个是否注册的状态是由master反馈回来的
        registrationRetryTimer = Some(forwordMessageScheduler.scheduleAtFixedRate(
          new Runnable {
            override def run(): Unit = Utils.tryLogNonFatalError {
              Option(self).foreach(_.send(ReregisterWithMaster))
            }
          },
          INITIAL_REGISTRATION_RETRY_INTERVAL_SECONDS,
          INITIAL_REGISTRATION_RETRY_INTERVAL_SECONDS,
          TimeUnit.SECONDS))
      case Some(_) =>
        logInfo("Not spawning another attempt to register with the master, since there is an" +
          " attempt scheduled already.")
    }
  }



看worker收到master的RegisteredWorker消息,要注册时并不知道哪台是主,哪台是备,所以向所有配置的master都发送注册信息。主备都收到worker的注册信息之后,只有主才会反馈,并带上自己的masterUrl信息,worker以此来认定主master的rpcEnv用于真正的信息交互
worker要通过心跳来保持与master的时刻连通,所以注册成功之后,有一个connected标记是否连接正常,在changeMaster方法内部设置connected = true


private def tryRegisterAllMasters(): Array[JFuture[_]] = {
    masterRpcAddresses.map { masterAddress =>
      registerMasterThreadPool.submit(new Runnable {
        override def run(): Unit = {
          try {
            logInfo("Connecting to master " + masterAddress + "...")
            val masterEndpoint =
              rpcEnv.setupEndpointRef(Master.SYSTEM_NAME, masterAddress, Master.ENDPOINT_NAME)
           <strong> registerWithMaster(masterEndpoint)</strong>
          } catch {
            case ie: InterruptedException => // Cancelled
            case NonFatal(e) => logWarning(s"Failed to connect to master $masterAddress", e)
          }
        }
      })
    }
  }<pre name="code" class="java">  private def registerWithMaster(masterEndpoint: RpcEndpointRef): Unit = {
    masterEndpoint.ask[RegisterWorkerResponse](RegisterWorker(
      workerId, host, port, self, cores, memory, webUi.boundPort, publicAddress))
      .onComplete {
        // This is a very fast action so we can use "ThreadUtils.sameThread"
        case Success(msg) =>
          Utils.tryLogNonFatalError {
            <strong>handleRegisterResponse(msg)</strong>
          }
        case Failure(e) =>
          logError(s"Cannot register with master: ${masterEndpoint.address}", e)
          System.exit(1)
      }(ThreadUtils.sameThread)
  }

 
case RegisteredWorker(masterRef, masterWebUiUrl) =>
        logInfo("Successfully registered with master " + masterRef.address.toSparkURL)
        registered = true <strong>//注册成功</strong>
        changeMaster(masterRef, masterWebUiUrl) //这里是将主master的信息保存
        forwordMessageScheduler.scheduleAtFixedRate(new Runnable { //在注册成功之后,才开启定时器向master发送心跳
          override def run(): Unit = Utils.tryLogNonFatalError {
            self.send(SendHeartbeat) //每4分钟发送一次心跳到master   Send a heartbeat every (heartbeat timeout) / 4 milliseconds</strong>
          }
        }, 0, HEARTBEAT_MILLIS, TimeUnit.MILLISECONDS)
        if (CLEANUP_ENABLED) {
          logInfo(
            s"Worker cleanup enabled; old application directories will be deleted in: $workDir")
          forwordMessageScheduler.scheduleAtFixedRate(new Runnable {//定时器清理workDir下很久都没有更新的且app也不在执行状态的目录
            override def run(): Unit = Utils.tryLogNonFatalError {
              self.send(WorkDirCleanup)
            }
          }, CLEANUP_INTERVAL_MILLIS, CLEANUP_INTERVAL_MILLIS, TimeUnit.MILLISECONDS)
        }


如果收到RegisterWorkerFailed消息,则退出


下面看master接受到worker的心跳之后如何处理


由于worker注册时,master已经将workerId存入idToWorker中,所以这里走Some分支。很简单,只是更新该worker的一个时间戳。这里有必要说明一下None分支,在注册消息到达后,在master 的idToWorker和workers中都会保存,但是当master检测到worker超时时,将worker从idToWorker中删除,这样新的任务就选不了该worker了,但不删除workers中的。workers中的只会在间隔很长一段时间之后仍然没有心跳上来,才说明该worker真正无法再工作了,再从workers中删除。这里的None分支就是应对超时过后,心跳又继续上来了,就向worker发送重新注册的消息ReconnectWorker



  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值