Spark学习之2:Worker启动流程

1. 启动脚本

sbin/start-slaves.sh

# Launch the slaves
if [ "$SPARK_WORKER_INSTANCES" = "" ]; then
  exec "$sbin/slaves.sh" cd "$SPARK_HOME" \; "$sbin/start-slave.sh" 1 "spark://$SPARK_MASTER_IP:$SPARK_MASTER_PORT"
else
  if [ "$SPARK_WORKER_WEBUI_PORT" = "" ]; then
    SPARK_WORKER_WEBUI_PORT=8081
  fi  
  for ((i=0; i<$SPARK_WORKER_INSTANCES; i++)); do
    "$sbin/slaves.sh" cd "$SPARK_HOME" \; "$sbin/start-slave.sh" $(( $i + 1 ))  "spark://$SPARK_MASTER_IP:$SPARK_MASTER_PORT" --webui-port $(( $SPARK_WORKER_WEBUI_PORT + $i ))
  done
fi

假设每个节点启动一个Worker

具体执行:

  exec "$sbin/slaves.sh" cd "$SPARK_HOME" \; "$sbin/start-slave.sh" 1 "spark://$SPARK_MASTER_IP:$SPARK_MASTER_PORT"

该语句分为两部分:

(1)

exec "$sbin/slaves.sh" cd "$SPARK_HOME"

登录到worker服务器并cdSPARK_HOME目录。

(2)

"$sbin/start-slave.sh" 1 "spark://$SPARK_MASTER_IP:$SPARK_MASTER_PORT"

worker服务器执行sbin/start-slave.sh脚本。

参数“1”代码worker的编号,用来区分不同worker实例的日志文件。如:

spark-xxx-org.apache.spark.deploy.worker.Worker-1-CentOS-02.out
spark-xxx-org.apache.spark.deploy.worker.Worker-1.pid

其中“Worker-1”中的“1”就代表worker编号。

这个参数并不会传入Worker类。传入Worker类的参数为:

spark://$SPARK_MASTER_IP:$SPARK_MASTER_PORT。

2. Worker.main

  def main(argStrings: Array[String]) {
    SignalLogger.register(log)
    val conf = new SparkConf
    val args = new WorkerArguments(argStrings, conf)
    val (actorSystem, _) = startSystemAndActor(args.host, args.port, args.webUiPort, args.cores,
      args.memory, args.masters, args.workDir)
    actorSystem.awaitTermination()
  }

main函数的职责:

(1)创建WorkerArguments对象并初始化其成员;

(2)调用startSystemAndActor方法,创建ActorSystem对象并启动Worker actor

2.1. WorkerArguments

  var cores = inferDefaultCores()
  var memory = inferDefaultMemory()

(1)计算默认核数

(2)计算默认内存大小

  parse(args.toList)
  // This mutates the SparkConf, so all accesses to it must be made after this line
  propertiesFile = Utils.loadDefaultSparkProperties(conf, propertiesFile)

(1)parse方法负责解析启动脚本所带的命令行参数;

(2)loadDefaultSparkProperties负责从配置文件中加载spark运行属性,默认而配置文件为spark-defaults.conf

2.2. startSystemAndActor

    val (actorSystem, boundPort) = AkkaUtils.createActorSystem(systemName, host, port,
      conf = conf, securityManager = securityMgr)
    val masterAkkaUrls = masterUrls.map(Master.toAkkaUrl(_, AkkaUtils.protocol(actorSystem)))
    actorSystem.actorOf(Props(classOf[Worker], host, boundPort, webUiPort, cores, memory,
      masterAkkaUrls, systemName, actorName,  workDir, conf, securityMgr), name = actorName)

(1)通过AkkaUtils.createActorSystem创建ActorSystem对象

(2)创建Worker actor并启动

3. Worker Actor

3.1. 重要数据成员

  val executors = new HashMap[String, ExecutorRunner]
  val finishedExecutors = new HashMap[String, ExecutorRunner]
  val drivers = new HashMap[String, DriverRunner]
  val finishedDrivers = new HashMap[String, DriverRunner]
  val appDirectories = new HashMap[String, Seq[String]]
  val finishedApps = new HashSet[String]

3.2. Worker.preStart

    createWorkDir()
    context.system.eventStream.subscribe(self, classOf[RemotingLifecycleEvent])
    shuffleService.startIfEnabled()
    webUi = new WorkerWebUI(this, workDir, webUiPort)
    webUi.bind()
    registerWithMaster()

(1)创建Worker节点工作目录;

(2)监听RemotingLifecycleEvent事件,它一个trait

sealed trait RemotingLifecycleEvent extends Serializable {
  def logLevel: Logging.LogLevel
}

Worker只处理了DisassociatedEvent消息。

(3)创建并启动WorkerWebUI

(4)向Master进行注册,registerWithMaster将调用tryRegisterAllMasters方法向Master节点发送注册消息

3.3. Worker.registerWithMaster

    registrationRetryTimer match {
      case None =>
        registered = false
        tryRegisterAllMasters()
        connectionAttemptCount = 0
        registrationRetryTimer = Some {
          context.system.scheduler.schedule(INITIAL_REGISTRATION_RETRY_INTERVAL,
            INITIAL_REGISTRATION_RETRY_INTERVAL, self, ReregisterWithMaster)
        }
      case Some(_) =>
        logInfo("Not spawning another attempt to register with the master, since there is an" +
          " attempt scheduled already.")
    }

(1)调用tryRegisterAllMasters方法向Master发起注册消息;

(2)创建注册重试定时器,通过向自己(Worker Actor)发送ReregisterWithMaster消息;

3.3.1. Worker.tryRegisterAllMasters

    for (masterAkkaUrl <- masterAkkaUrls) {
      logInfo("Connecting to master " + masterAkkaUrl + "...")
      val actor = context.actorSelection(masterAkkaUrl)
      actor ! RegisterWorker(workerId, host, port, cores, memory, webUi.boundPort, publicAddress)
    }

(1)创建Master Actor远程引用;

(2)向Master发送RegisterWorker消息;如果注册成功,Master将向Worker发送RegisteredWorker消息。

workerId是一个字符串,定义:

  val workerId = generateWorkerId()
  ...
  def generateWorkerId(): String = {
    "worker-%s-%s-%d".format(createDateFormat.format(new Date), host, port)
  }

格式:worker-时间-主机名-端口

3.4. Worker消息处理

3.4.1. RegisteredWorker消息

此消息表示WorkerMaster注册成功消息;该消息处理的主要目的是启动心跳发送定时器。

    case RegisteredWorker(masterUrl, masterWebUiUrl) =>
      logInfo("Successfully registered with master " + masterUrl)
      registered = true
      changeMaster(masterUrl, masterWebUiUrl)
      context.system.scheduler.schedule(0 millis, HEARTBEAT_MILLIS millis, self, SendHeartbeat)
      if (CLEANUP_ENABLED) {
        logInfo(s"Worker cleanup enabled; old application directories will be deleted in: $workDir")
        context.system.scheduler.schedule(CLEANUP_INTERVAL_MILLIS millis,
          CLEANUP_INTERVAL_MILLIS millis, self, WorkDirCleanup)
      }

(1)设置注册状态;

(2)调用changeMaster方法

(3)创建心跳发送定时器,向自己(Worker Actor)发送SendHeartbeat消息;

3.4.1.1. Worker.changeMaster
    // activeMasterUrl it's a valid Spark url since we receive it from master.
    activeMasterUrl = url
    activeMasterWebUiUrl = uiUrl
    master = context.actorSelection(
      Master.toAkkaUrl(activeMasterUrl, AkkaUtils.protocol(context.system)))
    masterAddress = Master.toAkkaAddress(activeMasterUrl, AkkaUtils.protocol(context.system))
    connected = true
    // Cancel any outstanding re-registration attempts because we found a new master
    registrationRetryTimer.foreach(_.cancel())
    registrationRetryTimer = None

职责:

(1)创建Master远程引用并赋值给master

(2)将连接状态设置为true

(3)取消registrationRetryTimer定时器;

3.4.2. SendHeartbeat消息

    case SendHeartbeat =>
      if (connected) { master ! Heartbeat(workerId) }

master发送Heartbeat消息。

3.4.3. ReregisterWithMaster消息

    case ReregisterWithMaster =>
      reregisterWithMaster()

reregisterWithMaster方法职责:

(1)如果已经注册成功,取消registrationRetryTimer定时器;

(2)如果注册失败,从新向master发送RegisterWorker消息;初始默认重连次数为6,最大重连次数为16

  // The first six attempts to reconnect are in shorter intervals (between 5 and 15 seconds)
  // Afterwards, the next 10 attempts are between 30 and 90 seconds.
  // A bit of randomness is introduced so that not all of the workers attempt to reconnect at
  // the same time.
  val INITIAL_REGISTRATION_RETRIES = 6
  val TOTAL_REGISTRATION_RETRIES = INITIAL_REGISTRATION_RETRIES + 10

6次和后10次采用不同的周期。

4. 启动结束

到此,Worker节点就启动完成,它定时向Master节点发送心跳。在SparkSubmit提交Application时,将接收Master发送的启动Executor消息,由ExecutorDriver进行消息通信。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值