spark 启动消息通信基本过程源码阅读(一)

spark 启动消息通信基本过程

 

spark启动过程主要是进程maste和worker之间的通信:

1、worker节点向master节点发送注册消息

2、注册成功后,返回注册成功消息或者失败消息。

3、worker定时发送心跳给master。

具体流程图如下所示:

 

1、

a)、当master启动后,随之启动各worker,worker启动时会创建通信环境RpcEnv和终端点Endpoint,

并向Master发送注册Worker的消息RegisterWorker。

由于Worker可能需要注册多个Master(HA),在Worker类的tryRegisterAllMasters方法中创建注册线程池 registerMasterThreadPool,把需要注册的请求,放入线程池中,然后通过启动线程池来注册。

b)、注册过程:

获取master终端引用,调用registerWithMaster(2.1.1版本是此方法,但是2.2.0版本用的是sendRegisterMessageToMaster方法)

2.1.1:

类:worker:

private def tryRegisterAllMasters(): Array[JFuture[_]] = {

    masterRpcAddresses.map { masterAddress =>

    registerMasterThreadPool.submit(new Runnable {

    override def run(): Unit = {

        try {
    
            logInfo("Connecting to master " + masterAddress + "...")

            //获取Master终端点引用

            val masterEndpoint = rpcEnv.setupEndpointRef(masterAddress, Master.ENDPOINT_NAME)

            // 调用registerWithMaster 方法注册信息

            registerWithMaster(masterEndpoint)

        } catch {

            case ie: InterruptedException => // Cancelled

            case NonFatal(e) => logWarning(s"Failed to connect to master $masterAddress", e)

        }

    }

})

}

}

2.2.0:

private def tryRegisterAllMasters(): Array[JFuture[_]] = {

    masterRpcAddresses.map { masterAddress =>

    registerMasterThreadPool.submit(new Runnable {

        override def run(): Unit = {

            try {

                   logInfo("Connecting to master " + masterAddress + "...")

                    获取Master终端点引用

                    val masterEndpoint = rpcEnv.setupEndpointRef(masterAddress, Master.ENDPOINT_NAME)

                     调用registerWithMaster 方法注册信

                    sendRegisterMessageToMaster(masterEndpoint)

            } catch {

                case ie: InterruptedException => // Cancelled

                case NonFatal(e) => logWarning(s"Failed to connect to master $masterAddress", e)

            }

        }

})

}

}

 

sendRegisterMessageToMaster和registerWithMaster方法的不同是一个send*用了send方法,而registerWithMaster用的是ask方法,其中一个区别是ask方法有返回,而send方法没有返回。2.2版本为什么这么修改,待学习。

c) Master收到消息后,需要对Worker发送的信息进行验证、记录。如果注册成功,则发送RegisteredWorker消息给对应的Worker,告诉Worker已经完成注册,随之进程步骤3,即Worker定期发送心跳信息给Master;如果注册失败,则会发送RegisterWorkerFailed消息,Worker打印出错误日志并结束worker启动。

d) 在Master中,Master接收到Worker注册信息后,先判断Master当前状态是处于standby状态,如果是则忽略该消息,如果在注册列表中发现了该worker的编号,则发送注册失败的消息。判断完毕后,使用registerWorker方法把该Worker加入到列表中,用户集群进行处理任务时进行调度。Master.receiveAndReply方法中注册Worer代码:

类:master

case RegisterWorker(

    id, workerHost, workerPort, workerRef, cores, memory, workerWebUiUrl, masterAddress) =>

    logInfo("Registering worker %s:%d with %d cores, %s RAM".format(

    workerHost, workerPort, cores, Utils.megabytesToString(memory)))

    if (state == RecoveryState.STANDBY) {

        workerRef.send(MasterInStandby)

    } else if (idToWorker.contains(id)) {

        workerRef.send(RegisterWorkerFailed("Duplicate worker ID"))

    } else {

        val worker = new WorkerInfo(id, workerHost, workerPort, cores, memory, workerRef, workerWebUiUrl)

        //registerWorker方法中注册Worker,该方法中会把Worker放到列表中

        //用于后续运行任务时使用

        if (registerWorker(worker)) {

            persistenceEngine.addWorker(worker)

            workerRef.send(RegisteredWorker(self, masterWebUiUrl, masterAddress))

            schedule()

        } else {

            val workerAddress = worker.endpoint.address

            logWarning("Worker registration failed. Attempted to re-register worker at same " + "address: " + workerAddress)

            workerRef.send(RegisterWorkerFailed("Attempted to re-register worker at same address: "

+ workerAddress))

        }

}

 

e) 当worker接收到注册成功后,会定时发送heartbeat给Master,以便Master了解Worker的实时状态。间隔时间可以在spark.worker.timer中设置,注意的是 ,该设置值为1/4为心跳间隔。

 

private val HEARTBEAT_MILLIS = conf.getLong("spark.worker.timeout", 60) * 1000 / 4

 

当 Worker获取到注册成功消息后,先记录日志并更新Master信息,然后启动定时调度进程发送心跳信息,该调度进程时间间隔为上面所所定义的HEARTBEAT_MILLIS 值。


 

private def handleRegisterResponse(msg: RegisterWorkerResponse): Unit = synchronized {

msg match {

    case RegisteredWorker(masterRef, masterWebUiUrl, masterAddress) =>

    if (preferConfiguredMasterAddress) {

        logInfo("Successfully registered with master " + masterAddress.toSparkURL)

    } else {

        logInfo("Successfully registered with master " + masterRef.address.toSparkURL)

    }
    
    registered = true

    changeMaster(masterRef, masterWebUiUrl, masterAddress)

    forwordMessageScheduler.scheduleAtFixedRate(new Runnable {

    override def run(): Unit = Utils.tryLogNonFatalError {

    self.send(SendHeartbeat)

    }

}, 0, HEARTBEAT_MILLIS, TimeUnit.MILLISECONDS)

//如果设置清理以前的应用使用的文件夹,则设置spark.worker.cleanup.enabled参数,将CLEANUP_ENABLED设置为True。

if (CLEANUP_ENABLED) {

    logInfo(

        s"Worker cleanup enabled; old application directories will be deleted in: $workDir")
    
    forwordMessageScheduler.scheduleAtFixedRate(new Runnable {

    override def run(): Unit = Utils.tryLogNonFatalError {

        self.send(WorkDirCleanup)

    }

}, CLEANUP_INTERVAL_MILLIS, CLEANUP_INTERVAL_MILLIS, TimeUnit.MILLISECONDS)

}

//向maste汇报Worker中Excutor最新状态

val execs = executors.values.map { e =>

    new ExecutorDescription(e.appId, e.execId, e.cores, e.state)

}

masterRef.send(WorkerLatestState(workerId, execs.toList, drivers.keys.toSeq))



 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值