Master可以配置为两个,Spark在standalone模式下,支持Master主备切换。当Active Master节点出现故障的时候,可以将Standby Master切换为Active Master。
Master主备切换相关代码流程如下:
1 设置RECOVERY_MODE,没有配置的话 默认值为 NONE
private val RECOVERY_MODE = conf.get("spark.deploy.recoveryMode", "NONE")
配置方式:conf/spark-env.sh
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=FILESYSTEM -Dspark.deploy.recoveryDirectory=/nfs/spark_recovery"
2 设置持久化引擎
// Master onStart
override def onStart(): Unit = {
...
...
val (persistenceEngine_, leaderElectionAgent_) = RECOVERY_MODE match {
case "ZOOKEEPER" =>
logInfo("Persisting recovery state to ZooKeeper")
val zkFactory =
new ZooKeeperRecoveryModeFactory(conf, serializer)
(zkFactory.createPersistenceEngine(), zkFactory.createLeaderElectionAgent(this))
case "FILESYSTEM" =>
val fsFactory =
new FileSystemRecoveryModeFactory(conf, serializer)
(fsFactory.createPersistenceEngine(), fsFactory.createLeaderElectionAgent(this))
case "CUSTOM" =>
val clazz = Utils.classForName(conf.get("spark.deploy.recoveryMode.factory"))
val factory = clazz.getConstructor(classOf[SparkConf], classOf[Serializer])
.newInstance(conf, serializer)
.asInstanceOf[StandaloneRecoveryModeFactory]
(factory.createPersistenceEngine(), factory.createLeaderElectionAgent(this))
case _ =>
(new BlackHolePersistenceEngine(), new MonarchyLeaderAgent(this))
}
persistenceEngine = persistenceEngine_
leaderElectionAgent = leaderElectionAgent_
}
由代码可以看出,持久化引擎可以基于四种方式来创建。
(1) ZOOKEEPER。其基本原理是通过zookeeper来选举一个Master,其他的Master处于Standby状态。将Standalone集群连接到同一个ZooKeeper实例并启动多个Master,利用zookeeper提供的选举和状态保存功能,可以使一个Master被选举,而其他Master处于Standby状态。如果现任Master死去,另一个Master会通过选举产生,并恢复到旧的Master状态,然后恢复调度。整个恢复过程可能要1-2分钟。
(2) FILESYSTEM。spark提供目录保存spark Application和worker的注册信息,并将他们的恢复状态写入该目录,当spark的master节点宕掉的时候,重启master,就能获取application和worker的注册信息。需要手动进行切换。
(3) CUSTOM。用户可以继承抽象类PersistenceEngine实现自己的持久化引擎,用于保存和恢复 Application,Driver,Worker信息。
(4) NONE。使用BlackHolePersistenceEngine(),由其实现可以看出,它不会持久化Application,Driver,Worker信息,主备切换的时候,之前的Application,Driver,Worker信息会被全部丢弃。
private[master] class BlackHolePersistenceEngine extends PersistenceEngine {
override def persist(name: String, obj: Object): Unit = {} //空实现
override def unpersist(name: String): Unit = {} //空实现
override def read[T: ClassTag](name: String): Seq[T] = Nil //空实现
}
3 Master onStart()调用时机
Master继承了ThreadSafeRpcEndpoint,ThreadSafeRpcEndpoint继承了RpcEndpoint
private[deploy] class Master(
override val rpcEnv: RpcEnv,
address: RpcAddress,
webUiPort: Int,
val securityMg