文章目录
开篇
网上看过不少的源码文章,有的解析写的很详细,有的很简略,但是不管是详细的还是简略的都没有解决我看源码当中的疑惑点,后来自己debug测试解答了自己的疑惑,所以简单的记录一下自己的流程和见解,这里主要是分析选举的流程,同步暂时不分析
源码下载
源码直接去github下载,在zookeeper官网下载的源码有很多都是文本而不是java文件,
下载的源码解压之后使用ant编译,ant的安装和编译代码参照网上教程,虽然网上说3.5后面的版本可以不编译,没测试过。
环境准备
目录结构
导入idea之后的问题
初次导入之后会有很多爆红的位置,一个是缺少info,一个是导包的pom文件中有的scope是provided,具体的情况看个人遇到的问题,可以网上找解决方案,这里贴出我的问题
添加版本号的类,这个类添加在zookeeper-server下面的version包下面
public interface Info {
int MAJOR=1;
int MINOR=0;
int MICRO=0;
String QUALIFIER=null;
int REVISION=-1; //TODO: remove as related to SVN VCS
String REVISION_HASH="1";
String BUILD_DATE="2019-3-4";
}
修改zookeeper-server项目pom文件的scope
配置文件
# 心跳时间
tickTime=2000
# 初始化连接通信的时限 tickTime * initLimit
initLimit=10
# 同步信息时限 syncLimit * initLimit
syncLimit=5
# 存放目录
dataDir=/zkData
# 通信端口
clientPort=2181
# 集群配置信息
server.1=hadoop101:2888:3888
server.2=hadoop102:2888:3888
server.4=windowpc:2888:3888
window电脑的hosts文件配置
192.168.147.128 hadoop101
192.168.147.129 hadoop102
192.168.147.130 hadoop103
192.168.147.1 windowpc
服务器myid文件
在配置文件指定的数据目录下面创建myid文件,内容写个配置文件中server后面的数字就可以,例如window本机,写个4
启动类
这里我们直接看bin目录下面的zkServer.sh,这里截取部分,主要是看类名为QuorumPeerMain,后面启动的时候就会用到,这个是集群激动的入口
then
# for some reason these two options are necessary on jdk6 on Ubuntu
# accord to the docs they are not necessary, but otw jconsole cannot
# do a local attach
ZOOMAIN="-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=$JMXLOCALONLY org.apache.zookeeper.server.quorum.QuorumPeerMain"
else
if [ "x$JMXAUTH" = "x" ]
then
JMXAUTH=false
fi
if [ "x$JMXSSL" = "x" ]
then
JMXSSL=false
fi
if [ "x$JMXLOG4J" = "x" ]
then
JMXLOG4J=true
fi
echo "ZooKeeper remote JMX Port set to $JMXPORT" >&2
echo "ZooKeeper remote JMX authenticate set to $JMXAUTH" >&2
echo "ZooKeeper remote JMX ssl set to $JMXSSL" >&2
echo "ZooKeeper remote JMX log4j set to $JMXLOG4J" >&2
ZOOMAIN="-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=$JMXPORT -Dcom.sun.management.jmxremote.authenticate=$JMXAUTH -Dcom.sun.management.jmxremote.ssl=$JMXSSL -Dzookeeper.jmx.log4j.disable=$JMXLOG4J org.apache.zookeeper.server.quorum.QuorumPeerMain"
fi
开始解析
经典图 尚硅谷的 对错误的位置做了改动
启动类添加日志文件和环境参数,这里的log4j是自带的,配置文件也是自带的,改名为zoo.cfg,这里也指定一下位置。
解析配置文件
入口为QuorumPeerMain的main方法,这里主要就一个方法调用initializeAndRun,初始化并且运行
部分代码
public static void main(String[] args) {
QuorumPeerMain main = new QuorumPeerMain();
try {
// 初始化配置并且启动
main.initializeAndRun(args);
} catch (IllegalArgumentException e) {
LOG.error("Invalid arguments, exiting abnormally", e);
LOG.info(USAGE);
System.err.println(USAGE);
System.exit(2);
}
这里的方法也不多,只是嵌套比较深,后续会逐步对重点代码进行解析
protected void initializeAndRun(String[] args)
throws ConfigException, IOException, AdminServerException
{
// 配置类,解析配置文件以后得到的入参封装在这里
QuorumPeerConfig config = new QuorumPeerConfig();
// 首先判断入参是否只有一个 即是zoo.cfg
if (args.length == 1) {
// 进行配置文件的解析
config.parse(args[0]);
}
// Start and schedule the the purge task
DatadirCleanupManager purgeMgr = new DatadirCleanupManager(config
.getDataDir(), config.getDataLogDir(), config
.getSnapRetainCount(), config.getPurgeInterval());
// 启动数据清理管理类
purgeMgr.start();
if (args.length == 1 && config.isDistributed()) {
runFromConfig(config);
} else {
LOG.warn("Either no config or no quorum defined in config, running "
+ " in standalone mode");
// there is only server in the quorum -- run as standalone
ZooKeeperServerMain.main(args);
}
}
parseProperties 方法
跟进parse方法,里面重点关注的就是parseProperties,主要的工作就是读取配置文件设置一些后面要用到的参数,这里贴出一些后面会解析的代码用的参数,具体的意思看注释
// 省略
} else if (key.equals("autopurge.snapRetainCount")) {
// 快照的分片数量
snapRetainCount = Integer.parseInt(value);
} else if (key.equals("autopurge.purgeInterval")) {
// 定期清理快照的时间 默认为0 表示不清理
purgeInterval = Integer.parseInt(value);
}
// 省略
if (dynamicConfigFileStr == null) {
// 这里是设置一些配置参数
setupQuorumPeerConfig(zkProp, true);
if (isDistributed() && isReconfigEnabled()) {
// we don't backup static config for standalone mode.
// we also don't backup if reconfig feature is disabled.
backupOldConfig();
}
}
设置配置参数
void setupQuorumPeerConfig(Properties prop, boolean configBackwardCompatibilityMode)
throws IOException, ConfigException {
quorumVerifier = parseDynamicConfig(prop, electionAlg, true, configBackwardCompatibilityMode);
// 设置服务的id,这里就是读取数据下面的myid,然后作为服务id
setupMyId();
// 这里是设置地址和端口 2181
setupClientPort();
// 这里设置启动的peertype 主要是判断自己是否有选举权,里面一个
setupPeerType();
//
checkValidity();
}
定期清理快照和日志文件
首先需要清楚zookeeper的数据存储,这里贴出数据目录
数据主要是根据快照和日志组成,恢复数据的时候也是根据快照和日志,将内容恢复成树状结果,放到内存当中,zookeeper不适合存放大数据,他的每个节点能保存的数据很有限,快照和文件是有顺序的,后面的数字是事务id,有了这个基础继续分析
DatadirCleanupManager purgeMgr = new DatadirCleanupManager(config
.getDataDir(), config.getDataLogDir(), config
.getSnapRetainCount(), config.getPurgeInterval());
// 启动数据清理管理类
purgeMgr.start();
这里进行第二部分的解析,直接进入start方法,主要工作就是判断配置的定期清理是否打开,如果不打开直接返回,打开的话则启动定时任务进行处理
public void start() {
// 判断任务是否启动过,如果启动过直接返回
if (PurgeTaskStatus.STARTED == purgeTaskStatus) {
LOG.warn("Purge task is already running.");
return;
}
// Don't schedule the purge task with zero or negative purge interval.
// 判断定期清理参数是否小于等于0 如果是直接返回
if (purgeInterval <= 0) {
LOG.info("Purge task is not scheduled.");
return;
}
// 如果打开了清理快照 这里会启动定时任务去进行处理
timer = new Timer("PurgeTask", true);
TimerTask task = new PurgeTask(dataLogDir, snapDir, snapRetainCount);
timer.scheduleAtFixedRate(task, 0, TimeUnit.HOURS.toMillis(purgeInterval));
purgeTaskStatus = PurgeTaskStatus.STARTED;
}
这里我们关注PurgeTask的run方法
public void run() {
LOG.info("Purge task started.");
try {
PurgeTxnLog.purge(logsDir, snapsDir, snapRetainCount);
} catch (Exception e) {
LOG.error("Error occurred while purging.", e);
}
LOG.info("Purge task completed.");
}
这里的工作就是判断快照的配置数量,然后获取最近的快照数量,然后进行清理
public static void purge(File dataDir, File snapDir, int num) throws IOException {
// 保留的快照配置数量小于三个直接报错,系统默认的就是三个,这里只是配置三个,并不是说实际一定有
if (num < 3) {
throw new IllegalArgumentException(COUNT_ERR_MSG);
}
FileTxnSnapLog txnLog = new FileTxnSnapLog(dataDir, snapDir);
// 找最近的快照
List<File> snaps = txnLog.findNRecentSnapshots(num);
int numSnaps = snaps.size();
if (numSnaps > 0) {
// 如果有快照 则进入清楚2逻辑,当然并不一定会清除,因为快照和日志数量实际可能少于3个
purgeOlderSnapshots(txnLog, snaps.get(numSnaps - 1));
}
}
寻找最近的快照逻辑比较简单,首先是排序,然后查找snip前缀的文件,取最近的几个
public List<File> findNRecentSnapshots(int n) throws IOException {
List<File> files = Util.sortDataDir(snapDir.listFiles(), SNAPSHOT_FILE_PREFIX, false);
int count = 0;
List<File> list = new ArrayList<File>();
for (File f: files) {
if (count == n)
break;
if (Util.getZxidFromName(f.getName(), SNAPSHOT_FILE_PREFIX) != -1) {
count++;
list.add(f);
}
}
return list;
}
清理逻辑也比较简单,因为快照和日志是根据事务的大小排序的,那么只需要获取小于获取到最近快照最小的,删除掉比最近里面最小的进行删除就可以
// 要清除的日志
File[] logs = txnLog.getDataDir().listFiles(new MyFileFilter(PREFIX_LOG));
List<File> files = new ArrayList<>();
if (logs != null) {
files.addAll(Arrays.asList(logs));
}
// add all non-excluded snapshot files to the deletion list
// 要清除的快照
File[] snapshots = txnLog.getSnapDir().listFiles(new MyFileFilter(PREFIX_SNAPSHOT));
if (snapshots != null) {
files.addAll(Arrays.asList(snapshots));
}
// remove the old files
for(File f: files)
{
final String msg = "Removing file: "+
DateFormat.getDateTimeInstance().format(f.lastModified())+
"\t"+f.getPath();
LOG.info(msg);
System.out.println(msg);
if(!f.delete()){
System.err.println("Failed to remove "+f.getPath());
}
}
集群启动
if (args.length == 1 && config.isDistributed()) {
// 判断参数个数 判断是否集群启动
runFromConfig(config);
} else {
LOG.warn("Either no config or no quorum defined in config, running "
+ " in standalone mode");
// there is only server in the quorum -- run as standalone
ZooKeeperServerMain.main(args);
}
集群启动准备工作
这里主要就是日志管理,网络处理工厂类的创建
public void runFromConfig(QuorumPeerConfig config)
throws IOException, AdminServerException
{
try {
// 日志管理注册 我们启动的时候传入了
ManagedUtil.registerLog4jMBeans();
} catch (JMException e) {
LOG.warn("Unable to register log4j JMX control", e);
}
LOG.info("Starting quorum peer");
try {
// 创建网路处理工厂
ServerCnxnFactory cnxnFactory = null;
ServerCnxnFactory secureCnxnFactory = null;
if (config.getClientPortAddress() != null) {
cnxnFactory = ServerCnxnFactory.createFactory();
// 这里会绑定端口,打开通讯
cnxnFactory.configure(config.getClientPortAddress(),
config.getMaxClientCnxns(),
false);
}
if (config.getSecureClientPortAddress() != null) {
secureCnxnFactory = ServerCnxnFactory.createFactory();
secureCnxnFactory.configure(config.getSecureClientPortAddress(),
config.getMaxClientCnxns(),
true);
}
quorumPeer = getQuorumPeer();
quorumPeer.setTxnFactory(new FileTxnSnapLog(
config.getDataLogDir(),
config.getDataDir()));
quorumPeer.enableLocalSessions(config.areLocalSessionsEnabled());
quorumPeer.enableLocalSessionsUpgrading(
config.isLocalSessionsUpgradingEnabled());
//quorumPeer.setQuorumPeers(config.getAllMembers());
quorumPeer.setElectionType(config.getElectionAlg());
quorumPeer.setMyid(config.getServerId());
quorumPeer.setTickTime(config.getTickTime());
quorumPeer.setMinSessionTimeout(config.getMinSessionTimeout());
quorumPeer.setMaxSessionTimeout(config.getMaxSessionTimeout());
quorumPeer.setInitLimit(config.getInitLimit());
quorumPeer.setSyncLimit(config.getSyncLimit());
quorumPeer.setConfigFileName(config.getConfigFilename());
quorumPeer.setZKDatabase(new ZKDatabase(quorumPeer.getTxnFactory()));
quorumPeer.setQuorumVerifier(config.getQuorumVerifier(), false);
if (config.getLastSeenQuorumVerifier()!=null) {
quorumPeer.setLastSeenQuorumVerifier(config.getLastSeenQuorumVerifier(), false);
}
quorumPeer.initConfigInZKDatabase();
quorumPeer.setCnxnFactory(cnxnFactory);
quorumPeer.setSecureCnxnFactory(secureCnxnFactory);
quorumPeer.setSslQuorum(config.isSslQuorum());
quorumPeer.setUsePortUnification(config.shouldUsePortUnification());
quorumPeer.setLearnerType(config.getPeerType());
quorumPeer.setSyncEnabled(config.getSyncEnabled());
quorumPeer.setQuorumListenOnAllIPs(config.getQuorumListenOnAllIPs());
if (config.sslQuorumReloadCertFiles) {
quorumPeer.getX509Util().enableCertFileReloading();
}
// sets quorum sasl authentication configurations
quorumPeer.setQuorumSaslEnabled(config.quorumEnableSasl);
if(quorumPeer.isQuorumSaslAuthEnabled()){
quorumPeer.setQuorumServerSaslRequired(config.quorumServerRequireSasl);
quorumPeer.setQuorumLearnerSaslRequired(config.quorumLearnerRequireSasl);
quorumPeer.setQuorumServicePrincipal(config.quorumServicePrincipal);
quorumPeer.setQuorumServerLoginContext(config.quorumServerLoginContext);
quorumPeer.setQuorumLearnerLoginContext(config.quorumLearnerLoginContext);
}
quorumPeer.setQuorumCnxnThreadsSize(config.quorumCnxnThreadsSize);
quorumPeer.initialize();
// 后面会重点解析
quorumPeer.start();
quorumPeer.join();
} catch (InterruptedException e) {
// warn, but generally this is ok
LOG.warn("Quorum Peer interrupted", e);
}
}
这里是读取配置,然后利用反射创建出来,这里贴出zookeeperAdmin.md的部分解释
- serverCnxnFactory :
(Java system property: zookeeper.serverCnxnFactory)
Specifies ServerCnxnFactory implementation.
This should be set toNettyServerCnxnFactory
in order to use TLS based server communication.
Default isNIOServerCnxnFactory
.
public static final String ZOOKEEPER_SERVER_CNXN_FACTORY = "zookeeper.serverCnxnFactory";
static public ServerCnxnFactory createFactory() throws IOException {
String serverCnxnFactoryName =
System.getProperty(ZOOKEEPER_SERVER_CNXN_FACTORY);
if (serverCnxnFactoryName == null) {
serverCnxnFactoryName = NIOServerCnxnFactory.class.getName();
}
try {
ServerCnxnFactory serverCnxnFactory = (ServerCnxnFactory) Class.forName(serverCnxnFactoryName)
.getDeclaredConstructor().newInstance();
LOG.info("Using {} as server connection factory", serverCnxnFactoryName);
return serverCnxnFactory;
} catch (Exception e) {
IOException ioe = new IOException("Couldn't instantiate "
+ serverCnxnFactoryName);
ioe.initCause(e);
throw ioe;
}
}
启动
重点解析
quorumPeer.start();
加载数据,启动网络监听
加粗样式
@Override
public synchronized void start() {
if (!getView().containsKey(myid)) {
throw new RuntimeException("My id " + myid + " not in the peer list");
}
// 加载快照数据和日志数据
loadDataBase();
// 启动前面的网络工厂
startServerCnxnFactory();
try {
adminServer.start();
} catch (AdminServerException e) {
LOG.warn("Problem starting AdminServer", e);
System.out.println(e);
}
// 开始选举
startLeaderElection();
super.start();
}
创建选票对象
这里主要是创建选票,参数为服务器id、最后一次事务id、纪元(递增的一个数字)
synchronized public void startLeaderElection() {
try {
// 判断是否looking状态 创建选票
if (getPeerState() == ServerState.LOOKING) {
currentVote = new Vote(myid, getLastLoggedZxid(), getCurrentEpoch());
}
} catch(IOException e) {
RuntimeException re = new RuntimeException(e.getMessage());
re.setStackTrace(e.getStackTrace());
throw re;
}
// if (!getView().containsKey(myid)) {
// throw new RuntimeException("My id " + myid + " not in the peer list");
//}
// electionType 到这里是3 不会进入if 直接走到createElectionAlgorithm
if (electionType == 0) {
try {
udpSocket = new DatagramSocket(getQuorumAddress().getPort());
responder = new ResponderThread();
responder.start();
} catch (SocketException e) {
throw new RuntimeException(e);
}
}
this.electionAlg = createElectionAlgorithm(electionType);
}
启动各种线程处理选票
protected Election createElectionAlgorithm(int electionAlgorithm){
Election le=null;
//TODO: use a factory rather than a switch
switch (electionAlgorithm) {
case 0:
le = new LeaderElection(this);
break;
case 1:
le = new AuthFastLeaderElection(this);
break;
case 2:
le = new AuthFastLeaderElection(this, true);
break;
case 3:
QuorumCnxManager qcm = createCnxnManager();
QuorumCnxManager oldQcm = qcmRef.getAndSet(qcm);
if (oldQcm != null) {
LOG.warn("Clobbering already-set QuorumCnxManager (restarting leader election?)");
oldQcm.halt();
}
QuorumCnxManager.Listener listener = qcm.listener;
if(listener != null){
// 启动监听
listener.start();
FastLeaderElection fle = new FastLeaderElection(this, qcm);
fle.start();
le = fle;
} else {
LOG.error("Null listener when initializing cnx manager");
}
break;
default:
assert false;
}
return le;
QuorumCnxManager.Listener初步了解
这里看listen的run方法.这里主要就是启动监听,3888端口,然后阻塞在那里等待其余服务器的信息,所以这里暂时就介绍这么多,后面再看
public void run() {
int numRetries = 0;
InetSocketAddress addr;
Socket client = null;
Exception exitException = null;
while ((!shutdown) && (portBindMaxRetry == 0 || numRetries < portBindMaxRetry)) {
try {
if (self.shouldUsePortUnification()) {
LOG.info("Creating TLS-enabled quorum server socket");
ss = new UnifiedServerSocket(self.getX509Util(), true);
} else if (self.isSslQuorum()) {
LOG.info("Creating TLS-only quorum server socket");
ss = new UnifiedServerSocket(self.getX509Util(), false);
} else {
// debug到这里 会创建ServerSocket
ss = new ServerSocket();
}
ss.setReuseAddress(true);
if (self.getQuorumListenOnAllIPs()) {
int port = self.getElectionAddress().getPort();
addr = new InetSocketAddress(port);
} else {
// Resolve hostname for this server in case the
// underlying ip address has changed.
self.recreateSocketAddresses(self.getId());
addr = self.getElectionAddress();
}
LOG.info("My election bind port: " + addr.toString());
setName(addr.toString());
ss.bind(addr);
while (!shutdown) {
try {
client = ss.accept();
setSockOpts(client);
LOG.info("Received connection request "
+ formatInetAddr((InetSocketAddress)client.getRemoteSocketAddress()));
// Receive and handle the connection request
// asynchronously if the quorum sasl authentication is
// enabled. This is required because sasl server
// authentication process may take few seconds to finish,
// this may delay next peer connection requests.
if (quorumSaslAuthEnabled) {
receiveConnectionAsync(client);
} else {
receiveConnection(client);
}
numRetries = 0;
} catch (SocketTimeoutException e) {
LOG.warn("The socket is listening for the election accepted "
+ "and it timed out unexpectedly, but will retry."
+ "see ZOOKEEPER-2836");
}
}
} catch (IOException e) {
if (shutdown) {
break;
}
LOG.error("Exception while listening", e);
exitException = e;
numRetries++;
try {
ss.close();
Thread.sleep(1000);
} catch (IOException ie) {
LOG.error("Error closing server socket", ie);
} catch (InterruptedException ie) {
LOG.error("Interrupted while sleeping. " +
"Ignoring exception", ie);
}
closeSocket(client);
}
}
LOG.info("Leaving listener");
if (!shutdown) {
LOG.error("As I'm leaving the listener thread after "
+ numRetries + " errors. "
+ "I won't be able to participate in leader "
+ "election any longer: "
+ formatInetAddr(self.getElectionAddress())
+ ". Use " + ELECTION_PORT_BIND_RETRY + " property to "
+ "increase retry count.");
if (exitException instanceof SocketException) {
// After leaving listener thread, the host cannot join the
// quorum anymore, this is a severe error that we cannot
// recover from, so we need to exit
socketBindErrorHandler.run();
}
} else if (ss != null) {
// Clean up for shutdown.
try {
ss.close();
} catch (IOException ie) {
// Don't log an error for shutdown.
LOG.debug("Error closing server socket", ie);
}
}
}
这里可以看到已经阻塞等待监听数据了,后面再继续分析这个类
FastLeaderElection 的start
继续分析FastLeaderElection 的start方法,跟进到最后主要是启动了两个线程,一个workSend一个workReceive,这个用于后面处理消息的轮转,脑海中要有前面的图
void start(){
this.wsThread.start();
this.wrThread.start();
}
QuorumPeer的启动
开始解析super.start(),主要是跟进去他的run方法
这里再贴一下流程
public synchronized void start() {
if (!getView().containsKey(myid)) {
throw new RuntimeException("My id " + myid + " not in the peer list");
}
// 加载快照数据和日志数据
loadDataBase();
// 启动前面的网络工厂
startServerCnxnFactory();
try {
adminServer.start();
} catch (AdminServerException e) {
LOG.warn("Problem starting AdminServer", e);
System.out.println(e);
}
// 开始选举
startLeaderElection();
super.start();
}
这里看当前类的run方法,截取部分重要代码,主要关注lookForLeader()方法
// 省略
case LOOKING:
LOG.info("LOOKING");
if (Boolean.getBoolean("readonlymode.enabled")) {
LOG.info("Attempting to start ReadOnlyZooKeeperServer");
// Create read-only server but don't start it immediately
final ReadOnlyZooKeeperServer roZk =
new ReadOnlyZooKeeperServer(logFactory, this, this.zkDb);
// Instead of starting roZk immediately, wait some grace
// period before we decide we're partitioned.
//
// Thread is used here because otherwise it would require
// changes in each of election strategy classes which is
// unnecessary code coupling.
Thread roZkMgr = new Thread() {
public void run() {
try {
// lower-bound grace period to 2 secs
sleep(Math.max(2000, tickTime));
if (ServerState.LOOKING.equals(getPeerState())) {
roZk.startup();
}
} catch (InterruptedException e) {
LOG.info("Interrupted while attempting to start ReadOnlyZooKeeperServer, not started");
} catch (Exception e) {
LOG.error("FAILED to start ReadOnlyZooKeeperServer", e);
}
}
};
try {
roZkMgr.start();
reconfigFlagClear();
if (shuttingDownLE) {
shuttingDownLE = false;
startLeaderElection();
}
// 这里设置投票信息并且寻找leader 关注lookForLeader方法
setCurrentVote(makeLEStrategy().lookForLeader());
} catch (Exception e) {
LOG.warn("Unexpected exception", e);
setPeerState(ServerState.LOOKING);
} finally {
// If the thread is in the the grace period, interrupt
// to come out of waiting.
roZkMgr.interrupt();
roZk.shutdown();
}
} else {
try {
reconfigFlagClear();
if (shuttingDownLE) {
shuttingDownLE = false;
startLeaderElection();
}
setCurrentVote(makeLEStrategy().lookForLeader());
} catch (Exception e) {
LOG.warn("Unexpected exception", e);
setPeerState(ServerState.LOOKING);
}
}
break;
case OBSERVING:
try {
LOG.info("OBSERVING");
setObserver(makeObserver(logFactory));
observer.observeLeader();
} catch (Exception e) {
LOG.warn("Unexpected exception",e );
} finally {
observer.shutdown();
setObserver(null);
updateServerState();
}
break;
case FOLLOWING:
try {
LOG.info("FOLLOWING");
setFollower(makeFollower(logFactory));
follower.followLeader();
} catch (Exception e) {
LOG.warn("Unexpected exception",e);
} finally {
follower.shutdown();
setFollower(null);
updateServerState();
}
break;
lookForLeader()方法 *****
进入FastLeaderElection的lookForLeader()方法,首先关注里面的sendNotifications()方法,这里就是构建好选票的信息,放到发送队列sendqueue里面
private void sendNotifications() {
for (long sid : self.getCurrentAndNextConfigVoters()) {
QuorumVerifier qv = self.getQuorumVerifier();
ToSend notmsg = new ToSend(ToSend.mType.notification,
proposedLeader,
proposedZxid,
logicalclock.get(),
QuorumPeer.ServerState.LOOKING,
sid,
proposedEpoch, qv.toString().getBytes());
if(LOG.isDebugEnabled()){
LOG.debug("Sending Notification: " + proposedLeader + " (n.leader), 0x" +
Long.toHexString(proposedZxid) + " (n.zxid), 0x" + Long.toHexString(logicalclock.get()) +
" (n.round), " + sid + " (recipient), " + self.getId() +
" (myid), 0x" + Long.toHexString(proposedEpoch) + " (n.peerEpoch)");
}
sendqueue.offer(notmsg);
}
}
继续看lookForLeader下面的代码,这里启动了一个whiel循环,不停的从recvqueue里面拉取选票信息,首先判断消息是否为空,如果为空,判断queueSendMap是否为空,如果不为空但是里面的某个queue没有值那么就会继续发送消息,毕竟这里的消息为空表示leader都没有出来,当然疯狂继续发消息沟通,当然消息要变动,后面说;服务器会给另外的每个服务器都创建一条发送投票消息队列,放在queueSendMao, 如果这个queueSendMap直接为空,那么直接返回false,会直接和其余的服务器建立联系,并且为他们创建好队列,并且启动SendWorker和ReceiveWorker
while ((self.getPeerState() == ServerState.LOOKING) &&
(!stop)){
/*
* Remove next notification from queue, times out after 2 times
* the termination time
*/
Notification n = recvqueue.poll(notTimeout,
TimeUnit.MILLISECONDS);
/*
* Sends more notifications if haven't received enough.
* Otherwise processes new notification.
*/
if(n == null){
if(manager.haveDelivered()){
sendNotifications();
} else {
manager.connectAll();
}
boolean haveDelivered() {
for (ArrayBlockingQueue<ByteBuffer> queue : queueSendMap.values()) {
LOG.debug("Queue size: " + queue.size());
if (queue.size() == 0) {
return true;
}
}
return false;
}
这里需要结合前面的图来看,这里再放一次
其实上面说的创建队列的这些工作也可以由FastLeaderElection的WorkerSend来进行处理,前面也提到了,他会开启2个线程,workersend就是读取消息,并且给每个对应的服务器创建好队列,并且把消息给进去的,然后再由sendworker处理,注意这里一个是sendworker一个是workersend,workersend是FastLeaderElection里面的线程,sendworker是QuorumCnxManager里面的
这里贴出部分代码,后面会讲
看这里,第一次debug启动,就是空的,此时上面说到的队列都没有创建好,所以直接调用connectAll,感觉就像是在说,你特么太慢了我手动调用一次的,因为这个connectAll最后也会调用到上图的方法中去,这里其实影响也不大
后面再进来就是有消息了
有消息了就要进行处理了,因为这里还都是looking状态,粘贴代码进行简析
这里贴出选举判断规则
- EPOCH大的直接胜出
- EPOCH相同,事务id大的胜出
- 事务id相同,服务器id大的胜出
// 校验这个消息里面的服务id和要认为的leader是否有选举资格
else if (validVoter(n.sid) && validVoter(n.leader)) {
/*
* Only proceed if the vote comes from a replica in the current or next
* voting view for a replica in the current or next voting view.
*/
switch (n.state) {
case LOOKING:
// If notification > current, replace and send messages out
// 这里是判断electionEpochde值,看看选举是否同一轮,每执行一次leader选举,就会增加1
// 我这里debug卡主,此时对面都是41,这里还是1,所以表示我们都不是同一轮次了,
// 网上有些扯犊子的说 说这个electionEpoch 和纪元是同一个值,呵呵
if (n.electionEpoch > logicalclock.get()) {
// 将投票信息的选举始终给自己设置
logicalclock.set(n.electionEpoch);
// 清除掉旧的投票信息,因为我已经落后了
recvset.clear();
// 这里进行选票判断了 这里贴出我debug的数据
// (2 8589934594 7) (4 8589934594 7), 所以这里是我的主机更大,走else
if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
updateProposal(n.leader, n.zxid, n.peerEpoch);
} else {
// 更新选票的信息 这里其实没变,毕竟主机胜出,只是发送消息的时候,逻辑始终改变了
updateProposal(getInitId(),
getInitLastLoggedZxid(),
getPeerEpoch());
}
// 继续把消息发送出去
sendNotifications();
} else if (n.electionEpoch < logicalclock.get()) {
// 票据的轮次比我还小 不处理
if(LOG.isDebugEnabled()){
LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
+ Long.toHexString(n.electionEpoch)
+ ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
}
break;
} else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
proposedLeader, proposedZxid, proposedEpoch)) {
// 这里就是轮次相同,判断这个b是否比我有资格当leader,如果有,就更新我的选票发出去,很显然 他没有
updateProposal(n.leader, n.zxid, n.peerEpoch);
sendNotifications();
}
if(LOG.isDebugEnabled()){
LOG.debug("Adding vote: from=" + n.sid +
", proposed leader=" + n.leader +
", proposed zxid=0x" + Long.toHexString(n.zxid) +
", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
}
// 把选票的结果保留 id为服务器的id 所以后面其余服务器更改选票的时候会改变
// don't care about the version if it's in LOOKING state
recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
// 这里是判断选票结果是否大于集群的一半,这里判断都不满足了
if (termPredicate(recvset,
new Vote(proposedLeader, proposedZxid,
logicalclock.get(), proposedEpoch))) {
// Verify if there is any change in the proposed leader
// 把队列里面的投票都给拿出来 和当前服务器的进行比较,最终的结果就是没有数据了
while((n = recvqueue.poll(finalizeWait,
TimeUnit.MILLISECONDS)) != null){
if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
proposedLeader, proposedZxid, proposedEpoch)){
recvqueue.put(n);
break;
}
}
/*
* This predicate is true once we don't read any new
* relevant message from the reception queue
*/
// 所以走到最后的判断,将自己设置成为leader 跳出整个循环
if (n == null) {
self.setPeerState((proposedLeader == self.getId()) ?
ServerState.LEADING: learningState());
Vote endVote = new Vote(proposedLeader,
proposedZxid, logicalclock.get(),
proposedEpoch);
leaveInstance(endVote);
return endVote;
}
}
break;
分析其余线程
QuorumCnxManager.Listener再次解析
public void run() {
int numRetries = 0;
InetSocketAddress addr;
Socket client = null;
Exception exitException = null;
while ((!shutdown) && (portBindMaxRetry == 0 || numRetries < portBindMaxRetry)) {
try {
if (self.shouldUsePortUnification()) {
LOG.info("Creating TLS-enabled quorum server socket");
ss = new UnifiedServerSocket(self.getX509Util(), true);
} else if (self.isSslQuorum()) {
LOG.info("Creating TLS-only quorum server socket");
ss = new UnifiedServerSocket(self.getX509Util(), false);
} else {
ss = new ServerSocket();
}
ss.setReuseAddress(true);
if (self.getQuorumListenOnAllIPs()) {
int port = self.getElectionAddress().getPort();
addr = new InetSocketAddress(port);
} else {
// Resolve hostname for this server in case the
// underlying ip address has changed.
self.recreateSocketAddresses(self.getId());
addr = self.getElectionAddress();
}
LOG.info("My election bind port: " + addr.toString());
setName(addr.toString());
ss.bind(addr);
while (!shutdown) {
try {
// 这里是阻塞监听
client = ss.accept();
setSockOpts(client);
LOG.info("Received connection request "
+ formatInetAddr((InetSocketAddress)client.getRemoteSocketAddress()));
// Receive and handle the connection request
// asynchronously if the quorum sasl authentication is
// enabled. This is required because sasl server
// authentication process may take few seconds to finish,
// this may delay next peer connection requests.
if (quorumSaslAuthEnabled) {
//异步
receiveConnectionAsync(client);
} else {
// 同步
receiveConnection(client);
}
numRetries = 0;
} catch (SocketTimeoutException e) {
LOG.warn("The socket is listening for the election accepted "
+ "and it timed out unexpectedly, but will retry."
+ "see ZOOKEEPER-2836");
}
}
} catch (IOException e) {
if (shutdown) {
break;
}
LOG.error("Exception while listening", e);
exitException = e;
numRetries++;
try {
ss.close();
Thread.sleep(1000);
} catch (IOException ie) {
LOG.error("Error closing server socket", ie);
} catch (InterruptedException ie) {
LOG.error("Interrupted while sleeping. " +
"Ignoring exception", ie);
}
closeSocket(client);
}
}
LOG.info("Leaving listener");
if (!shutdown) {
LOG.error("As I'm leaving the listener thread after "
+ numRetries + " errors. "
+ "I won't be able to participate in leader "
+ "election any longer: "
+ formatInetAddr(self.getElectionAddress())
+ ". Use " + ELECTION_PORT_BIND_RETRY + " property to "
+ "increase retry count.");
if (exitException instanceof SocketException) {
// After leaving listener thread, the host cannot join the
// quorum anymore, this is a severe error that we cannot
// recover from, so we need to exit
socketBindErrorHandler.run();
}
} else if (ss != null) {
// Clean up for shutdown.
try {
ss.close();
} catch (IOException ie) {
// Don't log an error for shutdown.
LOG.debug("Error closing server socket", ie);
}
}
}
继续进入到同步的方法
public void receiveConnection(final Socket sock) {
DataInputStream din = null;
try {
din = new DataInputStream(
new BufferedInputStream(sock.getInputStream()));
handleConnection(sock, din);
} catch (IOException e) {
LOG.error("Exception handling connection, addr: {}, closing server connection",
sock.getRemoteSocketAddress());
closeSocket(sock);
}
}
private void handleConnection(Socket sock, DataInputStream din)
throws IOException {
Long sid = null, protocolVersion = null;
InetSocketAddress electionAddr = null;
try {
protocolVersion = din.readLong();
//读取协议版本 这里debug是-65536,如果是大于0表示就是服务器id
if (protocolVersion >= 0) { // this is a server id and not a protocol version
sid = protocolVersion;
} else {
try {
InitialMessage init = InitialMessage.parse(protocolVersion, din);
sid = init.sid;
electionAddr = init.electionAddr;
} catch (InitialMessage.InitialMessageException ex) {
LOG.error(ex.toString());
closeSocket(sock);
return;
}
}
// 判断如果是OBSERVER_ID 那么久执行减1操作
if (sid == QuorumPeer.OBSERVER_ID) {
/*
* Choose identifier at random. We need a value to identify
* the connection.
*/
sid = observerCounter.getAndDecrement();
LOG.info("Setting arbitrary identifier to observer: " + sid);
}
} catch (IOException e) {
LOG.warn("Exception reading or writing challenge: {}", e);
closeSocket(sock);
return;
}
// do authenticating learner
authServer.authenticate(sock, din);
//If wins the challenge, then close the new connection.
// 如果取出的sid小于本机
if (sid < self.getId()) {
/*
* This replica might still believe that the connection to sid is
* up, so we have to shut down the workers before trying to open a
* new connection.
*/
// 从senderWorkerMap中取出sid对应sendwoker,如果不为空就修改sendwork的状态为false,同时关闭socket连接,移除这个sendworker,打断线程,如果这个sendworker里面的RecvWorker也不为空,那么也取出来 设置为false,打断线程,然后建立连接
SendWorker sw = senderWorkerMap.get(sid);
if (sw != null) {
sw.finish();
}
/*
* Now we start a new connection
*/
LOG.debug("Create new connection to server: {}", sid);
closeSocket(sock);
// 拿到服务器地址执行连接
if (electionAddr != null) {
connectOne(sid, electionAddr);
} else {
connectOne(sid);
}
} else { // Otherwise start worker threads to receive data.
SendWorker sw = new SendWorker(sock, sid);
RecvWorker rw = new RecvWorker(sock, din, sid, sw);
sw.setRecv(rw);
SendWorker vsw = senderWorkerMap.get(sid);
if (vsw != null) {
vsw.finish();
}
senderWorkerMap.put(sid, sw);
queueSendMap.putIfAbsent(sid,
new ArrayBlockingQueue<ByteBuffer>(SEND_CAPACITY));
sw.start();
rw.start();
}
}
进入connectOne,最终跟进initiateConnection方法,然后跟进startConnection
private boolean startConnection(Socket sock, Long sid)
throws IOException {
DataOutputStream dout = null;
DataInputStream din = null;
try {
// Use BufferedOutputStream to reduce the number of IP packets. This is
// important for x-DC scenarios.
BufferedOutputStream buf = new BufferedOutputStream(sock.getOutputStream());
dout = new DataOutputStream(buf);
// Sending id and challenge
// represents protocol version (in other words - message type)
dout.writeLong(PROTOCOL_VERSION);
dout.writeLong(self.getId());
String addr = formatInetAddr(self.getElectionAddress());
byte[] addr_bytes = addr.getBytes();
dout.writeInt(addr_bytes.length);
dout.write(addr_bytes);
dout.flush();
din = new DataInputStream(
new BufferedInputStream(sock.getInputStream()));
} catch (IOException e) {
LOG.warn("Ignoring exception reading or writing challenge: ", e);
closeSocket(sock);
return false;
}
// authenticate learner
QuorumPeer.QuorumServer qps = self.getVotingView().get(sid);
if (qps != null) {
// TODO - investigate why reconfig makes qps null.
authLearner.authenticate(sock, qps.hostname);
}
// If lost the challenge, then drop the new connection
// 这里判断如果发消息过来的sid,大于当前服务器的id,那么就会关闭掉这个socket
if (sid > self.getId()) {
LOG.info("Have smaller server identifier, so dropping the " +
"connection: (" + sid + ", " + self.getId() + ")");
closeSocket(sock);
// Otherwise proceed with the connection
} else {
SendWorker sw = new SendWorker(sock, sid);
RecvWorker rw = new RecvWorker(sock, din, sid, sw);
sw.setRecv(rw);
SendWorker vsw = senderWorkerMap.get(sid);
if(vsw != null)
vsw.finish();
senderWorkerMap.put(sid, sw);
queueSendMap.putIfAbsent(sid, new ArrayBlockingQueue<ByteBuffer>(
SEND_CAPACITY));
sw.start();
rw.start();
return true;
}
分析sid对于连接保存的逻辑
这里需要看FastLeaderElection的WorkerSender的run方法,前面提到过FastLeaderElection会启动2个线程,一个是WorkerSender一个是WorkerReceiver
这里贴出WorkerSender的代码,跟进去
public void run() {
while (!stop) {
try {
// 从消息队列拿到消息,然后执行
ToSend m = sendqueue.poll(3000, TimeUnit.MILLISECONDS);
if(m == null) continue;
process(m);
} catch (InterruptedException e) {
break;
}
}
LOG.info("WorkerSender is down");
}
void process(ToSend m) {
ByteBuffer requestBuffer = buildMsg(m.state.ordinal(),
m.leader,
m.zxid,
m.electionEpoch,
m.peerEpoch,
m.configData);
manager.toSend(m.sid, requestBuffer);
}
再来看tosend方法,这里判断发送的消息是给自己的,直接保存到接受消息队列中,否则就根据sid创建好给各个服务器消息的队列,然后执行connectOne方法
public void toSend(Long sid, ByteBuffer b) {
/*
* If sending message to myself, then simply enqueue it (loopback).
*/
if (this.mySid == sid) {
b.position(0);
addToRecvQueue(new Message(b.duplicate(), sid));
/*
* Otherwise send to the corresponding thread to send.
*/
} else {
/*
* Start a new connection if doesn't have one already.
*/
ArrayBlockingQueue<ByteBuffer> bq = new ArrayBlockingQueue<ByteBuffer>(
SEND_CAPACITY);
ArrayBlockingQueue<ByteBuffer> oldq = queueSendMap.putIfAbsent(sid, bq);
if (oldq != null) {
addToSendQueue(oldq, b);
} else {
addToSendQueue(bq, b);
}
connectOne(sid);
}
}
这里直接贴出第二层的connectOne,因为外面还包了一层,继续进入到initiateConnection,最后进入到startConnection方法
synchronized private boolean connectOne(long sid, InetSocketAddress electionAddr){
if (senderWorkerMap.get(sid) != null) {
LOG.debug("There is a connection already for server " + sid);
return true;
}
Socket sock = null;
try {
LOG.debug("Opening channel to server " + sid);
if (self.isSslQuorum()) {
SSLSocket sslSock = self.getX509Util().createSSLSocket();
setSockOpts(sslSock);
sslSock.connect(electionAddr, cnxTO);
sslSock.startHandshake();
sock = sslSock;
LOG.info("SSL handshake complete with {} - {} - {}", sslSock.getRemoteSocketAddress(), sslSock.getSession().getProtocol(), sslSock.getSession().getCipherSuite());
} else {
sock = new Socket();
setSockOpts(sock);
sock.connect(electionAddr, cnxTO);
}
LOG.debug("Connected to server " + sid);
// Sends connection request asynchronously if the quorum
// sasl authentication is enabled. This is required because
// sasl server authentication process may take few seconds to
// finish, this may delay next peer connection requests.
if (quorumSaslAuthEnabled) {
initiateConnectionAsync(sock, sid);
} else {
initiateConnection(sock, sid);
}
return true;
} catch (UnresolvedAddressException e) {
// Sun doesn't include the address that causes this
// exception to be thrown, also UAE cannot be wrapped cleanly
// so we log the exception in order to capture this critical
// detail.
LOG.warn("Cannot open channel to " + sid
+ " at election address " + electionAddr, e);
closeSocket(sock);
throw e;
} catch (X509Exception e) {
LOG.warn("Cannot open secure channel to " + sid
+ " at election address " + electionAddr, e);
closeSocket(sock);
return false;
} catch (IOException e) {
LOG.warn("Cannot open channel to " + sid
+ " at election address " + electionAddr,
e);
closeSocket(sock);
return false;
}
}
贴出部分重要代码
// If lost the challenge, then drop the new connection
// 判断如果要发送的sid大于本机,那么就直接关闭掉前面建立好的socket,
if (sid > self.getId()) {
LOG.info("Have smaller server identifier, so dropping the " +
"connection: (" + sid + ", " + self.getId() + ")");
closeSocket(sock);
// Otherwise proceed with the connection
} else {
// 例如我本机为4,虚拟机为2,那么就会调用else里面的方法,建立好发送消息的SendWorker 和 RecvWorker
SendWorker sw = new SendWorker(sock, sid);
RecvWorker rw = new RecvWorker(sock, din, sid, sw);
sw.setRecv(rw);
SendWorker vsw = senderWorkerMap.get(sid);
if(vsw != null)
vsw.finish();
senderWorkerMap.put(sid, sw);
queueSendMap.putIfAbsent(sid, new ArrayBlockingQueue<ByteBuffer>(
SEND_CAPACITY));
sw.start();
rw.start();
return true;
}
归纳总结
服务器id 为1 2 4
没监听数据之前:
判断要连接的sid是否大于自己,如果大于自己则直接不发送请求结束
所以开始的时候,这里的连接也指创建好发送投票的队列和对应的worker
服务id | 连接情况 |
---|---|
1 | 1不连接2 1不连接4 |
2 | 2连接1 同时建立好发送投票的队列和worke 2不连接4 |
4 | 4连接1 4连接2 同时建立好发送投票的队列和worker |
监听到数据之后
服务id | 连接情况 |
---|---|
1 | 1与2 建立好发送投票的队列和worke 1与4 建立好发送投票的队列和worke |
2 | 2不连接1 断掉建立好发送投票的队列和worke 2与4 建立好发送投票的队列和worke |
4 | 4连接1 断掉 建立好的发送投票的队列和worke 4连接2 断掉建立好发送投票的队列和worke |
运行后的最终结果就是
服务器id大的直接和小的建立连接,服务器id小的通过队列和服务器大的建立关系。
所以在启动选举leader的时候,当监听到消息之后,服务器id大的不会向服务器id小的发送选票,而只是监听他们发来的选票。