所使用的组件来自github, https://github.com/wenweihu86/raft-java.
本文档仅讨论组件选举主节点的过程,暂不讨论其内部日志复制的过程
初始化
// 初始化RPCServer
RpcServer server = new RpcServer(localServer.getEndpoint().getPort());
// 设置Raft选项,比如:
// just for test snapshot
RaftOptions raftOptions = new RaftOptions();
raftOptions.setDataDir(dataPath);
raftOptions.setSnapshotMinLogSize(110 * 1024);
raftOptions.setSnapshotPeriodSeconds(30);
raftOptions.setMaxSegmentFileSize(1024 * 1024);
// 应用状态机
ExampleStateMachine stateMachine = new ExampleStateMachine(raftOptions.getDataDir());
// 初始化RaftNode
RaftNode raftNode = new RaftNode(raftOptions, serverList, localServer, stateMachine);
// 注册Raft节点之间相互调用的服务
RaftConsensusService raftConsensusService = new RaftConsensusServiceImpl(raftNode);
server.registerService(raftConsensusService);
// 注册给Client调用的Raft服务
RaftClientService raftClientService = new RaftClientServiceImpl(raftNode);
server.registerService(raftClientService);
// 注册应用自己提供的服务
ExampleService exampleService = new ExampleServiceImpl(raftNode, stateMachine);
server.registerService(exampleService);
// 启动RPCServer,初始化Raft节点
server.start();
// 初始化RaftNode
raftNode.init();
初始化可大致分为
1.rpc组件初始化
2.应用状态机初始化
3.节点配置初始化
4.服务初始化
5.服务启动
6.节点启动
本文档只关注其主节点选举过程,因此直接关注其节点启动代码
其中RaftNode
为组件的核心代码,包含了NodeState
节点状态,currentTerm
当前周期,leaderId
主节点id等重要信息,组件所提供的getLeader方法即是从RaftNode
中根据leaderId
获取的。
public void init() {
// 初始化peerMap,即同类节点
for (RaftProto.Server server : configuration.getServersList()) {
if (!peerMap.containsKey(server.getServerId())
&& server.getServerId() != localServer.getServerId()) {
Peer peer = new Peer(server);
peer.setNextIndex(raftLog.getLastLogIndex() + 1);
peerMap.put(server.getServerId(), peer);
}
}
// init thread pool 后续循环使用
executorService = new ThreadPoolExecutor(
raftOptions.getRaftConsensusThreadNum(),
raftOptions.getRaftConsensusThreadNum(),
60,
TimeUnit.SECONDS,
new LinkedBlockingQueue<Runnable>());
scheduledExecutorService = Executors.newScheduledThreadPool(2);
scheduledExecutorService.scheduleWithFixedDelay(new Runnable() {
@Override
public void run() {
// 数据同步相关,暂时跳过
takeSnapshot();
}
}, raftOptions.getSnapshotPeriodSeconds(), raftOptions.getSnapshotPeriodSeconds(), TimeUnit.SECONDS);
// start election
// 选举定时器
resetElectionTimer();
}
节点启动过程中重点关注开启选举定时器resetElectionTimer();
的过程
/**
* 选举定时器
*/
private void resetElectionTimer() {
if (electionScheduledFuture != null && !electionScheduledFuture.isDone()) {
electionScheduledFuture.cancel(true);
}
electionScheduledFuture = scheduledExecutorService.schedule(new Runnable() {
@Override
public void run() {
startPreVote();
}
}, getElectionTimeoutMs(), TimeUnit.MILLISECONDS);
// getElectionTimeoutMs() 5000-10000ms 的随机数
}
选举定时器中添加了一个倒计时线程,当倒计时结束,将会进行pre-vote(raft不存在该步骤,该步骤是为了防止某一个节点断网后,不断的增加term发起投票),即startPreVote();
/**
* 客户端发起pre-vote请求。
* pre-vote/vote是典型的二阶段实现。
* 作用是防止某一个节点断网后,不断的增加term发起投票;
* 当该节点网络恢复后,会导致集群其他节点的term增大,导致集群状态变更。
*/
private void startPreVote() {
lock.lock();
try {
if (!ConfigurationUtils.containsServer(configuration, localServer.getServerId())) {
resetElectionTimer();
return;
}
LOG.info("Running pre-vote in term {}", currentTerm);
// 将本节点的状态变为STATE_PRE_CANDIDATE,该状态与本步骤的投票一样,属于组件新增的状态
state = NodeState.STATE_PRE_CANDIDATE;
} finally {
lock.unlock();
}
for (RaftProto.Server server : configuration.getServersList()) {
if (server.getServerId() == localServer.getServerId()) {
continue;
}
final Peer peer = peerMap.get(server.getServerId());
executorService.submit(new Runnable() {
@Override
public void run() {
// rpc发送pre-vote请求
preVote(peer);
}
});
}
// 重置选举定时器
resetElectionTimer();
}
rpc发送pre-vote请求后,会将结果记录在peerMap
中,若有返回结果,则执行以下操作
1.若节点term与发请求是的term不一致或当前节点状态不为STATE_PRE_CANDIDATE
,则忽略
2.若其他节点的term必本节点高,则执行stepDown
3.若获得选票,则从peerMap
中统计选票值,若大于1/2,则开始真正的选举startVote();
4.若未活动选票,则不做任何操作
具体过程如下
// pre选举 若peer的pre选票>1/2,则开启正式选举
// 若term小于其他节点,则重新发起选举
// 若不给选票,则打印个日志就完事
@Override
public void success(RaftProto.VoteResponse response) {
lock.lock();
try {
peer.setVoteGranted(response.getGranted());
if (currentTerm != request.getTerm() || state != NodeState.STATE_PRE_CANDIDATE) {
LOG.info("ignore preVote RPC result");
return;
}
if (response.getTerm() > currentTerm) {
LOG.info("Received pre vote response from server {} " +
"in term {} (this server's term was {})",
peer.getServer().getServerId(),
response.getTerm(),
currentTerm);
stepDown(response.getTerm());
} else {
if (response.getGranted()) {
LOG.info("get pre vote granted from server {} for term {}",
peer.getServer().getServerId(), currentTerm);
int voteGrantedNum = 1;
for (RaftProto.Server server : configuration.getServersList()) {
if (server.getServerId() == localServer.getServerId()) {
continue;
}
Peer peer1 = peerMap.get(server.getServerId());
if (peer1.isVoteGranted() != null && peer1.isVoteGranted() == true) {
voteGrantedNum += 1;
}
}
LOG.info("preVoteGrantedNum={}", voteGrantedNum);
if (voteGrantedNum > configuration.getServersCount() / 2) {
LOG.info("get majority pre vote, serverId={} when pre vote, start vote",
localServer.getServerId());
startVote();
}
} else {
LOG.info("pre vote denied by server {} with term {}, my term is {}",
peer.getServer().getServerId(), response.getTerm(), currentTerm);
}
}
} finally {
lock.unlock();
}
}
@Override
public void fail(Throwable e) {
LOG.warn("pre vote with peer[{}:{}] failed",
peer.getServer().getEndpoint().getHost(),
peer.getServer().getEndpoint().getPort());
peer.setVoteGranted(new Boolean(false));
}
pre-vote步骤中较为关键的两步为stepDown
和startVote();
先看stepDown
// in lock
// 当本节点的term小于其他节点的term时 或 响应主节点发的心跳信息时执行
// 置为从节点,停止心跳线程,开启选举定时器
public void stepDown(long newTerm) {
if (currentTerm > newTerm) {
LOG.error("can't be happened");
return;
}
// 若term不对等,重置节点内部leader等信息
if (currentTerm < newTerm) {
currentTerm = newTerm;
leaderId = 0;
votedFor = 0;
raftLog.updateMetaData(currentTerm, votedFor, null, null);
}
// 置为从节点
state = NodeState.STATE_FOLLOWER;
// stop heartbeat
if (heartbeatScheduledFuture != null && !heartbeatScheduledFuture.isDone()) {
heartbeatScheduledFuture.cancel(true);
}
// 选举定时器
resetElectionTimer();
}
stepDown
的主要作用有两点
1.同步节点的term,在发现本节点term小于其他节点时,立刻进行重置和降级
2.在收到主节点的心跳信息后重置选举定时器的倒计时
再看startVote()
,该步骤与pre-vote差不多,区别是会将term
++,节点状态变为STATE_CANDIDATE
/**
* 客户端发起正式vote,对candidate有效
*/
private void startVote() {
lock.lock();
try {
if (!ConfigurationUtils.containsServer(configuration, localServer.getServerId())) {
resetElectionTimer();
return;
}
// 周期++
currentTerm++;
LOG.info("Running for election in term {}", currentTerm);
state = NodeState.STATE_CANDIDATE;
leaderId = 0;
votedFor = localServer.getServerId();
} finally {
lock.unlock();
}
for (RaftProto.Server server : configuration.getServersList()) {
if (server.getServerId() == localServer.getServerId()) {
continue;
}
final Peer peer = peerMap.get(server.getServerId());
executorService.submit(new Runnable() {
@Override
public void run() {
// 发送正式选票
requestVote(peer);
}
});
}
}
vote阶段的rpc逻辑与pre-vote阶段无异,只是在选票大于1/2后执行方法不再是startVote();
而是becomeLeader();
// in lock
private void becomeLeader() {
// 节点状态变更主节点
state = NodeState.STATE_LEADER;
leaderId = localServer.getServerId();
// stop vote timer
if (electionScheduledFuture != null && !electionScheduledFuture.isDone()) {
electionScheduledFuture.cancel(true);
}
// start heartbeat timer 定时发送(我是主节点)的心跳信息
startNewHeartbeat();
}
// in lock, 开始心跳,对leader有效
private void startNewHeartbeat() {
LOG.debug("start new heartbeat, peers={}", peerMap.keySet());
for (final Peer peer : peerMap.values()) {
executorService.submit(new Runnable() {
@Override
public void run() {
// 向其他peer节点发送(我是主节点)的请求
appendEntries(peer);
}
});
}
resetHeartbeatTimer();
}
// heartbeat timer, append entries
// in lock
// 心跳定时器 500ms
private void resetHeartbeatTimer() {
if (heartbeatScheduledFuture != null && !heartbeatScheduledFuture.isDone()) {
heartbeatScheduledFuture.cancel(true);
}
heartbeatScheduledFuture = scheduledExecutorService.schedule(new Runnable() {
@Override
public void run() {
startNewHeartbeat();
}
// 500ms
}, raftOptions.getHeartbeatPeriodMilliseconds(), TimeUnit.MILLISECONDS);
}
在节点成为leader后,首先是将节点状态变更为STATE_LEADER
,再是每隔500ms一次的宣誓主权appendEntries(peer);
public void appendEntries(Peer peer) {
...
lock.lock();
try {
long firstLogIndex = raftLog.getFirstLogIndex();
Validate.isTrue(peer.getNextIndex() >= firstLogIndex);
prevLogIndex = peer.getNextIndex() - 1;
long prevLogTerm;
if (prevLogIndex == 0) {
prevLogTerm = 0;
} else if (prevLogIndex == lastSnapshotIndex) {
prevLogTerm = lastSnapshotTerm;
} else {
prevLogTerm = raftLog.getEntryTerm(prevLogIndex);
}
requestBuilder.setServerId(localServer.getServerId());
requestBuilder.setTerm(currentTerm);
requestBuilder.setPrevLogTerm(prevLogTerm);
requestBuilder.setPrevLogIndex(prevLogIndex);
numEntries = packEntries(peer.getNextIndex(), requestBuilder);
requestBuilder.setCommitIndex(Math.min(commitIndex, prevLogIndex + numEntries));
} finally {
lock.unlock();
}
RaftProto.AppendEntriesRequest request = requestBuilder.build();
// rpc发送给其他peer节点
RaftProto.AppendEntriesResponse response = peer.getRaftConsensusServiceAsync().appendEntries(request);
lock.lock();
try {
// -----------------网络隔离后,会出现脑裂---------------------
if (response == null) {
LOG.warn("appendEntries with peer[{}:{}] failed",
peer.getServer().getEndpoint().getHost(),
peer.getServer().getEndpoint().getPort());
// 检查节点信息是否已在配置中
if (!ConfigurationUtils.containsServer(configuration, peer.getServer().getServerId())) {
peerMap.remove(peer.getServer().getServerId());
peer.getRpcClient().stop();
}
return;
}
LOG.info("AppendEntries response[{}] from server {} " +
"in term {} (my term is {})",
response.getResCode(), peer.getServer().getServerId(),
response.getTerm(), currentTerm);
// 周期小于peer的周期
if (response.getTerm() > currentTerm) {
stepDown(response.getTerm());
} else {
if (response.getResCode() == RaftProto.ResCode.RES_CODE_SUCCESS) {
peer.setMatchIndex(prevLogIndex + numEntries);
peer.setNextIndex(peer.getMatchIndex() + 1);
// 如果配置中有该机器
if (ConfigurationUtils.containsServer(configuration, peer.getServer().getServerId())) {
// 与状态机同步有关,没细看
advanceCommitIndex();
} else {
if (raftLog.getLastLogIndex() - peer.getMatchIndex() <= raftOptions.getCatchupMargin()) {
LOG.debug("peer catch up the leader");
peer.setCatchUp(true);
// signal the caller thread
catchUpCondition.signalAll();
}
}
} else {
peer.setNextIndex(response.getLastLogIndex() + 1);
}
}
} finally {
lock.unlock();
}
}
appendEntries(peer);
方法中去除了日志复制相关的代码。可以看到,在主节点被网络隔离后,收不到peer的回复时,该节点并不会更改状态,此时王该节点中获取leaderIp时,将会依然返回原来的leaderIp,若其他节点已选出一个新的主节点,则会出现raft脑裂的现象。
让我们再看看peer节点在接收到主节点的appendEntries(peer);
请求时会做什么处理。
// 自己周期更高,直接返回失败和自己的周期
// stepDown
// 如果没主节点,则发过来的就是主节点
// 如果本地存的主节点与发过来的不一致,则 + 一个周期,stepDown, 开始新一轮选举
@Override
public RaftProto.AppendEntriesResponse appendEntries(RaftProto.AppendEntriesRequest request) {
raftNode.getLock().lock();
try {
RaftProto.AppendEntriesResponse.Builder responseBuilder
= RaftProto.AppendEntriesResponse.newBuilder();
responseBuilder.setTerm(raftNode.getCurrentTerm());
responseBuilder.setResCode(RaftProto.ResCode.RES_CODE_FAIL);
responseBuilder.setLastLogIndex(raftNode.getRaftLog().getLastLogIndex());
if (request.getTerm() < raftNode.getCurrentTerm()) {
return responseBuilder.build();
}
// 重置选举倒计时
raftNode.stepDown(request.getTerm());
// 若之前无主节点信息,则将请求中的节点作为主节点信息
if (raftNode.getLeaderId() == 0) {
raftNode.setLeaderId(request.getServerId());
LOG.info("new leaderId={}, conf={}",
raftNode.getLeaderId(),
PRINTER.printToString(raftNode.getConfiguration()));
}
// 若本地之前保留的leader信息与请求内的leader信息不一致,则发起新一轮选举
if (raftNode.getLeaderId() != request.getServerId()) {
LOG.warn("Another peer={} declares that it is the leader " +
"at term={} which was occupied by leader={}",
request.getServerId(), request.getTerm(), raftNode.getLeaderId());
raftNode.stepDown(request.getTerm() + 1);
responseBuilder.setResCode(RaftProto.ResCode.RES_CODE_FAIL);
responseBuilder.setTerm(request.getTerm() + 1);
return responseBuilder.build();
}
// 删除了与日志复制相关的代码
...
}
再看看组件对外提供的getLeader方法
@Override
public RaftProto.GetLeaderResponse getLeader(RaftProto.GetLeaderRequest request) {
LOG.info("receive getLeader request");
RaftProto.GetLeaderResponse.Builder responseBuilder = RaftProto.GetLeaderResponse.newBuilder();
responseBuilder.setResCode(RaftProto.ResCode.RES_CODE_SUCCESS);
RaftProto.Endpoint.Builder endPointBuilder = RaftProto.Endpoint.newBuilder();
raftNode.getLock().lock();
try {
int leaderId = raftNode.getLeaderId();
if (leaderId == 0) {
responseBuilder.setResCode(RaftProto.ResCode.RES_CODE_FAIL);
} else if (leaderId == raftNode.getLocalServer().getServerId()) {
endPointBuilder.setHost(raftNode.getLocalServer().getEndpoint().getHost());
endPointBuilder.setPort(raftNode.getLocalServer().getEndpoint().getPort());
} else {
RaftProto.Configuration configuration = raftNode.getConfiguration();
for (RaftProto.Server server : configuration.getServersList()) {
if (server.getServerId() == leaderId) {
endPointBuilder.setHost(server.getEndpoint().getHost());
endPointBuilder.setPort(server.getEndpoint().getPort());
break;
}
}
}
} finally {
raftNode.getLock().unlock();
}
responseBuilder.setLeader(endPointBuilder.build());
RaftProto.GetLeaderResponse response = responseBuilder.build();
LOG.info("getLeader response={}", jsonFormat.printToString(response));
return responseBuilder.build();
}
即从RaftNode
中根据leaderId
获取leader的ip+port