rocketmq DLedger主从自动切换
rocketmq从4.5开始,提供了故障自动切换功能,当主从集群中的master故障后,可自动从多个slave中选举出master,完成故障转移,不需要人工操作
rocketmq使用DLedger实现自动故障转移,DLedger是基于raft协议的commitLog存储库,主要包括master选举和日志复制
*********************
master 选举
节点状态:leader、candidate、follower
public class MemberState {
...
public static enum Role {
UNKNOWN,
CANDIDATE,
LEADER,
FOLLOWER;
private Role() {
}
}
}
leader:接受客户端请求,本地写入日志数据,并将数据复制给follower;定期发送心跳数据给follower维护leader状态
candidate:master故障后节点的中间状态,只有处于candidate状态的节点才会发送投票选举请求,master选举完成后,节点状态为leader或者follower
follower:负责同步leader的日志数据;接受leader心跳数据,重置倒计时器保持follower状态,并将心跳响应返回给leader
master选举触发:
集群初始启动,此时所有节点都处于candidate状态,需要选举产生master;
master故障或者网络故障导致超过半数follower接收不到心跳数据,倒计时器到期触发master选举
****************
master选举过程
follower倒计时器到期,状态转变为candidate,向自己及其它节点发起投票请求(自己给自己投赞成票);
其他节点收到投票请求后,如果满足以下任一条件,则拒绝投票:
reject_already_voted:当前节点已经投票已经投票、
reject_already_has_leader:集群中已经选举产生leader、
reject_expired_term:请求节点投票term小于当前节点投票term、
reject_term_not_ready:请求节点投票term大于当前节点投票term、
reject_term_small_than_ledger:请求节点投票term小于当前节点日志term(ledgerEndTerm)、
reject_expired_ledger:请求节点日志term(ledgerEndTerm)小于当前节点日志term(ledgerEndTerm)、
reject_small_ledger_end_index:请求节点与当前节点日志term(ledgerEndTerm)相等,但是日志索引小于当前节点日志索引(ledgerEndIndex)、
否则,当前节点投票同意请求节点为主节点(accepted)
请求投票结果:
DLedgerLeaderSelector.matainAsCandidate()方法
final AtomicInteger allNum = new AtomicInteger(0); //所有投票数
final AtomicInteger validNum = new AtomicInteger(0); //有效投票
final AtomicInteger acceptedNum = new AtomicInteger(0); //同意票
final AtomicInteger notReadyTermNum = new AtomicInteger(0); //未准备好投票,请求节点投票term大于远端节点投票term,远端节点返回rejected_term_not_ready
final AtomicInteger biggerLedgerNum = new AtomicInteger(0); //请求节点日志term小于远端节点日志term,或者日志term相同,请求节点日志索引小于远端节点日志索引(ledgerEndIndex)
final AtomicBoolean alreadyHasLeader = new AtomicBoolean(false); //当前集群已有leader
if (knownMaxTermInGroup.get() > term) {
parseResult = VoteResponse.ParseResult.WAIT_TO_VOTE_NEXT;
nextTimeToRequestVote = getNextTimeToRequestVote();
changeRoleToCandidate(knownMaxTermInGroup.get());
} else if (alreadyHasLeader.get()) {
parseResult = VoteResponse.ParseResult.WAIT_TO_VOTE_NEXT;
nextTimeToRequestVote = getNextTimeToRequestVote() + heartBeatTimeIntervalMs * maxHeartBeatLeak;
} else if (!memberState.isQuorum(validNum.get())) {
parseResult = VoteResponse.ParseResult.WAIT_TO_REVOTE;
nextTimeToRequestVote = getNextTimeToRequestVote();
} else if (memberState.isQuorum(acceptedNum.get())) {
parseResult = VoteResponse.ParseResult.PASSED;
} else if (memberState.isQuorum(acceptedNum.get() + notReadyTermNum.get())) {
parseResult = VoteResponse.ParseResult.REVOTE_IMMEDIATELY;
} else if (memberState.isQuorum(acceptedNum.get() + biggerLedgerNum.get())) {
parseResult = VoteResponse.ParseResult.WAIT_TO_REVOTE;
nextTimeToRequestVote = getNextTimeToRequestVote();
} else {
parseResult = VoteResponse.ParseResult.WAIT_TO_VOTE_NEXT;
nextTimeToRequestVote = getNextTimeToRequestVote();
}
lastParseResult = parseResult;
logger.info("[{}] [PARSE_VOTE_RESULT] cost={} term={} memberNum={} allNum={} acceptedNum={} notReadyTermNum={} biggerLedgerNum={} alreadyHasLeader={} maxTerm={} result={}",
memberState.getSelfId(), lastVoteCost, term, memberState.peerSize(), allNum, acceptedNum, notReadyTermNum, biggerLedgerNum, alreadyHasLeader, knownMaxTermInGroup.get(), parseResult);
if (parseResult == VoteResponse.ParseResult.PASSED) {
logger.info("[{}] [VOTE_RESULT] has been elected to be the leader in term {}", memberState.getSelfId(), term);
changeRoleToLeader(term);
}
选主成功:同意票数(acceptedNum)超过一半
立即重新投票:acceptedNum + notReadyTermNum 超过一半
同一投票term重新投票:有效票数(validNum)未超过一半、acceptedNum + biggerLegderNum 超过一半
自增投票term重新投票:请求节点的投票term小于集群中最大的投票term、集群中已有leader(此种情况当接收到leader的心跳数据时会转变为follower)、以及其他情况
****************
节点状态变更
candidate状态
变为leader:投票选举阶段获得半数以上的accepted投票
变为follower:如果集群中已有leader节点,candidate节点收到leader节点的心跳数据
维持candidate:其他状况需要重新投票选主
leader状态:通过发送心跳数据,根据心跳响应维持leader状态或者变为candidate状态
private void maintainAsLeader() throws Exception {
if (DLedgerUtils.elapsed(lastSendHeartBeatTime) > heartBeatTimeIntervalMs) {
//超过心跳间隔时间,发送心跳
long term;
String leaderId;
synchronized (memberState) {
if (!memberState.isLeader()) { //非leader节点直接返回
//stop sending
return;
}
term = memberState.currTerm();
leaderId = memberState.getLeaderId();
lastSendHeartBeatTime = System.currentTimeMillis();
}
sendHeartbeats(term, leaderId); //leader节点发送心跳
}
}
private void sendHeartbeats(long term, String leaderId) throws Exception { //leader节点发送心跳
final AtomicInteger allNum = new AtomicInteger(1); //所有节点数
final AtomicInteger succNum = new AtomicInteger(1); //响应为success的节点数
final AtomicInteger notReadyNum = new AtomicInteger(0); //发送心跳的投票term大于接收心跳节点的投票term的借点数目
final AtomicLong maxTerm = new AtomicLong(-1); //所有节点最大投票term
final AtomicBoolean inconsistLeader = new AtomicBoolean(false); //leader节点不一致
final CountDownLatch beatLatch = new CountDownLatch(1);
long startHeartbeatTimeMs = System.currentTimeMillis();
for (String id : memberState.getPeerMap().keySet()) {
if (memberState.getSelfId().equals(id)) {
continue;
}
HeartBeatRequest heartBeatRequest = new HeartBeatRequest();
heartBeatRequest.setGroup(memberState.getGroup());
heartBeatRequest.setLocalId(memberState.getSelfId()); //当前节点id
heartBeatRequest.setRemoteId(id); //接收发送心跳数据的远端节点id
heartBeatRequest.setLeaderId(leaderId); //当前节点leaderId
heartBeatRequest.setTerm(term); //当前节点投票term
CompletableFuture<HeartBeatResponse> future = dLedgerRpcService.heartBeat(heartBeatRequest);
//心跳响应数据
future.whenComplete((HeartBeatResponse x, Throwable ex) -> {
try {
if (ex != null) {
throw ex;
}
switch (DLedgerResponseCode.valueOf(x.getCode())) {
case SUCCESS: //响应为success的节点数
succNum.incrementAndGet();
break;
case EXPIRED_TERM: //响应为expired_term的节点数(发送心跳节点的投票term小于接收心跳节点的投票term数)
maxTerm.set(x.getTerm()); //设置最大请求term
break;
case INCONSISTENT_LEADER: //响应inconsistent_leader,集群中leader不一致
inconsistLeader.compareAndSet(false, true);
break;
case TERM_NOT_READY: //响应为term_not_ready,发送心跳节点的投票term大于接收心跳节点的投票term
notReadyNum.incrementAndGet();
break;
default:
break;
}
if (memberState.isQuorum(succNum.get())
|| memberState.isQuorum(succNum.get() + notReadyNum.get())) {
//如果响应为success的节点超过半数
//或者succNum + notReady(此种情况会立即重新投票),到计数器减一
beatLatch.countDown();
}
} catch (Throwable t) {
logger.error("Parse heartbeat response failed", t);
} finally {
allNum.incrementAndGet();
if (allNum.get() == memberState.peerSize()) {
//所有响应节点数等与集群节点数,到计数器减一
beatLatch.countDown();
}
}
});
}
beatLatch.await(heartBeatTimeIntervalMs, TimeUnit.MILLISECONDS);
//等待一个心跳间隔周期,在此间隔期间,不满足
//memberState.isQuorum(succNum.get())、
//memberState.isQuorum(succNum.get() + notReadyNum.get())
//allNum.get() == memberState.peerSize()时,会重新发送心跳
if (memberState.isQuorum(succNum.get())) {
//心跳响应success超过半数,设置心跳发送成功时间,当前节点状态保持为leader
lastSuccHeartBeatTime = System.currentTimeMillis();
} else {
logger.info("[{}] Parse heartbeat responses in cost={} term={} allNum={} succNum={} notReadyNum={} inconsistLeader={} maxTerm={} peerSize={} lastSuccHeartBeatTime={}",
memberState.getSelfId(), DLedgerUtils.elapsed(startHeartbeatTimeMs), term, allNum.get(), succNum.get(), notReadyNum.get(), inconsistLeader.get(), maxTerm.get(), memberState.peerSize(), new Timestamp(lastSuccHeartBeatTime));
if (memberState.isQuorum(succNum.get() + notReadyNum.get())) {
lastSendHeartBeatTime = -1;
} else if (maxTerm.get() > term) {
changeRoleToCandidate(maxTerm.get());
} else if (inconsistLeader.get()) {
changeRoleToCandidate(term);
} else if (DLedgerUtils.elapsed(lastSuccHeartBeatTime) > maxHeartBeatLeak * heartBeatTimeIntervalMs) {
changeRoleToCandidate(term);
} //如果集群最大投票term大于当前leader状态节点投票term、
//出现不一致leader
//上一次发送心跳间隔时间超过最大心跳间隔时间,leader转为candidate
}
}
public CompletableFuture<HeartBeatResponse> handleHeartBeat(HeartBeatRequest request) throws Exception {
//处理leader节点心跳数据
if (!memberState.isPeerMember(request.getLeaderId())) {
//如果集群中不存在节点id,返回unknown_member
logger.warn("[BUG] [HandleHeartBeat] remoteId={} is an unknown member", request.getLeaderId());
return CompletableFuture.completedFuture(new HeartBeatResponse().term(memberState.currTerm()).code(DLedgerResponseCode.UNKNOWN_MEMBER.getCode()));
}
if (memberState.getSelfId().equals(request.getLeaderId())) {
//如果当前节点id等于请求节点leaderId(leader节点不需要给自己发送心跳),返回unexpected_error
logger.warn("[BUG] [HandleHeartBeat] selfId={} but remoteId={}", memberState.getSelfId(), request.getLeaderId());
return CompletableFuture.completedFuture(new HeartBeatResponse().term(memberState.currTerm()).code(DLedgerResponseCode.UNEXPECTED_MEMBER.getCode()));
}
if (request.getTerm() < memberState.currTerm()) {
//如果请求节点的投票term小于当前节点的投票term,返回expired_term
return CompletableFuture.completedFuture(new HeartBeatResponse().term(memberState.currTerm()).code(DLedgerResponseCode.EXPIRED_TERM.getCode()));
} else if (request.getTerm() == memberState.currTerm()) {
//如果当前节点的投票term与请求节点想等
if (request.getLeaderId().equals(memberState.getLeaderId())) {
//如果leaderId相等,返回success
lastLeaderHeartBeatTime = System.currentTimeMillis();
return CompletableFuture.completedFuture(new HeartBeatResponse());
}
}
//abnormal case
//hold the lock to get the latest term and leaderId
synchronized (memberState) { //如果遇到异常情况,获取当前节点状态锁,获取最新的投票term、leaderId重新判断
if (request.getTerm() < memberState.currTerm()) { //如果请求节点的投票term小于当前节点的投票term,返回expired_term
return CompletableFuture.completedFuture(new HeartBeatResponse().term(memberState.currTerm()).code(DLedgerResponseCode.EXPIRED_TERM.getCode()));
} else if (request.getTerm() == memberState.currTerm()) { //请求节点的投票term等于当前节点的投票term
if (memberState.getLeaderId() == null) {
//如果当前节点的leaderId为null(节点处于candidate状态),则将节点转变为follower状态,返回success
changeRoleToFollower(request.getTerm(), request.getLeaderId());
return CompletableFuture.completedFuture(new HeartBeatRespo