【raft】学习六：etcd/raft 选举和选举优化

从未想放弃

已于 2022-04-25 16:58:46 修改

阅读量1.2k

点赞数 2

分类专栏： golang raft linux 文章标签： golang 分布式

于 2022-04-15 17:00:45 首次发布

本文链接：https://blog.csdn.net/qq_40859492/article/details/124179468

版权

golang 同时被 3 个专栏收录

17 篇文章

订阅专栏

linux

13 篇文章

订阅专栏

raft

8 篇文章

订阅专栏

前言

又到了王家村一年一度的村支书选举时间，依据王家村的历史经验，王家村中依法享有选举权和被选举权的每位村民都可以参与选举。由于村支书这个职位在王家村备受好评，因而王家村每年的村长选举都非常激烈。村支书选举委员每年都非常头疼，由于王家村的村民大多数都没有在家，导致一次选举时常出现一些问题，因此王家村每年都在线上展开选举。由于每年王家村的村民报名参加的人数比较多，但王家村的村民都遵守村支书选举委员会定制的规则，且选举委员公正无私，因而王家村选举委员会面临以下几个问题;

怎么发起选举，且有序地进行？
具有相同能力和影响力的人选，怎么筛选？
如何判断这个对象是否具有参与选举的资格？
每个投票者如何知道这个候选人的能力和影响力？
怎么在最短的时间内选出村支书？即省去无意义的选举？

正当王家村选举委员会头疼的时候，一名在犀利的程序员回到村里。听说村里正在选举，由于王家村的村民诚实守信、公正无私，程序员决定将自己熟知的raft选举引入到村支书选举中。接下来，程序员将带着大家带着这几个问题来看raft如何进行一次村支书选举。

王家村村支书选举之raft应用

基本情况

既然是村支书选举，自然会存在不同的角色，不同的阶段以及村支书唯一。
在raft中存在如下几种角色和响应的阶段，参照etcd/raft的抽象，具体如下：

func (r *raft) becomeFollower(term uint64, lead uint64) {
	r.step = stepFollower
	r.reset(term)
	r.tick = r.tickElection
	r.lead = lead
	r.state = StateFollower
	r.logger.Infof("%x became follower at term %d", r.id, r.Term)
}
func (r *raft) becomeCandidate() {
	// TODO(xiangli) remove the panic when the raft implementation is stable
	if r.state == StateLeader {
		panic("invalid transition [leader -> candidate]")
	}
	r.step = stepCandidate
	r.reset(r.Term + 1)
	r.tick = r.tickElection
	r.Vote = r.id
	r.state = StateCandidate
	r.logger.Infof("%x became candidate at term %d", r.id, r.Term)
}
func (r *raft) becomePreCandidate() {
	// TODO(xiangli) remove the panic when the raft implementation is stable
	if r.state == StateLeader {
		panic("invalid transition [leader -> pre-candidate]")
	}
	// Becoming a pre-candidate changes our step functions and state,
	// but doesn't change anything else. In particular it does not increase
	// r.Term or change r.Vote.
	r.step = stepCandidate
	r.prs.ResetVotes()
	r.tick = r.tickElection
	r.lead = None
	r.state = StatePreCandidate
	r.logger.Infof("%x became pre-candidate at term %d", r.id, r.Term)
}
func (r *raft) becomeLeader() {
	// TODO(xiangli) remove the panic when the raft implementation is stable
	if r.state == StateFollower {
		panic("invalid transition [follower -> leader]")
	}
	r.step = stepLeader
	r.reset(r.Term)
	r.tick = r.tickHeartbeat
	r.lead = r.id
	r.state = StateLeader
	// Followers enter replicate mode when they've been successfully probed
	// (perhaps after having received a snapshot as a result). The leader is
	// trivially in this state. Note that r.reset() has initialized this
	// progress with the last index already.
	r.prs.Progress[r.id].BecomeReplicate()

	// Conservatively set the pendingConfIndex to the last index in the
	// log. There may or may not be a pending config change, but it's
	// safe to delay any future proposals until we commit all our
	// pending log entries, and scanning the entire tail of the log
	// could be expensive.
	r.pendingConfIndex = r.raftLog.lastIndex()

	emptyEnt := pb.Entry{Data: nil}
	if !r.appendEntry(emptyEnt) {
		// This won't happen because we just called reset() above.
		r.logger.Panic("empty entry was dropped")
	}
	// As a special case, don't count the initial empty entry towards the
	// uncommitted log quota. This is because we want to preserve the
	// behavior of allowing one entry larger than quota if the current
	// usage is zero.
	r.reduceUncommittedSize([]pb.Entry{emptyEnt})
	r.logger.Infof("%x became leader at term %d", r.id, r.Term)
}

从上面becomeXXX可以得知，在这次选举中存在四种角色，相应地也存在四种阶段：

角色	阶段
Leader（村支书）	选出Leader，代表本次选举完成，并告知其他对象
Follower（具有选举权的村民）	等待leader消息或者选举消息，可以准备发起选举或者预选举
Candidate（候选人）	成为了候选人，告诉其他对象，将要参与正式选举
PreCandidate（预备候选人）	预备候选人，有了选参与选举的资格
清晰了这几个状态后，带着第一个问题来看raft是如何解决的。

定义选举规则

为了避免选票被均分，raft设计如下几个规则：

参与投票的村民只能在一次选举中投出一票；
多个选举人有资格时，参与投票的村民需投给第一个发来选举请求的候选人，后来的都拒绝；
一次起始选举中，参与者随机优先开始，etcd/raft中采用计时器实现

// tickElection is run by followers and candidates after r.electionTimeout.
func (r *raft) tickElection() {
	r.electionElapsed++

	if r.promotable() && r.pastElectionTimeout() {
		r.electionElapsed = 0
		r.Step(pb.Message{From: r.id, Type: pb.MsgHup})
	}
}

这几个规则基本解决了前面提到的1. 2.两个问题，简单来说即：随机、先来先得、选票唯一。

如何衡量参与选举的对象的能力？

由于王家村的村支书需要管理一些重要信息，这些信息随着选举届的增加会有更新，因而借助raft有定义一下一个规定：

任期高、掌握的信息新（在Raft中比较的是index和term，这里的index是最新的不一定是已提交的）

简单来说，便是每个村民（raft默认每个对象都参与选举，当然也有设立专门做事的，不参与投票）要求其他村民投票时，需要出示自己的选举任期和且掌握的信息，两个要求都需要满足，缺一不可。后面再细说为什么。还有很多其他特殊情况，后面慢慢细说。

case pb.MsgVote, pb.MsgPreVote:
		// We can vote if this is a repeat of a vote we've already cast...
		canVote := r.Vote == m.From ||
			// ...we haven't voted and we don't think there's a leader yet in this term...
			(r.Vote == None && r.lead == None) ||
			// ...or this is a PreVote for a future term...
			(m.Type == pb.MsgPreVote && m.Term > r.Term)
		// ...and we believe the candidate is up to date.
		if canVote && r.raftLog.isUpToDate(m.Index, m.LogTerm) {
			// Note: it turns out that that learners must be allowed to cast votes.
			// This seems counter- intuitive but is necessary in the situation in which
			// a learner has been promoted (i.e. is now a voter) but has not learned
			// about this yet.
			// For example, consider a group in which id=1 is a learner and id=2 and
			// id=3 are voters. A configuration change promoting 1 can be committed on
			// the quorum `{2,3}` without the config change being appended to the
			// learner's log. If the leader (say 2) fails, there are de facto two
			// voters remaining. Only 3 can win an election (due to its log containing
			// all committed entries), but to do so it will need 1 to vote. But 1
			// considers itself a learner and will continue to do so until 3 has
			// stepped up as leader, replicates the conf change to 1, and 1 applies it.
			// Ultimately, by receiving a request to vote, the learner realizes that
			// the candidate believes it to be a voter, and that it should act
			// accordingly. The candidate's config may be stale, too; but in that case
			// it won't win the election, at least in the absence of the bug discussed
			// in:
			// https://github.com/etcd-io/etcd/issues/7625#issuecomment-488798263.
			r.logger.Infof("%x [logterm: %d, index: %d, vote: %x] cast %s for %x [logterm: %d, index: %d] at term %d",
				r.id, r.raftLog.lastTerm(), r.raftLog.lastIndex(), r.Vote, m.Type, m.From, m.LogTerm, m.Index, r.Term)
			// When responding to Msg{Pre,}Vote messages we include the term
			// from the message, not the local term. To see why, consider the
			// case where a single node was previously partitioned away and
			// it's local term is now out of date. If we include the local term
			// (recall that for pre-votes we don't update the local term), the
			// (pre-)campaigning node on the other end will proceed to ignore
			// the message (it ignores all out of date messages).
			// The term in the original message and current local term are the
			// same in the case of regular votes, but different for pre-votes.
			r.send(pb.Message{To: m.From, Term: m.Term, Type: voteRespMsgType(m.Type)})
			if m.Type == pb.MsgVote {
				// Only record real votes.
				r.electionElapsed = 0
				r.Vote = m.From
			}
		} else {
			r.logger.Infof("%x [logterm: %d, index: %d, vote: %x] rejected %s from %x [logterm: %d, index: %d] at term %d",
				r.id, r.raftLog.lastTerm(), r.raftLog.lastIndex(), r.Vote, m.Type, m.From, m.LogTerm, m.Index, r.Term)
			r.send(pb.Message{To: m.From, Term: r.Term, Type: voteRespMsgType(m.Type), Reject: true})
		}

投票者

当接到选举电话时，投票的村民首先看看自己是否已经投票以及是否已经选出了村支书、和其掌握的信息是否比自己更新等条件，来判断是否投票，根据投票结果来改变自身状态：

可以投票，重置自己发起下一轮选举的时间，投票并记录自己投票；
拒绝投票，并告诉选举对象自己所处的任期；

选举者

选举人根据每一次反馈统计投票结果，只要结果超过半数赞成就代表选举成功；根据这个结果来改变自己的状态：

成功，便晋升为村支书（raft中称为Leader）；
失败，将自己改变为普通村民，并等待下一次选举。

没有意外，一次复杂的村支书选举就这么简单就完成了！！！！庆祝撒花啦

=-=-=-=-=-=-=-=-=-=-=-=-=-正事分割线=-=-=-=–=-=-=-=-=-=-=-=-=-=

因为不知道明天和意外哪一个先来，所以选举委员为需要考虑一些意外情况？比如村长家的牛死了，张三把李四打了等等，都会影响一次选举，下面来看看，选举中会存在哪些意外？

王家村的一些风波

由于每个村民相隔特别远，不能面对面投票，两两之间可能断联；
作为普通村民，如果和村支书失联，自己就会发起选举；
如果是村支书，和大多数村民都失联，导致重要信息不能公布；
有新村民来了？
新村民数量大于老村民数量怎么办？
等等。。。

所以一次选举还没有这么简单呀，因为谁一开始也没办法把整个事情考虑完整呀
接下来，我们正式看看raft中存在的解决方案。。。

解决方案

在解决网络分区问题时，etcd/raft实现的与选举有关的优化有Pre-Vote、Check Quorum、和Leader Lease。
raft网咯分区示意图
图1
在这里插入图片描述
图2

方案	说明
Pre-Vote	预选举机制，当网络发生分区时，当主节点在多数节点那个分区中如图1中D节点，其他少数节点的分区中的节点便会无限增加Term，分区恢复后，这些节点一直无法加入正常集群
Check Quorum	这是一种退位机制，发生网络分区故障且leader在少数节点分区时如图2中A，如果leader不退位，便会存在两个leader，这样会导致数据分叉虽然少数leader不能提交，但是会让client读到旧数据，因此为了避免这种问题，采用Check Quorum 实时检测当前leader是否和大多数节点保持一致，否则成为降为follower
Leader Lease	是一种配合Check quorum的校验机制，当网络分区不完全时，例如出现桥点节点（这个节点连着唯一的主）除此还存在一些孤节点，导致连通的节点需要包括桥点才是大多数，如果启用Leader Lease没有启用check quorum，因为桥节点能够收到主节点的心跳，则连通分区由于桥接点拒绝选新的leader，就不能选出正常的主节点

引发新问题【参考】

场景1：在开启了Check Quorum / Leader Lease后（假设没有开启Pre-Vote），数量达不到quorum的分区中的leader会退位，且该分区中的节点永远都无法选举出leader，因此该分区的节点的term会不断增大。当该分区与整个集群的网络恢复后，由于开启了Check Quorum / Leader Lease，即使该分区中的节点有更大的term，由于原分区的节点工作正常，它们的选举请求会被丢弃。同时，由于该节点的term比原分区的leader节点的term大，因此它会丢弃原分区的leader的请求。这样，该节点永远都无法重新加入集群，也无法当选新leader。
场景2： Pre-Vote机制也有类似的问题。假如发起预投票的节点，在预投票通过后正要发起正式投票的请求时出现网络分区。此时，该节点的term会高于原集群的term。而原集群因没有收到真正的投票请求，不会更新term，继续正常运行。在网络分区恢复后，原集群的term低于分区节点的term，但是日志比分区节点更新。此时，该节点发起的预投票请求因没有日志落后会被丢弃，而原集群leader发给该节点的请求会因term比该节点小而被丢弃。同样，该节点永远都无法重新加入集群，也无法当选新leader。
场景3：在更复杂的情况中，比如，在变更配置时，开启了原本没有开启的Pre-Vote机制。此时可能会出现与上一条类似的情况，即可能因term更高但是log更旧的节点的存在导致整个集群的死锁，所有节点都无法预投票成功。这种情况比上一种情况更危险，上一种情况只有之前分区的节点无法加入集群，在这种情况下，整个集群都会不可用。（详见issue #8501、issue #8525）。

为了解决以上问题，节点在收到term比自己低的请求时，需要做特殊的处理。处理逻辑也很简单：

如果收到了term比当前节点term低的leader的消息，且集群开启了Check Quorum / Leader Lease或Pre-Vote，那么发送一条term为当前term的消息，令term低的节点成为follower。（针对场景1、场景2）
对于term比当前节点term低的预投票请求，无论是否开启了Check Quorum / Leader Lease或Pre-Vote，都要通过一条term为当前term的消息，迫使其转为follower并更新term。（针对场景3）

小结

随着对raft的学习，本次内容，借着王家村选村长的案例，引入raft选举，对比选举的一些基本规则，除此考虑了在raft选举中的特殊场景。

参考

https://mrcroxx.github.io/posts/code-reading/etcdraft-made-simple/3-election/#14-%E5%BC%95%E5%85%A5%E7%9A%84%E6%96%B0%E9%97%AE%E9%A2%98%E4%B8%8E%E8%A7%A3%E5%86%B3%E6%96%B9%E6%A1%88