实验介绍
lab2总体是要复现一个简易的Raft系统,这个实验被分为了四个部分
- 2a ,Leader的选举与保持;
- 2b ,log的添加;
- 2c ,数据持久化与服务器恢复
- 2d ,snapshot实现
实验的重点是2a和2b,如果这两个部分没有打好基础,后面的实验会反复修改前面的代码,使得整个程序需要多次修改和调整,最后代码到处是corner case需要到处打补丁。
实验建议
1.熟悉整个实验
因此虽然实验要求是逐步推进,但是想要优雅的完成实验,建议在最一开始就反复通读Raft的paper和助教给的指导,让自己对整个实验有一个完整清晰的认识。
一定要对原文第五章,Figure.2和第七章的内容都了然于心再进行代码工作,不然需要后期反复调整代码以及参数。尤其是Figure.2的处理步骤。此外需要熟悉官方提供给的一些工具代码。
2.提前做好后两个实验的准备
在进行2a和2b实验的时候,最好可以顾及到2c与2d的内容,主要是以下两个方面:
- 在需要数据持久化的地方添加persist()方法的调用
- 计算log和commit的下标时,直接使用已有snapshot的方式计算下标,而在测试时将snapshot的长度设为0
这样实验2c和2d可以节省很多时间
3.写好功能函数
如返回log的最后一个项目的Tern
// 返回全部log的最后一个的Term,包括snapshot
func (rf *Raft)LastTerm() int{
lastTerm := 0
if len(rf.log) > 0{
lastTerm = rf.log[len(rf.log)-1].Term
}else if len(rf.log) == 0{
lastTerm = rf.LastIncludedTerm
}else{
panic("term wrong,len(log) smaller than 0")
}
return lastTerm
}
这些功能函数在2d会有很大用处
4.做足够的测试
这里贴上官方的 测试代码,个人建议前一个实验可以保证1000次连续测试最多只有两到三个错误时再进行下一个,这时候应该还是某些地方有bug但是无伤大雅,而且这时候很难再进行进一步的debug。
但是注意 所有的bug都会累积到后面的实验中出现,而且后面实验更加复杂,更加难以找出bug,所以应该尽可能保证自己的代码逻辑无误,并且考虑到所有的corner case 然后再进行下一步实验,稳扎稳打,不然最后debug的过程漫长又痛苦
测试文件使用方法:放在raft文件夹下
./go-test-many.sh [测试次数] [测试并行数] [测试项目]
./go-test-many.sh 1000 5 2C //对2C实验测试1000次,五个并行测试
//测试项目可以是具体测试方法,在test_test.go内查看
2a实验内容
2a的实验内容很简单,复现raft leader选举和心跳(没有数据的 AppendEntries rpc请求)。这个lab2A的目标是选出leader,leader挂掉或者断网后新的leader上位。
同时要满足以下接口
// create a new Raft server instance:
func Make(peers []*labrpc.ClientEnd, me int,
persister *Persister, applyCh chan ApplyMsg) *Raft
// start agreement on a new log entry:
func (rf *Raft) Start(command interface{}) (int, int, bool)
// ask a Raft for its current term, and whether it thinks it is leader
func (rf *Raft) GetState() (int, bool)
按照Raft.go文件给出的结构,应该完成以下功能
- 一个计时器方法,用于判断server何时超时
- 选举方法
- server收到candidate的选举请求时回应
- 保持心跳
具体逻辑都应该严格参照论文Figure.2。首先测试代码通过make() 方法创建raft服务器,服务器等待超时后成为candidate,给所有其他raft服务器发送投票请求,通过则成为leader,此时不断给其他服务器发送空的log以保持心跳。
具体实现
我2a的完整实现在 这里,但是需要指出,在这里的很多逻辑都是错误或者片面的,仅作为2a实验的实现,想要查看完整实现的可以查看该仓库的 raft.go。
- Make方法
func Make(peers []*labrpc.ClientEnd, me int,
persister *Persister, applyCh chan ApplyMsg) *Raft {
rf := &Raft{}
rf.peers = peers
rf.persister = persister
rf.me = me
// Your initialization code here (2A, 2B, 2C).
rf.currentTerm = 0
rf.votedFor = NILL
rf.log = make([]Entry,1024)
rf.commitIndex = 0
rf.lastApplied = 0
rf.nextIndex = make([]int,len(peers))
rf.matchIndex = make([]int,len(peers))
for i:=0;i<len(rf.peers);i++{
rf.nextIndex[i] = 0;
rf.matchIndex[i] = 0;
}
rf.state = Follower
rf.candidateOut = 0
rf.followerOut = 0
rf.applyMessage = applyCh
// initialize from state persisted before a crash
rf.readPersist(persister.ReadRaftState())
// start ticker goroutine to start elections
go rf.Timer()
return rf
}
- 计时器Timer
func (rf *Raft) Timer() {
for {
time.Sleep(time.Duration(100+rand.Intn(50)) * time.Millisecond)
rf.mu.Lock()
// Your code here to check if a leader election should
// be started and to randomize sleeping time using
// time.Sleep().
if rf.killed(){
rf.mu.Unlock()
return
}
if rf.state == Leader || rf.state == Candidate{
rf.mu.Unlock()
continue
}else if rf.state == Follower{
rf.followerOut += 1
if rf.followerOut >= FTimeOut{
rf.followerOut = 0
rf.state = Candidate
go rf.BeCanidate()
///log.Printf("Peer %d(%d) : start Candidate", rf.me,rf.currentTerm)
}
rf.mu.Unlock()
}
}
}
- 成为candidate后获取投票
func (rf *Raft) BeCanidate() {
for {
///log.Printf("Peer %d(%d,%d) : Being Candidate", rf.me,rf.currentTerm,rf.state)
if rf.killed() {
return
}
rf.mu.Lock()
if rf.state != Candidate{
rf.mu.Unlock()
return
}
rf.currentTerm += 1
rf.votedFor = rf.me
rf.followerOut = 0
voteForMe := 1
totalVote := 1
Term := rf.currentTerm
rf.mu.Unlock()
waitFlag := sync.WaitGroup{}
///log.Printf("Peer %d(%d,%d) : start RequestVote", rf.me,rf.currentTerm,rf.state)
for i := 0;i<len(rf.peers);i++{
if rf.votedFor == i{
continue
}else{
args := &RequestVoteArgs{
Term : rf.currentTerm,
CandidateId : rf.me,
LastLogTerm : len(rf.log),
LastLogIndex : rf.log[len(rf.log)-1].Term,
}
reply := &RequestVoteReply{
Term : NILL,
VoteGranted : false,
}
waitFlag.Add(1)
go func(server int){
///log.Printf("Peer %d(%d,%d) : RequestVote to Peer %d", rf.me,rf.currentTerm,rf.state,server)
ok := rf.sendRequestVote(server, args, reply)
rf.mu.Lock()
if ok {
if reply.VoteGranted{
voteForMe++
}else{
if reply.Term > rf.currentTerm{
rf.BeFollower(reply.Term,NILL)
rf.mu.Unlock()
return
}else if reply.Term == rf.currentTerm{
// pass
}else if reply.Term < rf.currentTerm{
// pass
}
}
}
totalVote++
waitFlag.Done()
rf.mu.Unlock()
}(i)
}
}
time_out := make(chan bool)
vote_done := make(chan bool)
vote_succ := make(chan bool)
// 检查是否超时
go func(){
time.Sleep(CTimeOut * 100 * time.Millisecond)
time_out <- true
}()
// 检查是否所有requestVote都返回
go func(){
waitFlag.Wait()
vote_done <- true
}()
// 检查是否已经可以成为Leader
go func(){
for readtime := 0;readtime < CTimeOut;readtime++{
time.Sleep(100 * time.Millisecond)
if voteForMe*2 >= len(rf.peers){
vote_succ <- true
}
}
}()
select{
case <- time_out:
log.Printf("Peer %d(%d,%d) : timeout,totalVote: %d:,voteForMe:%d", rf.me,rf.currentTerm,rf.state,totalVote,voteForMe)
case <- vote_done:
log.Printf("Peer %d(%d,%d)%d : Vote_Done", rf.me,rf.currentTerm,rf.state,voteForMe)
case <- vote_succ:
log.Printf("Peer %d(%d,%d) : Vote_Success", rf.me,rf.currentTerm,rf.state)
}
rf.mu.Lock()
if rf.state == Follower || Term != rf.currentTerm{
log.Printf("Peer %d(%d,%d) : has been follower", rf.me,rf.currentTerm,rf.state)
defer rf.mu.Unlock()
return
}
if voteForMe*2 >= len(rf.peers){
log.Printf("Peer %d(%d,%d) : BeLeader", rf.me,rf.currentTerm,rf.state)
go rf.BeLeader()
rf.mu.Unlock()
return
}else{
rf.mu.Unlock()
}
}
}
- 处理获取投票
func (rf *Raft) RequestVote(args *RequestVoteArgs, reply *RequestVoteReply) {
// Your code here (2A, 2B).
///log.Printf("Peer %d(%d,%d) : receive RequestVote", rf.me,rf.currentTerm,rf.state)
rf.followerOut = 0
rf.mu.Lock()
defer rf.mu.Unlock()
reply.Term = rf.currentTerm
reply.VoteGranted = false
// 如果收到的term<=自己的term,说明对方开始竞选前的term小于自己
if args.Term < rf.currentTerm{
log.Printf("Peer %d(%d,%d) : hav higher term than %d", rf.me,rf.currentTerm,rf.state,args.Term)
return
}else if args.Term == rf.currentTerm{
if (rf.votedFor == NILL || args.CandidateId == rf.votedFor) &&
(args.LastLogTerm>rf.log[len(rf.log)-1].Term ||
(args.LastLogTerm==rf.log[len(rf.log)-1].Term && args.LastLogIndex>=len(rf.log)) ){
// 如果候选者的log不比receiver新
///log.Printf("Peer %d(%d,%d) : receive RequestVote,Be Follower", rf.me,rf.currentTerm,rf.state)
reply.VoteGranted = true
rf.BeFollower(args.Term, args.CandidateId)
rf.votedFor = args.CandidateId
return
}else{
// 如果votedfor不为NILL则说明已经进行了投票
log.Printf("Peer %d(%d,%d) : already vote for %d", rf.me,rf.currentTerm,rf.state,rf.votedFor)
return
}
}else{ //如果收到的term更大,则转换为follower
///log.Printf("Peer %d(%d,%d) : receive RequestVote,Be Follower", rf.me,rf.currentTerm,rf.state)
rf.votedFor = args.CandidateId
reply.VoteGranted = true
rf.BeFollower(args.Term, args.CandidateId)
return
}
}
- 成为Leader后维持心跳
func (rf * Raft) BeLeader(){
rf.state = Leader
rf.votedFor = rf.me
rf.candidateOut = 0
rf.followerOut = 0
for i:=0;i<len(rf.peers);i++{
rf.nextIndex[i] = len(rf.log)
rf.matchIndex[i] = 0
}
for {
//log.Printf("peer:%d(%d,%d) is leader? follower", rf.me,rf.currentTerm,rf.state)
if rf.state != Leader{
//log.Printf("peer:%d(%d) became follower", rf.me,rf.currentTerm)
return
}
for i:=0;i<len(rf.peers);i++{
if rf.me == i{
rf.followerOut = 0
continue
}
args:=&AppendEntriesArgs{
Term:rf.currentTerm,
LeaderId:rf.me,
PrevLogIndex:len(rf.log),
PrevLogTerm:rf.log[len(rf.log)-1].Term,
Entries:make([]Entry,2),
LeaderCommit:len(rf.log),
}
reply:=&AppendEntriesReply{
Term:0,
Success:false,
}
go func(server int){
//log.Printf("Peer %d(%d) : Be Leader and Sending void entry to %d", rf.me,rf.currentTerm,server)
rf.sendAppendEntry(server, args, reply)
return
}(i)
}
time.Sleep(110*time.Millisecond)
}
}
- 处理心跳
func (rf *Raft) AppendEntry(args *AppendEntriesArgs, reply *AppendEntriesReply) {
// Your code here (2A, 2B).
rf.mu.Lock()
defer rf.mu.Unlock()
if args.Term > rf.currentTerm{
//log.Printf("Peer %d(%d,%d) : change to follower", rf.me,rf.currentTerm,rf.state)
rf.BeFollower(args.Term, args.LeaderId)
rf.followerOut = 0
return
}else if args.Term == rf.currentTerm{
//log.Printf("Peer %d(%d,%d) : change to follower", rf.me,rf.currentTerm,rf.state)
rf.followerOut = 0
return
}
}
还是要说这里逻辑与参数选择有诸多错误,只是作为2a的代码进行示意,真正可行的代码查看raft.go。
测试
2a的测试很简单,只有三个
TestInitialElection2A:能在指定时间选出一个Leader,并且该Leader正常发出心跳
TestReElection2A:在选出一个Leader的情况下,断掉该Leader查看是否能及时选出新的Leader,重连断掉的旧Leader,看服务集群能否重新达到共识
TestManyElections2A:在10个服务器集群里,随机断掉三个,看剩下的能否选出Leader
后两个实验都具有一定的随机性,因而一定要进行大量的测试保证代码无误
错误信息
Term x has y (>1) leaders : Term x有了多个leader,
expected one leader, got none:没有选举出leader