2A是写投票和心跳,照着figure2写
主要逻辑都写在ticker函数中。两种状态,leader,非leader。
leader:1睡眠50ms;2检查领导状态;3心跳。
非leader:1睡眠150-300ms;2检查心跳;3发起投票;4睡眠100ms;5检查候选人状态与投票情况
投票都是要在新term;注意锁的开关,睡眠前后、调用函数前后;print调试法;空白appendEntry;睡眠时间
要处理好:主崩了,从要接任;主还在,从不篡位。
go test -run 2A -race
Test (2A): initial election ...
... Passed -- 3.0 3 106 27448 0
Test (2A): election after network failure ...
... Passed -- 4.9 3 240 48758 0
Test (2A): multiple elections ...
... Passed -- 6.5 7 1032 201534 0
PASS
ok 6.824/raft 14.522s
2B要比2A难得多,2B不仅要写log、commit,还需要完善投票部分。可能还会导致2A不通过。需要把figure2所有都实现。是通过Start函数与ApplyMsg来检查是否正确。
leader:更新commitIndex,进行commit。心跳时考虑是否需要同步
一个term最多只有一个leader;log只需要比较index和term;收到appendEntry,比较term考虑是否变为follower;index从1开始;睡眠先后解开锁,醒后需检查状态;nextIndex、matchIndex的值变化,考虑nextIndex过大/失联;
leader只会commit当前term,而且不主动分发之前term的内容,等到当前term有内容再分发。注意心跳的commitIndex
投票时,判断a和b谁更新,先看a b最后一个日志的term,然后再比较index
go test -run 2B -race
Test (2B): basic agreement ...
... Passed -- 0.8 3 16 3828 3
Test (2B): RPC byte count ...
... Passed -- 1.8 3 46 112212 11
Test (2B): agreement despite follower disconnection ...
... Passed -- 6.0 3 186 45330 8
Test (2B): no agreement if too many followers disconnect ...
... Passed -- 3.5 5 276 53898 4
Test (2B): concurrent Start()s ...
... Passed -- 0.7 3 14 3344 6
Test (2B): rejoin of partitioned leader ...
... Passed -- 6.2 3 278 64214 4
Test (2B): leader backs up quickly over incorrect follower logs ...
... Passed -- 26.0 5 3048 2313497 104
Test (2B): RPC counts aren't too high ...
... Passed -- 2.3 3 64 16584 12
PASS
ok 6.824/raft 48.405s
2C实现存取。比前两个简单多了
实现存、读函数,在每个改变的地方调用存函数
Figure 8 (unreliable)一开始没有通过,通过查看Dprintf发现,nextIndex一直匹配不到,根据题目的最后一个hint改进
go test -run 2C -race
Test (2C): basic persistence ...
labgob warning: Decoding into a non-default variable/field int may not work
... Passed -- 3.8 3 118 25653 6
Test (2C): more persistence ...
... Passed -- 16.4 5 1384 267718 16
Test (2C): partitioned leader and one follower crash, leader restarts ...
... Passed -- 1.8 3 44 9794 4
Test (2C): Figure 8 ...
... Passed -- 43.3 5 1476 293394 25
Test (2C): unreliable agreement ...
... Passed -- 6.3 5 412 114070 246
Test (2C): Figure 8 (unreliable) ...
... Passed -- 49.0 5 6684 11634487 583
Test (2C): churn ...
... Passed -- 16.4 5 1080 483804 255
Test (2C): unreliable churn ...
... Passed -- 16.6 5 768 192555 96
PASS
ok 6.824/raft 155.078s
2D实现快照模块。最复杂的部分,之前的代码都得修改
先考虑偏移,设定X,通过测试ABC
根据图13实现InstallSnapshotRPC。如果安装成功 修正nextIndex matchIndex
AppendEntriesRPC、C中的加速nextIndex、心跳、commit、投票均需重新设计,是个体力活
当队列中没有需要commit的内容:可以使用applyCh上传快照,快照先前的所有内容均被上传
当队列中没有需要append的内容:调用InstallSnapshotRPC
当队列为空 需要投票时:需要比较lastIncluded
2C中对nextIndex跳转进行了修改,这里需要再次修改,考虑InstallSnapshotRPC与AppendEntriesRPC
需要把快照保存
go test -run 2D -race
Test (2D): snapshots basic ...
... Passed -- 4.6 3 142 44750 251
Test (2D): install snapshots (disconnect) ...
... Passed -- 61.2 3 2066 491612 343
Test (2D): install snapshots (disconnect+unreliable) ...
... Passed -- 92.2 3 3134 705353 386
Test (2D): install snapshots (crash) ...
labgob warning: Decoding into a non-default variable/field int may not work
... Passed -- 31.3 3 926 236012 344
Test (2D): install snapshots (unreliable+crash) ...
... Passed -- 42.1 3 1194 285557 346
PASS
ok 6.824/raft 232.538s
觉得2A 2B比较有意思,是raft中的核心。2C比较简单,可以一做。2D很麻烦,属于raft的拓展,也没那么多巧妙的设计。
raft的index从1开始,要时刻想着应该+1 还是 -1。
调试过程中 经常调好X,Y又出问题。或者偶尔Z出问题,又很难复现,也猜不出原因。
总体难度还是很大。
go test -race
Test (2A): initial election ...
... Passed -- 3.6 3 104 25004 0
Test (2A): election after network failure ...
... Passed -- 5.0 3 166 30642 0
Test (2A): multiple elections ...
... Passed -- 6.3 7 690 130308 0
Test (2B): basic agreement ...
... Passed -- 0.8 3 14 3344 3
Test (2B): RPC byte count ...
... Passed -- 1.8 3 46 111596 11
Test (2B): agreement despite follower disconnection ...
... Passed -- 6.0 3 172 39892 8
Test (2B): no agreement if too many followers disconnect ...
... Passed -- 3.8 5 252 50384 3
Test (2B): concurrent Start()s ...
... Passed -- 1.2 3 22 5328 6
Test (2B): rejoin of partitioned leader ...
... Passed -- 4.7 3 208 43486 4
Test (2B): leader backs up quickly over incorrect follower logs ...
... Passed -- 19.9 5 2344 1800172 102
Test (2B): RPC counts aren't too high ...
... Passed -- 2.7 3 70 17168 12
Test (2C): basic persistence ...
labgob warning: Decoding into a non-default variable/field int may not work
... Passed -- 4.5 3 114 25071 6
Test (2C): more persistence ...
... Passed -- 17.3 5 1360 271270 16
Test (2C): partitioned leader and one follower crash, leader restarts ...
... Passed -- 2.1 3 42 9234 4
Test (2C): Figure 8 ...
... Passed -- 48.2 5 1168 221631 17
Test (2C): unreliable agreement ...
... Passed -- 7.4 5 428 116286 246
Test (2C): Figure 8 (unreliable) ...
... Passed -- 54.7 5 7148 24082212 884
Test (2C): churn ...
... Passed -- 16.5 5 1460 1470495 158
Test (2C): unreliable churn ...
... Passed -- 16.4 5 1068 357124 279
Test (2D): snapshots basic ...
... Passed -- 5.2 3 144 45522 251
Test (2D): install snapshots (disconnect) ...
... Passed -- 47.5 3 1612 384681 377
Test (2D): install snapshots (disconnect+unreliable) ...
... Passed -- 60.2 3 1996 457602 348
Test (2D): install snapshots (crash) ...
... Passed -- 34.5 3 928 238636 377
Test (2D): install snapshots (unreliable+crash) ...
... Passed -- 40.2 3 1104 274032 399
PASS
ok 6.824/raft 411.847s