浅谈Etcd中Raft库的实现

最新推荐文章于 2022-08-16 13:58:43 发布
而鱼儿and-fish
最新推荐文章于 2022-08-16 13:58:43 发布
阅读量383
点赞数 1
分类专栏：分布式文章标签： etcd 共识算法分布式
本文链接：https://blog.csdn.net/weixin_46253576/article/details/125701609
版权
分布式专栏收录该内容
2 篇文章 0 订阅
订阅专栏
一、Etcd Raft介绍
		· Raft一种用于集群维护一个复制状态机的协议，状态机通过复制日志的形式来保持同步；

		· Etcd的Raft库中提供了一个核心的Raft算法，没有存储处理、消息序列化、网络传输等工程应用；
			网络和磁盘的IO都由用户另外实现，也就是需要自己实现传输层；并且，持久化Raft日志和状态也需要用户自己实现；

		· Raft的响应必须是能确定的，因此库中会将Raft建模为一个状态机；
		  状态机接收Message作为输入；而输出是{[]Massage,[]LogEntries,NextState}的三元组，由Message数组、log entries、和Raft state changes组成；
			状态机对于相同的输入，也会输出相同的输出；

		· 特性：
				· Raft库是基于Raft协议实现的，所以也会包含Raft协议的核心功能：
					1. Leader election(领导人选举)
					2. Log replication(日志复制)
				  3. Log compaction(日志压缩，快照功能)
					4. Membership Changes(集群成员变更)
					5. Leadership transfer extension(Leader变更扩展)
					6. Leader和Follower都提供高效的线性只读查询；
							a. Leader在处理只读操作时不会通过log的形式记录，会通过quorum(法定人数)检查来保证自己的commitIndex没有过期
							b. Follower在处理只读请求时会知道哪些entry是committed(安全的)
				
				· 同时还支持一些可选的优化方案
					1. pipeline：用于减少日志复制的延迟(流式的向Follower发送entries，不需要等Follower响应)；
					2. 在支持pipeline的情况下会通过记录安全的Index来保证必要的重传(Flow control)
					3. Batch：批处理Raft消息，减少网络IO调用；
					4. Batch：批处理Raft写入，减少磁盘IO调用；
					5. 并行写入：允许Leader一边写入log[]一边发送Entries
					6. 内置的Follower到Leader重定向；
					7. 失去quorum的Leader会主动下位；
					8. 使用快照防止日志无限增长；

一、Entry
### https://github.com/etcd-io/etcd/blob/main/raft/raftpb/raft.pb.go#L267		
		· 从整体来说，一个集群中的每个节点都是一个状态机，而Raft管理的就是对这个状态机进行更改的一些操作，这些操作会被封装为一个个Entry；
      也就是之前Raft中提到的logEntry；
      在Etcd中被定义为：
        	type Entry struct {
          	Term  uint64    `protobuf:"varint,2,opt,name=Term" json:"Term"`
        		Index uint64    `protobuf:"varint,3,opt,name=Index" json:"Index"`
        		Type  EntryType `protobuf:"varint,1,opt,name=Type,enum=raftpb.EntryType" json:"Type"`
        		Data  []byte    `protobuf:"bytes,4,opt,name=Data" json:"Data,omitempty"`
    			}
			Term是该entry是什么Term的Leader发布的entry；
			Index是该entry在log[]中的应该位于的Index；
			Type指的是EntryType，EntryType是int32的别名，EntryType包含Normal(0)、ConfChange(1)、ConfChangeV2(2)；
			ConfChangeV2支持joint consensus；
			Data是实际的数据，里面记录着一些K-V的数据(因为对于状态机来说状态就是K-V形式的)；
					
		· ConfChangeV2支持单节点变更，也支持任意个节点变更的joint consensus；
			ConfChangeV2的结构为：
					type ConfChangeV2 struct {
						Transition ConfChangeTransition `protobuf:"varint,1,opt,name=transition,enum=raftpb.ConfChangeTransition" json:"transition"`
						Changes    []ConfChangeSingle   `protobuf:"bytes,2,rep,name=changes" json:"changes"`
						Context    []byte               `protobuf:"bytes,3,opt,name=context" json:"context,omitempty"`
					}
			Transition ConfChangeTransition用于指定如何/是否使用joint consensus；
			ConfChangeSingle是一个单独的配置更改操作；

		· ConfChange是配置变更的信息(单节点变更不需要联合配置信息)
			ConfChange的结构为：
					type ConfChange struct {
						Type    ConfChangeType `protobuf:"varint,2,opt,name=type,enum=raftpb.ConfChangeType" json:"type"`
						NodeID  uint64         `protobuf:"varint,3,opt,name=node_id,json=nodeId" json:"node_id"`
						Context []byte         `protobuf:"bytes,4,opt,name=context" json:"context,omitempty"`
						ID uint64 `protobuf:"varint,1,opt,name=id" json:"id"`
					}
			ConfChangeType中包括AddNode、RemoveNode、UpdateNode、AddLearnerNode；


			
二、Message
### https://github.com/etcd-io/etcd/blob/main/raft/raftpb/raft.pb.go#L384
		· Message是作为Raft节点之间通讯的载体存在，Message涵盖了各种消息所需的字段；
		  接收端只需要根据Type字段来判断Message的类型，用对于的结构体接收即可；
			Message在库中被定义为：
					type Message struct {
						Type MessageType `protobuf:"varint,1,opt,name=type,enum=raftpb.MessageType" json:"type"`
						To   uint64      `protobuf:"varint,2,opt,name=to" json:"to"`
						From uint64      `protobuf:"varint,3,opt,name=from" json:"from"`
						Term uint64      `protobuf:"varint,4,opt,name=term" json:"term"`
						LogTerm    uint64   `protobuf:"varint,5,opt,name=logTerm" json:"logTerm"`
						Index      uint64   `protobuf:"varint,6,opt,name=index" json:"index"`
						Entries    []Entry  `protobuf:"bytes,7,rep,name=entries" json:"entries"`
						Commit     uint64   `protobuf:"varint,8,opt,name=commit" json:"commit"`
						Snapshot   Snapshot `protobuf:"bytes,9,opt,name=snapshot" json:"snapshot"`
						Reject     bool     `protobuf:"varint,10,opt,name=reject" json:"reject"`
						RejectHint uint64   `protobuf:"varint,11,opt,name=rejectHint" json:"rejectHint"`
						Context    []byte   `protobuf:"bytes,12,opt,name=context" json:"context,omitempty"`
					}

		· 参数：
				Type：		Message的类型，包括Vote、Heartbeat、App(end)等等；
				To：			接收者；
				From：		发送者；
				Term： 		任期；
				LogTerm：	prevLogTerm，如果是Type=MsgVote，发送者LastLogTerm；
				Index：		prevLogIndex，如果是Type=MsgVote，则表示为发送者LastLogIndex；
				Entries：	发送的entries；
				Commit：	已提交的日志Index；
				Snapshot：如果是MsgSnap，就在这个字段存放快照；
				Reject、RejectHint：返回message字段；
				Context：	运维相关的上下文信息，用于跟踪调试；
		
		· 其中Snapshot是分块发送，在结构体中也能看出这点：
					type Snapshot struct {
						Data     []byte           `protobuf:"bytes,1,opt,name=data" json:"data,omitempty"`
						Metadata SnapshotMetadata `protobuf:"bytes,2,opt,name=metadata" json:"metadata"`
					}
					type SnapshotMetadata struct {
						ConfState ConfState `protobuf:"bytes,1,opt,name=conf_state,json=confState" json:"conf_state"`
						Index     uint64    `protobuf:"varint,2,opt,name=index" json:"index"`
						Term      uint64    `protobuf:"varint,3,opt,name=term" json:"term"`
					}
					type ConfState struct {
						Voters []uint64 `protobuf:"varint,1,rep,name=voters" json:"voters,omitempty"`
						Learners []uint64 `protobuf:"varint,2,rep,name=learners" json:"learners,omitempty"`
						VotersOutgoing []uint64 `protobuf:"varint,3,rep,name=voters_outgoing,json=votersOutgoing" json:"voters_outgoing,omitempty"`
						LearnersNext []uint64 `protobuf:"varint,4,rep,name=learners_next,json=learnersNext" json:"learners_next,omitempty"`
						AutoLeave bool `protobuf:"varint,5,opt,name=auto_leave,json=autoLeave" json:"auto_leave"`
					}
			Snapshot.Data中是快照的内容；
			Metadata中是记录的快照块的偏移量和Term；
			ConfState：记录的配置状态


四、log_unstable.go	
### https://github.com/etcd-io/etcd/blob/main/raft/log_unstable.go#L23
		· 经过Raft的学习，可以得知，状态机的状态被通过Snapshot + Log来保存；
		  log[]的表示形式就是unstable结构，也是用于还没有被用户持久化的数据：
					type unstable struct {
						snapshot *pb.Snapshot
						entries []pb.Entry
						offset  uint64
						logger Logger
					}
			unstable中由两部分组成，一个是snapshot快照，一个是logEntires；
			entries表示要进行操作的日志条目，但是因为不可能让log无限增长，会在特定的情况下，将log的部分清空并转化为Snapshot；
			offset用于记载哪些部分的entries被转化为了Snapshot，因为当一个新的节点加入集群，Leader需要将所有数据都同步到这个节点；
			只需要将Snapshot发送给new Follower，然后Follower在快照的基础上应用Entries；


五、stroage.go
### https://github.com/etcd-io/etcd/blob/main/raft/storage.go#L46
		· storage.go中定义了一个Storage接口，因为Etcd中raft库不支持内置持久化，
		  所以只向外界应用暴露了一个持久化接口，当外界应用实现了这个接口，就可以从中外界获取日志；
					type Storage interface {
						// 获取保存了的HardState和ConfState信息
						InitialState() (pb.HardState, pb.ConfState, error)
						// 获取一个entry切片，范围是[lo,hi]
						Entries(lo, hi, maxSize uint64) ([]pb.Entry, error)
						// 返回entry i的Term
						Term(i uint64) (uint64, error)
						// 返回lastLogIndex
						LastIndex() (uint64, error)
						// 返回可以通过Entries可用的Index(因为之前的可能被放入Snapshot中了，详见unstable)
						FirstIndex() (uint64, error)
						// 返回最近的Snapshot
						Snapshot() (pb.Snapshot, error)
					}
			除此之外，还提供了内存版本的Storage实例，即	MemoryStorage实现了Storage接口；
					type MemoryStorage struct {
						sync.Mutex
						hardState pb.HardState
						snapshot  pb.Snapshot
						ents []pb.Entry
					}


六、log.go
### https://github.com/etcd-io/etcd/blob/main/raft/log.go#L24
		· 上面介绍了unstable和stroage，而raftlog承当了raft日志的所有操作；
			raftlog由上面的两个元素组成，除此之还会有一些数据来记录当前的信息；
					type raftLog struct {
						storage Storage
						unstable unstable
						committed uint64
						applied uint64
						logger Logger
						maxNextEntsSize uint64
					}
			storage：可用获取已经持久化数据的Stroage；
			unstable：用于保存还没有持久化的数据；
			committed：用于保存当前已提交的最高Index；
			applied：保存当前应用于状态机的最高Index；
			因为一条entry要被提交后才会被应用，所以 applied <= committed；

		· 所以Raft算法中Log[]的实际表现形式应该为：
			MemoryStorage.snapshot + MemoryStorage.ents + unstable.entries
			从newLog()方法就能看出来；
					func newLog(storage Storage, logger Logger) *raftLog {
						return newLogWithSize(storage, logger, noLimit)
					}
					func newLogWithSize(storage Storage, logger Logger, maxNextEntsSize uint64) *raftLog {
						if storage == nil {
							log.Panic("storage must not be nil")
						}
						log := &raftLog{
							storage:         storage,
							logger:          logger,
							maxNextEntsSize: maxNextEntsSize,
						}
						firstIndex, err := storage.FirstIndex()
						if err != nil {
							panic(err) // TODO(bdarnell)
						}
						lastIndex, err := storage.LastIndex()
						if err != nil {
							panic(err) // TODO(bdarnell)
						}
						log.unstable.offset = lastIndex + 1
						log.unstable.logger = logger
						log.committed = firstIndex - 1
						log.applied = firstIndex - 1
						return log
					}

		· 一个raftlog，需要一个Stroage来构建；
			首先会初始化一个raftlog，将给出的数据添加进对应字段；
			然后根据Storage的FirstIndex和LastIndex来构建unstable；
			而unstable.snapshot不会在newLog()中构建，
			会在启动之后同步快照数据时才去进行复赋值修改的数据


七、progress.go
### https://github.com/etcd-io/etcd/blob/main/raft/tracker/progress.go#L30
		· Leader通过Progress这个数据结构来跟踪一个Follower的状态，
			并根据Progress里的信息来决定每次同步的logEntry；
					type Progress struct {
						Match, Next uint64
						State StateType
						PendingSnapshot uint64
						RecentActive bool
						ProbeSent bool
						Inflights *Inflights
						IsLearner bool
					}
			Match、Next：MaxtchIndex和NextIndex
			State：表示Leader应该如何与Followerr交互，
						 StateProbe 表示需要确定next和match(节点下线一段时间重新获取进度)，
						 StateReplicate 表示允许pipeline的方式发送，
						 StateSnapshot 表示需要发送快照；
			PendingSnapshot：因为Snapshot是分块传输的，所以会用一个offset来记录传输Snapshot的进度；
			RecentActive：表示是否活跃，没有time out；
			Inflights表示一个流量控制的buf窗口：因为三个state一个发送周期内的可发送的mag不一样，所以需要控制；
			Progress在机器中是这样使用的：
					type Status struct {
						BasicStatus
						Config   tracker.Config
						Progress map[uint64]tracker.Progress
					}
			只有Leader会拥有Progress，以一个map的形式存储

		· 状态机的变化：
			1. 收到receives msgAppResp(rej=false && index > match)；
				 说明此次同步失败，并且想要的index 大于 Match，就把Match变为index，然后把Next = Match+1；
				 然后State从 StateProbe 置为 StateReplicate，开始发送Next的entry
			
			2. 收到receives msgAppResp(rej=true)；
				 说明一个下线一段时间的节点需要发送entry，但是Leader不知道需要发送哪一个Index的entry；
				 所以State会从 StateReplicate 变为 StateProbe，开始商议应该发送哪一个；

			3. 发送了snapshot会从 StateProbe 变为 StateSnapshot；
			   表示正在发送快照；

			4. 收到snapshot success，说明快照同步成功，Next会变为Snapshot的Index+1，
				 也就是要发送Snapshot的下一个entry，State从 StateSnapshot 变为 StateProbe；
			
			5. 收到snapshot failure，说明快照同步失败，
				 StateSnapshot 变为 StateProbe，重新准备发送；
				
			6. 收到receives msgAppResp(rej=false && index > lastsnap.index)；
			   说明同步失败，Follower想要收到的Index大于snapshot所包含的最后一条entry的Index；
				 将Match置为Index，Next置为Match+1，StateSnapshot 变为 StateProbe，准备发送其他entries；


八、raft.go
### https://github.com/etcd-io/etcd/blob/main/raft/raft.go
		· raft协议的逻辑部分在raft.go中被定义，其中驱动逻辑位于Step()；
			raft节点收到的消息可能是来自与其他节点，也可能来自于上层的应用和下层的存储；
			所有的消息都会被转化位massage类型；
			使用一系列的swich/case、if/else来判断收到的mag的Type，并对不同type做出不同的响应(send response)；
			### https://github.com/etcd-io/etcd/blob/main/raft/raft.go#L847
			最后需要处理raft节点状态的msg用default来将回收；
					default:
							err := r.step(r, m)
							if err != nil {
								return err
					}
			可以发现最后是使用了step字段对应的 stepLeader/stepCandidate/stepFollower 函数来处理具体的msg；

		· raft的定时器 tick和step字段相似，也是一个函数指针，根据角色的不同，对应 tickHeartbeat/tickElection；
		


九、node.go
### https://github.com/etcd-io/etcd/blob/main/raft/node.go
		· node的主要作用就是把应用层的数据和raft协议层的衔接起来；
		  将应用层的消息传递给raft协议层，通过协议层的处理，将处理结果在返回给应用层；
					type node struct {
						propc      chan msgWithResult
						recvc      chan pb.Message
						confc      chan pb.ConfChangeV2
						confstatec chan pb.ConfState
						readyc     chan Ready
						advancec   chan struct{}
						tickc      chan struct{}
						done       chan struct{}
						stop       chan struct{}
						status     chan chan Status
					
						rn *RawNode
					}
			propc：	来自应用层的输入
			recvc：	来自其他节点的输入
			confc：	来自其他节点的配置输入
			readyc：返回的节点的状态

		· 使用run()来使用这些channel；
			使用for-select-channel的模式循环的读取事件并处理它们
					for{
						select {
						case pm := <-propc:
								r.Step(m)
						case m := <-n.recvc:
								r.Step(m)
						case cc := <-n.confc:
								...
						case <-n.tickc:
								r.tick()
						case readyc <- rd:
								...
						case <-advancec:
								...
						case c := <-n.status:
								...
						case <-n.stop:
								close(n.done)
								return
						}
					}
		
		· 在Etcd中，node不负责实现数据的持久化、网络消息通信、以及将log应用到状态机上；
		  所有node使用readyc的channel对外通知又数据要处理，并将这些外部操作可能需要的数据打包成Ready struct；
			对外界来说，Ready中的数据都是只读的
					type Ready struct {
						*SoftState
						pb.HardState
						ReadStates []ReadState
						Entries []pb.Entry
						Snapshot pb.Snapshot
						CommittedEntries []pb.Entry
						Messages []pb.Message
						MustSync bool
					}
			包括，node的角色、当前的Vote/Term/Committed、读请求队列(读请求包括Index和读ID，保证线性可读)、entries、Snapshot、CommittedEntries、和Maeeages；

		· 当应用程序得到了一个Ready；
				1. 将HardStable、Entries、Snapshot 持久化到Strage；
				2. 将Massages广播给其他节点；
				3. 将CommittedEntries应用到状态机上；
				4. 如果发现CommittedEntries中有成员变更的entries，调用node.ApplyConfChange()让node知道；
				5. 最后在调用node.Advance()告诉raft，这批状态处理完毕，可以发送下一批Ready；
				(这里的node.*指的是node interface)


十、Life of a Request
		· Life of Vote Request
				1. 在node的循环中，会有一个tick channel用于定时触发raft.tick字段函数；
				   当当前节点角色为Follower/Candideta/PreCandidate 时，会tick函数为 raft.tickElection()；
					 然后 tickElection()会向自己发送一条MsgHup，Step函数发现 Type==MsgHub会调到用campaign() ，进入竞选状态；
							func (r *raft) tickElection() {
								r.electionElapsed++
								if r.promotable() && r.pastElectionTimeout() {
									r.electionElapsed = 0
									if err := r.Step(pb.Message{From: r.id, Type: pb.MsgHup}); err != nil {
										r.logger.Debugf("error occurred during election: %v", err)
									}
								}
							}
							func (r *raft) Step(m pb.Message) error {
								switch m.Type {
									case pb.MsgHup:
										if r.preVote {
											r.hup(campaignPreElection)
										} else {
											r.hup(campaignElection)
										}
									}
							}
							func (r *raft) hup(t CampaignType) {
								// ...
								r.campaign(t)
							}

				2. campaign()会调用r.becomeCandidate()把状态变为Candideta，并递增Term值；
					 然后再Send自己的Term以及相关信息给其他节点，请求投票；
							func (r *raft) campaign(t CampaignType) {
								// ...
								r.becomeCandidate()
								voteMsg = pb.MsgVote
								term = r.Term
								// 循环集群配置中所有的节点
								for _, id := range ids {
									// ...
									r.send(pb.Message{Term: term, To: id, Type: voteMsg, Index: r.raftLog.lastIndex(), LogTerm: r.raftLog.lastTerm(), Context: ctx})
								}
							}

				3. 其他节点收到这个pb.MsgVote，首先会判断Term是否比自己大，以及判断Candideta在MsgVota附带的日志信息是不是比自己的新，
					 从而决定是否投票，这个逻辑还是在Step函数中；
							func (r *raft) Step(m pb.Message) error {
								switch m.Type {
									case pb.MsgVote, pb.MsgPreVote:
										canVote := r.Vote == m.From ||
											(r.Vote == None && r.lead == None) ||
											(m.Type == pb.MsgPreVote && m.Term > r.Term)
									if canVote && r.raftLog.isUpToDate(m.Index, m.LogTerm) {
										r.send(pb.Message{To: m.From, Term: m.Term, Type: voteRespMsgType(m.Type)})
										if m.Type == pb.MsgVote {
											r.electionElapsed = 0
											r.Vote = m.From
										}
									} else {
										r.send(pb.Message{To: m.From, Term: r.Term, Type: voteRespMsgType(m.Type), Reject: true})
									}
								}
							}

				4. 最后当Candideta收到投票回复后，会通过Step的default进入到stepCandideta()；
				   就会计算收到的选票数量是否大于半数以上，如果大于则成为Leader，然后发送一个信息，否则变为Folower；
							func stepCandidate(r *raft, m pb.Message) error {
								switch m.Type {
								case myVoteRespType:
									gr, rj, res := r.poll(m.From, m.Type, !m.Reject)
									r.logger.Infof("%x has received %d %s votes and %d vote rejections", r.id, gr, m.Type, rj)
									switch res {
									case quorum.VoteWon:
										if r.state == StatePreCandidate {
											r.campaign(campaignElection)
										} else {
											r.becomeLeader()
											r.bcastAppend()
										}
									case quorum.VoteLost:
										r.becomeFollower(r.Term, None)
									}
								}
							}


		· Life of Write Request
				1. 一个写请求一般会通过node接口提供的Propose()方法开始，
					 应用层会实现Propose()将这个写请求封装到一个MsgProp消息中，发送到propc channel中；
					 协议层通过run()抓取到这个消息；
							func (n *node) run() {
								select {
									case pm := <-propc:
										m := pm.m
										m.From = r.id
										err := r.Step(m)
										if pm.result != nil {
											pm.result <- err
											close(pm.result)
										}
								}
							}
				
				2. MsgProp会Step()通过default进入到角色Step函数，根据当前角色来处理；
					 假设当前为Follower，那么他会把这个消息转发给Leader；
							func stepFollower(r *raft, m pb.Message) error {
								switch m.Type {
									case pb.MsgProp:
										if r.lead == None {
											return ErrProposalDropped
										} else if r.disableProposalForwarding {
											return ErrProposalDropped
										}
										m.To = r.lead
										r.send(m)
									}
							}
				
				3. Leader收到这个消息后，通过stepLeader会将这个消息添加到自己的log中，再向其他follower广播MsgApp消息；
							func stepLeader(r *raft, m pb.Message) error {
								switch m.Type {
									case pb.MsgProp:
										// 省略了检查entries中是否有confchange的指令
										if !r.appendEntry(m.Entries...) {
											return ErrProposalDropped
										}
										r.bcastAppend()
										return nil
							}
							func (r *raft) bcastAppend() {
								r.prs.Visit(func(id uint64, _ *tracker.Progress) {
									if id == r.id {
										return
									}
									r.sendAppend(id)
								})
							}
				
			  4. 当Follower接收到这消息就会返回一个MsgAppResp；
							func stepFollower(r *raft, m pb.Message) error {
								switch m.Type {
									case pb.MsgApp:
										r.electionElapsed = 0
										r.lead = m.From
										r.handleAppendEntries(m)
								}
							}
				
				5. 当Leader确认MsgAppResp就会计算是否超过法定人数的Follower收到了这个消息，如果超过了就会Commit这个Log，
				   然后再广播一次，告诉Follower新的Committed状态