【Raft】学习九：成员变更ConfChangeV2

从未想放弃

已于 2022-04-29 10:07:29 修改

阅读量666

点赞数 1

分类专栏： raft linux golang 文章标签： golang 分布式 linux

于 2022-04-29 00:05:38 首次发布

本文链接：https://blog.csdn.net/qq_40859492/article/details/124428555

版权

golang 同时被 3 个专栏收录

16 篇文章 0 订阅

订阅专栏

linux

13 篇文章 0 订阅

订阅专栏

raft

8 篇文章 2 订阅

订阅专栏

前言

在分布式系统中，节点的增删是常见也是必须的操作，对于实用的共识算法Raft自然也提供了关于节点变更的理论基础。在Raft算法中一次变更一个节点是天然支持的，如果一次涉及多个节点的变更，对于一个稳定的系统来说就具有一定风险。例如，在前面的介绍leader选举时，leader的选举依赖集群中的大多数，如果改变了这个大多数，就会导致leader在选举出现问题。如下图（图来源于这里）所示，当新配置占据大多数，而原leader在老配置中就可能会产生新两个leader。因而本文基于Raft的理论和etcd/raft时间，探讨学习Raft的成员变更。本文主要尝试解决的问题：

Raft提供的方案为什么能够解决多成员变更带来的不一致问题？
etxd/raft的是如何实现成员变更的问题？

理论基础

Raft在成员变更提出联合共识方案来解决成员添加的问题。联合共识有如下要求：

日志条目会复制到两个配置的所有服务节点；
任何一个配置的任何服务节点都可以成为leader；
一致（用来选举和日志条目提交）需要在旧配置和新配置中的大多数分别达成。

如图所示，这个配置更改过程分为两个步骤，当有收到新的配置时，首先会生成的新老联合配置，这个配置在未提交以前老配置可以独立决策，只要这个配置提交，就需要新老配置共同决策，也就是说需要两份配置中的各自的大多数，这时候就会生成新的配置，这个配置提交后，新配置才可以独立做决策。
简要分析： 这种方式为什么能够避免数据不一致，首先数据不一致一般是集群中产生多个leader，正常情况下，只要在任意时刻避免多个leader产生就可以达到数据的一致性要求。而发生多个leader主要是当新节点的加入影响了大多数这个成为leader的条件。例如在当前配置中我们有【A,B,C】三个节点，现在新加入四个节点【D,E,F,G】，如果不采用联合配置，那么【D,E,F,G】就可能在收到leader的心跳之前选出新的leader，集群中就会出现两个leader。如果采用联合配置，就需要两个配置中的大多数，然而在老配置中【A,B,C】就会拒绝选举，这样【D,E,F,G】选举leader就不会成功，因为没有达到老配置的大多数。只要保证了在新配置应用之前没有发生多个leader的分化，配置更新就是安全地。只要在节点在等待主节点发送心跳之前没有选举leader成功，这样集群就算更新配置成功了。

etcd/raft 实现

etcd在实现中给出了两个方案：one by one 和联合共识方案。本文跟着上报消息流程来看看一次变更经历哪些步骤。
对于消息上报，etcd/Raft提供了两个方法在Node接口中

	// Propose proposes that data be appended to the log. Note that proposals can be lost without
	// notice, therefore it is user's job to ensure proposal retries.
	Propose(ctx context.Context, data []byte) error
	// ProposeConfChange proposes a configuration change. Like any proposal, the
	// configuration change may be dropped with or without an error being
	// returned. In particular, configuration changes are dropped unless the
	// leader has certainty that there is no prior unapplied configuration
	// change in its log.
	//
	// The method accepts either a pb.ConfChange (deprecated) or pb.ConfChangeV2
	// message. The latter allows arbitrary configuration changes via joint
	// consensus, notably including replacing a voter. Passing a ConfChangeV2
	// message is only allowed if all Nodes participating in the cluster run a
	// version of this library aware of the V2 API. See pb.ConfChangeV2 for
	// usage details and semantics.
	ProposeConfChange(ctx context.Context, cc pb.ConfChangeI) error

这两个方法都可以上报成员变更（confchange）的消息，我们这里看看Propose方法，经过Propose的方法最后都会来到stepLeader的处理阶段，也就是只有leader节点才能够处理上报消息的权力。这里我们着重看看处理配置改变的工作，EntryConfChange和EntryConfChangeV2分别代表第一个版本和第二个版本的配置变更，第二种主要是增加联合配置。

if e.Type == pb.EntryConfChange {
				var ccc pb.ConfChange
				if err := ccc.Unmarshal(e.Data); err != nil {
					panic(err)
				}
				cc = ccc
			} else if e.Type == pb.EntryConfChangeV2 {
				var ccc pb.ConfChangeV2
				if err := ccc.Unmarshal(e.Data); err != nil {
					panic(err)
				}
				cc = ccc
			}

在广播消息之前，需要先做几件事情：

if cc != nil {
				alreadyPending := r.pendingConfIndex > r.raftLog.applied
				alreadyJoint := len(r.prs.Config.Voters[1]) > 0
				wantsLeaveJoint := len(cc.AsV2().Changes) == 0

				var refused string
				if alreadyPending {
					refused = fmt.Sprintf("possible unapplied conf change at index %d (applied to %d)", r.pendingConfIndex, r.raftLog.applied)
				} else if alreadyJoint && !wantsLeaveJoint {
					refused = "must transition out of joint config first"
				} else if !alreadyJoint && wantsLeaveJoint {
					refused = "not in joint state; refusing empty conf change"
				}

				if refused != "" {
					r.logger.Infof("%x ignoring conf change %v at config %s: %s", r.id, cc, r.prs.Config, refused)
					m.Entries[i] = pb.Entry{Type: pb.EntryNormal}
				} else {
					r.pendingConfIndex = r.raftLog.lastIndex() + uint64(i) + 1
				}
			}

pendingConfIndex：是否有配置正在变更，这个和切主不一样，这里如果有配置正在变更，会舍弃当前的配置变更，返回失败；
wantsLeaveJoint：是否是联合配置，换句话说就是是不是涉及多个节点的变更。
alreadyJoint：之前的联合配置是否已经更新为最新配置，也就是有没有清理联合配置，保留最新配置。

注意：如果alreadyJoint && !wantsLeaveJoint为真，就表示涉及多个节点变更但上次的变更还没有清理联合配置，返回失败；不是联合配置便拒绝空配置。

如果正常就将消息广播，然后返回应用层处理。
广播之后就等待消息提交然后被状态机应用。。。

two years later…

我们的confchange消息来到了这里，这里一般是由状态机应用调用，如下所示

func (n *node) ApplyConfChange(cc pb.ConfChangeI) *pb.ConfState {
	var cs pb.ConfState
	select {
	case n.confc <- cc.AsV2():
	case <-n.done:
	}
	select {
	case cs = <-n.confstatec:
	case <-n.done:
	}
	return &cs
}

这里给n.confc里面塞入消息，然后被raft监听到。也就是说应用层告诉raft说准备好了可以应用这个配置了，然后通过node就进入raft协议层

case cc := <-n.confc:
			_, okBefore := r.prs.Progress[r.id]
			// 进入raft协议层进行配置变更应用
			cs := r.applyConfChange(cc)
			// If the node was removed, block incoming proposals. Note that we
			// only do this if the node was in the config before. Nodes may be
			// a member of the group without knowing this (when they're catching
			// up on the log and don't have the latest config) and we don't want
			// to block the proposal channel in that case.
			//
			// NB: propc is reset when the leader changes, which, if we learn
			// about it, sort of implies that we got readded, maybe? This isn't
			// very sound and likely has bugs.
			if _, okAfter := r.prs.Progress[r.id]; okBefore && !okAfter {
				var found bool
			outer:
				for _, sl := range [][]uint64{cs.Voters, cs.VotersOutgoing} {
					for _, id := range sl {
						if id == r.id {
							found = true
							break outer
						}
					}
				}
				if !found {
					propc = nil
				}
			}
			select {
			case n.confstatec <- cs:
			case <-n.done:
			}

正片开始。。。
下面这一段代码，看似简单，其实只是冰山一角，接下来正式撸这一段

func (r *raft) applyConfChange(cc pb.ConfChangeV2) pb.ConfState {
	// 整个配置变更准备就在这个虚拟函数中做了
	cfg, prs, err := func() (tracker.Config, tracker.ProgressMap, error) {
		// 首先初始化一个Changer结构体
		changer := confchange.Changer{
			Tracker:   r.prs,
			LastIndex: r.raftLog.lastIndex(),
		}
		// 判断是不是来解除联合配置的，这里为什么能够判断我们暂时按下不表 ①
		if cc.LeaveJoint() {
		// 进入解除流程
			return changer.LeaveJoint()
		} else if autoLeave, ok := cc.EnterJoint(); ok { // 判断能否进入联合配置，就是来看是否有多个节点变更②
		// 然后进入联合配置变更 ③
			return changer.EnterJoint(autoLeave, cc.Changes...)
		}
		// 如果不是就施行one by one模式，就比较简单 
		return changer.Simple(cc.Changes...)
	}()
	if err != nil {
		// TODO(tbg): return the error to the caller.
		panic(err)
	}
	// 切换到准备好的配置 ④
	return r.switchToConfig(cfg, prs)
}

首先看第①点，要追述第①点我们还要看上一次配置变更的收尾工作
这段代码在raft.advance()函数中，就是当应用层把Ready结构中的数据处理完了，这是就该raft做一些收尾工作

if r.prs.Config.AutoLeave && oldApplied <= r.pendingConfIndex && newApplied >= r.pendingConfIndex && r.state == StateLeader {
			// If the current (and most recent, at least for this leader's term)
			// 这里做了一件事，就是如果AutoLeave为真，上一个配置还没收尾，就该收尾了，当然只能是leader收尾
			// 怎么收尾呢，就是发一个空配置
			ent := pb.Entry{
				Type: pb.EntryConfChangeV2,
				Data: nil,
			}
			// There's no way in which this proposal should be able to be rejected.
			if !r.appendEntry(ent) {
				panic("refused un-refusable auto-leaving ConfChangeV2")
			}
			r.pendingConfIndex = r.raftLog.lastIndex()
			r.logger.Infof("initiating automatic transition out of joint configuration %s", r.prs.Config)
		}

这个空消息在stepLeader阶段会被解析为ConfChangeV2{}，然后传到applyConfChange函数中，这样就可以通过下面这个函数

func (c ConfChangeV2) LeaveJoint() bool {
	// NB: c is already a copy.
	c.Context = nil
	return proto.Equal(&c, &ConfChangeV2{})
}

来判断是否是来解除（等价于 $C_{old，new}$ 到 $C_{new}$ ）上一次的联合配置的，这样也能看出为什么是两阶段的形式。
然后来看第②点，这一点比较简单，看代码

func (c ConfChangeV2) EnterJoint() (autoLeave bool, ok bool) {
	// NB: in theory, more config changes could qualify for the "simple"
	// protocol but it depends on the config on top of which the changes apply.
	// For example, adding two learners is not OK if both nodes are part of the
	// base config (i.e. two voters are turned into learners in the process of
	// applying the conf change). In practice, these distinctions should not
	// matter, so we keep it simple and use Joint Consensus liberally.
	if c.Transition != ConfChangeTransitionAuto || len(c.Changes) > 1 {
		// Use Joint Consensus.
		var autoLeave bool
		switch c.Transition {
		// ConfChangeTransitionAuto和ConfChangeTransitionJointImplicit都会在raft的收尾工作中自动完成联合配置解除
		case ConfChangeTransitionAuto:
			autoLeave = true
		case ConfChangeTransitionJointImplicit:
			autoLeave = true
			// 需要应用层显示发一个空配置告诉Raft，解除上一次的联合配置
		case ConfChangeTransitionJointExplicit:
		default:
			panic(fmt.Sprintf("unknown transition: %+v", c))
		}
		return autoLeave, true
	}
	return false, false
}

主要是判断是否自动解除联合配置以及是否需要应用联合共识
再来看第③点，这个应该是是整个联合配置最复杂的一段，先看代码

func (c Changer) EnterJoint(autoLeave bool, ccs ...pb.ConfChangeSingle) (tracker.Config, tracker.ProgressMap, error) {
	// 拷贝出要正在应用的配置和Tracker.Progressyi ③-①
	cfg, prs, err := c.checkAndCopy()
	if err != nil {
		return c.err(err)
	}
	// 如果已经是联合配置了，就返回错误 ③-②
	if joint(cfg) {
		err := errors.New("config is already joint")
		return c.err(err)
	}
	// 只是为了测试，别当真，正常运行的raft集群怎么会没有Voters
	if len(incoming(cfg.Voters)) == 0 {
		// We allow adding nodes to an empty config for convenience (testing and
		// bootstrap), but you can't enter a joint state.
		err := errors.New("can't make a zero-voter config joint")
		return c.err(err)
	}
	// 保留之前的配置放到joint[1]中
	*outgoingPtr(&cfg.Voters) = quorum.MajorityConfig{}
	// Copy incoming to outgoing.
	for id := range incoming(cfg.Voters) {
		// 赋值为原来配置，意思是即将不用的配置outgoing，即将走了。。。有没有字面意思的感觉哈哈哈
		outgoing(cfg.Voters)[id] = struct{}{}
	}
	// 应用为联合配置，这里很重要③-③
	if err := c.apply(&cfg, prs, ccs...); err != nil {
		return c.err(err)
	}
	// 这里的变更主要是让raft后面进行收尾工作
	cfg.AutoLeave = autoLeave
	//  这个我们在第①点里说一下
	return checkAndReturn(cfg, prs)
}

首先看第③-①点

func (c Changer) checkAndCopy() (tracker.Config, tracker.ProgressMap, error) {
	cfg := c.Tracker.Config.Clone()
	prs := tracker.ProgressMap{}

	for id, pr := range c.Tracker.Progress {
		// A shallow copy is enough because we only mutate the Learner field.
		ppr := *pr
		// 这里拷贝的是每个progress地址
		prs[id] = &ppr
	}
	return checkAndReturn(cfg, prs)
}

其实就是把我们在前面初始化的changer中的数据拷贝出来，初始化时传入的是当前正在应用的prs和lastIndex，因此在函数中才能拷贝出当前运行的配置。
然后看第③-②点，就更简单了，就是看看原来的配置中有没有老配置没有清理

func joint(cfg tracker.Config) bool {
	return len(outgoing(cfg.Voters)) > 0
}

继续看第③-③点，最重要的一点

// 这个函数主要做什么呢?就是根据ConfChangeSingle的内容，对原来的配置做一个变更
func (c Changer) apply(cfg *tracker.Config, prs tracker.ProgressMap, ccs ...pb.ConfChangeSingle) error {
	for _, cc := range ccs {
		if cc.NodeID == 0 {
			// etcd replaces the NodeID with zero if it decides (downstream of
			// raft) to not apply a change, so we have to have explicit code
			// here to ignore these.
			continue
		}
		switch cc.Type {
		// 如果是增加节点，就在prs中加一个，顺便清理Learners 和LearnersNext中相关的节点，下同就不啰嗦了
		case pb.ConfChangeAddNode:
			c.makeVoter(cfg, prs, cc.NodeID)
		case pb.ConfChangeAddLearnerNode:
			c.makeLearner(cfg, prs, cc.NodeID)
		case pb.ConfChangeRemoveNode:
			c.remove(cfg, prs, cc.NodeID)
		case pb.ConfChangeUpdateNode:
		default:
			return fmt.Errorf("unexpected conf type %d", cc.Type)
		}
	}
	// 这个就是如果变更后没有节点了，就是说全给删了或者变为learner节点，这必然不行啊，那干嘛不直接整一个新的集群，然后把这个集群给下了
	if len(incoming(cfg.Voters)) == 0 {
		return errors.New("removed all voters")
	}
	return nil
}

这样一看最重要但不一定难哈
总的来说，第③点做了什么事呢，就是生成联合配置，将Config中Joint的增加一个MajorityConfig，将老的命名为outgoing在joint[1],变更后为incomming在joint[0]，意思为即将应用的的配置，这样就组成一个联合配置了，然后把这个配置返回给applyConfChange。现在我们就可以来看第④点了，变更切换，先看代码

// switchToConfig reconfigures this node to use the provided configuration. It
// updates the in-memory state and, when necessary, carries out additional
// actions such as reacting to the removal of nodes or changed quorum
// requirements.
//
// The inputs usually result from restoring a ConfState or applying a ConfChange.
func (r *raft) switchToConfig(cfg tracker.Config, prs tracker.ProgressMap) pb.ConfState {
// 把老配置应用程序新配置，以及prs也换一下
	r.prs.Config = cfg
	r.prs.Progress = prs

	r.logger.Infof("%x switched to configuration %s", r.id, r.prs.Config)
	// 拿到之前配置状态
	cs := r.prs.ConfState()
	// 检查当前节点是否在prs中，就是看看是否被移除
	pr, ok := r.prs.Progress[r.id]

	// Update whether the node itself is a learner, resetting to false when the
	// node is removed.
	// 当前节点是否被变为learner节点
	r.isLearner = ok && pr.IsLearner
	// 如果如果是主节点且被降级了或者移除了
	if (!ok || r.isLearner) && r.state == StateLeader {
	// 直接返回
		// This node is leader and was removed or demoted. We prevent demotions
		// at the time writing but hypothetically we handle them the same way as
		// removing the leader: stepping down into the next Term.
		//
		// TODO(tbg): step down (for sanity) and ask follower with largest Match
		// to TimeoutNow (to avoid interruption). This might still drop some
		// proposals but it's better than nothing.
		//
		// TODO(tbg): test this branch. It is untested at the time of writing.
		return cs
	}

	// Follower或者candidate
	if r.state != StateLeader || len(cs.Voters) == 0 {
		return cs
	}
	// 如果是主节点，就准备提交配置,看看大多数节点的Match Index是否大于当前commitIndex，如果大于就可提交到MatchIndex
	if r.maybeCommit() {
		// If the configuration change means that more entries are committed now,
		// broadcast/append to everyone in the updated config.
		// 然后广播给其他节点，我准备提交到MatchIndex
		r.bcastAppend()
	} else {
		// Otherwise, still probe the newly added replicas; there's no reason to
		// let them wait out a heartbeat interval (or the next incoming
		// proposal).
		r.prs.Visit(func(id uint64, pr *tracker.Progress) {
			// 或者给落后的节点发送append消息
			r.maybeSendAppend(id, false /* sendIfEmpty */)
		})
	}
	// 如果将要切主的节点被移除了，就驳回切主请求
	if _, tOK := r.prs.Config.Voters.IDs()[r.leadTransferee]; !tOK && r.leadTransferee != 0 {
		r.abortLeaderTransfer()
	}
	// 返回变更后的ConfState
	return cs
}

到这里我们在applyConfChange函数中的四点就说清楚了，然后会到node.run()中

case cc := <-n.confc:
			_, okBefore := r.prs.Progress[r.id]
			cs := r.applyConfChange(cc)
			// If the node was removed, block incoming proposals. Note that we
			// only do this if the node was in the config before. Nodes may be
			// a member of the group without knowing this (when they're catching
			// up on the log and don't have the latest config) and we don't want
			// to block the proposal channel in that case.
			//
			// NB: propc is reset when the leader changes, which, if we learn
			// about it, sort of implies that we got readded, maybe? This isn't
			// very sound and likely has bugs.
			if _, okAfter := r.prs.Progress[r.id]; okBefore && !okAfter {
				var found bool
			outer:
				for _, sl := range [][]uint64{cs.Voters, cs.VotersOutgoing} {
					for _, id := range sl {
						if id == r.id {
							found = true
							break outer
						}
					}
				}
				if !found {
					propc = nil
				}
			}
			select {
			case n.confstatec <- cs:
			case <-n.done:
			}

这里主要检查当前节点是否被移除，如果移除就停止接受上报信息，然后把变更后的配置返回给应用层。这个阶段对应的是从 $C_{old}$ 到 $C_{old,new}$ 。后面的leaveJoint就是从 $C_{old，new}$ 到 $C_{new}$ 。后面就比较简单了，大同小异，读者可以自己去看。