In Search of an Understandable Consensus Algorithm(寻找可理解的共识算法)

  • 原文见 Raft Consensus Protocol
  • 题目:In Search of an Understandable Consensus Algorithm
  • 作者:Diego Ongaro and John Ousterhout Stanford University(Diego Ongaro 和 John Ousterhout,斯坦福大学)

本文主要介绍以一种不同于 Paxos 的另一种易于理解的共识算法 Raft。而关于 Paxos 算法 的介绍可以查看 The Part-Time Parliament,或者查看易于理解的版本 Paxos Made Simple



0 Abstract(摘要)

Raft is a consensus algorithm for managing a replicated log. It produces a result equivalent to (multi-)Paxos, and it is as efficient as Paxos, but its structure is different from Paxos; this makes Raft more understandable than Paxos and also provides a better foundation for building practical systems. In order to enhance understandability, Raft separates the key elements of consensus, such as leader election, log replication, and safety, and it enforces a stronger degree of coherency to reduce the number of states that must be considered. Results from a user study demonstrate that Raft is easier for students to learn than Paxos. Raft also includes a new mechanism for changing the cluster membership, which uses overlapping majorities to guarantee safety.

Raft 是一种用于管理复制日志的共识算法,它产生的结果等价于(multi-)Paxos,与 Paxos 一样高效但它的结构与Paxos不同; 这使得 Raft 比 Paxos 更易于理解,也为构建实际系统提供了更好的基础。为了增强可理解性,Raft 将共识的关键要素(例如 leader 选举、日志复制和安全性)分离,并强制执行更强的一致性以减少必须考虑的状态数量。用户研究的结果表明 Raft 比 Paxos 更容易让学生学习。Raft 还包括一种用于更改集群成员的新机制,该机制使用重叠多数票(overlapping majorities)来保证安全。


1 Introduction(介绍)

Consensus algorithms allow a collection of machines to work as a coherent group that can survive the failures of some of its members. Because of this, they play a key role in building reliable large-scale software systems. Paxos [13, 14] has dominated the discussion of consensus algorithms over the last decade: most implementations of consensus are based on Paxos or influenced by it, and Paxos has become the primary vehicle used to teach students about consensus.

共识算法允许一组机器作为一致性(coherent group)组工作,可以在其某些成员失败中幸存下来,因此它们在构建可靠的大型软件系统方面发挥着关键作用。Paxos [13, 14] 在过去十年中主导了共识算法的讨论:大多数共识的实现都是基于 Paxos 或受其影响的,而 Paxos 已成为用于向学生教授共识的主要工具。

Unfortunately, Paxos is quite difficult to understand, in spite of numerous attempts to make it more approachable. Furthermore, its architecture requires complex changes to support practical systems. As a result, both system builders and students struggle with Paxos.

不幸的是 Paxos 非常难以理解,尽管多次尝试使其更易于理解。 此外其架构需要进行复杂的更改才能支持实际系统。 结果系统构建者和学生都在为 Paxos 苦苦挣扎。

After struggling with Paxos ourselves, we set out to find a new consensus algorithm that could provide a better foundation for system building and education. Our approach was unusual in that our primary goal was understandability: could we define a consensus algorithm for practical systems and describe it in a way that is significantly easier to learn than Paxos? Furthermore, we wanted the algorithm to facilitate the development of intuitions that are essential for system builders. It was important not just for the algorithm to work, but for it to be obvious why it works.

在自己与 Paxos 苦苦挣扎之后,我们着手寻找一种新的共识算法,可以为系统构建和教育提供更好的基础。 我们的方式不寻常,因为我们的主要目标是可理解性:我们能否为实际系统定义一个共识算法,并以比 Paxos 更容易学习的方式来描述它? 此外我们希望该算法能够促进对系统构建者非常重要的直观上的开发,重要的是不仅算法要起作用,而且要清楚它为什么起作用。

The result of this work is a consensus algorithm called Raft. In designing Raft we applied specific techniques to improve understandability, including decomposition (Raft separates leader election, log replication, and safety) and state space reduction (relative to Paxos, Raft reduces the degree of nondeterminism and the ways servers can be inconsistent with each other). A user study with 43 students at two universities shows that Raft is significantly easier to understand than Paxos: after learning both algorithms, 33 of these students were able to answer questions about Raft better than questions about Paxos.

这项工作的结果是一种称为 Raft 的共识算法。在设计 Raft 时,我们应用了特定的技术来提高可理解性,包括分解(Raft 将 leader 选举、日志复制和安全性分开)和状态空间简化(相对于 Paxos,Raft 减少了不确定性的程度以及服务彼此不一致的方式 )。对两所大学的 43 名学生进行的用户研究表明,Raft 比 Paxos 更容易理解:在学习了这两种算法后,其中 33 名学生能够回答关于 Raft 的问题比回答有关 Paxos 的问题更好。

Raft is similar in many ways to existing consensus algorithms (most notably, Oki and Liskov’s Viewstamped Replication [27, 20]), but it has several novel features:
Strong leader: Raft uses a stronger form of leadership than other consensus algorithms. For example, log entries only flow from the leader to other servers. This simplifies the management of the replicated log and makes Raft easier to understand.
Leader election: Raft uses randomized timers to elect leaders. This adds only a small amount of mechanism to the heartbeats already required for any consensus algorithm, while resolving conflicts simply and rapidly.
Membershipchanges:Raft’smechanismforchanging the set of servers in the cluster uses a new joint consensus approach where the majorities of two different configurations overlap during transitions. This allows the cluster to continue operating normally during configuration changes.

Raft 在许多方面与现有的共识算法相似(最著名的是 Oki 和 Liskov 的 Viewstamped Replication [27, 20]),但它有几个新颖的特点:

  • 强 leader:Raft 使用比其他共识算法更强的领导形式,例如日志条目仅从 leader 流向其他服务器,这简化了复制日志的管理,使 Raft 更容易理解。
  • Leader 选举:Raft 使用随机计时器(randomized timers)来选举 leaders,这仅为任何共识算法已经需要的心跳增加了少量机制,同时简单快速地解决了冲突。
  • Membershipchanges:Raft 用于更改集群中服务集的机制使用一种新的联合共识方法(joint consensus approach),其中两种不同配置的大多数在转换期间重叠,这允许集群在配置更改期间继续正常运行。

We believe that Raft is superior to Paxos and other consensus algorithms, both for educational purposes and as a foundation for implementation. It is simpler and more understandable than other algorithms; it is described completely enough to meet the needs of a practical system; it has several open-source implementations and is used by several companies; its safety properties have been formally specified and proven; and its efficiency is comparable to other algorithms.

我们相信 Raft 优于 Paxos 和其他共识算法,无论是出于教育目的还是作为实现的基础。比其他算法更简单易懂;描述完整足以满足实际系统的需要;它有几个被多家公司使用的开源实现;其安全特性已得到正式确定和证明;其效率可与其他算法相媲美。

The remainder of the paper introduces the replicated state machine problem (Section 2), discusses the strengths and weaknesses of Paxos (Section 3), describes our general approach to understandability (Section 4), presents the Raft consensus algorithm (Sections 5–7), evaluates Raft (Section 8), and discusses related work (Section 9). A few elements of the Raft algorithm have been omitted here because of space limitations, but they are available in an extended technical report [29]. The additional material describes how clients interact with the system, and how space in the Raft log can be reclaimed.

论文的其余部分介绍了复制状态机问题(第 2 节),讨论了 Paxos 的优缺点(第 3 节),描述了我们实现可理解性的一般方法(第 4 节),介绍了 Raft 共识算法(第 5-7 节) ,评估 Raft(第 8 节),并讨论相关工作(第 9 节)。 由于篇幅限制这里省略了 Raft 算法的一些元素,但可以在扩展的技术报告中找到它们 [29]。 附加资料描述了客户端如何与系统交互,以及如何回收 Raft 日志中的空间。


2 Replicated state machines(复制状态机)

Consensus algorithms typically arise in the context of replicated state machines [33]. In this approach, state machines on a collection of servers compute identical copies of the same state and can continue operating even if some of the servers are down. Replicated state machines are used to solve a variety of fault tolerance problems in distributed systems. For example, large-scale systems that have a single cluster leader, such as GFS [7], HDFS [34], and RAMCloud [30], typically use a separate replicated state machine to manage leader election and store configuration information that must survive leader crashes. Examples of replicated state machines include Chubby [2] and ZooKeeper [9].

共识算法通常出现在复制状态机的背景下 [33],在这种方法中,一组服务上的状态机计算相同状态的相同副本,并且即使某些服务器关闭也可以继续运行。 复制状态机用于解决分布式系统中的各种容错问题,例如具有单集群 leader 的大型系统,如 GFS [7]、HDFS [34] 和 RAMCloud [30],通常使用单独的复制状态机来管理 leader 选举并存储必须生存的 leader 崩溃配置信息,复制状态机的例子包括 Chubby [2] 和 ZooKeeper [9]。
Figure 1

Figure 1: 复制状态机架构.
共识算法管理包含来自客户端的状态机命令的复制日志。
状态机处理来自日志的相同命令序列,因此它们产生相同的输出。

Replicated state machines are typically implemented using a replicated log, as shown in Figure 1. Each server stores a log containing a series of commands, which its state machine executes in order. Each log contains the same commands in the same order, so each state machine processes the same sequence of commands. Since the state machines are deterministic, each computes the same state and the same sequence of outputs.

复制状态机通常使用复制日志来实现,如图 1 所示。每个服务存储一个包含一系列命令的日志,其状态机按顺序执行这些命令。每个日志以相同的顺序包含相同的命令,因此每个状态机处理相同的命令序列。由于状态机是确定性的,每个状态机都计算相同的状态和相同的输出序列。

Keeping the replicated log consistent is the job of the consensus algorithm. The consensus module on a server receives commands from clients and adds them to its log. It communicates with the consensus modules on other servers to ensure that every log eventually contains the same requests in the same order, even if some servers fail. Once commands are properly replicated, each server’s state machine processes them in log order, and the outputs are returned to clients. As a result, the servers appear to form a single, highly reliable state machine.

一致性算法的工作是保持复制日志的一致性。服务上的共识模块接收来自客户端的命令并将它们添加到其日志中,它与其他服务上的共识模块进行通信,以确保每个日志最终包含相同顺序的相同请求,即使某些服务出现故障。一旦命令被正确复制,每个服务的状态机就会按日志顺序处理它们,并将输出返回给客户端。结果服务似乎形成了一个单一的、高度可靠的状态机。

Consensus algorithms for practical systems typically have the following properties:
• They ensure safety (never returning an incorrect result)under all non-Byzantine conditions, including network delays, partitions, and packet loss, duplication, and reordering.
• They are fully functional (available) as long as any majority of the servers are operational and can communicate with each other and with clients. Thus, a typical cluster of five servers can tolerate the failure of any two servers. Servers are assumed to fail by stopping; they may later recover from state on stable storage and rejoin the cluster.
• They do not depend on timing to ensure the consistency of the logs: faulty clocks and extreme message delays can, at worst, cause availability problems.
• In the common case, a command can complete as soon as a majority of the cluster has responded to a single round of remote procedure calls; a minority of slow servers need not impact overall system performance.

实际系统的共识算法通常具有以下特性:

  • 他们确保安全(永远不会返回错误的结果)在所有非拜占庭条件(non-Byzantine conditions)下,包括网络延迟、分区和数据包丢失、重复和重新排序。
  • 只要大多数服务都可以运行并且可以相互通信和与客户端通信,它们就可以完全发挥作用(可用)。因此一个典型的五台服务器集群可以容忍任意两台服务器的故障。假设服务器因停止而失败;他们稍后可能会从可靠存储的状态中恢复并重新加入集群。
  • 它们不依赖于时间来确保日志的一致性:错误的时钟和极端的消息延迟在最坏的情况下会导致可用性问题。
  • 在一般情况下,只要集群的大部分响应了一轮远程过程调用(RPC),命令就可以完成,少数慢服务不必影响整体系统性能。

3 What’s wrong with Paxos? (Paxos有什么问题?)

Over the last ten years, Leslie Lamport’s Paxos protocol [13] has become almost synonymous with consensus: it is the protocol most commonly taught in courses, and most implementations of consensus use it as a starting point. Paxos first defines a protocol capable of reaching agreement on a single decision, such as a single replicated log entry. We refer to this subset as single-decree Paxos. Paxos then combines multiple instances of this protocol to facilitate a series of decisions such as a log (multi-Paxos). Paxos ensures both safety and liveness, and it supports changes in cluster membership. Its correctness has been proven, and it is efficient in the normal case.

在过去的十年中,Leslie Lamport 的 Paxos 协议 [13] 几乎成为共识的同义词:它是课程中最常教授的协议,大多数共识的实现都以它为起点。Paxos 首先定义了一个能够就单个决策达成一致的协议,例如单个复制的日志条目。我们将这个子集称为单法令 Paxos (single-decree Paxos)。Paxos 然后将这个协议的多个实例组合起来,以促进一系列决策,例如日志(多 Paxos)。Paxos 确保安全性和活性,并且支持集群成员的变更,其正确性已被证明,在正常情况下是有效的。

Unfortunately, Paxos has two significant drawbacks. The first drawback is that Paxos is exceptionally difficult to understand. The full explanation [13] is notoriously opaque; few people succeed in understanding it, and only with great effort. As a result, there have been several attempts to explain Paxos in simpler terms [14, 18, 19]. These explanations focus on the single-decree subset, yet they are still challenging. In an informal survey of attendees at NSDI 2012, we found few people who were comfortable with Paxos, even among seasoned researchers. We struggled with Paxos ourselves; we were not able to understand the complete protocol until after reading several simplified explanations and designing our own alternative protocol, a process that took almost a year.

不幸的是 Paxos 有两个明显的缺点,第一个缺点是 Paxos 异常难以理解,众所周知完整的解释 [13] 是不透明的,很少有人能够成功地理解它,而且只有付出巨大的努力才能理解它,因此有几次尝试用更简单的术语来解释 Paxos [14, 18, 19],这些解释侧重于单一法令子集,但理解它们仍然具有挑战性。在 NSDI 2012 参加者的非正式调查中,我们发现很少有人对 Paxos 感到满意,即使是经验丰富的研究人员也是如此。我们自己也在与 Paxos 斗争;直到阅读了几个简化的解释并设计了我们自己的替代方案之后,我们才能够理解完整的方案,这个过程花了将近一年的时间。

We hypothesize that Paxos’ opaqueness derives from its choice of the single-decree subset as its foundation. Single-decree Paxos is dense and subtle: it is divided into two stages that do not have simple intuitive explanations and cannot be understood independently. Because of this, it is difficult to develop intuitions about why the singledecree protocol works. The composition rules for multiPaxos add significant additional complexity and subtlety. We believe that the overall problem of reaching consensus on multiple decisions (i.e., a log instead of a single entry) can be decomposed in other ways that are more direct and obvious.

我们假设 Paxos 的不透明性源于它选择单一法令子集作为其基础,单法令Paxos密集而微妙:它分为两个阶段,没有简单直观的解释,不能独立理解,因此很难对单一法令协议的工作原理产生直观感。multiPaxos 的组合规则显着增加了复杂性和微妙性,我们认为就多个决策(即日志而不是单个条目)达成共识的整体问题可以用其他更直接、更明显的方式进行分解。

The second problem with Paxos is that it does not provide a good foundation for building practical implementations. One reason is that there is no widely agreedupon algorithm for multi-Paxos. Lamport’s descriptions are mostly about single-decree Paxos; he sketched possible approaches to multi-Paxos, but many details are missing. There have been several attempts to flesh out and optimize Paxos, such as [24], [35], and [11], but these differ from each other and from Lamport’s sketches. Systems such as Chubby [4] have implemented Paxos-like algorithms, but in most cases their details have not been published.

Paxos 的第二个问题是它没有为构建实际实现提供良好的基础。原因之一是对于 multi-Paxos 没有广泛认可的算法。Lamport 的描述主要是关于单法令 Paxos;他勾画了多 Paxos 的可能方法,但缺少许多细节,已经有几次尝试充实和优化 Paxos,例如 [24]、[35] 和 [11],但这些尝试彼此不同,也与 Lamport 的草图不同。Chubby [4] 等系统已经实现了类似 Paxos 的算法,但在大多数情况下,它们的细节尚未公布。

Furthermore, the Paxos architecture is a poor one for building practical systems; this is another consequence of the single-decree decomposition. For example, there is little benefit to choosing a collection of log entries independently and then melding them into a sequential log; this just adds complexity. It is simpler and more efficient to design a system around a log, where new entries are appended sequentially in a constrained order. Another problem is that Paxos uses a symmetric peer-to-peer approach at its core (though it eventually suggests a weak form of leadership as a performance optimization). This makes sense in a simplified world where only one decision will be made, but few practical systems use this approach. If a series of decisions must be made, it is simpler and faster to first elect a leader, then have the leader coordinate the decisions.

此外,Paxos 架构对于构建实际系统来说是一种糟糕的架构;这是单法令分解的另一个结果。例如独立选择一组日志条目,然后将它们融合到一个顺序日志中几乎没有什么好处;这只会增加复杂性。围绕日志设计一个系统更简单、更有效,其中新条目以受约束的顺序依次追加。另一个问题是 Paxos 在其核心使用对称的点对点方法(尽管它最终提议了一种弱领导形式作为性能优化),这在一个只做出一个决定的简化世界中是有意义的,但很少有实际系统使用这种方法。如果必须做出一系列决策,首先选举一个 leader,然后让 leader 协调决策会更简单、更快捷。

As a result, practical systems bear little resemblance to Paxos. Each implementation begins with Paxos, discovers the difficulties in implementing it, and then develops a significantly different architecture. This is timeconsuming and error-prone, and the difficulties of understanding Paxos exacerbate the problem. Paxos’ formulation may be a good one for proving theorems about its correctness, but real implementations are so different from Paxos that the proofs have little value. The following comment from the Chubby implementers is typical:

There are significant gaps between the description of
the Paxos algorithm and the needs of a real-world
system. . . . the final system will be based on an unproven protocol [4].

因此,实际系统与 Paxos 几乎没有相似之处,每个实现都从 Paxos 开始,发现实现它的困难,然后开发出明显不同的架构,这既费时又容易出错,理解 Paxos 的困难加剧了这个问题。Paxos 的公式可能是证明其正确性定理的好方法,但实际实现与 Paxos 如此不同,以至于证明没有什么价值。以下来自 Chubby 实施者的评论是典型的:

Paxos 算法的描述与现实世界系统的需求之间存在重大差距。
. . . 最终系统将基于未经证实的协议 [4]。

Because of these problems, we concluded that Paxos does not provide a good foundation either for system building or for education. Given the importance of consensus in large-scale software systems, we decided to see if we could design an alternative consensus algorithm with better properties than Paxos. Raft is the result of that experiment.

由于这些问题,我们得出结论,Paxos 没有为系统构建或教育提供良好的基础。考虑到共识在大型软件系统中的重要性,我们决定看看是否可以设计一种替代的共识算法,其性能比 Paxos 更好,Raft 就是那个实验的结果


4 Designing for understandability(可理解性设计)

We had several goals in designing Raft: it must provide a complete and practical foundation for system building, so that it significantly reduces the amount of design work required of developers; it must be safe under all conditions and available under typical operating conditions; and it must be efficient for common operations. But our most important goal—and most difficult challenge—was understandability. It must be possible for a large audience to understand the algorithm comfortably. In addition, it must be possible to develop intuitions about the algorithm, so that system builders can make the extensions that are inevitable in real-world implementations.

我们在设计 Raft 时有几个目标:它必须为系统构建提供完整且实用的基础,从而显着减少开发人员所需的设计工作量;它必须在所有条件下都是安全的,并且在典型的操作条件下可用;并且它必须对常见操作有效。但我们最重要的目标——也是最困难的挑战——是可理解性,必须有可能让大量受众轻松地理解算法。此外,必须能够对算法产生直观感,以便系统构建者可以进行在实际实现中进行不可避免的扩展。

There were numerous points in the design of Raft where we had to choose among alternative approaches. In these situations we evaluated the alternatives based on understandability: how hard is it to explain each alternative (for example, how complex is its state space, and does it have subtle implications?), and how easy will it be for a reader to completely understand the approach and its implications?

在 Raft 的设计中有很多地方我们不得不在替代方法中进行选择,在这些情况下,我们根据可理解性评估了备选方案:解释每个备选方案有多难(例如,它的状态空间有多复杂,是否有微妙的含义?),以及读者完全理解该方法及其含义有多容易?

We recognize that there is a high degree of subjectivity in such analysis; nonetheless, we used two techniques that are generally applicable. The first technique is the well-known approach of problem decomposition: wherever possible, we divided problems into separate pieces that could be solved, explained, and understood relatively independently. For example, in Raft we separated leader election, log replication, safety, and membership changes.

我们意识到到这种分析具有高度的主观性;尽管如此,我们还是使用了两种普遍适用的技术。第一种技术是众所周知的问题分解方法:在可能的情况下,我们将问题分成可以相对独立地解决、解释和理解的单独部分。例如,在 Raft 中我们将 leader 选举、日志复制、安全性和成员资格变更分开。

Our second approach was to simplify the state space by reducing the number of states to consider, making the system more coherent and eliminating nondeterminism where possible. Specifically, logs are not allowed to have holes, and Raft limits the ways in which logs can become inconsistent with each other. Although in most cases we tried to eliminate nondeterminism, there are some situations where nondeterminism actually improves understandability. In particular, randomized approaches introduce nondeterminism, but they tend to reduce the state space by handling all possible choices in a similar fashion (“choose any; it doesn’t matter”). We used randomization to simplify the Raft leader election algorithm.

我们的第二种方法是通过缩减要考虑的状态数量来简化状态空间,使系统更加连贯并尽可能消除不确定性。具体来说,日志是不允许有漏洞的,Raft 限制了日志彼此不一致的方式,尽管在大多数情况下我们试图消除不确定性,但在某些情况下,不确定性实际上提高了可理解性,特别是随机方法引入了不确定性,但它们倾向于通过以类似的方式处理所有可能的选择来减少状态空间(“选择任何一个;无关紧要”)。我们使用随机化来简化 Raft leader 选举算法。


5 The Raft consensus algorithm (Raft 共识算法)

Raft is an algorithm for managing a replicated log of the form described in Section 2. Figure 2 summarizes the algorithm in condensed form for reference, and Figure 3 lists key properties of the algorithm; the elements of these figures are discussed piecewise over the rest of this section.

Raft 是一种用于管理第 2 节中描述的形式的复制日志的算法,图 2 以精简形式总结了该算法以供参考,图 3 列出了该算法的关键特性;这些图的元素将在本节的其余部分逐个讨论。

State(状态)
所有服务上的持久状态:(在响应 RPC 之前更新稳定存储)
currentTerm(当前任期)最新的 term 服务已经看到(第一次启动时初始化为 0,单调增加)
votedFor(投赞成票)在当前任期内收到投票的候选人 ID(如果没有,则为 null)
log[]日志条目;每个条目包含状态机的命令,以及 leader 接收条目时的 term(第一个索引为 1)
所有服务上的不稳定状态(Volatile state):
commitIndex(提交索引)已知提交的最高日志条目的索引(初始化为 0,单调增加)
lastApplied(上次申请)应用于状态机的最高日志条目的索引(初始化为 0,单调增加)
leaders 的不稳定状态:(Volatile state):(选举后重新初始化)
nextIndex[]对于每个服务,要发送到该服务的下一个日志条目的索引(初始化为 leader 最后一个日志索引 + 1)
matchIndex[]对于每个服务,已知在服务上复制的最高日志条目的索引(初始化为 0,单调增加)
AppendEntries RPC(追加条目 RPC)
由 leader 调用以复制日志条目(§5.3);也用于心跳(§5.2)。
参数:
termleader 的任期
leaderId (leader 编号)follower 可以重定向客户端
prevLogIndex紧接在新条目之前的日志条目的索引
prevLogTermprevLogIndex 条目的term
entries[]要存储的日志条目(心跳为空;为了提高效率可能会发送多个)
leaderCommit (leader 提交)leader 的 commitIndex
结果:
termcurrentTerm,用于 leader 自我更新
success如果 follower 包含匹配的 prevLogIndex 和 prevLogTerm 条目,则为 true
Receiver 实现:
1. 如果 term < currentTerm (§5.1),则回复 false
2. 如果日志不包含在 prevLogIndex 的条目,则回复 false,其 term 与 prevLogTerm 匹配 (§5.3)
3. 如果现有条目与新条目冲突(相同的索引但不同的术语),删除现有条目及其后的所有条目 (§5.3)
4. 追加任何不在日志中的新条目
5. 如果 leaderCommit > commitIndex,设置commitIndex = min(leaderCommit, 最后一个新条目的索引)
RequestVote RPC(请求投票 RPC)
由候选人调用以收集选票 (§5.2)。
参数:
term候选人的任期
candidateId (候选人编号)要求投票的候选人
lastLogIndex (上次日志索引)候选人最后一个日志条目的索引(§5.4)
lastLogTerm (最后日志 term)候选人最后一个日志条目的期限(§5.4)
结果:
termcurrentTerm,供候选人自行更新
voteGranted (批准投票)true 意味着候选人获得选票
Receiver 实现:
1. 如果 term < currentTerm (§5.1),则回复 false
2. 如果 votedFor 为 null 或 CandidateId,并且候选人的日志至少与接收者的日志一样为最新,则授予投票权 (§5.2, §5.4)
Rules for Servers(服务规则)
所有服务:
  • 如果 commitIndex > lastApplied:增加 lastApplied,将 log[lastApplied] 应用到状态机 (§5.3)
  • 如果 RPC 请求或响应包含 term T > currentTerm:设置 currentTerm = T,转换为 follower (§5.1)
Followers (§5.2):
  • 响应候选人和 leaders 的 RPC
  • 如果选举超时没有收到来自当前 leader 的 AppendEntries RPC 或投票给候选人:转换为候选人
候选人 (§5.2):
  • 转换为候选人后,开始选举:
    • 增加 currentTerm
    • 为自己投票
    • 重置选举计时器
    • 向所有其他服务发送 RequestVote RPC
  • 如果从大多数服务器收到投票:成为领导者
  • 如果从新 leader 收到 AppendEntries RPC:转换为 follower
  • 如果选举超时:开始新的选举
Leaders:
  • 选举时:向每个服务发送初始的空 AppendEntries RPC(心跳);在空闲期间重复以防止选举超时(§5.2)
  • 如果从客户端收到命令:将条目附加到本地日志,在条目应用于状态机后响应(§5.3)
    • 如果 follower 的最后一个日志索引 ≥ nextIndex:发送 AppendEntries RPC,日志条目从 nextIndex 开始
      • 如果成功:更新 follower 的 nextIndex 和 matchIndex (§5.3)
      • 如果 AppendEntries 由于日志不一致而失败:递减 nextIndex 并重试(§5.3)
    • 如果存在一个 N 使得 N > commitIndex,大部分 matchIndex[i] ≥ N,并且 log[N].term == currentTerm: set commitIndex = N (§5.3, §5.4)。
Figure 2: Raft 共识算法的简要总结(不包括成员变更和日志压缩)。
左上框中的服务行为被描述为一组独立且重复触发的规则。 诸如 §5.2 之类指出了讨论指定功能的地方,正式的规范 [28] 更准确地描述了算法。

选举安全:
在给定的任期内最多可以选举一个 leader。 §5.2
Leader Append-Only:
Leader 永远不会覆盖或删除其日志中的条目;它只追加新条目。 §5.3
日志匹配:
如果两个日志包含具有相同索引和 term 的条目,则日志在给定索引之前的所有条目中都是相同的。 §5.3
Leader 完整性:
如果在给定的任期内提交了一个日志条目,那么该条目将出现在所有更高编号任期的 leader 的日志中。 §5.4
StateMachineSafety:
如果服务已将给定索引应用于其状态机,则其他服务将永远不会为同一索引应用不同的日志条目。 §5.4.3
Figure 3: Raft 保证这些属性中的每一个在任何时候都是真实的,部分编号指示讨论每个属性的位置。

Raft implements consensus by first electing a distinguished leader, then giving the leader complete responsibility for managing the replicated log. The leader accepts log entries from clients, replicates them on other servers, and tells servers when it is safe to apply log entries to their state machines. Having a leader simplifies the management of the replicated log. For example, the leader can decide where to place new entries in the log without consulting other servers, and data flows in a simple fashion from the leader to other servers. A leader can fail or become disconnected from the other servers, in which case a new leader is elected.

Raft 通过首先选举一个显著的 leader 来实现共识,然后让 leader 完全负责管理复制日志。 leader 接受来自客户端的日志条目,将它们复制到其他服务器上,并告诉服务器何时将日志条目应用到它们的状态机是安全的。 拥有 leader 简化了复制日志的管理,例如 leader 可以在不咨询其他服务器的情况下决定在日志中放置新条目的位置,并且数据以简单的方式从 leader 流向其他服务器。leader 可能会失败或与其他服务器断开连接,在这种情况下会选出新的 leader。

Given the leader approach, Raft decomposes the consensus problem into three relatively independent subproblems, which are discussed in the subsections that follow:
• Leader election: a new leader must be chosen when an existing leader fails (Section 5.2).
• Log replication: the leader must accept log entries from clients and replicate them across the cluster, forcing the other logs to agree with its own (Section 5.3).
• Safety: the key safety property for Raft is the State Machine Safety Property in Figure 3: if any server has applied a particular log entry to its state machine, then no other server may apply a different command for the same log index. Section 5.4 describes how Raft ensures this property; the solution involves an additional restriction on the election mechanism described in Section 5.2.

鉴于 leader 方式,Raft 将共识问题分解为三个相对独立的子问题,这些子问题将在以下小节中讨论:

  • Leader 选举:当现有 leader 失败时,必须选择新的leader(第 5.2 节)。
  • 日志复制:leader 必须接受来自客户端的日志条目并在整个集群中复制它们,迫使其他日志与自己的一致(第 5.3 节)。
  • 安全性:Raft 的关键安全属性是图 3 中的状态机安全属性:如果任何服务器已将特定日志条目应用到其状态机,则其他服务器不能对相同的日志索引应用不同的命令。 5.4 节描述了 Raft 如何保证这个特性;该解决方案涉及对第 5.2 节中描述的选举机制的额外限制。

After presenting the consensus algorithm, this section discusses the issue of availability and the role of timing in the system.

在介绍了共识算法之后,本节将讨论可用性问题和时序 (role of timing)在系统中的作用。

5.1 Raft basics(Raft基础)

A Raft cluster contains several servers; five is a typical number, which allows the system to tolerate two failures. At any given time each server is in one of three states: leader, follower, or candidate. In normal operation there is exactly one leader and all of the other servers are followers. Followers are passive: they issue no requests on their own but simply respond to requests from leaders and candidates. The leader handles all client requests (if a client contacts a follower, the follower redirects it to the leader). The third state, candidate, is used to elect a new leader as described in Section 5.2. Figure 4 shows the states and their transitions; the transitions are discussed below.

一个 Raft 集群包含多个服务;5 是一个典型的数字,它允许系统容忍两次故障。在任何给定时间,每个服务都处于以下三种状态之一:leader、follower 或候选者。在正常操作中,只有一个 leader,所有其他服务都是 follower,follower 是被动的:他们不会自己发出请求,而只是回应 leader 和候选者的请求。leader 处理所有客户端请求(如果客户端联系 follower,follower 将其重定向到 leader)。第三个状态候选者,用于选举一个新的 leader,如第 5.2 节所述,图 4 显示了状态及其转换;下面讨论这些转换。

Figure 4: Server states

Figure 4: 服务状态。
follower 只响应来自其他服务的请求,如果 follower 没有收到任何通信,它就会成为候选者并发起选举,从整个集群的大多数那里获得投票的候选者成为新的 leader,leader 通常会一直运作直到失败为止。

Raft divides time into terms of arbitrary length, as shown in Figure 5. Terms are numbered with consecutive integers. Each term begins with an election, in which one or more candidates attempt to become leader as described in Section 5.2. If a candidate wins the election, then it serves as leader for the rest of the term. In some situations an election will result in a split vote. In this case the term will end with no leader; a new term (with a new election) will begin shortly. Raft ensures that there is at most one leader in a given term.

Raft 将时间划分为任意长度的任期 (term),如图 5 所示,任期为连续的整数编号,每个任期都以选举开始,其中一个或多个候选人尝试成为 leader,如第 5.2 节所述,如果一个候选者赢得了选举,那么他将在余下的任期内担任 leader 。在某些情况下,选举会导致分裂投票,在这种情况下,任期将在没有领导者的情况下结束;新的任期(有新的选举)将很快开始,Raft 确保在给定的任期内最多有一个 leader。
Figure 5: Time is divided into terms, and each term begins with an election

Figure 5: 时间划分为几个任期,每个任期都以选举开始。
选举成功后,由一个 leader 管理集群直到任期结束,一些选举失败的情况下,任期结束而没有选择 leader,可以在不同服务上的不同时间观察任期之间转换。

Different servers may observe the transitions between terms at different times, and in some situations a server may not observe an election or even entire terms. Terms act as a logical clock [12] in Raft, and they allow servers to detect obsolete information such as stale leaders. Each server stores a current term number, which increases monotonically over time. Current terms are exchanged whenever servers communicate; if one server’s current term is smaller than the other’s, then it updates its current term to the larger value. If a candidate or leader discovers that its term is out of date, it immediately reverts to follower state. If a server receives a request with a stale term number, it rejects the request.

不同的服务可能会在不同的时间观察任期之间转换,在某些情况下,一个服务可能不会观察到选举甚至整个任期。任期在 Raft 中充当逻辑时钟 [12],它们允许服务检测过时的信息,例如过时的领导者。每个服务存储一个当前的任期编号,随着时间的推移单调递增。每当服务通信时都会交换当前任期;如果一台服务的当前任期小于另一台服务的当前期限,则它将其当前任期更新为较大的值。 如果候选者或 leader 发现其任期已过时,它会立即恢复到 follower 状态。如果服务收到带有过期任期号的请求,它会拒绝该请求。

Raft servers communicate using remote procedure calls (RPCs), and the consensus algorithm requires only two types of RPCs. RequestVote RPCs are initiated by candidates during elections (Section 5.2), and AppendEntries RPCs are initiated by leaders to replicate log entries and to provide a form of heartbeat (Section 5.3). Servers retry RPCs if they do not receive a response in a timely manner, and they issue RPCs in parallel for best performance.

Raft 服务使用远程过程调用 (RPC) 进行通信,共识算法只需要两种类型的 RPC,RequestVote RPC 由候选者在选举期间发起(第 5.2 节),而 AppendEntries RPC 由 leader 发起以复制日志条目并提供一种心跳形式(第 5.3 节)。如果服务没有及时收到响应,它们会重试 RPC,并且它们并行发出 RPC 以获得最佳性能。

5.2 Leader election(Leader选举)

Raft uses a heartbeat mechanism to trigger leader election. When servers start up, they begin as followers. A server remains in follower state as long as it receives valid RPCs from a leader or candidate. Leaders send periodic heartbeats (AppendEntries RPCs that carry no log entries) to all followers in order to maintain their authority. If a follower receives no communication over a period of time called the election timeout, then it assumes there is no viable leader and begins an election to choose a new leader.

Raft 使用心跳机制来触发 leader 选举,当服务启动时它们从 follower 开始,只要服务从 leader 或候选者那里接收到有效的 RPC它就会保持 follower 状态。leader 定期向所有 follower 发送心跳(不携带日志条目的 AppendEntries RPC)以维护他们的权限。 如果 follower 在称为选举超时的一段时间内没有收到任何信息,则它假定没有可行的 leader 并开始选举以选择新的 leader。

To begin an election, a follower increments its current term and transitions to candidate state. It then votes for itself and issues RequestVote RPCs in parallel to each of the other servers in the cluster. A candidate continues in this state until one of three things happens: (a) it wins the election, (b) another server establishes itself as leader, or © a period of time goes by with no winner. These outcomes are discussed separately in the paragraphs below.

开始选举时 follower 增加其当前任期并转换到候选者状态,然后它为自己投票且并行地向集群中的每个其它服务发出 RequestVote RPC。候选者会一直处于这种状态,直到发生一下三种情况之一:(a) 它赢得选举,(b) 另一个服务成为 leader,或(c)一段时间内没有胜出者。这些结果将在以下段落中单独讨论。

A candidate wins an election if it receives votes from a majority of the servers in the full cluster for the same term. Each server will vote for at most one candidate in a given term, on a first-come-first-served basis (note: Section 5.4 adds an additional restriction on votes). The majority rule ensures that at most one candidate can win the election for a particular term (the Election Safety Property in Figure 3). Once a candidate wins an election, it becomes leader. It then sends heartbeat messages to all of the other servers to establish its authority and prevent new elections.

如果一个候选者在同一个任期内从这个集群中的大多数服务获得选票,那么它就赢得了选举。每个服务将在给定的任期内以先到先得的方式投票给至多一名候选者(注意:第 5.4 节增加了对投票的额外限制)。多数规则确保最多一名候选者可以赢得特定任期的选举(图 3 中的选举安全项),一旦一个候选者赢得了选举它就会成为 leader,然后它向所有其它服务发送心跳消息以建立其权限并防止新的选举。

While waiting for votes, a candidate may receive an AppendEntries RPC from another server claiming to be leader. If the leader’s term (included in its RPC) is at least as large as the candidate’s current term, then the candidate recognizes the leader as legitimate and returns to follower state. If the term in the RPC is smaller than the candidate’s current term, then the candidate rejects the RPC and continues in candidate state.

在等待投票时,候选者可能会收到来自另一台声称是 leader 服务的 AppendEntries RPC。如果 leader 的任期(包含在其 RPC 中)至少与候选者的当前任期一样大,则候选者将 leader 视为合法并返回 follower 状态。如果 RPC 中的任期小于候选者的当前任期,则候选者拒绝 RPC 并继续处于候选者状态。

The third possible outcome is that a candidate neither wins nor loses the election: if many followers become candidates at the same time, votes could be split so that no candidate obtains a majority. When this happens, each candidate will time out and start a new election by incrementing its term and initiating another round of RequestVote RPCs. However, without extra measures split votes could repeat indefinitely.

第三种可能的结果是这次选举候选者既没赢也没输:如果许多 followers 同时成为候选者,可能会分裂选票,从而没有候选者获得多数票,发生这种情况时,每个候选者将超时并通过增加其任期并启动另一轮 RequestVote RPC 来开始新的选举。然而如果没有额外的措施,分裂选票可能会无限期地重复。

Raft uses randomized election timeouts to ensure that split votes are rare and that they are resolved quickly. To prevent split votes in the first place, election timeouts are chosen randomly from a fixed interval (e.g., 150–300ms). This spreads out the servers so that in most cases only a single server will time out; it wins the election and sends heartbeats before any other servers time out. The same mechanism is used to handle split votes. Each candidate restarts its randomized election timeout at the start of an election, and it waits for that timeout to elapse before starting the next election; this reduces the likelihood of another split vote in the new election. Section 8.3 shows that this approach elects a leader rapidly.

Raft 使用随机选举超时时间来确保分裂选票很少发生并且它们被快速解决,为了首先防止分裂投票,选举超时时间是从固定间隔(例如 150-300ms)中随机选择的。这会分散给服务以便在大多数情况下只有一个服务会超时;它赢得了选举并在任何其它服务超时之前发送心跳。相同的机制还用于处理分裂投票,每个候选者在选举开始时重新开始其随机选举超时时间,并在开始下一次选举之前等待该超时时间过去;这降低了在新选举中再次出现分裂投票的可能性,8.3 节表明这种方法可以快速选举 leader。

Elections are an example of how understandability guided our choice between design alternatives. Initially we planned to use a ranking system: each candidate was assigned a unique rank, which was used to select between competing candidates. If a candidate discovered another candidate with higher rank, it would return to follower state so that the higher ranking candidate could more easily win the next election. We found that this approach created subtle issues around availability (a lower-ranked server might need to time out and become a candidate again if a higher-ranked server fails, but if it does so too soon, it can reset progress towards electing a leader). We made adjustments to the algorithm several times, but after each adjustment new corner cases appeared. Eventually we concluded that the randomized retry approach is more obvious and understandable.

选举是理解如何指导我们在设计备选方案之间进行选择的一个例子。最初我们计划使用一个排名系统:每个候选者都被分配一个唯一的排名,用于在竞争候选者之间进行选择,如果一个候选者发现另一个排名更高的候选者,它会回到 follower 状态,这样排名更高的候选者可以更轻松地赢得下一次选举。我们发现这种方法在可用性方面产生了微妙的问题(如果排名较高的服务出现故障,排名较低的服务可能需要超时并再次成为候选服务,但如果时间过短,则可能会重置选举 leader 的进度 ),我们对算法进行了多次调整,但每次调整后都会出现新的极端情况,最终我们得出结论,随机重试方法更加合理和易于理解

5.3 Log replication(日志复制)

Once a leader has been elected, it begins servicing client requests. Each client request contains a command to be executed by the replicated state machines. The leader appends the command to its log as a new entry, then issues AppendEntries RPCs in parallel to each of the other servers to replicate the entry. When the entry has been safely replicated (as described below), the leader applies the entry to its state machine and returns the result of that execution to the client. If followers crash or run slowly, or if network packets are lost, the leader retries AppendEntries RPCs indefinitely (even after it has responded to the client) until all followers eventually store all log entries.

一旦选举出 leader 它就会开始为客户端请求提供服务,每个客户端请求都包含一个要由复制状态机执行的命令,leader 将命令作为新条目附加到其日志中,然后并行地向其它每个服务发出 AppendEntries RPC 以复制条目。当条目被安全复制后(如下所述),leader 将条目应用于其状态机并将该执行的结果返回给客户端。如果 follower 崩溃或运行缓慢,或者网络数据包丢失,leader 会无限期地重试 AppendEntries RPC(即使它已经响应了客户端),直到所有 follower 最终存储所有日志条目。

Logs are organized as shown in Figure 6. Each log entry stores a state machine command along with the term number when the entry was received by the leader. The term numbers in log entries are used to detect inconsistencies between logs and to ensure some of the properties in Figure 3. Each log entry also has an integer index identifying its position in the log.

日志的组织方式如图 6 所示,每个日志条目存储一个状态机命令以及 leader 收到条目时的任期号。日志条目中的任期编号用于检测日志之间的不一致并确保图 3 中的某些属性,每个日志条目还有一个整数索引,用于标识其在日志中的位置。

Figure 6: 日志由按顺序编号的条目组成

Figure 6: 日志由按顺序编号的条目组成。
每个条目包含创建它的任期(每个框中的数字)和状态机的命令,如果该条目可以安全地应用于状态机,则该条目被视为已提交。

The leader decides when it is safe to apply a log entry to the state machines; such an entry is called committed. Raft guarantees that committed entries are durable and will eventually be executed by all of the available state machines. A log entry is committed once the leader that created the entry has replicated it on a majority of the servers (e.g., entry 7 in Figure 6). This also commits all preceding entries in the leader’s log, including entries created by previous leaders. Section 5.4 discusses some subtleties when applying this rule after leader changes, and it also shows that this definition of commitment is safe. The leader keeps track of the highest index it knows to be committed, and it includes that index in future AppendEntries RPCs (including heartbeats) so that the other servers eventually find out. Once a follower learns that a log entry is committed, it applies the entry to its local state machine (in log order).

leader 决定何时将日志条目应用到状态机是安全的;这样的条目称为已提交。Raft 保证提交的条目是持久的,并且最终会被所有可用的状态机执行,一旦创建条目的 leader 在大多数服务器上复制了它(例如,图 6 中的条目 7)就会提交一个日志条目。 这也会提交 leader 日志中的所有先前条目,包括由以前的 leader 创建的条目,第 5.4 节讨论了在 leader 变更后应用此规则时的一些微妙之处,并且还表明这种承诺的定义是安全的。leader 跟踪它知道要提交的最高索引,并将该索引包含在未来的 AppendEntries RPC(包括心跳)中,以便其他服务最终发现。一旦 follower 得知一个日志条目被提交,它就会将该条目应用到它的本地状态机(按日志顺序)。

We designed the Raft log mechanism to maintain a high level of coherency between the logs on different servers. Not only does this simplify the system’s behavior and make it more predictable, but it is an important component of ensuring safety. Raft maintains the following properties, which together constitute the Log Matching Property in Figure 3:
• If two entries in different logs have the same index and term, then they store the same command.
• If two entries in different logs have the same index and term, then the logs are identical in all preceding entries.

我们设计了 Raft 日志机制来保持不同服务上的日志之间的高度一致性,这不仅简化了系统的行为并使其更具可预测性,而且还是确保安全的重要组成部分。Raft 维护了以下属性,它们共同构成了图 3 中的日志匹配属性:

  • 如果不同日志中的两个条目具有相同的索引和任期,则它们存储相同的命令。
  • 如果不同日志中的两个条目具有相同的索引和期限,则日志在所有前面的条目中都是相同的。

The first property follows from the fact that a leader creates at most one entry with a given log index in a given term, and log entries never change their position in the log. The second property is guaranteed by a simple consistency check performed by AppendEntries. When sending an AppendEntries RPC, the leader includes the index and term of the entry in its log that immediately precedes the new entries. If the follower does not find an entry in its log with the same index and term, then it refuses the new entries. The consistency check acts as an induction step: the initial empty state of the logs satisfies the Log Matching Property, and the consistency check preserves the Log Matching Property whenever logs are extended. As a result, whenever AppendEntries returns successfully, the leader knows that the follower’s log is identical to its own log up through the new entries.

第一个属性来自这样一个事实,即 leader 在给定任期内最多创建一个具有给定日志索引的条目,并且日志条目永远不会改变它们在日志中的位置。 第二个属性由 AppendEntries 执行的简单一致性检查保证。在发送 AppendEntries RPC 时,leader 在其日志中包含条目的索引和任期则该条目紧接在新条目之前,如果 follower 在其日志中没有找到具有相同索引和任期的条目,则它拒绝新条目。一致性检查可归纳为步骤:日志的初始空状态满足日志匹配属性(Matching Property),并且一致性检查在日志扩展时保留日志匹配属性。 结果每当 AppendEntries 成功返回时,leader 就知道 follower 的日志通过新条目与自己的日志相同。

During normal operation, the logs of the leader and followers stay consistent, so the AppendEntries consistency check never fails. However, leader crashes can leave the logs inconsistent (the old leader may not have fully replicated all of the entries in its log). These inconsistencies can compound over a series of leader and follower crashes. Figure 7 illustrates the ways in which followers’ logs may differ from that of a new leader. A follower may be missing entries that are present on the leader, it may have extra entries that are not present on the leader, or both. Missing and extraneous entries in a log may span multiple terms.

正常运行时 leader 和 follower 的日志保持一致,因此 AppendEntries 一致性检查永远不会失败,但是 leader 崩溃会使日志不一致(旧的 leader 可能没有完全复制其日志中的所有条目),这些不一致可能会导致一系列 leader 和 follower 崩溃,图 7 说明了follower 的日志可能与新 leader 的日志不同的方式,follower 可能缺少 leader 上存在的条目,它可能有 leader 上不存在的额外条目,或两者兼而有之。日志中缺失和无关的条目可能跨越多个任期。

Figure 7: When the leader at the top comes to power, it is possible that any of scenarios (a–f) could occur in follower logs

Figure 7: 当顶部的 leader 上任时,任何情况 (a–f) 都可能出现在 follower 日志中。
每个框代表一个日志条目;框中的数字是它的任期。follower 可能缺少条目 (a–b),可能有额外的未提交条目 (c–d),或两者都有 (e–f)。例如,如果该服务是第 2 任期的 leader,在其日志中添加了几个条目,然后在提交任何条目之前崩溃,则可能会发生场景 (f);它很快重新启动,成为第 3 任期的 leader,并在其日志中添加了更多条目; 在提交第 2 任期或第 3 任期中的任何条目之前,服务再次崩溃并保持停机数个任期。

In Raft, the leader handles inconsistencies by forcing the followers’ logs to duplicate its own. This means that conflicting entries in follower logs will be overwritten with entries from the leader’s log. Section 5.4 will show that this is safe when coupled with one more restriction.

在 Raft 中 leader 通过强制 follower 的日志复制自己 (leader) 的日志来处理不一致,这意味着 follower 日志中的冲突条目将被 leader 日志中的条目覆盖,第 5.4 节将表明,当再加上一个限制时,这是安全的。

To bring a follower’s log into consistency with its own, the leader must find the latest log entry where the two logs agree, delete any entries in the follower’s log after that point, and send the follower all of the leader’s entries after that point. All of these actions happen in response to the consistency check performed by AppendEntries RPCs. The leader maintains a nextIndex for each follower, which is the index of the next log entry the leader will send to that follower. When a leader first comes to power, it initializes all nextIndex values to the index just after the last one in its log (11 in Figure 7). If a follower’s log is inconsistent with the leader’s, the AppendEntries consistency check will fail in the next AppendEntries RPC. After a rejection, the leader decrements nextIndex and retries the AppendEntries RPC. Eventually nextIndex will reach a point where the leader and follower logs match. When this happens, AppendEntries will succeed, which removes any conflicting entries in the follower’s log and appends entries from the leader’s log (if any). Once AppendEntries succeeds, the follower’s log is consistent with the leader’s, and it will remain that way for the rest of the term.

为了使 follower 的日志与自己的一致,leader 必须找到两个日志一致的最新日志条目,删除该点之后 follower 日志中的任何条目,并将该点之后 leader 的所有条目发送给 follower,所有这些操作都是为了响应 AppendEntries RPC 执行的一致性检查而发生的。leader 为每个 follower 维护一个 nextIndex,这是 leader 将发送给该 follower 的下一个日志条目的索引,当 leader 第一次上任时,它将所有 nextIndex 值初始化为其日志中最后一个之后的索引(图 7 中的 11)。如果一个 follower 的 log 与 leader 的不一致,则 AppendEntries 一致性检查将在下一次 AppendEntries RPC 中失败,拒绝后 leader 递减 nextIndex 并重试 AppendEntries RPC,最终 nextIndex 将达到 leader 和 follower 日志匹配的点。发生这种情况时,AppendEntries 将成功,它会删除跟随者日志中的任何冲突条目并附加 leader 日志中的条目(如果有)。一旦 AppendEntries 成功,follower 的日志与 leader 的日志一致,并且在剩余的任期内将保持这种状态。

The protocol can be optimized to reduce the number of rejected AppendEntries RPCs; see [29] for details.

可以优化协议以减少被拒绝的 AppendEntries RPC 的数,有关详细信息请参见 [29]。

With this mechanism, a leader does not need to take any special actions to restore log consistency when it comes to power. It just begins normal operation, and the logs automatically converge in response to failures of the AppendEntries consistency check. A leader never overwrites or deletes entries in its own log (the Leader Append-Only Property in Figure 3).

有了这种机制,leader 在上任时不需要采取任何特殊的行动来恢复日志的一致性。它刚刚开始正常运行,日志自动收敛以响应 AppendEntries 一致性检查失败,leader 永远不会覆盖或删除自己日志中的条目(图 3 中的 leader 仅附加属性)。

This log replication mechanism exhibits the desirable consensus properties described in Section 2: Raft can accept, replicate, and apply new log entries as long as a majority of the servers are up; in the normal case a new entry can be replicated with a single round of RPCs to a majority of the cluster; and a single slow follower will not impact performance.

这种日志复制机制展示了第 2 节中描述的理想共识属性:只要大多数服务器都启动,Raft 就可以接受、复制和应用新的日志条目; 在正常情况下,可以通过一轮 RPC 将新条目复制到集群的大多数; 并且单个慢 follower 不会影响性能。

5.4 Safety(安全)

The previous sections described how Raft elects leaders and replicates log entries. However, the mechanisms described so far are not quite sufficient to ensure that each state machine executes exactly the same commands in the same order. For example, a follower might be unavailable while the leader commits several log entries, then it could be elected leader and overwrite these entries with new ones; as a result, different state machines might execute different command sequences.

前面的部分描述了 Raft 如何选举 leader 和复制日志条目,然而到目前为止描述的机制还不足以确保每个状态机以相同的顺序执行完全相同的命令,例如当 leader 提交多个日志条目时 follower 可能不可用,然后它可以被选为 leader 并用新的条目覆盖这些条目; 因此,不同的状态机可能会执行不同的命令序列。

This section completes the Raft algorithm by adding a restriction on which servers may be elected leader. The restriction ensures that the leader for any given term contains all of the entries committed in previous terms (the Leader Completeness Property from Figure 3). Given the election restriction, we then make the rules for commitment more precise. Finally, we present a proof sketch for the Leader Completeness Property and show how it leads to correct behavior of the replicated state machine.

本节通过添加对哪些服务可以被选为 leader 的限制来完成 Raft 算法,该限制确保任何给定任期的 leader 都包含之前任期中提交的所有条目(图 3 中的 leader 完整性属性)。考虑到选举限制,我们然后使承诺规则更加精确。 最后我们展示了 leader 完整性属性的证明草图,并展示了它如何引出复制状态机的正确行为。

5.4.1 Election restriction(选举限制)

In any leader-based consensus algorithm, the leader must eventually store all of the committed log entries. In some consensus algorithms, such as Viewstamped Replication [20], a leader can be elected even if it doesn’t initially contain all of the committed entries. These algorithms contain additional mechanisms to identify the missing entries and transmit them to the new leader, either during the election process or shortly afterwards. Unfortunately, this results in considerable additional mechanism and complexity. Raft uses a simpler approach where it guarantees that all the committed entries from previous terms are present on each new leader from the moment of its election, without the need to transfer those entries to the leader. This means that log entries only flow in one direction, from leaders to followers, and leaders never overwrite existing entries in their logs.

在任何基于 leader 的共识算法中,leader 最终必须存储所有提交的日志条目。在某些共识算法中,例如 Viewstamped Replication [20],即使 leader 最初不包含所有已提交的条目,也可以选举出领导者。这些算法包含额外的机制来识别丢失的条目并将它们传输给新的 leader,无论是在选举过程中还是之后不久。不幸的是,这会导致相当多的额外机制和复杂性,Raft 使用了一种更简单的方法,它保证从选举的那一刻起,每个新 leader 都存在以前任期的所有提交条目,而无需将这些条目传输给 leader,这意味着日志条目只向一个方向流动,从 leader 到 follower,leader 永远不会覆盖他们日志中的现有条目。

Raft uses the voting process to prevent a candidate from winning an election unless its log contains all committed entries. A candidate must contact a majority of the cluster in order to be elected, which means that every committed entry must be present in at least one of those servers. If the candidate’s log is at least as up-to-date as any other log in that majority (where “up-to-date” is defined precisely below), then it will hold all the committed entries. The RequestVote RPC implements this restriction: the RPC includes information about the candidate’s log, and the voter denies its vote if its own log is more up-to-date than that of the candidate.

Raft 使用投票过程来防止候选者胜出选举,除非其日志包含所有承诺的条目。 候选者必须与集群的大多数成员联系才能当选,这意味每个提交的条目必须至少存在于其中一个服务中。如果候选者的日志至少与该(集群)多数中的任何其它日志一样最新(“最新”定义如下)那么它将保存所有提交的条目。 RequestVote RPC 实现了这个限制:RPC 包含有关候选者日志的信息,如果选民自己的日志比候选人的日志更新,则选民拒绝投票。

Raft determines which of two logs is more up-to-date by comparing the index and term of the last entries in the logs. If the logs have last entries with different terms, then the log with the later term is more up-to-date. If the logs end with the same term, then whichever log is longer is more up-to-date.

Raft 通过比较日志中最后一个条目的索引和任期来确定两个日志中的哪一个是最新的,如果日志的最后条目具有不同的任期,则具有较晚任期的日志是最新的,如果日志以相同的任期结束,则以更长的日志为准。

5.4.2 Committing entries from previous terms(提交前任期的条目)

As described in Section 5.3, a leader knows that an entry from its current term is committed once that entry is stored on a majority of the servers. If a leader crashes before committing an entry, future leaders will attempt to finish replicating the entry. However, a leader cannot immediately conclude that an entry from a previous term is committed once it is stored on a majority of servers. Figure 8 illustrates a situation where an old log entry is stored on a majority of servers, yet can still be overwritten by a future leader.

如第 5.3 节所述,一旦该条目存储在大多数服务上 leader 就知道其当前任期中的条目已提交,如果 leader 在提交条目之前崩溃,未来的 leader 将尝试完成复制该条目。然而 leader 不能立即断定前一任期的条目一旦存储在大多数服务器上就已提交。 图 8 说明了一种情况,旧日志条目存储在大多数服务器上,但仍然可能被未来的 leader 覆盖。

图 8:一个时间序列,显示了为什么 leader 不能使用旧任期的日志条目来确定承诺

Figure 8: 一个时间序列,显示了为什么 leader 不能使用旧任期的日志条目来确定承诺。
在 (a) S1 是 leader 并且部分复制了索引 2 处的日志条目,在 (b) S1 崩溃; S5 被选为第 3 任期的 leader,来自 S3、S4 及其本身的投票,并在日志索引 2 处接受不同的条目。在 (c) S5 崩溃;S1 重新启动被选为 leader 并继续复制。 此时第 2 任期的日志条目已在大多数服务器上复制但尚未提交。 如果S1崩溃如(d)中,则S5可以被选为 leader(来自S2,S3和S4的投票)并从3 任期中覆盖其自身条目的条目。但是如果S1在崩溃前在大多数服务上复制了当前任期的条目,如 (e) 所示,则该条目自己提交(S5 无法赢得选举),此时日志中的所有先前条目也都已提交。

To eliminate problems like the one in Figure 8, Raft never commits log entries from previous terms by counting replicas. Only log entries from the leader’s current term are committed by counting replicas; once an entry from the current term has been committed in this way, then all prior entries are committed indirectly because of the Log Matching Property. There are some situations where a leader could safely conclude that an older log entry is committed (for example, if that entry is stored on every server), but Raft takes a more conservative approach for simplicity.

为了消除图 8 中的问题,Raft 从不通过计算副本数来提交前几项的日志条目,通过计算副本数仅提交来自 leader 当前任期的日志条目; 一旦以这种方式提交了当前任期中的条目,那么由于日志匹配属性,所有先前的条目都将被间接提交。 在某些情况下,leader 可以安全地得出一个较旧的日志条目已提交的结论(例如,如果该条目存储在每个服务器上),但 Raft 为简单起见采取了更保守的方法。

Raft incurs this extra complexity in the commitment rules because log entries retain their original term numbers when a leader replicates entries from previous terms. In other consensus algorithms, if a new leader rereplicates entries from prior “terms,” it must do so with its new “term number.” Raft’s approach makes it easier to reason about log entries, since they maintain the same term number over time and across logs. In addition, new leaders in Raft send fewer log entries from previous terms than in other algorithms (other algorithms must send redundant log entries to renumber them before they can be committed).

Raft 在提交规则中会产生这种额外的复杂性,因为当 leader 从以前的任期复制条目时,日志条目会保留其原始任期号。在其他共识算法中,如果一个新的 leader 从之前的“任期”中重新复制条目,它必须使用新的“任期号”这样做。Raft 的方法使推理日志条目变得更容易,因为它们随着时间的推移和跨日志保持相同的术语编号。此外与其他算法相比,Raft 中的新 leader 发送的先前条款中的日志条目更少(其他算法必须发送冗余日志条目以重新编号,然后才能提交)。

5.4.3 Safety argument(安全性论证)

Given the complete Raft algorithm, we can now argue more precisely that the Leader Completeness Property holds (this argument is based on the safety proof; see Section 8.2). We assume that the Leader Completeness Property does not hold, then we prove a contradiction. Suppose the leader for term T (leaderT) commits a log entry from its term, but that log entry is not stored by the leader of some future term. Consider the smallest term U > T whose leader (leaderU) does not store the entry.

给定完整的 Raft 算法,我们现在可以更准确地论证 Leader 完整性属性成立(这个论证基于安全性证明;参见第 8.2 节)。我们假设 Leader 完整性属性不成立,那么我们证明一个矛盾,假设任期 T (leaderT) 的 leader 提交了其任期内的日志条目,但该日志条目并未由未来某个任期的 leader 存储,考虑最小的项 U > T,其 leader (leaderU) 不存储条目。

Figure 9: 如果 S1(T 任期的 leader)在其任期内提交了一个新的日志条目

Figure 9: 如果 S1(T 任期的 leader)在其任期内提交了一个新的日志条目,并且 S5 被选为下一任期 U 的 leader,那么必须至少有一个服务(S3)接受该日志条目并投票给 S5。
  1. The committed entry must have been absent from leaderU’s log at the time of its election (leaders never delete or overwrite entries).
  2. leaderT replicated the entry on a majority of the cluster, and leaderU received votes from a majority of the cluster. Thus, at least one server (“the voter”) both accepted the entry from leaderT and voted for leaderU, as shown in Figure 9. The voter is key to reaching a contradiction.
  3. The voter must have accepted the committed entry from leaderT before voting for leaderU; otherwise it would have rejected the AppendEntries request from leaderT (its current term would have been higher than T).
  4. The voter still stored the entry when it voted for leaderU, since every intervening leader contained the entry (by assumption), leaders never remove entries, and followers only remove entries if they conflict with the leader.
  5. The voter granted its vote to leaderU , so leaderU ’s log must have been as up-to-date as the voter’s. This leads to one of two contradictions.
  6. First, if the voter and leaderU shared the same last log term, then leaderU’s log must have been at least as long as the voter’s, so its log contained every entry in the voter’s log. This is a contradiction, since the voter contained the committed entry and leaderU was assumed not to.
  7. Otherwise, leaderU’s last log term must have been larger than the voter’s. Moreover, it was larger than T, since the voter’s last log term was at least T (it contains the committed entry from term T). The earlier leader that created leaderU’s last log entry must have contained the committed entry in its log (by assumption). Then, by the Log Matching Property, leaderU’s log must also contain the committed entry, which is a contradiction.
  8. This completes the contradiction. Thus, the leaders of all terms greater than T must contain all entries from term T that are committed in term T.
  9. The Log Matching Property guarantees that future leaders will also contain entries that are committed indirectly, such as index 2 in Figure 8(d).
  1. 在 leaderU 选举时,提交的条目必须不存在于 leaderU 的日志中(leader永远不会删除或覆盖条目)。
  2. leaderT 在集群的大多数成员上复制了条目,leaderU 收到了集群大多数成员的投票。因此至少有一个服务(“投票者”)同时接受了来自 leaderT 的条目并投票给了 leaderU,如图 9 所示,投票者是达成矛盾的关键。
  3. 在投票给 leaderU 之前,投票者必须已经接受了 leaderT 提交的条目;否则它会拒绝来自 leaderT 的 AppendEntries 请求(其当前任期将高于 T)。
  4. 投票者在投票给 leaderU 时仍然存储该条目,因为每个干预的领导者都包含该条目(假设),领导者从不删除条目,而追随者仅在与领导者冲突时删除条目。
  5. 选民投票给了 leaderU ,所以 leaderU 的日志必须和选民的一样最新,这导致了两个矛盾之一。
  6. 首先,如果voter 和 leaderU 共享相同的最后一个日志任期,那么 leaderU 的日志必须至少和voter 一样长,所以它的日志包含了 voter 日志中的每一个条目。这是一个矛盾,因为选民包含提交的条目,而 leaderU 被认为没有。
  7. 否则,leaderU 的最后一个日志任期必须大于投票者的。此外它大于 T,因为投票者的最后一个日志期限至少是 T(它包含来自期限 T 的提交条目)。创建 leaderU 的最后一个日志条目的较早 leader 必须在其日志中包含已提交的条目(假设)。那么根据日志匹配属性,leaderU 的日志也必须包含提交的条目,这是一个矛盾。
  8. 这就完成了矛盾。因此,所有大于 T 的项的 leader 必须包含项 T 中在项 T 中提交的所有条目。
  9. 日志匹配属性保证未来的 leader 也将包含间接提交的条目,例如图 8(d) 中的索引 2。

Given the Leader Completeness Property, it is easy to prove the State Machine Safety Property from Figure 3 and that all state machines apply the same log entries in the same order (see [29]).

给定 leader 完整性属性,很容易证明图 3 中的状态机安全属性,并且所有状态机都以相同的顺序应用相同的日志条目(参见 [29])。

5.5 Follower and candidate crashes(Follower和候选者崩溃)

Until this point we have focused on leader failures. Follower and candidate crashes are much simpler to handle than leader crashes, and they are both handled in the same way. If a follower or candidate crashes, then future RequestVote and AppendEntries RPCs sent to it will fail. Raft handles these failures by retrying indefinitely; if the crashed server restarts, then the RPC will complete successfully. If a server crashes after completing an RPC but before responding, then it will receive the same RPC again after it restarts. Raft RPCs are idempotent, so this causes no harm. For example, if a follower receives an AppendEntries request that includes log entries already present in its log, it ignores those entries in the new request.

到目前为止我们一直专注于 leader 的失败,follower 和候选者崩溃比 leader 崩溃更容易处理,并且它们都以相同的方式处理。 如果 follower 或候选人崩溃,那么未来发送给它的 RequestVote 和 AppendEntries RPC 将失败。 Raft 通过无限重试来处理这些失败; 如果崩溃的服务重新启动则 RPC 将成功完成。 如果服务器在完成 RPC 之后但在响应之前崩溃,那么它会在重新启动后再次收到相同的 RPC。 Raft RPC 是幂等的,所以这不会造成损害,例如如果一个 follower 收到一个 AppendEntries 请求,其中包括其日志中已经存在的日志条目,它会忽略新请求中的这些条目。

5.6 Timing and availability(时间和可用性)

One of our requirements for Raft is that safety must not depend on timing: the system must not produce incorrect results just because some event happens more quickly or slowly than expected. However, availability (the ability of the system to respond to clients in a timely manner) must inevitably depend on timing. For example, if message exchanges take longer than the typical time between server crashes, candidates will not stay up long enough to win an election; without a steady leader, Raft cannot make progress.

我们对 Raft 的要求之一是安全性不能依赖于时间:系统不能仅仅因为某些事件发生得比预期的快或慢而产生错误的结果。然而,可用性(系统及时响应客户的能力)必须不可避免地取决于时间。 例如,如果消息交换比服务器崩溃之间的典型时间要长,候选者就不会坚持到赢得选举;没有稳定的 leader ,Raft 无法取得进展。

Leader election is the aspect of Raft where timing is most critical. Raft will be able to elect and maintain a steady leader as long as the system satisfies the following timing requirement:

broadcastTime ≪ electionTimeout ≪ MTBF

In this inequality broadcastTime is the average time it takes a server to send RPCs in parallel to every server in the cluster and receive their responses; electionTimeout is the election timeout described in Section 5.2; and MTBF is the average time between failures for a single server. The broadcast time should be an order of magnitude less than the election timeout so that leaders can reliably send the heartbeat messages required to keep followers from starting elections; given the randomized approach used for election timeouts, this inequality also makes split votes unlikely. The election timeout should be a few orders of magnitude less than MTBF so that the system makes steady progress. When the leader crashes, the system will be unavailable for roughly the election timeout; we would like this to represent only a small fraction of overall time.

Leader 选举是 Raft 最关键的方面,只要系统满足以下时序要求,Raft 将能够选举和维护一个稳定的 leader:

broadcastTime ≪ electionTimeout ≪ MTBF

在这个不等式中 broadcastTime 是服务向集群中的每个服务并行发送 RPC 并接收它们的响应所花费的平均时间;electionTimeout 是第 5.2 节中描述的选举超时时间;MTBF 是单个服务器的平均故障间隔时间。广播时间应该比选举超时时间少一个数量级,以便 leader 可以可靠地发送阻止 followers 开始选举所需的心跳消息; 鉴于用于选举超时的随机方法,这种不平等也使得分裂选票不太可能。选举超时应该比 MTBF 小几个数量级,以便系统稳步前进。当 leader 崩溃时系统将在大约选举超时时间内不可用;我们希望这仅占总时间的一小部分。

The broadcast time and MTBF are properties of the underlying system, while the election timeout is something we must choose. Raft’s RPCs typically require the recipient to persist information to stable storage, so the broadcast time may range from 0.5ms to 20ms, depending on storage technology. As a result, the election timeout is likely to be somewhere between 10ms and 500ms. Typical server MTBFs are several months or more, which easily satisfies the timing requirement.

广播时间和 MTBF 是底层系统的属性,而选举超时是我们必须选择的。Raft 的 RPC 通常需要接收方将信息持久化到稳定的存储中,因此广播时间可能在 0.5 毫秒到 20 毫秒之间,具体取决于存储技术。 因此选举超时很可能在 10 毫秒到 500 毫秒之间,典型的服务 MTBF 为几个月或更长时间,很容易满足时序要求。


6 Cluster membership changes(集群成员变更)

Up until now we have assumed that the cluster configuration (the set of servers participating in the consensus algorithm) is fixed. In practice, it will occasionally be necessary to change the configuration, for example to replace servers when they fail or to change the degree of replication. Although this can be done by taking the entire cluster off-line, updating configuration files, and then restarting the cluster, this would leave the cluster unavailable during the changeover. In addition, if there are any manual steps, they risk operator error. In order to avoid these issues, we decided to automate configuration changes and incorporate them into the Raft consensus algorithm.

到目前为止我们假设集群配置(参与共识算法的服务集)是固定的,在实践中有时需要更改配置,例如在服务出现故障时更换服务器或更改复制度,虽然这可以通过使整个集群脱机、更新配置文件,然后重新启动集群来完成,但这会使集群在转换期间不可用。 此外如果有任何手动步骤,则存在误操作的风险。为了避免这些问题,我们决定自动化配置更改并将它们合并到 Raft 共识算法中。

For the configuration change mechanism to be safe, there must be no point during the transition where it is possible for two leaders to be elected for the same term. Unfortunately, any approach where servers switch directly from the old configuration to the new configuration is unsafe. It isn’t possible to atomically switch all of the servers at once, so the cluster can potentially split into two independent majorities during the transition (see Figure 10).

为了使配置更改机制安全,在过渡期间不能有可能在同一任期内选举两个 leader 的情况。不幸的是,服务直接从旧配置切换到新配置的任何方法都是不安全的。一次原子地切换所有服务是不可能的,因此在转换过程中,集群可能会分裂为两个独立的大多数(参见图 10)。

Figure 10:  直接从一种配置切换到另一种配置是不安全的

Figure 10: 直接从一种配置切换到另一种配置是不安全的,因为不同的服务器会在不同的时间切换。 在此示例中集群从3台服务器增长到五台,不幸的是在某个时间点可以选举两个不同的 leader 同一任期任职,一个拥有大多数旧配置 (C old),另一个拥有大多数新配置 (C new)。

In order to ensure safety, configuration changes must use a two-phase approach. There are a variety of ways to implement the two phases. For example, some systems (e.g., [20]) use the first phase to disable the old configuration so it cannot process client requests; then the second phase enables the new configuration. In Raft the cluster first switches to a transitional configuration we call joint consensus; once the joint consensus has been committed, the system then transitions to the new configuration. The joint consensus combines both the old and new configurations:
• Log entries are replicated to all servers in both configurations.
• Any server from either configuration may serve as leader.
• Agreement (for elections and entry commitment) requires separate majorities from both the old and new configurations.
The joint consensus allows individual servers to transition between configurations at different times without compromising safety. Furthermore, joint consensus allows the cluster to continue servicing client requests throughout the configuration change.

为了确保安全,配置更改必须使用两阶段方法,有多种方法可以实现这两个阶段,例如一些系统(例如,[20])使用第一阶段禁用旧配置,使其无法处理客户端请求;然后第二阶段启用新配置。在 Raft 中,集群首先切换到我们称之为联合共识的过渡配置;一旦达成联合共识,系统就会转换到新的配置。联合共识结合了新旧配置:

  • 日志条目被复制到两种配置中的所有服务器。
  • 任一配置中的任何服务器都可以充当 leader。
  • 协议(用于选举和进入承诺)需要与新旧配置不同的多数。

联合共识允许单个服务在不同时间在配置之间转换,而不会影响安全性。此外联合共识允许集群在整个配置更改期间继续为客户端请求提供服务。

Cluster configurations are stored and communicated using special entries in the replicated log; Figure 11 illustrates the configuration change process. When the leader receives a request to change the configuration from Cold to Cnew, it stores the configuration for joint consensus (Cold,new in the figure) as a log entry and replicates that entry using the mechanisms described previously. Once a given server adds the new configuration entry to its log, it uses that configuration for all future decisions (a server always uses the latest configuration in its log, regardless of whether the entry is committed). This means that the leader will use the rules of Cold,new to determine when the log entry for Cold,new is committed. If the leader crashes, a new leader may be chosen under either Cold or Cold,new, depending on whether the winning candidate has received Cold,new. In any case, Cnew cannot make unilateral decisions during this period.

集群配置使用复制日志中的特殊条目进行存储和通信;图 11 说明了配置更改过程,当 leader 收到将配置从 Cold 更改为 Cnew 的请求时,它将联合共识的配置(图中的 Cold,new)存储为日志条目,并使用前面描述的机制复制该条目。 一旦给定的服务将新的配置条目添加到其日志中,它将使用该配置进行所有未来的决策(服务始终使用其日志中的最新配置,无论该条目是否已提交),这意味着 leader 将使用 Cold,new 的规则来确定 Cold,new 的日志条目何时被提交。如果 leader 崩溃,则可能会在 Cold 或 Cold,new 下选择新的 leader,这取决于获胜候选者是否收到了 Cold,new。在任何情况下 Cnew 都不能在此期间做出单方面的决定。

Figure 11: 配置更改的时间轴

Figure 11: 配置更改的时间轴。
虚线显示已创建但未提交的配置条目,实线显示最新提交的配置条目,leader 首先在其日志中创建 C old,new 配置条目并将其提交到 C old,new(大多数 C old 和大多数 C new),然后它创建 C new 条目并将其提交给大多数 C new,C old 和 C new 没有时间可以独立做出决定。

Once Cold,new has been committed, neither Cold nor Cnew can make decisions without approval of the other, and the Leader Completeness Property ensures that only servers with the Cold,new log entry can be elected as leader. It is now safe for the leader to create a log entry describing Cnew and replicate it to the cluster. Again, this configuration will take effect on each server as soon as it is seen. When the new configuration has been committed under the rules of Cnew, the old configuration is irrelevant and servers not in the new configuration can be shut down. As shown in Figure 11, there is no time when Cold and Cnew can both make unilateral decisions; this guarantees safety.

一旦 Cold,new 被提交,Cold 和 Cnew 都不能在未经对方同意的情况下做出决定,并且 Leader 完整性属性确保只有具有 Cold,new 日志条目的服务才能被选举为 leader,现在 leader 可以安全地创建一个描述 Cnew 的日志条目并将其复制到集群中。 同样,此配置将在每个服务上看到后立即生效,当新的配置在 Cnew 的规则下被提交后,旧的配置是无关紧要的,不在新配置中的服务可以被关闭。 如图 11 所示,Cold 和 Cnew 没有时间可以同时做出单方面决策,这保证了安全。

There are three more issues to address for reconfiguration. The first issue is that new servers may not initially store any log entries. If they are added to the cluster in this state, it could take quite a while for them to catch up, during which time it might not be possible to commit new log entries. In order to avoid availability gaps, Raft introduces an additional phase before the configuration change, in which the new servers join the cluster as non-voting members (the leader replicates log entries to them, but they are not considered for majorities). Once the new servers have caught up with the rest of the cluster, the reconfiguration can proceed as described above.

重新配置还有三个问题需要解决。 第一个问题是新服务最初可能不会存储任何日志条目,如果在这种状态下将它们添加到集群中,它们可能需要很长时间才能赶上,在此期间可能无法提交新的日志条目,为了避免可用性的差距,Raft 在配置更改之前引入了一个额外的阶段,在这个阶段,新服务作为非投票成员加入集群(leader 将日志条目复制给他们,但他们不被考虑为多数),一旦新服务赶上集群的其余部分,重新配置就可以如上所述进行。

The second issue is that the cluster leader may not be part of the new configuration. In this case, the leader steps down (returns to follower state) once it has committed the Cnew log entry. This means that there will be a period of time (while it is committing Cnew ) when the leader is managing a cluster that does not include itself; it replicates log entries but does not count itself in majorities. The leader transition occurs when Cnew is committed because this is the first point when the new configuration can operate independently (it will always be possible to choose a leader from Cnew ). Before this point, it may be the case that only a server from Cold can be elected leader.

第二个问题是集群 leader 可能不是新配置的部分。在这种情况下,一旦提交了 Cnew 日志条目 leader 就会下台(返回 follower 状态),这意味着当 leader 管理一个不包括自己的集群时,会有一段时间(在它提交 Cnew 时); 它复制日志条目但不占多数,在提交 Cnew 时发生 leader 转换,因为这是新配置可以独立运行的第一个点(始终可以从 Cnew 中选择 leader)。在此之前,可能是只有来自 Cold的服务可以选择 leader。

The third issue is that removed servers (those not in Cnew) can disrupt the cluster. These servers will not receive heartbeats, so they will time out and start new elections. They will then send RequestVote RPCs with new term numbers, and this will cause the current leader to revert to follower state. A new leader will eventually be elected, but the removed servers will time out again and the process will repeat, resulting in poor availability.

第三个问题是移除的服务(那些不在 Cnew 中的)可能会破坏集群,这些服务不会收到心跳,因此它们将超时并开始新的选举。然后他们将发送带有新任期号的 RequestVote RPC,这将导致当前 leader 恢复到 follower 状态。最终会选出一个新的 leader,但是被移除的服务会再次超时,这个过程会重复导致可用性较差。

To prevent this problem, servers disregard RequestVote RPCs when they believe a current leader exists. Specifically, if a server receives a RequestVote RPC within the minimum election timeout of hearing from a current leader, it does not update its term or grant its vote.This does not affect normal elections, where each server waits at least a minimum election timeout before starting an election. However, it helps avoid disruptions from removed servers: if a leader is able to get heartbeats to its cluster, then it will not be deposed by larger term numbers.

为防止出现此问题,服务在认为当前 leader 存在时会忽略 RequestVote RPC。 具体来说,如果服务在听取当前 leader 的最小选举超时时间内收到 RequestVote RPC,则不会更新其任期或授予其投票,这不影响正常选举,其中每个服务至少等待最小选举超时开始选举。然而它有助于避免被移除的服务器造成的中断:如果 leader 能够获得其集群的心跳,那么它就不会被更大的任期号废黜。

7 Clients and log compaction(客户端和日志压缩)

This section has been omitted due to space limitations, but the material is available in the extended version of this paper [29]. It describes how clients interact with Raft, including how clients find the cluster leader and how Raft supports linearizable semantics [8]. The extended version also describes how space in the replicated log can be reclaimed using a snapshotting approach. These issues apply to all consensus-based systems, and Raft’s solutions are similar to other systems.

由于篇幅限制本节已被省略,但该材料可在本文的扩展版本中获得 [29],它描述了客户端如何与 Raft 交互,包括客户端如何找到集群 leader 以及 Raft 如何支持线性化语义[8]。扩展版本还描述了如何使用快照方法回收复制日志中的空间,这些问题适用于所有基于共识的系统,Raft 的解决方案与其他系统类似。

8 Implementation and evaluation(实施和评估)

We have implemented Raft as part of a replicated state machine that stores configuration information for RAMCloud [30] and assists in failover of the RAMCloud coordinator. The Raft implementation contains roughly 2000 lines of C++ code, not including tests, comments, or blank lines. The source code is freely available [21]. There are also about 25 independent third-party open source implementations [31] of Raft in various stages of development, based on drafts of this paper. Also, various companies are deploying Raft-based systems [31].

我们已经将 Raft 实现为复制状态机的一部分,该状态机存储 RAMCloud [30] 的配置信息并协助 RAMCloud 协调器的故障转移。 Raft 实现包含大约 2000 行 C++ 代码,不包括测试、注释或空行。源代码可免费获得[21]。根据本文的草稿,还有大约 25 个独立的第三方开源实现 [31] Raft 处于不同的开发阶段。此外,各种公司正在部署基于 Raft 的系统 [31]。

The remainder of this section evaluates Raft using three criteria: understandability, correctness, and performance.

本节的其余部分使用三个标准评估 Raft:可理解性、正确性和性能。

8.1 Understandability(可理解性)

To measure Raft’s understandability relative to Paxos, we conducted an experimental study using upper-level undergraduate and graduate students in an Advanced Operating Systems course at Stanford University and a Distributed Computing course at U.C. Berkeley. We recorded a video lecture of Raft and another of Paxos, and created corresponding quizzes. The Raft lecture covered the content of this paper; the Paxos lecture covered enough material to create an equivalent replicated state machine, including single-decree Paxos, multi-decree Paxos, reconfiguration, and a few optimizations needed in practice (such as leader election). The quizzes tested basic understanding of the algorithms and also required students to reason about corner cases. Each student watched one video, took the corresponding quiz, watched the second video, and took the second quiz. About half of the participants did the Paxos portion first and the other half did the Raft portion first in order to account for both individual differences in performance and experience gained from the first portion of the study. We compared participants’ scores on each quiz to determine whether participants showed a better understanding of Raft.

为了衡量 Raft 相对于 Paxos 的可理解性,我们对斯坦福大学的高级操作系统(Advanced Operating Systems)课程和加州大学伯克利分校的分布式计算课程(Distributed Computing)的高年级本科生和研究生进行了一项实验研究。我们录制了一个 Raft 和另一个 Paxos 的视频讲座,并创建了相应的测验。 Raft 讲座涵盖了本文的内容; Paxos 讲座涵盖了足够的材料来创建等效的复制状态机,包括单决策 Paxos、多决策 Paxos、重新配置和一些实践中需要的优化(例如 leader 选举)。测验测试了对算法的基本理解,还要求学生对极端情况进行推理。每个学生观看一个视频,参加相应的测验,观看第二个视频,并参加第二个测验。大约一半的参与者先做 Paxos 部分,另一半先做 Raft 部分,以考虑到从研究的第一部分中获得的表现和经验的个体差异。我们比较了参与者在每个测验中的分数,以确定参与者是否对 Raft 表现出更好的理解。

We tried to make the comparison between Paxos and Raft as fair as possible. The experiment favored Paxos in two ways: 15 of the 43 participants reported having some prior experience with Paxos, and the Paxos video is 14% longer than the Raft video. As summarized in Table 1, we have taken steps to mitigate potential sources of bias. All of our materials are available for review [26, 28].

我们试图让 Paxos 和 Raft 之间的比较尽可能公平,该实验在两方面对 Paxos 有利:43 名参与者中有 15 人报告说之前有使用 Paxos 的经验,Paxos 视频比 Raft 视频长 14%。如表 1 所述,我们已采取措施减轻潜在的偏见来源,我们所有的材料都可供审查 [26, 28]。

Table 1: 对研究中可能对 Paxos 存在偏见的担忧、针对每种偏见采取的措施以及可用的其它材料
Concern为减轻偏见而采取的措施审查材料 [26, 28]
同等的授课质量两者的讲师相同。Paxos 讲座基于并改进了几所大学使用的现有材料,Paxos 讲座的时间延长了 14%。视频
相同的测验难度问题按难度分组并在考试中配对。测试题
公平分级使用的评分准则。以随机顺序评分,在测验之间交替。评分准则

On average, participants scored 4.9 points higher on the Raft quiz than on the Paxos quiz (out of a possible 60 points, the mean Raft score was 25.7 and the mean Paxos score was 20.8); Figure 12 shows their individual scores. A paired t-test states that, with 95% confidence, the true distribution of Raft scores has a mean at least 2.5 points larger than the true distribution of Paxos scores.

平均而言,参与者在 Raft 测验中的得分比 Paxos 测验高 4.9 分(在可能的 60 分中,平均 Raft 得分为 25.7,平均 Paxos 得分为 20.8);图 12 显示了他们的个人得分,配对 t-test 表明,在 95% 的置信度下,Raft 分数的真实分布的平均值至少比 Paxos 分数的真实分布大 2.5 个百分点。

Figure 12:比较 43 名参与者在 Raft 和 Paxos 测验中的表现的散点图

Figure 12: 比较 43 名参与者在 Raft 和 Paxos 测验中的表现的散点图
对角线 (33) 以上的点代表 Raft 得分更高的参与者。

We also created a linear regression model that predicts a new student’s quiz scores based on three factors: which quiz they took, their degree of prior Paxos experience, and the order in which they learned the algorithms. The model predicts that the choice of quiz produces a 12.5-point difference in favor of Raft. This is significantly higher than the observed difference of 4.9 points, because many of the actual students had prior Paxos experience, which helped Paxos considerably, whereas it helped Raft slightly less. Curiously, the model also predicts scores 6.3 points lower on Raft for people that have already taken the Paxos quiz; although we don’t know why, this does appear to be statistically significant.

我们还创建了一个线性回归模型,该模型根据三个因素来预测新学生的测验分数:他们参加了哪个测验、他们先前的 Paxos 经验程度以及他们学习算法的顺序。该模型预测测验的选择会产生 12.5 分的差异支持 Raft,这明显高于观察到的 4.9 分的差异,因为许多实际的学生之前都有 Paxos 经验,这对 Paxos 有很大帮助,而对 Raft 的帮助略小。奇怪的是,该模型还预测已经参加 Paxos 测验的人在 Raft 上的分数降低了 6.3 分; 虽然我们不知道为什么,但这似乎在统计上是显着的。

We also surveyed participants after their quizzes to see which algorithm they felt would be easier to implement or explain; these results are shown in Figure 13. An overwhelming majority of participants reported Raft would be easier to implement and explain (33 of 41 for each question). However, these self-reported feelings may be less reliable than participants’ quiz scores, and participants may have been biased by knowledge of our hypothesis that Raft is easier to understand.

我们还在测验后对参与者进行了调查,以了解他们认为哪种算法更容易实现或解释,这些结果显示在图 13 中。绝大多数参与者报告说 Raft 更容易实现和解释(每个问题 41 个中的 33 个)。 然而这些自我报告的感觉可能不如参与者的测验分数可靠,而且参与者可能因我们对 Raft 更容易理解的假设的了解而产生偏见。

Figure 13:

Figure 13: 使用 5 分制,
参与者被问到(左)他们认为哪种算法在功能正常、正确和高效的系统中更容易实现,
(右)哪种算法更容易解释 CS研究生。

A detailed discussion of the Raft user study is available at [28].

Raft 用户研究的详细讨论可在 [28] 中找到。

8.2 Correctness(正确性)

We have developed a formal specification and a proof of safety for the consensus mechanism described in Section 5. The formal specification [28] makes the information summarized in Figure 2 completely precise using the TLA+ specification language [15]. It is about 400 lines long and serves as the subject of the proof. It is also useful on its own for anyone implementing Raft. We have mechanically proven the Log Completeness Property using the TLA proof system [6]. However, this proof relies on invariants that have not been mechanically checked (for example, we have not proven the type safety of the specification). Furthermore, we have written an informal proof [28] of the State Machine Safety property which is complete (it relies on the specification alone) and relatively precise (it is about 3500 words long).

我们已经为第 5 节中描述的共识机制制定了正式规范和安全证明。正式规范 [28] 使用 TLA+ 规范语言 [15] 使图 2 中总结的信息完全准确,它大约有 400 行长,是证明的主题。对于任何实现 Raft 的人来说,它本身也很有用。 我们已经使用 TLA 证明系统 [6] 机械证明了日志完整性属性。 然而,这个证明依赖于没有经过机械检查的不变量(例如,我们没有证明规范的类型安全),此外我们编写了状态机安全属性的非正式证明 [28],该证明是完整的(仅依赖于规范)且相对精确(大约 3500 字长)。

8.3 Performance(性能)

Raft’s performance is similar to other consensus algorithms such as Paxos. The most important case for performance is when an established leader is replicating new log entries. Raft achieves this using the minimal number of messages (a single round-trip from the leader to half the cluster). It is also possible to further improve Raft’s performance. For example, it easily supports batching and pipelining requests for higher throughput and lower la-ency. Various optimizations have been proposed in the literature for other algorithms; many of these could be applied to Raft, but we leave this to future work.

Raft 的性能类似于 Paxos 等其他共识算法。性能最重要的情况是当已建立的 leader 正在复制新的日志条目时,Raft 使用最少数量的消息(从 leader 到一半集群的单次往返)实现了这一点,还可以进一步提高 Raft 的性能,例如它可以轻松支持批处理和流水线请求,以实现更高的吞吐量和更低的延迟。在其他算法的文献中已经提出了各种优化;其中许多可以应用于 Raft,但我们将其留给未来的工作。

We used our Raft implementation to measure the performance of Raft’s leader election algorithm and answer two questions. First, does the election process converge quickly? Second, what is the minimum downtime that can be achieved after leader crashes?

我们使用我们的 Raft 实现来衡量 Raft 的 leader 选举算法的性能并回答两个问题,第一,选举过程收敛很快吗? 其次 Leader 崩溃后可以达到的最小停机时间是多少?

To measure leader election, we repeatedly crashed the leader of a cluster of five servers and timed how long it took to detect the crash and elect a new leader (see Figure 14). To generate a worst-case scenario, the servers in each trial had different log lengths, so some candidates were not eligible to become leader. Furthermore, to encourage split votes, our test script triggered a synchronized broadcast of heartbeat RPCs from the leader before terminating its process (this approximates the behavior of the leader replicating a new log entry prior to crashing). The leader was crashed uniformly randomly within its heartbeat interval, which was half of the minimum election timeout for all tests. Thus, the smallest possible downtime was about half of the minimum election timeout.

为了衡量 leader 选举,我们反复使由五台服务器组成的集群的 leader 崩溃,并对检测到崩溃和选举新 leader 所需的时间进行计时(见图 14)。为了产生最坏的情况,每次试验中的服务器都有不同的日志长度,因此一些候选者没有资格成为 leader。此外,为了鼓励分裂投票,我们的测试脚本在终止进程之前触发了来自 leader 的心跳 RPC 的同步广播(这近似于 leader 在崩溃之前复制新日志条目的行为)。leader 在其心跳间隔内均匀随机崩溃,这是所有测试的最小选举超时时间的一半,因此最小可能的停机时间大约是最小选举超时时间的一半。

Figure 14:检测和替换崩溃的 leader 时间

Figure 14: 检测和替换崩溃的 leader 时间。
顶部图表改变选举超时的随机性数量,底部图表缩放最小选举超时,每行代表 1000 次试验(“150–150ms” 的100次试验除外)并且对应于选举超时的特定选择;例如,“150-155ms”表示选举超时时间是随机选择的,并且统一在 150ms 和 155ms 之间。 测量是在五台服务器的集群上进行的,广播时间大约为 15 毫秒,九台服务器集群的结果是相似的。

The top graph in Figure 14 shows that a small amount of randomization in the election timeout is enough to avoid split votes in elections. In the absence of randomness, leader election consistently took longer than 10 seconds in our tests due to many split votes. Adding just 5ms of randomness helps significantly, resulting in a median downtime of 287ms. Using more randomness improves worst-case behavior: with 50ms of randomness the worstcase completion time (over 1000 trials) was 513ms.

图 14 中的顶部图表显示,选举超时中的少量随机化足以避免选举中的分裂投票,在缺乏随机性的情况下,由于许多分裂选票,在我们的测试中,leader 选举的时间始终超过 10 秒,仅添加 5 毫秒的随机性有很大帮助,导致平均停机时间为 287 毫秒。 使用更多的随机性可以改善最坏情况的行为:随机性为 50 毫秒时,最坏情况的完成时间(超过 1000 次试验)为 513 毫秒。

The bottom graph in Figure 14 shows that downtime can be reduced by reducing the election timeout. With an election timeout of 12–24ms, it takes only 35ms on average to elect a leader (the longest trial took 152ms). However, lowering the timeouts beyond this point violates Raft’s timing requirement: leaders have difficulty broadcasting heartbeats before other servers start new elections. This can cause unnecessary leader changes and lower overall system availability. We recommend using a conservative election timeout such as 150–300ms; such timeouts are unlikely to cause unnecessary leader changes and will still provide good availability.

图 14 中的底部图表显示可以通过减少选举超时来减少停机时间,选举超时时间为 12-24 毫秒,平均只需要 35 毫秒就可以选举出一个 leader(最长的试验需要 152 毫秒)。然而,将超时时间降低到这一点之后违反了 Raft 的时间要求:在其他服务器开始新的选举之前 leader 很难广播心跳。这可能会导致不必要的 leader 变更并降低整体系统可用性。我们建议使用保守的选举超时,例如 150-300 毫秒; 此类超时不太可能导致不必要的 leader 更改,并且仍将提供良好的可用性。

9 Related work(相关工作)

There have been numerous publications related to consensus algorithms, many of which fall into one of the following categories:
• Lamport’s original description of Paxos [13], and attempts to explain it more clearly [14, 18, 19].
• Elaborations of Paxos, which fill in missing details and modify the algorithm to provide a better foundation for implementation [24, 35, 11].
• Systems that implement consensus algorithms, such as Chubby [2, 4], ZooKeeper [9, 10], and Spanner [5]. The algorithms for Chubby and Spanner have not been published in detail, though both claim to be based on Paxos. ZooKeeper’s algorithm has been published in more detail, but it is quite different from Paxos.
• Performance optimizations that can be applied to Paxos [16, 17, 3, 23, 1, 25].
• Oki and Liskov’s Viewstamped Replication (VR), an alternative approach to consensus developed around the same time as Paxos. The original description [27] was intertwined with a protocol for distributed transactions, but the core consensus protocol has been separated in a recent update [20]. VR uses a leader-based approach with many similarities to Raft.

有许多与共识算法相关的出版物,其中许多属于以下类别之一:

  • Lamport 对 Paxos 的原始描述 [13],并试图更清楚地解释它 [14, 18, 19]。
  • Paxos 的详细说明,填补缺失的细节并修改算法,为实现提供更好的基础 [24, 35, 11]。
  • 实现共识算法的系统,例如 Chubby [2, 4]、ZooKeeper [9, 10]和 Spanner [5]。Chubby 和 Spanner 的算法尚未详细发布,但都声称基于 Paxos。ZooKeeper 的算法已经更详细的公布了,但是和 Paxos 有很大的不同。
  • 可应用于 Paxos [16, 17, 3, 23, 1, 25] 的性能优化。
  • Oki 和 Liskov 的 Viewstamped Replication (VR),一种与 Paxos 大约同时开发的共识替代方法。最初的描述 [27] 与分布式事务协议交织在一起,但核心共识协议在最近的更新中被分离 [20]。VR 使用基于 leader 的方法,与 Raft 有许多相似之处。

The greatest difference between Raft and Paxos is Raft’s strong leadership: Raft uses leader election as an essential part of the consensus protocol, and it concentrates as much functionality as possible in the leader. This approach results in a simpler algorithm that is easier to understand. For example, in Paxos, leader election is orthogonal to the basic consensus protocol: it serves only as a performance optimization and is not required for achieving consensus. However, this results in additional mechanism: Paxos includes both a two-phase protocol for basic consensus and a separate mechanism for leader election. In contrast, Raft incorporates leader election directly into the consensus algorithm and uses it as the first of the two phases of consensus. This results in less mechanism than in Paxos.

Raft 和 Paxos 最大的区别在于 Raft 的强大领导力:Raft 将 leader 选举作为共识协议的重要组成部分,并将尽可能多的功能集中在 leader 身上,这种方法的结果是更简单的算法更容易理解。 例如在 Paxos 中,leader 选举与基本共识协议是正交的:它仅用作性能优化,而不是达成共识所必需的。然而,这导致了额外的机制:Paxos 包括用于基本共识的两阶段协议和用于 leader 选举的单独机制,相比之下,Raft 将 leader 选举直接纳入共识算法,并将其用作共识的两个阶段中的第一个,这导致比 Paxos 更少的机制。

Like Raft, VR and ZooKeeper are leader-based and therefore share many of Raft’s advantages over Paxos. However, Raft has less mechanism that VR or ZooKeeper because it minimizes the functionality in non-leaders. For example, log entries in Raft flow in only one direction: outward from the leader in AppendEntries RPCs. In VR log entries flow in both directions (leaders can receive log entries during the election process); this results in additional mechanism and complexity. The published de scription of ZooKeeper also transfers log entries both to and from the leader, but the implementation is apparently more like Raft [32].

与 Raft 一样,VR 和 ZooKeeper 也是基于 leader 的,因此与 Paxos 相比有许多 Raft 的优势。 然而,Raft 的机制比 VR 或 ZooKeeper 少,因为它最大限度地减少了非 leader 的功能,例如 Raft 中的日志条目仅向一个方向流动:从 AppendEntries RPC 中的 leader 向外流动。 在 VR 中,日志条目是双向流动的(leader 可以在选举过程中收到日志条目);这会导致额外的机制和复杂性。已发布的 ZooKeeper 描述也将日志条目传输到 leader 和从 leader 传输日志条目,但实现显然更像 Raft [32]。

Raft has fewer message types than any other algorithm for consensus-based log replication that we are aware of. For example, VR and ZooKeeper each define 10 different message types, while Raft has only 4 message types (two RPC requests and their responses). Raft’s messages are a bit more dense than the other algorithms’, but they are simpler collectively. In addition, VR and ZooKeeper are described in terms of transmitting entire logs during leader changes; additional message types will be required to optimize these mechanisms so that they are practical.

Raft 的消息类型比我们所知的任何其他基于共识的日志复制算法都要少,例如 VR 和 ZooKeeper 各自定义了 10 种不同的消息类型,而 Raft 只有 4 种消息类型(两个 RPC 请求及其响应)。Raft 的消息比其他算法更密集一些,但它们总体上更简单。 另外,VR和ZooKeeper是按照 leader 变化时传输整个日志的方式来描述的;将需要额外的消息类型来优化这些机制,以便它们实用。

Several different approaches for cluster membership changes have been proposed or implemented in other work, including Lamport’s original proposal [13], VR [20], and SMART [22]. We chose the joint consensus approach for Raft because it leverages the rest of the consensus protocol, so that very little additional mechanism is required for membership changes. Lamport’s α-based approach was not an option for Raft because it assumes consensus can be reached without a leader. In comparison to VR and SMART, Raft’s reconfiguration algorithm has the advantage that membership changes can occur without limiting the processing of normal requests; in contrast, VR stops all normal processing during configuration changes, and SMART imposes an α-like limit on the number of outstanding requests. Raft’s approach also adds less mechanism than either VR or SMART.

在其他工作中已经提出或实施了几种不同的集群成员更改方法,包括 Lamport 的 original proposal [13]、VR [20] 和 SMART [22]。 我们为 Raft 选择了联合共识方法,因为它利用了共识协议的其余部分,因此成员更改几乎不需要额外的机制。Lamport 的基于 α 的方法不是 Raft 的选择,因为它假设可以在没有 leader 的情况下达成共识。相比VR和SMART,Raft的重配置算法的优势在于可以在不限制正常请求处理的情况下发生成员变化;相比之下,VR 在配置更改期间停止所有正常处理,而 SMART 对未完成请求的数量施加了类似 α 的限制。 Raft 的方法也比 VR 或 SMART 添加更少的机制。


10 Conclusion(结论)

Algorithms are often designed with correctness, efficiency, and/or conciseness as the primary goals. Although these are all worthy goals, we believe that understandability is just as important. None of the other goals can be achieved until developers render the algorithm into a practical implementation, which will inevitably deviate from and expand upon the published form. Unless developers have a deep understanding of the algorithm and can create intuitions about it, it will be difficult for them to retain its desirable properties in their implementation.

算法的设计通常以正确性、效率和/或简洁性为主要目标,尽管这些都是有价值的目标,但我们认为可理解性同样重要,在开发人员将算法转化为实际实现之前,其他任何目标都无法实现,这将不可避免地偏离和扩展已发布的形式。除非开发人员对算法有深刻的理解并且可以对它产生直觉,否则他们将很难在他们的实现中保留其理想的属性。

In this paper we addressed the issue of distributed consensus, where a widely accepted but impenetrable algorithm, Paxos, has challenged students and developers for many years. We developed a new algorithm, Raft, which we have shown to be more understandable than Paxos. We also believe that Raft provides a better foundation for system building. Using understandability as the primary design goal changed the way we approached the design of Raft; as the design progressed we found ourselves reusing a few techniques repeatedly, such as decomposing the problem and simplifying the state space. These techniques not only improved the understandability of Raft but also made it easier to convince ourselves of its correctness.

在本文中,我们解决了分布式共识的问题,其中一种被广泛接受但难以理解的算法 Paxos 多年来一直在挑战学生和开发人员。我们开发了一种新算法 Raft,我们已经证明它比 Paxos 更容易理解,我们也相信 Raft 为系统构建提供了更好的基础,使用可理解性作为主要设计目标改变了我们处理 Raft 设计的方式;随着设计的进展,我们发现自己重复使用了一些技术,例如分解问题和简化状态空间,这些技术不仅提高了 Raft 的可理解性,而且更容易让我们相信它的正确性。

11 Acknowledgments(致谢)

The user study would not have been possible without the support of Ali Ghodsi, David Mazieres, and the students of CS 294-91 at Berkeley and CS 240 at Stanford. Scott Klemmer helped us design the user study, and Nelson Ray advised us on statistical analysis. The Paxos slides for the user study borrowed heavily from a slide deck originally created by Lorenzo Alvisi. Special thanks go to David Mazie`res and Ezra Hoch for finding subtle bugs in Raft. Many people provided helpful feedback on the paper and user study materials, including Ed Bugnion, Michael Chan, Hugues Evrard, Daniel Giffin, Arjun Gopalan, Jon Howell, Vimalkumar Jeyakumar, Ankita Kejriwal, Aleksandar Kracun, Amit Levy, Joel Martin, Satoshi Matsushita, Oleg Pesok, David Ramos, Robbert van Renesse, Mendel Rosenblum, Nicolas Schiper, Deian Stefan, Andrew Stone, Ryan Stutsman, David Terei, Stephen Yang, Matei Zaharia, 24 anonymous conference reviewers (with duplicates), and especially our shepherd Eddie Kohler. Werner Vogels tweeted a link to an earlier draft, which gave Raft significant exposure. This work was supported by the Gigascale Systems Research Center and the Multiscale Systems Center, two of six research centers funded under the Focus Center Research Program, a Semiconductor Research Corporation program, by STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA, by the National Science Foundation under Grant No. 0963859, and by grants from Facebook, Google, Mellanox, NEC, NetApp, SAP, and Samsung. Diego Ongaro is supported by The Junglee Corporation Stanford Graduate Fellowship.

如果没有 Ali Ghodsi、David Mazieres 和伯克利 CS 294-91 和斯坦福大学 CS 240 学生的支持,用户研究是不可能的。Scott Klemmer 帮助我们设计了用户研究,Nelson Ray 为我们提供了统计分析方面的建议。用于用户研究的 Paxos 幻灯片大量借用了最初由 Lorenzo Alvisi 创建的幻灯片。特别感谢 David Mazieres 和 Ezra Hoch 在 Raft 中发现了细微的错误。许多人对论文和用户研究材料提供了有用的反馈,包括 Ed Bugnion、Michael Chan、Hugues Evrard、Daniel Giffin、Arjun Gopalan、Jon Howell、Vimalkumar Jeyakumar、Ankita Kejriwal、Aleksandar Kracun、Amit Levy、Joel Martin、Satoshi Matsushita, Oleg Pesok、David Ramos、Robbert van Renesse、Mendel Rosenblum、Nicolas Schiper、Deian Stefan、Andrew Stone、Ryan Stutsman、David Terei、Stephen Yang、Matei Zaharia,24 位匿名会议审稿人(有重复),尤其是我们的牧师 Eddie Kohler。 Werner Vogels 在推特上发布了一个指向早期草案的链接,这让 Raft 得到了显着的曝光。这项工作得到了 Gigascale Systems Research Center 和 Multiscale Systems Center 的支持,6 个研究中心有 2 个由 Focus Center Research Program、Semiconductor Research Corporation program、STARnet、MARCO 和 DARPA资助的 Semiconductor Research Corporation program 、美国国家科学基金会(No. 0963859),以及 Facebook、谷歌、Mellanox、NEC、NetApp、SAP 和三星的资助。Diego Ongaro 得到 Junglee Corporation 斯坦福大学研究生奖学金的支持。

References(参考文献)

[1] BOLOSKY, W. J., BRADSHAW, D., HAAGENS, R. B., KUSTERS, N. P., AND LI, P. Paxos replicated state machines as the basis of a high-performance data store. In Proc. NSDI’11, USENIX Conference on Networked Systems Design and Implementation (2011), USENIX, pp. 141–154.
[2] BURROWS, M. The Chubby lock service for looselycoupled distributed systems. In Proc. OSDI’06, Symposium on Operating Systems Design and Implementation (2006), USENIX, pp. 335–350.
[3] CAMARGOS, L. J., SCHMIDT, R. M., AND PEDONE, F. Multicoordinated Paxos. In Proc. PODC’07, ACM Symposium on Principles of Distributed Computing (2007), ACM, pp. 316–317.
[4] CHANDRA, T. D., GRIESEMER, R., AND REDSTONE, J. Paxos made live: an engineering perspective. In Proc. PODC’07, ACM Symposium on Principles of Distributed Computing (2007), ACM, pp. 398–407.
[5] CORBETT, J. C., DEAN, J., EPSTEIN, M., FIKES, A., FROST, C., FURMAN, J. J., GHEMAWAT, S., GUBAREV, A., HEISER, C., HOCHSCHILD, P., HSIEH, W., KANTHAK, S., KOGAN, E., LI, H., LLOYD, A., MELNIK, S., MWAURA, D., NAGLE, D., QUINLAN, S., RAO, R., ROLIG, L., SAITO, Y., SZYMANIAK, M., TAYLOR, C., WANG, R., AND WOODFORD, D. Spanner: Google’s globally-distributed database. In Proc. OSDI’12, USENIX Conference on Operating Systems Design and Implementation (2012), USENIX, pp. 251–264.
[6] COUSINEAU, D., DOLIGEZ, D., LAMPORT, L., MERZ, S., RICKETTS, D., AND VANZETTO, H. TLA+ proofs. In Proc. FM’12, Symposium on Formal Methods (2012), D. Giannakopoulou and D. Me ́ry, Eds., vol. 7436 of Lecture Notes in Computer Science, Springer, pp. 147–154.
[7] GHEMAWAT, S., GOBIOFF, H., AND LEUNG, S.-T. The Google file system. In Proc. SOSP’03, ACM Symposium on Operating Systems Principles (2003), ACM, pp. 29–43.
[8] HERLIHY, M. P., AND WING, J. M. Linearizability: a correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems 12 (July 1990), 463–492.
[9] HUNT, P., KONAR, M., JUNQUEIRA, F. P., AND REED, B . ZooKeeper: wait-free coordination for internet-scale systems. In Proc ATC’10, USENIX Annual Technical Conference (2010), USENIX, pp. 145–158.
[10] JUNQUEIRA, F. P., REED, B. C., AND SERAFINI, M. Zab: High-performance broadcast for primary-backup systems. In Proc. DSN’11, IEEE/IFIP Int’l Conf. on Dependable Systems & Networks (2011), IEEE Computer Society, pp. 245–256.
[11] KIRSCH, J., AND AMIR, Y. Paxos for system builders. Tech. Rep. CNDS-2008-2, Johns Hopkins University, 2008.
[12] LAMPORT, L. Time, clocks, and the ordering of events in a distributed system. Commununications of the ACM 21, 7 (July 1978), 558–565.
[13] LAMPORT, L. The part-time parliament. ACM Transactions on Computer Systems 16, 2 (May 1998), 133–169.
[14] L A M P O RT, L . Paxos made simple. ACM SIGACT News 32, 4 (Dec. 2001), 18–25.
[15] L A M P O RT, L . Specifying Systems, The TLA+ Language and Tools for Hardware and Software Engineers. AddisonWesley, 2002.
[16] LAMPORT, L. Generalized consensus and Paxos. Tech. Rep. MSR-TR-2005-33, Microsoft Research, 2005.
[17] L A M P O RT, L . Fast paxos. Distributed Computing 19, 2 (2006), 79–103.
[18] LAMPSON, B. W. How to build a highly available system using consensus. In Distributed Algorithms, O. Baboaglu and K. Marzullo, Eds. Springer-Verlag, 1996, pp. 1–17.
[19] LAMPSON, B. W. The ABCD’s of Paxos. In Proc. PODC’01, ACM Symposium on Principles of Distributed Computing (2001), ACM, pp. 13–13.
[20] LISKOV, B., AND COWLING, J. Viewstamped replication revisited. Tech. Rep. MIT-CSAIL-TR-2012-021, MIT, July 2012.
[21] LogCabin source code. http://github.com/logcabin/logcabin.
[22] LORCH, J. R., ADYA, A., BOLOSKY, W. J., CHAIKEN, R., DOUCEUR, J. R., AND HOWELL, J. The SMART way to migrate replicated stateful services. In Proc. EuroSys’06, ACM SIGOPS/EuroSys European Conference on Computer Systems (2006), ACM, pp. 103–115.
[23] MAO, Y., JUNQUEIRA, F. P., AND MARZULLO, K. Mencius: building efficient replicated state machines for WANs. In Proc. OSDI’08, USENIX Conference on Operating Systems Design and Implementation (2008), USENIX, pp. 369–384.
[24] MAZIERES, D. Paxos made practical. http://www.scs.stanford.edu/~dm/home/papers/paxos.pdf, Jan. 2007.
[25] MORARU, I., ANDERSEN, D. G., AND KAMINSKY, M. There is more consensus in egalitarian parliaments. In Proc. SOSP’13, ACM Symposium on Operating System Principles (2013), ACM.
[26] Raft user study. http://ramcloud.stanford.edu/~ongaro/userstudy/.
[27] OKI, B. M., AND LISKOV, B. H. Viewstamped replication: A new primary copy method to support highly-available distributed systems. In Proc. PODC’88, ACM Symposium on Principles of Distributed Computing (1988), ACM, pp. 8–17.
[28] ONGARO, D. Consensus: Bridging Theory and Practice. PhD thesis, Stanford University, 2014 (work in progress). http://ramcloud.stanford.edu/~ongaro/userstudy/thesis.pdf.
[29] ONGARO, D., AND OUSTERHOUT, J. In search of an understandable consensus algorithm (extended version). http://ramcloud.stanford.edu/raft.pdf.
[30] OUSTERHOUT, J., AGRAWAL, P., ERICKSON, D., KOZYRAKIS, C., LEVERICH, J., MAZIE`RES, D., MITRA, S., NARAYANAN, A., ONGARO, D., PARULKAR, G., ROSENBLUM, M., RUMBLE, S. M., STRATMANN, E., AND STUTSMAN, R. The case for RAMCloud. Communications of the ACM 54 (July 2011), 121–130.
[31] Raft consensus algorithm website. http://raftconsensus.github.io.
[32] REED, B. Personal communications, May 17, 2013.
[33] S C H N E I D E R , F. B . Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Computing Surveys 22, 4 (Dec. 1990), 299–319.
[34] SHVACHKO, K., KUANG, H., RADIA, S., AND CHANSLER, R. The Hadoop distributed file system. In Proc. MSST’10, Symposium on Mass Storage Systems and Technologies (2010), IEEE Computer Society, pp. 1–10.
[35] VAN RENESSE, R. Paxos made moderately complex. Tech. rep., Cornell University, 2012.

  • 1
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值