[etcd] Raft理论学习（一）（为啥开发Raft？）

最新推荐文章于 2022-09-18 14:04:33 发布

flyfox_1988

最新推荐文章于 2022-09-18 14:04:33 发布

阅读量312

点赞数

分类专栏： raft

本文链接：https://blog.csdn.net/flyfox_1988/article/details/104975486

版权

raft 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

一、Raft理论

In Search of an Understandable Consensus Algorithm(Extended Version)（这篇文章下载的地址https://raft.github.io/raft.pdf）

寻找一种可以理解的共识算法（扩展版）

Abstract
Raft is a consensus algorithmfor managing a replicated log. It produces a result equivalent to (multi-)Paxos, and it is as efficient as Paxos, but its structure is different from Paxos; this makes Raft more understandable than Paxos and also provides a better foundation for building practical systems. In order to enhance understandability, Raft separates the key elements of consensus, such as leader election, log replication, and safety, and it enforces a stronger degree of coherency to reduce the number of
states that must be considered. Results from a user study demonstrate that Raft is easier for students to learn than Paxos. Raft also includes a new mechanism for changing the cluster membership, which uses overlapping majorities to guarantee safety.

摘要

Raft是用于管理日志的共识算法。它产生的结果等同于（multi）Paxos，并且它和Paxos一样有效，但是结构与Paxos不同；这使得比 Paxos更容易理解，也对构建一个实用系统提供了更好的基础。为了更容易理解， Raft将共识的关键要素分开，例如领导选举，日志复制和安全性，并强制执行较强的一致性以减少数量必须考虑的状态。用户研究的结果证明Raft让学习者学起来比Paxos更容易。Raft还包括一种新的集群成员身份改变机制，使用多重冗余来保证安全。

1 Introduction

Consensus algorithms allow a collection of machines to work as a coherent group that can survive the failures of some of its members. Because of this, they play a key role in building reliable large-scale software systems.Paxos [15, 16] has dominated the discussion of consensus algorithms over the last decade:most implementations of consensus are based on Paxos or influenced by it, and Paxos has become the primary vehicle used to teach students about consensus.

1引言

共识算法允许一组计算机作为一个连贯的组工作，组可以在某些成员的故障中存活下来。因此，它们在构建可靠的大规模软件系统中起着关键作用。Paxos [15，16]在过去十年中主导了共识算法的讨论：共识的大多数实现基于Paxos或受其影响，并且Paxos已成为用于教导学生有关共识的主要工具。

Unfortunately, Paxos is quite difficult to understand, in spite of numerous attempts to make it more approachable.Furthermore, its architecture requires complex changes to support practical systems. As a result, both system builders and students struggle with Paxos.

不幸的是，尽管有很多尝试使Paxos变得更容易上手，但它仍然很难理解。此外，在实际系统应用中，它体系结构需要复杂的更改才能使用。结果，系统构建者和学习者使用Paxos很郁闷。

After struggling with Paxos ourselves, we set out to find a new consensus algorithm that could provide a better foundation for system building and education. Our approach was unusual in that our primary goal was under-standability: could we define a consensus algorithm for practical systems and describe it in a way that is significantly easier to learn than Paxos? Furthermore,we wanted the algorithm to facilitate the development of intuitions that are essential for system builders. It was important not just for the algorithmto work, but for it to be obvious why it works.

我们自己纠结的使用了Paxos之后，就着手寻找一种新的共识算法，该算法可以为系统构建和教育提供更好的基础。我们的方法与众不同，因为我们的主要目标是易于理解：我们是否可以为实际系统定义共识算法，并以比Paxos更容易学习的方式对其进行描述？此外，我们希望该算法能够促进对系统构建者必不可少的直觉的发展。重要的不仅是算法能起作用，而且很明显它为什么起作用。

The result of this work is a consensus algorithm called Raft. In designing Raft we applied specific techniques to improve understandability, including decomposition (Raft separates leader election, log replication, and safety) and state space reduction (relative to Paxos, Raft reduces the degree of nondeterminismand the ways servers can be inconsistent with each other). A user study with 43 students at two universities shows that Raft is significantly easier to understand than Paxos: after learning both algorithms, 33 of these students were able to answer questions about Raft better than questions about Paxos.

这项工作的结果是一个称为Raft的共识算法。在设计Raft时，我们应用了特定的技术来提高易懂性，包括分解（Raft分离领导者选举，日志复制和安全性）和状态空间缩减（相对于Paxos，Raft减少了不确定性的程度以及服务器彼此之间不一致的方式）。一项对两所大学的43名学生进行的用户研究表明，Raft比Paxos更容易理解：学习了这两种算法后，其中33位学生比Rax更好地回答了有关Raft的问题。

Raft is similar in many ways to existing consensus algorithms (most notably, Oki and Liskov’s Viewstamped Replication [29, 22]), but it has several novel features:

Strong leader: Raft uses a stronger form of leadership than other consensus algorithms. For example, log entries only flow from the leader to other servers. This simplifies the management of the replicated log and makes Raft easier to understand.

• Leader election: Raft uses randomized timers to elect leaders. This adds only a small amount of mechanism to the heartbeats already required for any consensus algorithm, while resolving conflicts simply and rapidly.
• Membership changes: Raft’s mechanism for changing the set of servers in the cluster uses a new joint consensus approach where the majorities of two different configurations overlap during transitions. This allows the cluster to continue operating
normally during configuration changes.

Raft在许多方面与现有的共识算法相似（最著名的是Oki和Liskov的Viewstamped复制[29，22]），但是它具有几个新颖的功能：

•强大的领导者：与其他共识算法相比，Raft使用更强大的领导形式。例如，日志条目仅从领导者流向其他服务器。这简化了复制日志的管理，并使Raft更易于理解。

•领导人选举：Raft使用随机计时器选举领导人。这为任何共识算法已经要求的心跳仅增加了少量机制，同时简单，快速地解决了冲突。

•成员资格更改：Raft用于更改集群中服务器集的机制使用一种新的联合共识方法，其中两种不同配置的多数在过渡期间会重叠。这使群集可以继续运行通常在配置更改期间。

We believe that Raft is superior to Paxos and other consensus algorithms, both for educational purposes and as a
foundation for implementation. It is simpler and more understandable than other algorithms; it is described completely
enough to meet the needs of a practical system; it has several open-source implementations and is used by several companies; its safety properties have been formally specified and proven; and its efficiency is comparable to other algorithms.

我们认为，无论从教育角度还是从教育角度来看，Raft都优于Paxos和其他共识算法实施的基础。它比其他算法更简单易懂。它被完整描述足以满足实际系统的需求；它具有多种开源实现，并被多家公司使用；其安全性能已得到正式规定和证明；而且其效率可与其他算法媲美。

The remainder of the paper introduces the replicated state machine problem(Section 2), discusses the strengths and weaknesses of Paxos (Section 3), describes our general approach to understandability (Section 4), presents the Raft consensus algorithm (Sections 5–8), evaluates Raft (Section 9), and discusses related work (Section 10).

本文的其余部分介绍了复制状态机问题（第2节），讨论了Paxos的优缺点（第3节），描述了我们对易懂性的一般方法（第4节），介绍了Raft共识算法（第5-8节），评估Raft（第9节），并讨论相关工作（第10节）。

2 Replicated state machines

Consensus algorithms typically arise in the context of replicated state machines [37]. In this approach, state machines
on a collection of servers compute identical copies of the same state and can continue operating even if some of the servers are down. Replicated state machines are used to solve a variety of fault tolerance problems in distributed
systems. For example, large-scale systems that have a single cluster leader, such as GFS [8], HDFS [38], and RAMCloud [33], typically use a separate replicated state machine to manage leader election and store configuration information that must survive leader crashes. Examples of replicated state machines include Chubby [2] and ZooKeeper [11].

2复制状态机

共识算法通常出现在复制状态机的环境中[37]。在这种方法中，状态机一组服务器上的计算机计算相同状态的相同副本，即使某些服务器宕机也可以继续运行。复制状态机用于解决分布式中的各种容错问题系统。例如，具有单个集群领导者的大型系统，例如GFS [8]，HDFS [38]和RAMCloud [33]，通常使用单独的复制状态机来管理领导者选举并存储必须保留的配置信息领导者崩溃。复制状态机的示例包括Chubby [2]和ZooKeeper [11]。

Replicated state machines are typically implemented using a replicated log, as shown in Figure 1. Each server stores a log containing a series of commands, which its state machine executes in order. Each log contains the same commands in the same order, so each state machine processes the same sequence of commands. Since the state machines are deterministic, each computes the same state and the same sequence of outputs.

复制状态机通常使用复制日志来实现，如图1所示。每个服务器都存储一个包含一系列命令的日志，其状态机按顺序执行这些命令。每个日志以相同的顺序包含相同的命令，因此每个状态机处理相同的命令序列。由于状态机是确定性的，因此每个状态机都计算相同的状态和相同的输出序列。

Keeping the replicated log consistent is the job of the consensus algorithm. The consensus module on a server
receives commands from clients and adds them to its log. It communicates with the consensus modules on other
servers to ensure that every log eventually contains the same requests in the same order, even if some servers fail.
Once commands are properly replicated, each server’s state machine processes them in log order, and the outputs
are returned to clients. As a result, the servers appear to form a single, highly reliable state machine.

保持复制日志的一致性是共识算法的工作。服务器上的共识模块从客户端接收命令并将其添加到其日志中。它与其他的共识模块进行通信服务器，以确保每个日志最终都包含相同顺序的相同请求，即使某些服务器发生故障也是如此。正确复制命令后，每个服务器的状态机都将以日志顺序对其进行处理，然后输出返回给客户。结果，服务器似乎形成了单个高度可靠的状态机。

Consensus algorithms for practical systems typically have the following properties:

• They ensure safety (never returning an incorrect result) under all non-Byzantine conditions, including
network delays, partitions, and packet loss, duplication, and reordering.

• They are fully functional (available) as long as any majority of the servers are operational and can communicate
with each other and with clients. Thus, a typical cluster of five servers can tolerate the failure of any two servers. Servers are assumed to fail by stopping; they may later recover from state on stable storage and rejoin the cluster.

They do not depend on timing to ensure the consistency of the logs: faulty clocks and extreme message
delays can, at worst, cause availability problems.
• In the common case, a command can complete as soon as a majority of the cluster has responded to a
single round of remote procedure calls; a minority of slow servers need not impact overall system performance

实际系统的共识算法通常具有以下属性：

•他们在所有非拜占庭条件下确保安全（绝不会返回错误的结果），包括网络延迟，分区以及数据包丢失，重复和重新排序。

•只要大多数服务器都可以运行并且可以通信，它们就可以正常运行（可用）彼此之间以及与客户之间。因此，由五个服务器组成的典型集群可以容忍任何两个服务器的故障。假定服务器因停止而发生故障；它们稍后可能会从稳定存储上的状态中恢复并重新加入群集。它们不依赖于时间来确保日志的一致性：错误的时钟和极端的消息延迟可能会导致可用性问题。

•在通常情况下，只要大多数集群响应了命令，命令就可以完成。单轮远程过程调用；少数慢速服务器不必影响整体系统性能

3 What’s wrong with Paxos?

Over the last ten years, Leslie Lamport’s Paxos protocol [15] has become almost synonymous with consensus:
it is the protocol most commonly taught in courses, and most implementations of consensus use it as a starting
point. Paxos first defines a protocol capable of reaching agreement on a single decision, such as a single replicated
log entry. We refer to this subset as single-decree Paxos. Paxos then combinesmultiple instances of this protocol to
facilitate a series of decisions such as a log (multi-Paxos). Paxos ensures both safety and liveness, and it supports
changes in cluster membership. Its correctness has been proven, and it is efficient in the normal case.

3 Paxos有啥问题？

在过去十年中，莱斯利·兰伯特（Leslie Lamport）的Paxos协议[15]几乎已经成为共识的代名词：它是课程中最常教授的协议，大多数共识的实现都以它为起点点。 Paxos首先定义了一种能够在单个决定（例如单个复制）上达成协议的协议日志条目。我们将此子集称为单命令Paxos。 Paxos然后将该协议的多个实例组合到促进一系列决策，例如日志（多人）。 Paxos确保安全性和活力，并提供支持群集成员的更改。它的正确性已经被证明，并且在正常情况下是有效的。

Unfortunately, Paxos has two significant drawbacks. The first drawback is that Paxos is exceptionally difficult
to understand. The full explanation [15] is notoriously opaque; few people succeed in understanding it, and
only with great effort. As a result, there have been several attempts to explain Paxos in simpler terms [16, 20, 21].
These explanations focus on the single-decree subset, yet they are still challenging. In an informal survey of attendees
at NSDI 2012, we found few people who were comfortable with Paxos, even among seasoned researchers. We struggled with Paxos ourselves; we were not able to understand the complete protocol until after reading several simplified explanations and designing our own alternative protocol, a process that took almost a year.

不幸的是，Paxos有两个明显的缺点。第一个缺点是Paxos了解起来非常困难。完整的解释[15]众所周知是不透明的。很少有人能成功地理解它，并且只有付出很大的努力。结果，已经有一些尝试以更简单的术语来解释Paxos [16，20，21]。这些解释着眼于单法令的子集，但它们仍然具有挑战性。在对参与者的非正式调查中在NSDI 2012上，即使是经验丰富的研究人员，也很少有人对Paxos感到满意。我们自己与Paxos斗争；在阅读了一些简化的说明并设计了我们自己的替代协议后，我们才了解完整的协议，这一过程花费了将近一年的时间。

We hypothesize that Paxos’ opaqueness derives from its choice of the single-decree subset as its foundation.
Single-decree Paxos is dense and subtle: it is divided into two stages that do not have simple intuitive explanations
and cannot be understood independently. Because of this, it is difficult to develop intuitions about why the singledecree
protocol works. The composition rules for multi- Paxos add significant additional complexity and subtlety. We believe that the overall problemof reaching consensuson multiple decisions (i.e., a log instead of a single entry)can be decomposed in other ways that are more direct and obvious.

我们假设Paxos的不透明性来自其对单法则子集的选择的基础。单一法令Paxos密不可分：它分为两个阶段，没有简单直观的解释并且不能独立理解。因此，很难理解为什么单一法令协议有效。多Paxos的组成规则增加了很多额外的复杂性和微妙性。我们认为，可以通过其他更为直接和明显的方式来分解在多个决策（即，日志而不是单个条目）上达成共识的总体问题。

The second problem with Paxos is that it does not provide a good foundation for building practical implementations.
One reason is that there is no widely agreedupon algorithm for multi-Paxos. Lamport’s descriptions are mostly about single-decree Paxos; he sketched possible approaches to multi-Paxos, but many details are missing.There have been several attempts to flesh out and optimize Paxos, such as [26], [39], and [13], but these differ from each other and from Lamport’s sketches. Systems such as Chubby [4] have implemented Paxos-like algorithms, but in most cases their details have not been published.

Furthermore, the Paxos architecture is a poor one for building practical systems; this is another consequence of
the single-decree decomposition. For example, there is little benefit to choosing a collection of log entries independently
and then melding them into a sequential log; this just adds complexity. It is simpler and more efficient to design a system around a log, where new entries are appended sequentially in a constrained order. Another problem is that Paxos uses a symmetric peer-to-peer approach at its core (though it eventually suggests a weak form of leadership as a performance optimization). This makes sense in a simplified world where only one decision will be made, but few practical systems use this approach. If a
series of decisions must be made, it is simpler and faster to first elect a leader, then have the leader coordinate the decisions.

Paxos的第二个问题是，它没有为构建实际的实现提供良好的基础。原因之一是，没有针对多Paxos达成广泛共识的算法。 Lamport的描述主要是关于单一法令的Paxos。他勾画出了多种Paxos的可能方法，但缺少许多细节。曾有几次尝试充实和优化Paxos的尝试，例如[26]，[39]和[13]，但它们彼此之间以及与兰珀特的素描。诸如Chubby [4]之类的系统已经实现了类似于Paxos的算法，但是在大多数情况下，它们的详细信息尚未公开。

此外，Paxos体系结构对于构建实际系统而言是一个糟糕的体系。这是另一个后果单法令分解。例如，独立选择日志条目的集合几乎没有好处。然后将它们合并为顺序日志；这只会增加复杂性。围绕日志设计系统更简单，更高效，在该系统中，按约束顺序依次添加新条目。另一个问题是Paxos的核心使用对称的点对点方法（尽管它最终暗示了一种弱势的领导形式作为性能优化）。在仅做出一个决定的简化世界中，这是有道理的，但是很少有实际的系统使用此方法。如果一个必须做出一系列决策，首先选举一位领导者，然后让领导者协调决策，这将变得更加简单快捷。

Because of these problems, we concluded that Paxos does not provide a good foundation either for system building or for education. Given the importance of consensus in large-scale software systems, we decided to see if we could design an alternative consensus algorithm with better properties than Paxos. Raft is the result of that experiment.

由于这些问题，我们得出的结论是Paxos不能为系统构建或教育提供良好的基础。考虑到共识在大型软件系统中的重要性，我们决定看看是否可以设计一种性能比Paxos更好的替代共识算法。Raft是该实验的结果。

flyfox_1988

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
[etcd] Raft理论学习（一）（为啥开发Raft？）

一、Raft理论In Search of an Understandable Consensus Algorithm(Extended Version)（这篇文章下载的地址https://raft.github.io/raft.pdf）寻找一种可以理解的共识算法（扩展版）AbstractRaft is a consensus algorithmfor managing a repli...
复制链接

扫一扫