一致性算法探寻（扩展版）2-CSDN博客

2019独角兽企业重金招聘Python工程师标准>>>

3 What’s wrong with Paxos?

Over the last ten years, Leslie Lamport’s Paxos protocol [15] has become almost synonymous with consensus:it is the protocol most commonly taught in courses, and most implementations of consensus use it as a starting point. Paxos first defines a protocol capable of reaching agreement on a single decision, such as a single replicated log entry. We refer to this subset as single-decree Paxos.Paxos then combines multiple instances of this protocol to facilitate a series of decisions such as a log (multi-Paxos).Paxos ensures both safety and liveness, and it supports changes in cluster membership. Its correctness has been proven, and it is efficient in the normal case.

Unfortunately, Paxos has two significant drawbacks.The first drawback is that Paxos is exceptionally difficult to understand. The full explanation [15] is notoriously opaque; few people succeed in understanding it, and only with great effort. As a result, there have been several attempts to explain Paxos in simpler terms [16, 20, 21].These explanations focus on the single-decree subset, yet they are still challenging. In an informal survey of attendees at NSDI 2012, we found few people who were comfortable with Paxos, even among seasoned researchers. We struggled with Paxos ourselves; we were not able to understand the complete protocol until after reading several simplified explanations and designing our own alternative protocol, a process that took almost a year.

We hypothesize that Paxos’ opaqueness derives from its choice of the single-decree subset as its foundation. Single-decree Paxos is dense and subtle: it is divided into two stages that do not have simple intuitive explanations and cannot be understood independently. Because of this, it is difficult to develop intuitions about why the single-decree protocol works. The composition rules for multi-Paxos add significant additional complexity and subtlety. We believe that the overall problem of reaching consensus on multiple decisions (i.e., a log instead of a single entry) can be decomposed in other ways that are more direct and obvious.

The second problem with Paxos is that it does not provide a good foundation for building practical implementations. One reason is that there is no widely agreedupon algorithm for multi-Paxos. Lamport’s descriptions are mostly about single-decree Paxos; he sketched possible approaches to multi-Paxos, but many details are missing. There have been several attempts to flesh out and optimize Paxos, such as [26], [39], and [13], but these differ from each other and from Lamport’s sketches. Systems such as Chubby [4] have implemented Paxos-like algorithms,but in most cases their details have not been published.

Furthermore, the Paxos architecture is a poor one for building practical systems; this is another consequence of the single-decree decomposition. For example, there is little benefit to choosing a collection of log entries independently and then melding them into a sequential log; this just adds complexity. It is simpler and more efficient to design a system around a log, where new entries are appended sequentially in a constrained order. Another problem is that Paxos uses a symmetric peer-to-peer approach at its core (though it eventually suggests a weak form of leadership as a performance optimization). This makes sense in a simplified world where only one decision will be made, but few practical systems use this approach. If a series of decisions must be made, it is simpler and faster to first elect a leader, then have the leader coordinate the decisions.

As a result, practical systems bear little resemblance to Paxos. Each implementation begins with Paxos, discovers the difficulties in implementing it, and then develops a significantly different architecture. This is timeconsuming and error-prone, and the difficulties of understanding Paxos exacerbate the problem. Paxos’ formulation may be a good one for proving theorems about its correctness, but real implementations are so different from Paxos that the proofs have little value. The following comment from the Chubby implementers is typical:

There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system. . . . the final system will be based on an unproven protocol [4].

Because of these problems, we concluded that Paxos does not provide a good foundation either for system building or for education. Given the importance of consensus in large-scale software systems, we decided to see if we could design an alternative consensus algorithm with better properties than Paxos. Raft is the result of that experiment.

3 Paxos有什么问题？

过去的十年里，Leslie Lamport的Paxos协议几乎成了一致性算法的同义词：这是教学中最常见的协议，并且绝大多数一致性算法都以之为起点。Paxos首先定义了一个单一决定中达成共识的协议的功能，如独立复制日志条目。我们指的是该子集作为single-decree Paxos。然后，Paxos整合了该协议帮助一系列决策的多个实例，比如日志（multi-Paxos）。Paxos确保了安全性和活跃度，并支持集群成员关系的改变。其正确性已被证实，并在正常情况下是有效的。

可惜，Paxos有两个重大的缺点。首先，Paxos实在是难以理解。完整的说明[15]是出了名的不透明；只有少数人非常努力才成功理解了它。结果，有几个用更简单的术语[16,20,21]来解释Paxos的尝试。这些解释集中在single-decree子集，但他们依然具有挑战性。在一次NSDI 2012 与会者的非正式调查中，我们发现很少有人是适应Paxos，包括经验丰富的研究人员。我们挣扎于Paxos；我们无法理解直到读到几个简单解释才理解完整的协议，我们花了将近一年的时间设计自己的替代协议。

我们推测Paxos的不透明来自于选择single-decree子集作为基础。Single-decree Paxos是复杂而微妙的：它被分成两个不能简单直观的说明，并且很难独立理解。因此，很难知道single-decree协议的工作原理。multi-Paxos的组成规则明显增加了额外的复杂和微妙程度。我们认为多决策共识的整体问题（即以一个日志替代一个日志条目）可以被分解成其他更直接明显的方式。

Paxos的第二个问题是它没有提供一个好的用于构建实际实现的基础。其中一个原因是multi-Paxos没有广泛统一的算法。Lamport的描述更多讲述了single-decree Paxos；他为multi-Paxos勾勒了可能的策略，但是没有相关细节。曾有多次尝试充实和优化Paxos，如[26]，[39]和[13]，但是彼此和Lamport的草图不同。系统，比如Chubby[4]已经实现了类似Paxos的算法，但是大多数情况没有公布细节。

此外，用Paxos架构构建实际系统是个不好的选择；这是single-decree分解的另一个后果。例如，不太利于选择日志条目的集合整合成一个顺序的日志；这只是增加了复杂性。它设计一个围绕日志的系统更简单和有效，其中新的条目将按照约束顺序依次加入日志。另一个问题是，Paxos在它的核心使用了一个对称的点对点的策略（虽然它建议leadership的弱形式作为性能优化）。这在一个只有一个决策的简单世界是有意义的，但是很少有实际系统使用这个策略。如果一系列的决策需要决定，首先选举一个leader，并由其协调和决定是更简单快速的。

结果，实际系统和Paxos没什么相似的。以Paxos开始的每个实现都难以实现，然后再开发一个明显不同的架构。这是费时且容易出错的，并加剧了Paxos的理解难度。Paxos的公式可能是一个很好证明其正确性的东西，但是真正的实现与Paxos如此不同，几乎没什么价值。下面是Chubby实施者的评论是如此经典：

Paxos算法的描述和现实系统的需求差距太明显了...最后系统只能基于一个未被证实的协议[4]。

由于上述原因，我们认为Paxos既没有提供一个良好的基础，也对系统建设的教育没有意义。鉴于大型软件系统保持一致性的重要性，我们决定看看是否能找到一个比Paxos更好的一致性算法的替代算法。Raft就是结果。

4 Designing for understandability

We had several goals in designing Raft: it must provide a complete and practical foundation for system building, so that it significantly reduces the amount of design work required of developers; it must be safe under all conditions and available under typical operating conditions; and it must be efficient for common operations. But our most important goal—and most difficult challenge—was understandability. It must be possible for a large audience to understand the algorithm comfortably. In addition, it must be possible to develop intuitions about the algorithm, so that system builders can make the extensions that are inevitable in real-world implementations.

There were numerous points in the design of Raft where we had to choose among alternative approaches. In these situations we evaluated the alternatives based on understandability: how hard is it to explain each alternative (for example, how complex is its state space, and does it have subtle implications?), and how easy will it be for a reader to completely understand the approach and its implications?

We recognize that there is a high degree of subjectivity in such analysis; nonetheless, we used two techniques that are generally applicable. The first technique is the well-known approach of problem decomposition: wherever possible, we divided problems into separate pieces that could be solved, explained, and understood relatively independently. For example, in Raft we separated leader election, log replication, safety, and membership changes.

Our second approach was to simplify the state space by reducing the number of states to consider, making the system more coherent and eliminating nondeterminism where possible. Specifically, logs are not allowed to have holes, and Raft limits the ways in which logs can become inconsistent with each other. Although in most cases we tried to eliminate nondeterminism, there are some situations where nondeterminism actually improves understandability. In particular, randomized approaches introduce nondeterminism, but they tend to reduce the state space by handling all possible choices in a similar fashion ("choose any; it doesn't matter"). We used randomization to simplify the Raft leader election algorithm.

4 易懂性的设计

我们在设计Raft的时候制定了几个目标：它必须为系统构建提供一个完整和实用的基础。所以明显降低了开发人员设计的工作量；它必须在典型操作系统的所有情况下安全且可用；并且它支持常规操作。但是我们最主要的目标也是最大的挑战就是易懂。它必须让它的使用者能容易地理解算法。此外，它必须能养成算法的直觉，使系统制造者能在现实世界的实现中进行必然需要的扩展。

Raft的设计中选择替代方案有很多点。在这种情况下，我们以易懂为原则评估了很多替代方案：每个替代方案有多难解释（比如，它的状态空间有多复杂和它是否还有其他微妙的意味？），和它能对于读者完全理解替代方案及其影响有多简单？

我们认识到这有一定的主观性；尽管如此，我们使用了两种普遍适用的技术。第一种技术是问题分解的易懂策略：我们尽可能的将问题分解成能解决、解释和相对能独立理解的片段。例如，我们在Raft中分解了leader选举，日志复制，安全性和成员关系变更。

我们第二个方法是减少需要考虑的状态来简化状态空间，使系统更加连贯和消除可能的不确定性。具体来说，日志不许有洞，Raft限制了日志可能不一致的方式。虽然大多数情况下，我们都要消除不确定性，但是也有一些情况，不确定性也提高了易懂性。特别是，随机方法引入了不确定性，但他们往往通过以类似的方式解决所有可能的选择来减少状态空间（"随便选，没事"）。我们使用随机来简化Raft的leader选举算法。

转载于:https://my.oschina.net/daidetian/blog/488295