一致性算法探寻（扩展版）1

最新推荐文章于 2024-09-18 20:39:31 发布

weixin_34112208

最新推荐文章于 2024-09-18 20:39:31 发布

阅读量111

点赞数

文章标签： python 大数据

原文链接：https://my.oschina.net/daidetian/blog/485216

版权

2019独角兽企业重金招聘Python工程师标准>>>

Abstract

Raft is a consensus algorithm for managing a replicated log. It produces a result equivalent to (multi-)Paxos, and it is as efficient as Paxos, but its structure is different from Paxos; this makes Raft more understandable than Paxos and also provides a better foundation for building practical systems. In order to enhance understandability, Raft separates the key elements of consensus, such as leader election, log replication, and safety, and it enforces a stronger degree of coherency to reduce the number of states that must be considered. Results from a user study demonstrate that Raft is easier for students to learn than Paxos. Raft also includes a new mechanism for changing the cluster membership, which uses overlapping majorities to guarantee safety.

概述

Raft是一个用于日志复制管理的一致性算法。它产生一个和Paxos相同的结果，并且它和Paxos一样有效，但是它的结构却不同于Paxos；这就让Raft比Paxos更易懂，同时也使得它有一个更好的基础用于构建生产系统。为了增强易懂性，Raft分离了一致性的关键元素，如leader选举，日志复制，安全性，为了减少必须保持一致的状态机数量，它强制执行更高强度的一致性。用户研究结果表明Raft比Paxos更容易学习。Raft同样包含一个全新的集群成员关系改变的机制，它采用少数服从多数以保证安全。

1 Introduction

Consensus algorithms allow a collection of machines to work as a coherent group that can survive the failures of some of its members. Because of this, they play a key role in building reliable large-scale software systems. Paxos [15, 16] has dominated the discussion of consensus algorithms over the last decade: most implementations of consensus are based on Paxos or influenced by it, and Paxos has become the primary vehicle used to teach students about consensus.

Unfortunately, Paxos is quite difficult to understand, in spite of numerous attempts to make it more approachable. Furthermore, its architecture requires complex changes to support practical systems. As a result, both system builders and students struggle with Paxos.

After struggling with Paxos ourselves, we set out to find a new consensus algorithm that could provide a better foundation for system building and education. Our approach was unusual in that our primary goal was understandability: could we define a consensus algorithm for practical systems and describe it in a way that is signifi- cantly easier to learn than Paxos? Furthermore, we wanted the algorithm to facilitate the development of intuitions that are essential for system builders. It was important not just for the algorithm to work, but for it to be obvious why it works.

The result of this work is a consensus algorithm called Raft. In designing Raft we applied specific techniques to improve understandability, including decomposition (Raft separates leader election, log replication, and safety) and state space reduction (relative to Paxos, Raft reduces the degree of nondeterminism and the ways servers can be inconsistent with each other). A user study with 43 students at two universities shows that Raft is significantly easier to understand than Paxos: after learning both algorithms, 33 of these students were able to answer questions about Raft better than questions about Paxos.

Raft is similar in many ways to existing consensus algorithms (most notably, Oki and Liskov’s Viewstamped Replication [29, 22]), but it has several novel features:

Strong leader: Raft uses a stronger form of leadership than other consensus algorithms. For example, log entries only flow from the leader to other servers. This simplifies the management of the replicated log and makes Raft easier to understand.
Leader election: Raft uses randomized timers to elect leaders. This adds only a small amount of mechanism to the heartbeats already required for any consensus algorithm, while resolving conflicts simply and rapidly.
Membership changes: Raft’s mechanism for changing the set of servers in the cluster uses a new joint consensus approach where the majorities of two different configurations overlap during transitions. This allows the cluster to continue operating normally during configuration changes.

We believe that Raft is superior to Paxos and other consensus algorithms, both for educational purposes and as a foundation for implementation. It is simpler and more understandable than other algorithms; it is described completely enough to meet the needs of a practical system; it has several open-source implementations and is used by several companies; its safety properties have been formally specified and proven; and its efficiency is comparable to other algorithms.

The remainder of the paper introduces the replicated state machine problem (Section 2), discusses the strengths and weaknesses of Paxos (Section 3), describes our general approach to understandability (Section 4), presents the Raft consensus algorithm (Sections 5–8), evaluates Raft (Section 9), and discusses related work (Section 10).

1 引言

一致性算法允许一个计算机集群中死掉一部分成员。正因为如此，他们在大型软件系统建设中扮演了关键角色。Paxos[15,16]在过去十几年里的一致性算法里起了主导作用：大多数实现都是基于Paxos或是受其影响，并且Paxos也是学生在学习一致性算法中的主要手段。

不幸的是，Paxos经过无数次尝试使之易懂，还是相当的难懂。此外，它需要复杂多变的架构来支持实用系统。这样一来，无论是系统集成商还是学生都在Paxos中挣扎。

经历了Paxos的挣扎后，我们开始寻找一个无论是系统建设还是教学都能提供更好基础的新的一致性算法。我们的做法与以往不同，首要目的就是易懂性：我们能定义一个用于生产系统的一致性算法，并用一个比Paxos明显更易懂的方式描述它？此外，我们希望该算法能成为系统开发者的本能。这不只对算法的运行很重要，对于理解工作原理更重要。

这项工作成果是一个叫做Raft的一致性算法。在设计Raft时，我们采用了特定的技术来提高易懂性，其中包括分解（Raft分解了leader选举，日志复制和安全性）和减少状态空间（Raft相对于Paxos减少了不确定性和服务器相互矛盾的方式）。一份两所高校43名学生的研究报告表明，Raft明显比Paxos更易懂：学习这两种算法后，这些学生中的33人比回答Paxos的问题，能够更好的回答有关Raft的问题。

Raft和现在已有的很多一致性算法很相似（最值得注意的是，Oki and Liskov’s Viewstamped Replication [29, 22]），但是它有几个新特点：

健壮的leader：Raft使用了比其他一致性算法更健壮的领导机制。例如，日志条目只能从leader服务器流向其他服务器。这简化了日志复制的管理，也让Raft更容易理解。
leader选举：Raft采用了随机定时器来进行leader选举。这只增加了少量其他一致性算法都需要的心跳机制，而用来解决冲突简单且迅速。
成员关系的改变：Raft改变计算机集群的机制采用一个新的联合一致性策略，即在转换过程中的两种不同配置中重叠的多数。这使得集群在配置更改后也能继续正常的运行。

我们认为Raft比Paxos和其他一致性算法优异，即能用于教学目的，也能作为系统运行的基础。它比其他一致性算法更简单易懂；它完全能够满足一个实用系统的需要；它有几个开源实现和几家公司在使用；其安全性能已被正式验证；并且它比其它算法更效率。

在本文的其他部分介绍了状态机的复制问题（第2节），讨论Paxos的优缺点（第3节），描述了我们可理解的普通策略（第4节），提出了Raft一致性算法（第5-8节），评估Raft（第9节），并讨论了相关工作（第10节）。

2 Replicated state machines

Consensus algorithms typically arise in the context of replicated state machines [37]. In this approach, state machines on a collection of servers compute identical copies of the same state and can continue operating even if some of the servers are down. Replicated state machines are used to solve a variety of fault tolerance problems in distributed systems. For example, large-scale systems that have a single cluster leader, such as GFS [8], HDFS [38],
and RAMCloud [33], typically use a separate replicated state machine to manage leader election and store configuration information that must survive leader crashes. Examples of replicated state machines include Chubby [2] and ZooKeeper [11].
Replicated state machines are typically implemented using a replicated log, as shown in Figure 1. Each server
stores a log containing a series of commands, which its state machine executes in order. Each log contains the
same commands in the same order, so each state machine processes the same sequence of commands. Since
the state machines are deterministic, each computes the same state and the same sequence of outputs.
Keeping the replicated log consistent is the job of the consensus algorithm. The consensus module on a server
receives commands from clients and adds them to its log. It communicates with the consensus modules on other
servers to ensure that every log eventually contains the same requests in the same order, even if some servers fail.
Once commands are properly replicated, each server's state machine processes them in log order, and the outputs are returned to clients. As a result, the servers appear to form a single, highly reliable state machine.

Consensus algorithms for practical systems typically have the following properties:

They ensure safety (never returning an incorrect result) under all non-Byzantine conditions, including
network delays, partitions, and packet loss, duplication, and reordering.
They are fully functional (available) as long as any majority of the servers are operational and can communicate with each other and with clients. Thus, a typical cluster of five servers can tolerate the failure
of any two servers. Servers are assumed to fail by stopping; they may later recover from state on stable
storage and rejoin the cluster.
They do not depend on timing to ensure the consistency of the logs: faulty clocks and extreme message
delays can, at worst, cause availability problems.
In the common case, a command can complete as soon as a majority of the cluster has responded to a
single round of remote procedure calls; a minority of slow servers need not impact overall system performance.

2 状态机复制

一致性算法通常发生在状态机的上下文[37]。这种策略下，集群中的状态机计算相同的状态副本，即使有几台挂了，整个集群依然能够正常运行。状态机的复制用于解决一系列分布式系统的故障容错问题。例如只有一个custer leaderd的大规模系统，像是GFS[8]，HDFS[38]，RAMCloud[33]显然使用分离复制状态机的方法来管理leader选举和储存配置信息，从而能在leader崩溃的情况下继续使用。复制状态机的例子还有Chubby[2]和Zookeeper[11]。

复制状态机显然使用了复制日志来实现，如下图1所示。每个服务器一个包含一系列状态机执行顺序的命令。每个日志都包含相同的命令和顺序，所以每个状态机都运行着相同的命令序列。由于每个状态机都是确定的，计算着相同的状态和相同序列的输出。

一致性算法的主要工作是保证日志的一致性。服务端的共识模块从客户端接收命令并将其加入日志。它与其他服务器的共识模块通信，以保证即使一些服务器挂了的时候也能确保每个日志包含相同顺序的请求。一旦命令被正确复制，每个服务器的状态机按照日志顺序处理，并将输出返回给客户端。结果，服务器将形成一个单一的、高度可靠的状态机。

实际系统中的一致性算法通常具有以下属性：