目录
论文翻译
5.3 Log replication
Once a leader has been elected, it begins servicing client requests. Each client request contains a command to be executed by the replicated state machines. The leader appends the command to its log as a new entry, then issues AppendEntries RPCs in parallel to each of the other servers to replicate the entry. When the entry has been safely replicated (as described below), the leader applies the entry to its state machine and returns the result of that execution to the client. If followers crash or run slowly, or if network packets are lost, the leader retries Append- Entries RPCs indefinitely (even after it has responded to the client) until all followers eventually store all log en- tries.
一旦leader被选出,它开始收到客户端的请求,每个客户端请求包含一个被replicate状态机执行的命令。leader appends命令为entry到log。然后并行的想每个其他server replicate the entry。当entry已经被安全的复制,leader applies entry 到状态机并返回结果给client。如果follower 崩溃或者太慢,或者网络丢包,leader 会无限重试append entries RPC (即使在回复了客户端之后)直到所有follower 最终存储所有日志entry。
Logs are organized as shown in Figure 6. Each log entry stores a state machine command along with the term number when the entry was received by the leader. The term numbers in log entries are used to detect inconsis- tencies between logs and to ensure some of the properties in Figure 3. Each log entry also has an integer index identifying its position in the log.
log的组织形式如图6,每个log entry存储一条状态机,和leader收到该指令时的term号,term号来检测多个日志副本之间不一致的情况。同事也用来保证图3中的某些性质。每个entry 都有一个整数索引值来表明它在日志中的位置;
Figure 6: Logs are composed of entries, which are numbered sequentially. Each entry contains the term in which it was created (the number in each box) and a command for the state machine. An entry is considered committed if it is safe for that entry to be applied to state machines.
图 6 日志由 entries 组成,顺序的被编号。每个 entry 包含创建时的 term(方框中的数字)和一个给状态机的 command。一个 entry 如果安全
The leader decides when it is safe to apply a log entry to the state machines; such an entry is called commit- ted. Raft guarantees that committed entries are durable and will eventually be executed by all of the available state machines. A log entry is committed once the leader that created the entry has replicated it on a majority of the servers (e.g., entry 7 in Figure 6). This also commits all preceding entries in the leader’s log, including entries created by previous leaders. Section 5.4 discusses some subtleties when applying this rule after leader changes, and it also shows that this definition of commitment is safe. The leader keeps track of the highest index it knows to be committed, and it includes that index in future AppendEntries RPCs (including heartbeats) so that the other servers eventually find out. Once a follower learns that a log entry is committed, it applies the entry to its local state machine (in log order).
领导者决定何时对状态机应用日志尝试是安全的;这样的entry称为committed。Raft保证提交的条目是持久的,并且最终将由所有可用的状态机执行。一旦创建日志项的领导者在大多数服务器上复制了日志项,就会提交日志项(例如,图6中的条目7)。这还将提交领导日志中前面的所有条目,包括以前领导创建的条目。第5.4节讨论了在领导人更换后应用这一规则的一些微妙之处,它还表明,这一承诺定义是安全的。leader跟踪它知道要提交的最高索引,并将该索引包含在未来的RPC(包括心跳)中,以便其他服务器最终发现。一旦跟随者了解到日志条目已提交,它就会将该条目应用于其本地状态机(按日志顺序)。
We designed the Raft log mechanism to maintain a high level of coherency between the logs on different servers. Not only does this simplify the system’s behavior and make it more predictable, but it is an important component of ensuring safety. Raft maintains the following proper- ties, which together constitute the Log Matching Property in Figure 3:
• If two entries in different logs have the same index and term, then they store the same command.
• If two entries in different logs have the same index and term, then the logs are identical in all preceding entries.
我们设计了Raft日志机制,以保持不同服务器上日志之间的高度一致性。这不仅简化了系统的行为,使其更加可预测,而且是确保安全的重要组成部分。Raft保持以下特性,共同构成图3中的对数匹配特性:
•如果不同日志中的两个条目具有相同的索引和术语,则它们存储相同的命令。
•如果不同日志中的两个条目具有相同的索引和术语,则所有前面条目中的日志都是相同的。
The first property follows from the fact that a leader creates at most one entry with a given log index in a given term, and log entries never change their position in the log. The second property is guaranteed by a simple con- sistency check performed by AppendEntries. When send- ing an AppendEntries RPC, the leader includes the index and term of the entry in its log that immediately precedes the new entries. If the follower does not find an entry in its log with the same index and term, then it refuses the new entries. The consistency check acts as an induction step: the initial empty state of the logs satisfies the Log Matching Property, and the consistency check preserves the Log Matching Property whenever logs are extended. As a result, whenever AppendEntries returns successfully, the leader knows that the follower’s log is identical to its own log up through the new entries.
第一个特性:Leader 在特定的任期号内的一个日志索引处最多创建一个日志条目,同时日志条目在日志中的位置也从来不会改变。该点保证了上面的第一条特性。
第二个特性:是由 AppendEntries RPC 执行一个简单的一致性检查所保证的。在发送 AppendEntries RPC 的时候,leader 会将前一个日志条目的索引位置和任期号包含在里面。
如果 follower 在它的日志中找不到包含相同索引位置和任期号的条目,那么他就会拒绝该新的日志条目。
一致性检查就像一个归纳步骤:一开始空的日志状态肯定是满足 Log Matching Property(日志匹配特性) 的,然后一致性检查保证了日志扩展时的日志匹配特性。因此,每当 AppendEntries RPC 返回成功时,leader 就知道 follower 的日志一定和自己相同(从第一个日志条目到最新条目)。
第一个特性:一个leader 在指定term和log index创建最多一个log entry。并且log entry不会改变位置;
第二个特性:AppendEntries RPC会执行一个简单的一致性检查,发送AppendEntries RPC的时候,leader会将前一个日志entry的索引位置和term号包含在里面,如果follower在她的日志中找不到包含相同索引位置和term号的entry,他就会拒绝新的日志entry。一致性检查步骤:一开始空的日志状态肯定是满足日志匹配的特性,然后一致性检查保证日志扩展时的日志匹配特性。因此每当AppendEntries RPC返回成功的时候,leader就知道follower日志一定和自己相同。(从第一个日志entry到最新日志entry)。
During normal operation, the logs of the leader and followers stay consistent, so the AppendEntries consis- tency check never fails. However, leader crashes can leave the logs inconsistent (the old leader may not have fully replicated all of the entries in its log). These inconsisten- cies can compound over a series of leader and follower crashes. Figure 7 illustrates the ways in which followers’ logs may differ from that of a new leader. A follower may be missing entries that are present on the leader, it may have extra entries that are not present on the leader, or both. Missing and extraneous entries in a log may span multiple terms.
正常操作过程中,leader和follower的日志保持一致,所以 AppendEntries RPC 的一致性检查从来不会失败。
然而,leader 崩溃的情况会使日志处于不一致的状态(老的 leader 可能还没有完全复制它日志里的所有条目)。这种不一致会在一系列的 leader 和 follower 崩溃的情况下加剧。图 7 展示了在什么情况下 follower 的日志可能和新的 leader 的日志不同。
- Follower 可能缺少一些在新 leader 中有的日志条目,
- 也可能拥有一些新 leader 没有的日志条目,或者同时发生。
- 缺失或多出日志条目的情况可能会涉及到多个任期。
图 7:当一个 leader 成功当选时(最上面那条日志),follower 可能是(a-f)中的任何情况。
每一个盒子表示一个日志条目;里面的数字表示任期号。
Follower 可能会缺少一些日志条目(a-b),
可能会有一些未被提交的日志条目(c-d),
或者两种情况都存在(e-f)。
例如,场景 f 可能这样发生,f 对应的服务器在任期 2 的时候是 leader ,追加了一些日志条目到自己的日志中,一条都还没提交(commit)就崩溃了;该服务器很快重启,在任期 3 重新被选为 leader,又追加了一些日志条目到自己的日志中;在这些任期 2 和任期 3 中的日志都还没被提交之前,该服务器又宕机了,并且在接下来的几个任期里一直处于宕机状态。
In Raft, the leader handles inconsistencies by forcing the followers’ logs to duplicate its own. This means that conflicting entries in follower logs will be overwritten with entries from the leader’s log. Section 5.4 will show that this is safe when coupled with one more restriction.
在 Raft 算法中,leader 通过强制 follower 复制它的日志来解决不一致的问题。这意味着 follower 中跟 leader 冲突的日志条目会被 leader 的日志条目覆盖。5.4 节会证明通过增加一个限制可以保证安全性。
To bring a follower’s log into consistency with its own, the leader must find the latest log entry where the two logs agree, delete any entries in the follower’s log after that point, and send the follower all of the leader’s entries after that point. All of these actions happen in response to the consistency check performed by AppendEntries RPCs. The leader maintains a nextIndex for each follower, which is the index of the next log entry the leader will send to that follower. When a leader first comes to power, it initializes all nextIndex values to the index just after the last one in its log (11 in Figure 7). If a follower’s log is inconsistent with the leader’s, the AppendEntries consis- tency check will fail in the next AppendEntries RPC. Af- ter a rejection, the leader decrements nextIndex and retries the AppendEntries RPC. Eventually nextIndex will reach a point where the leader and follower logs match. When this happens, AppendEntries will succeed, which removes any conflicting entries in the follower’s log and appends entries from the leader’s log (if any). Once AppendEntries succeeds, the follower’s log is consistent with the leader’s, and it will remain that way for the rest of the term.
要使得 follower 的日志跟自己一致,leader 必须找到两者达成一致的最大的日志条目(索引最大),删除 follower 日志中从那个点之后的所有日志条目,并且将自己从那个点之后的所有日志条目发送给 follower 。所有的这些操作都发生在对 AppendEntries RPCs 中一致性检查的回复中。Leader 针对每一个 follower 都维护了一个 nextIndex ,表示 leader 要发送给 follower 的下一个日志条目的索引。当选出一个新 leader 时,该 leader 将所有 nextIndex 的值都初始化为自己最后一个日志条目的 index 加1(图 7 中的 11)。如果 follower 的日志和 leader 的不一致,那么下一次 AppendEntries RPC 中的一致性检查就会失败。在被 follower 拒绝之后,leaer 就会减小 nextIndex 值并重试 AppendEntries RPC 。最终 nextIndex 会在某个位置使得 leader 和 follower 的日志达成一致。此时,AppendEntries RPC 就会成功,将 follower 中跟 leader 冲突的日志条目全部删除然后追加 leader 中的日志条目(如果有需要追加的日志条目的话)。一旦 AppendEntries RPC 成功,follower 的日志就和 leader 一致,并且在该任期接下来的时间里保持一致。
If desired, the protocol can be optimized to reduce the number of rejected AppendEntries RPCs. For example, when rejecting an AppendEntries request, the follower can include the term of the conflicting entry and the first index it stores for that term. With this information, the leader can decrement nextIndex to bypass all of the con- flicting entries in that term; one AppendEntries RPC will be required for each term with conflicting entries, rather than one RPC per entry. In practice, we doubt this opti- mization is necessary, since failures happen infrequently and it is unlikely that there will be many inconsistent entries.
优化:如果需要,该协议可以被优化来减少被拒绝的 AppendEntries RPC 的个数。例如,当拒绝一个 AppendEntries RPC 的请求的时候,follower 可以包含冲突条目的任期号和自己存储的那个任期的第一个 index 。借助这些信息,leader 可以跳过那个任期内所有冲突的日志条目来减小 nextIndex;这样就变成每个有冲突日志条目的任期需要一个 AppendEntries RPC 而不是每个条目一次。在实践中,我们认为这种优化是没有必要的,因为失败不经常发生并且也不可能有很多不一致的日志条目。
With this mechanism, a leader does not need to take any special actions to restore log consistency when it comes to power. It just begins normal operation, and the logs auto- matically converge in response to failures of the Append- Entries consistency check. A leader never overwrites or deletes entries in its own log (the Leader Append-Only Property in Figure 3).
This log replication mechanism exhibits the desirable consensus properties described in Section 2: Raft can ac- cept, replicate, and apply new log entries as long as a ma- jority of the servers are up; in the normal case a new entry can be replicated with a single round of RPCs to a ma- jority of the cluster; and a single slow follower will not impact performance.
通过这种机制,leader 在当权之后就不需要任何特殊的操作来使日志恢复到一致状态。Leader 只需要进行正常的操作,然后日志就能在回复 AppendEntries 一致性检查失败的时候自动趋于一致。Leader 从来不会覆盖或者删除自己的日志条目(图 3 的 Leader Append-Only 属性)。
这样的日志复制机制展示了第 2 节中描述的一致性特性:只要过半的服务器能正常运行,Raft 就能够接受,复制并应用新的日志条目;在正常情况下,新的日志条目可以在一个 RPC 来回中被复制给集群中的过半机器;并且单个运行慢的 follower 不会影响整体的性能。
总结
请求路径
1.客户端发送消息给leader,leader写入raft log,发送给follower
2.follower收到消息写入raft log,并且回复raft log replicate成功;
如果leader收到超时,会不断重试;
3.leader收到大多数的follower回复成功之后就apply,更新commitid;返回客户端写入成功;
异常情况
因为网络,机器故障等,leader可能在不断切换,同时写入又在进行
可能有下列情况:
- Follower 可能会缺少一些日志条目(a-b)
- 可能会有一些未被提交的日志条目(c-d)
- 或者两种情况都存在(e-f)
解决方案
leader 通过强制 follower 复制它的日志来解决不一致的问题。这意味着 follower 中跟 leader 冲突的日志条目会被 leader 的日志条目覆盖。5.4 节会证明通过增加一个限制可以保证安全性。
- leader中维护了对每个follower的 nextIndex ,表示 leader 要发送给 follower 的下一个日志条目的索引。
具体怎么执行的——在braft中——待跟进