收获，产量和可扩展的耐性系统：摘要-CSDN博客

本文探讨了CAP定理在大型分布式系统中的一致性、可用性和分区容忍度之间的权衡，强调了通过容错、遏制和隔离提高系统可用性的策略。介绍了收获率和产量作为衡量系统性能的指标，提出了数据随机分布和关键数据复制的方法来提升系统产量。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

by Shubheksha

通过Shubheksha

收获，产量和可扩展的耐性系统：摘要 (Harvest, Yield, and Scalable Tolerant Systems: A Summary)

This article presents a summary of the paper “Harvest, Yield, and Scalable Tolerant Systems” published by Eric Brewer & Amando Fox in 1999. All unattributed quotes are from this paper.

本文提供了由Eric Brewer和Amando Fox于1999年发表的论文“ Harvest，Yield，和可伸缩的耐性系统 ”的摘要。所有未引用的引文均来自该论文。

The paper deals with the trade-offs between consistency and availability (CAP) for large systems. It’s very easy to point to CAP and assert that no system can have consistency and availability.

本文讨论了大型系统的一致性和可用性(CAP)之间的权衡。指向CAP并断言没有系统可以具有一致性和可用性非常容易。

But, there is a catch. CAP has been misunderstood in a variety of ways. As Coda Hale explains in his excellent blog post “You Can’t Sacrifice Partition Tolerance”:

但是有一个问题。 CAP已经以多种方式被误解了。正如科达·黑尔(Coda Hale)在其出色的博客文章“ 您不能牺牲分区容忍度 ”中所述：

Of the CAP theorem’s Consistency, Availability, and Partition Tolerance, Partition Tolerance is mandatory in distributed systems. You cannot not choose it. Instead of CAP, you should think about your availability in terms of yield (percent of requests answered successfully) and harvest (percent of required data actually included in the responses) and which of these two your system will sacrifice when failures happen.

在CAP定理的一致性，可用性和分区容忍度中，分区容忍度在分布式系统中是必需的。您不能选择它。而不是CAP，您应该考虑产量(成功回答的请求的百分比)和收获(响应中实际包括的所需数据的百分比)的可用性，以及在发生故障时系统将牺牲这两个中的哪一个。

The paper focuses on increasing the availability of large scale systems by fault toleration, containment and isolation:

本文着重于通过容错，遏制和隔离来提高大型系统的可用性：

We assume that clients make queries to servers, in which case there are at least two metrics for correct behavior: yield, which is the probability of completing a request, and harvest, which measures the fraction of the data reflected in the response, i.e. the completeness of the answer to the query.

我们假设客户端向服务器进行查询，在这种情况下，至少有两个度量标准来确定正确的行为：yield(完成请求的概率)和Harvest(收获)，它测量响应中反映的数据部分，即查询答案的完整性。

The two metrics, harvest and yield can be summarized as follows:

收获和产量这两个指标可以总结如下：

Harvest: data in response/total data
收获：响应中的数据/总数据

For example: If one of the nodes is down in a 100 node cluster, the harvest is 99% for the duration of the fault.
例如：如果一个节点在100个节点的群集中发生故障，则在故障期间的收获率为99％。
Yield: requests completed with success/total number of requests
产量：成功完成的请求/请求总数

Yield: requests completed with success/total number of requestsNote: Yield is different from uptime. Yield deals with the number of requests, not only the time the system wasn’t able to respond to requests.
收益：成功完成的请求/请求总数注意：收益与正常运行时间不同。 Yield处理请求的数量，而不仅仅是系统无法响应请求的时间。

The paper argues that there are certain systems which require perfect responses to queries every single time. Also, there are systems that can tolerate imperfect answers once in a while.

该论文认为，某些系统每次都需要对查询的完美响应。另外，有些系统可以偶尔容忍不完美的答案。

To increase the overall availability of our systems, we need to carefully think through the required consistency and availability guarantees it needs to provide.

为了提高系统的整体可用性，我们需要仔细考虑所需的一致性和可用性保证。

交易收获以提高收益—概率可用性 (Trading Harvest for Yield — Probabilistic Availability)

Nearly all systems are probabilistic whether they realize it or not. In particular, any system that is 100% available under single faults is probabilistically available overall (since there is a non-zero probability of multiple failures)

无论是否意识到，几乎所有系统都是概率性的。特别是，在单个故障下100％可用的任何系统在总体上都是概率可用的(因为多重故障的概率不为零)

The paper talks about understanding the probabilistic nature of availability. This helps in understanding and limiting the impact of faults by making decisions about what needs to be available and what kind of faults the system can deal with.

本文讨论了了解可用性的概率性质。通过决定需要提供什么以及系统可以处理哪种类型的故障，这有助于理解和限制故障的影响。

They outline the linear degradation of harvest in case of multiple node faults. The harvest is directly proportional to the number of nodes that are functioning correctly. Therefore, it decreases/increases linearly.

他们概述了在多个节点故障的情况下收割的线性下降。收获量与正常运行的节点数成正比。因此，它线性地减少/增加。

Two strategies are suggested for increasing the yield:

建议了两种提高产量的策略：

Random distribution of data on the nodes
节点上数据的随机分布

If one of the nodes goes down, the average-case and worst-case fault behavior doesn’t change. Yet if the distribution isn’t random, then depending on the type of data, the impact of a fault may vary.
如果其中一个节点发生故障，则平均情况和最坏情况的故障行为都不会改变。但是，如果分布不是随机的，则根据数据类型，故障的影响可能会有所不同。

For example, if only one of the nodes stored information related to a user’s account balance goes down, the entire banking system will not be able to work.
例如，如果存储与用户帐户余额相关的信息的节点中只有一个发生故障，则整个银行系统将无法工作。
Replicating the most important data
复制最重要的数据

This reduces the impact in case one of the nodes containing a subset of high-priority data goes down.
如果包含高优先级数据子集的节点之一发生故障，则可以减少影响。

It also improves harvest.
它还可以提高收成。

Another notable observation made in the paper is that it is possible to replicate all your data. It doesn’t do a lot to improve your harvest/yield, but it increases the cost of operation substantially. This is because the internet works based on best-in-effort protocols which can never guarantee 100% harvest/yield.

本文中另一个值得注意的发现是可以复制所有数据。它对提高收成/收成没有多大作用，但会大大增加运营成本。这是因为互联网是根据“尽力而为”协议工作的，该协议永远无法保证100％的收获/产量。

应用程序分解和正交机制 (Application Decomposition and Orthogonal Mechanisms)

The second strategy focuses on the benefits of orthogonal system design.

第二种策略集中于正交系统设计的好处。

It starts out by stating that large systems are composed of subsystems which cannot tolerate failures. But they fail in a way that allows the entire system to continue functioning with some impact on utility.

首先说明大型系统由不能容忍故障的子系统组成。但是它们的失败方式使整个系统继续运行，对实用程序产生了一些影响。

The actual benefit is the ability to provision each subsystem’s state management separately, providing strong consistency or persistent state only for the subsystems that need it, not for the entire application. The savings can be significant if only a few small subsystems require the extra complexity.

真正的好处是能够分别配置每个子系统的状态管理，从而仅为需要它的子系统(而不是整个应用程序)提供强一致性或持久状态。如果只有几个小型子系统需要额外的复杂性，那么节省的费用将是可观的。

The paper states that orthogonal components are completely independent of each other. They have no run time interface to other components, unless there is a configuration interface. This allows each individual component to fail independently and minimizes its impact on the overall system.

该论文指出，正交分量是完全相互独立的。除非有配置界面，否则它们没有与其他组件的运行时界面。这允许每个单独的组件独立发生故障，并将其对整个系统的影响最小化。

Composition of orthogonal subsystems shifts the burden of checking for possibly harmful interactions from runtime to compile time, and deployment of orthogonal guard mechanisms improves robustness for the runtime interactions that do occur, by providing improved fault containment.

正交子系统的组成将检查可能有害的交互作用的负担从运行时转移到了编译时，并且正交保护机制的部署通过提供改进的故障约束来提高确实发生的运行时交互的鲁棒性。

The goal of this paper was to motivate research in the field of designing fault-tolerant and highly available large scale systems.

本文的目的是激发设计容错和高度可用的大型系统领域的研究。

Also, to think carefully about the consistency and availability guarantees the application needs to provide. As well as the trade offs it is capable of making in terms of harvest against yield.

另外，仔细考虑一致性和可用性可确保应用程序需要提供。除了权衡取舍外，它还能够根据收成与产量进行交易。

If you enjoyed this paper, please hit the clap button so more people see it. Thank you.

如果您喜欢这篇论文，请点击拍手按钮，以便更多的人看到它。谢谢。

P.S. — If you made it this far and would like to receive a mail whenever I publish one of these posts, sign up here.

PS —如果您到现在为止，并且希望在我发布这些帖子之一时收到邮件，请在此处注册。