CAP Confusion: Problems with ‘partitiontolerance’(zz)
CAP 困惑:关于Partition tolerance 的问题
2011-11-20 12:21:17| 分类： 搜索与分布式 |字号 订阅
The ‘CAP’ theorem is a hot topic in the design ofdistributed data storage systems. However, it’s often widely misused. In thispost I hope to highlight why the common ‘consistency, availability andpartition tolerance: pick two’ formulation is inadequate for distributedsystems. In fact, the lesson of the theorem is that the choice is almost alwaysbetween sequential consistency and high availability.
It’s very common to invoke the ‘CAP theorem’ whendesigning, or talking about designing, distributed data storage systems. Thetheorem, as commonly stated, gives system designers a choice between threecompeting guarantees:
Consistency – roughly meaning that all clients of a data store get responses torequests that ‘make sense’. For example, if Client A writes 1 then 2 tolocation X, Client B cannot read 2 followed by 1.
Availability – all operations on a data store eventually return successfully.We say that a data store is ‘available’ for, e.g. write operations.
一致性：粗略的意思就是：一个数据存储（data store）的所有的client对于请求的响应应该合理（make sense），例如一个client A 对location x 先写1，后写2，client B 不应该先读2 ，后读得1.
Partition tolerance – if the network stops delivering messages between two setsof servers, will the system continue to work correctly?
This is often summarised as a single sentence:“consistency, availability, partition tolerance. Pick two.”. Short, snappy anduseful.
At least, that’s the conventional wisdom. Many moderndistributed data stores, including those often caught under the ‘NoSQL’ net,pride themselves on offering availability and partition tolerance over strongconsistency; the reasoning being that short periods of application misbehaviorare less problematic than short periods of unavailability. Indeed, Dr. MichaelStonebraker posted an article on the ACM’s blog bemoaning thepreponderance of systems that are choosing the ‘AP’ data point, and thatconsistency and availability are the two to choose. However for the vastmajority of systems, I contend that the choice is almost always betweenconsistency and availability, and unavoidably so.
至少，这是通认的法则。许多的现代分布式数据存储系统，包括现在许多的NOSQL，都以牺牲一致性来提供高可用和分区容忍性为荣。理由是短时间内的应用的错误产生的影响小，而短事件的不用用产生的影响较大。Dr. Michael Stonebraker在ACM的blog中发表了一篇文章，为大多数的系统选择了“AP”而不是“CA”而惋惜。
Dr. Stonebraker’s central thesis is that, sincepartitions are rare, we might simply sacrifice ‘partition-tolerance’ infavour of sequential consistency and availability – a model that iswell suited to traditional transactional data processing and the maintainanceof the good old ACID invariants of most relational databases. I want toillustrate why this is a misinterpretation of the CAP theorem.
Dr. Stonebraker’s 的中心观点是：分区时很少出现（rare）的。我们可以简单的牺牲P而支持CA，这个模型可以很好的适用传统的事务处理，维护大多数关系数据库ACID的特性不变。我试图去阐释这是对CAP理论的误解。
We first need to get exactly what is meant by ‘partitiontolerance’ straight. Dr. Stonebraker asserts that a system is partitiontolerant if processing can continue in both partitions in the case of a networkfailure.
“If there is a network failure that splits the processingnodes into two groups that cannot talk to each other, then the goal would be toallow processing to continue in both subgroups.”
This is actually a very strong partition tolerancerequirement. Digging into the history of the CAP theorem reveals somedivergence from this definition.
Seth Gilbert and Professor Nancy Lynch provided both aformalisation and a proof of the CAP theorem in their 2002 SIGACT paper. We should defer to theirdefinition of partition tolerance – if we are going to invoke CAP as amathematical truth, we should formalize our foundations, otherwise we arebuilding on very shaky ground. Gilbert and Lynch define partitiontolerance as follows:
Seth Gilbert and Professor Nancy Lynch在他们的2002 SIGACT论文提供了一个CAP理论的形式化以及证明。首先我们先放一放对这个partition tolerance的定义。我们先对我们的基础条件做一个定义，否则我们就是不稳定的地基上构建我们的结论。Gilbert and Lynch 对P的定义如下：
“The network will be allowed to lose arbitrarily manymessages sent from one node to another”
Note that Gilbert and Lynch’s definition isn’t a propertyof a distributed application, but a property of the network in which itexecutes. This is often misunderstood: partition tolerance is not something wehave a choice about designing into our systems. If you have a partition in yournetwork, you lose either consistency (because you allow updates to both sidesof the partition) or you lose availability (because you detect the error andshutdown the system until the error condition is resolved). Partition tolerancemeans simply developing a coping strategy by choosing which of the other systemproperties to drop. This is the real lesson of the CAP theorem – if you have anetwork that may drop messages, then you cannot have both availability andconsistency, you must choose one. We should really be writing Possibility ofNetwork Partitions => not(availability and consistency), but that’s notnearly so snappy.
注意到Gilbert and Lynch’s的定义并不是分布式应用的属性，而是分布式应用执行所依赖的网络的属性。这经常被误解：P并不是我们设计我们的系统的一个选择。如果在网络中有分区，你要么失去Consistency（因为分区的两边都可以更新），要么失去可用性（因为你检测到了错误，然后关闭系统直到错误解决，注就是说分区导致两个都可以更新，导致两边的数据都错误，也就没有用的准确的数据是系统继续，系统被迫停机）。分区容忍性意味着通过降低系统的其他属性来开发一个复制策略。这才是我们从CAP理论中真正学到的。如果你的网络也许丢数据，你不能同时获得availability 和 consistency，你必须选择一个。我们本可以写： 网络分区的概率可以推导出（availability和consistency）。但是并不是这么简单。
Dr. Stonebraker’s definition of partition tolerance is actually a measure of availability– if a write may go to either partition, will it eventually be responded to?This is a very meaningful question for systems distributed across manygeographic locations, but for the LAN case it is less common to have twopartitions available for writes. However, it is encompassed by the requirementfor availability that we already gave – if your system is available for writesat all times, then it is certainly available for writes during a networkpartition.
So what causes partitions? Two things, really. The firstis obvious – a network failure, for example due to a faulty switch, can causethe network to partition. The other is less obvious, but fits with thedefinition from Gilbert and Lynch: machine failures,either hard or soft. In an asynchronous network, i.e. one where processing amessage could take unbounded time, it is impossible to distinguish betweenmachine failures and lost messages. Therefore a single machine failurepartitions it from the rest of the network. A correlated failure of severalmachines partitions them all from the network. Not being able to receive amessage is the same as the network not delivering it. In the face ofsufficiently many machine failures, it is still impossible to maintainavailability and consistency, not because two writes may go to separatepartitions, but because the failure of an entire ‘quorum’ of servers may rendersome recent writes unreadable.
那么，什么导致了分区呢？两个：第一个很显然，那就是网络失效，例如switch失效导致网络分区。另一个就不太明显，但是符合Gilbert and Lynch的定义：机器失效，无论软件或硬件。在异步网络中，处理消息是没有时间上限的，不可能区分机器失效和消息的丢失。因此，一个机器失效使其和其他节点分区。多个机器的失效使他们和网络分区。不能接受消息和网络不能分发消息给它是一样的。在足够多的机器失效后，仍然不可能维护可用性和一致性。不是因为两个写操作可以到不同的分区，一定数量的服务器的失效导致一些最近的写不可读。
This is why defining P as ‘allowing partitioned groups toremain available’ is misleading – machine failures are partitions, almosttautologously, and by definition cannot be available while they are failed.Yet, Dr. Stonebraker says that he would suggest choosing CA rather than P. Thisfeels rather like we are invited to both have our cake and eat it. Not‘choosing’ P is analogous to building a network that will never experiencemultiple correlated failures. This is unreasonable for a distributed system –precisely for all the valid reasons that are laid out in the CACM post aboutcorrelated failures, OS bugs and cluster disasters – so what a designer has todo is to decide between maintaining consistency and availability. Dr.Stonebraker tells us to choose consistency, in fact, becauseavailability will unavoidably be impacted by large failure incidents. This is alegitimate design choice, and one that the traditional RDBMS lineage of systemshas explored to its fullest, but it implicitly protects us neither fromavailability problems stemming from smaller failure incidents, nor from thehigh cost of maintaining sequential consistency.
When the scale of a system increases to many hundreds orthousands of machines, writing in such a way to allow consistency in the faceof potential failures can become very expensive (you have to write to one moremachine than failures you are prepared to tolerate at once). This kind ofnuance is not captured by the CAP theorem: consistency is often much moreexpensive in terms of throughput or latency to maintain than availability.Systems such as ZooKeeper are explicitly sequentially consistentbecause there are few enough nodes in a cluster that the cost of writing toquorum is relatively small. The Hadoop Distributed File System (HDFS) alsochooses consistency – three failed datanodes can render a file’s blocksunavailable if you are unlucky. Both systems are designed to work in realnetworks, however, where partitions and failures will occur*, and when they doboth systems will become unavailable, having made their choice betweenconsistency and availability. That choice remains the unavoidable reality fordistributed data stores.
*For more on the inevitably of failure modes in largedistributed systems, the interested reader is referred to James Hamilton’s LISA‘07 paper On Designing and Deploying Internet-Scale Services.
Daniel Abadi has written an excellent critique of the CAP theorem.
James Hamilton also responds to Dr. Stonebraker’s blog entry,agreeing (as I do) with the problems of eventual consistency but taking issuewith the notion of infrequent network partitions.