Cassandra的分布式删除(DistributedDeletes)

最新推荐文章于 2023-10-29 21:39:35 发布

张兆坤的那些事

最新推荐文章于 2023-10-29 21:39:35 发布

阅读量3.4k

点赞数

分类专栏： Cassandra 文章标签： cassandra constraints delete zk each report

本文链接：https://blog.csdn.net/zhangzhaokun/article/details/5828432

版权

Cassandra 专栏收录该内容

11 篇文章 1 订阅

订阅专栏

As background, recall that a Cassandra cluster defines a ReplicationFactor that determines how many nodes each key and associated columns are written to. In Cassandra (as in Dynamo), the client controls how many replicas to block for on writes, which includes deletions. In particular, the client may, and typically will, specify a ConsistencyLevel of less than the cluster's ReplicationFactor, that is, the coordinating server node should report the write successful even if some replicas are down or otherwise not responsive to the write.

作为背景，记得Cassandra集群中定义了一个ReplicationFactor来表示每一个Key所关联的Column被写入几个不同的Node。在Cassandra中（与Dynamo相同），客户端控制一次写操作被写入几个节点，与此类似，删除操作也由客户端来控制。特别是，客户端可能，并通常会通过指定ConsistencyLevel来指定一个小于群集的ReplicationFactor，也就是说，客户端调用的服务器节点会报告调用成功，即使某些副本节点宕机或者还没有报告响应消息。

(Thus, the "eventual" in eventual consistency: if a client reads from a replica that did not get the update with a low enough ConsistencyLevel, it will potentially see old data. Cassandra uses HintedHandoff, ReadRepair, and AntiEntropy to reduce the inconsistency window, as well as offering higher consistency levels such as ConstencyLevel.QUORUM, but it's still something we have to be aware of.)

（因此，“最终”在最终一致性的含义是：如果一个客户端以一个较低ConsistencyLevel从一个副本读取数据时，这将有可能看到的旧数据，因为该节点上的数据可能还没有更新。Cassandra使用HintedHandoff，ReadRepair，和AntiEntropy减少不一致窗口，以及提供诸如ConstencyLevel.QUORUM来提供更高水平的一致性，但它仍然是我们要注意的。）

Thus, a delete operation can't just wipe out all traces of the data being removed immediately: if we did, and a replica did not receive the delete operation, when it becomes available again it will treat the replicas that did receive the delete as having missed a write update, and repair them! So, instead of wiping out data on delete, Cassandra replaces it with a special value called a tombstone. The tombstone can then be propagated to replicas that missed the initial remove request.

因此，一个删除操作不可能一下子就将被删除的数据立即删除掉：如果客户端执行一个删除操作，并且有一个副本还没有收到这个删除操作，这个时候这个副本依然是可用的，此外，该节点还认为那些已经执行删除操作的节点丢失了一个更新操作，它还要去修复这些节点。所以，Cassandra不会去直接删除数据，Cassandra使用一个被称为墓碑的值。这个墓碑可以被传播到那些丢失了初始删除请求的节点。

There's one more piece to the problem: how do we know when it's safe to remove tombstones? In a fully distributed system, we can't. We could add a coordinator like ZooKeeper, but that would pollute the simplicity of the design, as well as complicating ops -- then you'd essentially have two systems to monitor, instead of one. (This is not to say ZK is bad software -- I believe it is best in class at what it does -- only that it solves a problem that we do not wish to add to our system.)

这个产生了一个问题：我们怎么知道什么时候可以安全地删除墓碑？在一个完全分布式系统，我们不能知道。我们可以添加一个像ZooKeeper那样的协调员，但这样会污染简洁的设计，搞得像复杂的老年退休金计划 - 那样一来你基本上有两个系统要监测，而不是一个。（这并不是说ZK是恶意软件 - 我相信它在同类的系统是最好的 - 只是它解决了一个问题，但是我们不希望加入到我们的系统。）

So, Cassandra does what distributed systems designers frequently do when confronted with a problem we don't know how to solve: define some additional constraints that turn it into one that we do. Here, we defined a constant, GCGraceSeconds, and had each node track tombstone age locally. Once it has aged past the constant, it can be GC'd during compaction (see MemtableSSTable). This means that if you have a node down for longer than GCGraceSeconds, you should treat it as a failed node and replace it as described in Operations. The default setting is very conservative, at 10 days; you can reduce that once you have Anti Entropy configured to your satisfaction. And of course if you are only running a single Cassandra node, you can reduce it to zero, and tombstones will be GC'd at the first major compaction.

因此，Cassandra所做的是那些分布式系统设计者所做的，都会遭遇一个不知道如何去解决的问题：定义一个附加限制，并且把它和我们要做的东西融合成一个。在这里，我们定义了一个常量，GCGraceSeconds，让每一个节点跟踪本地墓碑值的年龄。一旦墓碑值的年龄超过这个常量，它将进行Compation的时候被收集（看MemtableSSTable）。这意味着，如果你有一个节点宕机时间比GCGraceSeconds更长的话，你应该把它作为一个失败的节点，并按照在Operations中描述的那样取代它。默认的设置是很保守的，即为10天；一旦你有Anti Entropy配置，并且它还让您满意，你可以减少。当然，如果你只运行单一Cassandra节点，你可以把它减少到零，墓碑将在进行首次主compaction的时候进行收集（GC）。