Cassandra的分布式删除(DistributedDeletes)

As background, recall that a Cassandra cluster defines a ReplicationFactor that determines how many nodes each key and associated columns are written to. In Cassandra (as in Dynamo), the client controls how many replicas to block for on writes, which includes deletions. In particular, the client may, and typically will, specify a ConsistencyLevel of less than the cluster's ReplicationFactor, that is, the coordinating server node should report the write successful even if some replicas are down or otherwise not responsive to the write.

作为背景,记得Cassandra集群中定义了一个ReplicationFactor来表示每一个Key所关联的Column被写入几个不同的Node。在Cassandra中(与Dynamo相同),客户端控制一次写操作被写入几个节点,与此类似,删除操作也由客户端来控制。特别是,客户端可能,并通常会通过指定ConsistencyLevel指定一个小于群集的ReplicationFactor,也就是说,客户端调用的服务器节点会报告调用成功,即使某些副本节点宕机或者还没有报告响应消息。

(Thus, the "eventual" in eventual consistency: if a client reads from a replica that did not get the update with a low enough ConsistencyLevel, it will potentially see old data. Cassandra uses HintedHandoff, ReadRepair, and AntiEntropy to reduce the inconsistency window, as well as offering higher consistency levels such as ConstencyLevel.QUORUM, but it's still something we have to be aware of.)

(因此,最终在最终一致性的含义是:如果一个客户端以一个较低ConsistencyLevel从一个副本读取数据时,这将有可能看到的旧数据,因为该节点上的数据可能还没有更新。Cassandra使用HintedHandoffReadRepair,和AntiEntropy减少不一致窗口,以及提供诸如ConstencyLevel.QUORUM来提供更高水平的一致性,但它仍然是我们要注意的。)

Thus, a delete operation can't just wipe out all traces of the data being removed immediately: if we did, and a replica did not receive the delete operation, when it becomes available again it will treat the replicas that did receive the delete as having missed a write update, and repair them! So, instead of wiping out data on delete, Cassandra replaces it with a special value called a tombstone. The tombstone can then be propagated to replicas that missed the initial remove request.

因此,一个删除操作不可能一下子就将被删除的数据立即删除掉:如果客户端执行一个删除操作,并且有一个副本还没有收到这个删除操作,这个时候这个副本依然是可用的,此外,该节点还认为那些已经执行删除操作的节点丢失了一个更新操作,它还要去修复这些节点。所以,Cassandra不会去直接删除数据,Cassandra使用一个被称为墓碑的值。这个墓碑可以被传播到那些丢失了初始删除请求的节点。

There's one more piece to the problem: how do we know when it's safe to remove tombstones? In a fully distributed system, we can't. We could add a coordinator like ZooKeeper, but that would pollute the simplicity of the design, as well as complicating ops -- then you'd essentially have two systems to monitor, instead of one. (This is not to say ZK is bad software -- I believe it is best in class at what it does -- only that it solves a problem that we do not wish to add to our system.)

这个产生了一个问题:我们怎么知道什么时候可以安全地删除墓碑?在一个完全分布式系统,我们不能知道。我们可以添加一个像ZooKeeper那样的协调员,但这样会污染简洁的设计,搞得像复杂的老年退休金计划 - 那样一来你基本上有两个系统要监测,而不是一个。 (这并不是说ZK是恶意软件 - 我相信它在同类的系统是最好的 - 只是它解决了一个问题,但是我们不希望加入到我们的系统。)

So, Cassandra does what distributed systems designers frequently do when confronted with a problem we don't know how to solve: define some additional constraints that turn it into one that we do. Here, we defined a constant, GCGraceSeconds, and had each node track tombstone age locally. Once it has aged past the constant, it can be GC'd during compaction (see MemtableSSTable). This means that if you have a node down for longer than GCGraceSeconds, you should treat it as a failed node and replace it as described in Operations. The default setting is very conservative, at 10 days; you can reduce that once you have Anti Entropy configured to your satisfaction. And of course if you are only running a single Cassandra node, you can reduce it to zero, and tombstones will be GC'd at the first major compaction.

因此,Cassandra所做的是那些分布式系统设计者所做的,都会遭遇一个不知道如何去解决的问题:定义一个附加限制,并且把它和我们要做的东西融合成一个。在这里,我们定义了一个常量,GCGraceSeconds,让每一个节点跟踪本地墓碑值的年龄。一旦墓碑值的年龄超过这个常量,它将进行Compation的时候被收集(看MemtableSSTable)。这意味着,如果你有一个节点宕机时间比GCGraceSeconds更长的话,你应该把它作为一个失败的节点,并按照在Operations中描述的那样取代它。默认的设置是很保守的,即为10天;一旦你有Anti Entropy配置,并且它还让您满意,你可以减少。当然,如果你只运行单一Cassandra节点,你可以把它减少到零,墓碑将在进行首次主compaction的时候进行收集(GC

××××××××××××××××××××××××

那究竟什么时候被删除的数据才真正从Cassandra节点中被删除掉呢?

我们在使用客户端从Cassandra中读取数据的时候,节点在返回数据之前都会主动检查是否该数据被设置了删除标志,并且该删除标志的添加时长已经大于GCGraceSeconds,则要先删除该节点的数据再返回。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值