Cassandra中的二级索引

最新推荐文章于 2023-11-07 10:30:00 发布

张兆坤的那些事

最新推荐文章于 2023-11-07 10:30:00 发布

阅读量5.8k

点赞数

分类专栏： Cassandra 文章标签： cassandra properties 数据库 user 中间件

Cassandra 专栏收录该内容

11 篇文章 1 订阅

订阅专栏

怎么去给一个Row的Column建立二级索引是Cassandra中一个常见拟题。下面的这个帖子来讲一个实现方式，当然不是只有这一种才能实现。对于有经验的Cassandra用户来讲，这个帖子应该会提起兴趣哦，这里描述的实现方式根本不用Super Column，也就不会有使用Super Column带来的复杂度和约束了。此外，应该指出的是，Cassandra0.7版本以上都会实现原生的二级索引，它使下面讲述的东西更加简单了，但是这个思路对于考虑Cassandra的二级索引任然是非常有效的，还是可以在很多场景得到应用。

首先，我们假设如下一个场景。有一个Container（比如：一个部门），该Container中包含众多的Items（比如：部门中的用户），每一个用户（user）有任意的属性集合，也可以根据Container中的上下文的值来搜索。Items也可以是别的Container的成员，但是在这里先不考虑这种情况。

在Cassandra中，一种建模方式是使用2个ColumnFamilies（下面将简称为CF）。第一个CF会描述Item的属性，名称为Item_Properties，它用Cassandra的最简答的数据模型，在Item_Properties的Row能通过一个Key来找到，在这里例子中将使用UUID来描述这个Key，在Item_properties中的列是Item的属性名，列的值是对应的属性的值。

CF: Item_Properties
Key: item_id
Compare with: BytesType
Name	Value
property_name	property_value
...	...

第二个CF是均对Container的，它包括Items的集合，名称为Container_Items。Container_Items中的列为Item_properties中的行的Key。在Cassandra中，这是一个让人难以理解的地方。当你把Column Family当做一个简单的关系数据库的表，把CF中的行也作为一个关系数据库中的一条记录来使用的时候，每一个行可以作为一个简单的表，甚至是一个连接的表。在Container_Items中，每一个列名用Item_Properties的行Key，列值则填充插入时候的当前时间戳。Container_Items的行可以增长的相当大，由于每一个列大概有42个字节（UUID+timestamp），在Cassandra0.7以下的版本，最大能允许4000万条Items，对于一个部门中的User来讲这个也许是一个合理的限制，但是如果你用这个方式去存放Status的对应信息(比如Tweets)，那就是一个不可以接受的了，对于一个状态的Tweets肯定会超过这个限制。不过，在Cassandra0.7及其以后的版本中，就没有这个限制了，一行最多能够存20亿列。

CF: Container_Items
Key: container_id
Compare with: TimeUUIDType
Name	Value
item_id	insertion_timestamp
...	...

到目前为止，这些都是相当基础的Cassandra数据模型。当一个人想从Container中根据指定属性值来查找Items的时候，事情就会变复杂了。为了实现这个目标，你需要管理你自己的索引，大大超过了Cassandra的最简设计了。需要创建另外两个ColumnFamily来实现这个目标。第一个CF存放实际的索引，用Container_ID和想去索引的Item_Properties中的属性名称作为行Key。结构如下表：

CF: Container_Items_Property_Index
Key: container_id + property_name
Compare with: compositecomparer.CompositeType
Name	Value
composite(property_value, item_id, entry_timestamp)	item_id
...	...

这里描述的索引技术与其它地方有点不同的是索引中每一列是怎么构成的。Cassandra提供了一套用于对行中的列进行排序的Column Type。你能在CF在被创建的时候指定一种排序类型，Cassandra也允许自定义Column Typesz，正如上面所使用到的组合类型。组合类型的列可以让我们将几个不同的成分进行组合为一个列，并且还可以按照该列进行排序。这使得可以让我们建立唯一的列，就算是那些列原本是存在不唯一的值也没有问题，不过需要添加一些额外的值去加以区别。

最后一个问题要处理的是属性值需要改变并且索引值必须被更新的时候会发生什么。答案很简单，你在Container_Items_Property_Index列族的把新值作为一个列插入，并删除旧值列。然而，Cassandra的最终一致性模型和事务缺乏相关的原因，简单地从Item_Properties取以前的值，然后再更新，然后删除Container_Items_Property_Index索引条目中的老得值将无法可靠地工作。To dothis we maintain a list of previous values for the specific property of a givenitem and use that to remove these values from the index before adding the newvalue to the index. These are stored in the following CF:

CF: Container_Item_Property_Index_Entries
Key: container_id + item_id + property_name
Compare with: LongType
Name	Value
entry_timestamp	property_value
...

在取出这些列之后就删除掉，所以这些行绝不会变得太大，在大多数情况下，绝不会超过1到2列，如果修改得比较频繁则会大一些。通过这个方法， it's areally good idea to make sure you understand why this CF is necessary becauseyou can use variations of it to solve a lot of problems with "eventualconsistency" datastores.

所以，整体看来，主要有两个基本的操作：（1）给Container中的一个Item设置属性值（2）从Container中取得匹配特定Value的Items列表信息。这些看起来像这样：

给Container中的Item的属性（property_name)设置值（property_value)的过程如下：
1、取得新增加的实体（entry)的timestamp作为当前的时间戳的值;
2、以Container_ID+item_ID+property_name为Key用get_Slice方法去Container_item_Property_index_entries中查找符合条件的列信息;
3、调用batch_mutate方法去批量完成如下步骤：
从Container_Items_Property_index中删除掉那些从Container_item_Property_index_entries中找到的先前步骤中的列信息;
从Container_item_Property_index_entries中删除掉先前查找出来的列信息;
向Item_properties中插入列（列名为Property_name，值为property_value);
向Container_Items_Property_Index中插入新的索引记录信息;
向Container_Item_Property_Index_Entries插入新的值信息，供后续修改时使用;

按照Property_value来查询Container中的Items过程如下：
1、以container_id + property_name为Key从Container_Items_Property_Index列簇中调用方法Get_Slice来找到匹配的Property_Value

看起来有很多步骤，但实际上，所有的步骤都被中间件包装了，外面都不可见。你能从 CassandraCompositeTypeon GitHub 中找到Composite column比较的具体实现，能从 CassandraIndexedCollectionson GitHub..中找到上述索引技术的简单实现

Update: Mike Malone pointsout that, since Cassandra already stores a timestamp along with the columnvalue, that it's redundant to store in the column value as well and can beomitted in the Container_Items and Container_Item_Property_Index_Entries columnfamilies, which would reduce storage space by about 20%.

翻译的很丑陋，自己都觉得不爽了，只是还是想坚持下去！

大家可以参考原文：http://www.anuff.com/2010/07/secondary-indexes-in-cassandra.html