第三部分关于Cassandra1.0.x的数据分区

最新推荐文章于 2024-06-19 00:40:58 发布

名剑传奇

最新推荐文章于 2024-06-19 00:40:58 发布

阅读量1.9k

点赞数

分类专栏： Java基础技术政务应用开发电子商务应用开发移动互联网数据库群集云计算 J2EE技术内存数据库文章标签： cassandra token 数据中心 each random application

J2EE技术同时被 3 个专栏收录

43 篇文章 0 订阅

订阅专栏

Java基础技术

32 篇文章 0 订阅

订阅专栏

电子商务应用开发

26 篇文章 0 订阅

订阅专栏

About Data Partitioning in Cassandra

关于Cassandra的数据分区

原文

When you start a Cassandra cluster, youmust choose how the data will be divided across the nodes in the cluster. Thisis done by choosing apartitioner for the cluster.

译文

当你启动一个Cassandra群集时，你必须选择它的数据是如何在节点间分布的。群集种的数据分布是通过选择一个“partitioner”而定的。

原文

In Cassandra, the total data managed by thecluster is represented as a circular space orring. The ring isdivided up into ranges equal to the number of nodes, with each node beingresponsible for one or more ranges of the overall data. Before a node can jointhe ring, it must be assigned a token. The token determines the node’s positionon the ring and the range of data it is responsible for.

译文

在Cassandra中群集管理着类似一个环一样的全部数据。这个环的节点的数据总和就是群集的所有数据，每个节点负群集体数据的一个或多个计算范围。在一个节点加入这个环之前，他必须被指定一个token，这个token确定了该节点承载的数据范围。

原文

Column family data is partitioned acrossthe nodes based on the row key. To determine the node where the first replicaof a row will live, the ring is walked clockwise until it locates the node witha token value greater than that of the row key. Each node is responsible forthe region of the ring between itself (inclusive) and its predecessor(exclusive). With the nodes sorted in token order, the last node is consideredthe predecessor of the first node; hence the ring representation.

译文

列族数据被分布到基于行key的所有节点，要确定行的第一个副本，环按顺时针寻找，直到它找到一个令牌的节点值大于该行键。每个节点负载环本身（含）和之前的所有空区的数据。节点的排序是根据令牌的特点而定，最后一个节点被认为是第一个节点的前一个节点，所以它们组成一个环的结构。

原文

For example, consider a simple 4 nodecluster where all of the row keys managed by the cluster were numbers in the rangeof 0 to 100. Each node is assigned a token that represents a point in thisrange. In this simple example, the token values are 0, 25, 50, and 75. Thefirst node, the one with token 0, is responsible for the wrapping range(75-0). The node with the lowest token also accepts row keys less than thelowest token and more than the highest token.

译文

比如，考虑一个4个节点群集，所有被群集管理的行可以不超过100个，每个节点被指定一个token来表示这个环，那么这个简单的例子中token的值可以使0,25,50,75.第一个节点的token是0，它所管理的范围就是75-0。最低token的节点也接受低于最低token和超过最高token行key。

AboutPartitioning in Multi-Data Center Clusters

在多数据中心群集的数据分区

原文

In multi-data center deployments, replicaplacement is calculated per data center when using the NetworkTopologyStrategyreplica placement strategy. In each data center (or replication group) thefirst replica for a particular row is determined by the token value assigned toa node. Additional replicas in the same data center are placed by walking thering clockwise until it reaches the first node in another rack.

译文

在多数据中心的部署环境中，群集使用NetworkTopologyStrategy副本放置策计算每个副本负载的位置。在每个数据中心（或是副本群组）第一个副本也被指定一个token值，在同一个数据中心找第一个府门节点就是按顺时针方式检测，直到找到第一个节点。

原文

If you do not calculate partitioner tokensso that the data ranges are evenly distributed for each data center, you couldend up with uneven data distribution within a data center.

译文

加入你没有计算每个分区的tokens，使数据范围平均分布到每个数据中心，那么每个数据中心的数据可能分布不均。

原文

The goal is to ensure that the nodes foreach data center have token assignments that evenly divide the overall range.Otherwise, you could end up with nodes in each data center that own adisproportionate number of row keys. Each data center should be partitioned asif it were its own distinct ring, however token assignments within the entirecluster cannot conflict with each other (each node must have a unique token).SeeCalculatingTokens for a Multi-Data Center Cluster for strategies on how togenerate tokens for multi-data center clusters.

译文

目标是确定在每个数据中心的节点被平均的分配了token，否则，最终每个数据中心的节点会得到一个不对称行key，每个数据中心都应该是分区就好像它是其独特的环，但是在整个集群的token任务，不能互相冲突（每个节点都必须有一个独特的标记）。请参考为多数据中心群集计算令牌里面是面么描述如何为多数据中心的节点计算令牌的。

Understandingthe Partitioner Types

了解分区类型

原文

Unlike almost every other configurationchoice in Cassandra, the partitioner may not be changed without reloading allof your data. It is important to choose and configure the correct partitionerbefore initializing your cluster.

Cassandra offers a number of partitionersout-of-the-box, but the random partitioner is the best choice for mostCassandra deployments.

译文

在Cassandra配置，分区配置不像其他的配置选择，分区的配置一旦配置完成并投入使用之后几乎是不变的，在群集初始化之前正确的选择和配置分区是很重要的事情。

Cassandra提供了out-of-the-box分区，但随机分区是最Cassandra部署的最佳选择。

原文

Aboutthe Random Partitioner

The RandomPartitioner is the defaultpartitioning strategy for a Cassandra cluster, and in almost all cases is theright choice.

译文

任意分区

任意分区是群集的默认分区策略，也是所有案例中正确的选择。

原文

Random partitioning uses consistenthashing to determine which node will store a particular row. Unlike naivemodulus-by-node-count, consistent hashing ensures that when nodes are added tothe cluster, the minimum possible set of data is affected.

译文

随机分区使用一致的散列确定哪些节点将存储一个特定的行。不像原始的模块根据节点的模式，一致散列确认当节点加入群集时，把影响到其他主要降到最小。

原文

To distribute the data evenly across thenumber of nodes, a hashing algorithm creates an MD5 hash value of the row key.The possible range of hash values is from 0 to 2**127. Each node in the clusteris assigned atoken that represents a hash value within this range. Anode then owns the rows with a hash value less than its token number. Forsingle data center deployments, tokens are calculated by dividing the hashrange by the number of nodes in the cluster. For multi data center deployments,tokens are calculated per data center (the hash range should be evenly dividedfor the nodes in each replication group).

译文

要均很的分布数据到各个群集的节点上，可以通过哈希算法创建一个行key的MD5哈希值。哈希值的可能范围是从0到2**127。哈希出来的范围内的值选择一个值作为集群中的每个节点分配一个token。一个节点拥有的行的哈希值小于token的值。对于单一的数据中心部署环境而言，token的计算方式是计算集群中的节点数量除以哈希范围。对于多数据中心部署，token的计算由每个数据中心决定（哈希范围应均匀地划分给每个复制组中的节点）。

原文

The primary benefit of this approach isthat once your tokens are set appropriately, data from all of your columnfamilies is evenly distributed across the cluster with no further effort. Forexample, one column family could be using user names as the row key and anothercolumn family timestamps, but the row keys from each individual column familyare still spread evenly. This also means that read and write requests to thecluster will also be evenly distributed.

译文

这种方法的主要好处是一旦适当的设置，所有列族的数据均匀地分布在整个集群中的节点。例如，一个列族使用users作为行key和另一个列族的时间戳，但是行key从每个个别的列族获得，但仍然是均匀分布。这也意味着读取和写入请求集群也将是均匀分布的。

原文

Another benefit of using randompartitioning is the simplification of load balancing a cluster. Because eachpart of the hash range will receive an equal number of rows on average, it iseasier to correctly assign tokens to new nodes.

译文

采用随机分区的另一个好处是简化了群集负载平衡。因为哈希范围内的每个值的将获得是一个平均的行的数目，它是比较容易正确地给节点分配token。

原文

AboutOrdered Partitioners

Using an ordered partitioner ensures thatrow keys are stored in sorted order. Unless absolutely required by yourapplication, DataStax strongly recommends choosing the random partitioner overan ordered partitioner.

译文

有序分区

使用一个有序的分区方式必须确保该行的key键顺序存储排序。除非你得应用程序绝对需要这么做，DataStax强烈建议选择任意分区模式。

译者注：关于有序分区的内容这里就不在翻译了，读者可以自行阅读，或是先不做了解。

没有翻译的原文

Using an ordered partitioner allows range scans over rows, meaning you canscan rows as though you were moving a cursor through a traditional index. Forexample, if your application has user names as the row key, you can scan rowsfor users whose names fall between Jake and Joe. This type of query would notbe possible with randomly partitioned row keys, since the keys are stored inthe order of their MD5 hash (not sequentially).

Although having the ability to do range scans on rows sounds like adesirable feature of ordered partitioners, there are ways to achieve the samefunctionality using column family indexes. Most applications can be designedwith a data model that supports ordered queries as slices over a set of columnsrather than range scans over a set of rows.

Using an ordered partitioner is not recommended for the following reasons:

Sequential writes can cause hot spots. If your application tends to write or update a sequential block of rows at a time, then the writes will not be distributed across the cluster; they will all go to one node. This is frequently a problem for applications dealing with timestamped data.
More administrative overhead to load balance the cluster. An ordered partitioner requires administrators to manually calculate token ranges based on their estimates of the row key distribution. In practice, this requires actively moving node tokens around to accommodate the actual distribution of data once it is loaded.
Uneven load balancing for multiple column families. If your application has multiple column families, chances are that those column families have different row keys and different distributions of data. An ordered partitioner than is balanced for one column family may cause hot spots and uneven distribution for another column family in the same cluster.

There are three choices of built-in ordered partitioners that come withCassandra. Note that the OrderPreservingPartitioner andCollatingOrderPreservingPartitioner are deprecated as of Cassandra 0.7 in favorof the ByteOrderedPartitioner:

ByteOrderedPartitioner - Row keys are stored in order of their raw bytes rather than converting them to encoded strings. Tokens are calculated by looking at the actual values of your row key data and using a hexadecimal representation of the leading character(s) in a key. For example, if you wanted to partition rows alphabetically, you could assign an A token using its hexadecimal representation of 41.
OrderPreservingPartitioner - Row keys are stored in order based on the UTF-8 encoded value of the row keys. Requires row keys to be UTF-8 encoded strings.
CollatingOrderPreservingPartitioner - Row keys are stored in order based on the United States English locale (EN_US). Also requires row keys to be UTF-8 encoded strings.