1 Key-prefix Accounting and Zones(key前缀记帐和地域)
Arbitrarilyfine-grained accounting is specified via key prefixes. Key prefixes canoverlap, as is necessary for capturing hierarchical relationships. Forillustrative purposes, let’s say keys specifying rows in a set of databaseshave the following format:
通过key前缀可以指定记录任意精细的粒度。Key前缀可以重叠,这是捕捉层次关系所必需的。为了解释说明,举个例子,比如说:用key指定数据库集合中的行,格式如下:
<db>:<table>:<primary-key>[:<secondary-key>]
In this case,we might collect accounting with key prefixes:
在该场景下,我们可以收集到以下key前缀的记帐信息:
db1, db1:user, db1:order,
Accounting iskept for the entire map by default.
默认情况下,保持对整个map进行记帐。
1.1 Accounting记帐
to keepaccounting for a range defined by a key prefix, an entry is created in theaccounting system table. The format of accounting table keys is:
为了保持对一个key前缀定义的range记帐,会在记帐系统表中增加一个条目。记帐表中key的格式是:
\0acct<key-prefix>
In practice,we assume each node is capable of caching the entire accounting table as it islikely to be relatively small.
实践中,我们假设每个节点都有能力缓存整个记帐表,因为它可能相对较小。
Accounting iskept for key prefix ranges with eventual consistency for efficiency. There aretwo types of values which comprise accounting: counts and occurrences, for lackof better terms. Counts describe system state, such as the total number ofbytes, rows, etc. Occurrences include transient performance and load metrics.Both types of accounting are captured as time series with minute granularity.The length of time accounting metrics are kept is configurable. Below areexamples of each type of accounting value.
保持对最终一致性range的key前缀进行记帐的目的是为了更高效。记帐信息包含两种类型的值:总数counts和当前值occurrences,这里没有更好的术语来表达。总数描述了系统的状态,如:字节、行的总数等等。当前值包含了临时性能和负载的指标。这两种记帐类型以分钟粒度作为时间序列被采集。记帐时长是可配置的。下面是每种记帐类型值的例子:
System StateCounters/Performance
系统状态计数/性能
l Count of items (e.g. rows) 条目数(如:行)
l Total bytes 字节总数
l Total key bytes (key字节总数)
l Total value length 值总长度
l Queued message count 队列消息总数
l Queued message total bytes 队列消息总字节数
l Count of values < 16B 值总数<16B
l Count of values < 64B值总数<64B
l Count of values < 256B值总数<256B
l Count of values < 1K 值总数<1K
l Count of values < 4K值总数<4K
l Count of values < 16K值总数<16K
l Count of values < 64K值总数<64K
l Count of values < 256K值总数<256K
l Count of values < 1M值总数<1M
l Count of values > 1M值总数<1M
l Total bytes of accounting 记帐总字节数
Load Occurrences
负载当前值
l Get op count (get操作总数)
l Get total MB (get 总值MB)
l Put op count (put操作总数)
l Put total MB (put总值MB)
l Delete op count (delete删除操作总数)
l Delete total MB (delete删除总值MB)
l Delete range op count (delete删除range总操作数)
l Delete range total MB (delete删除range总值MB)
l Scan op count (scan扫描操作总数)
l Scan op MB (scan扫描操作总值MB)
l Split count (split拆分总数)
l Merge count (merge合并总数)
Becauseaccounting information is kept as time series and over many possible metrics ofinterest, the data can become numerous. Accounting data are stored in the mapnear the key prefix described, in order to distribute load (for bothaggregation and storage).
因为记帐信息作为时间序列被保存并覆盖许多感兴趣的指标,所以数据量可能变得巨大。记帐数据存储在其key前缀附近的map中,目的是分散负载(对聚集体和存储)。
Accountingkeys for system state have the form: <key-prefix>|acctd<metric-name>*. Notice the leading ‘pipe’ character. It’s meant to sort the rootlevel account AFTER any other system tables. They must increment the sameunderlying values as they are permanent counts, and not transient activity. Logicat the node takes care of snapshotting the value into anappropriately suffixed (e.g. with timestamp hour) multi-value time seriesentry.
系统状态的记帐key形式为: <key-prefix>|acctd<metric-name>*。注意,前导“管道”符号,它的意义是为了把root层级的记帐排在其他系统表之后。当这些值是持久化总数并且不是临时活动时,它们必须高过相同的基础值。节点上的逻辑必须考虑选取多值时间序列条目中的一个合适后缀(例如:带有时间戳小时)值做为快照值。
Keys forperf/loadmetrics: <key-prefix>acctd<metric-name><hourly-timestamp>.
<hourly-timestamp>-suffixed accounting entries are multi-valued, containing a varint64entry for each minute with activity during the specified hour.
性能/负载指标的key: <key-prefix>acctd<metric-name><hourly-timestamp>。<hourly-timestamp>-suffixed记帐条目是多值的,包含:每分钟一个varint64条目,排满指定的小时期间。
Toefficiently keep accounting over large key ranges, the task of aggregation mustbe distributed. If activity occurs within the same range as the key prefix foraccounting, the updates are made as part of the consensus write.If the ranges differ, then a message is sent to the parent range to incrementthe accounting. If upon receiving the message, the parent range also does notinclude the key prefix, it in turn forwards it to its parent or left child inthe balanced binary tree which is maintained to describe the range hierarchy.This limits the number of messages before an update is visible at the rootto 2*logN, where N is the number ofranges in the key prefix.
为高效地保持对巨大数量的keyrange的记帐,聚集任务必须被分布执行。如果活动发生的range与记帐的key前缀range相同,更新则作为一致性写的一部分来执行。如果range不同,那么一个消息被发送到其父range来提升该记帐。如果据收到的消息,父range也不包含key前缀,它将依次传递消息到其父range或者平衡二叉树中的左孩子,该平衡二叉树被维护用来描述range的层次结构。这限制了一个更新前消息的数量,到根时的数量是2*logN,N是该key前缀内range的数量。
1.2 Zones地域
zones arestored in the map with keys prefixed by \0zone followedby the key prefix to which the zone configuration applies. Zone values specifya protobuf containing the datacenters from which replicas for ranges which fallunder the zone must be chosen.
地域信息存储在map中,map中的key前面追加\0zone+key前缀,key前缀由zone配置指定。Zone值指定了一个protobuf(google 的一种数据交换的格式),包含了range的副本所在的数据中心,当本地域失效时,会选择这些zone来接替。
Pleasesee pkg/config/config.proto for up-to-datedata structures used, the best entry point being message ZoneConfig.
最新使用的数据结构请参阅pkg/config/config.proto 源码,最佳进入点是消息ZoneConfig。
If zones aremodified in situ, each node verifies the existing zones for its ranges againstthe zone configuration. If it discovers differences, it reconfigures ranges inthe same way that it rebalances away from busy nodes, via special-case 1:1 splitto a duplicate range comprising the new configuration.
如果地域在原位置被修改,每个节点都会验证其上的ranges正使用的地域与zone配置是否匹配。如果发现不同,它将重新配置ranges,方式与它从忙碌节点进行重平衡一样,通过特殊情况1:1拆分成重复的包含新配置的range。