1 About(关于)
This documentis an updated version of the original design documents by Spencer Kimball fromearly 2014.
本文档为2014年初SpencerKimball编写的初始设计文档的更新版本。
2 Overview(概览)
CockroachDBis a distributed SQL database. The primary design goals are scalability, strongconsistency and survivability(hence the name). CockroachDBaims to tolerate disk, machine, rack, and even datacenter failures withminimal latency disruption and no manual intervention.CockroachDB nodes are symmetric; a design goal is homogeneousdeployment(one binary) with minimal configuration and no required externaldependencies.
CockroachDB是一个分布式SQL数据库。其主要设计目标是扩展性、强一致性和生存性(CockroachDB蟑螂数据库由此得名)。CockroachDB的目标是容忍磁盘、机器、机架,甚至数据中心故障,在无需人工干预的情况下,最小化这些延迟中断的影响。CockroachDB各节点是对等的,设计目标是同质化部署(一个二进制包),最小化配置,也不需要外部依赖项。
The entrypoint for database clients is the SQL interface. Every node in a CockroachDBcluster can act as a client SQL gateway. A SQL gateway transforms and executesclient SQL statements to key-value (KV) operations, which the gatewaydistributes across the cluster as necessary and returns results to the client.CockroachDB implements a single, monolithic sortedmap from key to valuewhere both keys and values are byte strings.
为数据库客户端提供的访问接口是SQL接口。CockroachDB集群中的每个节点都可以扮演一个客户端SQL网关角色。SQL网关将客户端SQL语句转换成KV操作,分发到所需的节点执行并返回结果给客户端。CockroachDB实现了一个单一整体有序映射Map,其中的键Key和值Value都是字节串(不是unicode)。
The KV map islogically composed of smaller segments of the keyspace called ranges. Eachrange is backed by data stored in a local KV storage engine (weuse RocksDB,a variant of LevelDB). Range data is replicated to a configurable number ofadditional CockroachDB nodes. Ranges are merged and split to maintain a targetsize, by default 64M. The relatively small sizefacilitates quick repair and rebalancing to address node failures, new capacityand even read/write load. However, the size must be balanced against thepressure on the system from having more ranges to manage.
KV映射(map)逻辑上由更小的称为范围(range)的键值空间区域分割组成。每一个range依靠本地KV存储引擎(我们使用RocksDB, LevelDB的变种)的数据存储能力来支持。Range数据可被复制到指定数量的其他CockroachDB节点上。Range会被合并或者拆分以维持目标尺寸大小,目标大小默认是64M。相对而言,小尺寸可以更快速的修复,以及更快速地重新调整,以处理节点失效、新增容量、甚至读写负载。无论如何,该尺寸的选取必须依据系统上的压力(系统中有多少更多的range需要管理)来平衡考虑。
CockroachDBachieves horizontally scalability:
l adding more nodes increases the capacity of the cluster by theamount of storage on each node (divided by a configurable replication factor),theoretically up to 4 exabytes (4E) of logical data;
l client queries can be sent to any node in the cluster, and queriescan operate independently (w/o conflicts), meaning that overall throughput is alinear factor of the number of nodes in the cluster.
l queries are distributed (ref: distributed SQL) so that the overallthroughput of single queries can be increased by adding more nodes.
CockroachDB水平扩展性:
l 增加更多节点来增加集群容量,理论上逻辑数据可扩展到4E byte。每个节点增加的有效容量=所增加的存储容量/所配置的复制系数;
l 客户端查询可以发送到集群中的任一节点,这些查询可以独立执行(无冲突),这意味着整体吞吐量随集群节点数线性增长;
l 查询被分发执行(参见:分布式SQL),所以单一查询的吞吐量也会因增加更多节点而增加。
CockroachDBachieves strong consistency:
l uses a distributed consensus protocol for synchronous replication ofdata in each key value range. We’ve chosen to use theRaftconsensus algorithm; all consensus state is stored in RocksDB.
l single or batched mutations to a single range are mediated via therange's Raft instance. Raft guarantees ACID semantics.
l logical mutations which affect multiple ranges employ distributedtransactions for ACID semantics. CockroachDB uses an efficient non-lockingdistributed commit protocol.
CockroachDB强一致性:
l 每个KV range内数据的同步复制都采用分布式一致性协议。我们采用Raft一致性算法,所有一致性状态都存储在RocksDB中;
l 对单一range的单一或者批量变化是通过该range的Raft实例作为中介来完成的。Raft保障了ACID语义;
l 涉及多个range的逻辑上的变化则是利用分布式事务来保障ACID语义。CockroachDB使用高效的无锁分布式提交协议。
CockroachDBachieves survivability:
l range replicas can be co-located within a single datacenter for lowlatency replication and survive disk or machine failures. They can bedistributed across racks to survive some network switch failures.
l range replicas can be located in datacenters spanning increasinglydisparate geographies to survive ever-greater failure scenarios from datacenterpower or networking loss to regional power failures (e.g. { US-East-1a,US-East-1b, US-East-1c }, { US-East, US-West, Japan }, { Ireland, US-East, US-West}, { Ireland, US-East,US-West, Japan, Australia }).
CockroachDB生存性:
l range多个副本可以在单一数据中心内同地协作,目的是提供低延迟复制和在磁盘或者机器故障时仍然存活。Range副本可以被分布到不同机架,目的是在一些网络交换机故障时仍能够存活。
l range多个副本可以位于跨越不同地域的数据中心范围内,目的是在更大的故障场景(从数据中心断电、断网到区域性电力故障)中仍然存活。(多数据中心跨地域的例子:{ US-East-1a, US-East-1b, US-East-1c }, {US-East, US-West, Japan }, { Ireland, US-East, US-West}, { Ireland, US-East,US-West, Japan, Australia })。
CockroachDBprovides snapshot isolation (SI) and serializablesnapshot isolation (SSI) semantics, allowing externally consistent,lock-free reads and writes--both from a historical snapshot timestamp andfrom the current wall clock time. SI provideslock-free reads and writes but still allows write skew. SSI eliminates writeskew, but introduces a performance hit in the case of a contentious system. SSIis the default isolation; clients must consciously decide to trade correctness forperformance. CockroachDB implements a limited form of linearizability,providing ordering for any observer or chain of observers.
CockroachDB提供快照隔离级别(snapshotisolation简称SI)和序列化快照隔离级别(serializable snapshot isolation简称SSI)语义,容许外部一致性、无锁定读写—--从历史快照时间戳和从当前系统时间读写(系统时间指从时间设备如:钟表、电脑等计时设备等读到的时间值,它是我们对真实时间的度量值,但跟真实时间总是不可能完全一致)。SI隔离级别提供无锁定读写但会产生写偏序(译注:因为每个事务在更新过程中看不到其他事务的更新结果,所以可能造成各个事务提交之后的最终结果违反了一致性)。SSI隔离级别消除了写偏序,但在竞争频繁的系统中引起了性能的下降。SSI隔离级别是默认的隔离级别,用户必须有意识地决定是否用性能换取正确性。CockroachDB实现了严格一致性(线性一致性)的一种有限形式,为任一观察者或观察者链提供有序化。
Similarto Spanner directories, CockroachDB allowsconfiguration of arbitrary zones of data. This allowsreplication factor, storage device type, and/or datacenter location to bechosen to optimize performance and/or availability. Unlike Spanner, zones aremonolithic and don’t allow movement of fine grained data on the level of entitygroups.
类似于Spanner目录,CockroachDB允许对任意数据地域进行配置。允许配置复制因子、存储设备类型、数据中心位置,以优化性能或者提高可用性。但与Spanner不同,地域是整体的,不允许在实体组层面进行更细粒度数据的移动。