ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers (we call these registers znodes), much like a file system. Unlike normal file systems ZooKeeper provides its clients with high throughput, low latency, highly available, strictly ordered access to the znodes. The performance aspects of ZooKeeper allow it to be used in large distributed systems. The reliability aspects prevent it from becoming the single point of failure in big systems. Its strict ordering allows sophisticated synchronization primitives to be implemented at the client.
ZooKeeper使得分布的进程之前可以互相通信,注册的数据(znode)被放在一个可以共享访问的类似文件系统的树状层级结构名字空间内。ZooKeeper给它的客户端提供对znode的高吞吐、低延迟、高可用,以及严格有序的访问。由于ZooKeeper在性能方面的优势,它可以用于大型分布式系统。而它的可靠性又能防止大型系统出现单点故障。它的严格有序性可以允许客户端实现复杂的同步原语。
The name space provided by ZooKeeper is much like that of a standard file system. A name is a sequence of path elements separated by a slash ("/"). Every znode in ZooKeeper’s name space is identified by a path. And every znode has a parent whose path is a prefix of the znode with one less element; the exception to this rule is root ("/") which has no parent. Also, exactly like standard file systems, a znode cannot be deleted if it has any children.
ZooKeeper提供的名字空间很像一个标准的文件系统。名字是一系列以"/"分隔的路径元素。在ZooKeeper的名字空间里每一个znode都对应一条路径。且每个znode都有一个父亲节点,其父亲节点的路径则是该znode的路径的上一层;除了根节点没有父亲节点。并且,如果znode下面还有子节点的话,那么该znode不能被删除。
The main differences between ZooKeeper and standard file systems are that every znode can have data associated with it (every file can also be a directory and vice-versa) and znodes are limited to the amount of data that they can have. ZooKeeper was designed to store coordination data: status information, configuration, location information, etc. This kind of meta-information is usually measured in kilobytes, if not bytes. ZooKeeper has a built-in sanity check of 1M, to prevent it from being used as a large data store, but in general it is used to store much smaller pieces of data.
ZooKeeper和标准文件系统的主要区别在于:每一个znode都有相关联的数据,(也就是说每一个文件也可以是目录,反之亦然),并且znode可以拥有的数据量是有限的。ZooKeeper被设计用于存储协调性数据,比如:状态信息,配置,位置信息,等。这类元信息通常是B或KB为单位。ZooKeeper还有一个内置的1MB检查,以防止被用于大型数据存储。通常来说,ZooKeeper只用于存储较小的数据。
The service itself is replicated over a set of machines that comprise the service. These machines maintain an in-memory image of the data tree along with a transaction logs and snapshots in a persistent store. Because the data is kept in-memory, ZooKeeper is able to get very high throughput and low latency numbers. The downside to an in-memory database is that the size of the database that ZooKeeper can manage is limited by memory. This limitation is further reason to keep the amount of data stored in znodes small.
在集群的每台机器上都有同样的ZooKeeper服务运行。这些机器保持一个内存中的数据树镜像,并且将日志和快照持久化。因为数据是保存在内存中的,ZooKeeper可以拥有很高的吞吐量和低延迟。缺点是对于内存数据库来说,ZooKeeper可以管理的数据量受限于内存大小。这也就是为什么每个znode可以存储的数据量很小。
The servers that make up the ZooKeeper service must all know about each other. As long as a majority of the servers are available, the ZooKeeper service will be available. Clients must also know the list of servers. The clients create a handle to the ZooKeeper service using this list of servers.
ZooKeeper服务运行所在的服务器都必须相互知道对方。只要大部分的服务器是可用的,那么ZooKeeper服务就是可用的。客户端必须知道所有的服务器列表。客户端使用服务器列表来和ZooKeeper服务建立handle。
Clients only connect to a single ZooKeeper server. The client maintains a TCP connection through which it sends requests, gets responses, gets watch events, and sends heartbeats. If the TCP connection to the server breaks, the client will connect to a different server. When a client first connects to the ZooKeeper service, the first ZooKeeper server will setup a session for the client. If the client needs to connect to another server, this session will get reestablished with the new server.
客户端只需要和一台ZooKeeper连接。客户端维持了一个TCP连接,发送请求,接收响应,接收监控事件,发送心跳包。如果和这台服务器的TCP连接中断,那么客户端会向另外一台服务器发起连接。如果客户端第一次向ZooKeeper服务发起连接,第一台ZooKeeper服务器会和该客户端建立session。如果客户端需要和其他服务器建立连接,那么会与新的服务器重建session。
Read requests sent by a ZooKeeper client are processed locally at the ZooKeeper server to which the client is connected. If the read request registers a watch on a znode, that watch is also tracked locally at the ZooKeeper server. Write requests are forwarded to other ZooKeeper servers and go through consensus before a response is generated. Sync requests are also forwarded to another server, but do not actually go through consensus. Thus, the throughput of read requests scales with the number of servers and the throughput of write requests decreases with the number of servers.
客户端发起的读请求在连接的ZooKeeper服务器本地处理。如果一个读请求对一个znode发起监听,那么该监听同样在ZooKeeper服务器本地被追踪。写请求被转发到其他的ZooKeeper服务器上,并且在生成响应之前获得一致。同步请求也被转发到另一台服务器,但并不需要获得一致性。因此,读请求的吞吐率和服务器数量成正比,写请求的吞吐量和服务器数量成反比。
Order is very important to ZooKeeper; almost bordering on obsessive–compulsive disorder. All updates are totally ordered. ZooKeeper actually stamps each update with a number that reflects this order. We call this number the zxid (ZooKeeper Transaction Id). Each update will have a unique zxid. Reads (and watches) are ordered with respect to updates. Read responses will be stamped with the last zxid processed by the server that services the read.
顺序对ZooKeeper非常重要。所有的更新都要排序。ZooKeeper实际上给每个Update打了一个序号标签。我们将该序号称作zxid(ZooKeeper事务Id)。每次更新将有一个唯一的zxid。读和(监听)参照进行排序。读响应会被打上服务器处理的最后一个zxid。