Design Goals
ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchal namespace
ZooKeeperdata is kept in-memory, which means ZooKeeper can achieve high throughput and low latency numbers.
ZooKeeper is replicated. Like the distributed processes it coordinates, ZooKeeper itself is intended to be replicated overa sets of hosts called an ensemble.
ZooKeeper Service |
Clients connect(TCP) to a single ZooKeeper server. If the TCP connection to the server breaks, the client will connect to a different server.
ZooKeeper is ordered. ZooKeeper stamps each update with a number that reflects the order of all ZooKeeper transactions.
ZooKeeper is fast. It is especially fast in "read-dominant" workloads. ZooKeeper applications run on thousands of machines, and it performs best where reads are more common than writes, at ratios of around 10:1.
Data model and the hierarchical namespace
ZooKeeper's Hierarchical Namespace |
Nodes and ephemeral nodes
Unlike is standard file systems, each node in a ZooKeeper namespace can have data associated with it as well as children.
Znodes maintain a stat structure that includes
version numbers for data changes,
ACL changes,
and timestamps, to allow cache validations and coordinated updates. Each time a znode's data changes, the version number increases.
For instance, whenever a client retrieves data it also receives the version of the data.
The data stored at each znode in a namespace is read and written atomically.
ZooKeeper also has the notion of ephemeral(临时) nodes. These znodes exists as long as the session that created the znode is active. When the session ends the znode is deleted.
Conditional updates and watches
Clients can set a watch on a znodes. A watch will be triggered and removed when the znode changes. When a watch is triggered the client receives a packet saying that the znode has changed. And if the connection between the client and one of the Zoo Keeper servers is broken, the client will receive a local notification. These can be used to [tbd].
Simple API
it supports only these operations:
-
create
-
creates a node at a location in the tree
delete
-
deletes a node
exists
-
tests if a node exists at a location
get data
-
reads the data from a node
set data
-
writes data to a node
get children
-
retrieves a list of children of a node
sync
-
waits for data to be propagated //
Implementation
ZooKeeper Components shows the high-level components of the ZooKeeper service.
ZooKeeper Components |
The replicated database is an in-memory database containing the entire data tree. Updates are logged to disk for recoverability, andwrites are serialized to disk before they are applied to the in-memory database.
As part of the agreement protocol all write requests from clients are forwarded to a single server, called the leader. The rest of the ZooKeeper servers, called followers, receive message proposals from the leader and agree upon message delivery. The messaging layer takes care of replacing leaders on failures and syncing followers with leaders.