分布式协调器ZooKeeper3.4—概述

最新推荐文章于 2023-06-22 00:15:00 发布

guxch

最新推荐文章于 2023-06-22 00:15:00 发布

阅读量3.7k

点赞数

分类专栏：分布式计算 Linux

Linux 同时被 2 个专栏收录

25 篇文章 4 订阅

订阅专栏

分布式计算

24 篇文章 2 订阅

订阅专栏

【ZooKeeper是Apache Hadoop下的开源软件，是一个分布式的协调器，本文来自于Zookeeper的官方网站，地址为：http://zookeeper.apache.org/doc/r3.4.5/zookeeperOver.html）】

ZooKeeper: A Distributed Coordination Service forDistributed Applications

ZooKeeper:分布式应用的分布式协调服务

ZooKeeper isa distributed, open-source coordination service for distributed applications.It exposes a simple set of primitives that distributed applications can buildupon to implement higher level services for synchronization, configurationmaintenance, and groups and naming. It is designed to be easy to program to,and uses a data model styled after the familiar directory tree structure offile systems. It runs in Java and has bindings for both Java and C.

Coordination services are notoriously hard to get right. They are especially prone to errorssuch as race conditions and deadlock. The motivation behind ZooKeeper is torelieve distributed applications the responsibility of implementing coordination services from scratch.

ZooKeeper是一个应用于分布式应用的、开源的协调服务，它本身也是分布式的。它提供了一系列简洁的原语，分布式应用可以在它的基础上实现更高层次的服务，满足同步、配置维护、分组及命名等要求。ZooKeeper采用我们所熟知的类似于文件树的数据结构，因而使用简单，它运行于Java环境，编程绑定了Java和C语言。

众所周知，协调服务很难开发，它们很容易发生条件竞争和死锁，ZooKeeper的开发动力就是减轻分布式应用开发的困难，使它们不再从头开始构建协调服务。

Design Goals

设计目标

ZooKeeper is simple. ZooKeeper allows distributed processes to coordinatewith each other through a shared hierarchal namespace which is organized similarly to a standard file system. The name space consists of data registers- called znodes, in ZooKeeper parlance - and these are similar to files anddirectories. Unlike a typical file system, which is designed for storage,ZooKeeper data is kept in-memory, which means ZooKeeper can achieve highthroughput and low latency numbers.

The ZooKeeper implementation puts a premium on high performance, highly available, strictlyordered access. The performance aspects of ZooKeeper means it can be used inlarge, distributed systems. The reliability aspects keep it from being a singlepoint of failure. The strict ordering means that sophisticated synchronizationprimitives can be implemented at the client.

简单。ZooKeeper允许各分布式进程通过一个共享的命名空间相互联系，该命名空间类似于一个标准的层次型的文件系统：由若干注册了的数据节点构成(用Zookeeper的术语叫znode)，这些节点类似于文件和目录。但是，与传统的文件系统主要用于存储功能不同，ZooKepper的数据是保存在内存中的，也就是说，可以获得高吞吐和低延迟。

在实现上，ZooKeeper特别关注了高性能，高可靠和严格顺序访问等要求。高性能保证了ZooKeeper可以用于大型的分布式系统，高可靠保证了ZooKeeper不会发生单点故障，严格的顺序访问保证了客户端可以获得复杂的同步操作原语。

ZooKeeper isreplicated. Like the distributed processes it coordinates, ZooKeeperitself is intended to be replicated over a sets of hosts called an ensemble.

ZooKeeper Service

The servers that make up the ZooKeeper service must all know about each other. Theymaintain an in-memory image of state, along with a transaction logs andsnapshots in a persistent store. As long as a majority of the servers areavailable, the ZooKeeper service will be available.

Clients connectto a single ZooKeeper server. The client maintains a TCP connection throughwhich it sends requests, gets responses, gets watch events, and sends heartbeats. If the TCP connection to the server breaks, the client will connect to adifferent server.

冗余。就像ZooKeeper需要协调的分布式系统一样，它本身就是具有冗余结构，它构建在一系列主机之上，叫做一个”ensemble”。

构成ZooKeeper服务的各服务器之间必须相互知道，它们维护着一个状态信息的内存映像，以及在持久化存储中维护着事务日志和快照。只要大部分服务器正常工作，ZooKeeper服务就能正常工作。

客户端连接到一台ZooKeeper服务器。客户端维护这个TCP连接，通过这个连接，客户端可以发送请求、得到应答，得到监视事件以及发送心跳。如果这个连接断了，客户端可以连接到另一个ZooKeeper服务器。

ZooKeeper isordered. ZooKeeper stamps each update with a number that reflects the order of all ZooKeeper transactions. Subsequent operations can use theorder to implement higher-level abstractions, such as synchronization primitives.

顺序化。ZooKeeper给每次更新附加一个数字标签，表明ZooKeeper中的事务顺序，后续操作可以利用这个顺序来完成更高层次的抽象功能，例如同步原语。

ZooKeeper isfast. It is especially fast in "read-dominant"workloads. ZooKeeper applications run on thousands of machines, and it performs best where reads are more common than writes, at ratios of around 10:1.

快速。ZooKeeper特别适合于以读为主要负荷的场合。ZooKeeper可以运行在数千台机器上，如果大部分操作为读，例如读写比例为10:1，ZooKeeper的效率会很高。

Data model and the hierarchical namespace

数据模型和层次型命名空间

The namespace provided by ZooKeeper is much like that of a standard file system. A nameis a sequence of path elements separated by a slash (/). Every node in ZooKeeper's name space is identified by a path.

ZooKeeper的命名空间非常像一个标准的文件系统。一个名字是一系列由/分割的路径，命名空间中的每个节点都由一个路径来标识。

ZooKeeper's Hierarchical Namespace

Nodes and ephemeral nodes

节点和暂态节点

Unlike isstandard file systems, each node in a ZooKeeper namespace can have data associated with it as well as children. It is like having a file-system that allows a file to also be a directory. (ZooKeeper was designed to storecoordination data: status information, configuration, location information,etc., so the data stored at each node is usually small, in the byte to kilobyterange.) We use the termznode to make it clear that we are talking aboutZooKeeper data nodes.

Znodes maintain a stat structure that includes version numbers for data changes, ACLchanges, and timestamps, to allow cache validations and coordinated updates.Each time a znode's data changes, the version number increases. For instance,whenever a client retrieves data it also receives the version of the data.

The datastored at each znode in a namespace is read and written atomically. Reads getall the data bytes associated with a znode and a write replaces all the data.Each node has an Access Control List (ACL) that restricts who can do what.

ZooKeeper also has the notion of ephemeral nodes. These znodes exists as long as thesession that created the znode is active. When the session ends the znode isdeleted. Ephemeral nodes are useful when you want to implement[tbd].

与标准的文件系统不同，ZooKeeper命名空间中的每个节点既可以有与之关联的数据也可以有与之关联的子节点，就好像文件系统中，一个节点既是文件又是目录（ZooKeeper被设计用来保存诸如状态、配置、位置等用于协调事务的数据，所以每个节点保存的数据通常不大，约几个到上千个字节的范围）。为清晰起见，我们用znode来称呼ZooKeeper数据节点。

Znode维护了一个stat结构，其中包含了数据修改、ACL修改和时间戳的版本号，用于缓存和协调更新。每次znode的数据更新，版本号就会增加，当客户端获取数据时，它也会得到数据的版本号。

对每个znode数据的读写是原子性的，读操作将读取整个节点的数据，写操作也是替换整个数据，每个节点有一个ACL，表明谁能做什么。

ZooKeeper也有暂态节点的概念，这些znode节点与创建它的session的寿命一样，如果session结束了，这个节点就被删掉了。暂态节点在你需要时很有用【待完成】。

Conditional updates and watches

条件更新和监视点

ZooKeeper supports the concept of watches. Clients can set a watch on a znodes. Awatch will be triggered and removed when the znode changes. When a watch is triggered the client receives a packet saying that the znode has changed. Andif the connection between the client and one of the Zoo Keeper servers isbroken, the client will receive a local notification. These can be used to[tbd].

ZooKeeper支持监视点的概念。客户端可以在一个zonde上增加一个监控点，当znode发生变化时，监视点将被触发和删除，监视点被触发后，客户端接收到一个包，说znode已变化了。如果客户端与ZooKeeper服务器之间的连接断了，客户端会收到一个本地的通知信息。【待完成】

Guarantees

保证

ZooKeeper isvery fast and very simple. Since its goal, though, is to be a basis for the construction of more complicated services, such as synchronization, it providesa set of guarantees. These are:

Sequential Consistency - Updates from a client will be applied in the order that they were sent.
Atomicity - Updates either succeed or fail. No partial results.
Single System Image - A client will see the same view of the service regardless of the server that it connects to.
Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.
Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.

For moreinformation on these, and how they can be used, see[tbd]

ZooKeeper运行非常快而且简单。虽然它的目标是构建更加复杂服务（例如同步）的基础，但它提供了一些保证，如下：

顺序一致性—来自于客户端的更新，根据发送的先后被顺序实施。
唯一的系统映像—尽管客户端连接到不同的服务器，但它们看到的一个唯一（一致性）的系统服务
可靠性—一旦实施了一个更新，就会一直保持那种状态，直到客户端再次更新它。
及时性—在一个确定的时间内，客户端看到的系统状态是最新的。

更多的信息，请参阅【待完成】

Simple API

简单的API

One of thedesign goals of ZooKeeper is provide a very simple programming interface. As aresult, it supports only these operations:

create

createsa node at a location in the tree

delete

deletesa node

exists

tests if a node exists at a location

get data

reads the data from a node

set data

writes data to a node

get children

retrieves a list of children of a node

sync

waits for data to be propagated

For a morein-depth discussion on these, and how they can be used to implement higher level operations, please refer to[tbd]

ZooKeeper的设计目标之一就是提供一个简单的编程接口，结果，它只支持如下操作：

创建：在树中某个位置创建节点

删除：删除节点

存在：在某个位置检查是否存在一个节点

获取数据：从一个节点读数据

设置数据：向一个节点写数据

获取子节点：获取一个节点的子节点列表

同步：等待数据传播（同步到其他节点）

对这些操作更深入的讨论意见如果用它们实现高层次的操作，请参阅【待完成】

Implementation

实现

ZooKeeper Components shows the high-level componentsof the ZooKeeper service. With the exception of the request processor, each ofthe servers that make up the ZooKeeper service replicates its own copy of eachof components.

ZooKeeperComponents

The replicated database is an in-memory database containing the entire data tree.Updates are logged to disk for recoverability, and writes are serialized todisk before they are applied to the in-memory database.

Every ZooKeeper server services clients. Clients connect to exactly one server tosubmit requests. Read requests are serviced from the local replica of each server database. Requests that change the state of the service, write requests,are processed by an agreement protocol.

As part ofthe agreement protocol all write requests from clients are forwarded to asingle server, called theleader. The rest of the ZooKeeper servers,called followers, receive message proposals from the leader and agreeupon message delivery. The messaging layer takes care of replacing leaders onfailures and syncing followers with leaders.

ZooKeeper uses a custom atomic messaging protocol. Since the messaging layer is atomic,ZooKeeper can guarantee that the local replicas never diverge. When the leader receives a write request, it calculates what the state of the system is whenthe write is to be applied and transforms this into a transaction that capturesthis new state.

ZooKeeperComponents图中给出了ZooKeeper服务的高层次的组件。除了请求处理器（requestprocessor）外，构成ZooKeeper服务的每个服务器都有一个备份。

复制的数据库（replicateddatabase）是一个内存数据库，包含整个数据树。为了可恢复，对它的更新log到磁盘，并且在更新这个内存数据库之前，先序列化到磁盘。

每个ZooKeeper都为客户端提供服务。客户端只连接到一个服务器，并提交请求。读请求由本地的复制数据库提供数据。对服务状态进行修改的请求、写请求通过一个约定的协议进行通讯。

作为这个协议的一部分，所有的写请求都被传送到一个叫“首领(leader)”的服务器，而其他的服务器，叫做“(随从)followers”，follower从leader接收信息修改的提议，并同意进行。当leader发生故障时，协议的信息层（messaginglayer）关注leader的替换，并同步到所有的follower。

ZooKeeper采用一个自定义的信息原子操作协议，由于信息层的操作是原子性的，ZooKeeper能保证本地的复制数据库不会产生不一致。当leader接收到一个写请求，它计算出写之后系统的状态，把它变成一个事务。

Uses

使用

The programming interface to ZooKeeper is deliberately simple. With it, however,you can implement higher order operations, such as synchronizations primitives,group membership, ownership, etc. Some distributed applications have used itto: [tbd: add uses from white paper and video presentation.] For more information, see[tbd]

ZooKeeper的编程接口特别简单，但是，你能用它实现高层次的顺序操作，例如同步原语、成员分组、所属等操作。一些分布式应用已采用了ZooKeeper【待完成】

Performance

性能

ZooKeeper is designed to be highly performant. But is it? The results of the ZooKeeper's development team at Yahoo! Research indicate that it is. (SeeZooKeeper Throughput as the Read-Write Ratio Varies.) It isespecially high performance in applications where reads outnumber writes, since writes involve synchronizing the state of all servers. (Reads outnumbering writes is typically the case for a coordination service.)

ZooKeeperThroughput as the Read-Write Ratio Varies

The figure ZooKeeper Throughput as the Read-Write Ratio Varies is at hroughput graph of ZooKeeper release 3.2 running on servers with dual 2GhzXeon and two SATA 15K RPM drives. One drive was used as a dedicated ZooKeeperlog device. The snapshots were written to the OS drive. Write requests were 1K writes and the reads were 1K reads. "Servers" indicate the size ofthe ZooKeeper ensemble, the number of servers that make up the service.Approximately 30 other servers were used to simulate the clients. The ZooKeeper ensemble was configured such that leaders do not allow connections fromclients.

Note

In version 3.2 r/w performance improved by ~2x comparedto theprevious3.1 release.

Benchmarks also indicate that it is reliable, too.Reliability inthe Presence of Errors shows how a deployment responds to various failures. The events marked in the figure are the following:

Failure and recovery of a follower
Failure and recovery of a different follower
Failure of the leader
Failure and recovery of two followers
Failure of another leader

ZooKeeper被设计成是高性能的，真的吗？ZooKeeper开发团队在Yahoo!Research的结果表明确实如此（请参阅ZooKeeper Throughput asthe Read-Write Ratio Varies.)，特别是在那些读操作远多于写操作的场合，因为写操作将引起所有服务器之间的同步操作。（读操作大于写操作是协调服务典型的运用场合）

上图中是ZooKeeperRelease3.2的吞吐量图，服务器为2GhzXeon，两个SATA 15K RPM磁盘。一个磁盘被用于ZooKeeper的log设备，快照写入OS设备。”Servers”是指ZooKeeper ensember中的服务器个数。另外大约30台服务器用来模拟客户端。ZooKeeperensember被配置成leader不允许客户端连接。

注：与3.1之前版本相比，3.2版的读/写性能提供了大致2倍。

Benchmark也表明ZooKeeper是可靠的。下图（Reliabilityin the Presence of Errors）表明一个部署对各种故障的响应。图中的标记的事件如下：

一个follow发生故障及恢复
另一个follow发生故障及恢复
leader发生故障
两个follow发生故障及恢复
另一个leader发生故障

Reliability

可靠性

To show the behavior of the system over time as failures are injected we ran a ZooKeeper service made up of 7 machines. We ran the same saturation benchmark as before,but this time we kept the write percentage at a constant 30%, which is a conservative ratio of our expected workloads.

Reliabilityin the Presence of Errors

The are a few important observations from this graph. First, if followers fail and recover quickly, then ZooKeeper is able to sustain a high throughput despite thefailure. But maybe more importantly, the leader election algorithm allows forthe system to recover fast enough to prevent throughput from dropping substantially. In our observations, ZooKeeper takes less than 200ms to elect anew leader. Third, as followers recover, ZooKeeper is able to raise throughput again once they start processing requests.

为了显示ZooKeeper在发生故障后运行的表现，我们运行一个由7台机器组成ZooKeeper，饱和测试的基准与以前一样，但这一次，我们保持写请求保持在30%，这是一个我们的一个保守负载率。

从上图中，可以有一些重要的发现。首先，如果follower发生故障并且很快恢复，ZooKeeper依然能承受高吞吐量，但是，可能更重要的是，leader选举算法考虑了系统快速恢复，避免使吞吐量下降太多，在我们的试验中，ZooKeeper花了小于200ms选出了一个新leader。第三，follower恢复后，一旦能处理新请求，ZooKeeper就提升了吞吐量。

The ZooKeeperProject

ZooKeeper项目

ZooKeeper has been successfully used in many industrial applications. It is used at Yahoo! as the coordination and failure recovery service for Yahoo! Message Broker, which is a highly scalable publish-subscribe system managing thousands of topics for replication and data delivery. It is used by the Fetching Service for Yahoo! crawler, where it also manages failure recovery. A number of Yahoo! advertising systems also use ZooKeeper to implement reliable services.

All users and developers are encouraged to join the community and contribute their expertise.See the Zookeeper Project on Apache for more information.

ZooKeeper已经被成功应用于许多工业应用。在Yahoo!，它为Yahoo! Message Broker提供协调和故障恢复服务，Yahoo! Message Broker是一个高可扩展性的发布-订阅系统，管理着成千上万的主题下的复制和数据传输。它为Yahoo!Crawler提供Fetching服务和管理故障恢复。一些Yahoo!广告系统也用它来实现可靠性服务。

鼓励所有的用户和开发者加入社区，并提供他们的聪明和才智。

guxch

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
分布式协调器ZooKeeper3.4—概述

【ZooKeeper是Apache Hadoop下的开源软件，是一个分布式的协调器，本文来自于Zookeeper的官方网站，地址为：http://zookeeper.apache.org/doc/r3.4.5/zookeeperOver.html）】ZooKeeper: A Distributed Coordination Service forDistributed Applic
复制链接

扫一扫