ZooKeeper: enables coordination for distributed system
Similar to multithread programming, but shared nothing. Easier with a component provide share store, like ZooKeeper.
ZooKeeper manage data related to coordination, not for bulk storage
Master-Slave Key Problems:
1. Master Crashes
2. Slave Crashes
3. Communication Failure(Master and Slave cannot exchange message)
Master Failures:
Store the state of the system at the time old primary master crashed.
Split-Brain Problem: more than one master running independently caused by false suspicion of master failed
Question: if there are 5 servers at beginning, when 3 separated from other 2, what will happen? will there be two separate master?(is this actually what split-brain means?) or just the group has the majority of 3?
Worker Failures:
Detect worker failure by master. If the computation has side effects, some recovery procedure might be necessary to clean up the state.
Communication Failure:
Reassigning a task may cause two workers executing the same task, which depends on application may unacceptable.
ephemeral state.
Cannot tell is a node is crashed or just be slow.
Summary of Master-Worker Tasks
1. Master Election
2. Crash Detection
3. Group Membership Management: figure out which workers are available, group member is worker
4. Metadata Management: master and worker be able to store assignments and execution statuses in a reliable manner
ZooKeeper Basics
ZooKeeper not provide primitives, but expose file system like API comprised for a small set of calls.
Recipe to denote these implementation of primitives
API
create /path data
delete /path
exists /path
setData /path data
getData /path
getChildren /path
Modes
Persistent
Ephemeral: delete if client created it crashes or simply closed the connection. (Cannot have children)
Sequential znode: tasks-1, tasks-2 ....
All modes: persistent, ephemeral, persistent-sequential, ephemeral-sequential
Watches and Notifications
Clients register to watch notification instead of POLLING.
【IMPORTANT】One-Shot Notification, Example:
A want to watch on /tasks, but tasks has changed before A successfully set the watch, A may lost the notification, if the notification is important for A, so A get status of /tasks when setting the watch
【IMPORTANT】Notification Guarantee, notifications are delivered to a client before any other change is made to the znode.
Question: What if the Notification is lost?
Versions
use version to check valid update
ZooKeeper Architecture
Client Library: responsible for talking to servers
ZooKeeper Quorums
In Quorum mode, ZooKeeper replicates its data tree across all servers in the ensemble nodes.
quorum minimum number of legislators required to be present for a note. (How Quorum is be used? Solve the problem of delay to store data across all servers before return to client? Solve the partition problem?Solve the split-brain problem,抽屉原理)
当一个网络中被分隔成两部分以后,不管哪一部分被使用,都会有最新的更新
There are other quorums other than majority quorums
ZooKeeper Session
when a session ends for any reason, the ephemeral nodes created during session disappeared.(能有哪些原因?)
Moving a session to different server if has not heard from its current server for a some time. Moving session to a different server is transparently by ZooKeeper library.
Session offer order guarantee
ZooKeeper Lifecycle
Case: Waiting on CONNECTING During Network Partitions
ZooKeeper Quorums
1. server need to know each other, each zookeeper server is configured with a list of servers, if quorum is reached, then the service is available.
2. client specify the host:port pair it tries to connect(we can limit the hosts client can reached to achieve simple location based balancing)
Can we enhance third-party framework or application to use ZooKeeper do coordination work?
Implementing a Primitive: Locks with ZooKeeper
To acquire a lock, create a ephemeral znode: /lock
Others watch changes for /lock
Implementation of a Master-Worker Example
Roles
Master / Worker / Client
Master
master相当于一个竞争资源,对应lock recipe,创建一个ephemeral node用来表示master
create ephemeral znode: /master
stat /master true
监听(watch)/master节点
Workers, Tasks, Assignments
persistent
create /workers ""
create /tasks ""
create /assigns ""
ls /workers true
ls /tasks true
监听 /workers 子节点是否发生改变
The Work Role
create -e /workers/worker1.example.com "worker1.example.com:2224"
此时针对 ls /workers true 的watcher将会接受到通知
create -e /assign/worker1.example.com
创建子节点接收任务分配
ls /assign/worker1.example.com true
The Client Role
create -s /tasks/task- "cmd"
created /tasks/task-000000
-s 表明自增
ls /tasks/task-000000 true
监听task完成情况
当task创建完成以后,服务器监听到此事件,就去找当前的 tasks 和 workers 去完成任务分配
ls /tasks
=> [task-000000]
ls /workers
=> [worker1.example.com]
create /assign/worker1.example.com/task-000000 ""
此时worker将会接收到新分配任务的通知,执行任务
当worker完成任务以后,创建一个task的对应status node
/tasks/task-000000/status "done"
此时client会监听到task完成
IMPORTANT: ZooKeeper Internals
ZooKeeper Internals
in Ensemble, One leader handling all requests; followers receive and vote for updates;observer not participate in decision process, they only learn what have been decided
Requests, Transactions, and Identifiers
Requests读写分离
ZooKeeper server process read requests locally (exists, getData, getChildren) .
ZooKeeper leader process write requests (create, delete, setData).
Question: 如果client和Server A建立了session,发起了一个write request,完成。再向Server A发起了一个read request,并不能保证Server A一定能读到最新的?
Transaction: one state update operation (write request)
check the version number
idempotent: apply transaction multi times get same result
zxid: ZooKeeper transaciton id
Leader Election
leader election notification: server identifier (sid) and most transaction it executes (zxid)
p158
Zab: ZooKeeper Atomic Broadcast protocol
References
<<ZooKeeper>>
http://blog.cloudera.com/blog/2013/02/how-to-use-apache-zookeeper-to-build-distributed-apps-and-why/
http://zookeeper.apache.org/doc/trunk/recipes.html
http://www.ibm.com/developerworks/library/bd-zookeeper/
http://highscalability.com/blog/2008/7/15/zookeeper-a-reliable-scalable-distributed-coordination-syste.html
https://engineering.pinterest.com/blog/zookeeper-resilience-pinterest
kafka
http://kafka.apache.org/documentation.html
paxos算法[ZooKeeper内部实现的算法模型]
http://baike.baidu.com/link?url=GZCf2VymTzeZDfGCRs2QgPpjLm6xJEX5TzvtWEQOULy77yEqO0nc6Gy6JOxJjIaBSPTCqv9E1fEKUx420AyX9q
http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf
http://www.ux.uis.no/~meling/papers/2013-paxostutorial-opodis.pdf
https://www.youtube.com/watch?v=JEpsBg0AO6o 【非常好的解释,需要翻墙】
https://distributedthoughts.wordpress.com/2013/09/22/understanding-paxos-part-1/