链式复制(Chain replication)
-
是一种 提供高吞吐 和 高可用的分布式算法,
-
基于主从架构( primary/backup),并对传统主从架构有加强
主从基本思想
-
主节点
-
给所有client请求来的操作排序
-
把操作分发给从节点
-
主节点要等所有从节点响应后,再响应client
-
主节点可以直接响应read
-
如果主节点failure了
-
会选举一个从节点来做新主节点
-
所有N个副本即使failure了N-1个,也能工作【这个比Raft要好】
-
-
因此需要有一个独立的“配置中心 configuration service (CFG)”来管理副本的failure,通常是用raft、zk,ping所有从节点来检测failure
-
如果一个从节点挂了,CFG要把这个从配置中心移除,因为主节点要wait从节点
-
-
也叫“ ROWA (Read One Write All)”
-
因为这里的主从,是写所有的主节点,因此可以容忍 N-1 failures
-
CR
基本步骤
-
[clients, S1=head, S2, S3=tail, CFG (= master)]
(can be more replicas)
clients send updates requests to head
head picks an order (assigns sequence numbers)
head updates local replica, sends to S1
S1 updates local replica, sends to S2
S2 updates local replica, sends to S3
S3 updates local replica, sends response to client
updates move along the chain in order:
at each server, earlier updates delivered before later ones
clients send read requests to tail
tail reads local replica and responds to client
-
这样的好处是:
-
head节点发的消息少于 主节点
-
client分别只和head tail节点进行交互,简单
-
容错
-
如果head节点 fail
-
CFG选第2个作为new head
-
可能产生一些update的丢失
-
即,只有哪个fail的head收到了upadte,且没往下继续传
-
此时,client会在之后 re-send
-
如果 tail fails?
-
CFG选择last的上一个
-
新选出的tail 肯定有 老tail已有的所有comiit的update数据
如果中间节点挂了
-
CGF 通知已挂节点的前后节点来交互
-
pre的节点会resend 给next的节点。所以,中间节点需要记录已经执行过的更新操作
-
什么时候可以free呢?
-
当tail节点完成update后,会往前发ack
-
-
-
对比
-
p/b versus chain replication?
-
chain (or p/b) versus Raft/Paxos/Zab (quorum)?
sharding
-
数据量太多,就需要sharding,步骤
-
a better plan ("rndpar" in Section 5.4):
split data into many more shards than servers
(so each shard is much smaller than in previous arrangement)
each server is a replica in many shard groups
shard A: S1 S2 S3
shard B: S2 S3 S1
shard C: S3 S1 S2
(this is a regular arrangement, but in general would be random)
for p/b, a server is primary in some groups, backup in other
for chain, a server is head in some, tail in others, middle in others
now request processing work is likely to be more balanced
简单总结
Chain Replication is one of the clearest descriptions of a ROWA scheme
it does a good job of balancing work
it has a simple approach to re-syncing replicas after a failure
influential: used in EBS, Ceph, Parameter Server, COPS, FAWN.
it's one of a number of designs (p/b, quorums) with different properties