CAP理论
- 一致性(Consistency): all nodes see the same data at the same time
A service that is consistent operate fully or not at all.
- 可用性(Availability): a guarantee that every request receives a response about whether it succeeded or failed
- 分区容忍性(Partition Tolerance): the system continues to operate despite arbitrary partitioning due to network failures
No set of failures less than total netowork failure is allowed to cause the system to respond incorrectly.
CAP 三者不可兼得
Dynamo设计时面临的问题及解决方案
摘录自 杨传辉,《大规模分布式存储系统》
问题 | 采取的技术 |
---|---|
数据分布 | 改进的一致性哈希(虚拟节点) |
复制协议 | 复制写协议(Replicated-write protocol, NWR参数可调) |
数据冲突协议 | 向量时钟 |
临时故障处理 | 数据回传机制(Hinted handoff) |
永久故障后的恢复 | Merkle哈希树 |
成员资格及错误检测 | 基于Gossip的成员资格和错误检测协议 |
DHT
(整理好再补充)
NWR策略(Quorum协议)
NWR是一种在分布式存储系统中用于控制一致性级别的策略。
* N: 同一份数据的Replica的份数;
* W: 更新一个数据对象时需要确保成功更新的份数;
* R: 读取一个数据需要读取的Replica的份数
* W+R>N : 保证某个数据不能被两个不同的事务同时读或写
* W>N/2 : 保证两个事务不能并发写一个数据
在分布式系统中,数据的单点是不允许存在的。一旦这个Replica出现错误,就可能发生数据的永久性错误。如果N设置为2,那么只要一个存储节点出错,就会有单点的存在,所以N>2。
以下整理自卡耐基梅隆大学CMU 的课件
Vector Clock
Lamport’s Logical Clock
hapened-before relation
- if a and b are events in the same process, and a occurs before b, then a->b is true
- if a is an event of message m being sent by a process, and b is the event of m being received by another process, then a->b
happened-before relation is transitive
if a->b and b->c, then a->c
property of logical clock
- if two eventa a and b occur within the same process and a->b, then assign the logical timve value C(a) and C(b), then C(a) < C(b)
- the clock time C must always go forward, and never backward
lamport’s clock alogrithm
- when a message is being sent: each message carries a timestamp according to the sender’s logical clock
- when a message is received: if the receiver logical clock is less than message sending time in the packet, then adjust the receiver’s clock suck that
currentTime = tiemstamp + 1
Vector clock
Lamport’s clock cannot guarantee perfect ordering of events by just observing the time values of two arbitrary events
defination
- vector clocks was proposed to overcome the limition of lamport’s clock(ie., C(a) < C(b) doesn’t mean that a->b)
- a vector clock for a system of N processes is an array of N integers
- every process Pi stores its own vector clock VCi
- Lamport’s time values for events are stored in VCi,VCi(a) is assigned to an event a
- VCi(a) < VCi(b) ==> a->b
update algorithm
- whenever ther is a new event at Pi, increment VCi[i]
- when a p process Pi sends a message m to Pj:
- increment VCi[i]
- set m’s timestamp ts(m) to the vector VCi
- when message m is received by process Pj:
- for k in ts(m):
VCj = max(VCi[k], ts(m)[k]);- increment VCj[j]
causal communication
to enforce causally-ordered multicasting, the delivery of message m sent from Pi to Pj can be delay until the following two conditions are met:
* ts(m)[i] = VCj[i] + 1
* ts(m)[k] <= VCj[k] for k in ts(m) and k!=i
Merkle tree
Merkle tree is a tree in which every non-leaf node is labelled with the hash of the labels or values (in case of leaves) of its children nodes.
(整理完之后补充)