概述
本文章是MIT 6.824 分布式系统课程 Lec3的课程笔记。
分布式存储系统
困难与挑战
构建一个良好的分布式存储系统需要面临很多挑战,主要有如下几个方面。
- high performance → \rightarrow → shard data over many servers
- many servers → \rightarrow → constant faults
- fault tolerance → \rightarrow → replication
- replication → \rightarrow → potential inconsistencies
- better consistency → \rightarrow → low performance
可以看出,分布式存储系统中high performance和better consistency是一对互相矛盾的概念,我们再设计系统时必须根据实际需求来做出取舍。
一致性
强一致性
强一致性的理想情况下就是多个server的表现跟一个server的表现一致。下面我们通过一个简单的例子来描述什么是强一致性。
现在有单个server S1使用local disk作为存储介质,并且server每次只处理一个请求。假设有client C1和client C2同时向S1写数据,并且client C3和client C4再写完数据后向S1请求读数据。
C1:Write X 1
C2:Write X 2
C3: Read X
C4: Read X
无论是C1的写请求先到,还是C2的写请求现到,C3和C4读取到的数据都是相同的(1或2),这就是强一致性。单个server虽然保证了强一致性,但是有很差的容错性。
错误的多Server模型
下面介绍一个很容易实现但是确有问题的多server复制模型。
假设我们从创建一个server S2作为server S1的副本,C1和C2同时向两个server发送写请求来保证数据在两个server上都有更改。但是由于两个server处理写请求的顺序不一样,可能导致server上的数据不一致。比如S1先处理了C1的请求,而S2先处理的确实C2的请求,导致S1上X为2,S2上X为1。如果C3读取了S1,则X=2,如果C4读取了S2,则X=1,并没有保证强一致性!
GFS
特点
GFS文件系统具有如下特点。
- automatic sharding of each file over many servers
- just one datacenter per deployment
- just google internal use
- aimed at sequential acess to large files, wirte or read, not random
- successful use of weak consistency
- successful use of single master
Master
master的存储介质包含RAM和disk两部分。存储在RAM中的数据结构包括:
- file name → \rightarrow → array of chunk handlers (non-volatile)
- chunk handler → \rightarrow → version number # (non-volatile) ; list of chunkservers (volatile); primary (volatile) ; lease time (volatile)
其中,对non-volatile数据的更改操作需要在写入log后再提交,其余数据不需要进行持久化存储。
存储再disk中的数据结构包括:
- log
- checkpoint
log主要用来记录对non-volatile数据的更改操作,checkpoint用来记录master的状态,两者主要用于master宕机之后的快速恢复。
读写操作
读操作
client的读操作处理如下:
- C sends filename and offset to master M (if not cached)
- M finds chunk handle for that offset
- M replies with list of chunkservers (only those with latest version)
- C caches handle + chunkserver list
- C sends request to nearest chunkserver (chunk handle, offset)
- chunk server reads from chunk file on disk, returns
写操作
client的写操作处理如下:
- C asks M about file’s last chunk
- if M sees chunk has no primary (or lease expired):
2.1. if no chunkservers has latest version #, error
2.2. pick primary P and secondaries from those has latest version #
2.3. increment version #, write to log on disk
2.4. tell P and secondaries who they are, and new version #
2.5. replicas write new version # to disk - M tells C the primary and secondaries
- C sends data to all (just temporary…), waits
- C tells P to append
- P checks that lease hasn’t expired, and chunk has space
- P picks an offset (at end of chunk)
- P writes chunk file (a Linux file)
- P tells each secondary the offset, tells to append to chunk file
- P waits for all secondaries to reply, or timeout,secondary can reply “error” e.g. out of disk space
- P tells C “ok” or “error”
- C retries from start if error