MIT 6.824 Lec3.GFS

最新推荐文章于 2024-10-09 23:06:58 发布

寒冰陨云

最新推荐文章于 2024-10-09 23:06:58 发布

阅读量249

点赞数

分类专栏： MIT6.824分布式系统文章标签：分布式计算分布式分布式存储 mit 后端

本文链接：https://blog.csdn.net/weixin_46840831/article/details/121897810

版权

MIT6.824分布式系统专栏收录该内容

26 篇文章 8 订阅

订阅专栏

本文探讨了分布式存储系统中的一致性问题，特别是强一致性与高可用性的权衡。通过错误多服务器模型揭示了潜在的不一致性，然后聚焦于Google File System (GFS)的特点，如自动分片、单数据中心部署和弱一致性策略。讲解了GFS中Master的设计，涉及内存和磁盘数据结构，以及客户端的读写操作流程，展示了GFS如何在实际场景中实现性能和一致性之间的平衡。

摘要由CSDN通过智能技术生成

概述

本文章是MIT 6.824 分布式系统课程 Lec3的课程笔记。

分布式存储系统

困难与挑战

构建一个良好的分布式存储系统需要面临很多挑战，主要有如下几个方面。

high performance $\rightarrow$ shard data over many servers
many servers $\rightarrow$ constant faults
fault tolerance $\rightarrow$ replication
replication $\rightarrow$ potential inconsistencies
better consistency $\rightarrow$ low performance

可以看出，分布式存储系统中high performance和better consistency是一对互相矛盾的概念，我们再设计系统时必须根据实际需求来做出取舍。

一致性

强一致性

强一致性的理想情况下就是多个server的表现跟一个server的表现一致。下面我们通过一个简单的例子来描述什么是强一致性。

现在有单个server S1使用local disk作为存储介质，并且server每次只处理一个请求。假设有client C1和client C2同时向S1写数据，并且client C3和client C4再写完数据后向S1请求读数据。

C1：Write X 1
C2：Write X 2
C3: Read X
C4: Read X

无论是C1的写请求先到，还是C2的写请求现到，C3和C4读取到的数据都是相同的（1或2），这就是强一致性。单个server虽然保证了强一致性，但是有很差的容错性。

错误的多Server模型

下面介绍一个很容易实现但是确有问题的多server复制模型。

假设我们从创建一个server S2作为server S1的副本，C1和C2同时向两个server发送写请求来保证数据在两个server上都有更改。但是由于两个server处理写请求的顺序不一样，可能导致server上的数据不一致。比如S1先处理了C1的请求，而S2先处理的确实C2的请求，导致S1上X为2，S2上X为1。如果C3读取了S1，则X=2，如果C4读取了S2，则X=1，并没有保证强一致性！

GFS

特点

GFS文件系统具有如下特点。

automatic sharding of each file over many servers
just one datacenter per deployment
just google internal use
aimed at sequential acess to large files, wirte or read, not random
successful use of weak consistency
successful use of single master

Master

master的存储介质包含RAM和disk两部分。存储在RAM中的数据结构包括：

file name $\rightarrow$ array of chunk handlers (non-volatile)
chunk handler $\rightarrow$ version number # (non-volatile) ; list of chunkservers (volatile); primary (volatile) ; lease time (volatile)

其中，对non-volatile数据的更改操作需要在写入log后再提交，其余数据不需要进行持久化存储。

存储再disk中的数据结构包括：

log
checkpoint

log主要用来记录对non-volatile数据的更改操作，checkpoint用来记录master的状态，两者主要用于master宕机之后的快速恢复。

读写操作

读操作

client的读操作处理如下：

C sends filename and offset to master M (if not cached)
M finds chunk handle for that offset
M replies with list of chunkservers (only those with latest version)
C caches handle + chunkserver list
C sends request to nearest chunkserver (chunk handle, offset)
chunk server reads from chunk file on disk, returns

写操作

client的写操作处理如下：

C asks M about file’s last chunk
if M sees chunk has no primary (or lease expired):
2.1. if no chunkservers has latest version #, error
2.2. pick primary P and secondaries from those has latest version #
2.3. increment version #, write to log on disk
2.4. tell P and secondaries who they are, and new version #
2.5. replicas write new version # to disk
M tells C the primary and secondaries
C sends data to all (just temporary…), waits
C tells P to append
P checks that lease hasn’t expired, and chunk has space
P picks an offset (at end of chunk)
P writes chunk file (a Linux file)
P tells each secondary the offset, tells to append to chunk file
P waits for all secondaries to reply, or timeout，secondary can reply “error” e.g. out of disk space
P tells C “ok” or “error”
C retries from start if error