The Google File System : part4 MASTER OPERATION

最新推荐文章于 2021-01-27 09:09:45 发布

雙重底

最新推荐文章于 2021-01-27 09:09:45 发布

阅读量419

点赞数

分类专栏： Google GFS

本文链接：https://blog.csdn.net/Android_XiuChou/article/details/77840529

版权

Google 同时被 2 个专栏收录

18 篇文章

订阅专栏

GFS

8 篇文章

订阅专栏

4. MASTER OPERATION
The master executes all namespace operations.
In addition, it manages chunk replicas throughout the system:
it makes placement decisions, creates new chunks and hence replicas, and coordinates various system-wide activities to keep chunks fully replicated, to balance load across all the chunkservers, and to reclaim unused storage.
We now discuss each of these topics.

4.主操作
master 执行所有命名空间操作。
此外，它还管理整个系统中的块副本：
它可以进行布局决策，创建新的块和复制品，并协调各种全系统的活动，以保持块完全复制，平衡所有块服务器的负载，并回收未使用的存储。
我们现在讨论这些主题。

4.1 Namespace Management and Locking
Many master operations can take a long time:
for example, a snapshot operation has to revoke chunkserver leases on all chunks covered by the snapshot.
We do not want to delay other master operations while they are running.
Therefore, we allow multiple operations to be active and use locks over regions of the namespace to ensure proper serialization.
Unlike many traditional file systems, GFS does not have a per-directory data structure that lists all the files in that directory.
Nor does it support aliases for the same file or directory (i.e, hard or symbolic links in Unix terms).
GFS logically represents its namespace as a lookup table mapping full pathnames to metadata.
With prefix compression, this table can be efficiently represented in memory.
Each node in the namespace tree (either an absolute file name or an absolute directory name) has an associated read-write lock.

4.1命名空间管理和锁定
许多 master 操作可能需要很长时间：
例如，快照操作必须撤消快照覆盖的所有块上的chunkserver租约。
我们不想在运行时延迟其他主操作。
因此，我们允许多个操作是活动的，并在命名空间的区域使用锁来确保正确的序列化。
与许多传统文件系统不同，GFS没有列出该目录中所有文件的每目录数据结构。
它也不支持同一个文件或目录的别名（即Unix条款中的硬链接或符号链接）。
GFS在逻辑上表示其命名空间，作为将完整路径名映射到元数据的查找表。
使用前缀压缩，可以在内存中高效地表示此表。
命名空间树中的每个节点（绝对文件名或绝对目录名）都具有关联的读写锁。

Each master operation acquires a set of locks before it runs.
Typically, if it involves /d1/d2/.../dn/leaf, it will acquire read-locks on the directory names /d1, /d1/d2, ..., /d1/d2/.../dn, and either a read lock or a write lock on the full pathname /d1/d2/.../dn/leaf.
Note that leaf may be a file or directory depending on the operation.
We now illustrate how this locking mechanism can prevent a file /home/user/foo from being created while /home/user is being snapshotted to /save/user.
The snapshot operation acquires read locks on /home and /save, and write locks on /home/user and /save/user.
The file creation acquires read locks on /home and /home/user, and a write lock on /home/user/foo.
The two operations will be serialized properly because they try to obtain conflicting locks on /home/user.
File creation does not require a write lock on the parent directory because there is no “directory”, or inode-like, data structure to be protected from modification.
The read lock on the name is sufficient to protect the parent directory from deletion.
One nice property of this locking scheme is that it allows concurrent mutations in the same directory.
For example, multiple file creations can be executed concurrently in the same directory:
each acquires a read lock on the directory name and a write lock on the file name.
The read lock on the directory name suffices to prevent the directory from being deleted, renamed, or snapshotted.
The write locks on file names serialize attempts to create a file with the same name twice.
Since the namespace can have many nodes, read-write lock objects are allocated lazily and deleted once they are not in use.
Also, locks are acquired in a consistent total order to prevent deadlock: they are first ordered by level in the namespace tree and lexicographically within the same level.

每个 master 操作在运行之前获取一组锁。
通常，如果它涉及/d1/d2/.../dn/leaf，它将获取目录名称/d1，/d1/d2，...，/d1/d2/.../dn上的读锁定，以及完整路径名/d1/d2/.../dn/leaf上的读锁或写锁。
请注意，根据操作，叶可能是文件或目录。
现在我们来说明这个锁定机制如何防止/home/user被快照到/save/user时被创建的文件/home/user/foo。
快照操作在/home和/save上获取读锁，并在/home/user和/save/user上写入锁。
文件创建在/home和/home/user上获取读锁定，并在/home/user/foo上写入一个锁。
这两个操作将被正确序列化，因为它们尝试在/home/user上获取冲突的锁。
文件创建不需要父目录上的写锁定，因为没有“目录”或类似于inode的数据结构可以免受修改。
该名称上的读锁足以保护父目录不被删除。
这种锁定方案的一个不错的特性是它允许同一个目录中的并发突变。
例如，可以在同一目录中同时执行多个文件创建：
每个获取目录名上的读锁定和文件名上的写锁定。
目录名称上的读锁定足以阻止目录被删除，重命名或快照。
文件名上的写入锁定序列化尝试创建两个相同名称的文件。
由于命名空间可以拥有多个节点，所以读写锁定对象在不被使用的情况下被延迟分配并被删除。
此外，锁是以一致的总顺序获取的，以防止死锁：它们首先按命名空间树中的级别排序，并在同一级别中按字典顺序排列。

4.2 Replica Placement
A GFS cluster is highly distributed at more levels than one.
It typically has hundreds of chunkservers spread across many machine racks.
These chunkservers in turn may be accessed from hundreds of clients from the same or different racks.
Communication between two machines on different racks may cross one or more network switches.
Additionally, bandwidth into or out of a rack may be less than the aggregate bandwidth of all the machines within the rack.
Multi-level distribution presents a unique challenge to distribute data for scalability, reliability, and availability.
The chunk replica placement policy serves two purposes:
maximize data reliability and availability, and maximize network bandwidth utilization.
For both, it is not enough to spread replicas across machines, which only guards against disk or machine failures and fully utilizes each machine’s network bandwidth.
We must also spread chunk replicas across racks.
This ensures that some replicas of a chunk will survive and remain available even if an entire rack is damaged or offline (for example, due to failure of a shared resource like a network switch or power circuit).
It also means that traffic, especially reads, for a chunk can exploit the aggregate bandwidth of multiple racks.
On the other hand, write traffic has to flow through multiple racks, a tradeoff we make willingly.

4.2复制放置
一个GFS集群的分布水平高于一个。
它通常有数百个chunkservers分布在许多机架上。
这些chunkserver又可以从相同或不同机架的数百个客户端进行访问。
不同机架上的两台机器之间的通讯可能会交叉一个或多个网络交换机。
此外，进入或退出机架的带宽可能小于机架内所有机器的总带宽。
多层次分发是分发可扩展性，可靠性和可用性数据的独特挑战。
块副本放置策略有两个目的：
最大化数据可靠性和可用性，并最大限度地提高网络带宽利用率
对于这两者来说，仅仅在防止磁盘或机器故障并充分利用每台机器的网络带宽的机器上传播副本是不够的。
我们还必须在机架上传播大块副本。
这确保了一个块的一些副本将能够存活并保持可用，即使整个机架被损坏或脱机（例如，由于诸如网络交换机或电源电路的共享资源的故障）。
这也意味着一个块的流量，特别是读取可以利用多个机架的总带宽。
另一方面，写入流量必须流经多个机架，这是我们自愿做出的权衡。

4.3 Creation, Re-replication, Rebalancing
Chunk replicas are created for three reasons:
chunk creation, re-replication, and rebalancing.
When the master creates a chunk, it chooses where to place the initially empty replicas.
It considers several factors.
(1) We want to place new replicas on chunkservers with below-average disk space utilization.
Over time this will equalize disk utilization across chunkservers.
(2) We want to limit the number of “recent” creations on each chunkserver.
Although creation itself is cheap, it reliably predicts imminent heavy write traffic because chunks are created when demanded by writes, and in our append-once-read-many work-load they typically become practically read-only once they have been completely written.
(3) As discussed above, we want to spread replicas of a chunk across racks.
The master re-replicates a chunk as soon as the number of available replicas falls below a user-specified goal.
This could happen for various reasons:
a chunkserver becomes unavailable, it reports that its replica may be corrupted, one of its disks is disabled because of errors, or the replication goal is increased.
Each chunk that needs to be re-replicated is prioritized based on several factors.
One is how far it is from its replication goal.
For example, we give higher priority to a chunk that has lost two replicas than to a chunk that has lost only one.
In addition, we prefer to first re-replicate chunks for live files as opposed to chunks that belong to recently deleted files (see Section 4.4).
Finally, to minimize the impact of failures on running applications, we boost the priority of any chunk that is blocking client progress.

The master picks the highest priority chunk and “clones” it by instructing some chunkserver to copy the chunk data directly from an existing valid replica.
The new replica is placed with goals similar to those for creation:
equalizing disk space utilization, limiting active clone operations on any single chunkserver, and spreading replicas across racks.
To keep cloning traffic from overwhelming client traffic, the master limits the numbers of active clone operations both for the cluster and for each chunkserver.
Additionally, each chunkserver limits the amount of bandwidth it spends on each clone operation by throttling its read requests to the source chunkserver.
Finally, the master rebalances replicas periodically:
it examines the current replica distribution and moves replicas for better disk space and load balancing.
Also through this process, the master gradually fills up a new chunkserver rather than instantly swamps it with new chunks and the heavy write traffic that comes with them.
The placement criteria for the new replica are similar to those discussed above.
In addition, the master must also choose which existing replica to remove.
In general, it prefers to remove those on chunkservers with below-average free space so as to equalize disk space usage.

4.3创建，重新复制，重新平衡
创建块副本有三个原因：
块创建，重新复制和重新平衡。
当 master 创建一个块时，它选择放置最初空的副本的位置。
它考虑了几个因素。
(1)我们要在具有低于平均磁盘空间利用率的块服务器上放置新的副本。随着时间的推移，这将平衡块服务器的磁盘利用率。
(2)我们要限制每个chunkserver的“最近”创作的数量。
虽然创建本身便宜，但它可靠地预测即将发生的大量写入流量，因为在写入时需要创建块，而在我们的附加读取许多工作负载中，一旦完全写入，它们通常会变为实用的只读。
(3)如上所述，我们想把一个块的副本分布在机架上。
只要可用副本的数量低于用户指定的目标，master 将重新复制一个块。
这可能是由于各种原因：
chunkserver变得不可用，它报告其副本可能已损坏，其中一个磁盘由于错误而被禁用，或复制目标增加。
每个需要重新复制的块都是基于几个因素来确定的。
一个是它的复制目标有多远。
例如，我们给一个丢失了两个副本的块优先于一个只丢失了一个的块。
此外，我们更喜欢首先重新复制实时文件的块，而不是属于最近删除的文件的块（参见第4.4节）。
最后，为了最大限度地减少运行应用程序的失败影响，我们提高阻止客户端进度的任何块的优先级。

master 选择最高优先级的块，并通过指示一些块服务器直接从现有的有效副本复制块数据来“克隆”。
新的副本的目标与创作的目标相似：
平衡磁盘空间利用率，限制任何单个块服务器上的活动克隆操作，以及跨机架扩展副本。
为了将流量克服压倒性的客户端流量，主机限制了群集和每个chunkserver的主动克隆操作的数量。
此外，每个chunkserver通过将其读请求限制到源chunkserver来限制每个克隆操作花费的带宽量。
最后，master 定期重新平衡复制品：
它检查当前的副本分发和移动副本以获得更好的磁盘空间和负载平衡。
同样通过这个过程，master 逐渐填补了一个新的chunkserver，而不是立即用新的块和它们附带的繁重的写入流量。
新副本的放置标准与上述相似。
此外，master 还必须选择要删除的现有副本。
一般来说，它更喜欢删除那些具有低于平均可用空间的块服务器，以便平衡磁盘空间的使用。

4.4 Garbage Collection
After a file is deleted, GFS does not immediately reclaim the available physical storage.
It does so only lazily during regular garbage collection at both the file and chunk levels.
We find that this approach makes the system much simpler and more reliable.

4.4垃圾收集
删除文件后，GFS不会立即回收可用的物理存储。
它在文件和块级别的常规垃圾收集过程中这样做只是懒惰。
我们发现这种方法使系统更简单和更可靠。

4.4.1 Mechanism
When a file is deleted by the application, the master logs the deletion immediately just like other changes.
However instead of reclaiming resources immediately, the file is just renamed to a hidden name that includes the deletion timestamp.
During the master’s regular scan of the file system namespace, it removes any such hidden files if they have existed for more than three days (the interval is configurable).
Until then, the file can still be read under the new, special name and can be undeleted by renaming it back to normal.
When the hidden file is removed from the namespace, its in-memory metadata is erased.
This effectively severs its links to all its chunks.
In a similar regular scan of the chunk namespace, the master identifies orphaned chunks (i.e., those not reachable from any file) and erases the metadata for those chunks.
In a HeartBeat message regularly exchanged with the master, each chunkserver reports a subset of the chunks it has, and the master replies with the identity of all chunks that are no longer present in the master’s metadata.
The chunkserver is free to delete its replicas of such chunks.

4.4.1机制
当应用程序删除文件时，master 站会像其他更改一样立即记录删除。
但是，不是立即回收资源，而是将文件重命名为包含删除时间戳的隐藏名称。
在master 服务器定期扫描文件系统名称空间期间，如果这些隐藏文件已存在超过三天（间隔可配置），它将删除这些隐藏文件。
在此之前，文件仍然可以以新的特殊名称读取，并且可以通过将其重命名为正常来取消删除。
隐藏文件从名称空间中删除时，其内存中的元数据将被删除。
这有效地切断了与其所有块的联系。
在组块命名空间的类似的常规扫描中，master 识别孤立的块（即，从任何文件不可访问的块），并擦除这些块的元数据。
在与 master 定期交换的HeartBeat消息中，每个chunkserver报告其拥有的块的一个子集，并且主节点对主节点元数据中不再存在的所有块的身份进行回复。
chunkserver可以自由地删除这些块的副本。

4.4.2 Discussion
Although distributed garbage collection is a hard problem that demands complicated solutions in the context of programming languages, it is quite simple in our case.
We can easily identify all references to chunks: they are in the file-to-chunk mappings maintained exclusively by the master.
We can also easily identify all the chunk replicas:
they are Linux files under designated directories on each chunkserver.
Any such replica not known to the master is “garbage.”

The garbage collection approach to storage reclamation offers several advantages over eager deletion.
First, it is simple and reliable in a large-scale distributed system where component failures are common.
Chunk creation may succeed on some chunkservers but not others, leaving replicas that the master does not know exist.
Replica deletion messages may be lost, and the master has to remember to resend them across failures, both its own and the chunkserver’s.
Garbage collection provides a uniform and dependable way to clean up any replicas not known to be useful.
Second,it merges storage reclamation into the regular background activities of the master, such as the regular scans of namespaces and handshakes with chunkservers.
Thus, it is done in batches and the cost is amortized.
Moreover, it is done only when the master is relatively free.
The master can respond more promptly to client requests that demand timely attention.
Third, the delay in reclaiming storage provides a safety net against accidental, irreversible deletion.
In our experience, the main disadvantage is that the delay sometimes hinders user effort to fine tune usage when storage is tight.
Applications that repeatedly create and delete temporary files may not be able to reuse the storage right away.
We address these issues by expediting storage reclamation if a deleted file is explicitly deleted again.
We also allow users to apply different replication and reclamation policies to different parts of the namespace.
For example, users can specify that all the chunks in the files within some directory tree are to be stored without replication, and any
deleted files are immediately and irrevocably removed from the file system state.

4.4.2讨论
虽然分布式垃圾收集是一个难题，需要在编程语言的上下文中使用复杂的解决方案，但在我们的案例中相当简单。
我们可以很容易地识别对块的所有引用：它们是由 master 唯一维护的文件到块的映射。
我们还可以轻松识别所有的块复本：
它们是每个chunkserver上指定目录下的Linux文件。
master 不知道的任何这样的复制品是“垃圾”。

存储回收的垃圾收集方法提供了超过急切删除的几个优点。
首先，在大型分布式系统中，组件故障是常见的，它是简单可靠的。
块创建可能会在某些chunkserver上成功，但不能成功，留下 master 不知道的副本。
副本删除消息可能会丢失，并且主服务器必须记住重新发送它们，包括它自己的和chunkserver的故障。
垃圾收集提供统一和可靠的方式来清理任何不被认为有用的副本。
第二，它将存储回收合并到主服务器的常规后台活动中，例如定期扫描命名空间和使用chunkserver进行握手。
因此，分批完成，成本摊销。
此外，只有当 master 相对自由时，才能完成。
master 可以更快地回应需要及时关注的客户端请求。
第三，回收存储的延迟提供了一个安全网，防止意外，不可逆的删除。
根据我们的经验，主要缺点是延迟有时会阻碍用户在存储紧张时微调使用情况。
反复创建和删除临时文件的应用程序可能无法立即重新使用存储。
如果再次明确删除已删除的文件，我们将通过加快存储回收来解决这些问题。
我们还允许用户将不同的复制和回收策略应用于命名空间的不同部分。
例如，用户可以指定某个目录树中的文件中的所有块都将被存储而不复制，以及任何
已删除的文件将立即且不可撤销地从文件系统状态中删除。

4.5 Stale Replica Detection
Chunk replicas may become stale if a chunkserver fails and misses mutations to the chunk while it is down.
For each chunk, the master maintains a chunk version number to distinguish between up-to-date and stale replicas.
Whenever the master grants a new lease on a chunk, it increases the chunk version number and informs the up-to-date replicas.
The master and these replicas all record the new version number in their persistent state.
This occurs before any client is notified and therefore before it can start writing to the chunk.
If another replica is currently unavailable, its chunk version number will not be advanced.
The master will detect that this chunkserver has a stale replica when the chunkserver restarts and reports its set of chunks and their associated version numbers.
If the master sees a version number greater than the one in its records, the master assumes that it failed when granting the lease and so
takes the higher version to be up-to-date.
The master removes stale replicas in its regular garbage collection.
Before that, it effectively considers a stale replica not to exist at all when it replies to client requests for chunk information.
As another safeguard, the master includes the chunk version number when it informs clients which chunkserver holds a lease on a chunk or when it instructs a chunkserver to read the chunk from another chunkserver in a cloning operation.
The client or the chunkserver verifies the version number when it performs the operation so that it is always accessing up-to-date data.

4.5稳定的副本检测
如果chunkserver失败，Chunk副本可能会变得陈旧，并且在下载时错过了该块的突变。
对于每个组块， master 维护一个块版本号，以区分最新和过时的副本。
每当 master 在一个大块上授予新的租约时，它会增加块版本号并通知最新的副本。
master 和这些副本都将新版本号记录在其持续状态。
这在任何客户端被通知之前发生，因此在它可以开始写入块之前。
如果另一个副本当前不可用，则其块版本号将不会提前。
master 将在chunkserver重新启动并报告其组块及其相关版本号时检测到此chunkserver具有陈旧的副本。
如果 master 看到的版本号大于其记录中的版本号，则master 假定在授予租期时失败
使更高版本更新。
master 删除其常规垃圾回收中的陈旧副本。
在此之前，它有效地考虑到一个陈旧的副本根本不存在，当它回复客户端请求的块信息。
作为另一个保护措施，当master服务器通知客户端哪个chunkserver持有一个chunk的租约或指示chunkserver在克隆操作中从另一个chunkserver读取该chunk时，主包括该块的版本号。
客户端或chunkserver在执行操作时验证版本号，以便始终访问最新的数据。