The Google File System : part5 FAULT TOLERANCE AND DIAGNOSIS

8 篇文章 0 订阅
5.FAULT TOLERANCE AND DIAGNOSIS
One of our greatest challenges in designing the system is dealing with frequent component failures. 
The quality and quantity of components together make these problems more the norm than the exception: we cannot completely trust the machines, nor can we completely trust the disks. 
Component failures can result in an unavailable system or, worse, corrupted data. 
We discuss how we meet these challenges and the tools we have built into the system to diagnose problems when they inevitably occur.

5.故障和诊断
设计系统的最大挑战之一是处理频繁的组件故障。
组件的质量和数量一起使这些问题比异常更加规范:我们不能完全信任机器,也不能完全信任磁盘。
组件故障可能导致不可用的系统或更糟的是数据损坏。
我们讨论如何应对这些挑战以及我们在系统中嵌入的工具,以便在不可避免地发生问题时进行诊断。

5.1 High Availability
Among hundreds of servers in a GFS cluster, some are bound to be unavailable at any given time. 
We keep the overall system highly available with two simple yet effective strategies: fast recovery and replication.

5.1高可用性
在GFS集群中的数百台服务器中,有些服务器在任何给定的时间都将不可用。
我们通过两种简单而有效的策略来保持整体系统的高度可用性:快速恢复和复制。

5.1.1 Fast Recovery
Both the master and the chunkserver are designed to restore their state and start in seconds no matter how they terminated. 
In fact, we do not distinguish between normal and abnormal termination; 
servers are routinely shut down just by killing the process. 
Clients and other servers experience a minor hiccup as they time out on their outstanding requests, reconnect to the restarted server, and retry. 
Section 6.2.2 reports observed startup times.

5.1.1快速恢复
主人和chunkserver都是为了恢复他们的状态,并在几秒钟内开始,无论他们如何终止。
其实我们不区分正常和异常终止;
服务器通常被关闭,只是通过杀死进程。
客户端和其他服务器在出现未完成的请求超时时,会遇到轻微的打嗝,重新连接到重新启动的服务器,然后重试。
6.2.2节报告了启动时间。

5.1.2 Chunk Replication
As discussed earlier, each chunk is replicated on multiple chunkservers on different racks. 
Users can specify different replication levels for different parts of the file namespace.
The default is three. 
The master clones existing replicas as needed to keep each chunk fully replicated as chunkservers go offline or detect corrupted replicas through checksum verification (see Section 5.2). 
Although replication has served us well, we are exploring other forms of cross-server redundancy such as parity or erasure codes for our increasing read-only storage requirements. 
We expect that it is challenging but manageable to implement these more complicated redundancy schemes in our very loosely coupled system because our traffic is dominated by appends and reads rather than small random writes.

5.1.2块复制
如前所述,每个块在不同机架上的多个块服务器上进行复制。
用户可以为文件命名空间的不同部分指定不同的复制级别。
默认值为3。
主人根据需要克隆现有的副本,以保持每个块完全复制,因为chunkservers脱机或通过校验和验证检测损坏的副本(参见第5.2节)。
虽然复制对我们很好,但我们正在探索其他形式的跨服务器冗余,如奇偶校验或擦除代码,以增加只读存储要求。
我们预计在我们非常松散耦合的系统中实现这些更复杂的冗余方案是有挑战性的,但是可以管理,因为我们的流量由附加和读取而不是小随机写入支配。

5.1.3 Master Replication
The master state is replicated for reliability. 
Its operation log and checkpoints are replicated on multiple machines. 
A mutation to the state is considered committed only after its log record has been flushed to disk locally and on all master replicas. 
For simplicity, one master process remains in charge of all mutations as well as background activities such as garbage collection that change the system internally.
When it fails, it can restart almost instantly. 
If its machine or disk fails, monitoring infrastructure outside GFS starts a new master process elsewhere with the replicated operation log. 
Clients use only the canonical name of the master (e.g. gfs-test), which is a DNS alias that can be changed if the master is relocated to another machine.

5.1.3主复制
复制主状态的可靠性。
其操作日志和检查点在多台机器上进行复制。
只有在其日志记录已刷新到本地和所有主副本的磁盘之后,状态的突变才被视为已提交。
为了简单起见,一个主程序仍然负责所有突变以及背景活动,如内部改变系统的垃圾收集。
当它失败时,它可以立即重新启动。
如果其机器或磁盘出现故障,则在GFS之外监视基础设施会在复制操作日志的其他地方启动新的主进程。
客户端仅使用主服务器的规范名称(例如,gfs-test),如果主服务器重新定位到另一台机器,则可以更改DNS别名。

Moreover, “shadow” masters provide read-only access to the file system even when the primary master is down. 
They are shadows, not mirrors, in that they may lag the primary slightly, typically fractions of a second. 
They enhance read availability for files that are not being actively mutated or applications that do not mind getting slightly stale results.
In fact, since file content is read from chunkservers, applications do not observe stale file content. 
What could be stale within short windows is file metadata, like directory contents or access control information.

此外,即使主主机关闭,“影子”主机也提供对文件系统的只读访问。
它们是阴影,而不是镜子,因为它们可能稍微滞后于主要部分,通常是秒数。
它们增强了未被主动突变的文件的读取可用性,或者不介意稍微过时的结果的应用程序的读取可用性。
实际上,由于从chunkserver中读取文件内容,所以应用程序不会看到不正常的文件内容。
文件元数据,如目录内容或访问控制信息,在短窗口内可能会过时。

To keep itself informed, a shadow master reads a replica of the growing operation log and applies the same sequence of
changes to its data structures exactly as the primary does.
Like the primary, it polls chunkservers at startup (and infrequently thereafter) to locate chunk replicas and exchanges frequent handshake messages with them to monitor their status. 
It depends on the primary master only for replica location updates resulting from the primary’s decisions to create and delete replicas.

为了保持自己的了解,影子大师读取不断增长的操作日志的副本,并应用相同的顺序
与数据结构完全一样的改变。
像主要的一样,它在启动时轮询chunkservers(之后很少)来定位块副本,并与他们交换频繁的握手信息来监视他们的状态。
这取决于主要主人仅对由主要决定创建和删除副本而产生的副本位置更新。

5.2 Data Integrity
Each chunkserver uses checksumming to detect corruption of stored data. 
Given that a GFS cluster often has thousands of disks on hundreds of machines, it regularly experiences disk failures that cause data corruption or loss on both the read and write paths. (See Section 7 for one cause.) 
We can recover from corruption using other chunk replicas, but it would be impractical to detect corruption by comparing replicas across chunkservers. 
Moreover, divergent replicas may be legal: 
the semantics of GFS mutations, in particular atomic record append as discussed earlier, does not guarantee identical replicas. 
Therefore, each chunkserver must independently verify the integrity of its own copy by maintaining checksums.
A chunk is broken up into 64 KB blocks. 
Each has a corresponding 32 bit checksum. 
Like other metadata, checksums are kept in memory and stored persistently with logging, separate from user data.

5.2数据完整性
每个chunkserver使用校验和来检测存储数据的损坏。
鉴于GFS集群在数百台计算机上经常有数千个磁盘,它经常会遇到导致读取和写入路径上的数据损坏或丢失的磁盘故障。 (有关一个原因,请参见第7节。)
我们可以使用其他块副本从损坏中恢复,但通过比较副本在块访问器之间来检测损坏是不实际的。
此外,不同的副本可能是合法的:
GFS突变的语义,特别是前面讨论的原子记录附录,并不能保证相同的复制品。
因此,每个chunkserver必须通过维持校验和独立地验证自己的副本的完整性。
一个块被分解成64 KB块。
每个都有相应的32位校验和。
像其他元数据一样,校验和保存在内存中,并与用户数据分开存储,并与日志记录持久存储。

For reads, the chunkserver verifies the checksum of data blocks that overlap the read range before returning any data to the requester, whether a client or another chunkserver.
Therefore chunkservers will not propagate corruptions to other machines. 
If a block does not match the recorded checksum, the chunkserver returns an error to the requestor and reports the mismatch to the master. 
In response, the requestor will read from other replicas, while the master will clone the chunk from another replica. 
After a valid new replica is in place, the master instructs the chunkserver that reported the mismatch to delete its replica.
Checksumming has little effect on read performance for several reasons. 
Since most of our reads span at least a few blocks, we need to read and checksum only a relatively small amount of extra data for verification. 
GFS client code further reduces this overhead by trying to align reads at checksum block boundaries. 
Moreover, checksum lookups and comparison on the chunkserver are done without any I/O, and checksum calculation can often be overlapped with I/Os.

对于读取,chunkserver在将任何数据返回给请求者(无论是客户机还是其他chunkserver)之前验证与读取范围重叠的数据块的校验和。
因此,块服务器不会将损坏传播到其他机器。
如果块与记录的校验和不匹配,则chunkserver将向请求者返回错误,并向主机报告不匹配。
作为响应,请求者将从其他副本读取,而主人将从另一个副本克隆该块。
在有效的新副本到位后,主人指示报告不匹配的chunkserver删除其副本。
校验和对于读取性能影响不大,原因有几个。
由于我们的读取大多数至少有几个块,所以我们需要读取和校验,只需要相对较少的额外的数据进行验证。
GFS客户端代码通过尝试将校验和块边界的读取对齐来进一步减少此开销。
此外,在没有任何I / O的情况下完成了chunkserver的校验和查找和比较,校验和计算通常可以与I / O重叠。

Checksum computation is heavily optimized for writes that append to the end of a chunk (as opposed to writes that overwrite existing data) because they are dominant in our workloads. 
We just incrementally update the checksum for the last partial checksum block, and compute new checksums for any brand new checksum blocks filled by the append. 
Even if the last partial checksum block is already corrupted and we fail to detect it now, the new checksum value will not match the stored data, and the corruption will be detected as usual when the block is next read.
In contrast, if a write overwrites an existing range of the chunk, we must read and verify the first and last blocks of the range being overwritten, then perform the write, and finally compute and record the new checksums. 
If we do not verify the first and last blocks before overwriting them partially, the new checksums may hide corruption that exists in the regions not being overwritten.
During idle periods, chunkservers can scan and verify the contents of inactive chunks. This allows us to detect corruption in chunks that are rarely read. 
Once the corruption is detected, the master can create a new uncorrupted replica and delete the corrupted replica. 
This prevents an inactive but corrupted chunk replica from fooling the master into thinking that it has enough valid replicas of a chunk.

由于它们在我们的工作负载中占主导地位,所以对于追加到块结尾(而不是覆盖现有数据的写入)的写入,校验和计算进行了大量优化。
我们只是递增地更新最后一个部分校验和块的校验和,并计算新的校验和,以获取由该附加程序填充的任何全新校验和块。
即使最后的部分校验和块已经损坏,我们现在无法检测到,新的校验和值将与存储的数据不匹配,并且当下一次读取块时,会像通常一样检测到损坏。
相反,如果写入覆盖块的现有范围,则必须读取并验证被覆盖的范围的第一个和最后一个块,然后执行写入,最后计算并记录新的校验和。
如果我们在部分覆盖之前不验证第一个和最后一个块,则新的校验和可能会隐藏不被覆盖的区域中存在的损坏。
在空闲期间,chunkserver可以扫描和验证非活动块的内容。这样我们可以检测到很少读取的块中的腐败。
检测到损坏后,主机可以创建新的未损坏的副本并删除损坏的副本。
这样可以防止一个不活跃但已损坏的块副本从欺骗主人的角度来认为它具有足够的有效副本块。

5.3 Diagnostic Tools
Extensive and detailed diagnostic logging has helped immeasurably in problem isolation, debugging, and performance analysis, while incurring only a minimal cost. 
Without logs, it is hard to understand transient, non-repeatable interactions between machines. 
GFS servers generate diagnostic logs that record many significant events (such as chunkservers going up and down) and all RPC requests and replies. 
These diagnostic logs can be freely deleted without affecting the correctness of the system. 
However, we try to keep these logs around as far as space permits.
The RPC logs include the exact requests and responses sent on the wire, except for the file data being read or written. 
By matching requests with replies and collating RPC records on different machines, we can reconstruct the entire interaction history to diagnose a problem. 
The logs also serve as traces for load testing and performance analysis.
The performance impact of logging is minimal (and far outweighed by the benefits) because these logs are written sequentially and asynchronously. 
The most recent events are also kept in memory and available for continuous online monitoring.

5.3诊断工具
广泛和详细的诊断记录在问题隔离,调试和性能分析方面做出了不可估量的贡献,同时只产生了最低的成本。
没有日志,很难理解机器之间的瞬态,不可重复的交互。
GFS服务器生成记录许多重要事件(如上下文中的chunkservers)和所有RPC请求和回复的诊断日志。
这些诊断日志可以自由删除,而不会影响系统的正确性。
但是,我们尽量将这些日志保留在空间许可之下。
RPC日志包括在线上发送的确切请求和响应,但文件数据被读取或写入除外。
通过将请求与回复匹配并将不同机器上的RPC记录进行整理,我们可以重建整个交互记录以诊断问题。
日志还用作负载测试和性能分析的踪迹。
日志记录的性能影响最小(远远超过优点),因为这些日志是顺序和异步地编写的。
最新的事件也保存在内存中,可用于连续在线监控。









  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值