The Google File System : part6 MEASUREMENTS

In this section we present a few micro-benchmarks to illustrate the bottlenecks inherent in the GFS architecture and implementation, and also some numbers from real clusters in use at Google.


6.1 Micro-benchmarks
We measured performance on a GFS cluster consisting of one master, two master replicas, 16 chunkservers, and 16 clients. 
Note that this configuration was set up for ease of testing. 
Typical clusters have hundreds of chunkservers and hundreds of clients.
All the machines are configured with dual 1.4 GHz PIII processors, 2 GB of memory, two 80 GB 5400 rpm disks, and a 100 Mbps full-duplex Ethernet connection to an HP 2524 switch. 
All 19 GFS server machines are connected to one switch, and all 16 client machines to the other. 
The two switches are connected with a 1 Gbps link.

我们测量了由一个master,两个master 副本,16个块服务器和16个客户端组成的GFS群集的性能。
所有这些机器配置了双1.4 GHz PIII处理器,2 GB内存,两个80 GB 5400 rpm磁盘和100 Mbps全双工以太网连接到HP 2524交换机。
两个交换机与1 Gbps链路相连。

6.1.1 Reads
N clients read simultaneously from the file system. 
Each client reads a randomly selected 4 MB region from a 320 GB file set. 
This is repeated 256 times so that each client ends up reading 1 GB of data. 
The chunkservers taken together have only 32 GB of memory, so we expect at most a 10% hit rate in the Linux buffer cache. 
Our results should be close to cold cache results.

Figure 3(a) shows the aggregate read rate for N clients and its theoretical limit. 

The limit peaks at an aggregate of 125 MB/s when the 1 Gbps link between the two switches is saturated, or 12.5 MB/s per client when its 100 Mbps network interface gets saturated, whichever applies. 
The observed read rate is 10 MB/s, or 80% of the per-client limit, when just one client is reading. 
The aggregate read rate reaches 94 MB/s, about 75% of the 125 MB/s link limit, for 16 readers, or 6 MB/s per client. 
The efficiency drops from 80% to 75% because as the number of readers increases, so does the probability that multiple readers simultaneously read from the same chunkserver.

6.1.1 read
每个客户端从320 GB文件集中读取随机选择的4 MB区域。
这是重复256次,所以每个客户端最终读取1 GB的数据。
一起使用的chunkserver只有32 GB的内存,所以我们期望Linux缓冲区缓存中最多达到10%的命中率。
当两个交换机之间的1 Gbps链路饱和时,总共为125 MB / s的峰值峰值,或者当其100 Mbps网络接口饱和时,每个客户端为12.5 MB / s(以适用者为准)。
观察的读取速率为10 MB / s,或每个客户端限制的80%,当只有一个客户端读取时。
总读取速率达到94 MB / s,约为125 MB / s链接限制的75%,16位读取器,或每个客户端6 MB / s。

6.1.2 Writes
N clients write simultaneously to N distinct files. 
Each client writes 1 GB of data to a new file in a series of 1 MB writes. 

The aggregate write rate and its theoretical limit are shown in Figure 3(b). 

The limit plateaus at 67 MB/s because we need to write each byte to 3 of the 16 chunkservers, each with a 12.5 MB/s input connection.
The write rate for one client is 6.3 MB/s, about half of the limit. 
The main culprit for this is our network stack. 
It does not interact very well with the pipelining scheme we use for pushing data to chunk replicas. 
Delays in propagating data from one replica to another reduce the overall write rate.
Aggregate write rate reaches 35 MB/s for 16 clients (or 2.2 MB/s per client), about half the theoretical limit. 
As in the case of reads, it becomes more likely that multiple clients write concurrently to the same chunkserver as the number of clients increases. 
Moreover, collision is more likely for 16 writers than for 16 readers because each write involves three different replicas.
Writes are slower than we would like. 
In practice this has not been a major problem because even though it increases the latencies as seen by individual clients, it does not significantly affect the aggregate write bandwidth delivered by the system to a large number of clients.

6.1.2 写
每个客户端将1 GB的数据写入一系列1 MB写入的新文件。


限制在67 MB / s的高原,因为我们需要将每个字节写入16个chunkserver中的3个,每个具有12.5 MB / s的输入连接。
一个客户端的写入速率为6.3 MB / s,约为限制的一半。
16个客户端(或每个客户端的2.2 MB / s)的总写入速率达到35 MB / s,约为理论极限的一半。

6.1.3 Record Appends

Figure 3(c) shows record append performance. 

N clients append simultaneously to a single file. 
Performance is limited by the network bandwidth of the chunkservers that store the last chunk of the file, independent of the number of clients. 
It starts at 6.0 MB/s for one client and drops to 4.8 MB/s for 16 clients, mostly due to congestion and variances in network transfer rates seen by different clients.
Our applications tend to produce multiple such files concurrently. 
In other words, N clients append to M shared files simultaneously where both N and M are in the dozens or hundreds. 
Therefore, the chunkserver network congestion in our experiment is not a significant issue in practice because a client can make progress on writing one file while the chunkservers for another file are busy.

一个客户端以6.0 MB / s开始,16个客户端下降到4.8 MB / s,主要是由于不同客户端看到的网络传输速率的拥塞和差异。

6.2 Real World Clusters
We now examine two clusters in use within Google that are representative of several others like them. 
Cluster A is used regularly for research and development by over a hundred engineers. 
A typical task is initiated by a human user and runs up to several hours. 
It reads through a few MBs to a few TBs of data, transforms or analyzes the data, and writes the results back to the cluster. 
Cluster B is primarily used for production data processing. 
The tasks last much longer and continuously generate and process multi-TB data sets with only occasional human intervention. 
In both cases, a single “task” consists of many processes on many machines reading and writing many files simultaneously.


6.2.1 Storage
As shown by the first five entries in the table, both clusters have hundreds of chunkservers, support many TBs of disk space, and are fairly but not completely full. 
“Used space” includes all chunk replicas. 
Virtually all files are replicated three times. 
Therefore, the clusters store 18 TB and 52 TB of file data respectively.
The two clusters have similar numbers of files, though B has a larger proportion of dead files, namely files which were deleted or replaced by a new version but whose storage have not yet been reclaimed. 
It also has more chunks because its files tend to be larger. 

因此,集群分别存储18 TB和52 TB的文件数据。

6.2.2 Metadata
The chunkservers in aggregate store tens of GBs of metadata, mostly the checksums for 64 KB blocks of user data.
The only other metadata kept at the chunkservers is the chunk version number discussed in Section 4.5.
The metadata kept at the master is much smaller, only tens of MBs, or about 100 bytes per file on average. 
This agrees with our assumption that the size of the master’s memory does not limit the system’s capacity in practice.
Most of the per-file metadata is the file names stored in a prefix-compressed form. 
Other metadata includes file ownership and permissions, mapping from files to chunks, and each chunk’s current version. 
In addition, for each chunk we store the current replica locations and a reference count for implementing copy-on-write.
Each individual server, both chunkservers and the master, has only 50 to 100 MB of metadata. 
Therefore recovery is fast: 
it takes only a few seconds to read this metadata from disk before the server is able to answer queries. 
However, the master is somewhat hobbled for a period – typically 30 to 60 seconds – until it has fetched chunk location information from all chunkservers.

聚合块中的块服务器存储数十GB的元数据,大多数是64 KB的用户数据块的校验和。
每个单独的服务器(chunkserver和master)只有50到100 MB的元数据。

6.2.3 Read and Write Rates
Table 3 shows read and write rates for various time periods. 
Both clusters had been up for about one week when these measurements were taken. 
(The clusters had been restarted recently to upgrade to a new version of GFS.)
The average write rate was less than 30MB/s since the restart. 
When we took these measurements, B was in the middle of a burst of write activity generating about 100 MB/s of data, which produced a 300 MB/s network load because writes are propagated to three replicas.

Figure 3: Aggregate Throughputs. 
Top curves show theoretical limits imposed by our network topology. 
Bottom curves show measured throughputs. They have error bars that show 95% confidence intervals, which are illegible in some cases because of low variance in measurements.

Table 3: Performance Metrics for Two GFS Clusters
The read rates were much higher than the write rates.
The total workload consists of more reads than writes as we have assumed. 
Both clusters were in the middle of heavy read activity. 
In particular, A had been sustaining a read rate of 580MB/s for the preceding week. 
Its network configuration can support 750MB/s, so it was using its re-
sources efficiently. Cluster B can support peak read rates of
1300 MB/s, but its applications were using just 380MB/s.


自重启以来,平均写入速率小于30MB / s。
当我们进行这些测量时,B处于一系列写入活动的中间,产生约100 MB / s的数据,由于写入传播到三个副本,因此产生了300 MB / s的网络负载。


特别是,A在上周一直维持580MB / s的读取速度。
其网络配置可支持750MB / s,
1300 MB / s,但其应用程序仅使用380MB / s。

6.2.4 Master Load
Table 3 also shows that the rate of operations sent to the master was around 200 to 500 operations per second. 
The master can easily keep up with this rate, and therefore is not a bottleneck for these workloads.
In an earlier version of GFS, the master was occasionally a bottleneck for some workloads. 
It spent most of its time sequentially scanning through large directories 
(which contained hundreds of thousands of files) looking for particular files. 
We have since changed the master data structures to allow efficient binary searches through the namespace. 
It can now easily support many thousands of file accesses persecond. 
If necessary, we could speed it up further by placing name lookup caches in front of the namespace data structures.

6.2.4 master 负载
 master 可以很容易地跟上这个速度,因此不是这些工作负载的瓶颈。
在早期版本的GFS中,master 偶尔会成为某些工作负载的瓶颈。

6.2.5 Recovery Time
After a chunkserver fails, some chunks will become under-replicated and must be cloned to restore their replication levels. 
The time it takes to restore all such chunks depends on the amount of resources. 
In one experiment, we killed a single chunkserver in cluster B. 
The chunkserver had about 15,000 chunks containing 600 GB of data. 
To limit the impact on running applications and provide leeway for scheduling decisions, our default parameters limit this cluster to 91 concurrent clonings (40% of the number of chunkservers) where each clone operation is allowed to consume at most 6.25 MB/s (50 Mbps). 
All chunks were restored in 23.2 minutes, at an effective replication rate of 440 MB/s.
In another experiment, we killed two chunkservers each with roughly 16,000 chunks and 660 GB of data. 
This double failure reduced 266 chunks to having a single replica. 
These 266 chunks were cloned at a higher priority, and were all restored to at least 2x replication within 2 minutes, thus putting the cluster in a state where it could tolerate another chunkserver failure without data loss.

chunkserver有大约15,000个数据块,包含600 GB的数据。
为了限制对正在运行的应用程序的影响并为调度决策提供了余地,我们的默认参数将此集群限制为允许每个克隆操作最多使用6.25 MB / s(50 Mbps)的91个并发克隆(占总服务器数的40%) )。
所有块在23.2分钟内恢复,有效复制率为440 MB / s。
在另一个实验中,我们杀死了两个chunkserver,每个chunkserver大约有16,000个块和660 GB的数据。

6.3 Workload Breakdown
In this section, we present a detailed breakdown of the workloads on two GFS clusters comparable but not identical to those in Section 6.2. Cluster X is for research and development while cluster Y is for production data processing.

在本节中,我们详细列出了两个与第6.2节相似但不完全相同的两个GFS集群的工作量。 集群X用于研究和开发,而集群Y用于生产数据处理。

6.3.1 Methodology and Caveats
These results include only client originated requests so that they reflect the workload generated by our applications for the file system as a whole. 
They do not include inter-server requests to carry out client requests or internal background activities, such as forwarded writes or rebalancing.
Statistics on I/O operations are based on information heuristically reconstructed from actual RPC requests logged by GFS servers. 
For example, GFS client code may break a read into multiple RPCs to increase parallelism, from which we infer the original read. 
Since our access patterns are highly stylized, we expect any error to be in the noise. 
Explicit logging by applications might have provided slightly more accurate data, but it is logistically impossible to recompile and restart thousands of running clients to do so and cumbersome to collect the results from as many machines.

One should be careful not to overly generalize from our workload. 
Since Google completely controls both GFS and its applications, the applications tend to be tuned for GFS, and conversely GFS is designed for these applications. 
Such mutual influence may also exist between general applications

Table 4: Operations Breakdown by Size (%). 

For reads, the size is the amount of data actually read and transferred, rather than the amount requested.
and file systems, but the effect is likely more pronounced in our case.

I / O操作统计基于由GFS服务器记录的实际RPC请求启发式重构的信息。



6.3.2 Chunkserver Workload

Table 4 shows the distribution of operations by size. 

Read sizes exhibit a bimodal distribution. 
The small reads (under 64 KB) come from seek-intensive clients that look up small pieces of data within huge files. 
The large reads (over 512 KB) come from long sequential reads through entire files.

A significant number of reads return no data at all in cluster Y. Our applications, especially those in the production systems, often use files as producer-consumer queues. 
Producers append concurrently to a file while a consumer reads the end of file. 
Occasionally, no data is returned when the consumer outpaces the producers. 
Cluster X shows this less often because it is usually used for short-lived data analysis tasks rather than long-lived distributed applications.
Write sizes also exhibit a bimodal distribution. 

The large writes (over 256 KB) typically result from significant buffering within the writers. 
Writers that buffer less data, checkpoint or synchronize more often, or simply generate less data account for the smaller writes (under 64 KB).
As for record appends, cluster Y sees a much higher percentage of large record appends than cluster X does because our production systems, which use cluster Y, are more aggressively tuned for GFS.

Table 5 shows the total amount of data transferred in operations of various sizes. 

For all kinds of operations, the larger operations (over 256 KB) generally account for most of the bytes transferred. 
Small reads (under 64 KB) do transfer a small but significant portion of the read data because of the random seek workload.

6.3.2 Chunkserver工作负载
小写(64 KB以下)来自寻找密集型客户端,在庞大的文件中查找小数据。
大写(超过512 KB)来自整个文件的长时间读取。

大量的读取在集群Y中根本不会返回任何数据。我们的应用程序,特别是生产系统中的应用程序通常使用文件作为生产者 - 消费者队列。

大写(超过256 KB)通常来自写入程序中的显着缓冲。
缓冲较少数据,检查点或更频繁同步的作者,或者简单地生成较小写入的数据帐户(64 KB以下)。

对于各种操作,较大的操作(超过256 KB)通常占传输的大部分字节。
由于随机查找工作量,小读取(64 KB以下)会传输读取数据的一小部分。

6.3.3 Appends versus Writes
Record appends are heavily used especially in our production systems. 
For cluster X, the ratio of writes to record appends is 108:1 by bytes transferred and 8:1 by operation counts. 
For cluster Y, used by the production systems, the ratios are 3.7:1 and 2.5:1 respectively. 
Moreover, these ratios suggest that for both clusters record appends tend to be larger than writes. 
For cluster X, however, the overall usage of record append during the measured period is fairly low and so the results are likely skewed by one or two applications with particular buffer size choices.
As expected, our data mutation workload is dominated by appending rather than overwriting. 
We measured the amount of data overwritten on primary replicas. 
This approximates the case where a client deliberately overwrites previous written data rather than appends new data. 
For cluster X, overwriting accounts for under 0.0001% of bytes mutated and under 0.0003% of mutation operations. 
For cluster Y, the ratios are both 0.05%. Although this is minute,it is still higher than we expected. 
It turns out that most of these overwrites came from client retries due to errors or timeouts. 
They are not part of the workload per se but a consequence of the retry mechanism.


6.3.4 Master Workload

Table 6 shows the breakdown by type of requests to the master. 

Most requests ask for chunk locations (FindLocation) for reads and lease holder information (FindLease-Locker) for data mutations.
Clusters X and Y see significantly different numbers of Delete requests because cluster Y stores production data sets that are regularly regenerated and replaced with newer versions. 
Some of this difference is further hidden in the difference in Open requests because an old version of a file may be implicitly deleted by being opened for write from scratch (mode “w” in Unix open terminology).
FindMatchingFiles is a pattern matching request that supports “ls” and similar file system operations. 
Unlike other requests for the master, it may process a large part of the namespace and so may be expensive. 
Cluster Y sees it much more often because automated data processing tasks tend to examine parts of the file system to understand global application state. 
In contrast, cluster X’s applications are under more explicit user control and usually know the names of all needed files in advance.

6.3.4 master 工作量
表6显示了对 master 的请求类型的细分。
与其他对 master 的请求不同,它可能会处理大部分名称空间,因此可能会昂贵。





