内存数据库mongodb_内存和磁盘性能如何影响您的MongoDB数据库

最新推荐文章于 2022-02-25 14:20:52 发布

culh2177

最新推荐文章于 2022-02-25 14:20:52 发布

阅读量1.5k

点赞数

文章标签：数据库大数据 python linux mysql

原文链接：https://www.sitepoint.com/how-memory-disk-performance-affects-your-mongodb-database/

版权

本文探讨了内存数据库MongoDB中内存与磁盘性能对数据库性能的影响。研究发现，当工作集完全装入缓存时，磁盘性能对应用性能影响不大；但当工作集超出可用内存，磁盘性能迅速成为吞吐量的瓶颈。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

内存数据库mongodb

This article was originally published on MongoDB. Thank you for supporting the partners who make SitePoint possible.

本文最初在MongoDB上发布。 感谢您支持使SitePoint成为可能的合作伙伴。

Understanding the relationships between various internal caches and disk performance, and how those relationships affect database and application performance, can be challenging. We’ve used the YCSB benchmark, varying the working set (number of documents used for the test) and disk performance, to better show how these relate. While reviewing the results, we’ll cover some MongoDB internals to improve understanding of common database usage patterns.

理解各种内部缓存与磁盘性能之间的关系以及这些关系如何影响数据库和应用程序性能可能是具有挑战性的。我们使用了YCSB基准测试，改变了工作集(用于测试的文档数量)和磁盘性能，以更好地显示它们之间的关系。在审查结果时，我们将介绍一些MongoDB内部结构，以增进对常见数据库使用模式的理解。

重要要点 (Key Takeaways)

Knowing disk baseline performance is important for understanding overall database performance.
了解磁盘基准性能对于了解整体数据库性能很重要。
High disk await and utilization are indicative of a disk bottleneck.
高磁盘等待和利用率表明存在磁盘瓶颈。
WiredTiger IO is random.
WiredTiger IO是随机的。
A query targeting a single replica set is single threaded and sequential.
针对单个副本集的查询是单线程的和顺序的。
Disk performance and working set size are closely related.
磁盘性能和工作集大小密切相关。

摘要 (Summary)

The primary contributors to overall system performance are how the working set relates to both the storage engine cache size (the memory dedicated for storing data) and disk performance (which provides a physical limit to how quickly data can be accessed).

总体系统性能的主要贡献者是工作集与存储引擎缓存大小(专用于存储数据的内存)和磁盘性能(对数据访问速度的物理限制)之间的关系。

Using YCSB, we explore the interactions between disk performance and cache size, demonstrating how these two factors can affect performance. While YCSB was used for this testing, synthetic benchmarks are not representative of production workloads. Latency and throughput numbers obtained with these methods do not map to production performance. We utilized MongoDB 3.4.10, YCSB 0.14, and the MongoDB 3.6.0 driver for these tests. YCSB was configured with 16 threads, and the “uniform” read only workload.

使用YCSB ，我们探索了磁盘性能和缓存大小之间的相互作用，展示了这两个因素如何影响性能。虽然使用YCSB进行此测试，但综合基准并不代表生产工作负载。使用这些方法获得的延迟和吞吐量数量未映射到生产性能。在这些测试中，我们使用了MongoDB 3.4.10，YCSB 0.14和MongoDB 3.6.0驱动程序。 YCSB配置有16个线程，并且“统一”只读工作负载。

We show that fitting your working set inside memory provides for optimal application performance and as with any database, exceeding this limit negatively affects latency and overall throughput.

我们证明，将您的工作集放入内存可提供最佳的应用程序性能，并且与任何数据库一样，超过此限制会对延迟和总体吞吐量产生负面影响。

了解磁盘指标 (Understanding Disk Metrics)

There are four important metrics when considering disk performance:

考虑磁盘性能时，有四个重要指标：

Disk throughput, or number of requests multiplied by the request size. This is usually measured in megabytes per second. Random read and write performance in the 4kb range is the most representative of standard database workloads. Note that many cloud providers limit the disk throughput or bandwidth.
磁盘吞吐量或请求数乘以请求大小。通常以兆字节/秒为单位。 4kb范围内的随机读写性能最能代表标准数据库工作负载。请注意，许多云提供商都限制磁盘吞吐量或带宽。
Disk latency. On Linux this is represented by await, the time in milliseconds from an application issuing a read or write before the data is written or returned to the application. For SSDs, latencies are typically under 3ms. HDDs are typically above 7ms. High latencies indicate disks have trouble keeping up with the given workload.
磁盘延迟。在Linux上，这由await表示，从应用程序发出读取或写入数据到将数据写入或返回给应用程序之前的时间(以毫秒为单位)。对于SSD，延迟通常在3毫秒以下。 HDD通常在7ms以上。高延迟表示磁盘无法满足给定的工作负载。
Disk IOPS (Input/Output Operations Per Second). iostat reports this metric as tps. A given cloud provider may guarantee a certain number of IOPS for a given drive. Should you reach this threshold, any further accesses will be queued, resulting in a disk bottleneck. A high end PCIe attached NVMe device could offer 1,500,000 IOPS while a typical hard disk may only support 150 IOPS.
磁盘IOPS(每秒输入/输出操作)。 iostat将此度量标准报告为tps 。给定的云提供商可以为给定的驱动器保证一定数量的IOPS。如果达到此阈值，则任何其他访问都将排队，从而导致磁盘瓶颈。连接高端PCIe的NVMe设备可以提供1,500,000 IOPS，而典型的硬盘只能支持150 IOPS。
Disk utilization. Reported by util in iostat. Linux has multiple queues per device for servicing IO. Utilization indicates what percentage of these queues is busy at a given time. While this number can be confusing, it is a good indicator of overall disk health.
磁盘利用率。由util在iostat报告。 Linux每个设备有多个queues用于服务IO。利用率指示在给定时间这些队列中有多少百分比处于繁忙状态。尽管此数字可能令人困惑，但它是总体磁盘运行状况的良好指标。

测试磁盘性能 (Testing Disk Performance)

While cloud providers may provide an IOPS threshold for a given volume and disk, and disk manufacturers publish expected performance numbers, the actual results on your system may vary. If the observed disk performance is in question, performing an IO test can be very helpful.

尽管云提供商可能会为给定的卷和磁盘提供IOPS阈值，并且磁盘制造商会发布预期的性能数字，但系统上的实际结果可能会有所不同。如果观察到的磁盘性能有问题，则执行IO测试可能非常有帮助。

We generally test with fio, the Flexible IO Tester. We performed tests on 10GB of data, the ioengine of psync, and with reads ranging between 4kb and 32kb. While the default fio settings are not representative of the WiredTiger workload, we have found this configuration to be a good approximation of WiredTiger disk utilization.

我们通常使用fio (柔性IO测试仪)进行测试。我们对10GB数据(psync的ioengine)进行了测试，读取范围为4kb至32kb。虽然默认的Fio设置不能代表WiredTiger的工作量，但我们发现此配置非常接近WiredTiger磁盘的利用率。

All tests were repeated under three disk scenarios:

在三个磁盘方案下重复了所有测试：

场景1 (Scenario 1)

Default disk settings provided by a AWS c5 io1 100GB volume. 5000 IOPS

AWS c5 io1 100GB卷提供的默认磁盘设置。 5000 IOPS

1144 IOPS / 5025 physical reads per second / 99.85% util
1144 IOPS /每秒5025物理读取/ 99.85％利用率

方案2 (Scenario 2)

Limiting the disk to 600 IOPS and introducing 7ms of latency. This should mirror the performance of a typical RAID10 SAN with hard drives

将磁盘限制为600 IOPS，并引入7毫秒的延迟。这应该反映出具有硬盘驱动器的典型RAID10 SAN的性能。

134 IOPS / 150 physical reads per second / 95.72% util
134 IOPS /每秒150物理读取/ 95.72％利用率

场景3 (Scenario 3)

Further limiting the disk to 150 IOPS with 7ms latency. This should model a commodity spinning hard drive.

进一步将磁盘限制为150 IOPS，延迟为7毫秒。这应该为商品旋转硬盘建模。

34 IOPS / 150 physical reads per second / 98.2% utilization
34 IOPS /每秒150次物理读取/98.2%的利用率

如何从磁盘提供查询服务？ (How Is a Query Serviced from Disk?)

The WiredTiger Storage Engine performs its own caching. By default, the WiredTiger cache is sized at 50% of system memory minus 1GB to allow adequate space for both other system processes, the filesystem cache, and internal MongoDB operations that consume additional memory such as building indexes, performing in memory sorts, deduplicating results, text scoring, connection handling, and aggregations. To prevent performance degradation from a totally full cache, WiredTiger automatically begins evicting data from the cache when the utilization grows above 80%. For our tests, this means the effective cache size is (7634MB – 1024MB) * .5 * .8, or 2644MB.

WiredTiger存储引擎执行自己的缓存。默认情况下，WiredTiger缓存的大小为系统内存的50％减去1GB，以为其他系统进程，文件系统缓存和内部MongoDB操作(这些内存占用额外的内存，如构建索引，执行内存排序，重复数据删除)留有足够的空间。，文本评分，连接处理和聚合。为了防止由于完全满的缓存而导致性能下降，当利用率提高到80％以上时，WiredTiger会自动开始从缓存中逐出数据。对于我们的测试，这意味着有效的缓存大小为(7634MB – 1024MB)* .5 * .8或2644MB。

All queries are serviced from the WiredTiger cache. This means a query will cause indexes and documents to be read from disk through the filesystem cache into the WiredTiger cache before returning results. If the requested data is already in the cache, this step is skipped.

所有查询均从WiredTiger缓存中提供。这意味着查询将导致索引和文档在返回结果之前通过文件系统缓存从磁盘读取到WiredTiger缓存。如果请求的数据已经在缓存中，则跳过此步骤。

WiredTiger stores documents with the snappy compression algorithm by default. Any data read from the file system cache is first decompressed before storing in the WiredTiger cache. Indexes utilize prefix compression by default and are compressed both on disk and inside the WiredTiger cache.

WiredTiger默认情况下使用快速压缩算法存储文档。从文件系统高速缓存中读取的所有数据都先进行解压缩，然后再存储到WiredTiger高速缓存中。索引默认情况下使用前缀压缩，并且在磁盘上和在WiredTiger缓存内部都进行压缩。

The filesystem cache is an Operating System construct to store frequently accessed files in memory to facilitate faster accesses. Linux is very aggressive in caching files and will attempt to consume all free memory with the filesystem cache. If additional memory is needed, the filesystem cache is evicted to allow more memory for applications.

文件系统缓存是一种操作系统构造，用于将频繁访问的文件存储在内存中，以加快访问速度。 Linux在缓存文件方面非常积极，将尝试使用文件系统缓存消耗所有可用内存。如果需要额外的内存，则将文件系统缓存逐出以为应用程序提供更多内存。

Here is an animated graphic, showing the disk accesses for the YCSB collection resulting from 100 YCSB read operations. Each operation is an individual find for providing the _id for a single document.

这是一个动画图形，显示了由于100次YCSB读取操作而导致的YCSB集合的磁盘访问。每个操作都是一个单独的发现，用于为单个文档提供_id。

The upper left hand corner represents the first byte in the WiredTiger collection file. Disk locations increment to the right hand side and wrap around. Each row represents a 3.5MB segment of the WiredTiger collection file. The accesses are ordered by time and represented by the frame of animation. Accesses are represented in red and green boxes to highlight the current disk access.

左上角代表WiredTiger收集文件中的第一个字节。磁盘位置增加到右侧并环绕。每行代表WiredTiger集合文件的3.5MB段。访问按时间排序，并由动画帧表示。访问以红色和绿色框表示，以突出显示当前磁盘访问。

3.5 MB vs 4KB

3.5 MB和4KB

Here we see the data file for our collection read into memory. Because the data is stored in B+ trees, we may need to find the disk location of our document (the smaller accesses) by visiting one or more locations on disk before our document is found and read (the wider accesses).

在这里，我们看到集合的数据文件读入内存。由于数据存储在B +树中，因此我们可能需要通过在找到并读取文档之前访问磁盘上的一个或多个位置来找到文档的磁盘位置(较小的访问权限)(访问范围更广)。

This demonstrates the typical access patterns of a MongoDB query – documents are unlikely to be close to each other on disk. This also shows it is highly unlikely for documents, even when inserted after each other, to be in consecutive disk locations.

这演示了MongoDB查询的典型访问模式-文档在磁盘上不太可能彼此靠近。这也表明即使将文档相互插入，也不太可能位于连续的磁盘位置中。

The WiredTiger storage engine is designed to “read completely”: it will issue a read for all of the data it needs at once. This leads to our recommendation to limit the disk read ahead for WiredTiger deployments to zero, as subsequent accesses are unlikely to take advantage of the additional data retrieved through read ahead.

WiredTiger存储引擎被设计为“完全读取”：它将立即对其所需的所有数据进行读取。这导致我们建议将WiredTiger部署的预读磁盘限制为零，因为后续访问不太可能利用通过预读获取的其他数据。

工作集适合缓存 (Working Set Fits in Cache)

For our first set of tests, we set the record count to 2 million, resulting in a total size for both data and indexes of 2.43 GB or 92% of cache.

对于我们的第一组测试，我们将记录计数设置为200万，因此数据和索引的总大小为2.43 GB或缓存的92％。

Here we see strong scenario 1 performance of 76,113 requests per second. Checking the filesystem cache statistics, we observe a WiredTiger cache hit rate of 100% with no accesses and zero bytes read into the filesystem cache, meaning no additional IO is required throughout this test.

在这里，我们看到方案1的强大性能，每秒76,113个请求。检查文件系统缓存统计信息，我们观察到WiredTiger缓存命中率为100％，没有访问权限，并且零字节读取到文件系统缓存中，这意味着在整个测试过程中不需要其他IO。

Unsurprisingly, in scenarios 2 and 3, changing the disk performance (adding 7ms of latency and limiting iops to either 600 or 150) affected throughput minimally (69,579.5 and 70,252 Operations per second respectively).

毫不奇怪，在方案2和3中，更改磁盘性能(增加7毫秒的延迟并将iops限制为600或150)对吞吐量的影响最小(分别为每秒69,579.5和70,252次操作)。

Our 99% response latencies for all three tests are between 0.40 and 0.44 ms.

所有这三个测试的99％响应延迟都在0.40到0.44毫秒之间。

工作集大于WiredTiger缓存，但仍适合文件系统缓存 (Working Set Larger than WiredTiger Cache, but Still Fits in Filesystem Cache)

Modern operating systems cache frequently accessed files to improve read performance. Because the file is already in memory, accessing cached files does not result in physical reads. The cached statistics displayed by the free Linux command details the size of the filesystem cache.

现代操作系统缓存经常访问的文件以提高读取性能。因为文件已经在内存中，所以访问缓存的文件不会导致物理读取。 free Linux命令显示的cached统计信息详细说明了文件系统缓存的大小。

When we increase our record count from 2 million to 3 million we increase our total size of data and indexes to 3.66GB, 38% greater than can be serviced solely from the WiredTiger cache.

当我们将记录数从200万增加到300万时，我们会将数据和索引的总大小增加到3.66GB，这比仅通过WiredTiger缓存提供的服务大38％。

The metrics are clear that we are reading an average of 548 mbps into the WiredTiger cache, but we can observe a 99.9% hit rate when checking the file system cache metrics.

这些指标很明显，我们平均向WiredTiger缓存中读取了548 mbps，但是在检查文件系统缓存指标时，可以看到99.9％的命中率。

For this test we begin to see a reduction in performance, performing only 66,720 operations per second compared to our baseline, representing an 8% reduction compared to our previous test serviced solely from the WiredTiger cache.

对于此测试，我们开始看到性能下降，与基线相比，每秒仅执行66,720次操作，与仅由WiredTiger缓存提供服务的先前测试相比，降低了8％。

As expected, reduced disk performance for this case does not significantly affect our overall throughput (64,484 and 64,229 operations respectively). In cases where the documents are more compressible, or the CPU is a limiting factor, the penalty reading from the filesystem cache would be more pronounced.

不出所料，这种情况下磁盘性能的降低不会显着影响我们的整体吞吐量(分别为64,484和64,229个操作)。如果文档具有更高的可压缩性，或者CPU是一个限制因素，那么从文件系统缓存中读取的惩罚将更加明显。

We note a 54% increase in observed p99 latency to .53 – .55ms.

我们注意到，观察到的p99延迟增加了54％，至.53 – .55ms。

工作集比WiredTiger和FileSystem缓存略大 (Working Set Slightly Larger Than WiredTiger and FileSystem Cache)

We have established the WiredTiger and file system caches work together to provide data to service our queries. However, when we grow our record count from 3 million to 4 million, we can no longer solely utilize these caches to service queries. Our data size grows to 4.8GB or 82% larger then our WiredTiger cache.

我们已经建立了WiredTiger和文件系统缓存一起工作，以提供数据来服务我们的查询。但是，当我们将记录数从300万增加到400万时，我们将无法再单独使用这些缓存来服务查询。我们的数据大小增加到4.8GB，比WiredTiger缓存大82％。

Here, we read into the WiredTiger cache at a rate of 257.4 mbps. Our filesystem cache hit rate lowers to 93-96%, meaning 4-7% of our reads result in physical reads from disk.

在这里，我们以257.4 mbps的速率读取WiredTiger缓存。我们的文件系统缓存命中率降低到93-96％，这意味着我们读取的4-7％会导致从磁盘进行物理读取。

Varying the available IOPS and disk latency has a huge impact on performance for this test.

改变可用的IOPS和磁盘等待时间对该测试的性能有很大的影响。

The 99th percentile response latencies further increase. Scenario 1: 19ms, scenario 2: 171ms, and scenario 3: 770ms an increase of 43x, 389x, and 1751x from the in cache case.

第99个百分点的响应延迟进一步增加。方案1：19毫秒，方案2：171毫秒和方案3：770毫秒(在缓存情况下)分别增加了43倍，389倍和1751倍。

We see 75% lower performance when MongoDB is provided the full 5000 iops compared to our earlier test, which fit fully in cache. Scenarios 2 and Scenario 3 achieved 5139.5 and 737.95 Operations per second respectively, further demonstrating the IO bottleneck.

与我们之前的完全适合缓存的测试相比，当向MongoDB提供完整的5,000 iops时，我们发现性能降低了75％。方案2和方案3分别实现了每秒5139.5和737.95的操作，这进一步证明了IO瓶颈。

工作集比WiredTiger和FileSystem缓存大得多 (Working Set Much Larger Than WiredTiger and FileSystem Cache)

Moving up to 5 million records, we grow our data and index size to 6.09GB, larger than our combined WiredTiger and file system caches. We see our throughput dip below our IOPS. In this case we are still servicing 81% of of WiredTiger reads from the file system cache, but the reads overflowing from disk are saturating our IO. We see 71, 8.3, and 1.9 Mbps read into the filesystem cache for this test.

移动多达500万条记录，我们将数据和索引大小增加到6.09GB，大于WiredTiger和文件系统缓存的组合。我们看到吞吐量下降到低于IOPS。在这种情况下，我们仍在为来自文件系统缓存的WiredTiger读取中的81％提供服务，但是从磁盘溢出的读取会使我们的IO饱和。我们看到此测试将71、8.3和1.9 Mbps读入文件系统缓存。

The 99th percentile response latencies further increase. Scenario 1: 22ms, Scenario 2: 199ms, and Senario 3: 810ms, an increase of 52x, 454x, and 1841x from the in cache response latencies. Here, changing the disk IOPS significantly affects our throughput.

第99个百分点的响应延迟进一步增加。方案1：22毫秒，方案2：199毫秒和Senario 3：810毫秒，与高速缓存中的响应延迟相比分别增加了52倍，454倍和1841倍。在这里，更改磁盘IOPS会严重影响我们的吞吐量。

摘要 (Summary)

Through this series of tests we demonstrate two major points.

通过这一系列测试，我们证明了两个要点。

If the working set fits in cache, disk performance does not greatly affect application performance.
如果工作集适合缓存，则磁盘性能不会对应用程序性能产生太大影响。
When the working set exceeds available memory, disk performance quickly becomes the limiting factor for throughput.
当工作集超出可用内存时，磁盘性能将Swift成为吞吐量的限制因素。

Understanding how MongoDB utilizes both memory and disks is an important part for both sizing a deployment and understanding performance. The inner workings of the WiredTiger storage engine attempts to use hardware to the fullest extent, but memory and disk are two critical pieces of infrastructure contributing to the overall performance characteristics of your workload.

了解MongoDB如何利用内存和磁盘是确定部署大小和了解性能的重要部分。 WiredTiger存储引擎的内部工作机制试图最大程度地使用硬件，但是内存和磁盘是基础架构的两个关键组成部分，它们有助于提高工作负载的整体性能特征。