Stefan is founder of www.RED-SOFT-ADAIR.com, a company that rescues software development projects in a technical crisis. He can be contacted at StefanWoe@gmail.com
Over the past year or so, we've heard a lot about multithreading, deadlocks, cache invalidation, and superlinear scalability. I, for one, have introduced multithreading for the CPU-intensive parts of my applications. During this time, however, I've come to realize that the biggest bottleneck often involves loading data from files. Consequently I've built a small test suite that performs the following actions:
- Read a single file sequentially
- Read multiple files sequentially
- Read from a single file at random positions
- Read from multiple files at random positions
- Write a single file sequentially
- Write multiple files sequentially
- Write to a single file at random positions
- Write to multiple files at random positions
I repeated each of these tests with 1, 2, 4, 8, 16, and 32 threads. From each test, I recorded the total throughput in MB/s. The source code consists of a single C++ source file and is available here.
I've run all of these tests on:
- A dual quad-core XEON with 10GB RAM and an internal RAID 5 system consisting of four 10k SAS Disks running Windows XP x64
- A dual XEON with 4 GB RAM and 10k U360 SCSI Disks running Windows XP
- A Core Duo Laptop with 1.5 GB RAM and a 5400 RPM SATA disk running Windows XP
For single file I/O, I used files of 200MB in size. All threads in a run accessed the same single file. For multiple files I/O, each file had 20MB. All threads were using separate files. All accesses were made with blocks of 512 Bytes.
I repeated the entire test suite three times. The values I present here are the average of the three runs. The standard deviation in most cases did not exceed 10-20%. All tests have been also run three times with reboots after every run, so that no file was accessed from cache. These runs were slower, of course, but the influence of the number threads was similar to the results shown here (with the exception of random read multiple files; see below). To suppress caching effects with the test, I used different set of files for each subtest with 1, 2...,32 threads.
Theoretical Aspects
Before discussing the results, it is important to consider a few important I/O issues.
Read caching. All modern operating systems use all available memory to cache the most recently used files -- or more specifically the most recently read or written blocks. The RAID machine with 10 GB RAM has a clear advantage here, as it's able to cache all files used (some 6 GB in total). The other machines cannot reuse any cache data from one to the next of the three runs as the first used file must be dropped from cache after reading and writing more file space than available RAM.
Write back caches. Operating systems and more specifically RAID systems offer write caches that buffer written data before it is actually stored on the hard disks. The write command returns although the data is only in the cache. On first tests I encountered a slowdown on the RAID system in the middle of the write tests. The reason for this was that the write cache was full, so the tests were influenced by the tests made before. I therefore introduced a pause of some seconds after each subtest before starting with a different number of threads. Read and write caches explain some of the huge throughput of the RAID system used.
Native command queuing. Most current drives (even my three-yearold laptop's SATA drive) support native command queuing. This technology contains asynchronous execution of read and write commands to a hard disk. The commands return in the order where they best fit the drive's point of rotation. For instance, if four read requests (A, B, C, D) are sent to the disk, they might return in the order D, C, B, A, because the drive's heads accidentally were close to the position of the data requested with D when the commands were received. Then the nearest position was the one of C and so on.
Now consider a single-threaded application waiting to receive first A, then B, then C, and so on, compared to a application with four threads requesting A, B, C, and D simultaneously. While the multithreaded application has received A, B, C, and D all together, the single-threaded one might just have received A after the same runtime. (For more information, see Serial ATA: Native Command Queuing.) While there many more theoretical aspects (operating system, drivers, disk fragmentation, sector, and block size, to mention a few), going deeper is beyond the scope of this article.
Sequential Read
For sequential access, I read files from the first to the last byte. With multiple files, each thread read one separate file of 20MB. For a single file access, a file of 200MB was divided in slices of equal length and each thread read one of these slices.
In short, the results show that:
- In case of single disks performance decreases significantly with additional threads
- Only in the case of a RAID system, up to four threads increases performance significantly
This result was surprising because I have a large application where I was able to reduce the load time for large datasets by more then 35% even on the laptop by introducing multithreading recently. The reason for this is obvious: The application did not only read the file. It also processed the data read, stored it into arrays and lists and so on. Using multiple threads, both cores were utilized by up to 100%. In this case, additional threads did not improve the performance of file access but the overall performance increased, although the main task of the process was reading files.
Random Read
For random read access, I positioned the file pointer at a random position somewhere within the file, then read a block 512 Bytes. I did this 10,000 times for each file in the case of multiple files. In the case of a single file, the 10,000 accesses were divided among all threads.
Reading multiple files random was the only case where the behavior was different after a reboot: When reading the files for the first time, all machines showed increased performance with more threads -- even the laptop performed best with 32 threads. The reason for this is that even the hard drive of the laptop supports Native Command Queuing; this is a perfect example of this technology.
When reading a single file, two threads perform a little bit better than one; on the RAID four threads are even better, but more threads decrease performance on all systems. These results did not differ strongly after a reboot.
Sequential Write
Files were written from the first to the last byte for sequential access. With multiple files, each thread wrote one separate file of 20MB. For a single file, a file of 200MB was divided in slices of equal length and each thread did write one of these slices. All files existed in full length before the test has been started, but their entire content was overwritten.
The results for multiple files are similar to those for sequential read: Performance generally decreases with multiple threads on single disks, but it increases for a RAID system -- in cases of sequentially writing, up to 8 threads. In cases of writing a single file, the results are surprising: It seems that up to 8 threads do not affect performance on single disks and do increase performance on a RAID system. More than 8 threads always decreases performance.
Random Write
For random write access, I positioned the file pointer at a random position within the file, then wrote a block of 512 Bytes. I did this 10,000 times for each file in the case of multiple files. In the case of a single file, the 10,000 accesses were divided among all threads.
The results of random write show that:
- With multiple files, performance generally increases with more threads. For single disks, there seems to be a saturation with 2-4 threads. For the RAID system, saturation seems to be reached with 8 threads.
- When writing a single file, two threads perform better than one. On the RAID, four threads are best, but more threads decrease performance on all systems. However, this is less drastic on RAID system.
What This All Means
Overall, the results show that multithreaded file I/O can both improve or decrease performance significantly. Keep in mind that an application typically does not only read data, but also processes the data read in a more ore less CPU-intensive way. This leads to different results for every application and even tasks within a application. This also may or may not be the case for writing data. Furthermore, there are very different ways in how and when files will be read or written, as well as different hardware and software configurations that a application will meet. There is no general advice software developers can follow. For example, in one application I measured clearly that using multiple threads per sequential read file increased performance significantly in the 64-bit version. But with the 32-bit version more threads decreased performance on the same machine, the same operating system (Windows XP x64) and the same source code. In another case, where an application opened and appended thousands of files, the best solution was to create 8 threads that did nothing but close files (on a average dual-core machine).
The bottom line is:
- Make multithreading configurable! The number of threads used in a program should always be configurable from 0 (no additional threads at all) to an arbitrary number. This not only allows a customization for optimal performance, but it also proves to be a good debugging tool and sometimes a lifesaver when unknown race conditions occur on client systems. I remember more than one situation where customers were able to overcome fatal bugs by switching off multithreading. This of course does not only apply to multithreaded file I/O.
Consider the following pseudocode:
int CMyThreadManger::AddThread(CThreadObj theTask)
{
if(mUsedThreadCount >= gConfiguration.MaxThreadCount())
return theTask.Execute(); // execute task in main thread
// add task to thread pool and start the thread
...
}
Such a mechanism is not very complicated (though a little bit more work will probably be needed than shown here), but it sometimes is very effective. It also may be used with prebuilt threading libraries such as OpenMP or Intel's Threaded Building Blocks. Considering the measurements shown here, its a good idea to include more than one configurable thread count (for example, one for file I/O and one for core CPU tasks). The default might probably be 0 for file I/O and <number of cores found> for CPU tasks. But all multithreading should be detachable. A more sophisticated approach might even include some code to test multithreaded performance and set the number of threads used automatically, may be even individually for different tasks.
Conclusion
So far multithreaded file I/O is a under-researched field. Although its simple to measure, there is not much common knowledge about it. The measurements I present here show that multithreading can improve performance of file access directly, as well as indirectly by utilizing available cores to process the data read. In general, however, there are no rules of thumb. In this article, I've tried to bring a few "hard" numbers into this area, although I think that more measurements and theoretical analysis is needed to enable applications, that perform better on I/O.