MySQL Performance: Linux I/O

最新推荐文章于 2022-09-29 12:08:02 发布

utf7

最新推荐文章于 2022-09-29 12:08:02 发布

阅读量1.2k

点赞数

分类专栏： MySQL

MySQL 专栏收录该内容

35 篇文章 1 订阅

订阅专栏

MySQL Performance: Linux I/O

It was a long time now that I wanted to run some benchmark tests to understand better the surprises I've met in the past with Linux I/O performance during MySQL benchmarks, and finally it happened last year, but I was able to organize and present my results only now..

My main questions were:

what is so different with various I/O schedulers in Linux (cfq, noop, deadline) ?..
what is wrong or right with O_DIRECT on Linux ?..
what is making XFS more attractive comparing to EXT3/EXT4 ?..

There were already several posts in the past about impact on MySQL performance when one or another Linux I/O layer feature was used (for ex. Domas about I/O schedulers , Vadim regarding TPCC-like performance , and many other) - but I still did not find any answer WHY (for ex.) cfq I/O scheduler is worse than noop, etc, etc..

So, I'd like to share here some answers to my WHY questions ;-))
(while for today I still have more questions than answers ;-))

Test Platform

First of all, the system I've used for my tests:

HW server: 64 cores (Intel), 128GB RAM, running RHEL 5.5
the kernel is 2.6.18 - as it was until now the most common Linux kernel used on Linux boxes hosting MySQL servers
installed filesystems: ext3, ext4, XFS
Storage: ST6140 (1TB on x16 HDD striped in RAID0, 4GB cache on controller) - not a monster, but fast enough to see if the bottleneck is coming from the storage level or not ;-))

Test Plan

Then, my initial test plan:

see what the max possible Read/Write I/O performance I can obtain from the given HW on the raw level (just RAW-devices, without any filesystem, etc.) - mainly I'm interested here on the impact of Linux I/O scheduler
then, based on observed results, setup more optimally each filesystem (ext3, ext4, XFS) and try to understand their bottlenecks..
I/O workload: I'm mainly focusing here on the random reads and random writes - they are the most problematic for any I/O related performance (and particularly painful for databases), while sequential read/writes may be very well optimized on the HW level already and hide any other problems you have..
Test Tool: I'm using here my IObench tool (at least I know exactly what it's doing ;-))

TESTING RAW DEVICES

Implementation of raw devices in Linux is quite surprising.. - it's simply involving O_DIRECT access to a block device. So to use a disk in raw mode you have to open() it with O_DIRECT option (or use "raw" command which will create an alias device in your system which will always use O_DIRECT flag on any involved open() system call). Using O_DIRECT flag on a file opening is disabling any I/O buffering on such a file (or device, as device is also a file in UNIX ;-) - NOTE: by default all I/O requests on block devices (e.g. hard disk) in Linux are buffered, so if you'll start a kind of I/O write test on, say, your /dev/sda1 - you'll obtain a kind of incredible great performance ;-)) as no data probably will not yet even reach your storage and in reality you'll simply test a speed of your RAM.. ;-))

Now, what is "fun" with O_DIRECT:

all your I/O requests (read, write, etc.) block size should be aligned to 512 bytes (e.g. be multiplier of 512 bytes), otherwise your I/O request is simply rejected and you get an error message.. - and regarding to RAW devices it's quite surprising comparing to Solaris for ex. where you're simply instead of /dev/dsk/device using /dev/rdsk/device and may use any block size you want..
but it's not all.. - the buffer you're using within your system call involving I/O request should also be allocated aligned to 512 bytes, so mainly you have to allocate it via posix_memalign() function, otherwise you'll also get an error.. (seems that during O_DIRECT operations there is used some kind of direct memory mapping)
then, reading the manual: "The O_DIRECT flag on its own makes at an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC that data and necessary metadata are transferred. To guarantee synchronous I/O the O_SYNC must be used in addition to O_DIRECT" - quite surprising again..
and, finally, you'll be unable to use O_DIRECT within your C code until you did not declare #define _GNU_SOURCE

Interesting that the man page is also quoting Linus about O_DIRECT:
"The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances." Linus

But we have to live with it ;-))

And if you need an example of C or C++ code, instead to show you the mine, there is a great dev page on Fusion-io site .

So far, what about my storage performance on the RAW devices now?..

Test scenario on RAW devices:

I/O Schedulers: cfq, noop, deadline
Block size: 1K, 4K, 16K
Workload: Random Read, Random Write

NOTE: I'm using here 1K block size as the smallest "useful" size for databases :-)) then 4K as the most aligned to the Linux page size (4K), and 16K - as the default InnoDB block size until now.

Following graphs are representing 9 tests executed one after one: cfq with 3 different block sizes (1K, 4K, 16K), then noop, then deadline. Each test is running a growing workload of 1, 4, 16, 64 concurrent users (processes) non-stop bombarding my storage subsystem with I/O requests.

Read-Only@RAW-device:

Observations :

Random Read is scaling well for all Linux I/O Schedulers
Reads reported by application (IObench) are matching numbers reported by the system I/O stats
1K reads are running slightly faster than 4K (as expected as it's a "normal" disks, and transfer of a bigger data volume reducing an overall performance, which is normal)..

Write-Only @RAW-device:

Observations :

looking on the graph you may easily understand now what is wrong with "cfq" I/O scheduler.. - it's serializing write operations!
while "noop" and "deadline" are continuing to scale with a growing workload..
so, it's clear now WHY performance gains were observed by many people on MySQL workloads by simply switching from "cfq" to "noop" or "deadline"

To check which I/O scheduler is used for your storage device:
# cat /sys/block/{DEVICE-NAME}/queue/scheduler

For ex. for "sda": # cat /sys/block/sda/queue/scheduler

Then set "deadline" for "sda": # echo deadline > /sys/block/sda/queue/scheduler
To set "deadline" as default I/ scheduler for all your storage devices you may boot your system with "elevator=deadline" boot option. Interesting that by default many Linux systems used "cfq". All recent Oracle Linux systems are shipped with "deadline" by default.

TESTING FILESYSTEMS

As you understand, there is no more reasons to continue any further tests by using "cfq" I/O scheduler.. - if on the raw level it's already bad, it cannot be better due any filesystem features ;-)) (While I was also told that in recent Linux kernels "cfq" I/O scheduler should perform much more better, let's see)..

Anyway, my filesystem test scenario:

Linux I/O Scheduler: deadline
Filesystems: ext3, ext4, XFS
File flags/options: osync (O_SYNC), direct (O_DIRECT), fsync (fsync() is involved after each write()), fdatasync (same as fsync, but calling fdatasync() instead of fsync())
Block size: 1k, 4K, 16K
Workloads: Random Reads, Random Writes on a single 128GB file - it's the most critical file access for any database (having a hot table, or a hot tablespace)
NOTE: to avoid most of background effects of caching, I've limited an available RAM for FS cache to 8GB only! (all other RAM was allocated to the huge SHM segment with huge pages, so not swappable)..

Also, we have to keep in mind now the highest I/O levels observed on RAW devices:

Random Read: ~4500 op/sec
Random Write: ~5000 op/sec

So, if for any reason Read or Write performance will be faster on any of filesystems - it'll be clear there is some buffering/caching happening on the SW level ;-))

Now, let me explain what you'll see on the following graphs:

they are already too many, so I've tried to bring more data on each graph :-))
there are 12 tests on each graph (x3 series of x4 tests)
each serie of tests is executed by using the same block size (1K, then 4K, then 16K)
within a serie of 4 tests there are 4 flags/options are used one after one (osync, direct, fsync, fdatasync)
each test is executed as before with 1, 4, 16, 64 concurrent user processes (IObench)
only one filesystem per graph :-))

So, let's start now with Read-Only results.

Read-Only @EXT3:

Observations :

pretty well scaling, reaching 4500 reads/sec in max
on 1K reads: only "direct" reads are really reading 1K blocks, all other options are involving reading of 4K blocks
nothing unexpected finally :-)

Read-Only @EXT4:

Observations :

same as on ext3, nothing unexpected

Read-Only @XFS:

Observations :

no surprise here either..
but there were one surprise anyway ;-))

While the results on Random Read workloads are looking exactly the same on all 3 filesystems, there are still some difference in how the O_DIRECT feature is implemented on them! ;-))

The following graphs are representing the same tests, but only corresponding to execution with O_DIRECT flag (direct). First 3 tests are with EXT3, then 3 with XFS, then 3 with EXT4:

Direct I/O & Direct I/O

Observations :

the most important here the last graph showing here the memory usage on the system during O_DIRECT tests
as you may see, only with XFS the filesystem cache usage is near zero!
while EXT3 and EXT4 are still continuing cache buffering.. - may be a very painful surprise when you're expecting to use this RAM for something else ;-))

Well, let's see now what is different on the Write Performance.

Write-Only @EXT3:

Observations :

the most worse performance here is with 1K blocks.. - as default EXT3 block size is 4K, on the 1K writes it involves a read-on-write (it has to read 4K block first, then change corresponding 1K on changes within it, and then write the 4K block back with applied changes..)
read-on-write is not happening on 1K when O_DIRECT flag is used: we're really writing 1K here
however, O_DIRECT writes are not scaling at all on EXT3! - and it explains me finally WHY I've always got a worse performance when tried to use O_DIRECT flush option in InnoDB on EXT3 filesystem! ;-))
interesting that the highest performance here is obtained with O_SYNC flag, and we're not far from 5000 writes/sec for what the storage is capable..

Write-Only @EXT4:

Observations :

similar to EXT3, but performance is worse comparing to EXT3
interesting that only with O_SYNC flag the performance is comparable with EXT3, while in all other cases it's simply worse..
I may suppose here that EXT3 is not flushing on every fsync() or fdatasync(), and that's why it's performing better with these options ;-)) need to investigate here.. But anyway, the result is the result..

What about XFS?..

Write-Only @XFS:

Observations :

XFS results are quite different from those of EXT3 and EXT4
I've used a default setup of XFS here, and was curios to not observe the impact of missed "nobarrier" option which was reported by Vadim in the past..
on 1K block writes only O_DIRECT is working well, but in difference from EXT3/EXT4 it's also scaling ;-) (other options are giving poor results due the same read-on-write issue..)
4K block writes are scaling well with O_SYNC and O_DIRECT, but still remaining poor with other options
16K writes are reporting some anomalies: while with O_SYNC nothing is going wrong and it's scaling well, with O_DIRECT there is some kind of serialization happened on 4 and 16 concurrent user processes.. - and then on 64 users things then came back to the normal.. Interesting that is was not observed with 4K block writes.. Which remains me the last year discussion about page block size in InnoDB for SSD, and the gain reported by using 4K page size vs 16K.. - just keep in mind that sometimes it may be not related to SSD at all, but just to some filesystem's internals ;-))
anyway, no doubt - if you have to use O_DIRECT in your MySQL server - use XFS! :-)

Now, what is the difference between a "default" XFS configuration and "tuned" ??..

I've recreated XFS with 64MB log size and mounted with following options:
# mount -t xfs -o noatime,nodiratime,nobarrier,logbufs=8,logbsize=32k

The results are following..

Write-Only @XFS-tuned:

Observations :

everything is similar to "default" config, except that there is no more problem with 16K block size performance
and probably this 16K anomaly observed before is something random, hard to say.. - but at least I saw it, so cannot ignore ;-))

Then, keeping in mind that XFS is so well performing on 1K block size, I was curious to see if thing will not go even better if I'll create my XFS filesystem with 1K block size instead of default 4K..

Write-Only @XFS-1K:

Observations :

when XFS is created with 1K block size there is no more read-on-write issue on 1K writes..
and we're really writing 1K..
however, the performance is completely poor.. even on 1K writes with O_DIRECT !!!
why?..

The answer is came from the Random Reads test on the same XFS, created with 1K block size.

Read-Only @XFS-1K:

Observations :

if you followed me until now, you'll understand everything from the last graph, reporting RAM usage.. ;-))
the previously 8GB free RAM is no more free here..
so, XFS is not using O_DIRECT here!
and you may see also that for all reads except O_DIRECT, it's reading 4K for every 1K, which is abnormal..

Instead of SUMMARY

I'd say the main point here is - "test your I/O subsystem performance before to deploy your MySQL server" ;-))
avoid to use "cfq" I/O scheduler :-)
if you've decided to use O_DIRECT flush method in your MySQL server - deploy your data on XFS..
seems to me the main reason why people are using O_DIRECT with MySQL it's a willing to avoid to deal with various issues of filesystem cache.. - and there is probably something needs to be improved in the Linux kernel, no? ;-)
could be very interesting to see similar test results on the other filesystems too..
things may look better with a newer Linux kernel..

So far, I've got some answers to my WHY questions.. Will be fine now to get a time to test it directly with MySQL ;-)

Any comments are welcome!

Rgds,
-Dimitri

utf7

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
MySQL Performance: Linux I/O

MySQL Performance: Linux I/OIt was a long time now that I wanted to run some benchmark tests to understand better the surprises I've met in the past with Linux I/O performance during MySQL benchmark
复制链接

扫一扫

专栏目录