Jay Kreps: 为什么write有时候在Linux里面耗时很久

最新推荐文章于 2021-05-16 00:27:52 发布

Linux阅码场

最新推荐文章于 2021-05-16 00:27:52 发布

阅读量2.5k

点赞数

作者简介：

Jay Kreps是Linkedin的前技术大牛，也是目前Confluent的CEO。它的自我介绍如下：

I am the co-founder of Confluent and also the co-creator of Apache Kafka as well as various other open source projects.

内容简介：

Jay Kreps的这篇文章，非常简单直接地分析为什么有时候write耗时很久，用的ext4文件系统，全文行云流水，没有一句废话。

快，关注Linuxer公众号，一起涨姿势～

Why do write calls sometimes block for a long time in Linux?

Update: Explanation at the bottom.

Anyone know why fwrite() calls sometimes block?

Here is a test I did. I sequentially append 4096 byte chunks to 10 files in a round robin fashion. Throughput is artificially limited to a fixed rate set at about ¼ the maximum throughput of the drives, which I accomplish by sleeping at subsecond intervals in between my writes. I time each call to write. My expectation is that writes go immediately to pagecache and are asynchronously written out to disk (unless I were to call fsync, which I don’t).

In fact this is usually what happens, the average time is just a few microseconds, but sometimes the write calls block for somewhere between 400 ms and a few seconds. I am using Linux 3.6.32 (RHEL 6) with ext4. I am using default configurations otherwise (no change to any of the /proc/sys/vm stuff and fiddling with those parameters don’t seem to help).

Here is a trace of average and max latencies taken over 1 second intervals. Note the regular spikes. What is the cause of this? Locking between the flush threads and the write thread? Is there anything you can do to mitigate it? This effects anything that does logging–i.e. almost everything.

I understand why this would happen if I exceeded the throughput that Linux’s background flush threads can handle, but that is clearly not the case here as the drive can sustain 400MB writes over a long period.

I tried this on a few different machines, some with RAID, some with a single disk and all showed the same behavior to varying degrees.

Throughput (mb/sec)	Avg. Latency (ms)	Max Latency (ms)
99.915	0.000	0.054
99.973	0.000	0.087
100.005	0.000	0.057
100.089	0.000	0.041
99.965	0.000	0.071
99.977	0.000	0.098
99.999	0.000	0.076
99.995	0.000	0.104
99.961	0.000	0.057
100.016	0.000	0.226
56.977	0.000	756.174
99.966	0.000	0.100
99.925	0.000	0.093
100.001	0.000	3.070
100.074	0.000	0.084
100.193	0.000	0.054
100.207	0.000	0.053
99.998	0.000	0.055
99.908	0.000	107.069
99.980	0.000	117.124
99.985	0.000	0.054
99.948	0.000	0.061
99.991	0.000	0.090
99.973	0.000	0.046
99.989	0.000	11.923
100.035	0.000	0.041
100.355	0.000	2.698
99.999	0.000	0.052
100.000	0.000	0.055
99.963	0.000	12.534
99.975	0.000	0.058
100.351	0.000	0.044
99.990	0.000	2.889
100.284	0.000	0.042
99.931	0.000	0.042
100.218	0.000	0.056
99.992	0.000	0.065
100.191	0.000	0.057
100.023	0.000	0.401
99.918	0.000	1.957
100.004	0.000	61.265
99.938	0.000	0.092
100.179	0.000	0.057
99.996	0.000	0.062
…

Update

A number of people gave interesting and helpful suggestions, such as “this is your punishment for not using Solaris.” The best suggestion was from Mark Wolfe which was to install latencytop and measure it. To do this on Red Hat you need to install their debug kernel and reboot with that, then latencytop will capture the highest latency kernel operations for each process. This gives a great deal of insight into what is going on.

For those who are curious here are a few of the traces that pop up as causing hundreds of ms of latency:

vfs_write()
  do_sync_write()
    ext4_file_write()
      generic_file_aio_write()
        ext4_da_write_begin() [in this case da means delayed allocation]
          block_write_begin()
            __block_prepare_write()
              ext4_da_get_block_prep()
                ext4_get_block_prep()
                  ext4_get_blocks()
                     call_rw_sem_down_read_failed()

This trace seems to be due to delayed allocation. Turing off delayed allocation makes it go away, though probably at the cost of some throughput.

Here is another one, this one seems to be related to journalling.

sys_write()
  vfs_write()
    do_sync_write()
      ext4_file_write()
        generic_file_aio_write()
          __generic_file_aio_write()
            file_update_time()
              __mark_inode_dirty()
                ext4_dirty_inode()
                  ext4_journal_start_sb()
                    jbd2_journal_start()
                      start_this_handle()

You can check out the details of the code in one of these nice Linux kernel code browsers. My take away from all this was that maybe it is time to look at XFS since that allegedly also has better file locking for fsync which is my other problem.

历史文章

《Linux进程、线程和调度》4次课程高清ppt和录播链接

吴锦华/明鑫: 用户态文件系统(FUSE)框架分析和实战

宋宝华：LEP(Linux易用剖析器) 是什么,为什么以及怎么办(1)

黄伟亮：探秘Linux的块设备和根

郭健： Linux内存逆向映射（reverse mapping）技术的前世今生

何晔：当ZYNQ遇到Linux Userspace I/O（UIO）

魏永明： MiniGUI的涅槃重生之路

宋宝华：Linux的任督二脉——进程调度和内存管理

黄伟亮：ext4文件系统之裸数据的分析实践

谢宝友：深入理解Linux RCU之一——从硬件说起

...

Linux阅码场

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Jay Kreps: 为什么write有时候在Linux里面耗时很久

作者简介：Jay Kreps是Linkedin的前技术大牛，也是目前Confluent的CEO。它的自我介绍如下：I am the co-founder of Confluent and also the co-creator of Apache Kafka as well as various other open source projects. 内容简介：Jay Kreps的
复制链接

扫一扫