源地址:http://idning.github.io/redis-aof-latency.html
Table of Contents
我的redis配置的aof如下:
appendonly yes appendfsync everysec
redis-mgr配置每天早上 6:00-8:00 做aof_rewrite 和 rdb, 所以每天早上这段时间, 我们就会收到twempxoy的forward_err报警, 大约每分钟会损失5000个请求.
失败率是 10/10000.
在线上测试, 做一个10G的文件写操作, 就会触发上面问题:
dd if=/dev/zero of=xxxxx bs=1M count=10000 &
我们修改了 appendfsync no, 发现这个问题能缓解, 但是不能解决.
关于redis的各种延迟, 作者antirez的 这篇文章 已经说的很清楚了.
我们这里遇到的就是有disk I/O 的时候aof受到影响.
1 一些分析
1.1 为什么慢查询看不到?
- 慢查询统计的时间只包括cpu计算的时间, 写aof这个过程不计入查询时间统计(也不应该计入)
1.2 观察
用下面命令:
strace -f -p $(pidof redis-server) -T -e trace=fdatasync,write 2>&1 | grep -v '0.0' | grep -v unfinished
我们在做copy 的时候, 可以观察到:
[pid 24734] write(42, "*4\r\n$5\r\nhmset\r\n$37\r\np-lc-d687791"..., 272475) = 272475 <0.036430> [pid 24738] <... fdatasync resumed> ) = 0 <2.030435> [pid 24738] <... fdatasync resumed> ) = 0 <0.012418> [pid 24734] write(42, "*4\r\n$5\r\nHMSET\r\n$37\r\np-lc-6787211"..., 73) = 73 <0.125906> [pid 24738] <... fdatasync resumed> ) = 0 <4.476948> [pid 24734] <... write resumed> ) = 294594 <2.477184> (2.47s)
此时输出:
$ ./_binaries/redis-cli --latency-history -h 10.38.114.60 -p 2000 min: 0, max: 223, avg: 1.24 (1329 samples) -- 15.01 seconds range min: 0, max: 2500, avg: 3.46 (1110 samples) -- 15.00 seconds range (这里观察到2.5s) min: 0, max: 5, avg: 1.01 (1355 samples) -- 15.01 seconds range
watchdog 输出:
[24734] 07 Jul 10:54:41.006 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis. [24734 | signal handler] (1404701682) --- WATCHDOG TIMER EXPIRED --- bin/redis-server *:2000(logStackTrace+0x4b)[0x443bdb] /lib64/tls/libpthread.so.0(__write+0x4f)[0x302b80b03f] /lib64/tls/libpthread.so.0[0x302b80c420] /lib64/tls/libpthread.so.0(__write+0x4f)[0x302b80b03f] bin/redis-server *:2000(flushAppendOnlyFile+0x76)[0x43f616] bin/redis-server *:2000(serverCron+0x325)[0x41b5b5] bin/redis-server *:2000(aeProcessEvents+0x2b2)[0x416a22] bin/redis-server *:2000(aeMain+0x3f)[0x416bbf] bin/redis-server *:2000(main+0x1c8)[0x41dcd8] /lib64/tls/libc.so.6(__libc_start_main+0xdb)[0x302af1c4bb] bin/redis-server *:2000[0x415b1a] [24734 | signal handler] (1404701682) --------
所以确定是write hang住
1.3 为什么 appendfsync no 无效
当磁盘写buf满的时候, write就会阻塞, 释放一些buf才会允许继续写入,
所以, 如果程序不调用sync, 系统就会在不确定的时候 做sync, 此时 wirte() 就会hang住
2 一些想法
-
-
能否对rdb/aof_rewrite/cp等命令限速,
-
- 不可能针对每个进程(比如有其它写日志的进程) 都做限制. 所以最好不要这样.
-
-
-
增加proxy timeout, 目前400ms, 增加到2000ms?
-
- 如果超时400ms, 想当于快速失败. 客户端重试效果一样, 所以还是不必改.
-
-
-
master 关aof
-
- 这个方法不需要做任何改动, 代价较小, 效果最好, 缺点是提高运维复杂性和数据可靠性, redis-mgr可以做这个支持.
-
-
-
write 时的阻塞貌似无法避免, 能否用一个新的线程来做write呢?
-
- 关于这个想法写了个patch提给作者: https://github.com/antirez/redis/pull/1862
- 不过作者貌似不太感冒.
-
3 关于page cache
- IO调度一般是针对读优化的, 因为读的时候是同步的, 进程读取不到, 就会睡眠. 写是异步的, 只是写到page cache.
3.1 查看当前page cache 状态
grep ^Cached: /proc/meminfo # page cache size grep ^Dirty: /proc/meminfo # total size of all dirty pages grep ^Writeback: /proc/meminfo # total size of actively processed dirty pages
3.2 参数
ning@ning-laptop ~/test$ sysctl -a | grep dirty vm.dirty_background_ratio = 10 vm.dirty_background_bytes = 0 vm.dirty_ratio = 20 vm.dirty_bytes = 0 vm.dirty_writeback_centisecs = 1500 vm.dirty_expire_centisecs = 3000
详细参考: https://www.kernel.org/doc/Documentation/sysctl/vm.txt
/proc/sys/vm/dirty_expire_centisecs #3000, 表示3000*0.01s = 30s, 队列中超过30s的被刷盘. /proc/sys/vm/dirty_writeback_centisecs #1500, 表示1500*0.01s = 15s, 内核pdflush wakeup 一次. /proc/sys/vm/dirty_background_ratio /proc/sys/vm/dirty_ratio Both values are expressed as a percentage of RAM. When the amount of dirty pages reaches the first threshold (dirty_background_ratio), write-outs begin in the background via the “flush” kernel threads. When the second threshold is reached, processes will block, flushing in the foreground. The problem with these variables is their minimum value: even 1% can be too much. This is why another two controls were introduced in 2.6.29: /proc/sys/vm/dirty_background_bytes /proc/sys/vm/dirty_bytes
x_bytes 和 x_ratio是互斥的, 设置dirty_bytes 的时候, dirty_ratio 会被清0:
root@ning-laptop:~# cat /proc/sys/vm/dirty_bytes 0 root@ning-laptop:~# cat /proc/sys/vm/dirty_ratio 20 root@ning-laptop:~# echo '5000000' > /proc/sys/vm/dirty_bytes root@ning-laptop:~# cat /proc/sys/vm/dirty_bytes 5000000 root@ning-laptop:~# cat /proc/sys/vm/dirty_ratio 0
Lower values generate more I/O requests (and more interrupts), significantly decrease sequential I/O bandwidth but also decrease random I/O latency 数值小的时候, 会减小IO系统带宽, 同时减少 随机的IO延迟.
3.3 Stable Page Write
http://yoshinorimatsunobu.blogspot.com/2014/03/why-buffered-writes-are-sometimes.html
When a dirty page is written to disk, write() to the same dirty page is blocked until flushing to disk is done. This is called Stable Page Write.
This may cause write() stalls, especially when using slower disks. Without write cache, flushing to disk takes ~10ms usually, ~100ms in bad cases.
有patch在较新的内核上能缓解这个问题, 原理是减少write调用 wait_on_page_writeback 的几率:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=1d1d1a767206fbe5d4c69493b7e6d2a8d08cc0a0 Here's the result of using dbench to test latency on ext2: 3.8.0-rc3: Operation Count AvgLat MaxLat ---------------------------------------- WriteX 109347 0.028 59.817 ReadX 347180 0.004 3.391 Flush 15514 29.828 287.283 Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms 3.8.0-rc3 + patches: WriteX 105556 0.029 4.273 ReadX 335004 0.005 4.112 Flush 14982 30.540 298.634 Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms As you can see, the maximum write latency drops considerably with this patch enabled.
据说xfs 也能解决问题.
3.4 查看线上
$ cat /proc/sys/vm/dirty_background_ratio 10 $ cat /proc/sys/vm/dirty_ratio 20
平时dirty:
$ grep ^Dirty: /proc/meminfo Dirty: 104616 kB 机器内存128G.
早上做rdb/aof_rewrite时, dirty:
500,000 kB (500M)
都还没到达配置的 dirty_background_ratio , dirty_ratio 所以调这两个参数估计没用.
测试:
#1. 最常90s. vm.dirty_expire_centisecs = 9000 echo '9000' > /proc/sys/vm/dirty_expire_centisecs #2. 改大dirty_ratio echo '80' > /proc/sys/vm/dirty_ratio
3.4.1 调整dirty_ratio
在一个io较差的48G机器上, 设置 dirty_ratio = 80, dirty 会涨的很高, 但是redis延迟看不明显的改善:
$ grep ^Dirty: /proc/meminfo Dirty: 8598180 kB =>echo '80' > /proc/sys/vm/dirty_ratio $ grep ^Dirty: /proc/meminfo Dirty: 11887180 kB $ grep ^Dirty: /proc/meminfo Dirty: 21295624 kB
4 小结
- master关aof应该是目前最可以接受的方法
- antirez在做一个latency采样的工作
- XFS/Solaris 貌似没有这个问题.
5 相关
-
-
11 年就有的讨论
https://groups.google.com/forum/#!msg/redis-db/jgGuGngDEb0/ZwnvUdx-gdAJ
-
- 作者本来想把write和fsync都移到另一个线程, 结论是把fsync移到一个线程了,
-
-
-
Linkedin 的一个工程师做了这样一个实验, 测试用1/4的带宽来写的时候, 产生的延迟情况: