说明
Ceph OSD 心跳 相关参数在 OSD 出现故障时, 决定着 MON 发现 OSD 故障的速度, 如果 MON 发现 OSD 故障速度太慢, 将会导致业务方受到较大的影响.
由于 Ceph 是强一致性的存储系统, 其必须要求每一笔写操作都完全写到每一个副本上才算一次写完成, 如果写的过程中, 某个 OSD 因为一些原因 Down, 从而无法回复给主 OSD 或 客户端写完成, OSD 或客户端就会一直等待其完成相应的写操作(无法区分 OSD 是写的慢还是真 down 了). 从而导致客户端产生高延时.
如果 MON 能够快速发现 OSD 无响应并将其标记为 down, 那么就可以使后续操作绕开此 OSD, 从而减少对业务的影响.
MON 通过心跳检查 OSD 是否存活, 通过调整 心跳相关参数, 可以将 MON 快速发现 OSD down
参数说明
官方文档:
https://docs.ceph.com/docs/jewel/rados/configuration/mon-osd-interaction/
建议调整的参数:
osd heartbeat interval
检查其它 osd 心跳的时间间隔
osd heartbeat grace
心跳响应宽限期, 即发送心跳后最多等待多久返回心跳
mon osd min down reporters
一个 osd 要被报告几次才会被真正标记为 down
测试
环境和相关命令
cephnode ceph 版本 10.2.5 , 9 个 osd
cephclient 用来挂载 rbd 并进模拟客户端使用 IOPING 测试延时
创建一个 RBD 进行测试,
[root@cephnode ceph]# rbd info test/rbd1
rbd image 'rbd1':
size 102400 MB in 25600 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.3eddc238e1f29
format: 2
features: layering
flags:
[root@cephnode ceph]#
将 RBD 挂载到客户端
[root@cephclient ceph]# lsblk | grep rbd
rbd0 252:0 0 100G 0 disk
[root@cephclient ceph]#
模拟客户端
使用 ioping 测试块设备的读/写延时, 每 0.1s 进行一次 4KB 的读/写, 开始进行 ioping 之后, 手动 kill osd 模拟 osd down
[root@cephclient ~]# ioping -i 0.1 /dev/rdb0 # 读
[root@cephclient ~]# ioping -i 0.1 /dev/rdb0 -WWW # 写
模拟 osd down
手动 kill 一个 osd 模拟 osd 故障 (ceph osd 进程如果使用 systemctl 进行管理, 将会有自动的失败自动重启的情况, 所以可能需要手动启动 osd 防止 osd 被 kill 后自动启动, 影响测试), 并观察此时 ioping 的延时
[root@cephnode ~]# ps aux | grep ceph | grep "id 8"
ceph 13459 3.2 0.1 1017328 138688 pts/0 Sl+ 15:09 0:02 /usr/bin/ceph-osd -f --cluster ceph --id 8 --setuser ceph --setgroup ceph
[root@cephnode ~]# kill -9 13459
延时测试
配置参数
当前为默认参数, 本文主要测试 osd heartbeat interval 和 osd heartbeat grace
[root@cephnode ~]# ceph daemon osd.7 config show | grep osd_heartbeat
"osd_heartbeat_addr": ":\/0",
"osd_heartbeat_interval": "6",
"osd_heartbeat_grace": "20",
"osd_heartbeat_min_peers": "10",
"osd_heartbeat_use_min_delay_socket": "false",
"osd_heartbeat_min_healthy_ratio": "0.33",
[root@cephnode ceph]# ceph daemon mon.cephnode config show | grep osd_heartbeat
"osd_heartbeat_addr": ":\/0",
"osd_heartbeat_interval": "6",
"osd_heartbeat_grace": "20",
"osd_heartbeat_min_peers": "10",
"osd_heartbeat_use_min_delay_socket": "false",
"osd_heartbeat_min_healthy_ratio": "0.33",
[root@cephclient ceph]#
读操作
模拟客户端读操作, kill osd 后查看 osd down 时客户端的读延时, 此时延时为 18.4 ms
[root@cephclient ~]# ioping -i 0.1 -D /dev/rbd0
......
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=798 time=741.3 us
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=799 time=869.1 us
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=800 time=809.0 us
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=801 time=681.3 us
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=802 time=699.9 us
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=803 time=749.5 us
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=804 time=18.4 s (slow) # 直到 osd 被标记为 down , 此次 io 才完成
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=805 time=3.22 ms (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=806 time=777.2 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=807 time=708.2 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=808 time=768.5 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=809 time=726.1 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=810 time=801.1 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=811 time=802.8 us (fast)
写操作
同样的手动 kill osd 模拟 osd 故障并模拟客户端写操作, 查看 osd down 时客户端的读延时, 此时延时为 21.4 s
[root@cephclient ~]# ioping -i 0.1 -D /dev/rbd0 -WWW
......
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=166 time=2.90 ms
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=167 time=2.52 ms
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=168 time=2.48 ms
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=169 time=2.57 ms
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=170 time=2.63 ms
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=171 time=2.76 ms
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=172 time=2.83 ms
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=173 time=2.44 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=174 time=2.66 ms
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=175 time=3.02 ms
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=176 time=2.81 ms
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=177 time=21.4 s (slow) # 直到 osd 被标记为 down , 此次 io 才完成
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=178 time=4.88 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=179 time=2.75 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=180 time=2.65 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=181 time=4.36 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=182 time=2.37 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=183 time=2.56 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=184 time=3.10 ms (fast)
更改参数
将 osd_heartbeat_grace 和 osd_heartbeat_interval 减少一半 osd_heartbeat_grace 20 → 10 osd_heartbeat_interval 6 → 3
[root@cephnode ceph]# ceph tell osd.* injectargs '--osd_heartbeat_grace 10'
osd.0: osd_heartbeat_grace = '10' (unchangeable)
osd.1: osd_heartbeat_grace = '10' (unchangeable)
osd.2: osd_heartbeat_grace = '10' (unchangeable)
osd.3: osd_heartbeat_grace = '10' (unchangeable)
osd.4: osd_heartbeat_grace = '10' (unchangeable)
osd.5: osd_heartbeat_grace = '10' (unchangeable)
osd.6: osd_heartbeat_grace = '10' (unchangeable)
osd.7: osd_heartbeat_grace = '10' (unchangeable)
Error ENXIO: problem getting command descriptions from osd.8
osd.8: problem getting command descriptions from osd.8
[root@cephnode ceph]# ceph tell osd.* injectargs '--osd_heartbeat_interval 3'
osd.0: osd_heartbeat_interval = '3' (unchangeable)
osd.1: osd_heartbeat_interval = '3' (unchangeable)
osd.2: osd_heartbeat_interval = '3' (unchangeable)
osd.3: osd_heartbeat_interval = '3' (unchangeable)
osd.4: osd_heartbeat_interval = '3' (unchangeable)
osd.5: osd_heartbeat_interval = '3' (unchangeable)
osd.6: osd_heartbeat_interval = '3' (unchangeable)
osd.7: osd_heartbeat_interval = '3' (unchangeable)
[root@cephnode ~]# ceph daemon osd.7 config show | grep osd_heartbeat
"osd_heartbeat_addr": ":\/0",
"osd_heartbeat_interval": "3",
"osd_heartbeat_grace": "10",
"osd_heartbeat_min_peers": "10",
"osd_heartbeat_use_min_delay_socket": "false",
"osd_heartbeat_min_healthy_ratio": "0.33",
[root@cephnode ceph]# ceph tell mon.* injectargs '--osd_heartbeat_grace 10'
injectargs:osd_heartbeat_grace = '10' (unchangeable)
[root@cephnode ceph]# ceph daemon mon.cephnode config show | grep osd_heartbeat
"osd_heartbeat_addr": ":\/0",
"osd_heartbeat_interval": "6",
"osd_heartbeat_grace": "10",
"osd_heartbeat_min_peers": "10",
"osd_heartbeat_use_min_delay_socket": "false",
"osd_heartbeat_min_healthy_ratio": "0.33",
[root@cephnode ceph]#
读操作
ioping 测试读, 可以看到明显小于上文默认参数的延时
[root@cephclient ~]# ioping -i 0.1 -D /dev/rbd0
......
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9882 time=602.6 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9883 time=777.6 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9884 time=604.0 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9885 time=596.5 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9886 time=733.4 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9887 time=639.5 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9888 time=757.5 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9889 time=599.7 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9890 time=768.3 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9891 time=776.2 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9892 time=8.45 s (fast) # 直到 osd 被标记为 down , 此次 io 才完成, 但是可以看到此次延时明显小于上文的数值
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9893 time=1.06 ms (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9894 time=3.14 ms (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9895 time=627.5 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9896 time=733.3 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9897 time=660.4 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9898 time=754.6 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9899 time=748.6 us (fast)
写操作
ioping 测试写截图, 可以看到明显小于上文默认参数的延时
[root@cephclient ~]# ioping -i 0.1 -D /dev/rbd0 -WWW
......
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=456 time=2.56 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=457 time=2.77 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=458 time=2.53 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=459 time=2.41 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=460 time=2.70 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=461 time=2.60 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=462 time=2.50 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=463 time=2.53 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=464 time=10.8 s (slow) # 直到 osd 被标记为 down , 此次 io 才完成, 但是可以看到此次延时明显小于上文的数值
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=465 time=2.94 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=466 time=2.58 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=467 time=2.52 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=468 time=2.55 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=469 time=2.53 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=470 time=2.46 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=471 time=2.83 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=472 time=2.78 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=473 time=2.66 ms (fast)
OSD 心跳丢失的 MON log
kill OSD 后的 mon log
2020-02-02 20:20:46.804677 7fb392bbd700 1 mon.cephnode@0(leader).osd e1330 prepare_failure osd.8 192.168.1.1:6801/13459 from osd.2 192.168.1.1:6804/16339 is reporting failure:1
2020-02-02 20:20:46.804698 7fb392bbd700 0 log_channel(cluster) log [DBG] : osd.8 192.168.1.1:6801/13459 reported failed by osd.2 192.168.1.1:6804/16339
2020-02-02 20:20:46.880512 7fb392bbd700 1 mon.cephnode@0(leader).osd e1330 prepare_failure osd.8 192.168.1.1:6801/13459 from osd.7 192.168.1.1:6803/40358 is reporting failure:1
2020-02-02 20:20:46.880535 7fb392bbd700 0 log_channel(cluster) log [DBG] : osd.8 192.168.1.1:6801/13459 reported failed by osd.7 192.168.1.1:6803/40358
2020-02-02 20:20:46.880586 7fb392bbd700 1 mon.cephnode@0(leader).osd e1330 we have enough reporters to mark osd.8 down
2020-02-02 20:20:46.880601 7fb392bbd700 0 log_channel(cluster) log [INF] : osd.8 192.168.1.1:6801/13459 failed (2 reporters from different host after 21.010763 >= grace 20.000000)
从这里可以看到, osd8 被 osd.2 和 osd.7 相继报告 failed, 当报告次数满足 mon osd min down reporters 后, osd 将会被标记识别
心跳延时配置的不要太低
如果将心跳参数改的过低, 将会引起频繁的误报
[root@cephnode ceph]# ceph daemon mon.cephnode config show | grep osd_heartbeat
"osd_heartbeat_addr": ":\/0",
"osd_heartbeat_interval": "6",
"osd_heartbeat_grace": "3",
"osd_heartbeat_min_peers": "10",
"osd_heartbeat_use_min_delay_socket": "false",
"osd_heartbeat_min_healthy_ratio": "0.33",
[root@cephnode ~]# ceph daemon osd.7 config show | grep osd_heartbeat
"osd_heartbeat_addr": ":\/0",
"osd_heartbeat_interval": "6",
"osd_heartbeat_grace": "3",
"osd_heartbeat_min_peers": "10",
"osd_heartbeat_use_min_delay_socket": "false",
"osd_heartbeat_min_healthy_ratio": "0.33",
osd 的状态将频繁变换, osd 会被误标记为 down
[root@test92 ~]# sudo ceph -s
cluster 27081438-f5fb-4862-b7c2-b87626bd19ec
health HEALTH_WARN
1 pgs stuck unclean
recovery 7/5238 objects degraded (0.134%)
recovery 7/5238 objects misplaced (0.134%)
monmap e1: 1 mons at {cephnode=192.168.1.1:6789/0}
election epoch 33, quorum 0 cephnode
osdmap e1592: 9 osds: 7 up, 7 in; 1 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v618211: 304 pgs, 7 pools, 3213 MB data, 1746 objects
4711 MB used, 12473 GB / 12478 GB avail
7/5238 objects degraded (0.134%)
7/5238 objects misplaced (0.134%)
303 active+clean
1 active+remapped
client io 69325 B/s rd, 0 B/s wr, 67 op/s rd, 45 op/s wr
注意事项
本次测试实在 ceph jewel 10.2.x 版本上进行的, 测试过程中发现 mon 和 osd 都有 osd heartbeat interval 这个参数可以配置, 如果按照 jewel 的官方文档, 只更改 OSD 的 osd heartbeat interval 实际上是没有效果的, 必须同时更改 mon 和 osd osd heartbeat interval, 并且如果你将 osd 的 osd heartbeat interval 改的比 mon 的还要低, 将回导致 osd 永远不会被标记为 down 造成业务放的写操作永远阻塞
对于最新版本的 Ceph 其官方文档已经更改, 要求将 osd heartbeat interval 配置到 global 或者 osd 和 mon 两者中
https://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/#osds-check-heartbeats
所以更改 osd heartbeat interval 一定要再 osd 和 mon 上同时更改
线上变更建议
降低 osd 心跳可以提高集群的灵敏性, 让 mon 快速发现 osd 的故障, 但是也可能由于 osd 心跳超时时间过段, 造成 osd 被误判, 从而导致集群不稳定, 所以如果需要在线上更改, 可以采用以下措施(一种或全部):
- 设置 nodown 状态, 并观察 osd 的日志有无 no reply 的报告.持续观察至少 24h 无异常然后再接触 nodown
状态 - 将 mon osd min down reporters 调高, 可以高于集群总 osd 个数, 这样即使出现了误报, 由于小于
mon osd min down reporters OSD 也不会被标记为 down.
日志示例
osd
2020-02-02 20:20:46.843525 7f508bdf6700 -1 osd.3 1633 heartbeat_check: no reply from 0x7f50a65a0810 osd.8 since back 2020-02-21 14:49:21.878833 front 2020-02-21 14:49:21.878833 (cutoff 2020-02-21 14:49:22.843404)
2020-02-02 20:20:46.843678 7f508bdf6700 -1 osd.3 1633 heartbeat_check: no reply from 0x7f50a4582090 osd.0 since back 2020-02-21 14:49:21.878833 front 2020-02-21 14:49:21.878833 (cutoff 2020-02-21 14:49:23.843677)
2020-02-02 20:20:46.843708 7f508bdf6700 -1 osd.3 1633 heartbeat_check: no reply from 0x7f50a4652690 osd.1 since back 2020-02-21 14:49:21.878833 front 2020-02-21 14:49:21.878833 (cutoff 2020-02-21 14:49:23.843677)
2020-02-02 20:20:46.843719 7f508bdf6700 -1 osd.3 1633 heartbeat_check: no reply from 0x7f50a4582210 osd.2 since back 2020-02-21 14:49:21.878833 front 2020-02-21 14:49:21.878833 (cutoff 2020-02-21 14:49:23.843677)
2020-02-02 20:20:46.843729 7f508bdf6700 -1 osd.3 1633 heartbeat_check: no reply from 0x7f50a4667e90 osd.4 since back 2020-02-21 14:49:21.878833 front 2020-02-21 14:49:21.878833 (cutoff 2020-02-21 14:49:23.843677)
mon (如果设置了 ceph osd set nodown) 即使有 osd down 的汇报, mon 也不会显示
2020-02-02 20:20:46.804677 7fb392bbd700 1 mon.cephnode@0(leader).osd e1330 prepare_failure osd.8 10.32.3.92:6801/13459 from osd.2 10.32.3.90:6804/16339 is reporting failure:1
2020-02-02 20:20:46.804698 7fb392bbd700 0 log_channel(cluster) log [DBG] : osd.8 10.32.3.92:6801/13459 reported failed by osd.2 10.32.3.90:6804/16339
2020-02-02 20:20:46.880512 7fb392bbd700 1 mon.cephnode@0(leader).osd e1330 prepare_failure osd.8 10.32.3.92:6801/13459 from osd.7 10.32.3.92:6803/40358 is reporting failure:1
2020-02-02 20:20:46.880535 7fb392bbd700 0 log_channel(cluster) log [DBG] : osd.8 10.32.3.92:6801/13459 reported failed by osd.7 10.32.3.92:6803/40358
2020-02-02 20:20:46.880586 7fb392bbd700 1 mon.cephnode@0(leader).osd e1330 we have enough reporters to mark osd.8 down
2020-02-02 20:20:46.880601 7fb392bbd700 0 log_channel(cluster) log [INF] : osd.8 10.32.3.92:6801/13459 failed (2 reporters from different host after 21.010763 >= grace 20.000000)