Ceph OSD 心跳参数对业务的影响和测试

最新推荐文章于 2024-04-02 19:32:40 发布

MrSate

最新推荐文章于 2024-04-02 19:32:40 发布

阅读量1.8k

点赞数

文章标签： ceph linux 分布式存储

本文链接：https://blog.csdn.net/mrsate/article/details/104413289

版权

说明

Ceph OSD 心跳相关参数在 OSD 出现故障时, 决定着 MON 发现 OSD 故障的速度, 如果 MON 发现 OSD 故障速度太慢, 将会导致业务方受到较大的影响.
由于 Ceph 是强一致性的存储系统, 其必须要求每一笔写操作都完全写到每一个副本上才算一次写完成, 如果写的过程中, 某个 OSD 因为一些原因 Down, 从而无法回复给主 OSD 或客户端写完成, OSD 或客户端就会一直等待其完成相应的写操作(无法区分 OSD 是写的慢还是真 down 了). 从而导致客户端产生高延时.

如果 MON 能够快速发现 OSD 无响应并将其标记为 down, 那么就可以使后续操作绕开此 OSD, 从而减少对业务的影响.

MON 通过心跳检查 OSD 是否存活, 通过调整心跳相关参数, 可以将 MON 快速发现 OSD down

参数说明

官方文档:
https://docs.ceph.com/docs/jewel/rados/configuration/mon-osd-interaction/

建议调整的参数:

osd heartbeat interval

检查其它 osd 心跳的时间间隔

osd heartbeat grace

心跳响应宽限期, 即发送心跳后最多等待多久返回心跳

mon osd min down reporters

一个 osd 要被报告几次才会被真正标记为 down

测试

环境和相关命令

cephnode ceph 版本 10.2.5 , 9 个 osd
cephclient 用来挂载 rbd 并进模拟客户端使用 IOPING 测试延时

创建一个 RBD 进行测试,

[root@cephnode ceph]# rbd info test/rbd1
rbd image 'rbd1':
	size 102400 MB in 25600 objects
	order 22 (4096 kB objects)
	block_name_prefix: rbd_data.3eddc238e1f29
	format: 2
	features: layering
	flags: 
[root@cephnode ceph]#

将 RBD 挂载到客户端

[root@cephclient ceph]# lsblk | grep rbd
rbd0                                                                                      252:0    0   100G  0 disk 
[root@cephclient ceph]#

模拟客户端
使用 ioping 测试块设备的读/写延时, 每 0.1s 进行一次 4KB 的读/写, 开始进行 ioping 之后, 手动 kill osd 模拟 osd down


[root@cephclient ~]# ioping -i 0.1 /dev/rdb0 # 读
[root@cephclient ~]# ioping -i 0.1 /dev/rdb0 -WWW # 写

模拟 osd down
手动 kill 一个 osd 模拟 osd 故障 (ceph osd 进程如果使用 systemctl 进行管理, 将会有自动的失败自动重启的情况, 所以可能需要手动启动 osd 防止 osd 被 kill 后自动启动, 影响测试), 并观察此时 ioping 的延时

[root@cephnode ~]# ps aux | grep ceph | grep "id 8"
ceph     13459  3.2  0.1 1017328 138688 pts/0  Sl+  15:09   0:02 /usr/bin/ceph-osd -f --cluster ceph --id 8 --setuser ceph --setgroup ceph
[root@cephnode ~]# kill -9 13459

延时测试

配置参数

当前为默认参数, 本文主要测试 osd heartbeat interval 和 osd heartbeat grace

[root@cephnode ~]# ceph daemon osd.7 config show | grep osd_heartbeat
    "osd_heartbeat_addr": ":\/0",
    "osd_heartbeat_interval": "6",
    "osd_heartbeat_grace": "20",
    "osd_heartbeat_min_peers": "10",
    "osd_heartbeat_use_min_delay_socket": "false",
    "osd_heartbeat_min_healthy_ratio": "0.33",
[root@cephnode ceph]# ceph daemon mon.cephnode config show | grep osd_heartbeat
    "osd_heartbeat_addr": ":\/0",
    "osd_heartbeat_interval": "6",
    "osd_heartbeat_grace": "20",
    "osd_heartbeat_min_peers": "10",
    "osd_heartbeat_use_min_delay_socket": "false",
    "osd_heartbeat_min_healthy_ratio": "0.33",
[root@cephclient ceph]#

读操作

模拟客户端读操作, kill osd 后查看 osd down 时客户端的读延时, 此时延时为 18.4 ms

[root@cephclient ~]# ioping -i 0.1 -D /dev/rbd0
......
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=798 time=741.3 us
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=799 time=869.1 us
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=800 time=809.0 us
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=801 time=681.3 us
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=802 time=699.9 us
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=803 time=749.5 us
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=804 time=18.4 s (slow) # 直到 osd 被标记为 down , 此次 io 才完成
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=805 time=3.22 ms (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=806 time=777.2 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=807 time=708.2 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=808 time=768.5 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=809 time=726.1 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=810 time=801.1 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=811 time=802.8 us (fast)

写操作

同样的手动 kill osd 模拟 osd 故障并模拟客户端写操作, 查看 osd down 时客户端的读延时, 此时延时为 21.4 s

[root@cephclient ~]# ioping -i 0.1 -D /dev/rbd0 -WWW
......
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=166 time=2.90 ms
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=167 time=2.52 ms
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=168 time=2.48 ms
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=169 time=2.57 ms
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=170 time=2.63 ms
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=171 time=2.76 ms
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=172 time=2.83 ms
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=173 time=2.44 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=174 time=2.66 ms
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=175 time=3.02 ms
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=176 time=2.81 ms
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=177 time=21.4 s (slow)  # 直到 osd 被标记为 down , 此次 io 才完成
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=178 time=4.88 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=179 time=2.75 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=180 time=2.65 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=181 time=4.36 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=182 time=2.37 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=183 time=2.56 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=184 time=3.10 ms (fast)

更改参数

将 osd_heartbeat_grace 和 osd_heartbeat_interval 减少一半 osd_heartbeat_grace 20 → 10 osd_heartbeat_interval 6 → 3

[root@cephnode ceph]# ceph tell osd.* injectargs '--osd_heartbeat_grace 10'
osd.0: osd_heartbeat_grace = '10' (unchangeable) 
osd.1: osd_heartbeat_grace = '10' (unchangeable) 
osd.2: osd_heartbeat_grace = '10' (unchangeable) 
osd.3: osd_heartbeat_grace = '10' (unchangeable) 
osd.4: osd_heartbeat_grace = '10' (unchangeable) 
osd.5: osd_heartbeat_grace = '10' (unchangeable) 
osd.6: osd_heartbeat_grace = '10' (unchangeable) 
osd.7: osd_heartbeat_grace = '10' (unchangeable) 
Error ENXIO: problem getting command descriptions from osd.8
osd.8: problem getting command descriptions from osd.8
[root@cephnode ceph]# ceph tell osd.* injectargs '--osd_heartbeat_interval 3'
osd.0: osd_heartbeat_interval = '3' (unchangeable) 
osd.1: osd_heartbeat_interval = '3' (unchangeable) 
osd.2: osd_heartbeat_interval = '3' (unchangeable) 
osd.3: osd_heartbeat_interval = '3' (unchangeable) 
osd.4: osd_heartbeat_interval = '3' (unchangeable) 
osd.5: osd_heartbeat_interval = '3' (unchangeable) 
osd.6: osd_heartbeat_interval = '3' (unchangeable) 
osd.7: osd_heartbeat_interval = '3' (unchangeable)

[root@cephnode ~]# ceph daemon osd.7 config show | grep osd_heartbeat
    "osd_heartbeat_addr": ":\/0",
    "osd_heartbeat_interval": "3",
    "osd_heartbeat_grace": "10",
    "osd_heartbeat_min_peers": "10",
    "osd_heartbeat_use_min_delay_socket": "false",
    "osd_heartbeat_min_healthy_ratio": "0.33",
[root@cephnode ceph]# ceph tell mon.* injectargs '--osd_heartbeat_grace 10'
injectargs:osd_heartbeat_grace = '10' (unchangeable) 
[root@cephnode ceph]# ceph daemon mon.cephnode config show | grep osd_heartbeat
    "osd_heartbeat_addr": ":\/0",
    "osd_heartbeat_interval": "6",
    "osd_heartbeat_grace": "10",
    "osd_heartbeat_min_peers": "10",
    "osd_heartbeat_use_min_delay_socket": "false",
    "osd_heartbeat_min_healthy_ratio": "0.33",
[root@cephnode ceph]#

读操作

ioping 测试读, 可以看到明显小于上文默认参数的延时

[root@cephclient ~]# ioping -i 0.1 -D /dev/rbd0
......
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9882 time=602.6 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9883 time=777.6 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9884 time=604.0 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9885 time=596.5 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9886 time=733.4 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9887 time=639.5 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9888 time=757.5 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9889 time=599.7 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9890 time=768.3 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9891 time=776.2 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9892 time=8.45 s (fast)  # 直到 osd 被标记为 down , 此次 io 才完成, 但是可以看到此次延时明显小于上文的数值
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9893 time=1.06 ms (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9894 time=3.14 ms (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9895 time=627.5 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9896 time=733.3 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9897 time=660.4 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9898 time=754.6 us (fast)
4 KiB <<< /dev/rbd0 (block device 100 GiB): request=9899 time=748.6 us (fast)

写操作

ioping 测试写截图, 可以看到明显小于上文默认参数的延时

[root@cephclient ~]# ioping -i 0.1 -D /dev/rbd0 -WWW
......
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=456 time=2.56 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=457 time=2.77 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=458 time=2.53 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=459 time=2.41 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=460 time=2.70 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=461 time=2.60 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=462 time=2.50 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=463 time=2.53 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=464 time=10.8 s (slow) # 直到 osd 被标记为 down , 此次 io 才完成, 但是可以看到此次延时明显小于上文的数值
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=465 time=2.94 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=466 time=2.58 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=467 time=2.52 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=468 time=2.55 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=469 time=2.53 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=470 time=2.46 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=471 time=2.83 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=472 time=2.78 ms (fast)
4 KiB >>> /dev/rbd0 (block device 100 GiB): request=473 time=2.66 ms (fast)

OSD 心跳丢失的 MON log

kill OSD 后的 mon log

2020-02-02 20:20:46.804677 7fb392bbd700  1 mon.cephnode@0(leader).osd e1330 prepare_failure osd.8 192.168.1.1:6801/13459 from osd.2 192.168.1.1:6804/16339 is reporting failure:1
2020-02-02 20:20:46.804698 7fb392bbd700  0 log_channel(cluster) log [DBG] : osd.8 192.168.1.1:6801/13459 reported failed by osd.2 192.168.1.1:6804/16339
2020-02-02 20:20:46.880512 7fb392bbd700  1 mon.cephnode@0(leader).osd e1330 prepare_failure osd.8 192.168.1.1:6801/13459 from osd.7 192.168.1.1:6803/40358 is reporting failure:1
2020-02-02 20:20:46.880535 7fb392bbd700  0 log_channel(cluster) log [DBG] : osd.8 192.168.1.1:6801/13459 reported failed by osd.7 192.168.1.1:6803/40358
2020-02-02 20:20:46.880586 7fb392bbd700  1 mon.cephnode@0(leader).osd e1330  we have enough reporters to mark osd.8 down
2020-02-02 20:20:46.880601 7fb392bbd700  0 log_channel(cluster) log [INF] : osd.8 192.168.1.1:6801/13459 failed (2 reporters from different host after 21.010763 >= grace 20.000000)

从这里可以看到, osd8 被 osd.2 和 osd.7 相继报告 failed, 当报告次数满足 mon osd min down reporters 后, osd 将会被标记识别

心跳延时配置的不要太低

如果将心跳参数改的过低, 将会引起频繁的误报

[root@cephnode ceph]# ceph daemon mon.cephnode config show | grep osd_heartbeat
    "osd_heartbeat_addr": ":\/0",
    "osd_heartbeat_interval": "6",
    "osd_heartbeat_grace": "3",
    "osd_heartbeat_min_peers": "10",
    "osd_heartbeat_use_min_delay_socket": "false",
    "osd_heartbeat_min_healthy_ratio": "0.33",
[root@cephnode ~]# ceph daemon osd.7 config show | grep osd_heartbeat
    "osd_heartbeat_addr": ":\/0",
    "osd_heartbeat_interval": "6",
    "osd_heartbeat_grace": "3",
    "osd_heartbeat_min_peers": "10",
    "osd_heartbeat_use_min_delay_socket": "false",
    "osd_heartbeat_min_healthy_ratio": "0.33",

osd 的状态将频繁变换, osd 会被误标记为 down

[root@test92 ~]# sudo ceph -s
   cluster 27081438-f5fb-4862-b7c2-b87626bd19ec
     health HEALTH_WARN
            1 pgs stuck unclean
            recovery 7/5238 objects degraded (0.134%)
            recovery 7/5238 objects misplaced (0.134%)
     monmap e1: 1 mons at {cephnode=192.168.1.1:6789/0}
            election epoch 33, quorum 0 cephnode
     osdmap e1592: 9 osds: 7 up, 7 in; 1 remapped pgs
            flags sortbitwise,require_jewel_osds
      pgmap v618211: 304 pgs, 7 pools, 3213 MB data, 1746 objects
            4711 MB used, 12473 GB / 12478 GB avail
            7/5238 objects degraded (0.134%)
            7/5238 objects misplaced (0.134%)
                 303 active+clean
                   1 active+remapped
  client io 69325 B/s rd, 0 B/s wr, 67 op/s rd, 45 op/s wr

注意事项

本次测试实在 ceph jewel 10.2.x 版本上进行的, 测试过程中发现 mon 和 osd 都有 osd heartbeat interval 这个参数可以配置, 如果按照 jewel 的官方文档, 只更改 OSD 的 osd heartbeat interval 实际上是没有效果的, 必须同时更改 mon 和 osd osd heartbeat interval, 并且如果你将 osd 的 osd heartbeat interval 改的比 mon 的还要低, 将回导致 osd 永远不会被标记为 down 造成业务放的写操作永远阻塞

对于最新版本的 Ceph 其官方文档已经更改, 要求将 osd heartbeat interval 配置到 global 或者 osd 和 mon 两者中

https://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/#osds-check-heartbeats

所以更改 osd heartbeat interval 一定要再 osd 和 mon 上同时更改

线上变更建议

降低 osd 心跳可以提高集群的灵敏性, 让 mon 快速发现 osd 的故障, 但是也可能由于 osd 心跳超时时间过段, 造成 osd 被误判, 从而导致集群不稳定, 所以如果需要在线上更改, 可以采用以下措施(一种或全部):

设置 nodown 状态, 并观察 osd 的日志有无 no reply 的报告.持续观察至少 24h 无异常然后再接触 nodown
状态
将 mon osd min down reporters 调高, 可以高于集群总 osd 个数, 这样即使出现了误报, 由于小于
mon osd min down reporters OSD 也不会被标记为 down.

日志示例

osd

2020-02-02 20:20:46.843525 7f508bdf6700 -1 osd.3 1633 heartbeat_check: no reply from 0x7f50a65a0810 osd.8 since back 2020-02-21 14:49:21.878833 front 2020-02-21 14:49:21.878833 (cutoff 2020-02-21 14:49:22.843404)
2020-02-02 20:20:46.843678 7f508bdf6700 -1 osd.3 1633 heartbeat_check: no reply from 0x7f50a4582090 osd.0 since back 2020-02-21 14:49:21.878833 front 2020-02-21 14:49:21.878833 (cutoff 2020-02-21 14:49:23.843677)
2020-02-02 20:20:46.843708 7f508bdf6700 -1 osd.3 1633 heartbeat_check: no reply from 0x7f50a4652690 osd.1 since back 2020-02-21 14:49:21.878833 front 2020-02-21 14:49:21.878833 (cutoff 2020-02-21 14:49:23.843677)
2020-02-02 20:20:46.843719 7f508bdf6700 -1 osd.3 1633 heartbeat_check: no reply from 0x7f50a4582210 osd.2 since back 2020-02-21 14:49:21.878833 front 2020-02-21 14:49:21.878833 (cutoff 2020-02-21 14:49:23.843677)
2020-02-02 20:20:46.843729 7f508bdf6700 -1 osd.3 1633 heartbeat_check: no reply from 0x7f50a4667e90 osd.4 since back 2020-02-21 14:49:21.878833 front 2020-02-21 14:49:21.878833 (cutoff 2020-02-21 14:49:23.843677)

mon (如果设置了 ceph osd set nodown) 即使有 osd down 的汇报, mon 也不会显示

2020-02-02 20:20:46.804677 7fb392bbd700  1 mon.cephnode@0(leader).osd e1330 prepare_failure osd.8 10.32.3.92:6801/13459 from osd.2 10.32.3.90:6804/16339 is reporting failure:1
2020-02-02 20:20:46.804698 7fb392bbd700  0 log_channel(cluster) log [DBG] : osd.8 10.32.3.92:6801/13459 reported failed by osd.2 10.32.3.90:6804/16339
2020-02-02 20:20:46.880512 7fb392bbd700  1 mon.cephnode@0(leader).osd e1330 prepare_failure osd.8 10.32.3.92:6801/13459 from osd.7 10.32.3.92:6803/40358 is reporting failure:1
2020-02-02 20:20:46.880535 7fb392bbd700  0 log_channel(cluster) log [DBG] : osd.8 10.32.3.92:6801/13459 reported failed by osd.7 10.32.3.92:6803/40358
2020-02-02 20:20:46.880586 7fb392bbd700  1 mon.cephnode@0(leader).osd e1330  we have enough reporters to mark osd.8 down
2020-02-02 20:20:46.880601 7fb392bbd700  0 log_channel(cluster) log [INF] : osd.8 10.32.3.92:6801/13459 failed (2 reporters from different host after 21.010763 >= grace 20.000000)