问题:ceph rdma协议的集群总是报daemons have recently crashed,而且数目越来越多,然并没有找到相关错误的日志
解决:可参考官网解决方案
RECENT_CRASH
One or more Ceph daemons has crashed recently, and the crash has not yet been archived (acknowledged) by the administrator. This may indicate a software bug, a hardware problem (e.g., a failing disk), or some other problem.
New crashes can be listed with:
#ceph crash ls-new
Information about a specific crash can be examined with:
#ceph crash info <crash-id>
This warning can be silenced by “archiving” the crash (perhaps after being examined by an administrator) so that it does not generate this warning:
#ceph crash archive <crash-id>
Similarly, all new crashes can be archived with:
#ceph crash archive-all
Archived crashes will still be visible via ceph crash ls but not ceph crash ls-new.
The time period for what “recent” means is controlled by the option mgr/crash/warn_recent_interval (default: two weeks).
These warnings can be disabled entirely with:
#ceph config set mgr/crash/warn_recent_interval 0
参考:
https://docs.ceph.com/docs/master/rados/operations/health-checks/?highlight=backfillfull%20ratio
https://docs.ceph.com/docs/master/mgr/crash/?highlight=crash