Ceph reports clients failing to respond to cache pressure
环境
- Red Hat Openshift Container Storage 4.x up to 4.8
- Red Hat Openshift Data Foundation 4.9 or higher
- Red Hat Ceph Storage 4.x or higher
问题
ceph -s reporting {x} clients failing to respond to cache pressure
# ceph -s
cluster:
id: 11111111-2222-3333-4444-555555666666
health: HEALTH_WARN
1 clients failing to respond to cache pressure
决议
-
在OCP/ODF平台中,当pod试图启动并且花费太多时间(分钟或小时)时,可以报告此错误,通常它试图挂载包含数百万文件的cepfs PV,并且pod留在
CreateContainerError
中。当pod最终启动时,ceph错误被清除。此问题的解决方法和根本原因描述在:在OpenShift中使用具有高文件计数的持久卷的Pods无法启动或花费过多时间
和/或
在Openshift Data Foundation/Openshift Container Storage中跳过SELinux重新标记的解决方法
-
在与上一点不同的场景中,您可以增加/减少以下命令,以帮助客户端更快地释放caps:
ceph config set mds mds_recall_max_caps xxxx (this should increase) ceph config set mds mds_recall_max_decay_rate x.xx (this should be decrease) ceph config set mds_session_cache_liveness_decay_rate xxx (this should decrease or increase based on issue)
-
If further assistance is required, kindly contact Red Hat Ceph Storage team for further investigation and recommendations.
根本原因
-
The cephfs client has limited caps so there isn’t much left to release. The reason is that MDS is recalling those caps and the client session has become quiet so client not releasing caps. The MDS is trying to reduce outstanding caps to reduce future work.
The client is likely using those caps (open files / io) so it won’t give them up.
-
How
mds_min_caps_per_client
relates to inode usage here ?When cephfs client wants to operate on an inode, it will query the MDS in various ways, which will then grant the client a set of capabilities. These grant the client permissions to operate on the inode in various ways. If any client exceeds
mds_min_caps_per_client
limit and does not release caps when MDS revokes these caps then MDS reports this warning.
诊断步骤
Run the mds session ls
command and make note of the recall_caps
section, this should be over the value current defined for mds_recall_max_caps
:
# oc rsh <active-mds-pod>
# ceph daemon mds.${mds_id} session ls
...
{
"id": 765432,
"entity": {
"name": {
"type": "client",
"num": 765432
},
"addr": {
"type": "v1",
"addr": "10.0.0.1:0",
"nonce": 3231639080
}
},
"state": "open",
"num_leases": 0,
"num_caps": 1151,
"request_load_avg": 0,
"uptime": 138612.869259214,
"requests_in_flight": 0,
"completed_requests": 1,
"reconnecting": false,
"recall_caps": {
"value": 91353.23456789, <------------ recall caps by mds
"halflife": 60
},
...