问题现象
ceph -s中经常出现报警:
1 clients failing to respond to cache pressure
其大致原因是cephfs的mds让客户端释放部分metadata的cache,客户端释放不及时mds会向monitor上报此类告警
通过代码分析其详细原因及如何进行规避或者修复
原因分析
找到mds代码中上报此告警的地方,如下:
void Beacon::notify_health(MDSRank const *mds)
{
// 由于函数比较长,截取其中报错的部分
set<Session*> sessions;
mds->sessionmap.get_client_session_set(sessions);
const auto recall_warning_threshold = g_conf().get_val<Option::size_t>("mds_recall_warning_threshold");
const auto max_completed_requests = g_conf()->mds_max_completed_requests;
const auto max_completed_flushes = g_conf()->mds_max_completed_flushes;
std::vector<MDSHealthMetric> late_recall_metrics;
std::vector<MDSHealthMetric> large_completed_requests_metrics;
for (auto& session : sessions) {
const uint64_t recall_caps = session->get_recall_caps(); // 获取每条连接上的recall caps的数量
if (recall_caps > recall_warning_threshold) {
dout(2) << "Session " << *session <<
" is not releasing caps fast enough. Recalled caps at " << recall_caps
<< " > " << recall_warning_threshold << " (mds_recall_warning_threshold)." << dendl;
std::ostringstream oss;
oss << "Client " << session->get_human_name() << " failing to respond to cache pressure";
MDSHealthMetric m(MDS_HEALTH_CLIENT_RECALL, HEALTH_WARN, oss.str());
m.metadata["client_id"] = stringify(session->get_client());
late_recall_metrics.emplace_back(std::move(m));
}
}
很容易发现只要连接上的recall caps数大于recall_warning_threshold就会上报此告警,而recall_warning_threshold是通过配置文件读取的,此处是默认配置32K。
那么为什么recall_caps会大于32K呢?继续找其原因
找到recall caps的修改处
uint64_t Session::notify_recall_sent(size_t new_limit)
{
const auto num_caps = caps.size();
ceph_assert(new_limit < num_caps); // Behaviour of Server::recall_client_state
const auto count = num_caps-new_limit;
uint64_t new_change;
if (recall_limit != new_limit) {
new_change = count;
} else {
new_change = 0; /* no change! */
}
/* Always hit the session counter as a RECALL message is still sent to the