背景:这个是在centos7.6的环境上复现的,但该问题其实在很多内核版本上都有,如何做好对linux一些缓存的监控和控制,一直是云计算方向的热点,但这些热点属于细分场景,很难合入到linux主基线,随着ebpf的逐渐稳定,对通用linux内核的编程,观测,可能会有新的收获。下面列一下我们是怎么排查并解决这个问题的。
一、故障现象
oppo云内核团队发现集群的snmpd的cpu消耗冲高,
snmpd几乎长时间占用一个核,perf发现热点如下:
+ 92.00% 3.96% [kernel] [k] __d_lookup
- 48.95% 48.95% [kernel] [k] _raw_spin_lock
20.95% 0x70692f74656e2f73
__fopen_internal
__GI___libc_open
system_call
sys_open
do_sys_open
do_filp_open
path_openat
link_path_walk
+ lookup_fast
- 45.71% 44.58% [kernel] [k] proc_sys_compare
- 5.48% 0x70692f74656e2f73
__fopen_internal
__GI___libc_open
system_call
sys_open
do_sys_open
do_filp_open
path_openat
+ 1.13% proc_sys_compare
几乎都消耗在内核态 __d_lookup的调用中,然后strace看到的消耗为:
open("/proc/sys/net/ipv4/neigh/kube-ipvs0/retrans_time_ms", O_RDONLY) = 8 <0.000024>------v4的比较快
open("/proc/sys/net/ipv6/neigh/ens7f0_58/retrans_time_ms", O_RDONLY) = 8 <0.456366>-------v6很慢
进一步手工操作,发现进入ipv6的路径很慢:
time cd /proc/sys/net
real 0m0.000s
user 0m0.000s
sys 0m0.000s
time cd /proc/sys/net/ipv6
real 0m2.454s
user 0m0.000s
sys 0m0.509s
time cd /proc/sys/net/ipv4
real 0m0.000s
user 0m0.000s
sys 0m0.000s
可以看到,进入ipv6的路径的时间消耗远远大于ipv4的路径。
二、故障现象分析
我们需要看一下,为什么perf的热点显示为__d_lookup中proc_sys_compare消耗较多,它的流程是怎么样的
proc_sys_compare只有一个调用路径,那就是d_compare回调,从调用链看:
__d_lookup--->if (parent->d_op->d_compare(parent, dentry, tlen, tname, name))
struct dentry *__d_lookup(const struct dentry *parent, const struct qstr *name)
{
.....
hlist_bl_for_each_entry_rcu(dentry, node, b, d_hash) {
if (dentry->d_name.hash != hash)
continue;
spin_lock(&dentry->d_lock);
if (dentry->d_parent != parent)
goto next;
if (d_unhashed(dentry))
goto next;
/*
* It is safe to compare names since d_move() cannot
* change the qstr (protected by d_lock).
*/
if (parent->d_flags & DCACHE_OP_COMPARE) {
int tlen = dentry->d_name.len;
const char *tname = dentry->d_name.name;
if (parent->d_op->d_compare(parent, dentry, tlen, tname, name))
goto next;//caq:返回1则是不相同
} else {
if (dentry->d_name.len != len)
goto next;
if (dentry_cmp(dentry, str, len))
goto next;
}
....
next:
spin_unlock(&dentry->d_lock);//caq:再次进入链表循环
}
.....
}
集群同物理条件的机器,snmp流程应该一样,所以很自然就怀疑,是不是hlist_bl_for_each_entry_rcu
循环次数过多,导致了parent->d_op->d_compare不停地比较冲突链,
进入ipv6的时候,是否比较次数很多,因为遍历list的过程中肯定会遇到了比较多的cache miss,当遍历了
太多的链表元素,则有可能触发这种情况,下面需要验证下:
static inline long hlist_count(const struct dentry *parent, const struct qstr *name)
{
long count = 0;
unsigned int hash = name->hash;
struct hlist_bl_head *b = d_hash(parent, hash);
struct hlist_bl_node *node;
struct dentry *dentry;
rcu_read_lock();
hlist_bl_for_each_entry_rcu(dentry, node, b, d_hash) {
count++;
}
rcu_read_unlock();
if(count >COUNT_THRES)
{
printk("hlist_bl_head=%p,count=%ld,name=%s,hash=%u\n",b,count,name,name->hash);
}
return count;
}
kprobe的结果如下:
[20327461.948219] hlist_bl_head=ffffb0d7029ae3b0 count = 799259,name=ipv6/neigh