背景:
pod在启动的时候报pod error loading seccomp filter into kernel error loading seccomp filter errno 524
问题的初步定位及处理
环境信息:
系统:Kylin v10 SP2
内核(宿主机、container):4.19.90-25.21
K8S: v1.20.15
1、据报错信息查询到这个是一个seccomp内存泄露的问题
2、查看宿主机实际bpf_jit_limit内存分配使用
#cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}'
67239936(64M)
3、查看宿主机bpf_jit_limit限制大小
#sysctl net.core.bpf_jit_limit
33554432(32M)
4、 问题原因
此时问题原因已呈现,实际bpf_jit_limit使用大于bpf_jit_limit限制,导致了pod创建(pod删除后重建,container级别删除重建)失败
5、bpf_jit_limit配置最佳实践
lxc
net.core.bpf_jit_limit = 3000000000
# This is a limit on the size of eBPF JIT allocations which is usually set to PAGE_SIZE *
40000. Set this to 1000000000 if you are running Rocky Linux 9.x
net.core.bpf_jit_limit = 3000000000(2G bpf_jit_limit限制)
bpf_jit_limit参数说明:
bpf_jit_limit
This parameter enforces a global limit for memory allocations to the Berkeley Packet Filter Just-in-Time (BPF JIT) compiler in order to reject the unprivileged JIT requests once it has been surpassed.
The bpf_jit_limit parameter contains the value of the global limit in bytes.
6、问题处理
调整bpf_jit_limit大小限制后建pod成功
# sysctl net.core.bpf_jit_limit=3000000000
加入至sysctl.conf
#vi /etc/sysctl.conf
net.core.bpf_jit_limit = 3000000000
问题根因定位
通过问题的初步定位可处理,但/proc/vmallocinfo里面的bpf_jit_binary_alloc还是在增长的
1 删除已知的exporter进程使用了ebpf服务,看增长是否继续
2 经过步骤1后发现还是在增长
3 通过ebpf的命令来追踪内存分配
3.1 bftrace追踪bpf_jit_binary_alloc事件
bftrace -e 'kprobe:bpf_jit_binary_alloc {printf("pid %d uid %d process name %s\n", pid, uid, comm);}'
3.1.1、命令追踪后可见vfreadlat.py klockstat命令
3.1.2、while true;do ps -ef |grep vfreadlat.py; done
可看到具体命令及其ppid信息
root 264844 2532042 0 10:17 ? 00:00:00 /usr/bin/python /usr/share/bcc/tools/klockstat -d 1
3.1.3 看其父进程,可知这两个命令均来自于expoter进程
ps -ef |grep 2532042
root 2532042 2532028 4 Feb02 ? 0:02:51 /exporter
该方法无法追踪临时执行的分配内存的父进程到底是哪些
3.2 写bpf代码片断追踪bpf_jit_binary_alloc
# cat > getmsgforjit.bt <<EOF
#include <linux/sched.h> /* for curtask */
BEGIN {
printf("%-8d %-16 %-8s %-16s => %-8s %-16s\n", "PPPID", "PPPIDCOMM", "PPID", "PPIDCOMM", "PID", "COMM");
}
kprobe:bpf_jit_binary_alloc {
printf("%-8d %-16 %-8d %-16s => %-8d %-16s\n", curtask->parent->parent->comm, curtask->parent->parent->pid, curtask->parent->pid, curtask->parent->comm, pid, comm);
}
EOF
# bpftrace getmsgforjit.bt
PPPID PPPIDCOMM PPID PPIDCOMM PID COMM
685405 contaierd-shim 685579 systemd 3252084 (auditd)
找到container的进程
# ps -ef |grep 685405
# docker exec container_id /bin/bash #进入该contaienr的id查看auditd服务
进一步确认问题
a)开两个实时窗口
a.1 持续container内的观察日志
# journalctl -xe -f
a.2 持续观察bpf_jit_binary_alloc分配信息事件
# bpftrace getmsgforjit.bt
b、匹配两边观察结果发现
container的audit服务未正常启动,每次systemd尝试去拉取该服务时会导致分配pf_jit_binary_alloc(auditd)信息产生
4、问题处理
停止container的audit服务
a) systemctl disable audit
b) rm -f /usr/lib/systemd/system/auditd.service
c) systemctl daemon-reload
d) systemctl list-units
可看到audit.service not-found 状态
e)继续观察bpf_jit_binary_alloc分配信息事件,发现再也无此container的auitd信息了
bpftrace getmsgforjit.bt
结论:
1 将bpf_jit_limit设置为2G
2 定位实际使用bpf_jit_binary_alloc分配内存进程,如实际无用则停止
3 据实际情况,宿主机内核升级或内核path
参考:
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/8.1_release_notes/kernel_parameters_changes
https://github.com/moby/moby/issues/45498
https://docs.rockylinux.org/books/lxd_server/01-install/