kernel: redirfs: loading out-of-tree module taints kernel.
kernel: redirfs: module verification failed: signature and/or required key missing - tainting kernel
kernel: Redirecting File System Framework Version 0.10 <www.redirfs.org> with TrendMicro Patch 9.6.2.8793
kernel: register_chrdev() done: 244
kernel: gsch: loading vfs-filter 9.6.2.8793: OK
kernel: gsch_dev_open() doing
kernel: gsch_dev_open() done: pid:100128(ds_am)
kernel: cannot uninstall hooks if location of sys_call_table is unknown
kernel: gsch_remove_hooks(&gsch_hooks, &orig_hooks) done: -22
kernel: lookup sys_call_table yields ffffffff816c6ee0
kernel: lookup sys_execve yields ffffffff8120b550
kernel: lookup do_execve yields ffffffff8120b2e0
kernel: lookup ia32_sys_call_table yields ffffffff816cdf80
kernel: running: awk '/[0-9a-f]+ [RTrt] compat_do_execve$/ { print "0x" $1 >"/proc/driver/gsch/syshook/addr_compat_do_execve" ; exit }' /boot/System.map-3.10.0-693.11.6.el7.x86_64 doing
kernel: running: awk '/[0-9a-f]+ [RTrt] compat_do_execve$/ { print "0x" $1 >"/proc/driver/gsch/syshook/addr_compat_do_execve" ; exit }' /boot/System.map-3.10.0-693.11.6.el7.x86_64 done(0
kernel: lookup compat_do_execve yields 0
kernel: lookup int_ret_from_sys_call yields ffffffff816b8c91
kernel: lookup getname yields ffffffff81213d40
kernel: lookup putname yields ffffffff81213b40
kernel: hooking open NR=2 ... ffffffff81201ac0 -> ffffffffc05f0a80
kernel: hooking close NR=3 ... ffffffff81201b20 -> ffffffffc05ef8a0
kernel: hooking exit NR=60 ... ffffffff8108fc10 -> ffffffffc05ee9a0
kernel: hooking getpgid NR=121 ... ffffffff810a50d0 -> ffffffffc05ee940
kernel: hooking unlink NR=87 ... ffffffff812153e0 -> ffffffffc05eefa0
kernel: hooking unlinkat NR=263 ... ffffffff812153a0 -> ffffffffc05ef240
kernel: hooking write NR=1 ... ffffffff81203880 -> ffffffffc05eecf0
kernel: hooking pwrite64 NR=18 ... ffffffff81203a20 -> ffffffffc05eedd0
kernel: hooking writev NR=20 ... ffffffff81204030 -> ffffffffc05eeec0
kernel: hooking dup2 NR=33 ... ffffffff812227b0 -> ffffffffc05efa80
kernel: hooking mount NR=165 ... ffffffff81227030 -> ffffffffc05f0cf0
kernel: hooking umount NR=166 ... ffffffff81225720 -> ffffffffc05f01e0
kernel: hooking exit_group NR=231 ... ffffffff8108fcd0 -> ffffffffc05eea20
kernel: hooking stub NR=59 @ ffffffff816b8f30 ... ffffffff8120b550 (-4905512) -> ffffffffc05f08f0 (1056143736)
kernel: gsch_install_hooks(&gsch_hooks, &orig_hooks) done: 0
kernel: gsch_flt: loaded
kernel: gsch_flt_add_mnt(/dev @ Unknown[1021994(devtmpfs)]) done: 0
kernel: gsch_flt_add_mnt(/dev/shm @ Unknown[1021994(tmpfs)]) done: 0
kernel: gsch_flt_add_mnt(/run @ Unknown[1021994(tmpfs)]) done: 0
kernel: gsch_flt_add_mnt(/sys/fs/cgroup @ Unknown[1021994(tmpfs)]) done: 0
kernel: gsch_flt_add_mnt(/sys/fs/pstore @ Unknown[6165676c(pstore)]) done: 0
kernel: gsch_flt_add_mnt(/sys/kernel/config @ Unknown[62656570(configfs)]) done: 0
kernel: ds_am: page allocation failure: order:0, mode:0x200020
kernel: CPU: 3 PID: 100156 Comm: ds_am Tainted: G OE ------------ 3.10.0-693.11.6.el7.x86_64 #1
kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
kernel: Call Trace:
kernel: [<ffffffff816a5ea1>] dump_stack+0x19/0x1b
kernel: [<ffffffff8118a510>] warn_alloc_failed+0x110/0x180
kernel: [<ffffffff816a1e7a>] __alloc_pages_slowpath+0x6b6/0x724
kernel: [<ffffffff8118eaa5>] __alloc_pages_nodemask+0x405/0x420
kernel: [<ffffffff811d3038>] alloc_pages_current+0x98/0x110
kernel: [<ffffffff811dddcc>] new_slab+0x2fc/0x310
kernel: [<ffffffff811df65c>] ___slab_alloc+0x3ac/0x4f0
kernel: [<ffffffffc05d7881>] ? rfs_dcache_get_subs+0xd1/0x2b0 [redirfs]
kernel: [<ffffffffc05d7881>] ? rfs_dcache_get_subs+0xd1/0x2b0 [redirfs]
kernel: [<ffffffff816a31de>] __slab_alloc+0x40/0x5c
kernel: [<ffffffff811e17e7>] kmem_cache_alloc_trace+0x1a7/0x200
kernel: [<ffffffffc05d7fb0>] ? rfs_dcache_rdentry_del+0x160/0x160 [redirfs]
kernel: [<ffffffffc05d7881>] rfs_dcache_get_subs+0xd1/0x2b0 [redirfs]
kernel: [<ffffffffc05d7fb0>] ? rfs_dcache_rdentry_del+0x160/0x160 [redirfs]
kernel: [<ffffffffc05d7b55>] rfs_dcache_walk+0xb5/0x270 [redirfs]
kernel: [<ffffffffc05d00c6>] redirfs_add_path+0x416/0x460 [redirfs]
kernel: [<ffffffffc05f34a9>] gsch_flt_add_mnt+0x169/0x290 [gsch]
kernel: [<ffffffffc05f373c>] gsch_flt_add_mntpoint+0x16c/0x180 [gsch]
kernel: [<ffffffff812029b5>] ? vfs_read+0xf5/0x170
kernel: [<ffffffffc05f38e9>] gsch_flt_add_mnts+0x199/0x2e0 [gsch]
kernel: [<ffffffffc05f4645>] gsch_redirfs_enable_write+0x165/0x190 [gsch]
kernel: [<ffffffffc05ed038>] write_wrapper+0x38/0x60 [gsch]
kernel: [<ffffffff8127230d>] proc_reg_write+0x3d/0x80
kernel: [<ffffffff81202aed>] vfs_write+0xbd/0x1e0
kernel: [<ffffffff812038ff>] SyS_write+0x7f/0xe0
kernel: [<ffffffffc05eed51>] gsch_write_hook_fn+0x61/0xe0 [gsch]
kernel: [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b
kernel: Mem-Info:
kernel: active_anon:142341 inactive_anon:95395 isolated_anon:0#012 active_file:163239 inactive_file:106774 isolated_file:0#012 unevictable:0 dirty:60 writeback:0 unstable:0#012 slab_recl
kernel: Node 0 DMA free:15872kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB pres
kernel: lowmem_reserve[]: 0 2975 16013 16013
kernel: Node 0 DMA32 free:56596kB min:12544kB low:15680kB high:18816kB active_anon:244552kB inactive_anon:77844kB active_file:161308kB inactive_file:91552kB unevictable:0kB isolated(anon
kernel: lowmem_reserve[]: 0 0 13037 13037
kernel: Node 0 Normal free:20496kB min:54968kB low:68708kB high:82452kB active_anon:324812kB inactive_anon:303736kB active_file:491648kB inactive_file:335544kB unevictable:0kB isolated(a
kernel: lowmem_reserve[]: 0 0 0 0
kernel: Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15872kB
kernel: Node 0 DMA32: 318*4kB (UE) 0*8kB 2*16kB (UE) 2*32kB (UE) 2*64kB (UE) 0*128kB 103*256kB (UEM) 56*512kB (UM) 0*1024kB 0*2048kB 0*4096kB = 56536kB
kernel: Node 0 Normal: 1326*4kB (EM) 3*8kB (UEM) 2*16kB (UE) 1*32kB (M) 2*64kB (UE) 1*128kB (U) 28*256kB (UEM) 15*512kB (EM) 0*1024kB 0*2048kB 0*4096kB = 20496kB
kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
kernel: 371106 total pagecache pages
kernel: 518 pages in swap cache
kernel: Swap cache stats: add 658, delete 140, find 2093/2095
kernel: Free swap = 16774648kB
kernel: Total swap = 16777212kB
kernel: SLUB: Unable to allocate memory on node -1 (gfp=0x8020)
kernel: cache: kmalloc-32, object size: 32, buffer size: 32, default order: 0, min order: 0
kernel: node 0: slabs: 53196, objs: 6809088, free: 0
kernel: sched: RT throttling activated
kernel: gsch_flt_add_mnt(/ @ Unknown[58465342(xfs)]) done: 0
kernel: gsch_flt_add_mnt(/dev/mqueue @ Unknown[19800202(mqueue)]) done: 0
kernel: gsch_flt_add_mnt(/dev/hugepages @ Unknown[958458f6(hugetlbfs)]) done: 0
kernel: gsch_flt_add_mnt(/proc/fs/nfsd @ Unknown[6e667364(nfsd)]) done: 0
对于SLUB: Unable to allocate memory on node -1
解决方法可以参考下文
https://pingcap.com/blog/try-to-fix-two-linux-kernel-bugs-while-testing-tidb-operator-in-k8s/
Try to Fix Two Linux Kernel Bugs While Testing TiDB Operator in K8s
Author: Wenbo Zhang (Linux Kernel Engineer of the EE team at PingCAP)
Kubernetes (K8s) is an open-source container orchestration system that automates application deployment, scaling, and management. It’s the operating system of the cloud-native world. Any defects in K8s or the operation system might put the application in the upper layer at risk.
As the EE (Efficiency Engineering) team at PingCAP, we optimize our office infrastructure and improve office automation and teamwork effectiveness. When we tested TiDB Operator (which creates and manages TiDB clusters) in K8s, we found two Linux kernel bugs. These bugs have plagued our PingCAP K8s environment for quite a while and aren’t thoroughly fixed in the K8s community as a whole.
After extensive investigation and diagnosis, we’ve identified ways to handle these bugs. In this post, I’ll share with you how we tackled these problems. However, as useful as these solutions are, we consider them workarounds, and we believe that more elegant solutions are possible. The goal of this post is to inspire the K8s community, RHEL, and CentOS so that they can help fix these bugs thoroughly in the near future.
Bug #1: Unstable kmem accounting
Keyword: SLUB: Unable to allocate memory on node -1
Related issues in the community:
- https://github.com/kubernetes/kubernetes/issues/61937
- https://github.com/opencontainers/runc/issues/1725
- https://support.mesosphere.com/s/article/Critical-Issue-KMEM-MSPH-2018-0006
Issue description
When we tested TiKV’s online transaction processing (OLTP) performance in K8s, I/O performance occasionally jittered. However, following items appeared normal:
- TiKV and RocksDB log files; RocksDB is TiKV’s underlying storage engine
- CPU (no bottleneck)
- Memory and disk statistics from the load information
The only negative indicator was the message, “SLUB: Unable to allocate memory on node -1” in the output of the dmesg
command.
Issue analysis
We used funcslower in perf-tools to trace kernel functions that were executed slowly and adjusted the threshold value of the hung_task_timeout_secs
kernel parameter. When we performed the write operations in TiKV, we found the following kernel path information:
Based on the information above, we could see that the I/O jitter was related to the file system executing writepage
. At the same time, before and after the performance jitter was captured, the dmesg
output contained a large amount of information about “SLUB: Unable to allocate memory on node -1”. This message occurred even when the node had sufficient memory.
We analyzed the call stack information output from hung_task
and the kernel code. We found that the kernel tried to allocate the bio_vec
object via kmem_cache_alloc
when executing the bvec_alloc
function to allocate the bio_vec
object.
If this operation failed, the kernel fell back to allocate the bio_vec
object from the mempool. However, inside the mempool, the system first tried to execute the pool->alloc
callback for allocation. If this allocation failed, the kernel set the process to an uninterruptible state and put the process in a wait queue. After other processes returned memory resources to the mempool or the timer timed out (after 5 seconds), the process scheduler invoked the process for retry. The wait time of the wait queue was consistent with the jitter delay of our application monitoring.
When we created the Docker container, we didn’t set memory.kmem.limit_in_bytes
. But why was kmem insufficient? To determine whether memory.kmem.limit_in_bytes
was set, we went to the cgroup memory controller to check the container’s kmem information. The kmem statistics were enabled, and the kemem limit was set to a very large value.
Because we knew that the kmem accounting was unstable in the RHEL 3.10 kernel, we suspected that a kernel bug caused the SLUB allocation failure. We searched for kernel patch information, and found that it was a kernel bug, and that it had been fixed in Linux kernel version 4.x: slub: make dead caches discard free slabs immediately. There was also a namespace leak issue associated with kmem accounting: mm: memcontrol: fix cgroup creation failure after many small jobs.
So how was kmem accounting enabled? We used the opensnoop tool in bcc to monitor the kmem configuration file and captured runc as the file modifier. From the K8s code, we found that kmem accounting was enabled by default in the K8s-dependent runc
project.
Solution
Based on our issue analysis, we can fix the bug in either of the following ways:
- Upgrade the existing kernel to a later version.
- Disable the kmem accounting feature when starting the container.
Now runc
includes the conditional compilation option, and you can disable kmem accounting via Build Tags. After we disabled this feature, our test result showed that the jitter disappeared, and so did the namespace leak and SLUB allocation failure problems.
Operation steps
We need to disable the kmem accounting feature on both kubelet and Docker.
-
Recompile kubelet according to the corresponding kubelet version:
- For kubelet v1.14 or later, add Build Tags when you compile kubelet to disable kmem accounting:
$ make BUILDTAGS="nokmem"
- For kubelet v1.13 or earlier, we cannot add Build Tags when compiling kubelet. Instead, manually replace the two functions that enable kmem accounting with the following code:
func EnableKernelMemoryAccounting(path string) error { return nil }
func setKernelMemory(path string, kernelMemoryLimit int64) error { return nil }
Then recompile kubelet.
-
Upgrade
docker-ce
to v18.09.1 or later. In these versions, kmem accounting ofrunc
is disabled in Docker. -
Reboot the system.
-
To verify whether the kmem accounting feature of all the containers is disabled in the newly created pod, run the following command:
$ cat /sys/fs/cgroup/memory/kubepods/burstable/pod<pod-uid>/<container-id>/memory.kmem.slabinfo
If the returned result is as follows, it means that the kmem accounting feature is disabled successfully:
> cat: memory.kmem.slabinfo: Input/output error
Bug #2: Network device reference count leak
Keyword: kernel:unregister_netdevice: waiting for eth0 to become free. Usage count = 1
Related issues in the community: