环境
- Red Hat Enterprise Linux (RHEL) 6.x
- Red Hat Enterprise Linux (RHEL) 7.x
- Using the native Gluster Clients
问题
- Gluster clients are getting hang.
-
The following errors were reported:
[11430(bash)]: gsch_scan(1179828,1,0) - interrupted & wait(1000) [11430(bash)]: gsch_scan(1179828,1,0) - interrupted & wait(1000): timeout = -512
-
Trend Micro Deep Security Agent is tainting the kernel, dsa_filter P(U), redirfs (U), gsch (U)
决议
Please contact Trend Micro with this issue. Trend Micro will request both the vmcore and sosreport.
Workaround
One possible workaround from Trend Micro is to exclude gluster binaries from their Real-Time Scanner:
- On the Deep Security Manager (DSM) console, go to the security profile or policy assigned to the machine.
- Go to Anti-malware > Real-time Scan > Exclusions tab.
- Under Directory List, click Create New or Edit the existing exclusions.
- Add the following entries:
/usr/sbin/glusterfs
/usr/sbin/glusterfsd
- Then save the changes.
根源
Currently, Trend Micro Deep Security Agent is not compatible with Gluster filesystem.
诊断步骤
The following vmcore was captured during slow system performance:
KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.32-642.11.1.el6.x86_64/vmlinux
DUMPFILE: /crash/vmcore [PARTIAL DUMP]
CPUS: 8
DATE: Wed Jul 26 05:05:57 2017
UPTIME: 03:13:51
LOAD AVERAGE: 0.03, 0.02, 0.00
TASKS: 2326
NODENAME: gluster-client.example.com
RELEASE: 2.6.32-642.11.1.el6.x86_64
VERSION: #1 SMP Wed Oct 26 10:25:23 EDT 2016
MACHINE: x86_64 (2197 Mhz)
MEMORY: 128 GB
PANIC: "SysRq : Trigger a crash"
PID: 6503
COMMAND: "bash"
TASK: ffff881edbdb6ab0 [THREAD_INFO: ffff881edbddc000]
CPU: 4
STATE: TASK_RUNNING (SYSRQ)
The following Trend Micro kernel modules were loaded
crash> mod -t
NAME TAINTS
dsa_filter P(U)
redirfs (U)
gsch (U)
crash> sys -t
TAINTED_MASK: 1 P
The following dmesg errors were found:
[11430(bash)]: gsch_scan(1179828,1,0) - interrupted & wait(1000)
[11430(bash)]: gsch_scan(1179828,1,0) - interrupted & wait(1000): timeout = -512
[11452(bash)]: gsch_scan(1179671,1,0) - interrupted & wait(1000)
[11452(bash)]: gsch_scan(1179671,1,0) - interrupted & wait(1000): timeout = -512
[11455(bash)]: gsch_scan(535633,1,0) - interrupted & wait(1000)
[11455(bash)]: gsch_scan(535633,1,0) - interrupted & wait(1000): timeout = -512
[11485(bash)]: gsch_scan(1179671,1,0) - interrupted & wait others(1000)
[11485(bash)]: gsch_scan(1179671,1,0) - interrupted & wait others(1000): timeout = -512
[11678(bash)]: gsch_scan(1179671,1,0) - interrupted & wait others(1000)
[11678(bash)]: gsch_scan(1179671,1,0) - interrupted & wait others(1000): timeout = -512
[11794(bash)]: gsch_scan(1179671,1,0) - interrupted & wait others(1000)
[11794(bash)]: gsch_scan(1179671,1,0) - interrupted & wait others(1000): timeout = -512
[11806(bash)]: gsch_scan(1179828,1,0) - interrupted & wait others(1000)
[11806(bash)]: gsch_scan(1179828,1,0) - interrupted & wait others(1000): timeout = -512
SysRq : Trigger a crash
BUG: unable to handle kernel NULL pointer dereference at (null)
The following shows a huge number of processes in the sleep state S
(IN) currently waiting for 'gsch' module to return.
crash> foreach IN bt | grep ' #1' | grep gsch | wc -l
979
Some were even there for more than 2 hours.
crash> ps -m 9526
[0 02:12:40.030] [IN] PID: 9526 TASK: ffff881d771a4ab0 CPU: 1 COMMAND: "siebmtshmw"
crash> bt 9526
PID: 9526 TASK: ffff881d771a4ab0 CPU: 1 COMMAND: "siebmtshmw"
#0 [ffff881d2dba7a48] schedule at ffffffff81548980
#1 [ffff881d2dba7b20] gsch_scan at ffffffffa04a9efc [gsch]
#2 [ffff881d2dba7bf0] gsch_policy_handle_open at ffffffffa04a81b6 [gsch]
#3 [ffff881d2dba7c00] gsch_flt_open at ffffffffa04a679c [gsch]
#4 [ffff881d2dba7c50] rfs_precall_flts at ffffffffa0492aa0 [redirfs]
#5 [ffff881d2dba7c80] rfs_open at ffffffffa048b323 [redirfs]
#6 [ffff881d2dba7d30] __dentry_open at ffffffff81196c52
#7 [ffff881d2dba7d90] nameidata_to_filp at ffffffff81196fc4
#8 [ffff881d2dba7db0] do_filp_open at ffffffff811aceb0
#9 [ffff881d2dba7f20] do_sys_open at ffffffff811969f7
#10 [ffff881d2dba7f70] compat_sys_open at ffffffff811eeb4a
#11 [ffff881d2dba7f80] sysenter_dispatch at ffffffff8105b080
RIP: 0000000055555430 RSP: 0000000080dfd738 RFLAGS: 00000296
RAX: 0000000000000005 RBX: ffffffff8105b080 RCX: 0000000000000000
RDX: 00000000000001b6 RSI: 000000000060b40b RDI: 0000000000000002
RBP: 0000000080dfd798 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff811eeb4a
R13: ffff881d2dba7f78 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: 0000000000000005 CS: 0023 SS: 002b
From the above 979 processes, randomly picked the following process.
crash> ps -m 9538
[0 02:12:25.296] [IN] PID: 9538 TASK: ffff881d4605f520 CPU: 1 COMMAND: "siebmtshmw"
crash> bt 9538
PID: 9538 TASK: ffff881d4605f520 CPU: 1 COMMAND: "siebmtshmw"
#0 [ffff881d2dbefa48] schedule at ffffffff81548980
#1 [ffff881d2dbefb20] gsch_scan at ffffffffa04a9efc [gsch]
#2 [ffff881d2dbefbf0] gsch_policy_handle_open at ffffffffa04a81b6 [gsch]
#3 [ffff881d2dbefc00] gsch_flt_open at ffffffffa04a679c [gsch]
#4 [ffff881d2dbefc50] rfs_precall_flts at ffffffffa0492aa0 [redirfs]
#5 [ffff881d2dbefc80] rfs_open at ffffffffa048b323 [redirfs]
#6 [ffff881d2dbefd30] __dentry_open at ffffffff81196c52
#7 [ffff881d2dbefd90] nameidata_to_filp at ffffffff81196fc4
#8 [ffff881d2dbefdb0] do_filp_open at ffffffff811aceb0
#9 [ffff881d2dbeff20] do_sys_open at ffffffff811969f7
#10 [ffff881d2dbeff70] compat_sys_open at ffffffff811eeb4a
#11 [ffff881d2dbeff80] sysenter_dispatch at ffffffff8105b080
RIP: 0000000055555430 RSP: 0000000082dfb738 RFLAGS: 00000296
RAX: 0000000000000005 RBX: ffffffff8105b080 RCX: 0000000000000000
RDX: 00000000000001b6 RSI: 000000000060b40b RDI: 0000000000000002
RBP: 0000000082dfb798 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff811eeb4a
R13: ffff881d2dbeff78 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: 0000000000000005 CS: 0023 SS: 002b
The above process is trying to open '/etc/resolve.conf', but redirfs' intercepted the original 'open()' call and called 'gsch_scan()' function in 'gsch' module and this went to sleep with S
state.
crash> do_filp_open
do_filp_open = $4 =
{struct file *(int, struct filename *, int, int, int)} 0xffffffff811ac7e0 <do_filp_open>
crash> dis -lr ffffffff811969f7 | tail
0xffffffff811969dd <do_sys_open+0x4d>: js 0xffffffff81196a87 <do_sys_open+0xf7>
/usr/src/debug/kernel-2.6.32-642.11.1.el6/linux-2.6.32-642.11.1.el6.x86_64/fs/open.c: 907
0xffffffff811969e3 <do_sys_open+0x53>: xor %r8d,%r8d
0xffffffff811969e6 <do_sys_open+0x56>: mov %r15d,%ecx
0xffffffff811969e9 <do_sys_open+0x59>: mov %r14d,%edx
0xffffffff811969ec <do_sys_open+0x5c>: mov %r13,%rsi
0xffffffff811969ef <do_sys_open+0x5f>: mov %r12d,%edi
0xffffffff811969f2 <do_sys_open+0x62>: callq 0xffffffff811ac7e0 <do_filp_open>
/usr/src/debug/kernel-2.6.32-642.11.1.el6/linux-2.6.32-642.11.1.el6.x86_64/fs/open.c: 908
0xffffffff811969f7 <do_sys_open+0x67>: cmp $0xfffffffffffff000,%rax
/usr/src/debug/kernel-2.6.32-642.11.1.el6/linux-2.6.32-642.11.1.el6.x86_64/fs/namei.c: 2168
0xffffffff811ac7e0 <do_filp_open>: push %rbp
0xffffffff811ac7e1 <do_filp_open+0x1>: mov %rsp,%rbp
0xffffffff811ac7e4 <do_filp_open+0x4>: sub $0x160,%rsp
0xffffffff811ac7eb <do_filp_open+0xb>: mov %rbx,-0x28(%rbp)
0xffffffff811ac7ef <do_filp_open+0xf>: mov %r12,-0x20(%rbp)
0xffffffff811ac7f3 <do_filp_open+0x13>: mov %r13,-0x18(%rbp)
crash> bt -f | grep do_sys_open -B 1
ffff881d2dba7f18: ffff881d2dba7f68 ffffffff811969f7
#9 [ffff881d2dba7f20] do_sys_open at ffffffff811969f7
crash> px 0xffff881d2dba7f18-0x18
$5 = 0xffff881d2dba7f00
crash> rd 0xffff881d2dba7f00
ffff881d2dba7f00: ffff881d2d932000 . .-....
crash> struct filename ffff881d2d932000
struct filename {
name = 0xffff881d2d932020 "/etc/resolv.conf",
uptr = 0x60d18c <Address 0x60d18c out of bounds>,
aname = 0xffff881db791c460,
separate = 0x0
}
This needs to be further investigated by gsch
module provider (Trend Micro).