RHEL 7.6: NFS share hangs when NFS operations are in-progress and a network partition occurs involving an unexpected RST
SOLUTION 已验证 - 已更新 2020年一月16日07:13 - English
环境
- Red Hat Enterprise Linux 7.6
- Red Hat Enterprise Linux as the NFS client
- kernel at least
3.10.0-957.el7
and before3.10.0-957.5.1.el7
- kernel at least
问题
- After upgrading from the RHEL7.5 kernel (3.10.0-862el7) to the RHEL7.6 kernel (3.10.0-957el7), reads or writes to a NFS share by a NFS client hang indefinitely following a network partition and recovery.
- NFS share is hung with a task, such as
kworker
, consuming 100% of CPU and the/proc/<pid>/stack
or other backtrace shows execution in the sunrpc transmit path- Processes can not be killed and a reboot of the NFS client is required. - This is new behaviour that is not seen in versions of RHEL 7 prior to RHEL 7.6
决议
Red Hat Enterprise Linux 7
- The issue bug: 1660784 has been resolved with errata RHSA-2019:0163 with the following package(s):
kernel-3.10.0-957.5.1.el7
or later. - As an alternative to the above errata, a kpatch on kernel-3.10.0-957.1.3.el7 was tracked with bugzilla 1730410: Bug 1730410 - RHEL7.6: NFS client sunrpc layer hangs and requires a reboot after a network disruption involving an RST from the server [kpatch 3.10.0-957.1.3.el7] (RHEL 7 7.7.0). As of Fri, July 19 2019, the status of bugzilla 1730410 is CLOSED. The problem described has been fixed in the current supported release. For further details regarding which specific releases where this bug is present or fixed, contact Red Hat support. Check the changelog (and bz1730410) for the package “kpatch-patch” for the change that resolved this issue.
Workarounds
-
Downgrade to a kernel prior to
kernel-3.10.0-957.el7
-
Use some approach to ensure the NFS (and lockd) TCP connections do not go idle and drop (Note that the idle timer will drop a connection after 5 minutes). One possibility is to use a variation of a script to monitor for hung NFS mount points as described in https://access.redhat.com/solutions/97873.
-
Use NFSv4.1 or NFSv4.0 (not fully reliable). NOTE: This issue can still occur on NFSv4.0 or NFSv4.1, but it is believed to be harder to reproduce due to the fact that NFS4.x periodically issues either NFS4 RENEW or SEQUENCE operations to keep the TCP connection active. However, on NFS 4.x any NFS operation completing with an error code (i.e. NFS4ERR_BAD_STATEID) will cause the TCP connection to drop so this issue can occur.
根源
- A patch was backported to RHEL7 to fix another bug (1614950), but one portion of the patch caused more severe connection side-effects such as this issue.
- The bug occurs because the TCP layer state can get out of sync with the sunrpc layers transport state. It is possible that the TCP state will change to TCP_CLOSE while the sunrpc layer’s transport state connected bit (XPRT_CONNECTED) does not get cleared. This can happen with certain timing conditions occur and when an unexpected RST is handled. In addition, the sunrpc’s RPC state machine transmit code handles the error it receives as a ‘retryable’ error and does not release the transport lock bit (XPRT_LOCKED), and thus blocking a connect_worker thread from running. As a result, the task that runs through the RPC state machine for an rpc_task will go into an infinite loop and the only way to recover the NFS mount and the machine is to reboot.
诊断步骤
Simple diagnostic steps on a live system or sosreport: look for specific xprt state in debugfs for the hung NFS share
- If this issue is occurring, an NFS share will be hung / unresponsive, and the rpc_xprt state will be 0x1b. This can be checked either on a live system or from a sosreport.
# grep -B 2 "state: 0x1b" /sys/kernel/debug/sunrpc/rpc_xprt/*/info
addr: 192.0.2.1
port: 2049
state: 0x1b
# grep 192.0.2.1 /proc/mounts
nfs-server:/exports /mnt/nfs nfs rw,relatime,vers=3,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.0.2.1,mountvers=3,mountport=20048,mountproto=udp,local_lock=none,addr=192.0.2.1 0 0
Simple diagnostic steps on a live system: one process consuming 100% CPU and a specific stack
- To recognize this issue, look for two things:
- a task spinning at 100% CPU
- the backtrace of the task inside the sunrpc code in the call_transmit code path
NOTE: Depending on the system call and whether the NFS operation that results is asynchronous, task will be a kworker as in the example provided here. If NFS operation is synchronous, it could be any task.
Use ‘top’ to get the highest CPU consumers, and assuming you see a task at the top running at near 100% CPU, get a backtrace of the task. The output of top and the backtrace should be similar to the following:
# top -b -o +%CPU | head -15
top - 22:08:08 up 13 min, 2 users, load average: 2.08, 1.71, 0.91
Tasks: 128 total, 2 running, 126 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us,100.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 1015000 total, 562112 free, 150196 used, 302692 buff/cache
KiB Swap: 1048572 total, 1048572 free, 0 used. 689744 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
42 root 20 0 0 0 0 R 94.1 0.0 8:22.04 kworker/u2:1
3954 root 20 0 161880 2088 1524 R 5.9 0.2 0:00.01 top
1 root 20 0 129564 6824 2836 S 0.0 0.7 0:01.57 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:00.05 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
6 root 20 0 0 0 0 S 0.0 0.0 0:00.02 kworker/u2:0
7 root rt 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
# cat /proc/42/stack
[<ffffffff882d31b6>] __cond_resched+0x26/0x30
[<ffffffff8881d548>] lock_sock_nested+0x18/0x50
[<ffffffff88893ffb>] tcp_sendmsg+0x3b/0xc60
[<ffffffff888bfeb9>] inet_sendmsg+0x69/0xb0
[<ffffffff888193a6>] sock_sendmsg+0xb6/0xf0
[<ffffffff88819417>] kernel_sendmsg+0x37/0x50
[<ffffffffc04b1ef9>] xs_send_kvec+0xa9/0xb0 [sunrpc]
[<ffffffffc04b21e7>] xs_sendpages+0x67/0x200 [sunrpc]
[<ffffffffc04b3bb1>] xs_tcp_send_request+0x91/0x220 [sunrpc]
[<ffffffffc04b01a8>] xprt_transmit+0x78/0x390 [sunrpc]
[<ffffffffc04abf50>] call_transmit+0x1d0/0x2c0 [sunrpc]
[<ffffffffc04b9639>] __rpc_execute+0x99/0x420 [sunrpc]
[<ffffffffc04b99d2>] rpc_async_schedule+0x12/0x20 [sunrpc]
[<ffffffff882b9d4f>] process_one_work+0x17f/0x440
[<ffffffff882bade6>] worker_thread+0x126/0x3c0
[<ffffffff882c1c31>] kthread+0xd1/0xe0
[<ffffffff88974c37>] ret_from_fork_nospec_end+0x0/0x39
[<ffffffffffffffff>] 0xffffffffffffffff
Simple diagnostic steps on a live system: enable rpcdebug and look for loop
- Enable rpcdebug for the ‘rpc’ module and the ‘call’, ‘trans’, and ‘xprt’ flags
# rpcdebug -m rpc -s call -s trans -s xprt
- Look for a specific trace inside /var/log/messages. If this issue is occurring, the rpcdebug output in /var/log/messages shows the following sequence repeatedly, where -EPIPE (-32) is coming back from xs_tcp_send_request and call_status even though call_connect is showing that the xprt is connected.
Dec 7 15:13:10 rhel7u6-node1 kernel: RPC: 92 xprt_transmit(524516)
Dec 7 15:13:10 rhel7u6-node1 kernel: RPC: xs_tcp_send_request(524516) = -32
Dec 7 15:13:10 rhel7u6-node1 kernel: RPC: 92 call_status (status -32)
Dec 7 15:13:10 rhel7u6-node1 kernel: RPC: 92 call_bind (status 0)
Dec 7 15:13:10 rhel7u6-node1 kernel: RPC: 92 call_connect xprt ffff8ff8abf44000 is connected
Dec 7 15:13:10 rhel7u6-node1 kernel: RPC: 92 call_transmit (status 0)
Dec 7 15:13:10 rhel7u6-node1 kernel: RPC: 92 xprt_prepare_transmit
- Disable rpcdebug
# rpcdebug -m rpc -c call -c trans -c xprt
Simple diagnostic steps from a vmcore
- In a vmcore, look for the following two things
- a running task with a backtrace in the sunrpc transmit path
- on the rpc_xprt, the XPRT_CONNECTED bit set, while either there is no socket or the socket is in TCP_CLOSE status
- Example of vmcore diagnostics
crash> foreach active bt | grep -B 30 "call_transmit at"
#8 [ffff90ff294e3ee0] cpu_startup_entry at ffffffff87afc3ba
#9 [ffff90ff294e3f28] start_secondary at ffffffff87a57db7
#10 [ffff90ff294e3f50] start_cpu at ffffffff87a000d5
PID: 173 TASK: ffff90efa8912080 CPU: 1 COMMAND: "khungtaskd"
#0 [ffff90fe7fdfbcb0] machine_kexec at ffffffff87a63674
#1 [ffff90fe7fdfbd10] __crash_kexec at ffffffff87b1ce12
#2 [ffff90fe7fdfbde0] panic at ffffffff8815b4db
#3 [ffff90fe7fdfbe60] watchdog at ffffffff87b488ee
#4 [ffff90fe7fdfbec8] kthread at ffffffff87ac1c31
PID: 62746 TASK: ffff90ee756c4100 CPU: 20 COMMAND: "kworker/u289:7"
#0 [ffff90fe7ee88e48] crash_nmi_callback at ffffffff87a55d07
#1 [ffff90fe7ee88e58] nmi_handle at ffffffff8816c8fc
#2 [ffff90fe7ee88eb0] do_nmi at ffffffff8816cb1d
#3 [ffff90fe7ee88ef0] end_repeat_nmi at ffffffff8816bd69
[exception RIP: xdr_encode_opaque_fixed+25]
RIP: ffffffffc0790669 RSP: ffff90fc27293c70 RFLAGS: 00000202
RAX: 0000000020000000 RBX: ffff90fe5a44a0c8 RCX: ffff90fdb3a31064
RDX: 0000000000000020 RSI: ffff90fe5a44a0ca RDI: ffff90fdb3a31068
RBP: ffff90fc27293c90 R8: ffff90fe70302550 R9: 0000000000000001
R10: 0000000000000001 R11: 0000000000000000 R12: ffff90fdb3a31068
R13: 0000000000000020 R14: ffffffffc0800870 R15: ffff90fdb3a31064
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#4 [ffff90fc27293c70] xdr_encode_opaque_fixed at ffffffffc0790669 [sunrpc]
#5 [ffff90fc27293c98] xdr_encode_opaque at ffffffffc07906c9 [sunrpc]
#6 [ffff90fc27293ca8] encode_nfs_fh3 at ffffffffc07ffde6 [nfsv3]
#7 [ffff90fc27293cc8] nfs3_xdr_enc_read3args at ffffffffc0800893 [nfsv3]
#8 [ffff90fc27293cf0] rpcauth_wrap_req at ffffffffc0786670 [sunrpc]
#9 [ffff90fc27293d78] call_transmit at ffffffffc0775f1a [sunrpc]
crash> set 62746
PID: 62746
COMMAND: "kworker/u289:7"
TASK: ffff90ee756c4100 [THREAD_INFO: ffff90fc27290000]
CPU: 20
STATE: TASK_RUNNING (ACTIVE)
crash> bt
PID: 62746 TASK: ffff90ee756c4100 CPU: 20 COMMAND: "kworker/u289:7"
#0 [ffff90fe7ee88e48] crash_nmi_callback at ffffffff87a55d07
#1 [ffff90fe7ee88e58] nmi_handle at ffffffff8816c8fc
#2 [ffff90fe7ee88eb0] do_nmi at ffffffff8816cb1d
#3 [ffff90fe7ee88ef0] end_repeat_nmi at ffffffff8816bd69
[exception RIP: xdr_encode_opaque_fixed+25]
RIP: ffffffffc0790669 RSP: ffff90fc27293c70 RFLAGS: 00000202
RAX: 0000000020000000 RBX: ffff90fe5a44a0c8 RCX: ffff90fdb3a31064
RDX: 0000000000000020 RSI: ffff90fe5a44a0ca RDI: ffff90fdb3a31068
RBP: ffff90fc27293c90 R8: ffff90fe70302550 R9: 0000000000000001
R10: 0000000000000001 R11: 0000000000000000 R12: ffff90fdb3a31068
R13: 0000000000000020 R14: ffffffffc0800870 R15: ffff90fdb3a31064
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#4 [ffff90fc27293c70] xdr_encode_opaque_fixed at ffffffffc0790669 [sunrpc]
#5 [ffff90fc27293c98] xdr_encode_opaque at ffffffffc07906c9 [sunrpc]
#6 [ffff90fc27293ca8] encode_nfs_fh3 at ffffffffc07ffde6 [nfsv3]
#7 [ffff90fc27293cc8] nfs3_xdr_enc_read3args at ffffffffc0800893 [nfsv3]
#8 [ffff90fc27293cf0] rpcauth_wrap_req at ffffffffc0786670 [sunrpc]
#9 [ffff90fc27293d78] call_transmit at ffffffffc0775f1a [sunrpc]
#10 [ffff90fc27293da8] __rpc_execute at ffffffffc0783639 [sunrpc]
#11 [ffff90fc27293e10] rpc_async_schedule at ffffffffc07839d2 [sunrpc]
#12 [ffff90fc27293e20] process_one_work at ffffffff87ab9d4f
#13 [ffff90fc27293e68] worker_thread at ffffffff87abade6
#14 [ffff90fc27293ec8] kthread at ffffffff87ac1c31
crash> bt -f | grep call_transmit
#9 [ffff90fc27293d78] call_transmit at ffffffffc0775f1a [sunrpc]
crash> px (0xffff90fc27293d78 - 6*8)
$1 = 0xffff90fc27293d48
crash> rd 0xffff90fc27293d48
ffff90fc27293d48: ffff90fe70302398 .#0p....
crash> p ((struct rpc_task *)0xffff90fe70302398)->tk_client->cl_xprt
$2 = (struct rpc_xprt *) 0xffff90efa884a800
crash> rpc_xprt.state 0xffff90efa884a800
state = 27
crash> set hex
output radix: 16 (hex)
crash> rpc_xprt.state 0xffff90efa884a800
state = 0x1b
crash> eval -b 0x1b | grep bits
bits set: 4 3 1 0
382 /*
383 * Reserved bit positions in xprt->state
384 */
385 #define XPRT_LOCKED (0)
386 #define XPRT_CONNECTED (1) <-------- This bit is set indicating transport connected
387 #define XPRT_CONNECTING (2)
388 #define XPRT_CLOSE_WAIT (3)
389 #define XPRT_BOUND (4)
390 #define XPRT_BINDING (5)
391 #define XPRT_CLOSING (6)
392 #define XPRT_CONGESTED (9)
393
"include/linux/sunrpc/xprt.h" line 393 of 482 --81%-- col 1
crash> p ((struct sock_xprt *)0xffff90efa884a800)->inet->__sk_common.skc_state
$5 = 0x7 <-------------- The TCP state is TCP_CLOSE which contradicts the above RPC xprt state
crash> p (int)TCP_CLOSE
$7 = 0x7
16 enum {
17 TCP_ESTABLISHED = 1,
18 TCP_SYN_SENT,
19 TCP_SYN_RECV,
20 TCP_FIN_WAIT1,
21 TCP_FIN_WAIT2,
22 TCP_TIME_WAIT,
23 TCP_CLOSE,
24 TCP_CLOSE_WAIT,
25 TCP_LAST_ACK,
26 TCP_LISTEN,
27 TCP_CLOSING, /* Now a valid state */
28
29 TCP_MAX_STATES /* Leave at the end! */
30 };
include/net/tcp_states.h
Reproducer with detailed diagnostic steps
WARNING: The following test script will cause the NFS client to hang, a kworker task will spin at 100% CPU, and you will need to reboot the system to recover.
- The behavior can be reproduced by running the below steps
- Test script
# cat hang_nfs
mkdir -p /srv/export
bash -c "echo /srv/export 127.0.0.1\(rw,insecure,sync,no_root_squash\) > /etc/exports"
systemctl start nfs-server
exportfs -a
mkdir -p /mnt/nfs
mount -t nfs -o vers=4.1,noresvport,_netdev 127.0.0.1:/ /mnt/nfs
mkdir -p /mnt/nfs/srv/export/dir
bash -c "while true; do date ; time dd if=/dev/zero oflag=direct of=/mnt/nfs/srv/export/dir/file bs=16M count=100 ; done""
- Execute the script
# ./hang_nfs
- In a separate
tty
, add an theiptables
rule to block output traffic to TCP port 2049, the well known port for NFS
# iptables -I OUTPUT -p tcp -m tcp --dport 2049 -j REJECT --reject-with tcp-reset
- Process ID for
# dd
and its process stack
# pidof dd
3877
# cat /proc/$(pidof dd)/stack
[<ffffffffc07fc626>] nfs_file_direct_write+0x396/0x470 [nfs]
[<ffffffffc07f1a79>] nfs_file_write+0x139/0x1e0 [nfs]
[<ffffffff884405b3>] do_sync_write+0x93/0xe0
[<ffffffff884410a0>] vfs_write+0xc0/0x1f0
[<ffffffff88441ebf>] SyS_write+0x7f/0xf0
[<ffffffff88974ddb>] system_call_fastpath+0x22/0x27
[<ffffffffffffffff>] 0xfffffffffffffff
- Remove the rule
# iptables -D OUTPUT -p tcp -m tcp --dport 2049 -j REJECT --reject-with tcp-reset
- Kill the process to start a new
# dd
within the while loop
# pkill --signal 9 dd
- The process hangs in
inode_dio_wait
# cat /proc/$(pidof dd)/stack
[<ffffffff8845dce9>] inode_dio_wait+0xd9/0x100
[<ffffffffc084e17b>] nfs4_file_fsync+0x3b/0x170 [nfsv4]
[<ffffffff88475a6e>] vfs_fsync+0x2e/0x40
[<ffffffffc084e3ee>] nfs4_file_flush+0x5e/0x90 [nfsv4]
[<ffffffff8843e4f7>] filp_close+0x37/0x90
[<ffffffff88461338>] put_files_struct+0x88/0xe0
[<ffffffff88461439>] exit_files+0x49/0x50
[<ffffffff8829dc3f>] do_exit+0x2af/0xa40
[<ffffffff8829e44f>] do_group_exit+0x3f/0xa0
[<ffffffff882af24e>] get_signal_to_deliver+0x1ce/0x5e0
[<ffffffff8822b527>] do_signal+0x57/0x6f0
[<ffffffff8822bc32>] do_notify_resume+0x72/0xc0
[<ffffffff88975124>] int_signal+0x12/0x17
[<ffffffffffffffff>] 0xffffffffffffffff
- Top CPU consumers
# top -b -o +%CPU | head -15
top - 22:08:08 up 13 min, 2 users, load average: 2.08, 1.71, 0.91
Tasks: 128 total, 2 running, 126 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us,100.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 1015000 total, 562112 free, 150196 used, 302692 buff/cache
KiB Swap: 1048572 total, 1048572 free, 0 used. 689744 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
42 root 20 0 0 0 0 R 94.1 0.0 8:22.04 kworker/u2:1
3954 root 20 0 161880 2088 1524 R 5.9 0.2 0:00.01 top
1 root 20 0 129564 6824 2836 S 0.0 0.7 0:01.57 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:00.05 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
6 root 20 0 0 0 0 S 0.0 0.0 0:00.02 kworker/u2:0
7 root rt 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
- Looking at the kworker thread, we have an issue with the kernel scheduler and controlling a TCP socket.
# cat /proc/42/stack
[<ffffffff882d31b6>] __cond_resched+0x26/0x30
[<ffffffff8881d548>] lock_sock_nested+0x18/0x50
[<ffffffff88893ffb>] tcp_sendmsg+0x3b/0xc60
[<ffffffff888bfeb9>] inet_sendmsg+0x69/0xb0
[<ffffffff888193a6>] sock_sendmsg+0xb6/0xf0
[<ffffffff88819417>] kernel_sendmsg+0x37/0x50
[<ffffffffc04b1ef9>] xs_send_kvec+0xa9/0xb0 [sunrpc]
[<ffffffffc04b21e7>] xs_sendpages+0x67/0x200 [sunrpc]
[<ffffffffc04b3bb1>] xs_tcp_send_request+0x91/0x220 [sunrpc]
[<ffffffffc04b01a8>] xprt_transmit+0x78/0x390 [sunrpc]
[<ffffffffc04abf50>] call_transmit+0x1d0/0x2c0 [sunrpc]
[<ffffffffc04b9639>] __rpc_execute+0x99/0x420 [sunrpc]
[<ffffffffc04b99d2>] rpc_async_schedule+0x12/0x20 [sunrpc]
[<ffffffff882b9d4f>] process_one_work+0x17f/0x440
[<ffffffff882bade6>] worker_thread+0x126/0x3c0
[<ffffffff882c1c31>] kthread+0xd1/0xe0
[<ffffffff88974c37>] ret_from_fork_nospec_end+0x0/0x39
[<ffffffffffffffff>] 0xffffffffffffffff
- Socket information from
# ss
# ss -plit '( sport = :2049 )'
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 64 *:nfs *:*
cubic rto:1000 mss:536 cwnd:10 segs_in:30 lastsnd:775312 lastrcv:775312 lastack:775312
LISTEN 0 64 :::nfs :::*
cubic rto:1000 mss:536 cwnd:10 lastsnd:775313 lastrcv:775313 lastack:775313