当 NFS 操作正在进行并且发生涉及意外 RST 的网络分区时,RHEL 7.6 NFS 共享挂起

RHEL 7.6: NFS share hangs when NFS operations are in-progress and a network partition occurs involving an unexpected RST

SOLUTION 已验证 - 已更新 2020年一月16日07:13 - English

环境

  • Red Hat Enterprise Linux 7.6
  • Red Hat Enterprise Linux as the NFS client
    • kernel at least 3.10.0-957.el7 and before 3.10.0-957.5.1.el7

问题

  • After upgrading from the RHEL7.5 kernel (3.10.0-862el7) to the RHEL7.6 kernel (3.10.0-957el7), reads or writes to a NFS share by a NFS client hang indefinitely following a network partition and recovery.
  • NFS share is hung with a task, such as kworker, consuming 100% of CPU and the /proc/<pid>/stack or other backtrace shows execution in the sunrpc transmit path- Processes can not be killed and a reboot of the NFS client is required.
  • This is new behaviour that is not seen in versions of RHEL 7 prior to RHEL 7.6

决议

Red Hat Enterprise Linux 7
  • The issue bug: 1660784 has been resolved with errata RHSA-2019:0163 with the following package(s): kernel-3.10.0-957.5.1.el7 or later.
  • As an alternative to the above errata, a kpatch on kernel-3.10.0-957.1.3.el7 was tracked with bugzilla 1730410: Bug 1730410 - RHEL7.6: NFS client sunrpc layer hangs and requires a reboot after a network disruption involving an RST from the server [kpatch 3.10.0-957.1.3.el7] (RHEL 7 7.7.0). As of Fri, July 19 2019, the status of bugzilla 1730410 is CLOSED. The problem described has been fixed in the current supported release. For further details regarding which specific releases where this bug is present or fixed, contact Red Hat support. Check the changelog (and bz1730410) for the package “kpatch-patch” for the change that resolved this issue.
Workarounds
  1. Downgrade to a kernel prior to kernel-3.10.0-957.el7

  2. Use some approach to ensure the NFS (and lockd) TCP connections do not go idle and drop (Note that the idle timer will drop a connection after 5 minutes). One possibility is to use a variation of a script to monitor for hung NFS mount points as described in https://access.redhat.com/solutions/97873.

  3. Use NFSv4.1 or NFSv4.0 (not fully reliable). NOTE: This issue can still occur on NFSv4.0 or NFSv4.1, but it is believed to be harder to reproduce due to the fact that NFS4.x periodically issues either NFS4 RENEW or SEQUENCE operations to keep the TCP connection active. However, on NFS 4.x any NFS operation completing with an error code (i.e. NFS4ERR_BAD_STATEID) will cause the TCP connection to drop so this issue can occur.

根源

  • A patch was backported to RHEL7 to fix another bug (1614950), but one portion of the patch caused more severe connection side-effects such as this issue.
  • The bug occurs because the TCP layer state can get out of sync with the sunrpc layers transport state. It is possible that the TCP state will change to TCP_CLOSE while the sunrpc layer’s transport state connected bit (XPRT_CONNECTED) does not get cleared. This can happen with certain timing conditions occur and when an unexpected RST is handled. In addition, the sunrpc’s RPC state machine transmit code handles the error it receives as a ‘retryable’ error and does not release the transport lock bit (XPRT_LOCKED), and thus blocking a connect_worker thread from running. As a result, the task that runs through the RPC state machine for an rpc_task will go into an infinite loop and the only way to recover the NFS mount and the machine is to reboot.

诊断步骤

Simple diagnostic steps on a live system or sosreport: look for specific xprt state in debugfs for the hung NFS share

  • If this issue is occurring, an NFS share will be hung / unresponsive, and the rpc_xprt state will be 0x1b. This can be checked either on a live system or from a sosreport.
# grep -B 2 "state: 0x1b" /sys/kernel/debug/sunrpc/rpc_xprt/*/info 
addr:  192.0.2.1
port:  2049
state: 0x1b
# grep 192.0.2.1 /proc/mounts 
nfs-server:/exports /mnt/nfs nfs rw,relatime,vers=3,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.0.2.1,mountvers=3,mountport=20048,mountproto=udp,local_lock=none,addr=192.0.2.1 0 0

Simple diagnostic steps on a live system: one process consuming 100% CPU and a specific stack

  • To recognize this issue, look for two things:
    1. a task spinning at 100% CPU
    2. the backtrace of the task inside the sunrpc code in the call_transmit code path

NOTE: Depending on the system call and whether the NFS operation that results is asynchronous, task will be a kworker as in the example provided here. If NFS operation is synchronous, it could be any task.

Use ‘top’ to get the highest CPU consumers, and assuming you see a task at the top running at near 100% CPU, get a backtrace of the task. The output of top and the backtrace should be similar to the following:

# top -b -o +%CPU | head -15
top - 22:08:08 up 13 min,  2 users,  load average: 2.08, 1.71, 0.91
Tasks: 128 total,   2 running, 126 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,100.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  1015000 total,   562112 free,   150196 used,   302692 buff/cache
KiB Swap:  1048572 total,  1048572 free,        0 used.   689744 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
   42 root      20   0       0      0      0 R 94.1  0.0   8:22.04 kworker/u2:1
 3954 root      20   0  161880   2088   1524 R  5.9  0.2   0:00.01 top
    1 root      20   0  129564   6824   2836 S  0.0  0.7   0:01.57 systemd
    2 root      20   0       0      0      0 S  0.0  0.0   0:00.00 kthreadd
    3 root      20   0       0      0      0 S  0.0  0.0   0:00.05 ksoftirqd/0
    5 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 kworker/0:0H
    6 root      20   0       0      0      0 S  0.0  0.0   0:00.02 kworker/u2:0
    7 root      rt   0       0      0      0 S  0.0  0.0   0:00.00 migration/0
# cat /proc/42/stack
[<ffffffff882d31b6>] __cond_resched+0x26/0x30
[<ffffffff8881d548>] lock_sock_nested+0x18/0x50
[<ffffffff88893ffb>] tcp_sendmsg+0x3b/0xc60
[<ffffffff888bfeb9>] inet_sendmsg+0x69/0xb0
[<ffffffff888193a6>] sock_sendmsg+0xb6/0xf0
[<ffffffff88819417>] kernel_sendmsg+0x37/0x50
[<ffffffffc04b1ef9>] xs_send_kvec+0xa9/0xb0 [sunrpc]
[<ffffffffc04b21e7>] xs_sendpages+0x67/0x200 [sunrpc]
[<ffffffffc04b3bb1>] xs_tcp_send_request+0x91/0x220 [sunrpc]
[<ffffffffc04b01a8>] xprt_transmit+0x78/0x390 [sunrpc]
[<ffffffffc04abf50>] call_transmit+0x1d0/0x2c0 [sunrpc]
[<ffffffffc04b9639>] __rpc_execute+0x99/0x420 [sunrpc]
[<ffffffffc04b99d2>] rpc_async_schedule+0x12/0x20 [sunrpc]
[<ffffffff882b9d4f>] process_one_work+0x17f/0x440
[<ffffffff882bade6>] worker_thread+0x126/0x3c0
[<ffffffff882c1c31>] kthread+0xd1/0xe0
[<ffffffff88974c37>] ret_from_fork_nospec_end+0x0/0x39
[<ffffffffffffffff>] 0xffffffffffffffff

Simple diagnostic steps on a live system: enable rpcdebug and look for loop

  • Enable rpcdebug for the ‘rpc’ module and the ‘call’, ‘trans’, and ‘xprt’ flags
# rpcdebug -m rpc -s call -s trans -s xprt
  • Look for a specific trace inside /var/log/messages. If this issue is occurring, the rpcdebug output in /var/log/messages shows the following sequence repeatedly, where -EPIPE (-32) is coming back from xs_tcp_send_request and call_status even though call_connect is showing that the xprt is connected.
Dec  7 15:13:10 rhel7u6-node1 kernel: RPC:    92 xprt_transmit(524516)
Dec  7 15:13:10 rhel7u6-node1 kernel: RPC:       xs_tcp_send_request(524516) = -32
Dec  7 15:13:10 rhel7u6-node1 kernel: RPC:    92 call_status (status -32) 
Dec  7 15:13:10 rhel7u6-node1 kernel: RPC:    92 call_bind (status 0)
Dec  7 15:13:10 rhel7u6-node1 kernel: RPC:    92 call_connect xprt ffff8ff8abf44000 is connected  
Dec  7 15:13:10 rhel7u6-node1 kernel: RPC:    92 call_transmit (status 0)
Dec  7 15:13:10 rhel7u6-node1 kernel: RPC:    92 xprt_prepare_transmit
  • Disable rpcdebug
# rpcdebug -m rpc -c call -c trans -c xprt

Simple diagnostic steps from a vmcore

  • In a vmcore, look for the following two things
    1. a running task with a backtrace in the sunrpc transmit path
    2. on the rpc_xprt, the XPRT_CONNECTED bit set, while either there is no socket or the socket is in TCP_CLOSE status
  • Example of vmcore diagnostics
crash> foreach active bt | grep -B 30 "call_transmit at"
 #8 [ffff90ff294e3ee0] cpu_startup_entry at ffffffff87afc3ba
 #9 [ffff90ff294e3f28] start_secondary at ffffffff87a57db7
#10 [ffff90ff294e3f50] start_cpu at ffffffff87a000d5

PID: 173    TASK: ffff90efa8912080  CPU: 1   COMMAND: "khungtaskd"
 #0 [ffff90fe7fdfbcb0] machine_kexec at ffffffff87a63674
 #1 [ffff90fe7fdfbd10] __crash_kexec at ffffffff87b1ce12
 #2 [ffff90fe7fdfbde0] panic at ffffffff8815b4db
 #3 [ffff90fe7fdfbe60] watchdog at ffffffff87b488ee
 #4 [ffff90fe7fdfbec8] kthread at ffffffff87ac1c31

PID: 62746  TASK: ffff90ee756c4100  CPU: 20  COMMAND: "kworker/u289:7"
 #0 [ffff90fe7ee88e48] crash_nmi_callback at ffffffff87a55d07
 #1 [ffff90fe7ee88e58] nmi_handle at ffffffff8816c8fc
 #2 [ffff90fe7ee88eb0] do_nmi at ffffffff8816cb1d
 #3 [ffff90fe7ee88ef0] end_repeat_nmi at ffffffff8816bd69
    [exception RIP: xdr_encode_opaque_fixed+25]
    RIP: ffffffffc0790669  RSP: ffff90fc27293c70  RFLAGS: 00000202
    RAX: 0000000020000000  RBX: ffff90fe5a44a0c8  RCX: ffff90fdb3a31064
    RDX: 0000000000000020  RSI: ffff90fe5a44a0ca  RDI: ffff90fdb3a31068
    RBP: ffff90fc27293c90   R8: ffff90fe70302550   R9: 0000000000000001
    R10: 0000000000000001  R11: 0000000000000000  R12: ffff90fdb3a31068
    R13: 0000000000000020  R14: ffffffffc0800870  R15: ffff90fdb3a31064
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #4 [ffff90fc27293c70] xdr_encode_opaque_fixed at ffffffffc0790669 [sunrpc]
 #5 [ffff90fc27293c98] xdr_encode_opaque at ffffffffc07906c9 [sunrpc]
 #6 [ffff90fc27293ca8] encode_nfs_fh3 at ffffffffc07ffde6 [nfsv3]
 #7 [ffff90fc27293cc8] nfs3_xdr_enc_read3args at ffffffffc0800893 [nfsv3]
 #8 [ffff90fc27293cf0] rpcauth_wrap_req at ffffffffc0786670 [sunrpc]
 #9 [ffff90fc27293d78] call_transmit at ffffffffc0775f1a [sunrpc]
crash> set 62746
    PID: 62746
COMMAND: "kworker/u289:7"
   TASK: ffff90ee756c4100  [THREAD_INFO: ffff90fc27290000]
    CPU: 20
  STATE: TASK_RUNNING (ACTIVE)
crash> bt
PID: 62746  TASK: ffff90ee756c4100  CPU: 20  COMMAND: "kworker/u289:7"
 #0 [ffff90fe7ee88e48] crash_nmi_callback at ffffffff87a55d07
 #1 [ffff90fe7ee88e58] nmi_handle at ffffffff8816c8fc
 #2 [ffff90fe7ee88eb0] do_nmi at ffffffff8816cb1d
 #3 [ffff90fe7ee88ef0] end_repeat_nmi at ffffffff8816bd69
    [exception RIP: xdr_encode_opaque_fixed+25]
    RIP: ffffffffc0790669  RSP: ffff90fc27293c70  RFLAGS: 00000202
    RAX: 0000000020000000  RBX: ffff90fe5a44a0c8  RCX: ffff90fdb3a31064
    RDX: 0000000000000020  RSI: ffff90fe5a44a0ca  RDI: ffff90fdb3a31068
    RBP: ffff90fc27293c90   R8: ffff90fe70302550   R9: 0000000000000001
    R10: 0000000000000001  R11: 0000000000000000  R12: ffff90fdb3a31068
    R13: 0000000000000020  R14: ffffffffc0800870  R15: ffff90fdb3a31064
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #4 [ffff90fc27293c70] xdr_encode_opaque_fixed at ffffffffc0790669 [sunrpc]
 #5 [ffff90fc27293c98] xdr_encode_opaque at ffffffffc07906c9 [sunrpc]
 #6 [ffff90fc27293ca8] encode_nfs_fh3 at ffffffffc07ffde6 [nfsv3]
 #7 [ffff90fc27293cc8] nfs3_xdr_enc_read3args at ffffffffc0800893 [nfsv3]
 #8 [ffff90fc27293cf0] rpcauth_wrap_req at ffffffffc0786670 [sunrpc]
 #9 [ffff90fc27293d78] call_transmit at ffffffffc0775f1a [sunrpc]
#10 [ffff90fc27293da8] __rpc_execute at ffffffffc0783639 [sunrpc]
#11 [ffff90fc27293e10] rpc_async_schedule at ffffffffc07839d2 [sunrpc]
#12 [ffff90fc27293e20] process_one_work at ffffffff87ab9d4f
#13 [ffff90fc27293e68] worker_thread at ffffffff87abade6
#14 [ffff90fc27293ec8] kthread at ffffffff87ac1c31
crash> bt -f | grep call_transmit
 #9 [ffff90fc27293d78] call_transmit at ffffffffc0775f1a [sunrpc]
crash> px (0xffff90fc27293d78 - 6*8)
$1 = 0xffff90fc27293d48
crash> rd 0xffff90fc27293d48
ffff90fc27293d48:  ffff90fe70302398                    .#0p....
crash> p ((struct rpc_task *)0xffff90fe70302398)->tk_client->cl_xprt
$2 = (struct rpc_xprt *) 0xffff90efa884a800
crash> rpc_xprt.state 0xffff90efa884a800
  state = 27
crash> set hex
output radix: 16 (hex)
crash> rpc_xprt.state 0xffff90efa884a800
  state = 0x1b
crash> eval -b 0x1b | grep bits
   bits set: 4 3 1 0 
   382 /*
    383  * Reserved bit positions in xprt->state
    384  */
    385 #define XPRT_LOCKED             (0) 
    386 #define XPRT_CONNECTED          (1)   <-------- This bit is set indicating transport connected
    387 #define XPRT_CONNECTING         (2)
    388 #define XPRT_CLOSE_WAIT         (3)
    389 #define XPRT_BOUND              (4)
    390 #define XPRT_BINDING            (5)
    391 #define XPRT_CLOSING            (6)
    392 #define XPRT_CONGESTED          (9)
    393 
"include/linux/sunrpc/xprt.h" line 393 of 482 --81%-- col 1

crash> p ((struct sock_xprt *)0xffff90efa884a800)->inet->__sk_common.skc_state
$5 = 0x7         <-------------- The TCP state is TCP_CLOSE which contradicts the above RPC xprt state
crash> p (int)TCP_CLOSE
$7 = 0x7

    16 enum {
     17         TCP_ESTABLISHED = 1,
     18         TCP_SYN_SENT,
     19         TCP_SYN_RECV,
     20         TCP_FIN_WAIT1,
     21         TCP_FIN_WAIT2,
     22         TCP_TIME_WAIT,
     23         TCP_CLOSE,
     24         TCP_CLOSE_WAIT,
     25         TCP_LAST_ACK,
     26         TCP_LISTEN,
     27         TCP_CLOSING,    /* Now a valid state */
     28 
     29         TCP_MAX_STATES  /* Leave at the end! */
     30 };
include/net/tcp_states.h

Reproducer with detailed diagnostic steps

WARNING: The following test script will cause the NFS client to hang, a kworker task will spin at 100% CPU, and you will need to reboot the system to recover.

  • The behavior can be reproduced by running the below steps
  • Test script
# cat hang_nfs
mkdir -p /srv/export
bash -c "echo /srv/export 127.0.0.1\(rw,insecure,sync,no_root_squash\) > /etc/exports"
systemctl start nfs-server
exportfs -a
mkdir -p /mnt/nfs
mount -t nfs -o vers=4.1,noresvport,_netdev 127.0.0.1:/ /mnt/nfs
mkdir -p /mnt/nfs/srv/export/dir
bash -c "while true; do date ; time dd if=/dev/zero oflag=direct of=/mnt/nfs/srv/export/dir/file bs=16M count=100 ; done""
  • Execute the script
# ./hang_nfs
  • In a separate tty, add an the iptables rule to block output traffic to TCP port 2049, the well known port for NFS
# iptables -I OUTPUT -p tcp -m tcp --dport 2049 -j REJECT --reject-with tcp-reset
  • Process ID for # dd and its process stack
# pidof dd
3877

# cat /proc/$(pidof dd)/stack
[<ffffffffc07fc626>] nfs_file_direct_write+0x396/0x470 [nfs]
[<ffffffffc07f1a79>] nfs_file_write+0x139/0x1e0 [nfs]
[<ffffffff884405b3>] do_sync_write+0x93/0xe0
[<ffffffff884410a0>] vfs_write+0xc0/0x1f0
[<ffffffff88441ebf>] SyS_write+0x7f/0xf0
[<ffffffff88974ddb>] system_call_fastpath+0x22/0x27
[<ffffffffffffffff>] 0xfffffffffffffff
  • Remove the rule
# iptables -D OUTPUT -p tcp -m tcp --dport 2049 -j REJECT --reject-with tcp-reset
  • Kill the process to start a new # dd within the while loop
# pkill --signal 9 dd
  • The process hangs in inode_dio_wait
# cat /proc/$(pidof dd)/stack
[<ffffffff8845dce9>] inode_dio_wait+0xd9/0x100
[<ffffffffc084e17b>] nfs4_file_fsync+0x3b/0x170 [nfsv4]
[<ffffffff88475a6e>] vfs_fsync+0x2e/0x40
[<ffffffffc084e3ee>] nfs4_file_flush+0x5e/0x90 [nfsv4]
[<ffffffff8843e4f7>] filp_close+0x37/0x90
[<ffffffff88461338>] put_files_struct+0x88/0xe0
[<ffffffff88461439>] exit_files+0x49/0x50
[<ffffffff8829dc3f>] do_exit+0x2af/0xa40
[<ffffffff8829e44f>] do_group_exit+0x3f/0xa0
[<ffffffff882af24e>] get_signal_to_deliver+0x1ce/0x5e0
[<ffffffff8822b527>] do_signal+0x57/0x6f0
[<ffffffff8822bc32>] do_notify_resume+0x72/0xc0
[<ffffffff88975124>] int_signal+0x12/0x17
[<ffffffffffffffff>] 0xffffffffffffffff
  • Top CPU consumers
# top -b -o +%CPU | head -15
top - 22:08:08 up 13 min,  2 users,  load average: 2.08, 1.71, 0.91
Tasks: 128 total,   2 running, 126 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,100.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  1015000 total,   562112 free,   150196 used,   302692 buff/cache
KiB Swap:  1048572 total,  1048572 free,        0 used.   689744 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
   42 root      20   0       0      0      0 R 94.1  0.0   8:22.04 kworker/u2:1
 3954 root      20   0  161880   2088   1524 R  5.9  0.2   0:00.01 top
    1 root      20   0  129564   6824   2836 S  0.0  0.7   0:01.57 systemd
    2 root      20   0       0      0      0 S  0.0  0.0   0:00.00 kthreadd
    3 root      20   0       0      0      0 S  0.0  0.0   0:00.05 ksoftirqd/0
    5 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 kworker/0:0H
    6 root      20   0       0      0      0 S  0.0  0.0   0:00.02 kworker/u2:0
    7 root      rt   0       0      0      0 S  0.0  0.0   0:00.00 migration/0
  • Looking at the kworker thread, we have an issue with the kernel scheduler and controlling a TCP socket.
# cat /proc/42/stack
[<ffffffff882d31b6>] __cond_resched+0x26/0x30
[<ffffffff8881d548>] lock_sock_nested+0x18/0x50
[<ffffffff88893ffb>] tcp_sendmsg+0x3b/0xc60
[<ffffffff888bfeb9>] inet_sendmsg+0x69/0xb0
[<ffffffff888193a6>] sock_sendmsg+0xb6/0xf0
[<ffffffff88819417>] kernel_sendmsg+0x37/0x50
[<ffffffffc04b1ef9>] xs_send_kvec+0xa9/0xb0 [sunrpc]
[<ffffffffc04b21e7>] xs_sendpages+0x67/0x200 [sunrpc]
[<ffffffffc04b3bb1>] xs_tcp_send_request+0x91/0x220 [sunrpc]
[<ffffffffc04b01a8>] xprt_transmit+0x78/0x390 [sunrpc]
[<ffffffffc04abf50>] call_transmit+0x1d0/0x2c0 [sunrpc]
[<ffffffffc04b9639>] __rpc_execute+0x99/0x420 [sunrpc]
[<ffffffffc04b99d2>] rpc_async_schedule+0x12/0x20 [sunrpc]
[<ffffffff882b9d4f>] process_one_work+0x17f/0x440
[<ffffffff882bade6>] worker_thread+0x126/0x3c0
[<ffffffff882c1c31>] kthread+0xd1/0xe0
[<ffffffff88974c37>] ret_from_fork_nospec_end+0x0/0x39
[<ffffffffffffffff>] 0xffffffffffffffff
  • Socket information from # ss
# ss -plit '( sport = :2049 )'
State       Recv-Q Send-Q                                                          Local Address:Port                                                                           Peer Address:Port
LISTEN      0      64                                                                          *:nfs                                                                                       *:*
         cubic rto:1000 mss:536 cwnd:10 segs_in:30 lastsnd:775312 lastrcv:775312 lastack:775312
LISTEN      0      64                                                                         :::nfs                                                                                      :::*
         cubic rto:1000 mss:536 cwnd:10 lastsnd:775313 lastrcv:775313 lastack:775313
  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值