Environment
SUSE Linux Enterprise Server 15 SP2
SUSE Linux Enterprise Server 12 SP5
Situation
This problem manifests itself in several ways; the one observed in the field was Linux NFS client timing out, with the following message logged in the system log:
nfs: server *HOSTNAME* not responding, still trying
and after several minutes
nfs: server *HOSTNAME* OK
Because of an in-kernel retransmit timer (or another packet being queued), the stuck packet will eventually be sent out, after a delay.
In tcpdump packet capture analysis, this problem can be identified by spurious resend attempts of the same packet (with equal TSVal) a long time apart.
Resolution
This problem has been reported upstream [3] and the proper fix is still in the works by the Linux Kernel community.
SUSE has released kernel maintanance update that will mitigate this problem by disabling the lockless optimization on pfifo_fast qdisc (which is the only qdisc currently making use of this optimization) [4].
The issue is solved in the following kernel versions:
- SLES15 SP2: 5.3.18-24.61
- SLES12 SP5: 4.12.14-122.66
Alternatively, to address the problem without the kernel update / reboot, the problem can be mitigated by switching away form pfifo_fast qdisc and making sure the change stays functional across reboots by the following commands:
echo 'net.core.default_qdisc = fq_codel' >>/etc/sysctl.conf sysctl -w net.core.default_qdisc=fq_codel tc qdisc add dev $devname root handle 1: mq tc qdisc del dev $devname root
In case the $devname above is not a multiqueue-capable device, the following commands have to be used instead:
echo 'net.core.default_qdisc = fq_codel' >>/etc/sysctl.conf sysctl -w net.core.default_qdisc=fq_codel tc qdisc add dev $devname root handle 1: fq_codel tc qdisc del dev $devname root
Cause
Linux kernel implements various algorithms -- called queuing disciplines (qdiscs) for scheduling outgoing network packets. Starting with Linux Kernel 4.16, there is an optimization that allows for these algorithms to process the packets without acquiring any locks, with the ultimate goal of improving throughput. These changes have been implemented upstream in commits [1] [2], and those commits have been backported by SUSE to SLE12-SP5 and SLE15-SP2 codestreams.
However, the lockless optimization has a design flaw which (under certain very specific circumstances) opens a window for a race condition that causes the "last" packet in the queue to be stuck (and not sent out to the wire) for a potentially unbound amount of time, causing network stalls.
Additional Information
[1] kernel/git/torvalds/linux.git - Linux kernel source tree
[2] kernel/git/torvalds/linux.git - Linux kernel source tree
[3] Packet gets stuck in NOLOCK pfifo_fast qdisc
[4] https://github.com/openSUSE/kernel-source/commit/1c59b584ef0cc166f6f5c9f8ed6f47e2e811e1c0
[5] https://github.com/openSUSE/kernel-source/commit/3aa0c01fad38360cc9cd840d49bdfdc565e2e718