包括我在内很多人认为Linux已经很成熟了,不应该有什么不可思议的bug。最近发现有个bug在2017年才fix。
From 39e6c8208d7b6fb9d2047850fb3327db567b564b Mon Sep 17 00:00:00 2001
From: Eric Dumazet <edumazet@google.com>
Date: Tue, 28 Feb 2017 10:34:50 -0800
Subject: [PATCH] net: solve a NAPI race
While playing with mlx4 hardware timestamping of RX packets, I found
that some packets were received by TCP stack with a ~200 ms delay...
Since the timestamp was provided by the NIC, and my probe was added
in tcp_v4_rcv() while in BH handler, I was confident it was not
a sender issue, or a drop in the network.
This would happen with a very low probability, but hurting RPC
workloads.
A NAPI driver normally arms the IRQ after the napi_complete_done(),
after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
it.
Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
while IRQ are not disabled, we might have later an IRQ firing and
finding this bit set, right before napi_complete_done() clears it.
This can happen with busy polling users, or if gro_flush_timeout is
used. But some other uses of napi_schedule() in drivers can cause this
as well.
thread 1 thread 2 (could be on same cpu, or not)
// busy polling or napi_watchdog()
napi_schedule();
...
napi->poll()
device polling:
read 2 packets from ring buffer
Additional 3rd packet is
available.
device hard irq
// does nothing because
NAPI_STATE_SCHED bit is owned by thread 1
napi_schedule();
napi_complete_done(napi, 2);
rearm_irq();
Note that rearm_irq() will not force the device to send an additional
IRQ for the packet it already signaled (3rd packet in my example)
This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
can set if it could not grab NAPI_STATE_SCHED
Then napi_complete_done() properly reschedules the napi to make sure
we do not miss something.
Since we manipulate multiple bits at once, use cmpxchg() like in
sk_busy_loop() to provide proper transactions.
mlx5_core相关的函数调用关系如下:
alloc_comp_eqs
mlx5_create_map_eq
request_irq(mlx5_eq_int)
mlx5_eq_int # hard irq handler
mlx5_eq_cq_completion
cq->comp -> mlx5e_completion_event
napi_schedule
napi_schedule_prep
set NAPIF_STATE_SCHED
if NAPIF_STATE_SCHED is already set
set NAPIF_STATE_MISSED
__napi_schedule
____napi_schedule
__raise_softirq_irqoff(NET_RX_SOFTIRQ)
mlx5e_napi_poll # softirq handler
mlx5e_poll_tx_cq
mlx5e_poll_rx_cq
napi_complete_done
clear NAPIF_STATE_SCHED
if NAPIF_STATE_MISSED is set
__napi_schedule
mlx5e_cq_arm