Subject: [PATCH] net: solve a NAPI race

包括我在内很多人认为Linux已经很成熟了,不应该有什么不可思议的bug。最近发现有个bug在2017年才fix。

From 39e6c8208d7b6fb9d2047850fb3327db567b564b Mon Sep 17 00:00:00 2001
From: Eric Dumazet <edumazet@google.com>
Date: Tue, 28 Feb 2017 10:34:50 -0800
Subject: [PATCH] net: solve a NAPI race

While playing with mlx4 hardware timestamping of RX packets, I found
that some packets were received by TCP stack with a ~200 ms delay...

Since the timestamp was provided by the NIC, and my probe was added
in tcp_v4_rcv() while in BH handler, I was confident it was not
a sender issue, or a drop in the network.

This would happen with a very low probability, but hurting RPC
workloads.

A NAPI driver normally arms the IRQ after the napi_complete_done(),
after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
it.

Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
while IRQ are not disabled, we might have later an IRQ firing and
finding this bit set, right before napi_complete_done() clears it.

This can happen with busy polling users, or if gro_flush_timeout is
used. But some other uses of napi_schedule() in drivers can cause this
as well.

thread 1                                 thread 2 (could be on same cpu, or not)

// busy polling or napi_watchdog()
napi_schedule();
...
napi->poll()

device polling:
read 2 packets from ring buffer
                                          Additional 3rd packet is
available.
                                          device hard irq

                                          // does nothing because
NAPI_STATE_SCHED bit is owned by thread 1
                                          napi_schedule();

napi_complete_done(napi, 2);
rearm_irq();

Note that rearm_irq() will not force the device to send an additional
IRQ for the packet it already signaled (3rd packet in my example)

This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
can set if it could not grab NAPI_STATE_SCHED

Then napi_complete_done() properly reschedules the napi to make sure
we do not miss something.

Since we manipulate multiple bits at once, use cmpxchg() like in
sk_busy_loop() to provide proper transactions.

 

mlx5_core相关的函数调用关系如下:


alloc_comp_eqs
        mlx5_create_map_eq
                request_irq(mlx5_eq_int)

mlx5_eq_int   # hard irq handler
        mlx5_eq_cq_completion
                cq->comp        ->      mlx5e_completion_event
                        napi_schedule
                                napi_schedule_prep
                                        set NAPIF_STATE_SCHED
                                        if NAPIF_STATE_SCHED is already set
                                                set NAPIF_STATE_MISSED
                                __napi_schedule
                                        ____napi_schedule
                                                __raise_softirq_irqoff(NET_RX_SOFTIRQ)

mlx5e_napi_poll   # softirq handler
        mlx5e_poll_tx_cq
        mlx5e_poll_rx_cq
        napi_complete_done
                clear NAPIF_STATE_SCHED
                if NAPIF_STATE_MISSED is set
                        __napi_schedule
        mlx5e_cq_arm
 

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值