网卡收包流程分析（一）

JuoJuo6

已于 2022-02-24 09:47:00 修改

阅读量1.7k

点赞数 3

文章标签：网络嵌入式 kernel linux

于 2022-02-24 00:02:40 首次发布

本文链接：https://blog.csdn.net/qq_34539312/article/details/123101938

版权

由于本人工作内容主要集中于kernel的网络子系统，刚接触这个模块，于是想梳理一下网卡驱动的收包过程，以下内容为个人理解，如有不对，希望大家能够多多指正，相互成长~

后续会持续更新有关kernel网络子系统相关的内容，坚持每周一更！
原创不易，转载请注明出处~

OK，进入正题。

目前网卡收包可以分为两类：中断方式与轮询方式。

中断收包

网卡作为数据收发设备，它将我们的数据转换成二进制信号通过媒介传输到核心网中去，那么当核心网有数据传输到本地，我们如何知道数据来了呢？最简单的方法就是网卡产生一个中断信号，通知CPU有数据包来临，CPU进入网卡的中断服务函数读取数据，构建skb然后递交到网络子系统中作进一步分析。

中断收包的方式响应及时，数据包来临立马就能得到处理，但是此种方式在大量数据包来临时却有一个致命的缺点，那就是CPU被频繁中断，其它任务无法得到执行，这对于kernel而言是无法忍受的。因此，kernel提出了一种新的方式用于高网络负载情景下的收包方式，NAPI（new API）。

NAPI收包

NAPI收包其实就是驱动提供轮询函数，在有大量数据包的场景下，关闭网卡中断，在软中断中执行驱动的轮询函数进行收包，此种方式避免了CPU频繁进出中断，同时软中断也保证了系统的性能。

但是NAPI方式对网卡的收包方式提出了要求，那就是需要支持ring buffer（即支持DMA）的网卡才能真正意义上的使用NAPI方式进行收包，不支持DMA的网卡还是需要通过中断的方式进行收包，下图描述了支持DMA方式的网卡是如何进行收包的：
网卡ring buffer

网卡驱动会申请一段ring buffer用于数据包的接收，ring buffer中存放的是描述符（注意是描述符，不是真正的数据），当网卡收到数据时，驱动申请skb并将skb的数据区地址存放到ring buffer的描述符中，且标记该描述符就绪，网卡会找到已就绪的描述符并通过DMA将数据写入到skb的数据区中去，同时网卡会标记该描述符已被使用，驱动读取ring buffer中的数据并维护ring buffer的状态。

讲完ring buffer，那么NAPI收包方式就比较容易理解了，由于网卡支持DMA，当网卡中有数据来临时，通过中断通知一次CPU处理数据即可，接下来由DMA负责搬运数据到内存中去，CPU只要隔一段时间去清理ring buffer中未读的数据即可（调用驱动的poll函数），这便是NAPI的思想。

那么不支持DMA的网卡呢？kernel为了统一使用NAPI的思想，对于不支持DMA的网卡，在网卡中断中仅仅负责将数据包挂接到input_pkt_queue链表中，然后kernel自己设计了一个poll函数（process_backlog函数）进行数据包的处理。

数据结构分析

前面讲到kernel为了统一使用NAPI方式，对于不支持DMA的网卡做了兼容，接下来我们从代码上自下而上整理一下网卡的收包方式。

首先便是CPU收包的入口softnet_data数据结构：

对于该结构体介绍几个重要的成员，见注释：

struct softnet_data {
	struct list_head	poll_list;   //轮询链表，各个驱动的poll方法都会挂接在此链表下
	struct sk_buff_head	process_queue; 

	/* stats */
	unsigned int		processed;
	unsigned int		time_squeeze;
	unsigned int		received_rps;
#ifdef CONFIG_RPS
	struct softnet_data	*rps_ipi_list;
#endif
#ifdef CONFIG_NET_FLOW_LIMIT
	struct sd_flow_limit __rcu *flow_limit;
#endif
	struct Qdisc		*output_queue;
	struct Qdisc		**output_queue_tailp;
	struct sk_buff		*completion_queue;

#ifdef CONFIG_RPS
	/* input_queue_head should be written by cpu owning this struct,
	 * and only read by other cpus. Worth using a cache line.
	 */
	unsigned int		input_queue_head ____cacheline_aligned_in_smp;

	/* Elements below can be accessed between CPUs for RPS/RFS */
	struct call_single_data	csd ____cacheline_aligned_in_smp;
	struct softnet_data	*rps_ipi_next;
	unsigned int		cpu;
	unsigned int		input_queue_tail;
#endif
	unsigned int		dropped;
	struct sk_buff_head	input_pkt_queue;  //输入队列，不支持DMA的网卡，收到的包会挂接在此链表下
	struct napi_struct	backlog;          //kernel为了统一NAPI方式收包，为不支持DMA的网卡构建的NAPI结构体

};

接下来便是napi_struct结构体了：

该结构体中最重要的成员便是驱动需要注册的poll回调函数了。由于kernel目前统一采用了NAPI方式，所以对于每个网卡都需要构建一个属于自己的NAPI结构体（当然接收多队列下，一张网卡可能需要为每个cpu都构建一个napi实例）。

struct napi_struct {
	/* The poll_list must only be managed by the entity which
	 * changes the state of the NAPI_STATE_SCHED bit.  This means
	 * whoever atomically sets that bit can add this napi_struct
	 * to the per-CPU poll_list, and whoever clears that bit
	 * can remove from the list right before clearing the bit.
	 */
	struct list_head	poll_list;  //挂接softnet_data结构下poll_list链表头下

	unsigned long		state;
	int			weight;   //收包权重
	unsigned int		gro_count;
	int			(*poll)(struct napi_struct *, int);  //驱动注册的poll回调函数
#ifdef CONFIG_NETPOLL
	spinlock_t		poll_lock;
	int			poll_owner;
#endif
	struct net_device	*dev;
	struct sk_buff		*gro_list;
	struct sk_buff		*skb;
	struct hrtimer		timer;
	struct list_head	dev_list;
	struct hlist_node	napi_hash_node;
	unsigned int		napi_id;
};

收包流程分析与对比

介绍完两个重要的数据结构后，下面以e100网卡（该网卡支持DMA）为例，介绍下该网卡收包时的函数调用。

首先在e100_probe函数中，构建了NAPI结构体：

netif_napi_add(netdev, &nic->napi, e100_poll, E100_NAPI_WEIGHT);

在网卡中断中，禁止了本地中断，并开始了NAPI方式收包：

	if (likely(napi_schedule_prep(&nic->napi))) {
		e100_disable_irq(nic);
		__napi_schedule(&nic->napi); //将驱动的napi结构体挂接到本地cpu softnet_data结构体下的poll_list链表下并开启NET_RX_SOFTIRQ软中断开始收包
	}

软中断net_rx_action中，开始遍历cpu下的poll_list链表，并取出挂接在下面的napi结构体并执行驱动注册的poll函数进行收包：

struct softnet_data *sd = this_cpu_ptr(&softnet_data);
unsigned long time_limit = jiffies + 2;
int budget = netdev_budget;
LIST_HEAD(list);
LIST_HEAD(repoll);

local_irq_disable();
list_splice_init(&sd->poll_list, &list); //取出挂接在poll_list下的所有链表，并重新初始poll_list链表
local_irq_enable();
for (;;) {
		struct napi_struct *n;

		if (list_empty(&list)) {
			if (!sd_has_rps_ipi_waiting(sd) && list_empty(&repoll))
				return;
			break;
		}

		n = list_first_entry(&list, struct napi_struct, poll_list); //取出第一个napi结构体
		budget -= napi_poll(n, &repoll); //执行驱动注册的poll函数进行收包

		/* If softirq window is exhausted then punt.
		 * Allow this to run for 2 jiffies since which will allow
		 * an average latency of 1.5/HZ.
		 */
		if (unlikely(budget <= 0 ||
			     time_after_eq(jiffies, time_limit))) {
			sd->time_squeeze++;
			break;
		}
	}

接下来看看不支持DMA的网卡是如何收包的，以DM9000网卡为例：

在DM9000网卡的中断函数里：以下显示了DM9000网卡中断里对于收包的函数调用，kernel最终调用了enqueue_to_backlog函数，该函数里有两个数据结构前面在介绍softnet_data结构体时有作注释，那便是input_pkt_queue和backlog，input_pkt_queue链表下挂接着中断里构建的skb，backlog便是kernel为了统一NAPI收包方式为不支持DMA的网卡所构建的napi_struct结构体实例。

dm9000_interrupt
	dm9000_rx
    	netif_rx
			netif_rx_internal
				enqueue_to_backlog
    
 static int enqueue_to_backlog(struct sk_buff *skb, int cpu,
			      unsigned int *qtail)
{
	struct softnet_data *sd;
	unsigned long flags;
	unsigned int qlen;

	sd = &per_cpu(softnet_data, cpu);

	local_irq_save(flags);

	rps_lock(sd);
	if (!netif_running(skb->dev))
		goto drop;
	qlen = skb_queue_len(&sd->input_pkt_queue);
	if (qlen <= netdev_max_backlog && !skb_flow_limit(skb, qlen)) {
		if (qlen) {
enqueue:
			__skb_queue_tail(&sd->input_pkt_queue, skb); //将skb挂接到input_pkt_queue链表中
			input_queue_tail_incr_save(sd, qtail);
			rps_unlock(sd);
			local_irq_restore(flags);
			return NET_RX_SUCCESS;
		}

		/* Schedule NAPI for backlog device
		 * We can use non atomic operation since we own the queue lock
		 */
		if (!__test_and_set_bit(NAPI_STATE_SCHED, &sd->backlog.state)) {
			if (!rps_ipi_queued(sd))
				____napi_schedule(sd, &sd->backlog);  //将backlog（NAPI）结构体挂接到本地cpu的poll_list链表下
		}
		goto enqueue;
	}

drop:
	sd->dropped++;
	rps_unlock(sd);

	local_irq_restore(flags);

	atomic_long_inc(&skb->dev->rx_dropped);
	kfree_skb(skb);
	return NET_RX_DROP;
}

中断收到包了并且也将skb放入链表了，那接下来该如何处理呢，经过前面对NAPI收包方式的介绍，自然是在软中断里调用napi_struct结构体实例下注册的poll回调函数了，这里便是kernel实现的backlog下的poll函数，在net_dev_init函数里：

static int __init net_dev_init(void)
{
	int i, rc = -ENOMEM;

	BUG_ON(!dev_boot_phase);

	if (dev_proc_init())
		goto out;

	if (netdev_kobject_init())
		goto out;

	INIT_LIST_HEAD(&ptype_all);
	for (i = 0; i < PTYPE_HASH_SIZE; i++)
		INIT_LIST_HEAD(&ptype_base[i]);

	INIT_LIST_HEAD(&offload_base);

	if (register_pernet_subsys(&netdev_net_ops))
		goto out;

	/*
	 *	Initialise the packet receive queues.
	 */

	for_each_possible_cpu(i) {  //在这里完成对softnet_data结构体的初始化以及backlog实例的初始化
		struct work_struct *flush = per_cpu_ptr(&flush_works, i);
		struct softnet_data *sd = &per_cpu(softnet_data, i);

		INIT_WORK(flush, flush_backlog);

		skb_queue_head_init(&sd->input_pkt_queue);
		skb_queue_head_init(&sd->process_queue);
		INIT_LIST_HEAD(&sd->poll_list);
		sd->output_queue_tailp = &sd->output_queue;
#ifdef CONFIG_RPS
		sd->csd.func = rps_trigger_softirq;
		sd->csd.info = sd;
		sd->cpu = i;
#endif

		sd->backlog.poll = process_backlog; //kernel实现的poll函数便是process_backlog
		sd->backlog.weight = weight_p;
	}

	dev_boot_phase = 0;

	/* The loopback device is special if any other network devices
	 * is present in a network namespace the loopback device must
	 * be present. Since we now dynamically allocate and free the
	 * loopback device ensure this invariant is maintained by
	 * keeping the loopback device as the first device on the
	 * list of network devices.  Ensuring the loopback devices
	 * is the first device that appears and the last network device
	 * that disappears.
	 */
	if (register_pernet_device(&loopback_net_ops))
		goto out;

	if (register_pernet_device(&default_device_ops))
		goto out;

	open_softirq(NET_TX_SOFTIRQ, net_tx_action);
	open_softirq(NET_RX_SOFTIRQ, net_rx_action);

	hotcpu_notifier(dev_cpu_callback, 0);
	dst_subsys_init();
	rc = 0;
out:
	return rc;
}

由前面的分析我们可知process_backlog函数会在软中断中得以执行，而该函数内会先将input_pkt_queue下的成员转移到process_queue链表下，然后在while里从process_queue链表里依次取出skb并通过__netif_receive_skb递交到网络子系统

static int process_backlog(struct napi_struct *napi, int quota)
{
	struct softnet_data *sd = container_of(napi, struct softnet_data, backlog);
	bool again = true;
	int work = 0;

	/* Check if we have pending ipi, its better to send them now,
	 * not waiting net_rx_action() end.
	 */
	if (sd_has_rps_ipi_waiting(sd)) {
		local_irq_disable();
		net_rps_action_and_irq_enable(sd);
	}

	napi->weight = weight_p;
	while (again) {
		struct sk_buff *skb;

		while ((skb = __skb_dequeue(&sd->process_queue))) {
			rcu_read_lock();
			__netif_receive_skb(skb);
			rcu_read_unlock();
			input_queue_head_incr(sd);
			if (++work >= quota)
				return work;

		}

		local_irq_disable();
		rps_lock(sd);
		if (skb_queue_empty(&sd->input_pkt_queue)) {
			/*
			 * Inline a custom version of __napi_complete().
			 * only current cpu owns and manipulates this napi,
			 * and NAPI_STATE_SCHED is the only possible flag set
			 * on backlog.
			 * We can use a plain write instead of clear_bit(),
			 * and we dont need an smp_mb() memory barrier.
			 */
			napi->state = 0;
			again = false;
		} else {
			skb_queue_splice_tail_init(&sd->input_pkt_queue,
						   &sd->process_queue);
		}
		rps_unlock(sd);
		local_irq_enable();
	}

	return work;
}

总结

至此，介绍完了kernel下网卡收包的流程，并对两种类型网卡收包过程进行了函数调用分析。kernel为了操作系统的性能，在网络负载较大的情况下采用了轮询的方式进行收包（NAPI），每个网卡对应一个自己的napi_struct实例，对于支持DMA的网卡，该实例由driver提供，对于不支持DMA的网卡kernel为了统一NAPI收包方式实现了自己的backlog实例，并提供了自己的poll函数process_backlog。

最后下面这张图对比了两种收包方式：
两种网卡NAPI收包对比