Linux启动

最新推荐文章于 2024-07-25 13:04:22 发布

当当响

最新推荐文章于 2024-07-25 13:04:22 发布

阅读量235

点赞数

文章标签： linux 服务器网络

本文链接：https://blog.csdn.net/zhpCSDN921011/article/details/125309185

版权

Linux启动

Linux 驱动，内核协议栈等等模块在具备接收⽹卡数据包之前，要做很多的准备⼯作才⾏。⽐如要提前创建好ksoftirqd内核线程，要注册好各个协议对应的处理函数，⽹卡设备⼦系统要提前初始化好，⽹卡要启动好。只有这些都Ready之后，我们才能真正开始接收数据包。那么我们现在来看看这些准备⼯作都是怎么做的。

创建ksoftirqd内核进程

Linux 的软中断都是在专⻔的内核线程（ksoftirqd）中进⾏的，因此我们⾮常有必要看⼀下这些进程是怎么初始化的，这样我们才能在后⾯更准确地了解收包过程。该进程数量不是 1个，⽽是 N 个，其中 N 等于你的机器的核数。系统初始化的时候在 kernel/smpboot.c中调⽤了 smpboot_register_percpu_thread，该函数进⼀步会执⾏到 spawn_ksoftirqd（位于kernel/softirq.c）来创建出 softirqd 进程。
在这里插入图片描述
相关代码如下：

//file: kernel/softirq.c
static struct smp_hotplug_thread softirq_threads = {
    .store = &ksoftirqd,
    .thread_should_run = ksoftirqd_should_run,
    .thread_fn = run_ksoftirqd,
    .thread_comm = "ksoftirqd/%u",
};
static __init int spawn_ksoftirqd(void) {
	 register_cpu_notifier(&cpu_nfb);
	 BUG_ON(smpboot_register_percpu_thread(&softirq_threads));
	 return 0; 
}
early_initcall(spawn_ksoftirqd);

当 ksoftirqd 被创建出来以后，它就会进⼊⾃⼰的线程循环函数 ksoftirqd_should_run和run_ksoftirqd 了。不停地判断有没有软中断需要被处理。这⾥需要注意的⼀点是，软中断不仅仅只有⽹络软中断，还有其它类型。

//file: include/linux/interrupt.h
enum
{
	HI_SOFTIRQ=0,
	TIMER_SOFTIRQ,
	NET_TX_SOFTIRQ,
	NET_RX_SOFTIRQ,
	BLOCK_SOFTIRQ,
	BLOCK_IOPOLL_SOFTIRQ,
	TASKLET_SOFTIRQ,
	SCHED_SOFTIRQ,
	HRTIMER_SOFTIRQ,
	RCU_SOFTIRQ, /* Preferable RCU should always be the last softirq */
	NR_SOFTIRQS
};

⽹络⼦系统初始化

在这里插入图片描述
linux 内核通过调⽤ subsys_initcall 来初始化各个⼦系统，在源代码⽬录⾥你可以 grep出许多对这个函数的调⽤。这⾥我们要说的是⽹络⼦系统的初始化，会执⾏到net_dev_init 函数。

//file: net/core/dev.c
static int __init net_dev_init(void) {
 ......
	for_each_possible_cpu(i) {
		struct softnet_data *sd = &per_cpu(softnet_data, i);
		memset(sd, 0, sizeof(*sd));
		skb_queue_head_init(&sd->input_pkt_queue);
		skb_queue_head_init(&sd->process_queue);
		sd->completion_queue = NULL;
		INIT_LIST_HEAD(&sd->poll_list);
		......
	}
	......
	open_softirq(NET_TX_SOFTIRQ, net_tx_action);
	open_softirq(NET_RX_SOFTIRQ, net_rx_action);
}
subsys_initcall(net_dev_init);

在这个函数⾥，会为每个 CPU 都申请⼀个 softnet_data 数据结构，在这个数据结构⾥的poll_list 是等待驱动程序将其 poll 函数注册进来，稍后⽹卡驱动初始化的时候我们可以看到这⼀过程。

另外 open_softirq 注册了每⼀种软中断都注册⼀个处理函数。 NET_TX_SOFTIRQ 的处理函数为 net_tx_action，NET_RX_SOFTIRQ 的为 net_rx_action。继续跟踪 open_softirq 后发现这个注册的⽅式是记录在 softirq_vec 变量⾥的。后⾯ ksoftirqd 线程收到软中断的时候，也会使⽤这个变量来找到每⼀种软中断对应的处理函数。

//file: kernel/softirq.c
void open_softirq(int nr, void (*action)(struct softirq_action*))
{
	softirq_vec[nr].action = action; 
}

协议栈注册

内核实现了⽹络层的 ip 协议，也实现了传输层的 tcp 协议和 udp 协议。这些协议对应的实现函数分别是 ip_rcv(), tcp_v4_rcv()和 udp_rcv()。和我们平时写代码的⽅式不⼀样的是，内核是通过注册的⽅式来实现的。 Linux 内核中的 fs_initcall 和 subsys_initcall 类似，也是初始化模块的⼊⼝。 fs_initcall 调⽤ inet_init 后开始⽹络协议栈注册。通过 inet_init ，将这些函数注册到了 inet_protos 和 ptype_base 数据结构中了。如下图:
在这里插入图片描述
相关代码如下：

//file: net/ipv4/af_inet.c
static struct packet_type ip_packet_type __read_mostly = {
	.type = cpu_to_be16(ETH_P_IP),
	.func = ip_rcv,
};
static const struct net_protocol udp_protocol = {
	.handler = udp_rcv,
	.err_handler = udp_err,
	.no_policy = 1,
	.netns_ok = 1,
};
static const struct net_protocol tcp_protocol = {
	.early_demux = tcp_v4_early_demux,
	.handler = tcp_v4_rcv,
	.err_handler = tcp_v4_err,
	.no_policy = 1,
	.netns_ok = 1,
};
static int __init inet_init(void) {
 ......
	if (inet_add_protocol(&icmp_protocol, IPPROTO_ICMP) < 0)
		pr_crit("%s: Cannot add ICMP protocol\n", __func__);
	if (inet_add_protocol(&udp_protocol, IPPROTO_UDP) < 0)
		pr_crit("%s: Cannot add UDP protocol\n", __func__);
	if (inet_add_protocol(&tcp_protocol, IPPROTO_TCP) < 0)
		pr_crit("%s: Cannot add TCP protocol\n", __func__);
 	......
	dev_add_pack(&ip_packet_type);
}

上⾯的代码中我们可以看到，udp_protocol 结构体中的 handler 是 udp_rcv，tcp_protocol 结构体中的 handler 是tcp_v4_rcv，通过 inet_add_protocol 被初始化了进来。

int inet_add_protocol(const struct net_protocol *prot, unsigned char protocol) {
	if (!prot->netns_ok) {
		pr_err("Protocol %u is not namespace aware, cannot register.\n", protocol);
		return -EINVAL;
	}
	return !cmpxchg((const struct net_protocol **)&inet_protos[protocol],  NULL, prot) ? 0 : -1; 
}

inet_add_protocol 函数将 tcp 和 udp 对应的处理函数都注册到了 inet_protos 数组中了。再看 dev_add_pack(&ip_packet_type); 这⼀⾏，ip_packet_type 结构体中的 type是协议名，func 是 ip_rcv 函数，在dev_add_pack 中会被注册到 ptype_base 哈希表中。

//file: net/core/dev.c
void dev_add_pack(struct packet_type *pt) {
	struct list_head *head = ptype_head(pt);
	......
}
static inline struct list_head *ptype_head(const struct packet_type *pt) {
	if (pt->type == htons(ETH_P_ALL))
		return &ptype_all;
	else
	return &ptype_base[ntohs(pt->type) & PTYPE_HASH_MASK];
}

这⾥我们需要记住 inet_protos 记录着 udp，tcp 的处理函数地址，ptype_base 存储着 ip_rcv() 函数的处理地址。后⾯我们会看到软中断中会通过 ptype_base 找到 ip_rcv 函数地址，进⽽将 ip 包正确地送到 ip_rcv() 中执⾏。在 ip_rcv 中将会通过 inet_protos 找到 tcp或者 udp 的处理函数，再⽽把包转发给 udp_rcv() 或 tcp_v4_rcv() 函数。扩展⼀下，如果看⼀下 ip_rcv 和 udp_rcv 等函数的代码能看到很多协议的处理过程。例如，ip_rcv 中会处理 netfilter 和 iptable 过滤，如果你有很多或者很复杂的 netfilter 或 iptables 规则，这些规则都是在软中断的上下⽂中执⾏的，会加⼤⽹络延迟。再例如，udp_rcv 中会判断 socket 接收队列是否满了。对应的相关内核参数是net.core.rmem_max 和net.core.rmem_default。如果有兴趣，建议⼤家好好读⼀下inet_init 这个函数的代码。

⽹卡驱动初始化

每⼀个驱动程序（不仅仅只是⽹卡驱动）会使⽤ module_init 向内核注册⼀个初始化函数，当驱动被加载时，内核会调⽤这个函数。⽐如igb⽹卡驱动的代码位于drivers/net/ethernet/intel/igb/igb_main.c

//file: drivers/net/ethernet/intel/igb/igb_main.c
static struct pci_driver igb_driver = {
	.name = igb_driver_name,
	.id_table = igb_pci_tbl,
	.probe = igb_probe,
	.remove = igb_remove,
	......
};
static int __init igb_init_module(void) {
	......
	ret = pci_register_driver(&igb_driver);
	return ret; 
}

驱动的 pci_register_driver 调⽤完成后，Linux 内核就知道了该驱动的相关信息，⽐如igb ⽹卡驱动的 igb_driver_name 和 igb_probe 函数地址等等。当⽹卡设备被识别以后，内核会调⽤其驱动的 probe ⽅法（igb_driver 的 probe ⽅法是 igb_probe）。驱动 probe⽅法执⾏的⽬的就是让设备 ready ，对于 igb ⽹卡，其 igb_probe 位于drivers/net/ethernet/intel/igb/igb_main.c 下。主要执⾏的操作如下：
在这里插入图片描述
第 5 步中我们看到，⽹卡驱动实现了 ethtool 所需要的接⼝，也在这⾥注册完成函数地址的注册。当 ethtool 发起⼀个系统调⽤之后，内核会找到对应操作的回调函数。对于 igb ⽹卡来说，其实现函数都在 drivers/net/ethernet/intel/igb/igb_ethtool.c 下。这个命令之所以能查看⽹卡收发包统计、能修改⽹卡⾃适应模式、能调整 RX 队列的数量和⼤⼩，是因为 ethtool 命令最终调⽤到了⽹卡驱动的相应⽅法，⽽不是 ethtool 本身有这个超能⼒。

第6步注册的 igb_netdev_ops 中包含的是 igb_open 等函数，该函数在⽹卡被启动的时候会被调⽤。

//file: drivers/net/ethernet/intel/igb/igb_main.c
......
static const struct net_device_ops igb_netdev_ops = {
	.ndo_open = igb_open,
	.ndo_stop = igb_close,
	.ndo_start_xmit = igb_xmit_frame,
	.ndo_get_stats64 = igb_get_stats64,
	.ndo_set_rx_mode = igb_set_rx_mode,
	.ndo_set_mac_address = igb_set_mac,
	.ndo_change_mtu = igb_change_mtu,
	.ndo_do_ioctl = igb_ioctl,
	......
}

第 7 步中，在 igb_probe 初始化过程中，还调⽤到了 igb_alloc_q_vector 。他注册了⼀个 NAPI 机制所必须的 poll 函数，对于 igb ⽹卡驱动来说，这个函数就是 igb_poll ,如下代码所示。

static int igb_alloc_q_vector(struct igb_adapter *adapter,
	int v_count, int v_idx,
	int txr_count, int txr_idx,
	int rxr_count, int rxr_idx) {
	......
 /* initialize NAPI */
	netif_napi_add(adapter->netdev, &q_vector->napi, igb_poll, 64);
}

启动⽹卡

当上⾯的初始化都完成以后，就可以启动⽹卡了。回忆前⾯⽹卡驱动初始化时，我们提到了驱动向内核注册了 structure net_device_ops 变量，它包含着⽹卡启⽤、发包、设置mac地址等回调函数（函数指针）。当启⽤⼀个⽹卡时（例如，通过 ifconfig eth0 up），net_device_ops 中的 igb_open ⽅法会被调⽤。它通常会做以下事情：
在这里插入图片描述

//file: drivers/net/ethernet/intel/igb/igb_main.c
static int __igb_open(struct net_device *netdev, bool resuming) {
	/* allocate transmit descriptors */
	err = igb_setup_all_tx_resources(adapter);
	/* allocate receive descriptors */
	err = igb_setup_all_rx_resources(adapter);
 
	/* 注册中断处理函数 */
	err = igb_request_irq(adapter);
	if (err)
		goto err_req_irq;
	/* 启⽤NAPI */
	for (i = 0; i < adapter->num_q_vectors; i++)
		napi_enable(&(adapter->q_vector[i]->napi));
	......
}

在上⾯ __igb_open 函数调⽤了 igb_setup_all_tx_resources ,和igb_setup_all_rx_resources。在 igb_setup_all_rx_resources 这⼀步操作中，分配了RingBuffer，并建⽴内存和Rx队列的映射关系。（Rx Tx 队列的数量和⼤⼩可以通过 ethtool进⾏配置）。我们再接着看中断函数注册 igb_request_irq :

static int igb_request_irq(struct igb_adapter *adapter) {
	if (adapter->msix_entries) {
		err = igb_request_msix(adapter);
		if (!err)
			goto request_done;
			......
		}
	}
static int igb_request_msix(struct igb_adapter *adapter) {
	......
	for (i = 0; i < adapter->num_q_vectors; i++) {
		...
		err = request_irq(adapter->msix_entries[vector].vector,
		igb_msix_ring, 0, q_vector->name,
 	}
 }

在上⾯的代码中跟踪函数调⽤， __igb_open => igb_request_irq =>igb_request_msix , 在 igb_request_msix 中我们看到了，对于多队列的⽹卡，为每⼀个队列都注册了中断，其对应的中断处理函数是 igb_msix_ring（该函数也在drivers/net/ethernet/intel/igb/igb_main.c 下）。我们也可以看到，msix ⽅式下，每个RX 队列有独⽴的 MSI-X 中断，从⽹卡硬件中断的层⾯就可以设置让收到的包被不同的 CPU处理。（可以通过 irqbalance ，或者修改 /proc/irq/IRQ_NUMBER/smp_affinity 能够修改和CPU的绑定⾏为）。
当做好以上准备⼯作以后，就可以开⻔迎客（数据包）了！

参考链接：
推荐一个零声学院后台服务器免费公开课，个人觉得老师讲得不错，分享给大家：
https://course.0voice.com/v1/course/intro?courseId=5&agentId=0