The journey of a packet through the linux 2.4 network stack-CSDN博客

The journey of a packet through the linux 2.4 network stack
--------------------------------------------------------------------------------
This document describes the journey of a network packet inside the linux kernel 2.4.x. This has changed drastically since 2.2 because the globally serialized bottom half was abandoned in favor of the new softirq system.
--------------------------------------------------------------------------------

1. Preface
I have to excuse for my ignorance, but this document has a strong focus on the "default case": x86 architecture and ip packets which get forwarded.

I am definitely no kernel guru and the information provided by this document may be wrong. So don't expect too much, I'll always appreciate Your comments and bugfixes.

2. Receiving the packet
2.1 The receive interrupt
If the network card receives an ethernet frame which matches the local MAC address or is a linklayer broadcast, it issues an interrupt. The network driver for this particular card handles the interrupt, fetches the packet data via DMA / PIO / whatever into RAM. It then allocates a skb and calls a function of the protocol independent device support routines: net/core/dev.c:netif_rx(skb).

If the driver didn't already timestamp the skb, it is timestamped now. Afterwards the skb gets enqueued in the apropriate queue for the processor handling this packet. If the queue backlog is full the packet is dropped at this place. After enqueuing the skb the receive softinterrupt is marked for execution via include/linux/interrupt.h:__cpu_raise_softirq().

The interrupt handler exits and all interrupts are reenabled.

2.2 The network RX softirq
Now we encounter one of the big changes between 2.2 and 2.4: The whole network stack is no longer a bottom half, but a softirq. Softirqs have the major advantage, that they may run on more than one CPU simultaneously. bh's were guaranteed to run only on one CPU at a time.

Our network receive softirq is registered in net/core/dev.c:net_init() using the function kernel/softirq.c:open_softirq() provided by the softirq subsystem.

Further handling of our packet is done in the network receive softirq (NET_RX_SOFTIRQ) which is called from kernel/softirq.c:do_softirq(). do_softirq() itself is called from three places within the kernel:

from arch/i386/kernel/irq.c:do_IRQ(), which is the generic IRQ handler
from arch/i386/kernel/entry.S in case the kernel just returned from a syscall
inside the main process scheduler in kernel/sched.c:schedule()
So if execution passes one of these points, do_softirq() is called, it detects the NET_RX_SOFTIRQ marked an calls net/core/dev.c:net_rx_action(). Here the sbk is dequeued from this cpu's receive queue and afterwards handled to the apropriate packet handler. In case of IPv4 this is the IPv4 packet handler.

2.3 The IPv4 packet handler
The IP packet handler is registered via net/core/dev.c:dev_add_pack() called from net/ipv4/ip_output.c:ip_init().

The IPv4 packet handling function is net/ipv4/ip_input.c:ip_rcv(). After some initial checks (if the packet is for this host, ...) the ip checksum is calculated. Additional checks are done on the length and IP protocol version 4.

Every packet failing one of the sanity checks is dropped at this point.

If the packet passes the tests, we determine the size of the ip packet and trim the skb in case the transport medium has appended some padding.

Now it is the first time one of the netfilter hooks is called.

Netfilter provides an generict and abstract interface to the standard routing code. This is currently used for packet filtering, mangling, NAT and queuing packets to userspace. For further reference see my conference paper 'The netfilter subsystem in Linux 2.4' or one of Rustys unreliable guides, i.e the netfilter-hacking-guide.

After successful traversal the netfilter hook, net/ipv4/ipv_input.c:ip_rcv_finish() is called.

Inside ip_rcv_finish(), the packet's destination is determined by calling the routing function net/ipv4/route.c:ip_route_input(). Furthermore, if our IP packet has IP options, they are processed now. Depending on the routing decision made by net/ipv4/route.c:ip_route_input_slow(), the journey of our packet continues in one of the following functions:

net/ipv4/ip_input.c:ip_local_deliver()
The packet's destination is local, we have to process the layer 4 protocol and pass it to an userspace process.

net/ipv4/ip_forward.c:ip_forward()
The packet's destination is not local, we have to forward it to another network

net/ipv4/route.c:ip_error()
An error occurred, we are unable to find an apropriate routing table entry for this packet.

net/ipv4/ipmr.c:ip_mr_input()
It is a Multicast packet and we have to do some multicast routing.

3. Packet forwarding to another device
If the routing decided that this packet has to be forwarded to another device, the function net/ipv4/ip_forward.c:ip_forward() is called.

The first task of this function is to check the ip header's TTL. If it is <= 1 we drop the packet and return an ICMP time exceeded message to the sender.

We check the header's tailroom if we have enough tailroom for the destination device's link layer header and expand the skb if neccessary.

Next the TTL is decremented by one.

If our new packet is bigger than the MTU of the destination device and the don't fragment bit in the IP header is set, we drop the packet and send a ICMP frag needed message to the sender.

Finally it is time to call another one of the netfilter hooks - this time it is the NF_IP_FORWARD hook.

Assuming that the netfilter hooks is returning a NF_ACCEPT verdict, the function net/ipv4/ip_forward.c:ip_forward_finish() is the next step in our packet's journey.

ip_forward_finish() itself checks if we need to set any additional options in the IP header, and has ip_optFIXME doing this. Afterwards it calls include/net/ip.h:ip_send().

If we need some fragmentation, FIXME:ip_fragment gets called, otherwise we continue in net/ipv4/ip_forward:ip_finish_output().

ip_finish_output() again does nothing else than calling the netfilter postrouting hook NF_IP_POST_ROUTING and calling ip_finish_output2() on successful traversal of this hook.

ip_finish_output2() calls prepends the hardware (link layer) header to our skb and calls net/ipv4/ip_output.c:ip_output().

2.4 内核由于采用了softirq 机制，因此网络包在内核中旅程与2.2 内核比起来有了很大的不同，所以我们需要重新讲解一下：），关于softirq/tasklet/bottom half 的讲解参看我画的图。

1. 网络包的接收

a) 接收中断

如果网卡探测到一个以太网帧的目的MAC 地址与本网卡的地址匹配的话，或者此帧是链路层的广播包的话，网卡就会接收此帧，放到网卡自己的缓冲区中，然后向cpu 发出一个中断请求，cpu 响应此中断请求，调用此网卡相应的中断处理程序，在中断处理程序中，把此包从网卡的缓冲区中拷贝到内存中，方式可以是DMA/PIO 等等。然后会分配一个skb 结构，最后会呼叫与协议没有关系的netif_rx() 函数（net/core/dev.c:netif_rx(skb) ），此函数主要负责来给此包打上时间戳标记，然后把此包对应的skb 放到对应的cpu 的接收队列中，但是如果接收队列以满的话，此包会被丢弃。Netif_rx 函数最后给接收网络包的软中断(softirq) 打上标记：__cpu_raise_softirq()(include/linux/interrupt.h) 。至此硬件对应的中断处理程序就完成了，即网络包的前期处理完成了，而其后期处理交给了软中断RX softirq 。

b) RX softirq

这部分是与2.2 内核完全不一样的，2.2 内核中使用bottom half 机制来处理，而在2.4 内核中使用的是softirq 机制。Softirq 机制与bottom half 机制相比，优点是可以在多个cpu 上同时处理，但对于bottom half ，会保证任意时刻只会在一个cpu 上运行。

RX softirq 的注册是在net/core/dev/c:net_init() 函数中，使用kernel/softirq.c:open_softirq() 来注册的。

网络包的后期处理是在NET_RX_SOFTIRQ 中完成的。它是在kernel/softirq.c:do_softirq() 中被呼叫的。而do_softirq （）在三种时机会被调用：

1. arch/i386/kernel/irq.c:do_IRQ() ，这个是IRQ 处理（硬件IRQ ）函数。

2. arch/i386/kernel/entry.S 内核从系统调用返回。

3. 在进程调度函数中：kernel/sched.c:schedule()

所以，如果cpu 执行到以上三种时刻时，会呼叫do_softirq() 函数。在此函数中，如果检测到NET_RX_SOFTIRQ 被打上标记，就会执行net/core/dev.c:net_rx_action() 。在net_rx_action() 函数中，skb 会被从相应cpu 的接收队列中取出，根据skb 的类型，调用相应的处理函数，对于ip 包来说，就是ip_rcv() 函数。

c) ip 包的处理函数是通过net/core/dev.c:dev_add_pack() 函数注册的。而后者是在net/ipv4/ip_output.c:ip_init() 函数中调用的，ip_init() 用于初始化各种结构，注册ip 包的处理函数。对于ipv4 来说，包处理函数是net/ipv4/ip_input.c:ip_rcv() ，此函数首先会做一些校验，如ip 头的校验，包长的校验，ip 协议版本号的校验等。如果以上的校验出错的话，此ip 包会被丢弃。然后会计算ip 包的长度，以及去处传输层可能加的一些没有用的padding 。此后，第一个NETFILTER 钩子会被调用。当执行完netfilter 钩子函数后，会呼叫/net/ipv4/ipv_input.c ：ipv_rcv_finish() 。在ipv_rcv_finish 函数中，此包的目的地址会由net/ipv4/route.c:ip_route_input() 计算出来。此外如果我们的ip 包包含ip 选项的话，也会在此处理。根据net/ipv4/route.c:ip_route_input_slow() ，我们的ip 包会有两条不同的路走：一条是针对此ip 包是发给本机节点的，一条是针对此ip 包是需要转发的。