Inside the Linux Packet Filter(REPOST)

 in
In Part I of this two-part series on the Linux Packet Filter, Gianluca describes a packet's journey through the kernel.

Network geeks among you may remember myarticle, “Linux Socket Filter: Sniffing Bytes over the Network”,in the June 2001 issue of LJ, regarding theuse of the packet filter built inside the Linux kernel. In thatarticle I provided an overview of the functionality of the packetfilter itself; this time, I delve into the depths of the kernelmechanisms that allow the filter to work and share some insights onLinux packet processing internals.

Last Article's Points

In the previous article, some arguments regarding kernelpacket processing were raised. It is worthwhile to recall brieflythe most important of them:

  • Packet reception is first dealt with at the networkcard's driver level, more precisely in the interrupt serviceroutine. The service routine looks up the protocol type inside thereceived frame and queues it appropriately for laterprocessing.

  • During reception and protocol processing, packetsmight be discarded if the machine is congested. Furthermore, asthey travel upward toward user land, packets lose networklower-level information.

  • At the socket level, just before reaching userland, the kernel checks whether an open socket for the given packetexists. If it does not, the packet is discarded.

  • Then the Linux kernel implements a generic-purposeprotocol, called PF_PACKET, which allows you to create a socketthat receives packets directly from the network card driver. Hence,any other protocols' handling is skipped, and any packets can bereceived.

  • An Ethernet card usually passes only the packetsdestined to itself to the kernel, discarding all the others.Nevertheless, it is possible to configure the card in such a waythat all the packets flowing through the network are captured,independent of their MAC address (promiscuous mode).

  • Finally, you can attach a filter to a socket, sothat only packets matching your filter's rules are accepted andpassed to the socket. Combined with PF_PACKET sockets, thismechanism allows you to sniff selected packets efficiently fromyour LAN.

Even though we built our sniffer using PF_PACKET sockets, theLinux socket filter (LSF) is not limited to those. In fact, thefilter also can be used on plain TCP and UDP sockets to filter outunwanted packets—of course, this use of the filter is much lesscommon.

In the following, I sometimes refer either to a socket or toa sock structure. As far as this article is concerned, both formsindicate the same object, and the latter corresponds to thekernel's internal representation of the former. Actually, thekernel holds both a socket structure and a sock structure, but thedifference between the two is not relevant here.

Another data structure that will recur quite often is thesk_buff (short for socket buffer), which represents a packet insidethe kernel. The structure is arranged in such a way that additionand removal of header and trailer information to the packet datacan be done in a relatively inexpensive way: no data actually needsto be copied since everything is done by just shiftingpointers.

Before going on, it may be useful to clear up possibleambiguities. Despite having a similar name, the Linux socket filterhas a completely different purpose with respect to the Netfilterframework introduced into the kernel in early 2.3 versions. Even ifNetfilter allows you to bring packets up to user space and feedthem to your programs, the focus there is to handle network addresstranslation (NAT), packet mangling, connection tracking, packetfiltering for security purposes and so on. If you just need tosniff packets and filter them according to certain rules, the moststraightforward tool is LSF.

Now we are going to follow the trip of a packet from its veryingress into the computer to its delivery to user land at thesocket level. We first consider the general case of a plain (i.e.,not PF_PACKET) socket. Our analysis at link layer level is based onEthernet, since this is the most widespread and representative LANtechnology. Cases of other link layer technologies do not presentsignificant differences.

Ethernet Card and Lower-Kernel Reception

As we mentioned in the previous article, the Ethernet card ishard-wired with a particular link layer (or MAC) address and isalways listening for packets on its interface. When it sees apacket whose MAC address matches either its own address or the linklayer broadcast address (i.e., FF:FF:FF:FF:FF:FF for Ethernet) itstarts reading it into memory.

Upon completion of packet reception, the network cardgenerates an interrupt request. The interrupt service routine thathandles the request is the card driver itself, which runs withinterrupts disabled and typically performs the followingoperations:

  • Allocates a new sk_buff structure, defined ininclude/linux/skbuff.h, which represents the kernel's view of apacket.

  • Fetches packet data from the card buffer into thefreshly allocated sk_buff, possibly using DMA.

  • Invokes netif_rx(), the generic network receptionhandler.

  • When netif_rx() returns, re-enables interrupts andterminates the service routine.

The netif_rx() function prepares the kernel for the nextreception step; it puts the sk_buff into the incoming packets queuefor the current CPU and marks the NET_RX softirq (softirq isexplained below) for execution via the __cpu_raise_softirq() call.Two points are worth noticing at this stage. First, if the queue isfull the packet is discarded and lost forever. Second, we have onequeue for each CPU; together with the new deferred kernelprocessing model (softirqs instead of bottom halves), this allowsfor concurrent packet reception in SMP machines.

If you want to see a real-world Ethernet driver in action,you can refer to the simple NE 2000 card PCI driver, located indrivers/net/8390.c; the interrupt service routine calledei_interrupt(), calls ei_receive(), which in turn, performs thefollowing procedure:

  • Allocates a new sk_buff structure via thedev_alloc_skb() call.

  • Reads the packet from the card buffer(ei_block_input() call) and sets skb->protocolaccordingly.

  • Calls netif_rx().

  • Repeats the procedure for a maximum of tenconsecutive packets.

A slightly more complex example is provided by the 3COMdriver, located in 3c59x.c, which uses DMA to transfer the packetfrom the card memory to the sk_buff.


Network Core Processing

Let's take a closer look at the netif_rx() function. Asmentioned before, this function has the task of receiving a packetfrom a network driver and queuing it for upper-layer processing. Itacts as a single gathering point for all the packets collected bythe different network card drivers, providing input to the upperprotocols' processing.

Since this function runs in interrupt context (that is, itsexecution flow follows the interrupt service path) with otherinterrupts disabled, it has to be quick and short. It cannotperform lengthy checks or other complex tasks since the system ispotentially losing packets while netif_rx() runs. So, what thisfunction does is basically select the packet queue from an arraycalled softnet_data, whose index is based on the CPU currentlyrunning. It then checks the status of the queue, identifying one offive possible congestion levels: NET_RX_SUCCESS (no congestion),NET_RX_CN_LOW, NET_RX_CN_MOD, NET_RX_CN_HIGH (low, moderate andhigh congestion, respectively) or NET_RX_DROP (packet dropped dueto critical congestion).

Should the critical congestion level be reached, netif_rx()engages a throttling policy that allows the queue to go back to anoncongested status, avoiding service disruption due to kerneloverload. Among other benefits, this helps avert possible DOSattacks.

Under normal conditions, the packet is finally queued(__skb_queue_tail()), and __cpu_raise_softirq(cpuid,NET_IF_SOFTIRQ) is called. The latter function has the effect ofscheduling a softirq for execution.

The netif_rx() function terminates, returning a valueindicating the current congestion level to the caller. At thispoint, interrupt context processing is done, and the packet isready to be taken care of by upper-layer protocols. This processingis deferred to a later time, when interrupts will have beenre-enabled and execution timing will not be as critical. Thedeferred execution mechanism has changed radically from kernelversions 2.2 (where it was based on bottom halves) to versions 2.4(where it is based on softirqs).

softirqs vs. Bottom Halves

Explaining in detail about bottom halves (BHs) and theirevolution is out of the scope of this article. But, some points areworth recalling briefly.

First off, their design was based on the principle that thekernel should perform as few computations as possible while ininterrupt context. Thus, when long operations were to be done inresponse to an interrupt, the corresponding driver would mark theappropriate BH for execution, without actually doing anythingcomplex. Then, at a later time, the kernel would have checked theBH mask to determine whether some BHs were marked for execution andexecute them before any application-level task.

BHs worked quite well, with one important drawback: due totheir structure, their execution was serialized strictly amongCPUs. That is, the same BH could not be executed by more than oneCPU at the same time. This obviously prevented any kind of kernelparallelism on SMP machines and seriously affected performance.softirqs represent the 2.4-ageevolution of BHs and, together with tasklets, belong to the familyof kernel software interrupts, pieces of code that can be executedby the kernel when requested, without strict response-timeguarantees.

The major difference with respect to BHs is that the samesoftirq may be run on more than one CPU at a time. Serialization,if required, now must be obtained explicitly by using kernelspinlocks.

softirq's Internals

softirq's processing core isperformed in the do_softirq() routine, located in kernel/softirq.c.This function checks a bit mask, and if the bit corresponding to agiven softirq is set, it calls the appropriate handling routine. Inthe case of NET_RX_SOFTIRQ, the one we are interested in at thistime, the relevant function is net_rx_action(), located innet/core/dev.c. The do_softirq() function may get called from threedistinct places inside the kernel: do_IRQ(), in kernel/irq.c, whichis the generic interrupt handler; system calls' exit point, inkernel/entry.S; and schedule(), in kernel/sched.c, which is themain process scheduling function.

In other words, execution of a softirq may happen either whena hardware interrupt has been processed, when an application-levelprocess invokes a system call or when a new process is scheduledfor execution. This way, softirqs are drained frequently enoughthat none of them will lie waiting for their turn for toolong.

The trigger mechanism also was exactly the same for theold-style bottom halves.


The NET_RX softirq

We've seen the packet come in through the network interfaceand get queued for later processing. Then, we've considered howthis processing is resumed by a call to the net_rx_action()function. It's now time to see what this function does. Basically,its operation is pretty simple: it just dequeues the first packet(sk_buff) from the current CPU's queue and runs through the twolists of packet handlers, calling the relevant processingfunctions.

Some more words are worth spending on those lists and howthey are built. The two lists are called ptype_all and ptype_baseand contain, respectively, protocol handlers for generic packetsand for specific packet types. Protocol handlers registerthemselves, either at kernel startup time or when a particularsocket type is created, declaring which protocol type they canhandle; the involved function is dev_add_pack() in net/core/dev.c,which adds a packet type structure (see include/linux/netdevice.h)containing a pointer to the function that will be called when apacket of that type is received. Upon registration, each handler'sstructure is either put in the ptype_all list (for the ETH_P_ALLtype) or hashed into the ptype_base list (for other ETH_P_*types).

So, what the NET_RX softirq does is call in sequence eachprotocol handler function registered to handle the packet'sprotocol type. Generic handlers (that is, ptype_all protocols) arecalled first, regardless of the packet's protocol; specifichandlers follow. As we will see, the PF_PACKET protocol isregistered in one of the two lists, depending on the socket typechosen by the application. On the other hand, the normal IP handleris registered in the second list, hashed with the keyETH_P_IP.

If the queue contains more than one packet, net_rx_action()loops on the packets until either a maximum number of packets hasbeen processed (netdev_max_backlog) or too much time has been spenthere (the time limit is 1 jiffy, i.e., 10ms on most kernels). Ifnet_rx_action() breaks the loop leaving a non-empty queue, theNET_RX_SOFTIRQ is enabled again to allow for the processing to beresumed at a later time.

The IP Packet Handler

The IP protocol receive function, namely ip_rcv() (innet/ipv4/ip_input.c), is pointed to by the packet type structureregistered within the kernel at startup time (ip_init(), innet/ipv4/ip_output.c). Obviously, the registered protocol type forIP is ETH_P_IP.

Thus, ip_rcv() gets called from within net_rx_action() duringthe processing of a softirq, whenever a packet with type ETH_P_IPis dequeued. This function performs all the initial checks on theIP packet, which mainly involve verifying its integrity (IPchecksum, IP header fields and minimum significant packet length).If the packet looks correct, ip_rcv_finish() is called. As a sidenote, the call to this function passes through the Netfilterprerouting control point, which is practically implemented with theNF_HOOK macro.

ip_rcv_finish(), still inip_input.c, mainly deals with the routing functionality of IP. Itchecks whether the packet should be forwarded to another machine orif it is destined to the local host. In the former case, routing isperformed, and the packet is sent out via the appropriateinterface; otherwise, local delivery is performed. All the magic isrealized by the ip_route_input() function, called at the verybeginning of ip_rcv_finish(), which determines the next processingstep by setting the appropriate function pointer inskb->dst->input. In the case of locally bound packets, thispointer is the address of the ip_local_deliver() function.ip_rcv_finish() terminates with acall to skb->dst->input().

At this point, packets definitely are traveling toward theupper-layer protocols. Control is passed to ip_local_deliver();this function just deals with IP fragments' reassembly (in case theIP datagram is fragmented) and then goes over to theip_local_deliver_finish() function. Just before calling it, anotherNetfilter hook (the ip-local-ip) is executed.

The latter is the last call involving IP-level processing;ip_local_deliver_finish() carries out the tasks still pending tocomplete the upper part of layer 3. IP header data is trimmed sothat the packet is ready to be transferred to the layer 4 protocol.A check is done to assess whether the packet belongs to a raw IPsocket, in which case the corresponding handler (raw_v4_input()) iscalled.

Raw IP is a protocol that allows applications to forge andreceive their own IP packets directly, without incurring actuallayer 4 processing. Its main use is for network tools that need tosend particular packets to perform their tasks. Well-known examplesof such tools are ping and traceroute, which use raw IP to buildpackets with specific header values. Another possible applicationof raw IP is, for example, realizing custom network protocols atthe user level (such as RSVP, the resource reservation protocol).Raw IP may be considered a standard equivalent of the PF_PACKETprotocol family, just shifted up one open systems interconnection(OSI) level.

Most commonly, though, packets will be headed toward afurther kernel protocol handler. In order to determine which one itis, the Protocol field inside the IP header is examined. The methodused by the kernel at this point is very similar to the one adoptedby the net_rx_action() function; a hash is defined, calledinet_protos, which contains all the registered post-IP protocolhandlers. The hash key is, of course, derived from the IP header'sprotocol field. The inet_protos hash is filled in at kernel startuptime by inet_init() (in net/ipv4/af_inet.c), which repeatedly callsinet_add_protocol() to register TCP, UDP, ICMP and IGMP handlers(the latter only if multicast is enabled). The complete protocoltable is defined in net/ipv4/protocol.c.

For each protocol, a handler function is defined:tcp_v4_rcv(), udp_rcv(), icmp_rcv() and igmp_rcv() are the obviousnames corresponding to the above-mentioned protocols. One of thesefunctions is thus called to proceed with packet processing. Thefunction's return value is used to determine whether an ICMPDestination Unreachable message has to be returned to the sender.This is the case when the upper-level protocols do not recognizethe packet as belonging to an existing socket. As you will recallfrom the previous article, one of the issues in sniffing networkdata was to have a socket able to receive packets independent oftheir port/address values. Here (and in the just-mentioned *_rcv()functions) is the point where that limitation arises from.

Conclusion

At this point, the packet is more than halfway through itsjourney. Since space is limited in our beloved magazine, we willleave the packet in the capable hands of upper-layer 3 protocolsuntil next month. What still remains to be explored is layer 4processing (TCP and UDP), PF_PACKETs handling and, of course, thesocket filter hooks and implementation. Be patient!

Resources

Creation of PF_PACKETSockets

Gianluca Insolvibile has been a Linux enthusiast since kernel 0.99pl4. He currently deals with networking and digital video research and development. He can be reached at g.insolvibile@cpr.it.
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值