翻译:Understanding Linux Network Internals 2.1. The Socket Buffer: sk_buff Structure 套接字缓存数据结构sk_buff...

目录:http://www.cnblogs.com/WuCountry/archive/2008/11/15/1333960.html

2.1. The Socket Buffer: sk_buff Structure 套接字缓存数据结构sk_buff

This is probably the most important data structure in the Linux networking code, representing the headers for data that has been received or is about to be transmitted. Defined in the <include/linux/skbuff.h> include file, it consists of a tremendous heap of variables that try to be all things to all people.

这可能就是Linux网络代码中最重要的数据结构了,它展示了接收到的或者将要发出的数据的帧头信息。它被定义在<include/linux/skbuff.h>头文件中,它由大量的堆栈上变量组成(译注:也就是说成员很多是指针),这些变量试着把所有的东西提供给所有的人。

The structure has changed many times in the history of the kernel, both to add new options and to reorganize existing fields into a cleaner layout. Its fields can be classified roughly into the following categories:这个结构随着内核的发展,已经修改过很多次了,都是添加一些新的选项或者整理一下已经存在的成员,使它们的排列更清晰。该结构的成员可以大概的分为以下几类:

  • Layout 布局

  • General 通用

  • Feature-specific 特殊功能

  • Management functions 管理函数

This structure is used by several different network layers (MAC or another link protocol on the L2 layer, IP on L3, TCP or UDP on L4), and various fields of the structure change as it is passed from one layer to another. L4 appends a header before passing it to L3, which in turn puts on its own header before passing it to L2. Appending headers is more efficient than copying the data from one layer to another. Since adding space to the beginning of a bufferwhich means changing the variable that points to itis a complicated operation, the kernel provides the skb_reserve function (described later in this chapter) to carry it out. Thus, one of the first things done by each protocol, as the buffer passes down through layers, is to call skb_reserve to reserve space for the protocol's header.[] In the later section "Data reservation and alignment: skb_reserve, skb_put, skb_push, and skb_pull," we will see an example of how the kernel makes sure enough space is reserved at the head of the buffer to allow each layer to add its own header while the buffer traverses the layers.
这个结构被几个不同的网络层(MAC层,或者其它的二层链路层,三层的IP,TCP或者四层的UDP)所使用,而很多字段(的含意)在该结构从一个网络层转到另一个网络层时会有所改变。四层在把数据传到三层时,会添加一个数据头,而三层在传到二层以前,就轮到三层把自己的数据头也加进去了。直接添加数据头,要比把数据从一个网络层COPY到另一个网络层要高效得多。正因为,在缓存前面添加一点存储空间,并改变成员变量让它指向新的存储空间上,是一个复杂的操作,所以内核为我们提供了一个skb_reserve函数(在这一章的后面会讨论这个函数)来帮助我们解决这个问题。因此,对于每一种协议来说,最先要做的事情就是:当帧缓存数据在网络层之间传递时,调用skb_reserve函数来为该协议预留出协议头的存储空间。在后面的“Data reservation and alignment: skb_reserve, skb_put, skb_push, and skb_pull”小节中,我们将会看到一个例子:当一个数据缓存在各个网络层之间传递时,内核是如何确保在每个缓存前面有足够的空间,从而允许每个协议层的头信息可以存储到缓存前面。

[]skb_reserve is also used by device drivers to align the IP header of ingress frames. See Chapter 10.
skb_reserve函数同样也被设备驱动所使用,用于对齐接收到的数据帧的IP头。参见第10章。

When the buffer passes up through the network layers, each header from the old layer is no longer of interest. The L2 header, for instance, is used only by the device drivers that handle the L2 protocol, so it is of no interest to L3. Instead of removing the L2 header from the buffer, the pointer to the beginning of the payload is moved ahead to the beginning of the L3 header, which requires fewer CPU cycles.
当缓存通过网络层时,老网络层的头信息就不再被关注了。例如,一个二层协议的头,它可能被一个处理二层协议的设备使驱动用,但到了三层,这个头就不再被关注了。但并不是把这个头从缓存里删除,取而代之的是把头指针从二层头信息处移到三层协议头信息的地方,这样可以使用更少的CPU处理周期。

The rest of this section explains a basic principle about conditional (optional) fields, and then covers each of the categories just listed.
接下来为大家解释一项基本原则,一个有条件的(可选)字段。然后只列出一些覆盖每个类别的数据结构。

2.1.1. Networking Options and Kernel Structures网络选项和内核结构

As you can see from glancing at TCP/IP specifications or configuring a kernel, network code provides an enormous number of options that are useful but not always required, such as a Firewall, Multicasting, and other features. Most of these options require additional fields in kernel data structures. Therefore, sk_buff is peppered with C preprocessor #ifdef directives. For example, near the bottom of the sk_buff definition you can find:
正如你在浏览TCP/IP规格或者配置内核时看到的那样,网络代码提供了大量的选项,虽然这些选项有用,但并不是必须的,例如防火墙,多播,以及其它的一些特性。大多数选项都要在内核的数据结构里添加一些字段。也就是说,sk_buff(的大小)是被C编译器的预处理选项#ifdef所调制的(peppered)。例如,在sk_buff的结尾部份你就可以看到:

struct sk_buff {
    ... ... ...
#ifdef CONFIG_NET_SCHED
    _ _u32    tc_index;
#ifdef CONFIG_NET_CLS_ACT
    _ _u32    tc_verd;
    _ _u32    tc_classid;
#endif
#endif
}

This shows that the field tc_index is part of the data structure only if the CONFIG_NET_SCHED symbol is defined at compile time, which means that the right option (in this example, "Device Drivers Networking support Networking options QoS and/or fair queueing") has been enabled with some version of make config by an administrator or by an automated installation utility.
可以看到,tc_index字段只有在CONFIG_NET_SCHED符号在编译时被定义了以后才是结构的一部份,这也就是说正确的选项(在这个例子中,“网络设备驱动支持网络选项:QoS并且/或者公平队列”)已经被管理员在配置一些版本的make选项时使能了,或者是一些自动安装工具(让它使能了)。
(译注:这里有一些专业名词,例如,make选项,就是make程序的一些配置选项,一般我们都这样说,所以这里也就这样翻译了;使能和去使能:enable/disable,也是一直这样用,所以就这样翻译;QoS和公平队列,这里没太明白是什么意思,大概是说一个服务质量的优先队列问题,就暂且直译,以后这样的都不添加译注了)

The previous example actually shows two nested options: the fields used by CONFIG_NET_CLS_ACT (packet classifier) are considered for inclusion only if support for "QoS and/or fair queueing" is present.
前面这个例子切实的反映了两个嵌套的选项:CONFIG_NET_CLS_ACT(分包)选项所要使用的字段,只有在"QoS and/or fair queueing" 选项有效时它们才会被包含进来。

Notice, by the way, that the QoS option cannot be compiled as a module. The reason is that most of the consequences of enabling the option will not be reversible after the kernel is compiled. In general, any option that causes a change in a kernel data structure (such as adding the tc_index field to the sk_buff structure) renders the option unfit to be compiled as a module.
注意,顺便说一下,这里的QoS选项不能用于编译模块。原因是大多数重要的让该功能使能的选项在内核编译过后是不可逆的(译注:这里就是说该功能只能随内核一起编译,一起使用内核的编译选项;而不能先把内核编译完成,再来配置编译QoS模块)。一般来说,任何一个引发内核修改数据结构的选项(例如添加一个tc_index字段到sk_buff结构中),都会让选项在编译模块时不合适。(译注:关于内核与模块的概念,可以参考:Understanding the Linux Kernel,深入理解Linux内核)

You'll often want to find out which compile option from make config or its variants is associated with a given #ifdef symbol, to understand when a block of code is included in the kernel. The fastest way to make the association, in the 2.6 kernels, is to look for the symbol in the kconfig files that are spread all over the source tree (one per directory). In 2.4 kernels, you can consult the file Documentation/Configure.help.
你可能经常想找出,哪些make配置里的编译选项,或者与#ifdef相关的(编译选项)变体,来控制一些代码块被包含到内核里。在2.6的内核里,查看关联的最快的方法是在kconfig文件中查找相关的符号。该文件以目录树的形式分散在源代码树里(一个目录一个)。在2.4内核里,你可以参考文件Documentation/Configure.help。

2.1.2. Layout Fields字段布局

A few of the sk_buff's fields exist just to facilitate searching and to organize the data structure itself. The kernel maintains all sk_buff structures in a doubly linked list. But the organization of this list is somewhat more complicated than that of a traditional doubly linked list.
sk_buff里的存在的一些字段完全只是为了便于搜索和组织好数据结构自己。内核是用一个双向链表来管理所有的sk_buff结构的,但是这些链表的管理却比管理传统的双向链表要复杂的多。

Like any doubly linked list, this one is tied together by next and prev fields in each sk_buff structure, the next field pointing forward and the prev field pointing backward. But this list has another requirement: each sk_buff structure must be able to find the head of the whole list quickly. To implement this requirement, an extra structure of type sk_buff_head is inserted at the beginning of the list, as a kind of dummy element. The sk_buff_head structure is:
和所有的双向链表一样,这个链表也是通过在每个sk_buff结构里使用一个next和一个prev字段来把它们放在一起的,next字段指向下一个而prev字段指向前一个。但这个链表有一个其它的要求:任何一个sk_buff结构必须可以快速的找个整个链表的头。为了实现这一功能,在链表的前面添加了一个额外的sk_buff_head 结构,它就是一个伪链接成员(译注:就只用来记录链表头)。sk_buff_head结构:

struct sk_buff_head {
    /* These two members must be first. */
    struct sk_buff    * next;
    struct sk_buff    * prev;

    _ _u32        qlen;
    spinlock_t    lock;
};

qlen represents the number of elements in the list. lock is used to prevent simultaneous accesses to the list and is described in the section "List management functions," later in this chapter.
qlen就是这个链表的长度。lock用于保护(多个线程在)同时访问链表,这个在本章的后面章节“List management functions”中说明。

The first two elements of both sk_buff and sk_buff_head are the same: the next and prev pointers. This allows the two structures to coexist in the same list, even though sk_buff_head is positively skimpy in comparison to sk_buff. In addition, the same functions can be used to manipulate both sk_buff and sk_buff_head.
sk_buff和sk_buff_head的前2个字段是一样的:next和prev指针。这就让这两种结构可以在同一个链表中共存,即使sk_buff_head结构肯定比sk_buff结构少很多。另外,同样的函数可以同时操作sk_buff和sk_buff_head。

To add to the complexity, every sk_buff structure contains a pointer to the single sk_buff_head structure. This pointer has the field name list. See Figure 2-1 for help finding your way around these data structures.
为了增加复杂性(译注:或者完全相反,是为了减少复杂性),每一个sk_buff结构都包含一个指向单个sk_buff_head结构的指针。这个结构成员的名字就是list。参见图(2-1),帮助你在数据结构中找到方法。

Figure 2-1. List of sk_buff elements


Other interesting fields of sk_buff follow:
sk_buff中另外一些有意思的字段:

struct sock *sk

This is a pointer to a sock data structure of the socket that owns this buffer. This pointer is needed when data is either locally generated or being received by a local process, because the data and socket-related information is used by L4 (TCP or UDP) and by the user application. When a buffer is merely being forwarded (that is, neither the source nor the destination is on the local machine), this pointer is NULL.
这是一个指向拥有该缓存的套接字数据结构。当数据是从本地生成的,或者被本地的处理器接收处理时,这个指针就要被使用,因为数据以及与之相关的套接字要被第四层网络(TCP或者UDP)的用户程序使用。而当一个缓存只是为了转送(也就是说,源地和目的地都不是本机)时,这个指针为NULL。


unsigned int len

This is the size of the block of data in the buffer. This length includes both the data in the main buffer (i.e., the one pointed to by head) and the data in the fragments.[] Its value changes as the buffer moves from one network layer to the next, because headers are discarded while moving up in the stack and are added while moving down the stack. len accounts for protocol headers as well, as shown in Figure 2-8 in the section "Data reservation and alignment: skb_reserve, skb_put, skb_push, and skb_pull."
这表示缓存里的数据块大小。这个长度包含数主缓存区(例如:一个指出头大小的)里的大小以及分片区里的大小。这个值在缓存数据从一个网络层被移到另一个网络层时会发生改变,因为缓存数据在协议栈里向上移动时要丢弃协议头,而向下移动时要添加协议头。len很好为协议头计算长度,如"Data reservation and alignment: skb_reserve, skb_put, skb_push, and skb_pull."中的图2-8所示。

[] See Chapter 21 for a discussion of fragmented buffers.

unsigned int data_len

Unlike len, data_len accounts only for the size of the data in the fragments.
和len不一样,data_len只计算主缓存区中的数据长度。

unsigned int mac_len

This is the size of the MAC header.
MAC头的长度。

atomic_t users

This is the reference count, or the number of entities using this sk_buff buffer. The main use of this parameter is to avoid freeing the sk_buff structure when someone is still using it. For this reason, each user of the buffer should increment and decrement this field when necessary. This counter covers only the users of the sk_buff data structure; the buffer containing the actual data is covered by a similar field (dataref) that will be introduced later in the chapter, in the section "The skb_shared_info structure and the skb_shinfo function."
这是一个引用计数,或者说就是记录有多少实例在使用这个sk_buff缓存。这个参数的主要用途就是防止在删除这个数据结构时,还有人在使用它。为此,每一个数据缓存的用户必须在须要的时候增加该值或者减少该值。这个计数只对sk_buff数据结构的用户有效,而这个结构还用一个类似的字段(dataref)包含了真实的数据,这个会在本章的后面一节,“The skb_shared_info structure and the skb_shinfo function.”中介绍。

users is sometimes incremented and decremented directly with the atomic_inc and atomic_dec functions, but most of the time it is manipulated with skb_get and kfree_skb.
用户有时候直接使用atomic_inc和atomic_dec这两个函数来增加或者减少这个字段的值,但大多数时候是通过调用skb_get和kfree_skb来维护的。


unsigned int truesize

This field represents the total size of the buffer, including the sk_buff structure itself. It is initially set by the function alloc_skb to len+sizeof(sk_buff) when the buffer is allocated for a requested data space of len bytes.
这个字段表示这个缓存的总大小,包括sk_buff结构自己。它是在由函数alloc_skb在初始化时设置的,当一个缓存被分配为要len个字节的数据空间时,这个值为len+sizeof(sk_buff)。

struct sk_buff *alloc_skb(unsigned int size,int gfp_mask)
{
     ... ... ...
     skb->truesize = size + sizeof(struct sk_buff);
     ... ... ...
}

The field gets updated whenever skb->len is increased.
只要skb->len增加时,这个字段就要更新。


unsigned char *head


unsigned char *end


unsigned char *data


unsigned char *tail

These represent the boundaries of the buffer and the data within it. When each layer prepares the buffer for its activities, it may allocate more space for a header or for more data. head and end point to the beginning and end of the space allocated to the buffer, and data and tail point to the beginning and end of the actual data. See Figure 2-2. The layer can then fill in the gap between head and data with a protocol header, or the gap between tail and end with new data. You will see in the later section "Allocating memory: alloc_skb and dev_alloc_skb" that the buffer on the right side of Figure 2-2 includes an additional header at the bottom.
这几个字段表示缓存以及里面数据的边界。当每个网络层准备使用缓存区时,可能要为协议头或者更多的数据分配更多的空间。head和end指向分配的缓存开始和结束的地方,而data和tail指向实际数据的开始和结束的地方。参见图2-2。网络层可以用协议头填充头和数据之间的间隙,或者是新数据的tail和end之间的间隙。你会在后面的章节"Allocating memory: alloc_skb and dev_alloc_skb"中看到,这个在图右边的缓存包含一个附加的头和尾。

Figure 2-2. head/end versus data/tail pointers


void (*destructor)(...)

This function pointer can be initialized to a routine that performs some activity when the buffer is removed. When the buffer does not belong to a socket, the destructor is usually not initialized. When the buffer belongs to a socket, it is usually set to sock_rfree or sock_wfree (by the skb_set_owner_r and skb_set_owner_w initialization functions, respectively). The two sock_xxx routines are used to update the amount of memory held by the socket in its queues.
这个函数指针可以被初始化到一个例程上,该例程在缓存被删除时被(激活)调用。当缓存不再属于一个套接字时,这个析构函数通常没有被初始化。而当一个缓存属于某个套接字时,它就被设置到sock_rfree上,或者sock_wfree(这取决于skb_set_owner_r和skb_set_owner_w这两个初始化函数)。这两个sock_xxx例程被套接字用于在自己的队列上更新大量的内存。
(译注:routine,也就是一个程序,后面所有的routine一般翻译成例程,而程序还是指programm)

2.1.3. General Fields

This section covers the majority of sk_buff fields, which are not associated with specific kernel features:
这一节包括了sk_buff中大量的字段,它们都与特殊的内核特性相关:


struct timeval stamp

This is usually meaningful only for a received packet. It is a timestamp that represents when a packet was received or (occasionally) when one is scheduled for transmission. It is set by the function netif_rx with net_timestamp, which is called by the device driver after the reception of each packet and is described in Chapter 21.
这个通常只对接收的包有意义。它就是表示一个包接收到的时间戳,或者(有时候)一个传输调度时的时间戳。它由函数netif_rx和net_timestamp设置,而这两个函数是被设备的驱动程序在接收到每个包时调用的,这在第21章中讨论。


struct net_device *dev

This field, whose type (net_device) will be described in more detail later in the chapter, describes a network device. The role of the device represented by dev depends on whether the packet stored in the buffer is about to be transmitted or has just been received.
这个字段的类型为net_device,我们会在本章的后面讲述这个数据结构,该结构描述了一个网络设备。这个由dev所表示的设备的角色,取决于存储在缓存中的数据包是要传输出去的还是刚刚收到的。

When a packet is received, the device driver updates this field with the pointer to the data structure representing the receiving interface, as illustrated by the following piece of code from vortex_rx, the function called by the driver of the 3c59x Ethernet card series when receiving a frame (in drivers/net/3c59x.c):
当收到一个包时,设备驱动就用指向数据结构的指针来更新这个字段,而这个字段也就是表示了包接收的应用接口,如下面的代码片段所示的vortex_rx,当(设备)收到一个数据帧时,这个函数被3c59x系列的以太网卡驱动程序调用(在drivers/net/3c59x.c):

static int vortex_rx(struct net_device *dev)
{
           ... ... ...
        skb->dev = dev;
           ... ... ...
        skb->protocol = eth_type_trans(skb, dev);
        netif_rx(skb); /* Pass the packet to the higher layer */
           ... ... ...
}

When a packet is to be transmitted, this parameter represents the device through which it will be sent out. The code that sets the value is more complicated than the code for receiving a packet, so I will postpone a discussion until Chapter 21 and Chapter 35.
当一个包准备发出时,这个参数表示数据包准备从哪个设备上发出去。但设置这个值的代码比接收一个包时设置这个值的代码要复杂得多,所以我会推迟到后面的第21章和第35章来讨论这个。

Some network features allow a few devices to be grouped together to represent a single virtual interface (that is, one that is not directly associated with a hardware device), served by a virtual device driver. When the device driver is invoked, the dev parameter points to the virtual device's net_device data structure. The driver chooses a specific device from its group and changes the dev parameter to point to the net_device data structure of that device. Under these circumstances, therefore, the pointer to the transmitting device may be changed during packet processing.
有一些网络特性可以让几个设备组合在一起,形成一个虚拟的接口(也就是说,这个接口并不只于一个硬件设备相关),然后以虚拟设备驱动的方式提供服务。当设备驱动被调用时,dev参数就指向这个虚拟设备的net_device数据结构。驱动会从它的设备组是选择一个指定的设备,然后修改dev参数,让它指向这个真实设备的net_device数据结构。也就是说,在这种情况下,这个指向传输设备的指针在包处理过程可能会被修改。


struct net_device *input_dev

This is the device the packet has been received from. It is a NULL pointer when the packet has been generated locally. For Ethernet devices, it is initialized in eth_type_trans (see Chapters 10 and 13). It is used mainly by Traffic Control.
这个表示数据包是从哪个设备接收的。当数据包已经本地化了,这个指针就为NULL。做为以太设备,它在eth_type_trans函数中初始化(参见第10章和第13章)。它主要用于流量控制。


struct net_device *real_dev

This field is meaningful only for virtual devices, and represents the real device the virtual one is associated with. The Bonding and VLAN interfaces use it, for example, to remember where the real device ingress traffic is received from.
这个字段只对虚拟设备才有意义,它表示与这个虚拟设备相关联的真实设备。例如,Bonding和VLAN接口在使用它,用于记住那个收到流入数据的真实设备。


union {...} h


union {...} nh


union {...} mac

These are pointers to the protocol headers of the TCP/IP stack: h for L4, nh for L3, and mac for L2. Each field points to a union of various structures, one structure for each protocol understood by the kernel at that layer. For instance, h is a union that includes a field for the header of each L4 protocol understood by the kernel. One member of each union is called raw and is used for initialization; all later accesses are through the protocol-specific members.
这些是指向TCP/IP协议栈的协议头,h表示四层的,nh表示三层的,然后mac就表示二层。每一个字段指向一个有多个结构的联合,这些联合的结构中,每一个表示一种在指定网络层上可以被内核识别的协议。例如,h就是一个联合,它包含了一个可以被内核识别,用于四层协议的字段。每个联合有一个称为raw的成员,它是在初始化使用的,其后的所有访问都是通过指定的协议层成员来完成的。

When receiving a data packet, the function responsible for processing the layer n header receives a buffer from layer n-1 with skb->data pointing to the beginning of the layer n header. The function that handles layer n initializes the proper pointer for this layer (for instance, skb->nh for L3 handlers) to preserve the skb->data field, because the contents of this pointer will be lost during the processing at the next layer, when skb->data is initialized to a different offset within the buffer. The function then completes the layer n processing and, before passing the packet to the layer n+1 handler, updates skb->data to make it point to the end of the layer n header, which is the beginning of the layer n+1 header (see Figure 2-3).
当设备收到一个数据包以后,这个函数有责任处理第N层协议头,即用指向第N层数据帧头的skb->dat从第N-1层中接收数据到缓存中。因为该指针(译注:就是指上文中不同层的句柄指针)的实际内容会在处理下一个协议层时丢失,所以当skb->data在要不同的偏移量上被初始化时,用于处理第N层的函数就为当前层初始化正确的指针(例如,skb->nh就是处理第三层的),让它可以保存skb->data字段的数据。然后该函数就可以在将数据传到第N+1层时,用于处理完第N层的数据,也就是更新skb->data里的数据,让它指向第N层帧头结束的地方,这里也就是第N+1层帧头开始的地方(参见图2-3)。

Sending a packet reverses this process, with the added complexity of adding a new header at each layer.
发送一个包时与这个操作相反,也就是用复杂的操作在每个层上添加一个新的头。

Figure 2-3. Header's pointer initializations while moving from layer two to layer three


struct dst_entry dst

This is used by the routing subsystem. Because the data structure is quite complex and requires knowledge of how other subsystems work, I'll postpone a description of it until Part VII.
这个字段在路由系统中使用。因为数据结构非常的复杂,而且须要有知道其它子系统是如何工作的知识,所在我会在把它推后到第七部份讲解。


char cb[40]

This is a "control buffer," or storage for private information, maintained by each layer for internal use. It is statically allocated within the sk_buff structure (currently with a size of 40 bytes) and is large enough to hold whatever private data is needed by each layer. In the code for each layer, access is done through macros to make the code more readable. TCP, for example, uses that space to store a tcp_skb_cb data structure, which is defined in include/net/tcp.h:
这是一个“控制缓存”,或者说是个私有信息存储地,被每层协议自己维护并在内部使用。它是由sk_buff结构静态分配的(目前是40个字节),而且不管每层的私有数据是否须要,它都足够了。在每一层的代码上,通过宏来访问数据结构让代码更可读。例如,TCP使用这样的空间来存储一个tcp_skb_cb数据结构,这个在include/net/tcp.h中有定义:

struct tcp_skb_cb {
    ... ... ...
    _ _u32        seq;        /* Starting sequence number */
    _ _u32        end_seq;    /* SEQ + FIN + SYN + datalen*/
    _ _u32        when;       /* used to compute rtt's    */
    _ _u8         flags;      /* TCP header flags.        */
    ... ... ...
};

And this is the macro used by the TCP code to access the structure. The macro consists simply of a pointer cast:
而且这样的一个宏被TCP层用于访问这个结构,这个宏就是一个指针的强制转换:

#define TCP_SKB_CB(_ _skb)    ((struct tcp_skb_cb *)&((_ _skb)->cb[0]))

Here is an example where the TCP subsystem fills in the structure upon receipt of a segment:
这里有一个例子:在收到一个数据段以后,TCP子系统是在哪里填充这个数据结构的:

int tcp_v4_rcv(struct sk_buff *skb)
{
        ... ... ...
        th = skb->h.th;
        TCP_SKB_CB(skb)->seq = ntohl(th->seq);
        TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
                                    skb->len - th->doff * 4);
        TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
        TCP_SKB_CB(skb)->when = 0;
        TCP_SKB_CB(skb)->flags = skb->nh.iph->tos;
        TCP_SKB_CB(skb)->sacked = 0;
        ... ... ...
}

To see how the parameters in the cb buffer are retrieved, take a look at the function tcp_transmit_skb in net/ipv4/tcp_output.c. That function is used by TCP to push a data segment down to the IP layer for transmission.
为了搞清楚在cb缓存中是如何接收的,可以看看net/ipv4/tcp_output.c中的tcp_transmit_skb函数。这个函数被TCP用于把数据片段放入到IP层中传输。

In Chapter 22, you will also see how IPv4 uses cb to store information about IP fragmentation.
在第22章中,你还会看到IPv4使用cb来存储关于IP分片的信息。


unsigned int csum


unsigned char ip_summed

These represent the checksum and associated status flag. Their use is described in Chapter 19.
这些用于承载检验和以及状态标志。会在第19章中讨论它们的使用。


unsigned char cloned

A boolean flag that, when set, indicates that this structure is a clone of another sk_buff buffer. See the later section "Cloning and copying buffers."
这是一个布尔类型的标志,当它被设置时,用于指示这个结构是另一个sk_buff缓存的克隆。我们会在后面的“Cloning and copying buffers”中看到它。


unsigned char pkt_type

This field classifies the type of frame based on its L2 destination address. The possible values are listed in include/linux/if_packet.h. For Ethernet devices, this parameter is initialized by the function eth_type_trans, which is described in Chapter 13.
这个字段用于将帧在基于二层的目的地址上进行分类。它的可能值在include/linux/if_packet.h中全部列出。对于以太类型的设备来说,这个参数的值被函数eth_type_trans初始化,这个会在第13章中讲述。

The main values it can be assigned are:主要几个以被指定的值:


PACKET_HOST

The destination address of the received frame is that of the receiving interface; in other words, the packet has reached its destination.
收到帧的目的地址是接口的一个地址,也就是说,收到的包是从目的地来的(译注:也就是检测了环网,或者是一个环回。这里的“接口”是指网络设备接口,下同)。


PACKET_MULTICAST

The destination address of the received frame is one of the multicast addresses to which the interface is registered.
该值表示收到的帧中的目的地址是接口中一个已经注册的组播地址。


PACKET_BROADCAST

The destination address of the received frame is the broadcast address of the receiving interface.
该值表示目的地址是一个广播地址。


PACKET_OTHERHOST

The destination address of the received frame does not belong to the ones associated with the interface (unicast, multicast, and broadcast); thus, the frame will have to be forwarded if forwarding is enabled, and dropped otherwise.
接收到的帧中的目的地址不属于与接口(单播,组播,广播)相关的一个地址,也就是,如果在转发许可的情况下,这个包会被转发,否则会被丢弃。


PACKET_OUTGOING

The packet is being sent out; among the users of this flag are the Decnet protocol and the function that gives each network tap a copy of the outgoing packet (see dev_queue_xmit_nit in Chapter 11).
该值表示包正在被送出,在用户的这些标志值中是正式协议(Decnet protocol),并且这个功能会在每个网络上分发一份正在送出的包的COPY(参见第11章的dev_queue_xmit_nit)。


PACKET_LOOPBACK

The packet is being sent out to the loopback device. Thanks to this flag, when dealing with the loopback device, the kernel can skip some operations needed for real devices.
表示该包被送往一个环回的设备。感谢这个标志,当处理一个环回设备时,内核就可以跳过一些真实设备所必须的操作。


PACKET_FASTROUTE

The packet is being routed using the Fastroute feature. Fastroute support is not available anymore in 2.6 kernels.
表示该包使用Fastroute功能进行路由。Fastroute功能在2.6中不再可用。

Chapter 13 details how those values are set based on the L2 destination address value.
第13章会详术这些值是如何基于二层目的地址而设置的。


_ _u32 priority

This indicates the Quality of Service (QoS) class of a packet being transmitted or forwarded. If the packet is generated locally, the socket layer defines the priority value. If instead the packet is being forwarded, the function rt_tos2priority (called from the ip_forward function) defines the value of the field according to the value of the Type of Service (ToS) field in the IP header itself. The value of this parameter has nothing to do with the DiffServ Code Point (DSCP) described in Chapter 18. I will discuss its role in the section "ip_forward Function" in Chapter 20.
该字段标识一个发送或者转发数据包的QoS级别。如果这个包就是本地生成的,套接字层就会定义优先级。如果数据包换作是要被转发的,rt_tos2priority函数(被ip_forward函数调用)会根据数据包IP头中ToS字段的值来定义该字段的值。该参数的值对于DSCP来说是无效的,这个会在第18中讲解。我会在第20章的“ip_forward Function”这一节中讨论它的角色。


unsigned short protocol

This is the protocol used at the next-higher layer from the perspective of the device driver at L2. Typical protocols listed here are IP, IPv6, and ARP; a complete list is available in include/linux/if_ether.h. Since each protocol has its own function handler for the processing of incoming packets, this field is used by the driver to inform the layer above it what handler to use. Each driver calls netif_rx to invoke the handler for the upper network layer, so the protocol field must be initialized before that function is invoked. See Chapters 10 and 13 for more detail.
从二层设备驱动中来看,这是一个被下一个更高层使用的协议。一般列出的协议为IP,IPv6,以及ARP。在include/linux/if_ether.h中有一个完整的列表。因为每一个协议都有自己的处理函数用于处理进来的数据包,这个字段就被驱动用于告知网络层,应该使用哪个处理函数。每一个驱动都调用netif_rx来为更高的网络层取得处理函数,所以protocol字段必须在这些函数调用前被初始化。更详细的参见第10和第13章。


unsigned short security

This is the security level of the packet. This field was originally introduced for use with IPsec but is no longer used.
这个表示包的安全级别。这个字段原先是用于IPsec上的,但现在不用了。

2.1.4. Feature-Specific Fields

The Linux kernel is modular, allowing you to select what to include and what to leave out. Thus, some fields are included in the sk_buff data structure only if the kernel is compiled with support for particular features such as firewalling (Netfilter) or QoS:
Linux是模块化的,这可以让你选择包含什么以及省去什么。也就是说,在sk_buff数据结构中的一些字段只有内核在编译时选择支持一些诸如防火墙(Netfilter)或者QoS之类的特性:


unsigned long nfmark


_ _u32 nfcache


_ _u32 nfctinfo


struct nf_conntrack *nfct


unsigned int nfdebug


struct nf_bridge_info *nf_bridge

These parameters are used by Netfilter (the firewall code), and more specifically by the kernel option "Device Drivers Networking support Networking options Network packet filtering" and its two suboptions, "Network packet filtering debugging" and "Bridged IP/ARP packets filtering."
这些参数是被Netfilter (防火墙的代码)使用的,而更多的特性是由内核的“Device Drivers  Networking support  Networking options  Network packet filtering”选项和两个子选项“Network packet filtering debugging”,“Bridged IP/ARP packets filtering”所决定的。


union {...} private

This union is used by the High Performance Parallel Interface (HIPPI). The associated kernel option is "Device Drivers Networking support Network device support HIPPI driver support."
这是一个被HIPPI使用的联合字段。与之相关的内核选项是“Device Drivers  Networking support  Network device support  HIPPI driver support”。


_ _u32 tc_index


_ _u32 tc_verd


_ _u32 tc_classid

These parameters are used by the Traffic Control, and more specifically by the kernel option "Device Drivers Networking support Networking options QoS and/or fair queueing" and its suboption, "Packet classifier API."
这些参数是由流量控制所使用,而更多的特殊功能由内核选项“Device Drivers  Networking support  Networking options  QoS and/or fair queueing”以及子选项“Packet classifier API”所决定。


struct sec_path *sp

This is used by the IPsec protocol suite to keep track of transformations.
这个用于IPsec协议簇,用来跟踪转换。

2.1.5. Management Functions

Lots of functions , usually very short and simple, are offered by the kernel to manipulate sk_buff elements or lists of elements. With the help of Figure 2-4, I'll describe the most important ones. First we will see the functions used to allocate and free buffers, and then the ones used to manipulate the pointers (i.e., skb->data) to reserve space at the head or at the tail of a frame.
内核提供了很多简单小巧的函数,用于管理sk_buff元素或者链表的元素。通过图2-4的帮助,我将描述其中最重要的一些。首选我们看看用于申请和翻译缓存的函数,然后是一些用于管理指针的函数(例如skb->data),这些函数用于在帧头或者帧尾预留一些空间。

If you take a look at the files include/linux/skbuff.h and net/core/skbuff.c, you will notice that almost all of the functions exist in two versions, with names like do_something and _ _do_something. Usually, the first one is a wrapper that adds extra sanity checks or locking mechanisms around a call to the second one. The internal _ _do_something form is generally not called directly (unless specific conditions are meti.e., lock requirements, to name one). Exceptions to that rule are usually poorly coded functions that will be fixed eventually.
如果你看看include/linux/skbuff.h文件和net/core/skbuff.c,你看注意到,很多函数有两个版本,它们的名字就你是do_something和__do_something这样的。通常,第一个是第二的一个封装,该封装在第二个函数的调用上,额外增加了一些健壮性检测以及加锁机制。内部的 _ _do_something 一般情况下不被直接调用(除非遇到特殊的情况,例如: lock requirements, to name one)。违返这一规则的不好的函数代码最终会被修订。

Figure 2-4. Before and after: (a)skb_put, (b)skb_push, (c)skb_pull, and (d)skb_reserve


2.1.5.1. Allocating memory: alloc_skb and dev_alloc_skb

alloc_skb is the main function for the allocation of buffers and is defined in net/core/skbuff.c. We have already seen that the data buffer and the header (the sk_buff data structure) are two different entities, which means that creating a single buffer involves two allocations of memory (one for the buffer and one for the sk_buff structure).
alloc_skb是用于分配缓存的主要函数,它定义在net/core/skbuff.c中。我们已经看到数据缓存和帧头(sk_buff 数据结构)有两个不同的实体,也就是说在创建一个缓存时,会引发两个内存分配(一个用于缓存,另一个用于sk_buff数据结构)。(译注:sk_buff结构中的data字段指向分配的数据缓存)

alloc_skb takes an sk_buff data structure from a cache by calling the function kmem_cache_alloc, and gets a data buffer by calling kmalloc, which also uses cached memory if it is available. The code (slightly simplified) is:
alloc_skb通过调用kmem_cache_alloc从高速缓存中分配一个sk_buff数据结构,然后调用kmalloc取得一块数据缓存,如果可能,这个也是使用的高速缓存内存。代码(精简以后)如下:

    skb = kmem_cache_alloc(skbuff_head_cache, gfp_mask & ~_ _GFP_DMA);
    ... ... ...
    size = SKB_DATA_ALIGN(size);
    data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);

Before calling kmalloc, the size parameter is tuned with the macro SKB_DATA_ALIGN to force alignment. Before returning, the function initializes a few parameters in the structure, producing the final result shown in Figure 2-5.
在调用kmalloc之前,大小参数由宏SKB_DATA_ALIGN强制调整对齐,该函数会初始化结构中的一些参数。最后生成图2-5中所示的结果。

At the bottom of the memory block on the right side of Figure 2-5 you can see the padding area introduced to force the alignment. The skb_shared_info block is mainly used to handle IP fragments and is described later in this chapter. The fields shown on the left side of the figure were explained earlier.

Figure 2-5. alloc_skb function


dev_alloc_skb is the buffer allocation function meant for use by device drivers and expected to be executed in interrupt mode. It is simply a wrapper around alloc_skb that adds 16 bytes to the requested size for optimization reasons and asks for an atomic operation (GFP_ATOMIC) since it will be called from within an interrupt handler routine:
dev_alloc_skb设备驱动使用的缓存分配函数,期望在中断模式下是可执行的。它就是简单的对alloc_skb进行封装。出于优化的原因,封装的函数会在请求的大小上添加16个字节。而且,当它从一个中断例程中调用时,可以被原子操作(GFP_ATOMIC)使用。

static inline struct sk_buff *dev_alloc_skb(unsigned int length)
{
    return _ _dev_alloc_skb(length, GFP_ATOMIC);
}


static inline
struct sk_buff *_ _dev_alloc_skb(unsigned int length, int gfp_mask)
{
    struct sk_buff *skb = alloc_skb(length + 16, gfp_mask);
    if (likely(skb))
            skb_reserve(skb, 16);
    return skb;
}

This definition of _ _dev_alloc_skb is the default one used when there is no architecture-specific definition.
_ _dev_alloc_skb的定义是在没有architecture-specific定义时的默认操作。

2.1.5.2. Freeing memory: kfree_skb and dev_kfree_skb

These two functions release a buffer, which results in its return to the buffer pool (cache). kfree_skb is both called directly and invoked through the dev_kfree_skb wrapper. The latter is defined for use by device drivers, to have a name that parallels dev_alloc_skb but consists of a simple macro that does nothing but call kfree_skb. This basic function releases a buffer only when the skb->users counter is 1 (when no users of the buffer are left). Otherwise, the function simply decrements that counter. So if a buffer had three users, only the third call to dev_kfree_skb or kfree_skb would free memory.
这两个函数用于释放缓存,它会导致内存块被返回到缓存池中(高速缓存)。kfree_skb可以被直接调用,或者通过dev_kfree_skb封装调用。后面一种用法被设备驱动所使用,仅仅是有一个与之相关的名字,而实际上是一个什么也不做,只调用kfree_skb的宏。这个基本函数只有在skb->users计数器为1时(当缓存块没有用户可离开时)用于释放缓存块。另一方面,这个函数简单的减少计数器。所以,如果缓存块有3个用户,只有第3次调用dev_kfree_skb或者kfree_skb时才会释放内存。

The flowchart in Figure 2-6 shows all the steps involved in freeing a buffer. As you will see in Chapter 33, an sk_buff structure can hold a reference on a dst_entry data structure. When the sk_buff structure is freed, therefore, dst_release also has to be called to decrement the reference count on the associated dst_entry data structure.
2-6的流程图展示了释放内存时的所有调用步骤。你会在第33章中看到,一个sk_buff结构可以被一个dst_entry的数据结构所引用。因此,当sk_buff结构被释放以后,dst_release 也会被调用,用于减少与之相关的在dst_entry数据结构上的引用计数。

When the destructor function pointer has been initialized, it is called here (see the section "Layout Fields" earlier in this chapter).
当析构函数指针被初始化以后,它会在这里被调用(参见章前面的小节“Layout Fields”)。

We have seen in Figure 2-5 what a simple scenario looks like: an sk_buff data structure is associated to another memory block where the actual data is stored. However, the skb_shared_info data structure at the bottom of that data block, as shown in Figure 2-5, can hold pointers to other memory fragments. See Chapter 21 for some examples. kfree_skb releases the memory held by those fragments as well, when they are present. Finally, the sk_buff data structure is returned to the skbuff_head_cache cache.
你已经看过图2-5了,它就是一个简单的场景:一个sk_buff数据结构,以及与之相关的另一个内存块,该内存块保存了实际的数据。然而, skb_shared_info 数据结构在数据块的底部,正如图2-5所示,它可以存放一个指向另一块内存片的指针。参见第21章中的一些例子。 当出现这些内存片时,kfree_skb可以很好的释放由这些指针控制的内存片。最后,sk_buff数据结构会被返回到skbuff_head_cache这个高速缓存中。

2.1.5.3. Data reservation and alignment: skb_reserve, skb_put, skb_push, and skb_pull

skb_reserve reserves some space (headroom) at the head of the buffer and is commonly used to allow the insertion of a header or to force data to be aligned on some boundary. The function shifts the data and tail pointers (discussed earlier in the section "Layout Fields") that mark the beginning and the end of the payload, respectively. Figure 2-4(d) shows the result of calling skb_reserve(skb,n). This function is usually called soon after the allocation of the buffer, when data and tail are still the same.
skb_reserve用于在缓存块前面预留一些空间(headroom:净空),这些通常空间用于插入一些帧头,或者强制数据在一些边界上对齐。该函数移动data和tail指针(在前面的“Layout Fields”中讨论过),用于标记各自的净载开始和结束位置。图2-4(d)展示了调用skb_reserve(skb,n)的结果。当data和tail还是一样的时候,这个函数通常在分配缓存以后很快的被调用。

If you look at the receive function of one of the Ethernet drivers (for instance, vortex_rx in drivers/net/3c59x.c) you will see that they all use the following command before storing any data in the buffer they have just allocated:
如果你看一下一个以太网设备(例如:vortex_rx in drivers/net/3c59x.c)的接收函数,你会发现他们在刚刚分配缓存中存储任何一块数据以前都会使用下面这个命令:

skb_reserve(skb, 2);    /* Align IP on 16 byte boundaries */

Figure 2-6. kfree_skb function


Because they know that they are about to copy an Ethernet frame that has a header 14 octets long into the buffer, the argument of 2 shifts the head of the buffer 2 bytes. This keeps the IP header, which follows immediately after the Ethernet header, aligned on a 16-byte boundary from the beginning of the buffer, as shown in Figure 2-7.
因为他们知道在Copy一个以太网帧时会Copy14字节的帧头到缓存中,参数2就是用于移动帧头缓存的2个字节。在紧接着以太类的IP头中也是这样,在缓存开始的边界上以16字节对齐。正如图2-7所示:

Figure 2-7. (a) before skb_reserve, (b) after skb_reserve, and (c) after copying the frame on the buffer


Figure 2-8 shows an example of using skb_reserve in the opposite direction, during data transmission.

Figure 2-8. Buffer that is filled in while traversing the stack from the TCP layer down to the link layer


  1. When TCP is asked to transmit some data, it allocates a buffer following certain criteria (TCP Maximum Segment Size (mss), support for scatter gather I/O, etc.).
    当TCP被要求传输一些数据时,它会根据确定的标准(TCP最大的片大小,支持scatter gather I/O等)分配一个缓存。

  2. TCP reserves (with skb_reserve) enough space at the head of the buffer to hold all the headers of all layers (TCP, IP, link layer). The parameter MAX_TCP_HEADER is the sum of all headers of all levels and is calculated taking into account the worst-case scenarios: because the TCP layer does not know what type of interface will be used for the transmission, it reserves the biggest possible header for each layer. It even accounts for the possibility of multiple IP headers (because you can have multiple IP headers when the kernel is compiled with support for IP over IP).
    TCP在缓存前面(用skb_reserve)保留足够的空间,用于保存所有协议层(TCP,IP,Link层)的帧头。MAX_TCP_HEADER参数是在最坏情况下,所有网络协议层的帧头的总和:因为TCP层不知道使用什么样的网络接口来传输(译注:因为不仅仅是在以太网上),所以它为所有网络层保留了最大可能的帧头。它甚至计算了多IP头的可能性(因为你在内核编译中选择支持IP over IP时你可以使用多IP头)。

  3. The TCP payload is copied into the buffer. Note that Figure 2-8 is just an example. The TCP payload could be organized differently; for example, it could be stored as fragments. In Chapter 21, we will see what a fragmented buffer (also commonly called a paged buffer) looks like.
    TCP的净载被Copy到缓存中。注意图2-8中仅仅只是一个例子。TCP的净载可能有不同的组织结构,例如:它可能是以分片的形式保存的。在第21章中我们会看到分片缓存(通常也叫做分页缓存)是什么样的。

  4. The TCP layer adds its header.
    TCP层添加它的帧头,

  5. The TCP layer hands the buffer to the IP layer, which adds its header as well.
    然后把数据缓存交给IP层,IP层会添加它自己的帧头。

  6. The IP layer hands the IP packet to the neighboring layer, which adds the link layer header.
    IP层再把包交给相邻的网络层,该网络层同样会添加链路层的帧头。

Note that while the buffer travels down the network stack, each protocol moves skb->data down, copies in its header, and updates skb->len. All of this is accomplished with the functions we saw in Figure 2-4.
注意到,skb_reserve 函数并没有真正的移走或者添加任何东西到数据缓存中。如图2-4(d)中所示的,只是简单的更新一下两个指针。

Note that the skb_reserve function does not really move anything into or within the data buffer; it simply updates the two pointers as depicted in Figure 2-4(d).
注意到,skb_reserve 函数并没有真正的移走或者添加任何东西到数据缓存中。如图2-4(d)中所示的,只是简单的更新一下两个指针。

static inline void skb_reserve(struct sk_buff *skb, unsigned int len)
{
    skb->data+=len;
    skb->tail+=len;
}

skb_push adds one block of data to the beginning of the buffer, and skb_put adds one to the end. Like skb_reserve, these functions don't really add any data to the buffer; they simply move the pointers to its head or tail. The new data is supposed to be copied explicitly by other functions. skb_pull removes a block of data from the head of the buffer by moving the head pointer forward. Figure 2-4 shows how these functions work.
skb_push在缓存块的开始处添加一个数据块,而skb_put在缓存尾添加。和skb_reserve一样,这些函数没有真正的添加任何数据到缓存块中,而只是简单的移动一下它们的头和尾指针。而假定的新数据是通过另外几个函数被明确的Copy的(译注:前面只是添加块,没有写数据)。skb_pull通过向前移动头指针,从缓存的前面移除数据块。图2-4展示了这些函数是如何工作折。

2.1.5.4. The skb_shared_info structure and the skb_shinfo function

As shown in Figure 2-5, there is a structure called skb_shared_info at the end of the data buffer that keeps additional information about the data block. The data structure immediately follows the end pointer that marks the end of the data. This is the definition of the data structure:
如图2-5所示,有一个叫作skb_shared_info的数据结构在数据缓存块的尾部,它用保存一些附加的关于数据块的信息。该数据结构紧跟在用于标记数据块结束的end指针后面。这里是该结构的定义:

struct skb_shared_info {
    atomic_t        dataref;
    unsigned int    nr_frags;
    unsigned short  tso_size;
    unsigned short  tso_seqs;
    struct sk_buff  *frag_list;
    skb_frag_t      frags[MAX_SKB_FRAGS];
};

dataref represents the number of "users" of the data block and is described in the next section, "Cloning and copying buffers." nr_frags, frag_list, and frags are used to handle IP fragments and are described in Chapter 21. The skb_is_nonlinear routine can be used to check whether the buffer is fragmented, and skb_linearize[] can be used to collapse the fragments into a single flat buffer. Collapsing the fragments involves copying, which introduces a performance penalty.
dataref承载了该数据块的用户数目,这个我们在下一节“Cloning and copying buffers”中描述。nr_frags, frag_list, 和frags用于处理IP分片,我们会在第21章中描述它们。skb_is_nonlinear函数可以用于检测该缓存是否是分片,而 skb_linearize[1]可以用于把分片折叠到一个平坦的缓存中。折叠分片会引发内存Copy,而这会带来性能的下降。

[] See the section "dev_queue_xmit Function" in Chapter 11 for an example of its use.
该函数的使用,参见第11章中的“dev_queue_xmit Function”一节中的例子。

Some network interface cards (NICs) can handle in hardware some of the tasks that have traditionally been done by the CPU. The most common example is the computation of the L3 and L4 checksums. Some NICs can even maintain the L4 protocol's state machines. For the sake of the code shown here, we are interested in TCP segmentation offload, where the NIC implements a subset of the TCP layer. tso_size and tso_seqs are used by this feature.
一些网络接口卡(NIC)可以在硬件上处理一些原来由CPU处理的任务。最常见的例子就是计算第三层和第四层的校验和。一些NIC甚至可以维护第四层协议的状态机。这里出于对代码的演示,我们把TCP中的分片先去掉,NIC实现了TCP层的一些子集。tso_size和tso_seqs在这一特性中被使用。

Note that there is no field inside the sk_buff structure pointing at the skb_shared_info data structure. To access that structure, functions need to use the skb_shinfo macro, which simply returns the end pointer:
注意,这里的sk_buff结构中没有指向skb_shared_info数据结构的字段。为了访问该结构,须要使用skb_shinfo宏,该宏会简单返回end指针:

#define skb_shinfo(SKB)    ((struct skb_shared_info *)((SKB)->end))

The following statement, for instance, shows how the macro is used to increment a field of the private block:
做为一个例子,下面的代码段演示了如何使用这个宏来增加一个私有数据块的字段:

skb_shinfo(skb)->dataref++;

2.1.5.5. Cloning and copying buffers

When the same buffer needs to be processed independently by different consumers, and they may need to change the content of the sk_buff descriptor (the h and nh pointers to the protocol headers), the kernel does not need to make a complete copy of both the sk_buff structure and the associated data buffers. Instead, to be more efficient, the kernel can clone the original, which consists of making a copy of the sk_buff structure only and playing with the reference counts to avoid releasing the shared data block prematurely. Buffer cloning is done with the skb_clone function.
当同样的缓存块要被相互独立的不同用户处理时,而且他们可能要修改sk_buff中描述符的内容(也就是指向协议头的h和nh指针),内核并不须要完全同时Copy一个sk_buff结构以及与之相关的数据块。更高效的,取而代之的方法是,内核克隆原始数据。该方法是:可以只Copy一个sk_buff结构,然后通过使用引用计数来避免过早的释放共享的数据块。缓存克隆通过skb_clone来完成。

An example of a situation using cloning is when an ingress packet needs to be delivered to multiple recipients, such as the protocol handler and one or more network taps (see Chapter 21).
一个要使用克隆的情况,就是当进来的包须要分发到不同的接收者那里去,例如像协议处理程序以及一个或者多个网络分发(参见第21章)。

The sk_buff clone is not linked to any list and has no reference to the socket owner. The field skb->cloned is set to 1 in both the clone and the original buffer. skb->users is set to 1 in the clone so that the first attempt to remove it succeeds, and the number of references (dataref) to the buffer containing the data is incremented (since now there is one more sk_buff data structure pointing to it). Figure 2-9 shows an example of a cloned buffer.
sk_buff 克隆不会链接到任何一个链表,而且也不会引用到任何一个套接字的所有者上。在原缓存块和克隆块中,skb->cloned字段都被设置为1。在克隆块中的skb->users被设置为1,这样第一次删除它的时候就可以成功。而在包含了数据块的缓存块中的引用计数(dataref)会被增加(至此,这里有多于一个的sk_buff数据结构指向它)。图2-9展示了克隆缓存的一个例子。

Figure 2-9. skb_clone function


The skb_clone routine can be used to check the cloned status of an skb buffer.
skb_clone例程可以被用于检测一个已经克隆的skb缓存的状态。

Figure 2-9 shows an example of a fragmented bufferthat is to say, a buffer that has some data stored in data fragments linked with the frags array. We will see how fragmented buffers are used in Chapter 21; for now, let's not bother with those details.
图2-9展示了一个分片缓存的例子,一个缓存块有一些数据保存在里面,这些数据是通过一个标志数组来标识的分片数据。我们会在第21章中看到分片缓存是如何使用的。而现在,别让我们被这些细节所打扰。

The skb_share_check routine can be used to check the reference count skb->users and clone the buffer skb when the users field says the buffer is shared.
skb_share_check例程可以用于检测skb->users的引用计数,以及在用户字段表示这个缓存是共享时来克隆一个skb缓存。

When a buffer is cloned, the contents of the data block cannot be modified. This means that code can access the data without any need for locking. When, however, a function needs to modify not only the contents of the sk_buff structure but the data too, it needs to clone the data block as well. In this case, the programmer has two options. When he knows he needs to modify only the contents of the data in the area between skb->start and skb->end, he can use pskb_copy to clone just that area. When he thinks he may need to modify the content of the fragment data blocks too, he must use skb_copy. The result of both pskb_copy and skb_copy is shown in Figure 2-10. You will see in Chapter 21 that the skb_shared_info data structure can include a list of sk_buff structures too (linked to a field called frag_list). That list is handled by pskb_copy and skb_copy in the same way as the frags array (this detail has been omitted from Figure 2-10 to keep the latter more readable).
当一个缓存被克隆时,数据块的一些实际内容不能被修改。这也就是说,那些代码在不须要加锁的情况下访问数据。然而,当一个函数要不仅仅要修改sk_buff 数据结构的内容,还须要修改数据时,它须要很好的copy一份数据块。在这种情况下,程序员有两个选择。当他知道只须要修改数据块中skb->start 和skb->end之间的数据时,他可以使用pskb_copy来只克隆这一区域的数据。当他想他可能要修改分片块中的数据内容时,他必须使用skb_copy。 pskb_copy 和 skb_copy的调用结果展示在图2-10中。你将会在第21章中看到,skb_shared_info 数据结构也可以包含一个sk_buff 结构链表(链接到一个叫做frag_list的字段)。这个链表被pskb_copy 和skb_copy 以同样的方式以标志数组的形式所处理(为了让后面的内容更容易阅读,这一细节在图2-10中被省略)。

Figure 2-10. (a) pskb_copy function and (b) skb_copy function


You may not be able to appreciate all of the details in Figures 2-9 and 2-10 at this point. Later in the book, especially once you have gone through Part V, everything will make more sense.
在这一点上,你可能不能赏识图2-9以及图2-10中所有的细节。在本书的后面,特别是当你学习到第五部份时,所有的内容都会变得更易理解了。

While discussing the various topics of this book, I will sometimes emphasize that a given function needs to clone or copy a buffer. When deciding to make a clone or copy of a buffer, programmers of each subsystem cannot anticipate whether other kernel components (or other users of their subsystems) will need the original information in that buffer. The kernel is very modular and changes in a very dynamic and unpredictable way, so each subsystem is ignorant of what other subsystems may do with a buffer. Therefore, the programmers of each subsystem just keep track of any modifications they make to the buffer, and take care to make a copy before modifying anything in case some other part of the kernel needs the original information.
在讨论本书中大量的话题时,我有时会强调一个给定的函数须要克隆或者Copy一个缓存。当决定去克隆或者copy一块缓存时,每一个子系统的程序员都不能预料内核(或者其它子系统的用户)是否须要该缓存的原始信息。内核是非常模块化的,而且在每个动态和不可预知的情况下发生改变,所以每一个子系统是不知道另一个子系统是否会要处理该缓存。因此,每个子系统的程序员应该坚持记录每一个对缓存所做的修改,而且在修改任何内容前要小心的做一个copy,因为内核的其它部份须要原始的信息。

2.1.5.6. List management functions

These functions manipulate the lists of sk_buff elements, also called queues. For a complete list of functions, see <include/linux/skbuff.h> and <net/core/skbuff.c>. Some of the most commonly used functions are:
这些函数用于操作sk_buff元素的链表,也叫做队列。要得到完整的函数列表,参见<include/linux/skbuff.h> 和<net/core/skbuff.c>。这里有一些最常用的函数:


skb_queue_head_init

Initializes an sk_buff_head with an empty queue of elements.用空元素来初始化一个sk_buff_head 。


skb_queue_head, skb_queue_tail

Adds one buffer to the head or to the tail of a queue, respectively.分别添加各别的缓存到队头或者队尾。


skb_dequeue, skb_dequeue_tail

Dequeues an element from the head or from the tail, respectively. The second function should probably have been called skb_dequeue_head to be consistent with the names of the other queueing functions.
分别从队头或者队尾取出一个元素。第二个函数应该是在已经在调用了skb_dequeue_head 以后,与其它入队函数名已经一致的情况下调用。


skb_queue_purge

Empties a queue.清空队列


skb_queue_walk

Runs a loop on each element of a queue in turn.循环遍历队列中的每个元素。

All functions of this class must be executed atomicallythat is, they must grab the spin lock provided by the sk_buff_head structure for the queue. Otherwise, they could be interrupted by asynchronous events that enqueue or dequeue elements from the queues, such as functions invoked by expired timers, which would lead to race conditions.
所有这些级别的函数都必须在原子级上操作,也就是说,他们必须为操作队列而获取一个由sk_buff_head 结构提供的自旋锁。另一方面,它们可以被步的事件所中断,而该事件可能就是在队列上入队或者出队,例如由到时的计时器调用这些函数,这些可能会引发条件竞争。

Thus, each function is implemented as follows:

static inline function_name ( parameter_list )
{
        unsigned long flags;

        spin_lock_irqsave(...);
        _ _ _function_name ( parameter_list )
        spin_unlock_irqrestore(...);
}

The function consists of a wrapper that grabs the lock, does its work by invoking a function whose name begins with two underscores, and releases the lock.
这些函数由一个获取锁的封装调用构成,通过调用一个由双下划线组成的函数来实现功能,然后释放锁。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值