The sk_buff structure represents a packet. SKB stands for socket buffer. A packet can be generated by a local socket in the local machine, which was created by a userspace application; the packet can be sent outside or to another socket in the same machine. A packet can also be created by a kernel socket; and you can receive a physical frame from a network device (Layer 2) and attach it to an sk_buff and pass it on to Layer 3. When the packet destination is your local machine, it will continue to Layer 4. If the packet is not for your machine, it will be forwarded according to your routing tables rules, if your machine supports forwarding. If the packet is damaged for any reason, it will be dropped.
[ include/linux/skbuff.h ]
- ktime_t tstamp
Timestamp of the arrival of the packet. Timestamps are stored in the SKB as offsets to a base timestamp. Note: do not confuse tstamp of the SKB with hardware timestamping, which is implemented with the hwtstamps of skb_shared_info - struct sock *sk
The socket that owns the SKB, for local generated traffic and for traffic that is destined for the local host. For packets that are being forwarded, sk is NULL. Usually when talking about sockets you deal with sockets which are created by calling the socket() system call from userspace. It should be mentioned that there are also kernel sockets, which are created by calling the sock_create_kern() method. - struct net_device *dev
The dev member is a net_device object which represents the network interface device associated to the SKB; you will sometimes encounter the term NIC (Network Interface Card) for such a network device. It can be the network device on which the packet arrives, or the network device on which the packet will be sent. - char cb[48]
This is the control buffer. It is free to use by any layer. This is an opaque area used to store private information. - unsigned long _skb_refdst
The destination entry (dst_entry) address. The dst_entry struct represents the routing entry for a given destination. For each packet, incoming or outgoing, you perform a lookup in the routing tables. Sometimes this lookup is called FIB lookup. The result of this lookup determines how you should handle this packet; for example, whether it should be forwarded, and if so, on which interface it should be transmitted; or should it be thrown, should an ICMP error message be sent, and so on. The dst_entry object has a reference counter (the __refcnt field). There are cases when you use this reference count, and there are cases when you do not use it.
skb_dst_set(struct sk_buff *skb, struct dst_entry *dst): Sets the skb dst, assuming a reference was taken on dst and should be released by the dst_release() method (which is invoked by the skb_dst_drop() method).
skb_dst_set_noref(struct sk_buff *skb, struct dst_entry *dst): Sets the skb dst, assuming a reference was not taken on dst. In this case, the skb_dst_drop() method will not call the dst_release() method for the dst.
Note The SKB might have a dst_entry pointer attached to it; it can be reference counted or not. The low order bit of _skb_refdst is set if the reference counter was not taken. - struct sec_path *sp
The security path pointer. It includes an array of IPsec XFRM transformations states (xfrm_state objects). IPsec (IP Security) is a Layer 3 protocol which is used mostly in VPNs. It is mandatory in IPv6 and optional in IPv4. Linux, like many other operating systems, implements IPsec both for IPv4 and IPv6. - unsigned int len
The total number of packet bytes. - unsigned int data_len
The data length. This field is used only when the packet has nonlinear data (paged data). - __u16 mac_len
The length of the MAC (Layer 2) header. - __wsum csum
The checksum. - __u32 priority
The queuing priority of the packet. In the Tx path, the priority of the SKB is set according to the socket priority (the sk_priority field of the socket). The socket priority in turn can be set by calling the setsockopt() system call with the SO_PRIORITY socket option. Using the net_prio cgroup kernel module, you can define a rule which will set the priority for the SKB; For forwarded packets, the priority is set according to TOS (Type Of Service) field in the IP header. There is a table named ip_tos2prio which consists of 16 elements. The mapping from TOS to priority is done by the rt_tos2priority() method, according to the TOS field of the IP header; - __u8 local_df:1
Allow local fragmentation flag. If the value of the pmtudisc field of the socket which sends the packet is IP_PMTUDISC_DONT or IP_PMTUDISC_WANT, local_df is set to 1; if the value of the pmtudisc field of the socket is IP_PMTUDISC_DO or IP_PMTUDISC_PROBE, local_df is set to 0.Only when the packet local_df is 0 do you set the IP header don’t fragment flag, IP_DF;
. . .
if (ip_dont_fragment(sk, &rt->dst) && !skb->local_df)
iph->frag_off = htons(IP_DF);
else
iph->frag_off = 0;
. . .
The frag_off field in the IP header is a 16-bit field, which represents the offset and the flags of the fragment. The 13 leftmost (MSB) bits are the offset (the offset unit is 8-bytes) and the 3 rightmost (LSB) bits are the flags. The flags can be IP_MF (there are more fragments), IP_DF (do not fragment), IP_CE (for congestion), or IP_OFFSET (offset part).
The reason behind this is that there are cases when you do not want to allow IP fragmentation.
For example, in Path MTU Discovery (PMTUD), you set the DF (don’t fragment) flag of the IP header. Thus, you don’t fragment the outgoing packets. Any network device along the path whose MTU is smaller than the packet will drop it and send back an ICMP packet (“Fragmentation Needed”). Getting these ICMP “Fragmentation Needed” packets is required in order to determine the Path MTU. From userspace, setting IP_PMTUDISC_DO is done, for example, thus (the following code snippet is taken from the source code of the tracepath utility from the iputils package; the tracepath utility finds the path MTU):
. . .
int on = IP_PMTUDISC_DO;
setsockopt(fd, SOL_IP, IP_MTU_DISCOVER, &on, sizeof(on));
. . . - __u8 cloned:1
When the packet is cloned with the __skb_clone() method, this field is set to 1 in both the cloned packet and the primary packet. Cloning SKB means creating a private copy of the sk_buff struct; the data block is shared between the clone and the primary SKB. - __u8 ip_summed:2
Indicator of IP (Layer 3) checksum; can be one of these values:
- CHECKSUM_NONE: W hen the device driver does not support hardware checksumming, it sets the ip_summed field to be CHECKSUM_NONE. This is an indication that checksumming should be done in software.
- CHECKSUM_UNNECESSARY: No need for any checksumming.
- CHECKSUM_COMPLETE: Calculation of the checksum was completed by the hardware, for incoming packets.
- CHECKSUM_PARTIAL: A partial checksum was computed for outgoing packets; the hardware should complete the checksum calculation. CHECKSUM_COMPLETE and CHECKSUM_PARTIAL replace the CHECKSUM_HW flag, which is now deprecated.
- __u8 nohdr:1
Payload reference only, must not modify header. There are cases when the owner of the SKB no longer needs to access the header at all. In such cases, you can call the skb_header_release() method, which sets the nohdr field of the SKB; this indicates that the header of this SKB should not be modified. - __u8 nfctinfo:3
Connection Tracking info. Connection Tracking allows the kernel to keep track of all logical network connections or sessions. NAT relies on Connection Tracking information for its translations. The value of the nfctinfo field corresponds to the ip_conntrack_info enum values. So, for example, when a new connection is starting to be tracked, the value of nfctinfo is IP_CT_NEW. When the connection is established, the value of nfctinfo is IP_CT_ESTABLISHED. The value of nfctinfo can change to IP_CT_RELATED when the packet is related to an existing connection—for example, when the traffic is part of some FTP session or SIP session, and so on.
The nfctinfo field of the SKB is set in the resolve_normal_ct() method. This method performs a Connection Tracking lookup, and if there is a miss, it creates a new Connection Tracking entry. - __u8 pkt_type:3
For Ethernet, the packet type depends on the destination MAC address in the ethernet header, and is determined by the eth_type_trans() method:
PACKET_BROADCAST for broadcast
PACKET_MULTICAST for multicast
PACKET_HOST if the destination MAC address is the MAC address of the device which was passed as a parameter
PACKET_OTHERHOST if these conditions are not met - __u8 ipvs_property:1
This flag indicates whether the SKB is owned by ipvs (IP Virtual Server), which is a kernel-based transport layer load-balancing solution. This field is set to 1 in the transmit methods of ipvs - __u8 peeked:1
This packet has been already seen, so stats have been done for it—so don’t do them again. - __u8 nf_trace:1
The netfilter packet trace flag. This flag is set by the packet flow tracing the netfilter module, xt_TRACE module, which is used to mark packets for tracing - __be16 protocol
The protocol field is initialized in the Rx path by the eth_type_trans() method to be ETH_P_IP when working with Ethernet and IP. - void (*destructor)(struct sk_buff *skb)
A callback that is invoked when freeing the SKB by calling the kfree_skb() method. - struct nf_conntrack *nfct
The associated Connection Tracking object, if it exists. The nfct field, like the nfctinfo field, is set in the resolve_normal_ct() method. - int skb_iif
The ifindex of the network device on which the packet arrived. - __u32 rxhash
The rxhash of the SKB is calculated in the receive path, according to the source and destination address of the IP header and the ports from the transport header. A value of zero indicates that the hash is not valid. The rxhash is used to ensure that packets with the same flow will be handled by the same CPU when working with Symmetrical Multiprocessing (SMP). This decreases the number of cache misses and improves network performance. The rxhash is part of the Receive Packet Steering (RPS) feature. The RPS feature gives performance improvement in SMP environments. - __be16 vlan_proto
The VLAN protocol used—usually it is the 802.1q protocol. Recently support for the 802.1ad protocol (also known as Stacked VLAN) was added.
The following is an example of creating 802.1q and 802.1ad VLAN devices in userspace using the ip command of the iproute2 package:
ip link add link eth0 eth0.1000 type vlan proto 802.1ad id 1000
ip link add link eth0.1000 eth0.1000.1000 type vlan proto 802.1q id 100 - __u16 vlan_tci
The VLAN tag control information (2 bytes), composed of ID and priority. - __u16 queue_mapping
Queue mapping for multiqueue devices. - __u8 pfmemalloc
Allocate the SKB from PFMEMALLOC reserves. - __u8 ooo_okay:1
The ooo_okay flag is set to avoid ooo (out of order) packets. - __u8 l4_rxhash:1
A flag that is set when a canonical 4-tuple hash over transport ports is used. - __u8 no_fcs:1
A flag that is set when you request the NIC to treat the last 4 bytes as Ethernet Frame Check Sequence (FCS). - __u8 encapsulation:1
The encapsulation field denotes that the SKB is used for encapsulation. It is used, for example, in the VXLAN driver. VXLAN is a standard protocol to transfer Layer 2 Ethernet packets over a UDP kernel socket. It can be used as a solution when there are firewalls that block tunnels and allow, for example, only TCP or UDP traffic. The VXLAN driver uses UDP encapsulation and sets the SKB encapsulation to 1 in the vxlan_init_net() method. Also the ip_gre module and the ipip tunnel module use encapsulation and set the SKB encapsulation to 1. - __u32 secmark
Security mark field. The secmark field is set by an iptables SECMARK target, which labels packets with any valid security context. For example:
iptables -t mangle -A INPUT -p tcp --dport 80 -j SECMARK --selctx
system_u:object_r:httpd_packet_t:s0
iptables -t mangle -A OUTPUT -p tcp --sport 80 -j SECMARK --selctx
system_u:object_r:httpd_packet_t:s0
In the preceding rule, you are statically labeling packets arriving at and leaving from port 80 as httpd_packet_t. - __u32 mark
This field enables identifying the SKB by marking it.
You can set the mark field of the SKB, for example, with the iptables MARK target in an iptables PREROUTING rule with the mangle table.
iptables -A PREROUTING -t mangle -i eth1 -j MARK --set-mark 0x1234
This rule will assign the value of 0x1234 to every SKB mark field for incoming traffic on eth1 before performing a routing lookup. You can also run an iptables rule which will check the mark field of every SKB to match a specified value and act upon it. Netfilter targets and iptables are discussed in Chapter 9, which deals with the netfilter subsystem. - __u32 dropcount
The dropcount counter represents the number of dropped packets (sk_drops) of the sk_receive_queue of the assigned sock object (sk). See the sock_queue_rcv_skb() method - _u32 reserved_tailroom: Used in the sk_stream_alloc_skb() method.
- sk_buff_data_t transport_header
The transport layer (L4) header. - sk_buff_data_t network_header
The network layer (L3) header. - sk_buff_data_t mac_header
The link layer (L2) header. - sk_buff_data_t tail
The tail of the data. - sk_buff_data_t end
The end of the buffer. The tail cannot exceed end. - unsigned char head
The head of the buffer. - unsigned char data
The data head. The data block is allocated separately from the sk_buff allocation.
skb_headroom(const struct sk_buff *skb): This method returns the headroom, which is the number of bytes of free space at the head of the specified skb (skb->data – skb->head).
skb_tailroom(const struct sk_buff *skb): This method returns the tailroom, which is the number of bytes of free space at the tail of the specified skb (skb->end – skb->tail). - skb_put(struct sk_buff *skb, unsigned int len): Adds data to a buffer: this method adds len bytes to the buffer of the specified skb and increments the length of the specified skb by the specified len.
- skb_push(struct sk_buff *skb, unsigned int len): Adds data to the start of a buffer; this method decrements the data pointer of the specified skb by the specified len and increments the length of the specified skb by the specified len.
- skb_pull(struct sk_buff *skb, unsigned int len): Removes data from the start of a buffer; this method increments the data pointer of the specified skb by the specified len and decrements the length of the specified skb by the specified len.
- skb_reserve(struct sk_buff *skb, int len): Increases the headroom of an empty skb by reducing the tail.
- unsigned int truesize
The total memory allocated for the SKB (including the SKB structure itself and the size of the allocated data block). - atomic_t users
A reference counter, initialized to 1; incremented by the skb_get() method and decremented by the kfree_skb() method or by the consume_skb() method; the kfree_skb() method decrements the usage counter; if it reached 0, the method will free the SKB—otherwise, the method will return without freeing it.
skb_share_check(struct sk_buff *skb, gfp_t pri): If the buffer is not shared, the original buffer is returned. If the buffer is shared, the buffer is cloned, and the old copy drops a reference. A new clone with a single reference is returned. When being called from interrupt context or with spinlocks held, the pri parameter (priority) must be GFP_ATOMIC. If memory allocation fails, NULL is returned.
struct skb_shared_info
The skb_shared_info struct is located at the end of the data block (skb_end_pointer(SKB)).
- nr_frags: Represents the number of elements in the frags array.
- tx_flags can be:
- SKBTX_HW_TSTAMP: Generate a hardware time stamp.
- SKBTX_SW_TSTAMP: Generate a software time stamp.
- SKBTX_IN_PROGRESS: Device driver is going to provide a hardware timestamp.
- SKBTX_DEV_ZEROCOPY: Device driver supports Tx zero-copy buffers.
- SKBTX_WIFI_STATUS: Generate WiFi status information.
- SKBTX_SHARED_FRAG: Indication that at least one fragment might be overwritten.
- When working with fragmentation, there are cases when you work with a list of sk_buffs (frag_list), and there are cases when you work with the frags array. It depends mostly on whether the Scatter/Gather mode is set.
- dataref: A reference counter of the skb_shared_info struct. It is set to 1 in the method, which allocates the skb and initializes skb_shared_info (The __alloc_skb() method).