一、起因
在写上面这篇文章时,igb 网卡驱动,根据软中断时,会调用napi_gro_receive 。这个函数接口是 GRO机制的开始。下面来分析 一下 GRO机制的处理流程。
二、GRO (Generic receive offload) 的作用
GRO是针对报文接收方向的,是指设备链路层在接收报文处理的时候,将多个小包合并成一个大包一起上送协议栈,减少数据包在协议栈间交互的机制。
GRO 是LGO(Large Receive Offload, 多数是在NIC 上实现的一种硬件优化机制)的一种软件实现,从而让所有NIC 都支持这个功能。网络上大部分MTU 都是1500字节,开启Jumbo Frame 后能到9000字节,如果 发送的数据超过MTU 就需要切割成多个数据包。通过合并[足够类似] 的包来减少传送给网络协议栈的包数,有助于减少CPU的使用量。GRO使协议层只需处理一个header,而将包含大量数据的整个大包送到用户程序。如果tcpdump 抓包看到机器收到了 不现实的、非常大的包,这很可能是系统开启了GRO。GRO 和 硬中断合并的思想类似,不过阶段不同。硬中断合并 是在中断发起之前,而GRO已经在软中断处理中了。
GRO 的主要功能是 将多个TCP/UDP 数据 聚合在一个skb 结构,然后作为一个大数据包 交付给上层的网络协议栈,以减少上层协议处理skb 的开销,提高系统接收数据包的性能。scatter-gather IO 的思想应用。合并了多个skb 的超级skb 能够一次性通过网络协议栈,从而减轻CPU负责。
GRO是在协议栈接收报文时进行减负的一种处理方式,该方式在设计上考虑了多种协议报文。主要原理是在接收端通过把多个相关的报文(比如TCP分段报文)组装成一个大的报文后再传送给协议栈进行处理,因为内核协议栈对报文的处理都是对报文头部进行处理,如果相关的多个报文合并后只有一个报文头,这样就减少了协议栈处理报文个数,加快协议栈对报文的处理速度。
GRO是针对网络收报流程进行改进的,并且只有NAPI类型的驱动才能支持此功能。因此如果要支持GRO,不仅要内核支持,驱动也必须调用相应的接口来开启此功能。 用 ethtool -K ethX gro on/off 来开启/关闭 GRO ,如果报错就说明网卡驱动本身就不支持GRO。GRO虽然可以提升吞吐,但同时也会带来一定是时延增加。GRO功能对到本机的报文能起到一定的加速作用,但如果linux 运行在转发设备上,一般不需要使用GRO功能,这时使用GRO功能反而会降低处理速度
为什么只有NAPI 类型的驱动才能支持?
先看一下NAPI和非NAPI 处理流程的对比:
process_backlog 函数实现中,是从 process_queue 队列取出skb后,直接就调用 __netif_receive_skb(skb);所以是一个一个的数据包发送。process_backlog 函数是内核实现的函数,在网络设备子系统初始化时就注册进内核,也不会更改,所以不支持GRO。
而napi_poll 函数 会调用napi_gro_receive 函数接口,就会有合并功能,后面会具体分析这个函数的实现。
目前 网卡驱动,都是走NAPI机制。
通过命令打开GRO 功能:
在 调用dev_gro_receive 函数收包时会判断是否支持GRO,代码如下:
static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
{
u32 hash = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1);
struct list_head *head = &offload_base;
struct packet_offload *ptype;
__be16 type = skb->protocol;
struct list_head *gro_head;
struct sk_buff *pp = NULL;
enum gro_result ret;
int same_flow;
int grow;
if (netif_elide_gro(skb->dev))
goto normal;
gro_head = gro_list_prepare(napi, skb);
rcu_read_lock();
list_for_each_entry_rcu(ptype, head, list) {
if (ptype->type != type || !ptype->callbacks.gro_receive)
continue;
skb_set_network_header(skb, skb_gro_offset(skb));
skb_reset_mac_len(skb);
NAPI_GRO_CB(skb)->same_flow = 0;
NAPI_GRO_CB(skb)->flush = skb_is_gso(skb) || skb_has_frag_list(skb);
NAPI_GRO_CB(skb)->free = 0;
NAPI_GRO_CB(skb)->encap_mark = 0;
NAPI_GRO_CB(skb)->recursion_counter = 0;
NAPI_GRO_CB(skb)->is_fou = 0;
NAPI_GRO_CB(skb)->is_atomic = 1;
NAPI_GRO_CB(skb)->gro_remcsum_start = 0;
/* Setup for GRO checksum validation */
switch (skb->ip_summed) {
case CHECKSUM_COMPLETE:
NAPI_GRO_CB(skb)->csum = skb->csum;
NAPI_GRO_CB(skb)->csum_valid = 1;
NAPI_GRO_CB(skb)->csum_cnt = 0;
break;
case CHECKSUM_UNNECESSARY:
NAPI_GRO_CB(skb)->csum_cnt = skb->csum_level + 1;
NAPI_GRO_CB(skb)->csum_valid = 0;
break;
default:
NAPI_GRO_CB(skb)->csum_cnt = 0;
NAPI_GRO_CB(skb)->csum_valid = 0;
}
pp = INDIRECT_CALL_INET(ptype->callbacks.gro_receive,
ipv6_gro_receive, inet_gro_receive,
gro_head, skb);
break;
}
rcu_read_unlock();
....
pull:
grow = skb_gro_offset(skb) - skb_headlen(skb);
if (grow > 0)
gro_pull_from_frag0(skb, grow);
normal:
ret = GRO_NORMAL;
goto pull;
}
if netif_elide_gro(skb->dev) 为真,就是支持GRO。
static inline bool netif_elide_gro(const struct net_device *dev)
{
if (!(dev->features & NETIF_F_GRO) || dev->xdp_prog)
return true;
return false;
}
GRO 与TSO类似,但TSO 只支持发送数据包。支持GRO的驱动会在NAPI的回调 poll 方法中读取数据包,然后调用GRO的接口napi_gro_receive 或者napi_gro_frags 来将数据包送进协议栈。
下面分析GRO处理数据包的流程(linux 内核版本:5.10.*)。
三、GRO 源码详解
网卡驱动在收包时会调用napi_gro_receive 函数接收数据:
先介绍一下数据结构:
1、struct napi_struct
struct napi_struct {
/* The poll_list must only be managed by the entity which
* changes the state of the NAPI_STATE_SCHED bit. This means
* whoever atomically sets that bit can add this napi_struct
* to the per-CPU poll_list, and whoever clears that bit
* can remove from the list right before clearing the bit.
*/
struct list_head poll_list;
unsigned long state;
int weight;
int defer_hard_irqs_count;
//哪些gro 链表上有数据包
unsigned long gro_bitmask;
int (*poll)(struct napi_struct *, int);
#ifdef CONFIG_NETPOLL
int poll_owner;
#endif
struct net_device *dev;
//gro 的链表数组,数据包积攒
struct gro_list gro_hash[GRO_HASH_BUCKETS];
struct sk_buff *skb;
struct list_head rx_list; /* Pending GRO_NORMAL skbs */
int rx_count; /* length of rx_list */ 最大值是8
struct hrtimer timer;
struct list_head dev_list;
struct hlist_node napi_hash_node;
unsigned int napi_id;
};
上面的数据成员 rx_count 最大值为8 ,定义如下:
/* Maximum number of GRO_NORMAL skbs to batch up for list-RX */
int gro_normal_batch __read_mostly = 8;
2、struct napi_gro_cb
GRO功能使用skb结构体内私有空间cb[48]来存放gro所用到的一些信息.
struct napi_gro_cb
{
/*指向存在skb_shinfo(skb)->frag[0].page页的数据的头部,
GRO使用过程中,如果skb是线性的,就置为空。
如果是非线性的并且报文头部全部存在非线性区中,
就指向页中的数据起始部分
*/
void *frag0;
/*第一页中数据的长度,如果frag0 字段不为空,
就设置该字段,否则为0。( Length of frag0.)
*/
unsigned int frag0_len;
/*This indicates where we are processing
relative to skb->data.
表明skb->data到GRO需要处理的数据区的偏移量。
因为在GRO合并处理过程中skb->data是不能被改变的,
所以需要使用该字段来记录一下偏移量。
GRO处理过程中根据该记录值快速找到要处理的数据部分。
比如进入ip层进行GRO处理,这时skb->data指向ip 头,
而ip层的gro 正好要处理ip头,这时偏移量就为0.
进入传输层后进行GRO处理,这时skb->data还指向ip头,
而tcp层gro要处理tcp头,这时偏移量就是ip头部长度。
*/
int data_offset;
/*This is non-zero if the packet may be of the same flow.
标记挂在napi->gro_list上的报文是否跟现在的报文进行匹配。
每层的gro_receive都设置该标记位。
接收到一个报文后,使用该报文和挂在napi->gro_list 上
的报文进行匹配。
在链路层,使用dev 和 mac头进行匹配,如果一样表示两个报文是通一个
设备发过来的,就标记napi->gro_list上对应的skb的same为1.
到网络层,再进一步进行匹配时,只需跟napi->list上刚被链路层标记
same为1的报文进行网络层的匹配即可,不需再跟每个报文进行匹配。
如果网络层不匹配,就清除该标记。
到传输层,也是只配置被网络层标记same为1 的报文即可。
这样设计为的是减少没必要的匹配操作
*/
int same_flow;
/*This is non-zero if the packet cannot be merged
with the new skb.
如果该字段不为0,表示该数据报文没必要再等待合并,
可以直接送进协议栈进行处理了
*/
int flush;
/*该报文被合并过的次数 ,Number of segments aggregated. */
int count;
/* Free the skb? ,是否该被丢弃*/
int free;
};
#define NAPI_GRO_CB(skb) ((struct napi_gro_cb *)(skb)->cb)
3、每个协议中定义自己的GRO接收合并函数和合并后处理函数
接收合并函数定义:
struct sk_buff**(*gro_receive)(struct sk_buff **head,struct sk_buff *skb);
参数:
head:等待合并的skb链表头
skb:接收到的skb。
返回值:
如果为空,表示报文被合并后不需要现在送入协议栈。
如果不为空,表示返回的报文需要立即送入协议栈。
合并后处理函数定义:
int(*gro_complete)(struct sk_buff *skb);
该函数对合并好的报文进行进一步加工,比如更新校验和。
4、napi_gro_receive
static inline void skb_gro_reset_offset(struct sk_buff *skb, u32 nhoff)
{
const struct skb_shared_info *pinfo = skb_shinfo(skb);
const skb_frag_t *frag0 = &pinfo->frags[0];
NAPI_GRO_CB(skb)->data_offset = 0;
NAPI_GRO_CB(skb)->frag0 = NULL;
NAPI_GRO_CB(skb)->frag0_len = 0;
//skb_headlen为0,则说明包头保存在skb_shinfo中
if (!skb_headlen(skb) && pinfo->nr_frags &&
!PageHighMem(skb_frag_page(frag0)) &&
(!NET_IP_ALIGN || !((skb_frag_off(frag0) + nhoff) & 3))) {
NAPI_GRO_CB(skb)->frag0 = skb_frag_address(frag0);
NAPI_GRO_CB(skb)->frag0_len = min_t(unsigned int,
skb_frag_size(frag0),
skb->end - skb->tail);
}
}
gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
{
gro_result_t ret;
skb_mark_napi_id(skb, napi);
trace_napi_gro_receive_entry(skb);
skb_gro_reset_offset(skb, 0);
ret = napi_skb_finish(napi, skb, dev_gro_receive(napi, skb));
trace_napi_gro_receive_exit(ret);
return ret;
}
EXPORT_SYMBOL(napi_gro_receive);
GRO将数据送进协议栈的点有两处:
一、在napi_skb_finish里,它会通过判断dev_gro_receive 的返回值,来决定是否需要将数据包送入进协议栈;
二、当napi的循环执行完毕时,执行 napi_complete 的时候。
5、skb_gro_reset_offset
根据napi_gro_receive 函数,在调用napi_skb_finish 之前,会调用skb_gro_reset_offset 进行skb 处理;
首先要知道一种情况,那就是skb本身不包含数据(包括头也没有),而所有的数据都保存在skb_shared_info中(支持S/G的网卡有可能会这么做).此时我们如果想要合并的话,就需要将包头这些信息取出来,也就是从skb_shared_info的frags[0]中去的,在 skb_gro_reset_offset中就有做这个事情,而这里就会把头的信息保存到napi_gro_cb 的frags0中。并且此时frags必然不会在high mem,要么是线性区,要么是dma(S/G io)。来看skb_gro_reset_offset
static inline void skb_gro_reset_offset(struct sk_buff *skb, u32 nhoff)
{
const struct skb_shared_info *pinfo = skb_shinfo(skb);
const skb_frag_t *frag0 = &pinfo->frags[0];
NAPI_GRO_CB(skb)->data_offset = 0;
NAPI_GRO_CB(skb)->frag0 = NULL;
NAPI_GRO_CB(skb)->frag0_len = 0;
如果mac_header和skb->tail相等并且地址不在高端内存,则说明包头保存在skb_shinfo中,所以我们需要从frags中取得对应的数据包
if (!skb_headlen(skb) && pinfo->nr_frags &&
!PageHighMem(skb_frag_page(frag0)) &&
(!NET_IP_ALIGN || !((skb_frag_off(frag0) + nhoff) & 3))) {
// 可以看到frag0保存的就是对应的skb的frags的第一个元素的地址
NAPI_GRO_CB(skb)->frag0 = skb_frag_address(frag0);
//然后保存对应的大小
NAPI_GRO_CB(skb)->frag0_len = min_t(unsigned int,
skb_frag_size(frag0),
skb->end - skb->tail);
}
}
从上面数据结构的包含关系可知, end 指针指向的位置 是 skb_shared_info 结构的开始位置。
6、napi_skb _finish
static gro_result_t napi_skb_finish(struct napi_struct *napi,
struct sk_buff *skb,
gro_result_t ret)
{
switch (ret) {
case GRO_NORMAL://将数据包送进协议栈
gro_normal_one(napi, skb, 1);
break;
case GRO_DROP:
kfree_skb(skb);
break;
case GRO_MERGED_FREE://表示skb可以被free,因为GRO已经将skb合并并保存起来
if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD)
napi_skb_free_stolen_head(skb);
else
__kfree_skb(skb);
break;
case GRO_HELD://这个表示当前数据已经被GRO保存起来,但是并没有进行合并,因此skb还需要保存。
case GRO_MERGED:
case GRO_CONSUMED:
break;
}
return ret;
}
这个函数,是根据dev_gro_receive 函数执行之后的结果,来执行不行的逻辑。如果是 GRO_NORMAL ,调用gro_normal_one 将数据包送进协议栈。
7、dev_gro_receive
dev_gro_receive 函数用于合并skb,并决定是否将合并后的大skb 送入网络协议栈:
INDIRECT_CALLABLE_DECLARE(struct sk_buff *inet_gro_receive(struct list_head *,
struct sk_buff *));
INDIRECT_CALLABLE_DECLARE(struct sk_buff *ipv6_gro_receive(struct list_head *,
struct sk_buff *));
static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
{
u32 hash = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1);
struct list_head *head = &offload_base;
struct packet_offload *ptype;
__be16 type = skb->protocol;
struct list_head *gro_head;
struct sk_buff *pp = NULL;
enum gro_result ret;
int same_flow;
int grow;
if (netif_elide_gro(skb->dev))//判断是否支持GRO
goto normal;
gro_head = gro_list_prepare(napi, skb);//比较GRO队列中的skb与当前skb在链路层是否属于同一个流
rcu_read_lock();
list_for_each_entry_rcu(ptype, head, list) {//遍历GRO处理函数注册队列
if (ptype->type != type || !ptype->callbacks.gro_receive)//查找网络层GRO处理函数
continue;
skb_set_network_header(skb, skb_gro_offset(skb));
skb_reset_mac_len(skb);
NAPI_GRO_CB(skb)->same_flow = 0;
//skb是GSO或skb有非线性空间的数据;这个skb已经是聚合状态了,不需要再次聚合
NAPI_GRO_CB(skb)->flush = skb_is_gso(skb) || skb_has_frag_list(skb);
NAPI_GRO_CB(skb)->free = 0;
NAPI_GRO_CB(skb)->encap_mark = 0;
NAPI_GRO_CB(skb)->recursion_counter = 0;
NAPI_GRO_CB(skb)->is_fou = 0;
NAPI_GRO_CB(skb)->is_atomic = 1;
NAPI_GRO_CB(skb)->gro_remcsum_start = 0;
/* Setup for GRO checksum validation */
switch (skb->ip_summed) {
case CHECKSUM_COMPLETE:
NAPI_GRO_CB(skb)->csum = skb->csum;
NAPI_GRO_CB(skb)->csum_valid = 1;
NAPI_GRO_CB(skb)->csum_cnt = 0;
break;
case CHECKSUM_UNNECESSARY:
NAPI_GRO_CB(skb)->csum_cnt = skb->csum_level + 1;
NAPI_GRO_CB(skb)->csum_valid = 0;
break;
default:
NAPI_GRO_CB(skb)->csum_cnt = 0;
NAPI_GRO_CB(skb)->csum_valid = 0;
}
pp = INDIRECT_CALL_INET(ptype->callbacks.gro_receive,
ipv6_gro_receive, inet_gro_receive,
gro_head, skb);//调用inet_gro_receive或ipv6_gro_receive合并skb
break;
}
rcu_read_unlock();
if (&ptype->list == head)
goto normal;
if (PTR_ERR(pp) == -EINPROGRESS) {
ret = GRO_CONSUMED;
goto ok;
}
same_flow = NAPI_GRO_CB(skb)->same_flow;
ret = NAPI_GRO_CB(skb)->free ? GRO_MERGED_FREE : GRO_MERGED;
if (pp) {
skb_list_del_init(pp);
napi_gro_complete(napi, pp);//更新聚合后的skb信息,将包送入协议栈
napi->gro_hash[hash].count--;
}
if (same_flow)//找到与当前skb属与同一个流的skb,此时当前skb已经聚合到所属的流中
goto ok;
if (NAPI_GRO_CB(skb)->flush)//skb不能聚合或GRO队列已满
goto normal;
if (unlikely(napi->gro_hash[hash].count >= MAX_GRO_SKBS)) {
gro_flush_oldest(napi, gro_head);
} else {
napi->gro_hash[hash].count++;
}
NAPI_GRO_CB(skb)->count = 1;
NAPI_GRO_CB(skb)->age = jiffies;
NAPI_GRO_CB(skb)->last = skb;
skb_shinfo(skb)->gso_size = skb_gro_len(skb);
list_add(&skb->list, gro_head);//将skb是一个新包,在gro_list中没有能与之合并的,需要加入到GRO队列中
ret = GRO_HELD;//存储当前包,不能释放
pull:
grow = skb_gro_offset(skb) - skb_headlen(skb);
if (grow > 0)//如果头不全部在线性区,则需要将其copy到线性区
gro_pull_from_frag0(skb, grow);
ok:
if (napi->gro_hash[hash].count) {
if (!test_bit(hash, &napi->gro_bitmask))
__set_bit(hash, &napi->gro_bitmask);
} else if (test_bit(hash, &napi->gro_bitmask)) {
__clear_bit(hash, &napi->gro_bitmask);
}
return ret;
normal:
ret = GRO_NORMAL;
goto pull;
}
static void gro_pull_from_frag0(struct sk_buff *skb, int grow)
{
struct skb_shared_info *pinfo = skb_shinfo(skb);
BUG_ON(skb->end - skb->tail < grow);
memcpy(skb_tail_pointer(skb), NAPI_GRO_CB(skb)->frag0, grow);//将第一个页的内容转移到线性空间
skb->data_len -= grow;
skb->tail += grow;
skb_frag_off_add(&pinfo->frags[0], grow);
skb_frag_size_sub(&pinfo->frags[0], grow);
if (unlikely(!skb_frag_size(&pinfo->frags[0]))) {//第一个页的内容已经全部转移到线性空间
skb_frag_unref(skb, 0);
memmove(pinfo->frags, pinfo->frags + 1,
--pinfo->nr_frags * sizeof(pinfo->frags[0]));//数组内容向前移动
}
}
8、每一层协议都实现了自己的gro回调函数,gro_receive和gro_complete
gro系统会根据协议来调用对应回调函数,其中gro_receive是将输入skb尽量合并到我们gro_list中。而gro_complete则是当我们需要提交gro合并的数据包到协议栈时被调用的。
下面就是ip层和tcp层对应的回调方法:
/*
* IP protocol layer initialiser
*/
static struct packet_offload ip_packet_offload __read_mostly = {
.type = cpu_to_be16(ETH_P_IP),
.callbacks = {
.gso_segment = inet_gso_segment,
.gro_receive = inet_gro_receive,
.gro_complete = inet_gro_complete,
},
};
static const struct net_offload ipip_offload = {
.callbacks = {
.gso_segment = ipip_gso_segment,
.gro_receive = ipip_gro_receive,
.gro_complete = ipip_gro_complete,
},
};
static int __init ipip_offload_init(void)
{
return inet_add_offload(&ipip_offload, IPPROTO_IPIP);
}
static int __init ipv4_offload_init(void)
{
/*
* Add offloads
*/
if (udpv4_offload_init() < 0)
pr_crit("%s: Cannot add UDP protocol offload\n", __func__);
if (tcpv4_offload_init() < 0)
pr_crit("%s: Cannot add TCP protocol offload\n", __func__);
if (ipip_offload_init() < 0)
pr_crit("%s: Cannot add IPIP protocol offload\n", __func__);
dev_add_offload(&ip_packet_offload);
return 0;
}
fs_initcall(ipv4_offload_init);
通过函数dev_add_offload 和inet_add_offload 注册进系统。
什么时候调用gro_receive 函数?
在dev_gro_receive 函数中,会调用 gro_receive 函数进行 合并skb.
INDIRECT_CALLABLE_DECLARE(struct sk_buff *inet_gro_receive(struct list_head *,
struct sk_buff *));
INDIRECT_CALLABLE_DECLARE(struct sk_buff *ipv6_gro_receive(struct list_head *,
struct sk_buff *));
static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
{
...
pp = INDIRECT_CALL_INET(ptype->callbacks.gro_receive,
ipv6_gro_receive, inet_gro_receive,
gro_head, skb);
....
}
什么时候调用gro_complete 函数?
在dev_gro_receive 函数中,当调用gro_receive函数进行合并skb 之后,会调用 napi_gro_complete函数。在napi_gro_complete中,会调用gro_complete函数。
如下:
INDIRECT_CALLABLE_DECLARE(int inet_gro_complete(struct sk_buff *, int));
INDIRECT_CALLABLE_DECLARE(int ipv6_gro_complete(struct sk_buff *, int));
int napi_gro_complete(struct napi_struct *napi, struct sk_buff *skb)
{
struct packet_offload *ptype;
__be16 type = skb->protocol;
struct list_head *head = &offload_base;
int err = -ENOENT;
BUILD_BUG_ON(sizeof(struct napi_gro_cb) > sizeof(skb->cb));
if (NAPI_GRO_CB(skb)->count == 1) {
skb_shinfo(skb)->gso_size = 0;
goto out;
}
rcu_read_lock();
list_for_each_entry_rcu(ptype, head, list) {
if (ptype->type != type || !ptype->callbacks.gro_complete)
continue;
err = INDIRECT_CALL_INET(ptype->callbacks.gro_complete,
ipv6_gro_complete, inet_gro_complete,
skb, 0);
break;
}
rcu_read_unlock();
if (err) {
WARN_ON(&ptype->list == head);
kfree_skb(skb);
return NET_RX_SUCCESS;
}
out:
gro_normal_one(napi, skb, NAPI_GRO_CB(skb)->count);
return NET_RX_SUCCESS;
}
EXPORT_SYMBOL(napi_gro_complete);
9、inet_gro_receive
inet_gro_receive函数是网络层skb 聚合处理函数:
struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb)
{
const struct net_offload *ops;
struct sk_buff *pp = NULL;
const struct iphdr *iph;
struct sk_buff *p;
unsigned int hlen;
unsigned int off;
unsigned int id;
int flush = 1;
int proto;
off = skb_gro_offset(skb);
hlen = off + sizeof(*iph);//MAC + IP头长度
iph = skb_gro_header_fast(skb, off);
if (skb_gro_header_hard(skb, hlen)) {
iph = skb_gro_header_slow(skb, hlen, off);
if (unlikely(!iph))
goto out;
}
proto = iph->protocol;
rcu_read_lock();
ops = rcu_dereference(inet_offloads[proto]);//找到传输层注册的处理函数
if (!ops || !ops->callbacks.gro_receive)
goto out_unlock;
if (*(u8 *)iph != 0x45)//不是IPv4且头长度不是20字节
goto out_unlock;
if (ip_is_fragment(iph))
goto out_unlock;
if (unlikely(ip_fast_csum((u8 *)iph, 5)))//检验和非法
goto out_unlock;
id = ntohl(*(__be32 *)&iph->id);
flush = (u16)((ntohl(*(__be32 *)iph) ^ skb_gro_len(skb)) | (id & ~IP_DF));
id >>= 16;
list_for_each_entry(p, head, list) {
struct iphdr *iph2;
u16 flush_id;
if (!NAPI_GRO_CB(p)->same_flow)//与当前包不是一个流
continue;
iph2 = (struct iphdr *)(p->data + off);
/* The above works because, with the exception of the top
* (inner most) layer, we only aggregate pkts with the same
* hdr length so all the hdrs we'll need to verify will start
* at the same offset.
*/
if ((iph->protocol ^ iph2->protocol) |
((__force u32)iph->saddr ^ (__force u32)iph2->saddr) |
((__force u32)iph->daddr ^ (__force u32)iph2->daddr)) {//三元组不匹配,与当前包不是一个流
NAPI_GRO_CB(p)->same_flow = 0;
continue;
}
/* All fields must match except length and checksum. */
NAPI_GRO_CB(p)->flush |=
(iph->ttl ^ iph2->ttl) |
(iph->tos ^ iph2->tos) |
((iph->frag_off ^ iph2->frag_off) & htons(IP_DF));//检查ttl、tos、id顺序,如果不符合则不是一个流
NAPI_GRO_CB(p)->flush |= flush;
/* We need to store of the IP ID check to be included later
* when we can verify that this packet does in fact belong
* to a given flow.
*/
flush_id = (u16)(id - ntohs(iph2->id));
/* This bit of code makes it much easier for us to identify
* the cases where we are doing atomic vs non-atomic IP ID
* checks. Specifically an atomic check can return IP ID
* values 0 - 0xFFFF, while a non-atomic check can only
* return 0 or 0xFFFF.
*/
if (!NAPI_GRO_CB(p)->is_atomic ||
!(iph->frag_off & htons(IP_DF))) {
flush_id ^= NAPI_GRO_CB(p)->count;
flush_id = flush_id ? 0xFFFF : 0;
}
/* If the previous IP ID value was based on an atomic
* datagram we can overwrite the value and ignore it.
*/
if (NAPI_GRO_CB(skb)->is_atomic)
NAPI_GRO_CB(p)->flush_id = flush_id;
else
NAPI_GRO_CB(p)->flush_id |= flush_id;
}
NAPI_GRO_CB(skb)->is_atomic = !!(iph->frag_off & htons(IP_DF));
NAPI_GRO_CB(skb)->flush |= flush;
skb_set_network_header(skb, off);
/* The above will be needed by the transport layer if there is one
* immediately following this IP hdr.
*/
/* Note : No need to call skb_gro_postpull_rcsum() here,
* as we already checked checksum over ipv4 header was 0
*/
skb_gro_pull(skb, sizeof(*iph));
skb_set_transport_header(skb, skb_gro_offset(skb));
pp = indirect_call_gro_receive(tcp4_gro_receive, udp4_gro_receive,
ops->callbacks.gro_receive, head, skb);//指向tcp4_gro_receive或tcp6_gro_receive
out_unlock:
rcu_read_unlock();
out:
skb_gro_flush_final(skb, pp, flush);
return pp;
}
EXPORT_SYMBOL(inet_gro_receive);
10、tcp4_gro_receive
tcp4_grp_receive 函数是传输层skb 聚合处理函数:
INDIRECT_CALLABLE_SCOPE
struct sk_buff *tcp4_gro_receive(struct list_head *head, struct sk_buff *skb)
{
/* Don't bother verifying checksum if we're going to flush anyway. */
if (!NAPI_GRO_CB(skb)->flush &&
skb_gro_checksum_validate(skb, IPPROTO_TCP,
inet_gro_compute_pseudo)) {
NAPI_GRO_CB(skb)->flush = 1;
return NULL;
}
return tcp_gro_receive(head, skb);
}
struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb)
{
struct sk_buff *pp = NULL;
struct sk_buff *p;
struct tcphdr *th;
struct tcphdr *th2;
unsigned int len;
unsigned int thlen;
__be32 flags;
unsigned int mss = 1;
unsigned int hlen;
unsigned int off;
int flush = 1;
int i;
off = skb_gro_offset(skb);
hlen = off + sizeof(*th);
th = skb_gro_header_fast(skb, off);
if (skb_gro_header_hard(skb, hlen)) {
th = skb_gro_header_slow(skb, hlen, off);
if (unlikely(!th))
goto out;
}
thlen = th->doff * 4;
if (thlen < sizeof(*th))
goto out;
hlen = off + thlen;
if (skb_gro_header_hard(skb, hlen)) {
th = skb_gro_header_slow(skb, hlen, off);
if (unlikely(!th))
goto out;
}
skb_gro_pull(skb, thlen);
len = skb_gro_len(skb);
flags = tcp_flag_word(th);
list_for_each_entry(p, head, list) {
if (!NAPI_GRO_CB(p)->same_flow)//IP层与当前包不是一个流
continue;
th2 = tcp_hdr(p);
if (*(u32 *)&th->source ^ *(u32 *)&th2->source) {//源端口不匹配,与当前包不是一个流
NAPI_GRO_CB(p)->same_flow = 0;
continue;
}
goto found;
}
p = NULL;
goto out_check_final;
found:
/* Include the IP ID check below from the inner most IP hdr */
flush = NAPI_GRO_CB(p)->flush;
flush |= (__force int)(flags & TCP_FLAG_CWR);//发生了拥塞,那么前面被缓存的数据包需要马上被送入协议栈,以便进行TCP的拥塞控制
flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));//如果控制标志位除了CWR、FIN、PSH外有不相同的则需要马上被送入协议栈
flush |= (__force int)(th->ack_seq ^ th2->ack_seq);//确认号不相同则需要马上被送入协议栈,以便尽快释放TCP发送缓存中的空间
for (i = sizeof(*th); i < thlen; i += 4)
flush |= *(u32 *)((u8 *)th + i) ^
*(u32 *)((u8 *)th2 + i);
/* When we receive our second frame we can made a decision on if we
* continue this flow as an atomic flow with a fixed ID or if we use
* an incrementing ID.
*/
if (NAPI_GRO_CB(p)->flush_id != 1 ||
NAPI_GRO_CB(p)->count != 1 ||
!NAPI_GRO_CB(p)->is_atomic)
flush |= NAPI_GRO_CB(p)->flush_id;
else
NAPI_GRO_CB(p)->is_atomic = false;
mss = skb_shinfo(p)->gso_size;
flush |= (len - 1) >= mss;
flush |= (ntohl(th2->seq) + skb_gro_len(p)) ^ ntohl(th->seq);
#ifdef CONFIG_TLS_DEVICE
flush |= p->decrypted ^ skb->decrypted;
#endif
if (flush || skb_gro_receive(p, skb)) {//flush非0意味着需要将skb立即送入协议栈;这样的包不能调用skb_gro_receive进行合并
mss = 1;
goto out_check_final;
}
tcp_flag_word(th2) |= flags & (TCP_FLAG_FIN | TCP_FLAG_PSH);
out_check_final:
flush = len < mss;
flush |= (__force int)(flags & (TCP_FLAG_URG | TCP_FLAG_PSH |
TCP_FLAG_RST | TCP_FLAG_SYN |
TCP_FLAG_FIN));//如果包中有URG、PSH、RST、SYN、FIN标记中的任意一个,则将合并后的包立即交付协议栈
if (p && (!NAPI_GRO_CB(skb)->same_flow || flush))
pp = p;
out:
NAPI_GRO_CB(skb)->flush |= (flush != 0);
return pp;
}
这里有一个疑问:为什么匹配TCP流时只检查源端口而不检查目的端口?
猜测:因为检查了 seq 和ack_seq .源IP 和源端口相同意味着包来自同一台主机,这种情况下,seq 和ack_seq 都匹配且目的端口不同的概率极低,所以不用检查目的端口。
11、skb_gro_receive
skb_gro_receive 函数用于合并同流的skb:
int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
{
struct skb_shared_info *pinfo, *skbinfo = skb_shinfo(skb);
unsigned int offset = skb_gro_offset(skb);
unsigned int headlen = skb_headlen(skb);
unsigned int len = skb_gro_len(skb);
unsigned int delta_truesize;
struct sk_buff *lp;
if (unlikely(p->len + len >= 65536 || NAPI_GRO_CB(skb)->flush))
return -E2BIG;
lp = NAPI_GRO_CB(p)->last;
pinfo = skb_shinfo(lp);
if (headlen <= offset) {//有一部分头在page中
skb_frag_t *frag;
skb_frag_t *frag2;
int i = skbinfo->nr_frags;
int nr_frags = pinfo->nr_frags + i;
if (nr_frags > MAX_SKB_FRAGS)
goto merge;
offset -= headlen;
pinfo->nr_frags = nr_frags;
skbinfo->nr_frags = 0;
frag = pinfo->frags + nr_frags;
frag2 = skbinfo->frags + i;
do {//遍历赋值,将skb的frag加到pinfo的frgas后面
*--frag = *--frag2;
} while (--i);
skb_frag_off_add(frag, offset);//去除剩余的头,只保留数据部分
skb_frag_size_sub(frag, offset);
/* all fragments truesize : remove (head size + sk_buff) */
delta_truesize = skb->truesize -
SKB_TRUESIZE(skb_end_offset(skb));
skb->truesize -= skb->data_len;
skb->len -= skb->data_len;
skb->data_len = 0;
NAPI_GRO_CB(skb)->free = NAPI_GRO_FREE;
goto done;
} else if (skb->head_frag) {//支持分散-聚集IO
int nr_frags = pinfo->nr_frags;
skb_frag_t *frag = pinfo->frags + nr_frags;
struct page *page = virt_to_head_page(skb->head);
unsigned int first_size = headlen - offset;
unsigned int first_offset;
if (nr_frags + 1 + skbinfo->nr_frags > MAX_SKB_FRAGS)
goto merge;
first_offset = skb->data -
(unsigned char *)page_address(page) +
offset;
pinfo->nr_frags = nr_frags + 1 + skbinfo->nr_frags;
__skb_frag_set_page(frag, page);
skb_frag_off_set(frag, first_offset);
skb_frag_size_set(frag, first_size);
memcpy(frag + 1, skbinfo->frags, sizeof(*frag) * skbinfo->nr_frags);
/* We dont need to clear skbinfo->nr_frags here */
delta_truesize = skb->truesize - SKB_DATA_ALIGN(sizeof(struct sk_buff));
NAPI_GRO_CB(skb)->free = NAPI_GRO_FREE_STOLEN_HEAD;
goto done;
}
merge:
delta_truesize = skb->truesize;
if (offset > headlen) {
unsigned int eat = offset - headlen;
skb_frag_off_add(&skbinfo->frags[0], eat);
skb_frag_size_sub(&skbinfo->frags[0], eat);
skb->data_len -= eat;
skb->len -= eat;
offset = headlen;
}
__skb_pull(skb, offset);
if (NAPI_GRO_CB(p)->last == p)
skb_shinfo(p)->frag_list = skb;//将旧GRO队列头放入frag_list队列中
else
NAPI_GRO_CB(p)->last->next = skb;//将包放入GRO队列中
NAPI_GRO_CB(p)->last = skb;
__skb_header_release(skb);
lp = p;
done:
NAPI_GRO_CB(p)->count++;
p->data_len += len;
p->truesize += delta_truesize;
p->len += len;
if (lp != p) {
lp->data_len += len;
lp->truesize += delta_truesize;
lp->len += len;
}
NAPI_GRO_CB(skb)->same_flow = 1;//标识当前skb已经找到同流的skb并进行了合并
return 0;
}
可见当网卡支持分散-聚合IO时,GRO会将多个skb 合并到一个skb 的frag page 数组中, 否则会合并到skb 的frag_list 中。
即使在上述流程中skb被放入GRO队列中保存而没有被立刻送入协议栈,他们也不会在队列中滞留太久时间,因为在收包软中断会调用napi_gro_flush 函数将GRO队列中的包送入协议栈。
12、napi_gro_flush
static void __napi_gro_flush_chain(struct napi_struct *napi, u32 index,
bool flush_old)
{
struct list_head *head = &napi->gro_hash[index].list;
struct sk_buff *skb, *p;
list_for_each_entry_safe_reverse(skb, p, head, list) {
if (flush_old && NAPI_GRO_CB(skb)->age == jiffies)
return;
skb_list_del_init(skb);
napi_gro_complete(napi, skb);//将包送入协议栈
napi->gro_hash[index].count--;
}
if (!napi->gro_hash[index].count)
__clear_bit(index, &napi->gro_bitmask);
}
/* napi->gro_hash[].list contains packets ordered by age.
* youngest packets at the head of it.
* Complete skbs in reverse order to reduce latencies.
*/
void napi_gro_flush(struct napi_struct *napi, bool flush_old)
{
unsigned long bitmask = napi->gro_bitmask;
unsigned int i, base = ~0U;
while ((i = ffs(bitmask)) != 0) {
bitmask >>= i;
base += i;
__napi_gro_flush_chain(napi, base, flush_old);
}
}
EXPORT_SYMBOL(napi_gro_flush);
上面的函数也是 调用 napi_gro_complete 将包送入协议栈。
包加入GRO队列的时间比当前仅晚1个jiffies 也会被视作旧包并交付协议栈处理,可见如果软中断每个jiffies 都调用一次napi_gro_flush 函数的话,开启GRO功能最多增加1个jiffies(1ms 或10ms) 的延迟。
13、napi_gro_complete
napi_gro_complete 函数会先处理一下聚合包各层协议的首部信息,再将包交付网络协议栈:
INDIRECT_CALLABLE_DECLARE(int inet_gro_complete(struct sk_buff *, int));
INDIRECT_CALLABLE_DECLARE(int ipv6_gro_complete(struct sk_buff *, int));
static int napi_gro_complete(struct napi_struct *napi, struct sk_buff *skb)
{
struct packet_offload *ptype;
__be16 type = skb->protocol;
struct list_head *head = &offload_base;
int err = -ENOENT;
BUILD_BUG_ON(sizeof(struct napi_gro_cb) > sizeof(skb->cb));
if (NAPI_GRO_CB(skb)->count == 1) {
skb_shinfo(skb)->gso_size = 0;
goto out;
}
rcu_read_lock();
list_for_each_entry_rcu(ptype, head, list) {
if (ptype->type != type || !ptype->callbacks.gro_complete)
continue;
err = INDIRECT_CALL_INET(ptype->callbacks.gro_complete,
ipv6_gro_complete, inet_gro_complete,//指向inet_gro_complete或ipv6_gro_complete
skb, 0);
break;
}
rcu_read_unlock();
if (err) {
WARN_ON(&ptype->list == head);
kfree_skb(skb);
return NET_RX_SUCCESS;
}
out:
gro_normal_one(napi, skb, NAPI_GRO_CB(skb)->count);
return NET_RX_SUCCESS;
}
/* Queue one GRO_NORMAL SKB up for list processing. If batch size exceeded,
* pass the whole batch up to the stack.
*/
更新当前 napi->rx_count 计数, 当数量达到gro_normal_batch时,将调用 gro_normal_list 函数,将多个包一次性送到协议栈。
static void gro_normal_one(struct napi_struct *napi, struct sk_buff *skb, int segs)
{
list_add_tail(&skb->list, &napi->rx_list);
napi->rx_count += segs;
if (napi->rx_count >= gro_normal_batch)
gro_normal_list(napi);
}
14、inet_gro_complete
inet_gro_complete 函数处理网络层首部信息:
int inet_gro_complete(struct sk_buff *skb, int nhoff)
{
__be16 newlen = htons(skb->len - nhoff);
struct iphdr *iph = (struct iphdr *)(skb->data + nhoff);
const struct net_offload *ops;
int proto = iph->protocol;
int err = -ENOSYS;
if (skb->encapsulation) {
skb_set_inner_protocol(skb, cpu_to_be16(ETH_P_IP));
skb_set_inner_network_header(skb, nhoff);
}
csum_replace2(&iph->check, iph->tot_len, newlen);//重新计算IP检验和,因为聚合在一起的包共用一个IP头
iph->tot_len = newlen;//重新计算包长
rcu_read_lock();
ops = rcu_dereference(inet_offloads[proto]);
if (WARN_ON(!ops || !ops->callbacks.gro_complete))
goto out_unlock;
/* Only need to add sizeof(*iph) to get to the next hdr below
* because any hdr with option will have been flushed in
* inet_gro_receive().
*/
err = INDIRECT_CALL_2(ops->callbacks.gro_complete,
tcp4_gro_complete, udp4_gro_complete,//指向tcp4_gro_complete或tcp6_gro_complete
skb, nhoff + sizeof(*iph));
out_unlock:
rcu_read_unlock();
return err;
}
EXPORT_SYMBOL(inet_gro_complete);
15、tcp4_gro_complete
tcp4_gro_complete 函数处理TCP首部信息
INDIRECT_CALLABLE_SCOPE int tcp4_gro_complete(struct sk_buff *skb, int thoff)
{
const struct iphdr *iph = ip_hdr(skb);
struct tcphdr *th = tcp_hdr(skb);
th->check = ~tcp_v4_check(skb->len - thoff, iph->saddr,
iph->daddr, 0);//重算聚合后的包的TCP检验和
skb_shinfo(skb)->gso_type |= SKB_GSO_TCPV4;
if (NAPI_GRO_CB(skb)->is_atomic)
skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_FIXEDID;
return tcp_gro_complete(skb);
}
int tcp_gro_complete(struct sk_buff *skb)
{
struct tcphdr *th = tcp_hdr(skb);
skb->csum_start = (unsigned char *)th - skb->head;
skb->csum_offset = offsetof(struct tcphdr, check);
skb->ip_summed = CHECKSUM_PARTIAL;
skb_shinfo(skb)->gso_segs = NAPI_GRO_CB(skb)->count;
if (th->cwr)
skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
if (skb->encapsulation)
skb->inner_transport_header = skb->transport_header;
return 0;
}
EXPORT_SYMBOL(tcp_gro_complete);
可见,GRO 的基本原理是将MAC层、IP层、TCP层都能合并的包的头只留一个,数据部分在frag 数组或 frag_list 中存储,这样大大提高了包携带数据的效率。
在完成GRO处理后,skb 会被交付到linux 网络协议栈入口进行协议处理。聚合后的skb 在被送入到网络协议栈后,在网络层协议、TCP协议处理函数中会调用pskb_may_pull 函数将GRO skb的数据整合到线性空间。
16、tcp_v4_rcv
/*
* From tcp_input.c
*/
int tcp_v4_rcv(struct sk_buff *skb)
{
struct net *net = dev_net(skb->dev);
struct sk_buff *skb_to_free;
int sdif = inet_sdif(skb);
int dif = inet_iif(skb);
const struct iphdr *iph;
const struct tcphdr *th;
bool refcounted;
struct sock *sk;
int ret;
if (skb->pkt_type != PACKET_HOST)
goto discard_it;
/* Count it even if it's bad */
__TCP_INC_STATS(net, TCP_MIB_INSEGS);
if (!pskb_may_pull(skb, sizeof(struct tcphdr)))//将TCP基本首部整合到线性空间
goto discard_it;
th = (const struct tcphdr *)skb->data;
if (unlikely(th->doff < sizeof(struct tcphdr) / 4))
goto bad_packet;
if (!pskb_may_pull(skb, th->doff * 4))//将TCP基本首部整合到线性空间
goto discard_it;
/* An explanation is required here, I think.
* Packet length and doff are validated by header prediction,
* provided case of th->doff==0 is eliminated.
* So, we defer the checks. */
if (skb_checksum_init(skb, IPPROTO_TCP, inet_compute_pseudo))
goto csum_error;
th = (const struct tcphdr *)skb->data;
iph = ip_hdr(skb);
lookup:
sk = __inet_lookup_skb(&tcp_hashinfo, skb, __tcp_hdrlen(th), th->source,
th->dest, sdif, &refcounted);
if (!sk)
goto no_tcp_socket;
process:
if (sk->sk_state == TCP_TIME_WAIT)
goto do_time_wait;
if (sk->sk_state == TCP_NEW_SYN_RECV) {
struct request_sock *req = inet_reqsk(sk);
bool req_stolen = false;
struct sock *nsk;
sk = req->rsk_listener;
if (unlikely(tcp_v4_inbound_md5_hash(sk, skb, dif, sdif))) {
sk_drops_add(sk, skb);
reqsk_put(req);
goto discard_it;
}
if (tcp_checksum_complete(skb)) {
reqsk_put(req);
goto csum_error;
}
if (unlikely(sk->sk_state != TCP_LISTEN)) {
inet_csk_reqsk_queue_drop_and_put(sk, req);
goto lookup;
}
/* We own a reference on the listener, increase it again
* as we might lose it too soon.
*/
sock_hold(sk);
refcounted = true;
nsk = NULL;
if (!tcp_filter(sk, skb)) {
th = (const struct tcphdr *)skb->data;
iph = ip_hdr(skb);
tcp_v4_fill_cb(skb, iph, th);
nsk = tcp_check_req(sk, skb, req, false, &req_stolen);
}
if (!nsk) {
reqsk_put(req);
if (req_stolen) {
/* Another cpu got exclusive access to req
* and created a full blown socket.
* Try to feed this packet to this socket
* instead of discarding it.
*/
tcp_v4_restore_cb(skb);
sock_put(sk);
goto lookup;
}
goto discard_and_relse;
}
if (nsk == sk) {
reqsk_put(req);
tcp_v4_restore_cb(skb);
} else if (tcp_child_process(sk, nsk, skb)) {
tcp_v4_send_reset(nsk, skb);
goto discard_and_relse;
} else {
sock_put(sk);
return 0;
}
}
if (unlikely(iph->ttl < inet_sk(sk)->min_ttl)) {
__NET_INC_STATS(net, LINUX_MIB_TCPMINTTLDROP);
goto discard_and_relse;
}
if (!xfrm4_policy_check(sk, XFRM_POLICY_IN, skb))
goto discard_and_relse;
if (tcp_v4_inbound_md5_hash(sk, skb, dif, sdif))
goto discard_and_relse;
nf_reset_ct(skb);
if (tcp_filter(sk, skb))
goto discard_and_relse;
th = (const struct tcphdr *)skb->data;
iph = ip_hdr(skb);
tcp_v4_fill_cb(skb, iph, th);
skb->dev = NULL;
if (sk->sk_state == TCP_LISTEN) {
ret = tcp_v4_do_rcv(sk, skb);
goto put_and_return;
}
sk_incoming_cpu_update(sk);
bh_lock_sock_nested(sk);
tcp_segs_in(tcp_sk(sk), skb);
ret = 0;
if (!sock_owned_by_user(sk)) {
skb_to_free = sk->sk_rx_skb_cache;
sk->sk_rx_skb_cache = NULL;
ret = tcp_v4_do_rcv(sk, skb);
} else {
if (tcp_add_backlog(sk, skb))
goto discard_and_relse;
skb_to_free = NULL;
}
bh_unlock_sock(sk);
if (skb_to_free)
__kfree_skb(skb_to_free);
put_and_return:
if (refcounted)
sock_put(sk);
return ret;
no_tcp_socket:
if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb))
goto discard_it;
tcp_v4_fill_cb(skb, iph, th);
if (tcp_checksum_complete(skb)) {
csum_error:
__TCP_INC_STATS(net, TCP_MIB_CSUMERRORS);
bad_packet:
__TCP_INC_STATS(net, TCP_MIB_INERRS);
} else {
tcp_v4_send_reset(NULL, skb);
}
discard_it:
/* Discard frame. */
kfree_skb(skb);
return 0;
discard_and_relse:
sk_drops_add(sk, skb);
if (refcounted)
sock_put(sk);
goto discard_it;
do_time_wait:
if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb)) {
inet_twsk_put(inet_twsk(sk));
goto discard_it;
}
tcp_v4_fill_cb(skb, iph, th);
if (tcp_checksum_complete(skb)) {
inet_twsk_put(inet_twsk(sk));
goto csum_error;
}
switch (tcp_timewait_state_process(inet_twsk(sk), skb, th)) {
case TCP_TW_SYN: {
struct sock *sk2 = inet_lookup_listener(dev_net(skb->dev),
&tcp_hashinfo, skb,
__tcp_hdrlen(th),
iph->saddr, th->source,
iph->daddr, th->dest,
inet_iif(skb),
sdif);
if (sk2) {
inet_twsk_deschedule_put(inet_twsk(sk));
sk = sk2;
tcp_v4_restore_cb(skb);
refcounted = false;
goto process;
}
}
/* to ACK */
fallthrough;
case TCP_TW_ACK:
tcp_v4_timewait_ack(sk, skb);
break;
case TCP_TW_RST:
tcp_v4_send_reset(sk, skb);
inet_twsk_deschedule_put(inet_twsk(sk));
goto discard_it;
case TCP_TW_SUCCESS:;
}
goto discard_it;
}
什么时候调用tcp_v4_rcv?
解释:当调用napi_gro_complete 时:
napi_gro_complete->
gro_normal_one->
gro_normal_list-> //当 napi->rx_count >= gro_normal_batch时,函数被调用
netif_receive_skb_list_internal->
__netif_receive_skb_list->
__netif_receive_skb_list_core->
__netif_receive_skb_core->
deliver_skb-> // pt_prev->func(); ip_rcv-> //static struct packet_type ip_packet_type;
tcp_v4_rcv
上面的调用流程,可以参考我的另外一篇文章:linux之网络子系统-网络协议栈 发包收包详解-CSDN博客
到这里,就算是GRO 处理之后,将数据包送入网络协议栈的全过程。
17、pskb_may_pull
static inline bool pskb_may_pull(struct sk_buff *skb, unsigned int len)
{
if (likely(len <= skb_headlen(skb)))//线性区中的数据长度够pull len个字节
return true;
if (unlikely(len > skb->len))//skb中数据总长度小于len,包异常
return false;
return __pskb_pull_tail(skb, len - skb_headlen(skb)) != NULL;//需要增加线性区中的数据
}
/* Moves tail of skb head forward, copying data from fragmented part,
* when it is necessary.
* 1. It may fail due to malloc failure.
* 2. It may change skb pointers.
*
* It is pretty complicated. Luckily, it is called only in exceptional cases.
*/
void *__pskb_pull_tail(struct sk_buff *skb, int delta)
{
/* If skb has not enough free space at tail, get new one
* plus 128 bytes for future expansions. If we have enough
* room at tail, reallocate without expansion only if skb is cloned.
*/
int i, k, eat = (skb->tail + delta) - skb->end;
if (eat > 0 || skb_cloned(skb)) {//如果空间不足或是有共享空间
if (pskb_expand_head(skb, 0, eat > 0 ? eat + 128 : 0,
GFP_ATOMIC))
return NULL;
}
BUG_ON(skb_copy_bits(skb, skb_headlen(skb),//将非线性空间中的数据填充到线性空间
skb_tail_pointer(skb), delta));
/* Optimization: no fragments, no reasons to preestimate
* size of pulled pages. Superb.
*/
if (!skb_has_frag_list(skb))//frag_list中没有skb
goto pull_pages;
/* Estimate size of pulled pages. */
eat = delta;
for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
int size = skb_frag_size(&skb_shinfo(skb)->frags[i]);
if (size >= eat)//仅使用frags中的数据就填满了pull的长度
goto pull_pages;
eat -= size;
}
/* If we need update frag list, we are in troubles.
* Certainly, it is possible to add an offset to skb data,
* but taking into account that pulling is expected to
* be very rare operation, it is worth to fight against
* further bloating skb head and crucify ourselves here instead.
* Pure masohism, indeed. 8)8)
*/
if (eat) {//整理frag_list队列
struct sk_buff *list = skb_shinfo(skb)->frag_list;
struct sk_buff *clone = NULL;
struct sk_buff *insp = NULL;
do {
if (list->len <= eat) {
/* Eaten as whole. */
eat -= list->len;
list = list->next;
insp = list;
} else {
/* Eaten partially. */
if (skb_shared(list)) {
/* Sucks! We need to fork list. :-( */
clone = skb_clone(list, GFP_ATOMIC);
if (!clone)
return NULL;
insp = list->next;
list = clone;
} else {
/* This may be pulled without
* problems. */
insp = list;
}
if (!pskb_pull(list, eat)) {
kfree_skb(clone);
return NULL;
}
break;
}
} while (eat);
/* Free pulled out fragments. */
while ((list = skb_shinfo(skb)->frag_list) != insp) {//释放已经合并的skb
skb_shinfo(skb)->frag_list = list->next;
kfree_skb(list);
}
/* And insert new clone at head. */
if (clone) {
clone->next = list;
skb_shinfo(skb)->frag_list = clone;
}
}
/* Success! Now we may commit changes to skb data. */
pull_pages:
eat = delta;
k = 0;
for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
int size = skb_frag_size(&skb_shinfo(skb)->frags[i]);
if (size <= eat) {//释放已经合并的page
skb_frag_unref(skb, i);
eat -= size;
} else {
skb_frag_t *frag = &skb_shinfo(skb)->frags[k];
*frag = skb_shinfo(skb)->frags[i];
if (eat) {
skb_frag_off_add(frag, eat);
skb_frag_size_sub(frag, eat);
if (!i)
goto end;
eat = 0;
}
k++;
}
}
skb_shinfo(skb)->nr_frags = k;
end:
skb->tail += delta;
skb->data_len -= delta;
if (!skb->data_len)
skb_zcopy_clear(skb, false);
return skb_tail_pointer(skb);
}
EXPORT_SYMBOL(__pskb_pull_tail);
/**
* skb_copy_bits - copy bits from skb to kernel buffer
* @skb: source skb
* @offset: offset in source
* @to: destination buffer
* @len: number of bytes to copy
*
* Copy the specified number of bytes from the source skb to the
* destination buffer.
*
* CAUTION ! :
* If its prototype is ever changed,
* check arch/{*}/net/{*}.S files,
* since it is called from BPF assembly code.
*/
int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len)
{
int start = skb_headlen(skb);
struct sk_buff *frag_iter;
int i, copy;
if (offset > (int)skb->len - len)
goto fault;
/* Copy header. */
if ((copy = start - offset) > 0) {
if (copy > len)
copy = len;
skb_copy_from_linear_data_offset(skb, offset, to, copy);
if ((len -= copy) == 0)
return 0;
offset += copy;
to += copy;
}
for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {//将frags数组中数据整理到线性区
int end;
skb_frag_t *f = &skb_shinfo(skb)->frags[i];
WARN_ON(start > offset + len);
end = start + skb_frag_size(f);
if ((copy = end - offset) > 0) {
u32 p_off, p_len, copied;
struct page *p;
u8 *vaddr;
if (copy > len)
copy = len;
skb_frag_foreach_page(f,
skb_frag_off(f) + offset - start,
copy, p, p_off, p_len, copied) {
vaddr = kmap_atomic(p);
memcpy(to + copied, vaddr + p_off, p_len);//复制页中数据到线性区
kunmap_atomic(vaddr);
}
if ((len -= copy) == 0)//copy的长度如果达到了要pull的长度,则可以结束了
return 0;
offset += copy;
to += copy;
}
start = end;
}
skb_walk_frags(skb, frag_iter) {//frags队列中的page都已经合入线性区,但还是没有凑够要pull的长度,需要整理frag_list队列
int end;
WARN_ON(start > offset + len);
end = start + frag_iter->len;
if ((copy = end - offset) > 0) {
if (copy > len)
copy = len;
if (skb_copy_bits(frag_iter, offset - start, to, copy))
goto fault;
if ((len -= copy) == 0)//copy的长度如果达到了要pull的长度,则可以结束了
return 0;
offset += copy;
to += copy;
}
start = end;
}
if (!len)
return 0;
fault:
return -EFAULT;
}
EXPORT_SYMBOL(skb_copy_bits);
pskb_may_pull 的整合保证了TCP首部数据全部被放入线性空间,从而使GRO不影响TCP协议的处理。在应用进程使用系统调用收数据时,会将仍然分散在不连续空间中的数据copy到应用进程的缓存空间中。应用进程会使用tcp_recvmsg 函数接收数据:
18、tcp_recvmsg
int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
size_t len, int nonblock, int flags, int *addr_len)
{
...
if (!(flags & MSG_TRUNC)) {
err = skb_copy_datagram_msg(skb, offset, msg, used);//copy数据到用户缓存
if (err) {
/* Exception. Bailout! */
if (!copied)
copied = -EFAULT;
break;
}
}
...
skb_copy_datagram_msg 函数可以将skb的线性区和非线性区中的数据全部copy进用户缓存,linux以GRO方式接收的数据会在这个函数中全部交付应用进程。