TCP的处理中有三个queue: receive queue, backlog queue和prequeue。
正常情况下是receive queue来接受数据包包,并把它传送给socket。如果receive queue满了或者是被用户进程锁住了,那么和诉据报将会被添加到backlog queue中。
只有当数据包头部预测通过(表示数据包是in-order的),并且处于ESTABLISHED状态,数据包才会被发往prequeue。
receive/backlog queue被称为slow path,prequeue被称为fast queue。
TCP包接收器(tcp_v4_rcv)将TCP包投递到目的套接字进行接收处理. 当套接字正被用户锁定,
TCP包将暂时排入该套接字的后备队列(sk_add_backlog).
这时如果某一用户线程企图锁定该套接字(lock_sock),
该线程被排入套接字的后备处理等待队列(sk->lock.wq).
当用户释放上锁的套接字时(release_sock),
后备队列中的TCP包被立即注入TCP包处理器(tcp_v4_do_rcv)进行处理,
然后唤醒等待队列中最先的一个用户来获得其锁定权. 如果套接字未被上锁,
当用户正在读取该套接字时, TCP包将被排入套接字的预备队列(tcp_prequeue),
将其传递到该用户线程上下文中进行处理.
我们来分析一下代码:
net/ipv4/tcp_ipv4.c中tcp_v4_rcv()函数中:
bh_lock_sock_nested(sk);
ret = 0;
if (!sock_owned_by_user(sk)) {
#ifdef CONFIG_NET_DMA
struct tcp_sock *tp = tcp_sk(sk);
if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
tp->ucopy.dma_chan = get_softnet_dma();
if (tp->ucopy.dma_chan)
ret = tcp_v4_do_rcv(sk, skb);
else
#endif
{
if (!tcp_prequeue(sk, skb))
ret = tcp_v4_do_rcv(sk, skb);
}
} else
sk_add_backlog(sk, skb);
bh_unlock_sock(sk);
注:
Linux Network中的锁机制
lock_sock and release_sock do not hold a normal spinlock directly
but instead hold the owner field and do other housework as well.
lock_sock和release_sock并没有直接的hold一个spinlock,而是hold了sk->sk_lock.owner。
/* This is the per-socket lock. The spinlock provides a synchronization
* between user contexts and software interrupt processing, whereas the
* mini-semaphore synchronizes multiple users amongst themselves.
*/
struct sock_iocb;
typedef struct {
spinlock_t slock;
struct sock_iocb *owner;
wait_queue_head_t wq;
/*
* We express the mutex-alike socket_lock semantics
* to the lock validator by explicitly managing
* the slock as a lock variant (in addition to
* the slock itself):
*/
#ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map dep_map;
#endif
} socket_lock_t;
lcok_sock 获取锁
sk->sk_lock.slock,禁止掉本地下半部,并且查看owner域。如果owner域有值的话,那么就会自旋直到它被释放,接着设置
owner,并且释放sk->sk_lock.slock。这就意味着bh_lock_sock依然能够执行,即使socket是被
lock_sock“锁住”,其(lock_sock)不能够执行。[通过分析lock_sock和__lock_sock的代码,可以看出
lock_sock中lock了sk_lock.slock,但是如果owner域有值的话,会在__lock_sock中unlock掉
sk_lock.slock,因此lock_sock不影响bh_lock_sock。例如上面的代码中:
调用了 bh_lock_sock_nested(sk),并不影响后面调用 sock_owned_by_user(sk)。]
release_sock获得锁 sk_lock.slock,处理接收到的backlog队列,清除owner域,唤醒等待在sk_lock.wq上的等待队列,接着释放 sk_lock.slock,开下半部。
bh_lock_sock和bh_release_sock只是获得、释放sk->sk_lock.slock锁。
跟踪tcp_prequeue()函数:
/* Packet is added to VJ-style prequeue for processing in process
* context, if a reader task is waiting. Apparently, this exciting
* idea (VJ's mail "Re: query about TCP header on tcp-ip" of 07 Sep 93)
* failed somewhere. Latency? Burstiness? Well, at least now we will
* see, why it failed. 8)8) --ANK
*
* NOTE: is this not too big to inline?
*/
static inline int tcp_prequeue(struct sock *sk, struct sk_buff *skb)
{
struct tcp_sock *tp = tcp_sk(sk);
if (!sysctl_tcp_low_latency && tp->ucopy.task) {
__skb_queue_tail(&tp->ucopy.prequeue, skb);
tp->ucopy.memory += skb->truesize;
if (tp->ucopy.memory > sk->sk_rcvbuf) {
struct sk_buff *skb1;
BUG_ON(sock_owned_by_user(sk));
while ((skb1 = __skb_dequeue(&tp->ucopy.prequeue)) != NULL) {
sk->sk_backlog_rcv(sk, skb1);
NET_INC_STATS_BH(LINUX_MIB_TCPPREQUEUEDROPPED);
}
tp->ucopy.memory = 0;
} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
wake_up_interruptible(sk->sk_sleep);
if (!inet_csk_ack_scheduled(sk))
inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
(3 * TCP_RTO_MIN) / 4,
TCP_RTO_MAX);
}
return 1;
}
return 0;
}
注:
/* Data for direct copy to user */
struct {
struct sk_buff_head prequeue;
struct task_struct *task;
struct iovec *iov;
int memory;
int len;
#ifdef CONFIG_NET_DMA
/* members for async copy */
struct dma_chan *dma_chan;
int wakeup;
struct dma_pinned_list *pinned_list;
dma_cookie_t dma_cookie;
#endif
} ucopy;
Ucopy is part of the TCP options structure, discussed in Chapter 8, Section 8.7.2. Once
segments are put on the prequeue, they are processed in the application task’s context rather than
in the kernel context. This improves the efficiency of TCP by minimizing context switches
between kernel and user. If tcp_prequeue returns zero, it means that there was no current user
task associated with the socket, so tcp_v4_do_rcv is called to continue with normal "slow path"
receive processing. Tcp_prequeue is covered in more detail later in this chapter.
The field, prequeue, contains the list of socket buffers waiting for processing. Task is
the user-level task to receive the data. The iov field points to the user’s receive data array, and
memory contains the sum of the actual data lengths of all of the socket buffers on the prequeue.
Len is the number of buffers on the prequeue.
在tcp_prequeue中,如果有一个task等待处理,将skb添加到tp->ucopy.prequeue中。如果prequeue的总数据长度大于了sock的sk_rcvbuf,那么调用sk_backlog_rcv(tcp_v4_do_rcv)来处理prequeue中所有的数据包;如果prequeue中只有一个数据包的话,唤醒等待队列sk->sk_sleep。
tcp_recvmsg()中调用的preueue队列处理函数tcp_prequeue_process函数。
static void tcp_prequeue_process(struct sock *sk)
{
struct sk_buff *skb;
struct tcp_sock *tp = tcp_sk(sk);
NET_INC_STATS_USER(LINUX_MIB_TCPPREQUEUED);
/* RX process wants to run with disabled BHs, though it is not
* necessary */
local_bh_disable();
while ((skb = __skb_dequeue(&tp->ucopy.prequeue)) != NULL)
sk->sk_backlog_rcv(sk, skb);
local_bh_enable();
/* Clear memory counter. */
tp->ucopy.memory = 0;
}
可见tcp_prequeue_process中最后也调用了tcp_v4_do_rcv,不过是在进程上下文,由tcp_recvmsg调用,而不是软中断上下文。
tcp_recvmsg():
.......
if (!sysctl_tcp_low_latency && tp->ucopy.task == user_recv) {
/* Install new reader */
if (!user_recv && !(flags & (MSG_TRUNC | MSG_PEEK))) {
user_recv = current;
tp->ucopy.task = user_recv;
tp->ucopy.iov = msg->msg_iov;
}
tp->ucopy.len = len;
BUG_TRAP(tp->copied_seq == tp->rcv_nxt ||
(flags & (MSG_PEEK | MSG_TRUNC)));
/* Ugly... If prequeue is not empty, we have to
* process it before releasing socket, otherwise
* order will be broken at second iteration.
* More elegant solution is required!!!
*
* Look: we have the following (pseudo)queues:
*
* 1. packets in flight
* 2. backlog
* 3. prequeue
* 4. receive_queue
*
* Each queue can be processed only if the next ones
* are empty. At this point we have empty receive_queue.
* But prequeue _can_ be not empty after 2nd iteration,
* when we jumped to start of loop because backlog
* processing added something to receive_queue.
* We cannot release_sock(), because backlog contains
* packets arrived _after_ prequeued ones.
*
* Shortly, algorithm is clear --- to process all
* the queues in order. We could make it more directly,
* requeueing packets from backlog to prequeue, if
* is not empty. It is more elegant, but eats cycles,
* unfortunately.
*/
if (!skb_queue_empty(&tp->ucopy.prequeue))
goto do_prequeue;
/* __ Set realtime policy in scheduler __ */
}
........
do_prequeue:
tcp_prequeue_process(sk);
参考资料:
http://www.linuxforum.net/forum/showflat.php?Cat=&Board=linuxK&Number=138240&page=129&view=collapsed&sb=5&o=all
http://www.linux-foundation.org/en/Net:Socket_Locks
正常情况下是receive queue来接受数据包包,并把它传送给socket。如果receive queue满了或者是被用户进程锁住了,那么和诉据报将会被添加到backlog queue中。
只有当数据包头部预测通过(表示数据包是in-order的),并且处于ESTABLISHED状态,数据包才会被发往prequeue。
receive/backlog queue被称为slow path,prequeue被称为fast queue。
TCP包接收器(tcp_v4_rcv)将TCP包投递到目的套接字进行接收处理. 当套接字正被用户锁定,
TCP包将暂时排入该套接字的后备队列(sk_add_backlog).
这时如果某一用户线程企图锁定该套接字(lock_sock),
该线程被排入套接字的后备处理等待队列(sk->lock.wq).
当用户释放上锁的套接字时(release_sock),
后备队列中的TCP包被立即注入TCP包处理器(tcp_v4_do_rcv)进行处理,
然后唤醒等待队列中最先的一个用户来获得其锁定权. 如果套接字未被上锁,
当用户正在读取该套接字时, TCP包将被排入套接字的预备队列(tcp_prequeue),
将其传递到该用户线程上下文中进行处理.
我们来分析一下代码:
net/ipv4/tcp_ipv4.c中tcp_v4_rcv()函数中:
bh_lock_sock_nested(sk);
ret = 0;
if (!sock_owned_by_user(sk)) {
#ifdef CONFIG_NET_DMA
struct tcp_sock *tp = tcp_sk(sk);
if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
tp->ucopy.dma_chan = get_softnet_dma();
if (tp->ucopy.dma_chan)
ret = tcp_v4_do_rcv(sk, skb);
else
#endif
{
if (!tcp_prequeue(sk, skb))
ret = tcp_v4_do_rcv(sk, skb);
}
} else
sk_add_backlog(sk, skb);
bh_unlock_sock(sk);
注:
Linux Network中的锁机制
lock_sock and release_sock do not hold a normal spinlock directly
but instead hold the owner field and do other housework as well.
lock_sock和release_sock并没有直接的hold一个spinlock,而是hold了sk->sk_lock.owner。
/* This is the per-socket lock. The spinlock provides a synchronization
* between user contexts and software interrupt processing, whereas the
* mini-semaphore synchronizes multiple users amongst themselves.
*/
struct sock_iocb;
typedef struct {
spinlock_t slock;
struct sock_iocb *owner;
wait_queue_head_t wq;
/*
* We express the mutex-alike socket_lock semantics
* to the lock validator by explicitly managing
* the slock as a lock variant (in addition to
* the slock itself):
*/
#ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map dep_map;
#endif
} socket_lock_t;
lcok_sock 获取锁
sk->sk_lock.slock,禁止掉本地下半部,并且查看owner域。如果owner域有值的话,那么就会自旋直到它被释放,接着设置
owner,并且释放sk->sk_lock.slock。这就意味着bh_lock_sock依然能够执行,即使socket是被
lock_sock“锁住”,其(lock_sock)不能够执行。[通过分析lock_sock和__lock_sock的代码,可以看出
lock_sock中lock了sk_lock.slock,但是如果owner域有值的话,会在__lock_sock中unlock掉
sk_lock.slock,因此lock_sock不影响bh_lock_sock。例如上面的代码中:
调用了 bh_lock_sock_nested(sk),并不影响后面调用 sock_owned_by_user(sk)。]
release_sock获得锁 sk_lock.slock,处理接收到的backlog队列,清除owner域,唤醒等待在sk_lock.wq上的等待队列,接着释放 sk_lock.slock,开下半部。
bh_lock_sock和bh_release_sock只是获得、释放sk->sk_lock.slock锁。
跟踪tcp_prequeue()函数:
/* Packet is added to VJ-style prequeue for processing in process
* context, if a reader task is waiting. Apparently, this exciting
* idea (VJ's mail "Re: query about TCP header on tcp-ip" of 07 Sep 93)
* failed somewhere. Latency? Burstiness? Well, at least now we will
* see, why it failed. 8)8) --ANK
*
* NOTE: is this not too big to inline?
*/
static inline int tcp_prequeue(struct sock *sk, struct sk_buff *skb)
{
struct tcp_sock *tp = tcp_sk(sk);
if (!sysctl_tcp_low_latency && tp->ucopy.task) {
__skb_queue_tail(&tp->ucopy.prequeue, skb);
tp->ucopy.memory += skb->truesize;
if (tp->ucopy.memory > sk->sk_rcvbuf) {
struct sk_buff *skb1;
BUG_ON(sock_owned_by_user(sk));
while ((skb1 = __skb_dequeue(&tp->ucopy.prequeue)) != NULL) {
sk->sk_backlog_rcv(sk, skb1);
NET_INC_STATS_BH(LINUX_MIB_TCPPREQUEUEDROPPED);
}
tp->ucopy.memory = 0;
} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
wake_up_interruptible(sk->sk_sleep);
if (!inet_csk_ack_scheduled(sk))
inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
(3 * TCP_RTO_MIN) / 4,
TCP_RTO_MAX);
}
return 1;
}
return 0;
}
注:
/* Data for direct copy to user */
struct {
struct sk_buff_head prequeue;
struct task_struct *task;
struct iovec *iov;
int memory;
int len;
#ifdef CONFIG_NET_DMA
/* members for async copy */
struct dma_chan *dma_chan;
int wakeup;
struct dma_pinned_list *pinned_list;
dma_cookie_t dma_cookie;
#endif
} ucopy;
Ucopy is part of the TCP options structure, discussed in Chapter 8, Section 8.7.2. Once
segments are put on the prequeue, they are processed in the application task’s context rather than
in the kernel context. This improves the efficiency of TCP by minimizing context switches
between kernel and user. If tcp_prequeue returns zero, it means that there was no current user
task associated with the socket, so tcp_v4_do_rcv is called to continue with normal "slow path"
receive processing. Tcp_prequeue is covered in more detail later in this chapter.
The field, prequeue, contains the list of socket buffers waiting for processing. Task is
the user-level task to receive the data. The iov field points to the user’s receive data array, and
memory contains the sum of the actual data lengths of all of the socket buffers on the prequeue.
Len is the number of buffers on the prequeue.
在tcp_prequeue中,如果有一个task等待处理,将skb添加到tp->ucopy.prequeue中。如果prequeue的总数据长度大于了sock的sk_rcvbuf,那么调用sk_backlog_rcv(tcp_v4_do_rcv)来处理prequeue中所有的数据包;如果prequeue中只有一个数据包的话,唤醒等待队列sk->sk_sleep。
tcp_recvmsg()中调用的preueue队列处理函数tcp_prequeue_process函数。
static void tcp_prequeue_process(struct sock *sk)
{
struct sk_buff *skb;
struct tcp_sock *tp = tcp_sk(sk);
NET_INC_STATS_USER(LINUX_MIB_TCPPREQUEUED);
/* RX process wants to run with disabled BHs, though it is not
* necessary */
local_bh_disable();
while ((skb = __skb_dequeue(&tp->ucopy.prequeue)) != NULL)
sk->sk_backlog_rcv(sk, skb);
local_bh_enable();
/* Clear memory counter. */
tp->ucopy.memory = 0;
}
可见tcp_prequeue_process中最后也调用了tcp_v4_do_rcv,不过是在进程上下文,由tcp_recvmsg调用,而不是软中断上下文。
tcp_recvmsg():
.......
if (!sysctl_tcp_low_latency && tp->ucopy.task == user_recv) {
/* Install new reader */
if (!user_recv && !(flags & (MSG_TRUNC | MSG_PEEK))) {
user_recv = current;
tp->ucopy.task = user_recv;
tp->ucopy.iov = msg->msg_iov;
}
tp->ucopy.len = len;
BUG_TRAP(tp->copied_seq == tp->rcv_nxt ||
(flags & (MSG_PEEK | MSG_TRUNC)));
/* Ugly... If prequeue is not empty, we have to
* process it before releasing socket, otherwise
* order will be broken at second iteration.
* More elegant solution is required!!!
*
* Look: we have the following (pseudo)queues:
*
* 1. packets in flight
* 2. backlog
* 3. prequeue
* 4. receive_queue
*
* Each queue can be processed only if the next ones
* are empty. At this point we have empty receive_queue.
* But prequeue _can_ be not empty after 2nd iteration,
* when we jumped to start of loop because backlog
* processing added something to receive_queue.
* We cannot release_sock(), because backlog contains
* packets arrived _after_ prequeued ones.
*
* Shortly, algorithm is clear --- to process all
* the queues in order. We could make it more directly,
* requeueing packets from backlog to prequeue, if
* is not empty. It is more elegant, but eats cycles,
* unfortunately.
*/
if (!skb_queue_empty(&tp->ucopy.prequeue))
goto do_prequeue;
/* __ Set realtime policy in scheduler __ */
}
........
do_prequeue:
tcp_prequeue_process(sk);
参考资料:
http://www.linuxforum.net/forum/showflat.php?Cat=&Board=linuxK&Number=138240&page=129&view=collapsed&sb=5&o=all
http://www.linux-foundation.org/en/Net:Socket_Locks