在正常情况下,TCP连接的关闭需要连接的两端进行四次分组交换,具体过程是:执行主动关闭的一端(A端)会首先发送FIN包给对端(B端),B端收到FIN包后会发送一个ACK包给A段;B段执行关闭操作,发送FIN给A端,A端发送一个ACK给B端,连接彻底关闭。分组交换和状态迁移如下图所示:
通常情况下,只有执行主动关闭的一端会进入TIME_WAIT状态,还有一种情况会导致连接的两端都进入TIME_WAIT状态。当TCP的两端同时给对端发送FIN包,两端的TCP状态均从ESTABLISHED变为FIN_WAIT_1,在FIN_WAIT_1状态下接收到FIN包后,状态会由FIN_WAIT_1迁移到CLOSING,并发送最后的ACK。收到最后的ACK后,状态变迁为TIME_WAIT状态,如下图所示:
上述的两种情况只是TCP连接关闭情况的一部分,在其他情况下,内核可能给对端发送的不是FIN包,而是RST包。在这种情况下,有可能是应用层序的问题,也可能是内核资源短缺造成的,如何来查找和确定这些异常情况的原因,需要对内核的实现有一个比较全面的了解。
在应用层要关闭一个连接非常简单,只需要指定要关闭的连接对应的套接字即可。内核中处理TCP连接关闭的系统调用是sys_close(),该函数做的事情不多,主要的关闭操作是由tcp_close()函数来完成的。tcp_close()函数的定义中比sys_close()多了一个timeout参数,不难看出这个timeout肯定是一个超时时间,第一次看到这个参数也是比较疑惑。在调用close()的时候并没有指定超时参数,那这个timeout的值怎么来的呢?这个值是在tcp_close()的上层函数inet_release()中计算出来的,其计算方式如下所示:
- int inet_release(struct socket *sock)
- {
- ......
- timeout = 0;
- if (sock_flag(sk, SOCK_LINGER) &&
- !(current->flags & PF_EXITING))
- timeout = sk->sk_lingertime;
- ......
- }
接下来我们从tcp_close()开始,看内核中如何来执行应用层发出的关闭连接请求。
tcp_close()中首先调用lock_sock()来获取访问sock实例的互斥锁,获取锁后将sk_shutdown设置为SHUTDOWN_MASK。sk_shutdown可以设置的值有RCV_SHUTDOWN(值为1)、SEND_SHUTDOWN(值为2)、SHUTDOWN_MASK(值为3),分别代表关闭接收通道、关闭发送通道、完全关闭,所以这里是要同时关闭发送和接收通道。接下来是检查套接的状态是否是LISTEN状态,我们这里要看的ESTABLISHED状态下套接字的关闭,所以这个部分直接跳过。
如果此时接收队列中不为空,即接收到的数据没有被上层读取,这时需要将接受队列中的SKB包全都释放掉,释放的数据长度存储在局部变量data_was_unread(它的初始值为0)中,如下所示:
- while ((skb = __skb_dequeue(&sk->sk_receive_queue)) != NULL) {
- u32 len = TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq -
- tcp_hdr(skb)->fin;
- data_was_unread += len;
- __kfree_skb(skb);
- if (data_was_unread) {
- /* Unread data was tossed, zap the connection. */
- NET_INC_STATS_USER(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE);
- tcp_set_state(sk, TCP_CLOSE);
- tcp_send_active_reset(sk, sk->sk_allocation);
假设套接字的接收队列为空,接着会调用tcp_close_state()函数,这个函数根据sock实例的当前状态找出执行关闭时下一个对应的状态,其返回值要么为0,要么为TCP_ACTION_FIN。返回值为TCP_ACTION_FIN时,表示要发送FIN包;为0时,表示不需要发送FIN包。如果连接是在ESTABLISHED状态执行关闭操作,其对应的下一个状态为FIN_WAIT_1,tcp_close_state()的返回值为TCP_ACTION_FIN,所以接下来会调用tcp_send_fin()给对端发送FIN包。在tcp_send_fin()中,如果sock实例的发送队列不为空,则将FIN标志添加到发送队列中最后一个要发送的SKB包中;如果发送队列为空,则会分配一个新的SKB包,添加上FIN标志后加入到发送队列中。FIN标志处理后,内核会调用__tcp_push_pending_frames()将发送队列的skb包发送出去。tcp_send_fin()的代码如下:
- /* Send a fin. The caller locks the socket for us. This cannot be
- * allowed to fail queueing a FIN frame under any circumstances.
- */
- void tcp_send_fin(struct sock *sk)
- {
- struct tcp_sock *tp = tcp_sk(sk);
- struct sk_buff *skb = tcp_write_queue_tail(sk);
- int mss_now;
- /* Optimization, tack on the FIN if we have a queue of
- * unsent frames. But be careful about outgoing SACKS
- * and IP options.
- */
- mss_now = tcp_current_mss(sk);
- if (tcp_send_head(sk) != NULL) {
- TCP_SKB_CB(skb)->flags |= TCPCB_FLAG_FIN;
- TCP_SKB_CB(skb)->end_seq++;
- tp->write_seq++;
- } else {
- /* Socket is locked, keep trying until memory is available. */
- for (;;) {
- skb = alloc_skb_fclone(MAX_TCP_HEADER,
- sk->sk_allocation);
- if (skb)
- break;
- yield();
- }
- /* Reserve space for headers and prepare control bits. */
- skb_reserve(skb, MAX_TCP_HEADER);
- /* FIN eats a sequence byte, write_seq advanced by tcp_queue_skb(). */
- tcp_init_nondata_skb(skb, tp->write_seq,
- TCPCB_FLAG_ACK | TCPCB_FLAG_FIN);
- tcp_queue_skb(sk, skb);
- }
- __tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_OFF);
- }
在发送完FIN包后,会将sock实例当前的状态存储在局部变量state中,增加对sock实例的引用,然后调用sock_orphan()将sock实例从socket结构中分离,并且将套接字标志设置为SOCK_DEAD,表示套接字即将关闭。在分离的过程中,有一项是分离等待队列,内核只是简单地将sk_sleep成员设置为NULL。这就有一个疑惑,sk_sleep是一个指针,直接设置为NULL,指向的内存怎么释放?等待的进程怎么处理?其实sk_sleep成员指向的是socket实例的wait成员,所以等待队列的释放会在释放socket结构时释放。sk_sleep的值的设置是在sock_init_data()中进行的。接着调用release_sock()释放sock实例后备队列上的SKB包,如果此时后备队列上的数据包不是刚好确认到FIN所在的序列好,在tcp_rcv_state_process()中处理时也有可能发送RST给对端,这个和tcp_close()中对接收队列的处理类似。这个过程可以在后面看FIN_WAIT_1状态下接收到ACK时的处理时会看到。此时,sock实例要等待释放,所以此时待销毁的sock实例个数加1。
为了完整的看到TCP连接的整个过程,在我们的讨论中认为在tcp_close()中处理后sock的状态为FIN_WAIT_1,在这个假设前提下,我们会进入到下面这段代码的处理中:
- if (sk->sk_state != TCP_CLOSE) {
- int orphan_count = percpu_counter_read_positive(
- sk->sk_prot->orphan_count);
- sk_mem_reclaim(sk);
- if (tcp_too_many_orphans(sk, orphan_count)) {
- if (net_ratelimit())
- printk(KERN_INFO "TCP: too many of orphaned "
- "sockets\n");
- tcp_set_state(sk, TCP_CLOSE);
- tcp_send_active_reset(sk, GFP_ATOMIC);
- NET_INC_STATS_BH(sock_net(sk),
- LINUX_MIB_TCPABORTONMEMORY);
- }
1、当前待销毁的套接字数量大于系统配置sysctl_tcp_max_orphans变量
2、如果当前sock实例发送队列中所有报文数据的总长度大于SOCK_MIN_SNDBUF(值为2048),并且TCP层分配内存的状态处于pressure状态
到这里,tcp_close()的处理基本上结束了,在上面我们已经说过,认为处理后sock实例处于FIN_WAIT_1状态。
在FIN_WAIT_1状态下,接收到SKB包时会在tcp_rcv_state_process()函数中处理,函数中涉及FIN_WAIT_1状态的代码如下所示:
- /*
- * This function implements the receiving procedure of RFC 793 for
- * all states except ESTABLISHED and TIME_WAIT.
- * It's called from both tcp_v4_rcv and tcp_v6_rcv and should be
- * address independent.
- */
- int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- struct tcphdr *th, unsigned len)
- {
- struct tcp_sock *tp = tcp_sk(sk);
- struct inet_connection_sock *icsk = inet_csk(sk);
- int queued = 0;
- int res;
- tp->rx_opt.saw_tstamp = 0;
- ......
- res = tcp_validate_incoming(sk, skb, th, 0);
- if (res <= 0)
- return -res;
- /* step 5: check the ACK field */
- if (th->ack) {
- int acceptable = tcp_ack(sk, skb, FLAG_SLOWPATH) > 0;
- switch (sk->sk_state) {
- ......
- case TCP_FIN_WAIT1:
- if (tp->snd_una == tp->write_seq) {
- tcp_set_state(sk, TCP_FIN_WAIT2);
- sk->sk_shutdown |= SEND_SHUTDOWN;
- dst_confirm(sk->sk_dst_cache);
- if (!sock_flag(sk, SOCK_DEAD))
- /* Wake up lingering close() */
- sk->sk_state_change(sk);
- else {
- int tmo;
- if (tp->linger2 < 0 ||
- (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
- after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt))) {
- tcp_done(sk);
- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPABORTONDATA);
- return 1;
- }
- tmo = tcp_fin_time(sk);
- if (tmo > TCP_TIMEWAIT_LEN) {
- inet_csk_reset_keepalive_timer(sk, tmo - TCP_TIMEWAIT_LEN);
- } else if (th->fin || sock_owned_by_user(sk)) {
- /* Bad case. We could lose such FIN otherwise.
- * It is not a big problem, but it looks confusing
- * and not so rare event. We still can lose it now,
- * if it spins in bh_lock_sock(), but it is really
- * marginal case.
- */
- inet_csk_reset_keepalive_timer(sk, tmo);
- } else {
- tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
- goto discard;
- }
- }
- }
- break;
- ......
- }
- } else
- goto discard;
- /* step 6: check the URG bit */
- tcp_urg(sk, skb, th);
- /* step 7: process the segment text */
- switch (sk->sk_state) {
- ......
- case TCP_FIN_WAIT1:
- case TCP_FIN_WAIT2:
- /* RFC 793 says to queue data in these states,
- * RFC 1122 says we MUST send a reset.
- * BSD 4.4 also does reset.
- */
- if (sk->sk_shutdown & RCV_SHUTDOWN) {
- if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
- after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt)) {
- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPABORTONDATA);
- tcp_reset(sk);
- return 1;
- }
- }
- /* Fall through */
- case TCP_ESTABLISHED:
- tcp_data_queue(sk, skb);
- queued = 1;
- break;
- }
- /* tcp_data could move socket to TIME-WAIT */
- if (sk->sk_state != TCP_CLOSE) {
- tcp_data_snd_check(sk);
- tcp_ack_snd_check(sk);
- }
- if (!queued) {
- discard:
- __kfree_skb(skb);
- }
- return 0;
- }
- if (tp->snd_una == tp->write_seq) {
接着我们跳转到第80行的FIN_WAIT_1分支的处理。在这里内核会检查sk_shutdown中是否设置了关闭接收通道,在tcp_close()中我们看到设置的是SHUTDOWN_MASK,所以这个的判断是成立的。我们来看这个判断:
- if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
- after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt))
- /* The socket must have it's spinlock held when we get
- * here.
- *
- * We have a potential double-lock case here, so even when
- * doing backlog processing we use the BH locking scheme.
- * This is because we cannot sleep with the original spinlock
- * held.
- */
- int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
- {
- ......
- if (tcp_rcv_state_process(sk, skb, tcp_hdr(skb), skb->len)) {
- rsk = sk;
- goto reset;
- }
- TCP_CHECK_TIMER(sk);
- return 0;
- reset:
- tcp_v4_send_reset(rsk, skb);
- discard:
- kfree_skb(skb);
- /* Be careful here. If this function gets more complicated and
- * gcc suffers from register pressure on the x86, sk (in %ebx)
- * might be destroyed here. This current version compiles correctly,
- * but you have been warned.
- */
- return 0;
- ......
- }
- /*
- * Process the FIN bit. This now behaves as it is supposed to work
- * and the FIN takes effect when it is validly part of sequence
- * space. Not before when we get holes.
- *
- * If we are ESTABLISHED, a received fin moves us to CLOSE-WAIT
- * (and thence onto LAST-ACK and finally, CLOSE, we never enter
- * TIME-WAIT)
- *
- * If we are in FINWAIT-1, a received FIN indicates simultaneous
- * close and we go into CLOSING (and later onto TIME-WAIT)
- *
- * If we are in FINWAIT-2, a received FIN moves us to TIME-WAIT.
- */
- static void tcp_fin(struct sk_buff *skb, struct sock *sk, struct tcphdr *th)
- {
- struct tcp_sock *tp = tcp_sk(sk);
- sk->sk_shutdown |= RCV_SHUTDOWN;
- sock_set_flag(sk, SOCK_DONE);
- switch (sk->sk_state) {
- ......
- case TCP_FIN_WAIT2:
- /* Received a FIN -- send ACK and enter TIME_WAIT. */
- tcp_send_ack(sk);
- tcp_time_wait(sk, TCP_TIME_WAIT, 0);
- break;
- ......
- }
- /* It _is_ possible, that we have something out-of-order _after_ FIN.
- * Probably, we should reset in this case. For now drop them.
- */
- __skb_queue_purge(&tp->out_of_order_queue);
- if (tcp_is_sack(tp))
- tcp_sack_reset(&tp->rx_opt);
- sk_mem_reclaim(sk);
- ......
- }
至此,TCP连接终于关闭了。
当然后续在TIME_WAIT状态下接收到对端的数据包也会做一些处理,这些处理不是本文关注的了,后面会写一篇关于TIME_WAIT状态下的内核处理的文章。