Linux TCP协议栈定时器走读

导语:TCP协议栈是事件驱动的,如果说TCP的拥塞算法是tcp协议栈的心脏,那么tcp协议栈的定时器则是tcp协议栈的心跳。

1. 数据结构

TCP协议栈的定时器众多,也异常复杂,在这里主要简单走一遍tcp协议栈的定时器机制。N.Wirth曾说:程序=数据结构+算法。某大牛曾说:好的数据结构设计是程序成功的一半。首先来看一下定时器的数据结构:

struct timer_list    //include/linux/timer.h

struct timer_list {
    /*
     * All fields that change during normal runtime grouped to the
     * same cacheline
     */
    struct list_head entry;
    unsigned long expires;
    struct tvec_base *base;

    void (*function)(unsigned long);
    unsigned long data;

    int slack;

#ifdef CONFIG_TIMER_STATS
    int start_pid;
    void *start_site;
    char start_comm[16];
#endif
#ifdef CONFIG_LOCKDEP
    struct lockdep_map lockdep_map;
#endif
};

每个定时器的基础是jiffies变量。定时器按照链表形式组织,使用双向链表将注册的各个定时器彼此连接起来。

  • entry 链表元素。
  • function 保存了一个指向回调函数的指针,该函数在超时时调用
  • data 是传递给回调函数的一个参数
  • expires 确定定时器到期时间,单位是jiffies
  • base 是一个指针,指向一个基元素,其中的定时器按到期时间排序。

timer_list的访问通过timerlist_lock实现。

TCP的每一个连接都有9个定时器,主要包括:

  • SYNACK 定时器

SYNACK 定时器主要用于tcp从LISTEN到SYN_RECV的状态转移。 TCP的初始RTO为1s,如果在三次握手期间,服务器在1s内等不到客户端的ACK,则认为连接超时,马上进行重传。

  • Retransmit 定时器

TCP协议使用ACK来实现数据是否正常到达目的端。如果一个数据包一直没有收到相应的ACK,则会触发重传定时器,进行数据包的重传。重传定时器主要在数据包传输过程中,定时器的时间是根据传输的RTT有关。主要参数是tcp_rto_min 和tcp_rto_max。

  • Delayed ACK 定时器

Delayed ACK 定时器主要是用于解决网络充斥着很多小包,导致传输时延低下。这个定时器会延时ACK包的发送。Linux的延时时间是40ms,而Windows的延时时间则默认是200ms。

  • Keepalive 定时器

保活定时器主要是用于检测一个连接是否活跃。定时器的超时时间为2个小时,在每75s会发送9个探测包,如果探测失败,在会reset连接。

  • 坚持(Persist)定时器(零窗口探测定时器)

数据发送方收到接收方的通告窗口为0时,就不能再发送数据,一直等到对方发送窗口更新为止。但对端发送的窗口更新报文可能会丢失,如果发送方只是等待的话会导致数据传输会一直停滞,最后连接会被断开。这时坚持定时器闪亮登场!数据发送方可以设置坚持定时器定时发送1个探测报文,对端收到后会对这个报文发送ACK报文,这样发送方就能及时得知窗口更新事件了。一旦窗口非0则数据传输就可以恢复正常的数据传输。

  • FinWait2 定时器

FINWait2定时器主要用于FIN_WAIT2状态到CLOSED状态的转移,如果对端一直没有发送FIN包过来,则到了超时时间就会触发,将TCP的状态从FIN_WAIT2转移到CLOSED。

  • TIME_WAIT 定时器

当socekt进入TIME_WAIT状态后,TIME_WAIT定时器启动。在超时之前,替代socket的tw sock会处理旧连接中的包,阻止其危害新连接。定时器超时后,tw sock被删除,并释放其占用的端口号。

  • ER(Early Retransmit) 定时器

ER可以基于两种模式,一种是基于字节的,一种是基于段(segment-based)的,Linux中的ER是基于段的。ER算法会在小窗口下(flight count 小于4)减小触发快重传的重复ACK的阈值,比如减小到1或者2。而在Linux的实现中为了防止假超时会加上一个延迟再重传数据,这个功能就靠ER定时器实现。

  • 尾部丢失探测(Tail Loss Probe)定时器

如果拥塞窗口较小且数据的最后一段数据丢失时,快速重传算法会因为无法收到足够数量的ACK而无法及时重传丢失的报文。尾部丢失探测(Tail Loss Probe)定时器就是为了解决这个问题而设计的。

每个socket连接都会有一个timer_list,里面包含了所有的定时器。由于一些定时器不能同时设置,故TCP使用了5个定时器结构完成全部9个定时器的功能。相关数据结构如下:

/include/net/sock.h

struct sock {

struct timer_list   sk_timer; //用于保活定时器、SYN-ACK定时器、FIN_WAIT2定时器  

}

/include/net/inet_connection_sock.h  

struct inet_connection_sock {  
    struct timer_list     icsk_retransmit_timer; //用于重传定时器、丢失探测定时器、ER定时器、坚持定时器
    struct timer_list     icsk_delack_timer; //dalay ack定时器
}

/include/net/inet_timewait_sock.h  

struct inet_timewait_death_row {  
    struct timer_list   twcal_timer; //time_wait定时器
    struct timer_list   tw_timer; //time_wait定时器
}

2. Linux的定时器

Linux定时器的常规操作init_timer,add_timer,mod_timer,del_timer。

timer_function

kernel/timer.c  init_timer
void __init init_timers(void)
{
    int err;

    /* ensure there are enough low bits for flags in timer->base pointer */
    BUILD_BUG_ON(__alignof__(struct tvec_base) & TIMER_FLAG_MASK);

    err = timer_cpu_notify(&timers_nb, (unsigned long)CPU_UP_PREPARE,
                   (void *)(long)smp_processor_id());
    init_timer_stats();

    BUG_ON(err != NOTIFY_OK);
    register_cpu_notifier(&timers_nb);
    open_softirq(TIMER_SOFTIRQ, run_timer_softirq);
}

函数init_timer()主要设置该内核定时器归属系统中哪一个处理,并初始化内核定时器链表指针的next域为NULL

kernel/timer.c add_timer
void add_timer(struct timer_list *timer)
{
    BUG_ON(timer_pending(timer));
    mod_timer(timer, timer->expires);
}

函数add_timer实质上调用了mod_timer.

kernel/timer.c mod_timer
int mod_timer(struct timer_list *timer, unsigned long expires)
{
    expires = apply_slack(timer, expires);

    /*
     * This is a common optimization triggered by the
     * networking code - if the timer is re-modified
     * to be the same thing then just return:
     */
    if (timer_pending(timer) && timer->expires == expires)
        return 1;

    return __mod_timer(timer, expires, false, TIMER_NOT_PINNED);
}

mod_timer实现的功能如下:

调用__mod_timer,而__mod_timer的功能则是先将定时器从列表中删掉,然后调整expires,在新建一个定时器。

static inline int
__mod_timer(struct timer_list *timer, unsigned long expires,
                        bool pending_only, int pinned)
{
    struct tvec_base *base, *new_base;
    unsigned long flags;
    int ret = 0 , cpu;

    timer_stats_timer_set_start_info(timer);
    BUG_ON(!timer->function);

    base = lock_timer_base(timer, &flags); //加锁

    ret = detach_if_pending(timer, base, false); //移除timer,调用detach_timer
    if (!ret && pending_only)
        goto out_unlock;

    debug_activate(timer, expires);

    cpu = smp_processor_id(); //获取cpuid

#if defined(CONFIG_NO_HZ_COMMON) && defined(CONFIG_SMP)
    if (!pinned && get_sysctl_timer_migration() && idle_cpu(cpu))
        cpu = get_nohz_timer_target();
#endif
    new_base = per_cpu(tvec_bases, cpu);

    if (base != new_base) {
        /*
         * We are trying to schedule the timer on the local CPU.
         * However we can't change timer's base while it is running,
         * otherwise del_timer_sync() can't detect that the timer's
         * handler yet has not finished. This also guarantees that
         * the timer is serialized wrt itself.
         */
        if (likely(base->running_timer != timer)) {  //设置timer
            /* See the comment in lock_timer_base() */
            timer_set_base(timer, NULL);
            spin_unlock(&base->lock);
            base = new_base;
            spin_lock(&base->lock);
            timer_set_base(timer, base);
        }
    }

    timer->expires = expires;  //更新expires
    internal_add_timer(base, timer);  //重新添加timer

out_unlock:
    spin_unlock_irqrestore(&base->lock, flags);  //解锁

    return ret;
}

再来看一下timer的删除,就是调用detach_if_pending,而detach_if_pending则是调用detach_timer进行删除。该函数可以使用于已被激活和未被激活的定时器。对于未激活的定时器返回0,对于已激活的定时器返回1。

int del_timer(struct timer_list *timer)
{
    struct tvec_base *base;
    unsigned long flags;
    int ret = 0;

    debug_assert_init(timer);

    timer_stats_timer_clear_start_info(timer);
    if (timer_pending(timer)) {
        base = lock_timer_base(timer, &flags);  //加锁
        ret = detach_if_pending(timer, base, true);  //删除定时器
        spin_unlock_irqrestore(&base->lock, flags); //解锁
    }

    return ret;
}

最后来看一下timer的实现机制,timer的基础是jiffes,而jiffes实质上是时钟中断。每个时钟中断都必须检查每个到期的定时器。Linux内核实际上将定时器放到不同的桶上,每个桶上的定时器都是进行排序的,因而进入时钟中断之后,能够快速找到到期的定时器。如下图所示:

timer_bucket

对所有定时器的处理都是由update_process_times发起,它会调用run_local_timers函数。该函数将使用raise_softirq(TIMER_SOFIRQ)来激活定时器管理软中断,在下一个可能的时机执行。run_timer_softirq用作该软中断处理程序函数,它会选择特定CPU的tvec_base实例,并调用__run_timers。

run_timer.jpg

有了定时器的知识之后,我们进入下一部分。先来看看TCP协议栈定时器的初始化。

3. TCP定时器初始化

sk_timer、icsk_retransmit_timer、icsk_delack_timer的初始化在tcp_init_xmit_timers中完成:

inet_csk_init_xmit_timers.png

net/ipv4/tcp_timer.c

void tcp_init_xmit_timers(struct sock *sk)
{
    inet_csk_init_xmit_timers(sk, &tcp_write_timer, &tcp_delack_timer,
                  &tcp_keepalive_timer);
}

inet_csk_init_xmit_timers函数:

net/ipv4/inet_connection_sock.c

void inet_csk_init_xmit_timers(struct sock *sk,
                   void (*retransmit_handler)(unsigned long),
                   void (*delack_handler)(unsigned long),
                   void (*keepalive_handler)(unsigned long))
{
    struct inet_connection_sock *icsk = inet_csk(sk);

    setup_timer(&icsk->icsk_retransmit_timer, retransmit_handler, //重传定时器句柄
            (unsigned long)sk);
    setup_timer(&icsk->icsk_delack_timer, delack_handler,  //delay ack定时器句柄
            (unsigned long)sk);
    setup_timer(&sk->sk_timer, keepalive_handler, (unsigned long)sk);  //保活定时器句柄
    icsk->icsk_pending = icsk->icsk_ack.pending = 0;
}

即sk_timer对应的超时函数是tcp_keepalive_timer,icsk_retransmit_timer对应的超时函数是tcp_write_timer,icsk_delack_timer对应的超时函数是tcp_delack_timer。tcp_init_xmit_timers会在每个socket初始化时被调用。

TIME_WAIT相关定时器的初始化在全局变量tcp_death_row的定义时完成:

/net/ipv4/tcp_minisocks.c
struct inet_timewait_death_row dccp_death_row = {
    .sysctl_max_tw_buckets = NR_FILE * 2,
    .period     = DCCP_TIMEWAIT_LEN / INET_TWDR_TWKILL_SLOTS,
    .death_lock = __SPIN_LOCK_UNLOCKED(dccp_death_row.death_lock),
    .hashinfo   = &dccp_hashinfo,
    .tw_timer   = TIMER_INITIALIZER(inet_twdr_hangman, 0,
                        (unsigned long)&dccp_death_row),
    .twkill_work    = __WORK_INITIALIZER(dccp_death_row.twkill_work,
                         inet_twdr_twkill_work),
/* Short-time timewait calendar */

    .twcal_hand = -1,
    .twcal_timer    = TIMER_INITIALIZER(inet_twdr_twcal_tick, 0,
                        (unsigned long)&dccp_death_row),
};

即tw_timer对应的超时函数是inet_twdr_hangman,twcal_timer对应的超时函数是inet_twdr_twcal_tick。

所有的TCP定时器超时函数的执行上下文都是软中断。

在tcp的定时器中,经常会用到的操作,无外乎是重置定时器、取消定时器这两个操作。在定时器满足设置的条件时,则重启定时器,在定时器不满足设置条件,则取消定时器。

4. TCP RETRANSMIT TIMER

重传定时器主要是用于检测网络的拥塞和数据包的丢失。TCP协议带有反馈机制的传输协议,ACK信号用于控制发送端的进度,以及确认数据是否安全到达。TCP在发送SYN、FIN以及数据包时为了保证可靠传输,会先将它们放入发送队列再发送副本到网络中,一旦发现数据丢失(比如连续收到多个ack_seq号相同的ACK)则重传发送队列中的skb。如果丢失发现机制失效了呢(比如ACK丢失),这时就需要重传定时器在指定的时间内重传数据,否则数据传输就可能会阻塞。

4.1 RTO 重传超时时间

这里涉及一个RTO的概念,RTO即是所谓的重传超时时间。关于RTO的时间的定义,涉及一个非常经典的问题,重传超时时间定长了会导致网络的数据传输效率低下,而重传超时时间定短了,会导致无谓的数据流重传,增加网络流量,甚至导致网络拥塞。 RTO超时,又分为三次握手的初始RTO超时和数据包传输的RTO超时。初始RTO超时属于SYN_ACK定时器的内容,暂且不表。先来看看数据包传输的RTO超时。

数据包传输的RTO超时,则是根据RTT进行自适应调整。根据1988年的RFC6289 Jacobson / Karels Algorithm 给出的定义如下:

SRTT = S RTT + α ( RTT – S RTT ) —— 计算平滑RTT

DevRTT = (1-β )* DevRTT + β *(| RTT-SRTT |) ——计算平滑RTT和真实的差距(加权移动平均)

RTO= µ * SRTT + ∂ *DevRTT —— 神一样的公式

其中:在Linux下,α = 0.125,β = 0.25, μ = 1,∂ = 4 ——这就是算法中的“调得一手好参数”,nobody knows why, it just works…) 最后的这个算法在被用在今天的TCP协议中。在Linux的代码实现中如下:

/net/ipv4/tcp_input.c
static void tcp_rtt_estimator(struct sock *sk, long mrtt_us)
{
    struct tcp_sock *tp = tcp_sk(sk);
    long m = mrtt_us; /* RTT */
    u32 srtt = tp->srtt_us;

    /*  The following amusing code comes from Jacobson's
     *  article in SIGCOMM '88.  Note that rtt and mdev
     *  are scaled versions of rtt and mean deviation.
     *  This is designed to be as fast as possible
     *  m stands for "measurement".
     *
     *  On a 1990 paper the rto value is changed to:
     *  RTO = rtt + 4 * mdev
     *
     * Funny. This algorithm seems to be very broken.
     * These formulae increase RTO, when it should be decreased, increase
     * too slowly, when it should be increased quickly, decrease too quickly
     * etc. I guess in BSD RTO takes ONE value, so that it is absolutely
     * does not matter how to _calculate_ it. Seems, it was trap
     * that VJ failed to avoid. 8)
     */
    if (srtt != 0) {
        m -= (srtt >> 3);   /* m is now error in rtt est */
        srtt += m;      /* rtt = 7/8 rtt + 1/8 new */
        if (m < 0) {
            m = -m;     /* m is now abs(error) */
            m -= (tp->mdev_us >> 2);   /* similar update on mdev */
            /* This is similar to one of Eifel findings.
             * Eifel blocks mdev updates when rtt decreases.
             * This solution is a bit different: we use finer gain
             * for mdev in this case (alpha*beta).
             * Like Eifel it also prevents growth of rto,
             * but also it limits too fast rto decreases,
             * happening in pure Eifel.
             */
            if (m > 0)
                m >>= 3;
        } else {
            m -= (tp->mdev_us >> 2);   /* similar update on mdev */
        }
        tp->mdev_us += m;       /* mdev = 3/4 mdev + 1/4 new */
        if (tp->mdev_us > tp->mdev_max_us) {
            tp->mdev_max_us = tp->mdev_us;
            if (tp->mdev_max_us > tp->rttvar_us)
                tp->rttvar_us = tp->mdev_max_us;
        }
        if (after(tp->snd_una, tp->rtt_seq)) {
            if (tp->mdev_max_us < tp->rttvar_us)
                tp->rttvar_us -= (tp->rttvar_us - tp->mdev_max_us) >> 2;
            tp->rtt_seq = tp->snd_nxt;
            tp->mdev_max_us = tcp_rto_min_us(sk);
        }
    } else {
        /* no previous measure. */
        srtt = m << 3;      /* take the measured time to be rtt */
        tp->mdev_us = m << 1;   /* make sure rto = 3*rtt */
        tp->rttvar_us = max(tp->mdev_us, tcp_rto_min_us(sk));
        tp->mdev_max_us = tp->rttvar_us;
        tp->rtt_seq = tp->snd_nxt;
    }
    tp->srtt_us = max(1U, srtt);
}

再来看看重传定时器的重传机制:

  • 发送数据时(包括重传期间),检查是否已经启动,若没有则启动。当该定时器监测的报文被ack,则删除定时器。

  • 如果定时器超时,说明该重传了:重传早先的还没有被ack的segment,同时进行定时器退避将RTO值增倍,最后重启该定时器。

  • 若是三次握手期间的定时器超时,则将RTO重新设置为1秒。

在可靠性要求下,网络出问题是需要重传多次的,每次的超时时间是增倍的,有上限限制。在这个大背景下,能够通过改变超时时间的下限和上限来适应对网络拥塞处理的不同需求:减小上限和下限可以适应高速网络,而通过增加上限和下限可以增强保守性。在Linux中的定义如下:

#define TCP_RTO_MAX ((unsigned)(120*HZ))  /*120s RFC标准是60s*/
#define TCP_RTO_MIN ((unsigned)(HZ/5))   /*200ms RFC标准是200ms...1s*/

对于linux的实现而言,缩小了RTO_MIN,是激进的表现;而扩大了RTO_MAX是更加保守的表现。综合这两点,就是即照顾了当今网络的高速传输,也照顾了传输的可靠性。

综上可以看到RTO的设置会根据不同的网络环境进行调整。正如送货员老湿机老王运货一样,当货物重发快了,会导致货物不必要的重发,当货物发慢了,又可能无法好好利用传输通道。另外RTO的值的设置主要关联的是RTT,因此预测RTT的时间变得至关重要了。

4.2 重传定时器设置的时机

超时重传定时器(ICSK_TIME_RETRANS)在以下几种情况下会被激活:

  1. 发现对端把保存在接收缓冲区的SACK段丢弃时。
  2. 发送一个数据段时,发现之前网络中不存在发送且未确认的段。
    之后每当收到确认了新数据段的ACK,则重置定时器。
  3. 发送SYN包后。
  4. 一些特殊情况。

首先我们来看最简单的发送数据时的重传机制。

tcp_event_new_data_sent.png

我们都知道tcp协议栈将用户态数据从用户态copy到内核态,对传输数据进行切片,然后再copy一份副本,从网卡发送出去,然而数据发送出去,并不代表发送成功,需要等待ack。在此时需要设置重传定时器,在一定时间内,传输数据未被ack,则认为是丢失,则实施重传。

第4层发送数据到第3层最终调用的是tcp_xmit_write,并且tcp是数据流,因此它总发一段数据到第3层。每次发送完毕之后,则会重置重传定时器。

static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
               int push_one, gfp_t gfp)
{
...
    while ((skb = tcp_send_head(sk))) {

....//只有发送成功才能走下去
    if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
            break;
...
        /* Advance the send_head.  This one is sent out.
         * This call will increment packets_out.
         */
         //最终在这个函数中启动重传定时器。
        tcp_event_new_data_sent(sk, skb);
            tcp_minshall_update(tp, mss_now, skb);
        sent_pkts += tcp_skb_pcount(skb);

        if (push_one)
            break;
    }
...
}

现在我们来看tcp_event_new_data_sent,如何启动定时器的.

static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
{
    struct inet_connection_sock *icsk = inet_csk(sk);
    struct tcp_sock *tp = tcp_sk(sk);
    unsigned int prior_packets = tp->packets_out;

    tcp_advance_send_head(sk, skb);
    tp->snd_nxt = TCP_SKB_CB(skb)->end_seq;
    //关键在这里 prior_packets 为0进入判断条件tcp_rearm_rto 重置重传定时器
    tp->packets_out += tcp_skb_pcount(skb);
    if (!prior_packets || icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
        icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) {
        tcp_rearm_rto(sk);
    }

    NET_ADD_STATS(sock_net(sk), LINUX_MIB_TCPORIGDATASENT,
              tcp_skb_pcount(skb));
}

prior_packets则是发送未确认的段的个数,也就是说如果发送了很多段,如果前面的段没有确认,那么后面发送的时候不会重启这个定时器。

当启动了重传定时器,我们就会等待ack的到来,如果超时还没到来,那么就调用重传定时器的回调函数,否则最终会调用tcp_rearm_rto来删除或者重启定时器,这个函数是在tcp_ack()->tcp_clean_rtx_queue()中被调用的。tcp_ack是专门用来处理ack。这个函数很简单,就是通过判断packets_out,这个值表示当前还未确认的段的个数。然后来进行相关操作。

void tcp_rearm_rto(struct sock *sk)
{
    const struct inet_connection_sock *icsk = inet_csk(sk);
    struct tcp_sock *tp = tcp_sk(sk);

    /* If the retrans timer is currently being used by Fast Open
     * for SYN-ACK retrans purpose, stay put.
     */
    if (tp->fastopen_rsk)
        return;
    //0说明所有的传输的段都已经acked。此时remove定时器。否则重启定时器。
    if (!tp->packets_out) {
        inet_csk_clear_xmit_timer(sk, ICSK_TIME_RETRANS);
    } else {
        u32 rto = inet_csk(sk)->icsk_rto;
        /* Offset the time elapsed after installing regular RTO */
        if (icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
            icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) {
            struct sk_buff *skb = tcp_write_queue_head(sk);
            const u32 rto_time_stamp =
                tcp_skb_timestamp(skb) + rto;
            s32 delta = (s32)(rto_time_stamp - tcp_time_stamp);
            /* delta may not be positive if the socket is locked
             * when the retrans timer fires and is rescheduled.
             */
            if (delta > 0)
                rto = delta;
        }
        inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto,
                      TCP_RTO_MAX);
    }
}

4.3 重传定时器超时会做什么?

根据tcp的定时器初始化,重传定时器设置超时函数为tcp_write_timer,而它则是调用tcp_write_timer_handler实施真正的处理。

static void tcp_write_timer(unsigned long data)
{
    struct sock *sk = (struct sock *)data;
    //首先加锁
    bh_lock_sock(sk);
    if (!sock_owned_by_user(sk)) {
        //执行真正的处理
        tcp_write_timer_handler(sk);
    } else {
        //如果是进程空间则什么都不做
        /* deleguate our work to tcp_release_cb() */
        if (!test_and_set_bit(TCP_WRITE_TIMER_DEFERRED, &tcp_sk(sk)->tsq_flags))
            sock_hold(sk);
    }
    bh_unlock_sock(sk);
    sock_put(sk);
}

tcp_write_timer_handler函数主要负责是各个定时器的处理,则是关注的是重传定时器ICSK_TIME_RETRANS,tcp_retransmit_timer则是处理数据段的重传。

void tcp_write_timer_handler(struct sock *sk)
{
    struct inet_connection_sock *icsk = inet_csk(sk);
    int event;

    if (sk->sk_state == TCP_CLOSE || !icsk->icsk_pending)
        goto out;

    if (time_after(icsk->icsk_timeout, jiffies)) {
        sk_reset_timer(sk, &icsk->icsk_retransmit_timer, icsk->icsk_timeout);
        goto out;
    }

    event = icsk->icsk_pending;

    switch (event) {
    case ICSK_TIME_EARLY_RETRANS:
        tcp_resume_early_retransmit(sk);
        break;
    case ICSK_TIME_LOSS_PROBE:
        tcp_send_loss_probe(sk);
        break;
    case ICSK_TIME_RETRANS:
        icsk->icsk_pending = 0;
        tcp_retransmit_timer(sk);//处理数据段的重传
        break;
    case ICSK_TIME_PROBE0:
        icsk->icsk_pending = 0;
        tcp_probe_timer(sk);
        break;
    }

out:
    sk_mem_reclaim(sk);
}

重传的时候为了防止确认二义性,使用karn算法,也就是定时器退避策略。下面的代码最后部分会修改定时器的值,这里是增加一倍。

void tcp_retransmit_timer(struct sock *sk)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct inet_connection_sock *icsk = inet_csk(sk);

    if (tp->fastopen_rsk) {
        WARN_ON_ONCE(sk->sk_state != TCP_SYN_RECV &&
                 sk->sk_state != TCP_FIN_WAIT1);
        tcp_fastopen_synack_timer(sk);
        /* Before we receive ACK to our SYN-ACK don't retransmit
         * anything else (e.g., data or FIN segments).
         */
        return;
    }
    //如果没有确认的段,则什么都不做
    if (!tp->packets_out)
        goto out;

    WARN_ON(tcp_write_queue_empty(sk));

    tp->tlp_high_seq = 0;
   /*
   首先进行合法判断,
   发送窗口,
   sock的状态,
   最后一个判断是当前的连接状态不能处于syn_sent和syn_recv状态,也就是连接还未建立状态.
   */
    if (!tp->snd_wnd && !sock_flag(sk, SOCK_DEAD) &&
        !((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV))) {
        /* Receiver dastardly shrinks window. Our retransmits
         * become zero probes, but we should not timeout this
         * connection. If the socket is an orphan, time it out,
         * we cannot allow such beasts to hang infinitely.
         */
        struct inet_sock *inet = inet_sk(sk);
        if (sk->sk_family == AF_INET) {
            LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("Peer %pI4:%u/%u unexpectedly shrunk window %u:%u (repaired)\n"),
                       &inet->inet_daddr,
                       ntohs(inet->inet_dport), inet->inet_num,
                       tp->snd_una, tp->snd_nxt);
        }
#if IS_ENABLED(CONFIG_IPV6)
        else if (sk->sk_family == AF_INET6) {
            LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("Peer %pI6:%u/%u unexpectedly shrunk window %u:%u (repaired)\n"),
                       &sk->sk_v6_daddr,
                       ntohs(inet->inet_dport), inet->inet_num,
                       tp->snd_una, tp->snd_nxt);
        }
#endif
        /*
        tcp_time_stamp也就是jifes,
        而rcv_tstamp表示最后一个ack接收的时间,
        也就是最后一次对端确认的时间。
        因此这两个时间之差不能大于tcp_rto_max,
        因为tcp_rto_max为我们重传定时器的间隔时间的最大值。
        */
        if (tcp_time_stamp - tp->rcv_tstamp > TCP_RTO_MAX) {
            tcp_write_err(sk);
            goto out;
        }
        //这个函数用来进入loss状态,也就是进行一些拥塞以及流量的控制。
        tcp_enter_loss(sk);
        //现在开始重传skb。
        tcp_retransmit_skb(sk, tcp_write_queue_head(sk));
        __sk_dst_reset(sk);
        //然后重启定时器,继续等待ack的到来。
        goto out_reset_timer;
    }
    /*
    程序到达这里说明上面的校验失败,
    因此下面这个函数用来判断我们重传需要的次数。
    如果超过了重传次数,直接跳转到out。
    */
    if (tcp_write_timeout(sk))
        goto out;
    //到达这里说明我们重传的次数还没到。icsk->icsk_retransmits表示重传的次数。
    if (icsk->icsk_retransmits == 0) {
    //这里其实也就是收集一些统计信息。
        int mib_idx;

        if (icsk->icsk_ca_state == TCP_CA_Recovery) {
            if (tcp_is_sack(tp))
                mib_idx = LINUX_MIB_TCPSACKRECOVERYFAIL;
            else
                mib_idx = LINUX_MIB_TCPRENORECOVERYFAIL;
        } else if (icsk->icsk_ca_state == TCP_CA_Loss) {
            mib_idx = LINUX_MIB_TCPLOSSFAILURES;
        } else if ((icsk->icsk_ca_state == TCP_CA_Disorder) ||
               tp->sacked_out) {
            if (tcp_is_sack(tp))
                mib_idx = LINUX_MIB_TCPSACKFAILURES;
            else
                mib_idx = LINUX_MIB_TCPRENOFAILURES;
        } else {
            mib_idx = LINUX_MIB_TCPTIMEOUTS;
        }
        NET_INC_STATS_BH(sock_net(sk), mib_idx);
    }
    //再次进入拥塞处理,处理sack
    tcp_enter_loss(sk);
    // 再次尝试重传队列的第一个段。
    if (tcp_retransmit_skb(sk, tcp_write_queue_head(sk)) > 0) {
    //本地拥塞导致重传失败
        /* Retransmission failed because of local congestion,
         * do not backoff.
         */
        if (!icsk->icsk_retransmits)
            icsk->icsk_retransmits = 1;
        inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
                      min(icsk->icsk_rto, TCP_RESOURCE_PROBE_INTERVAL),
                      TCP_RTO_MAX);
        goto out;
    }

    /* Increase the timeout each time we retransmit.  Note that
     * we do not increase the rtt estimate.  rto is initialized
     * from rtt, but increases here.  Jacobson (SIGCOMM 88) suggests
     * that doubling rto each time is the least we can get away with.
     * In KA9Q, Karn uses this for the first few times, and then
     * goes to quadratic.  netBSD doubles, but only goes up to *64,
     * and clamps at 1 to 64 sec afterwards.  Note that 120 sec is
     * defined in the protocol as the maximum possible RTT.  I guess
     * we'll have to use something other than TCP to talk to the
     * University of Mars.
     *
     * PAWS allows us longer timeouts and large windows, so once
     * implemented ftp to mars will work nicely. We will have to fix
     * the 120 second clamps though!
     */
     //icsk->icsk_backoff主要用在零窗口定时器。
    icsk->icsk_backoff++;
    //icsk_retransmits也就是重试次数。
    icsk->icsk_retransmits++;

out_reset_timer:
    /* If stream is thin, use linear timeouts. Since 'icsk_backoff' is
     * used to reset timer, set to 0. Recalculate 'icsk_rto' as this
     * might be increased if the stream oscillates between thin and thick,
     * thus the old value might already be too high compared to the value
     * set by 'tcp_set_rto' in tcp_input.c which resets the rto without
     * backoff. Limit to TCP_THIN_LINEAR_RETRIES before initiating
     * exponential backoff behaviour to avoid continue hammering
     * linear-timeout retransmissions into a black hole
     */
    if (sk->sk_state == TCP_ESTABLISHED &&
        (tp->thin_lto || sysctl_tcp_thin_linear_timeouts) &&
        tcp_stream_is_thin(tp) &&
        icsk->icsk_retransmits <= TCP_THIN_LINEAR_RETRIES) {
        //线性超时,不使用指数退避,用于高时延场景
        icsk->icsk_backoff = 0;
        icsk->icsk_rto = min(__tcp_set_rto(tp), TCP_RTO_MAX);
    } else {
        /* Use normal (exponential) backoff */
        //计算rto,并重启定时器,这里使用karn算法,也就是下次超时时间增加一倍/
        icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
    }
    //重启定时器,可以看到超时时间就是我们上面的icsk_rto.
    inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
    if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1, 0, 0))
        __sk_dst_reset(sk);

out:;
}

综上,TCP重传定时器的基本功能是:如果有TFO socket则直接重传SYN|ACK,然后返回;如果没有,检查重传是否经过了太长的时间,若是则关闭连接并报告错误;否则重传发送队列中的首包,并将重传定时器设置为更长的超时时间。

判断是否达到最大重传次数是依赖于tcp_write_timeout函数。系统重传的次数最大值通过下面4个参数进行配置。

  • sysctl_tcp_retries1 最大的重试次数,当超过了这个值,我们就需要检测路由表了。
  • sysctl_tcp_retries2 重试最大次数,只不过这个值一般要比上面的值大。和上面那个不同的是,当重试次数超过这个值,我们就必须放弃重试了。
  • sysctl_tcp_syn_retries 表示syn分节的重传次数
  • sysctl_tcp_orphan_retries 针对孤立的socket(也就是已经从进程上下文中删除了,可是还有一些清理工作没有完成).对于这种socket,我们重试的最大的次数就是它。
/* A write timeout has occurred. Process the after effects. */
static int tcp_write_timeout(struct sock *sk)
{
    struct inet_connection_sock *icsk = inet_csk(sk);
    struct tcp_sock *tp = tcp_sk(sk);
    //retry_untry表示我们需要重传的最大次数。
    int retry_until;
    bool do_reset, syn_set = false;
    //判断socket的状态
    if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
        if (icsk->icsk_retransmits) {
            dst_negative_advice(sk);
            if (tp->syn_fastopen || tp->syn_data)
                tcp_fastopen_cache_set(sk, 0, NULL, true, 0);
            if (tp->syn_data)
                NET_INC_STATS_BH(sock_net(sk),
                         LINUX_MIB_TCPFASTOPENACTIVEFAIL);
        }
        // 设置重传次数
        retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries;
        syn_set = true;
    } else {
        //是否需要检查路由
        if (retransmits_timed_out(sk, sysctl_tcp_retries1, 0, 0)) {
            /* Black hole detection */
            tcp_mtu_probing(icsk, sk);
            dst_negative_advice(sk);
        }
        //设置重传最大次数为sysctl_tcp_retries2
        retry_until = sysctl_tcp_retries2;
        if (sock_flag(sk, SOCK_DEAD)) {
           //表示是一个孤立的socket
            const int alive = icsk->icsk_rto < TCP_RTO_MAX;
            /*
            从tcp_orphan_retries
            (这个函数中会通过sysctl_tcp_orphan_retries来进行计算)
            中取得重传最大次数。
            */
            retry_until = tcp_orphan_retries(sk, alive);
            do_reset = alive ||
                !retransmits_timed_out(sk, retry_until, 0, 0);

            if (tcp_out_of_resources(sk, do_reset))
                return 1;
        }
    }
   //最终进行判断,如果重传次数已到则返回1,否则为0.
    if (retransmits_timed_out(sk, retry_until,
                  syn_set ? 0 : icsk->icsk_user_timeout, syn_set)) {
        /* Has it gone just too far? */
        tcp_write_err(sk);
        return 1;
    }
    return 0;
}

4.3 重传的拥塞控制

根据上面进入重传定时器超时后,会进行重传超时函数进行重传处理,而重传过程中往往需要涉及到网络的拥塞处理。重传意味着可能的数据丢失,没有收到及时的ack。数据丢失的原因很多,有可能本地网络的拥塞,也有可能网络中间路由的数据丢失,也有可能是接收端接收到的数据太多无法及时处理。因此重传的时机,以及网络拥塞的情况的预判是非常重要的。然而网络拥塞,有时犹如股市波动,没有人或者系统能够毫无遗漏精准的预测下一刻的情况。在通常情况下,发送端的tcp协议栈只能通过接收端的ack的正反馈进行自适应调整不同的拥塞策略。然而这种ack的正反馈是滞后的,发送端只能感知上一刻的传输的网络状况。下一刻的网络传输状态,God knows。

重传处理过程中,是用函数tcp_enter_loss进行处理网络拥塞情况。这个函数主要用来标记丢失的段(也就是没有acked的段),然后通过执行slow start来降低传输速率.

有关slow start以及Congestion avoidance算法描述可以看rfc2001:

http://www.faqs.org/rfcs/rfc2001.html

下面4个算法主要是用来对拥塞进行控制的,这四个算法其实都是彼此相连的。slow start和Congestion avoidance使用了相同的机制,他们都涉及到了拥塞窗口的定义。其中拥塞窗口限制着传输的长度,它的大小根据拥塞程度上升或者下降。

Slow start
Congestion avoidance
Fast re-transmit
Fast recovery

然后下面主要是介绍了slow start和Congestion avoidance的一些实现细节。

CWND - Sender side limit
RWND - Receiver side limit
Slow start threshold ( SSTHRESH ) - Used to determine whether slow start is used or congestion avoidance
When starting, probe slowly - IW <= 2 * SMSS
Initial size of SSTHRESH can be arbitrarily high, as high as the RWND
Use slow start when SSTHRESH > CWND. Else, use Congestion avoidance
Slow start - CWND is increased by an amount less than or equal to the SMSS for every ACK
Congestion avoidance - CWND += SMSS*SMSS/CWND
When loss is detected - SSTHRESH = max( FlightSize/2, 2*SMSS )

这里要注意在slow start中,窗口的大小是指数级的增长的。并且当cwnd(拥塞窗口)小于等于ssthresh,就是slow start模式,否则就执行Congestion avoidance。

下面来看一下tcp_enter_loss的实现

首先来介绍下下面要用到的几个关键域的含义。

  • icsk->icsk_ca_state 这个域表示拥塞控制的状态。
  • tp->snd_una 这个域表示tcp滑动窗口中的发送未确认的第一个字节的序列号。
  • tp->prior_ssthresh 这个域表示前一个snd_ssthresh得大小,也就是说每次改变snd_ssthresh前都要保存老的snd_ssthresh到这个域。
  • tp->snd_ssthresh slow start开始时的threshold大小
  • tp->snd_cwnd_cnt 这个域表示拥塞窗口的大小。
  • TCP_SKB_CB(skb)->sacked tcp数据中的sack标记。
  • tp->high_seq 拥塞开始时,snd_nxt的大小。
/* Enter Loss state. If we detect SACK reneging, forget all SACK information
 * and reset tags completely, otherwise preserve SACKs. If receiver
 * dropped its ofo queue, we will know this due to reneging detection.
 */
void tcp_enter_loss(struct sock *sk)
{
    const struct inet_connection_sock *icsk = inet_csk(sk);
    struct tcp_sock *tp = tcp_sk(sk);
    struct sk_buff *skb;
    bool new_recovery = false;
    bool is_reneg;          /* is receiver reneging on SACKs? */
    /* 1 拥塞控制状态小于TCP_CA_Disorder
     * 2 发送未确认的序列号等于拥塞开始时的下一个将要发送的序列号
     * 3 状态为TCP_CA_Loss,并且还未重新传输过。
     * 如果有一个满足说明有数据丢失,因此降低threshold。
     */
    /* Reduce ssthresh if it has not yet been made inside this window. */
    if (icsk->icsk_ca_state <= TCP_CA_Disorder ||
        !after(tp->high_seq, tp->snd_una) ||
        (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)) {
        new_recovery = true;
        //保存老的snd_ssthresh。
        tp->prior_ssthresh = tcp_current_ssthresh(sk);
        tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
        //设置拥塞状态。
        tcp_ca_event(sk, CA_EVENT_LOSS);
        tcp_init_undo(tp);
    }
    //设置拥塞窗口大小
    tp->snd_cwnd       = 1;
    tp->snd_cwnd_cnt   = 0;
    //设置时间
    tp->snd_cwnd_stamp = tcp_time_stamp;

    tp->retrans_out = 0;
    tp->lost_out = 0;

    if (tcp_is_reno(tp))
        tcp_reset_reno_sack(tp);

    skb = tcp_write_queue_head(sk);
    is_reneg = skb && (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED);
    if (is_reneg) {
        NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSACKRENEGING);
        tp->sacked_out = 0;
        tp->fackets_out = 0;
    }
    //清空所有相关的计数器。
    tcp_clear_all_retrans_hints(tp);
    //遍历sock的write队列。
    tcp_for_write_queue(skb, sk) {
        if (skb == tcp_send_head(sk))
            break;
        //判断sack段。
        TCP_SKB_CB(skb)->sacked &= (~TCPCB_TAGBITS)|TCPCB_SACKED_ACKED;
        //is_reneg为1,则说明不管sack段,此时标记所有的段为丢失
        if (!(TCP_SKB_CB(skb)->sacked&TCPCB_SACKED_ACKED) || is_reneg) {
            //设置sack段。
            TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_ACKED;
            TCP_SKB_CB(skb)->sacked |= TCPCB_LOST;
            //update 相关的域。
            tp->lost_out += tcp_skb_pcount(skb);
            tp->retransmit_high = TCP_SKB_CB(skb)->end_seq;
        }
    }
    tcp_verify_left_out(tp);

    /* Timeout in disordered state after receiving substantial DUPACKs
     * suggests that the degree of reordering is over-estimated.
     */
     //设置当前的reordering的长度
    if (icsk->icsk_ca_state <= TCP_CA_Disorder &&
        tp->sacked_out >= sysctl_tcp_reordering)
        tp->reordering = min_t(unsigned int, tp->reordering,
                       sysctl_tcp_reordering);
    //设置拥塞状态。
    tcp_set_ca_state(sk, TCP_CA_Loss);
    tp->high_seq = tp->snd_nxt;
    //由于我们修改了拥塞窗口,因此设置ecn状态。
    TCP_ECN_queue_cwr(tp);

    /* F-RTO RFC5682 sec 3.1 step 1: retransmit SND.UNA if no previous
     * loss recovery is underway except recurring timeout(s) on
     * the same SND.UNA (sec 3.2). Disable F-RTO on path MTU probing
     */
    tp->frto = sysctl_tcp_frto &&
           (new_recovery || icsk->icsk_retransmits) &&
           !inet_csk(sk)->icsk_mtup.probe_size;
}

4.3 重传定时器数据流

重传超时进入重传超时函数进行重传,下面主要看一下如何重传一个数据段。tcp_retransmit_skb实现数据重传。重传的时候需要从重传队列中挑选数据包进行重传,哪到底重传那些包呢?首先是lost,标记的包,然后需要处理:之前发送过的但尚未确认的包(向前重传),或者新数据,在这两者之间有一个选择。

tcp_retransmit_skb.png

int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
{
    struct tcp_sock *tp = tcp_sk(sk);
    int err = __tcp_retransmit_skb(sk, skb);

    if (err == 0) {
#if FASTRETRANS_DEBUG > 0
        if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_RETRANS) {
            net_dbg_ratelimited("retrans_out leaked\n");
        }
#endif
        if (!tp->retrans_out)
            tp->lost_retrans_low = tp->snd_nxt;
        TCP_SKB_CB(skb)->sacked |= TCPCB_RETRANS;//记录skb已被重传
        tp->retrans_out += tcp_skb_pcount(skb);

        /* Save stamp of the first retransmit. */
        if (!tp->retrans_stamp)
            tp->retrans_stamp = tcp_skb_timestamp(skb);  //计算重传时间

        /* snd_nxt is stored to detect loss of retransmitted segment,
         * see tcp_input.c tcp_sacktag_write_queue().
         */
        TCP_SKB_CB(skb)->ack_seq = tp->snd_nxt;
    } else if (err != -EBUSY) {
        NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPRETRANSFAIL);
    }

    if (tp->undo_retrans < 0)
        tp->undo_retrans = 0;
    tp->undo_retrans += tcp_skb_pcount(skb);
    return err;
}

/* This retransmits one SKB.  Policy decisions and retransmit queue
 * state updates are done by the caller.  Returns non-zero if an
 * error occurred which prevented the send.
 */
int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct inet_connection_sock *icsk = inet_csk(sk);
    unsigned int cur_mss;
    int err;

    /* Inconslusive MTU probe */
    if (icsk->icsk_mtup.probe_size) {
        icsk->icsk_mtup.probe_size = 0;
    }

    /* Do not sent more than we queued. 1/4 is reserved for possible
     * copying overhead: fragmentation, tunneling, mangling etc.
     */
    //sk_wmem_alloc:传输队列大小
    //sk_wmem_queud:固定的队列大小
    if (atomic_read(&sk->sk_wmem_alloc) >
        min(sk->sk_wmem_queued + (sk->sk_wmem_queued >> 2), sk->sk_sndbuf))
        return -EAGAIN;

    if (skb_still_in_host_queue(sk, skb))
        return -EBUSY;
  // 若这样,说明是有一部分数据才需要重传,形如:seq---snd_una---end_seq,前面一半已收到ACK
    if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) {
    // 若这样,说明全部ACK,无需重传,BUG
        if (before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))
            BUG();
    // 将无须重传的部分去掉
        if (tcp_trim_head(sk, skb, tp->snd_una - TCP_SKB_CB(skb)->seq))
            return -ENOMEM;
    }
    // 根据目的地址等条件获取路由,如果获取路由失败就不能发送  
    if (inet_csk(sk)->icsk_af_ops->rebuild_header(sk))
        return -EHOSTUNREACH; /* Routing failure or similar. */

    cur_mss = tcp_current_mss(sk);

    /* If receiver has shrunk his window, and skb is out of
     * new window, do not retransmit it. The exception is the
     * case, when window is shrunk to zero. In this case
     * our retransmit serves as a zero window probe.
     */
     // 如果数据在窗口后面,不会发送
    if (!before(TCP_SKB_CB(skb)->seq, tcp_wnd_end(tp)) &&
        TCP_SKB_CB(skb)->seq != tp->snd_una)
        return -EAGAIN;
    //数据段大于mss值,进行分片处理,并调整packet_out等统计值。再传送
    if (skb->len > cur_mss) {
        if (tcp_fragment(sk, skb, cur_mss, cur_mss))
            return -ENOMEM; /* We'll try again later. */
    } else {
        int oldpcount = tcp_skb_pcount(skb);

        if (unlikely(oldpcount > 1)) {
            if (skb_unclone(skb, GFP_ATOMIC))
                return -ENOMEM;
            // 按当前mss重置skb->gso_XXX
            tcp_init_tso_segs(sk, skb, cur_mss);
            // 调整packet_out等统计值
            tcp_adjust_pcount(sk, skb, oldpcount - tcp_skb_pcount(skb));
        }
    }
    // 尝试和后几个包合并后一起重传出去,加快速度,主要针对小包的场景
    tcp_retrans_try_collapse(sk, skb, cur_mss);

    /* Make a copy, if the first transmission SKB clone we made
     * is still in somebody's hands, else make a clone.
     */

    /* make sure skb->data is aligned on arches that require it
     * and check if ack-trimming & collapsing extended the headroom
     * beyond what csum_start can cover.
     */
    if (unlikely((NET_IP_ALIGN && ((unsigned long)skb->data & 3)) ||
             skb_headroom(skb) >= 0xFFFF)) {
        struct sk_buff *nskb = __pskb_copy(skb, MAX_TCP_HEADER,
                           GFP_ATOMIC);
        err = nskb ? tcp_transmit_skb(sk, nskb, 0, GFP_ATOMIC) :
                 -ENOBUFS;
    } else {
       //重传数据通过调用tcp_transmit_skb,最终使用tcp_write_xmit进行传输
        err = tcp_transmit_skb(sk, skb, 1, GFP_ATOMIC);
    }
    //传输成功则更新全局的tcp统计值
    if (likely(!err)) {
        TCP_SKB_CB(skb)->sacked |= TCPCB_EVER_RETRANS;
        /* Update global TCP statistics. */
        TCP_INC_STATS(sock_net(sk), TCP_MIB_RETRANSSEGS);
        if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN)
            NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
        tp->total_retrans++;   // 整体重传数量++
    }
    return err;
}

tcp_retrans_try_collapse 重传时尝试和后几个包合并后传出去,这里涉及到系统设置参数sysctl_tcp_retrans_collapse,这个值设置为0,则表示重传时不对数据包进行合并。

/* Collapse packets in the retransmit queue to make to create
 * less packets on the wire. This is only done on retransmission.
 */
static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
                     int space)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct sk_buff *skb = to, *tmp;
    bool first = true;

    if (!sysctl_tcp_retrans_collapse) //不对包进行合并
        return;
    if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN) //处于syn状态返回
        return;
    //遍历tcp队列
    tcp_for_write_queue_from_safe(skb, tmp, sk) {
        if (!tcp_can_collapse(sk, skb))  //判断skb是否适合合并
            break;

        space -= skb->len;

        if (first) {
            first = false;
            continue;
        }

        if (space < 0)
            break;
        /* Punt if not enough space exists in the first SKB for
         * the data in the second
         */
        if (skb->len > skb_availroom(to))
            break;

        if (after(TCP_SKB_CB(skb)->end_seq, tcp_wnd_end(tp)))
            break;

        tcp_collapse_retrans(sk, to); //合并重传
    }
}

/* Check if coalescing SKBs is legal. */
static bool tcp_can_collapse(const struct sock *sk, const struct sk_buff *skb)
{
    if (tcp_skb_pcount(skb) > 1)  // skb只包含一个数据包,没有TSO分包
        return false;
    /* TODO: SACK collapsing could be used to remove this condition */
    if (skb_shinfo(skb)->nr_frags != 0)  // 数据都在线性空间,非线性空间中没有数据
        return false;
    if (skb_cloned(skb))   // 不是clone
        return false;
    if (skb == tcp_send_head(sk))
        return false;
    /* Some heurestics for collapsing over SACK'd could be invented */
    if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)  // 已经被sack的当然不用重传
        return false;

    return true;
}


/* Collapses two adjacent SKB's during retransmission. */
static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct sk_buff *next_skb = tcp_write_queue_next(sk, skb);
    int skb_size, next_skb_size;

    skb_size = skb->len;
    next_skb_size = next_skb->len;

    BUG_ON(tcp_skb_pcount(skb) != 1 || tcp_skb_pcount(next_skb) != 1);

    tcp_highest_sack_combine(sk, next_skb, skb);
    // 将要合并的包从队列中删掉
    tcp_unlink_write_queue(next_skb, sk);
    // 将数据copy到前一个包上,调整前一个的len,tail
    skb_copy_from_linear_data(next_skb, skb_put(skb, next_skb_size),
                  next_skb_size);

    if (next_skb->ip_summed == CHECKSUM_PARTIAL)
        skb->ip_summed = CHECKSUM_PARTIAL;

    if (skb->ip_summed != CHECKSUM_PARTIAL)
        skb->csum = csum_block_add(skb->csum, next_skb->csum, skb_size);
    // end_seq 等于后一个包的end_seq,
    //所以如果skb->end_seq > next_skb->seq,就会合并出一个len>end_seq-seq的异常数据
    //(内核保证了sk_write_queue不会出现这情况)
    /* Update sequence range on original skb. */
    TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(next_skb)->end_seq;

    /* Merge over control information. This moves PSH/FIN etc. over */
    TCP_SKB_CB(skb)->tcp_flags |= TCP_SKB_CB(next_skb)->tcp_flags;

    /* All done, get rid of second SKB and account for it so
     * packet counting does not break.
     */
    TCP_SKB_CB(skb)->sacked |= TCP_SKB_CB(next_skb)->sacked & TCPCB_EVER_RETRANS;

    /* changed transmit queue under us so clear hints */
    tcp_clear_retrans_hints_partial(tp);
    if (next_skb == tp->retransmit_skb_hint)
        tp->retransmit_skb_hint = skb;
    // 调整pcount
    tcp_adjust_pcount(sk, next_skb, tcp_skb_pcount(next_skb));
   // 合并到了前一个包上,所以释放这个包
    sk_wmem_free_skb(sk, next_skb);
}

4.4 重传数据包的copy

不少吐槽Linux的网络协议栈的笨重,吐槽最多的莫过于数据校验和计算和多次的数据拷贝的性能开销。数据校验和的计算的时候,于是有了通过网卡使用硬件计算的方式,卸载数据校验和计算的耗时,加速网络传输。而多次的数据拷贝的性能开销问题,于是有了dpdk,netmap,pf_ring等实现所谓的零拷贝,架空协议栈从网卡中抽取数据到用户态空间进行处理,估计这也只能在UDP或者做路由转发上动些手脚,如果要重构TCP估计是痴心妄想。

从软件设计来看,Linux的网络协议栈设计得已经相当精美了。即兼顾了分层设计,解耦每个协议层,解耦硬件实现。单从tcp协议栈重传数据包的copy来看,实现也是相当smart的。

当tcp协议栈需要重传一个数据段时,重传使用函数skb_clone()进行clone,然后对这个clone的数据段进行重传。实际上,考虑到性能,数据包clone只会对sk_buff的头部进行copy,而data部分则是shared的。skb_clone()实际上是调用了__skb_clone()函数。

当数据进行拷贝时,会将skb->cloned置1和skb_shinfo(skb)->dataref进行原子加操作,当数据重传成功之后,skb->cloned依旧为1,而在重传成功后调用skb_release_data()函数会对skb_shinfo(skb)->dataref则会进行原子减操作,因此tcp协议栈会根据该标识哪那些数据段已经重传成功了,而不会把它再添加到重传队列当中。

struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
{
    struct sk_buff_fclones *fclones = container_of(skb,
                               struct sk_buff_fclones,
                               skb1);
    struct sk_buff *n = &fclones->skb2;

    if (skb_orphan_frags(skb, gfp_mask))
        return NULL;

    if (skb->fclone == SKB_FCLONE_ORIG &&
        n->fclone == SKB_FCLONE_UNAVAILABLE) {
        n->fclone = SKB_FCLONE_CLONE;
        atomic_inc(&fclones->fclone_ref);
    } else {
        if (skb_pfmemalloc(skb))
            gfp_mask |= __GFP_MEMALLOC;

        n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
        if (!n)
            return NULL;

        kmemcheck_annotate_bitfield(n, flags1);
        kmemcheck_annotate_bitfield(n, flags2);
        n->fclone = SKB_FCLONE_UNAVAILABLE;
    }

    return __skb_clone(n, skb);
}

static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb)
{
#define C(x) n->x = skb->x

    n->next = n->prev = NULL;
    n->sk = NULL;
    //只拷贝头部
    __copy_skb_header(n, skb);

    C(len);
    C(data_len);
    C(mac_len);
    n->hdr_len = skb->nohdr ? skb_headroom(skb) : skb->hdr_len;
    //cloned位置1
    n->cloned = 1;
    n->nohdr = 0;
    n->destructor = NULL;
    C(tail);
    C(end);
    C(head);
    C(head_frag);
    C(data);
    C(truesize);
    atomic_set(&n->users, 1);
    //原子加操作
    atomic_inc(&(skb_shinfo(skb)->dataref));
    skb->cloned = 1;

    return n;
#undef C
}

5 零窗口探测定时器

rwnd是指接收端的窗口值,发送端的能发送数据包的最大值为min(rwnd, cwnd)。然而在接收端的接收buffer满了之后,则无法进行数据接收,接收端会发送一个advertise zero window的ack信号,告诉发送端,你不要再发数据,哥实在是受不了了。产生这个问题的原因是,接收端的设备是慢速设备,无法创建更多的socket buff存储tcp data。每当用户态进程从接收缓存区中读取数据时,它都会检查是否会产生更多的接收buffer,用于告诉发送端有新的窗口。如果是则发送一个ack数据段对发送端通告新窗口。

举个栗子:这就好比追求女神的过程。急功近利的你对女神很好,一口气送了女神很多礼物(很多数据段),然而女神实在受不了你对他的烦,只好直接暗示你不要再送礼物来了(使用ACK通告zero Windows)。毕竟勉强没幸福,正所谓欲速则不达,你只好暂停观察一段时间,直到女神静下心感知到你的好,她才会暗示对你不反感了(使用ACK通告new Windows)。如果,这个暗示没有传达到你(如网络拥塞数据包丢失),这就悲剧了,耽误你追求女神了。为了避免这种情况出现,你不能一直等待,这里有一种探测机制,你可能会定期从女神的朋友圈、好友获取女神状态,定时探测,等待合适的时机,再次出击,一举获取女神芳心。这就是零窗口探测机制了。

5.1 探测定时器的启动和清除时机

零窗口探测定时器的启动主要在两个时机。

zero_win.png

一个是TCP使用__tcp_push_pending_frames发送数据时:

/net/ipv4/tcp_ouput.c
void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
                   int nonagle)
{
...
    /* If we are closed, the bytes will have to remain here.
     * In time closedown will finish, we empty the write queue and
     * all will be happy.
     */
    if (unlikely(sk->sk_state == TCP_CLOSE))
        return;

    if (tcp_write_xmit(sk, cur_mss, nonagle, 0,
               sk_gfp_atomic(sk, GFP_ATOMIC)))
    /*
    判断为真意味着所有发送出去的数据
    都已经被确认且发送队列中还有数据未发送
    即可能是因为窗口太小无法发送  
    */
        tcp_check_probe_timer(sk);  //设定零窗口探测定时器
...
}

static inline void tcp_check_probe_timer(struct sock *sk)
{
    const struct tcp_sock *tp = tcp_sk(sk);
    const struct inet_connection_sock *icsk = inet_csk(sk);
    //所有发送出去的数据都已经被确认
    //且未设置坚持定时器、重传定时器、ER定时器和TLP定时器  
    if (!tp->packets_out && !icsk->icsk_pending)
    //启动零窗口探测定时器
        inet_csk_reset_xmit_timer(sk, ICSK_TIME_PROBE0,
                      icsk->icsk_rto, TCP_RTO_MAX);
}

另外一个时机是收到ACK的时候:首先来看一下tcp_input.c如何处理输入的ack的。

//net/ipv4/tcp_input.c
/* This routine deals with incoming acks, but not outgoing ones. */
static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
{
    int prior_packets = tp->packets_out;
...//所有发送出去的数据都已经被确认  
    if (!prior_packets)
        goto no_queue;
...
no_queue:
    /* If data was DSACKed, see if we can undo a cwnd reduction. */
    if (flag & FLAG_DSACKING_ACK)
        tcp_fastretrans_alert(sk, acked, prior_unsacked,
                      is_dupack, flag);
    /* If this ack opens up a zero window, clear backoff.  It was
     * being used to time the probes, and is probably far higher than
     * it needs to be for normal retransmission.
     */
     //发送队列中还有数据未发送  
    if (tcp_send_head(sk))
        tcp_ack_probe(sk); //设置零窗口探测定时器
}

static void tcp_ack_probe(struct sock *sk)
{
    const struct tcp_sock *tp = tcp_sk(sk);
    struct inet_connection_sock *icsk = inet_csk(sk);

    /* Was it a usable window open? */
    //当前窗口可以容纳下一个要发送的包  
    if (!after(TCP_SKB_CB(tcp_send_head(sk))->end_seq, tcp_wnd_end(tp))) {
        icsk->icsk_backoff = 0;
        //清除零窗口探测定时器
        inet_csk_clear_xmit_timer(sk, ICSK_TIME_PROBE0);
        /* Socket must be waked up by subsequent tcp_data_snd_check().
         * This function is not for random using!
         */
    } else {
        unsigned long when = inet_csk_rto_backoff(icsk, TCP_RTO_MAX);
        //启动零窗口定时器
        inet_csk_reset_xmit_timer(sk, ICSK_TIME_PROBE0,
                      when, TCP_RTO_MAX);
    }
}

实质上两个时机是共通的,就是零窗口探测定时器触发条件是:所有发送出去的数据都已经被确认了,且发送队列中还有数据未发送(这时不会设置重传定时器,ER定时器和TLP定时器)。数据未发送的原因可能是发送窗口过小。

零窗口探测定时器的清除时机主要包括一下场景:根据tcp_ack_probe可以知道,超时时间由RTO决定。

  • 发送窗口增大到能够允许发送至少一个报文

  • 安装了重传定时器、ER定时器或TLP定时器

5.2 零窗口定时器超时会做什么

tcp_xmit_probe_skb.png

我们知道零窗口定时器和重传的定时器是一个定时器,只不过在回调函数中,进行event判断,从而进入不同的处理。而它调用的是tcp_probe_timer函数。
这个函数主要就是用来发送探测包,下面的函数似曾相识了。哈哈。最大的探测次数等于sysctl_tcp_retries2。

static void tcp_probe_timer(struct sock *sk)
{
    struct inet_connection_sock *icsk = inet_csk(sk);
    struct tcp_sock *tp = tcp_sk(sk);
    int max_probes;
    u32 start_ts;
   //有发送出去的数据未确认或发送队列为空  
    if (tp->packets_out || !tcp_send_head(sk)) {
        icsk->icsk_probes_out = 0;
        return;
    }

    /* RFC 1122 4.2.2.17 requires the sender to stay open indefinitely as
     * long as the receiver continues to respond probes. We support this by
     * default and reset icsk_probes_out with incoming ACKs. But if the
     * socket is orphaned or the user specifies TCP_USER_TIMEOUT, we
     * kill the socket when the retry count and the time exceeds the
     * corresponding system limit. We also implement similar policy when
     * we use RTO to probe window in tcp_retransmit_timer().
     */
    start_ts = tcp_skb_timestamp(tcp_send_head(sk));
    if (!start_ts)
        skb_mstamp_get(&tcp_send_head(sk)->skb_mstamp);
    else if (icsk->icsk_user_timeout &&
         (s32)(tcp_time_stamp - start_ts) > icsk->icsk_user_timeout)
        goto abort;

    max_probes = sysctl_tcp_retries2;
    //当前socket是孤立socket  
    if (sock_flag(sk, SOCK_DEAD)) {
        const int alive = inet_csk_rto_backoff(icsk, TCP_RTO_MAX) < TCP_RTO_MAX;

        max_probes = tcp_orphan_retries(sk, alive);
        if (!alive && icsk->icsk_backoff >= max_probes)
            goto abort;
        //孤立socket占用资源过多  
        if (tcp_out_of_resources(sk, true))
            return;
    }
     //探测次数超出上限  
    if (icsk->icsk_probes_out > max_probes) {
abort:      tcp_write_err(sk);
    } else {
        //发送探测报文  
        /* Only send another probe if we didn't close things up. */
        tcp_send_probe0(sk);
    }
}

tcp_send_probe0会发送探测报文:

//net/ipv4/tcp_output.c
/* A window probe timeout has occurred.  If window is not closed send
 * a partial packet else a zero probe.
 */
void tcp_send_probe0(struct sock *sk)
{
    struct inet_connection_sock *icsk = inet_csk(sk);
    struct tcp_sock *tp = tcp_sk(sk);
    unsigned long probe_max;
    int err;

    err = tcp_write_wakeup(sk);  //发送探测报文   
     //有发送出去的数据未确认或发送队列为空  
    if (tp->packets_out || !tcp_send_head(sk)) {
        /* Cancel probe timer, if it is not required. */
        icsk->icsk_probes_out = 0;
        icsk->icsk_backoff = 0;
        return;
    }

    if (err <= 0) {
        if (icsk->icsk_backoff < sysctl_tcp_retries2)
            icsk->icsk_backoff++;
        icsk->icsk_probes_out++;
        probe_max = TCP_RTO_MAX;
    } else {  //包在底层由于队列拥塞没有发送出去  
        /* If packet was not sent due to local congestion,
         * do not backoff and do not remember icsk_probes_out.
         * Let local senders to fight for local resources.
         *
         * Use accumulated backoff yet.
         */
        if (!icsk->icsk_probes_out)
            icsk->icsk_probes_out = 1;
        probe_max = TCP_RESOURCE_PROBE_INTERVAL;
    }
    //重启probe定时器,并设置指数退避
    inet_csk_reset_xmit_timer(sk, ICSK_TIME_PROBE0,
                  inet_csk_rto_backoff(icsk, probe_max),
                  TCP_RTO_MAX);
}

/* Initiate keepalive or window probe from timer. */
int tcp_write_wakeup(struct sock *sk)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct sk_buff *skb;

    if (sk->sk_state == TCP_CLOSE)
        return -1;
    //有数据尚未发送 &&当前窗口允许发送至少1字节的新数据  
    if ((skb = tcp_send_head(sk)) != NULL &&
        before(TCP_SKB_CB(skb)->seq, tcp_wnd_end(tp))) {
        int err;
        unsigned int mss = tcp_current_mss(sk);
        unsigned int seg_size = tcp_wnd_end(tp) - TCP_SKB_CB(skb)->seq;

        if (before(tp->pushed_seq, TCP_SKB_CB(skb)->end_seq))
            tp->pushed_seq = TCP_SKB_CB(skb)->end_seq;

        /* We are probing the opening of a window
         * but the window size is != 0
         * must have been a result SWS avoidance ( sender )
         */
        if (seg_size < TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq ||
            skb->len > mss) {
              //数据段过大  
            seg_size = min(seg_size, mss);
            TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_PSH;
            if (tcp_fragment(sk, skb, seg_size, mss))
                return -1;
        } else if (!tcp_skb_pcount(skb))
            tcp_set_skb_tso_segs(sk, skb, mss);

        TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_PSH;
        //发送新数据作为探测报文  
        err = tcp_transmit_skb(sk, skb, 1, GFP_ATOMIC);
        //发送成功  
        if (!err)
            //处理发送了新数据的事件  
            tcp_event_new_data_sent(sk, skb);
        return err;
    } else {
        //没有数据要发送或当前发送窗口不允许发送新数据  
        if (between(tp->snd_up, tp->snd_una + 1, tp->snd_una + 0xFFFF))
        //有紧急数据未确认且在窗口之内,这时一定有数据要发送  
            tcp_xmit_probe_skb(sk, 1);  //发送一个使用重复序列号的ACK  
        return tcp_xmit_probe_skb(sk, 0);   //发送一个使用旧序列号的ACK  
    }
}

tcp_xmit_probe_skb函数用于发送一个无数据的报文:

//net/ipv4/tcp_output.c
/* This routine sends a packet with an out of date sequence
 * number. It assumes the other end will try to ack it.
 *
 * Question: what should we make while urgent mode?
 * 4.4BSD forces sending single byte of data. We cannot send
 * out of window data, because we have SND.NXT==SND.MAX...
 *
 * Current solution: to send TWO zero-length segments in urgent mode:
 * one is with SEG.SEQ=SND.UNA to deliver urgent pointer, another is
 * out-of-date with SND.UNA-1 to probe window.
 */
static int tcp_xmit_probe_skb(struct sock *sk, int urgent)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct sk_buff *skb;

    /* We don't queue it, tcp_transmit_skb() sets ownership. */
    skb = alloc_skb(MAX_TCP_HEADER, sk_gfp_atomic(sk, GFP_ATOMIC));
    if (skb == NULL)
        return -1;

    /* Reserve space for headers and set control bits. */
    skb_reserve(skb, MAX_TCP_HEADER);
    /* Use a previous sequence.  This should cause the other
     * end to send an ack.  Don't queue or clone SKB, just
     * send it.
     */
    tcp_init_nondata_skb(skb, tp->snd_una - !urgent, TCPHDR_ACK);
    skb_mstamp_get(&skb->skb_mstamp);
    return tcp_transmit_skb(sk, skb, 0, GFP_ATOMIC); //发送数据
}

如果探测次数超出限制或内存紧张,零窗口探测定时器会断开连接;否则,发送探测报文,然后重设零窗口探测定时器。发送探测报文时如果发送窗口允许发送至少一字节,则发送一个新的报文段;否则发送一个seq比较旧的非法ACK,这样对端收到后会丢弃之并发送ACK报文(如果有紧急数据未确认则发送一个seq最旧但合法的ACK)。

总之,零窗口探测定时器发送探测报文并期望对端能对探测报文发送ACK,这样TCP就能得到最新的窗口信息。一旦窗口增加到可以发送数据,则正常的数据交互就可以尽快恢复。

6 delay ack 定时器

所谓的delay ack也就是发送端的数据报文过来之后,接收端的ack不会马上发送,而是等待一段时间和数据一起发送,这样就减少了一个数据包的发送。而定时器则是控制在一定的时间内的行为,如果定时器到期了,都没有数据要发送给对端,此时就单独发送这个ack。如果在定时器时间内,有数据要发送,此时这个ack和数据一起发送到对端。delay ack的方案利用稍许的时延来提升两端网络传输的吞吐量。

6.1 delay ack的ato

delay ack的一个必要条件是两端的数据交互进入一个pingpong的场景。发送方在发送数据包时,如果发送的数据包有负载,则会检测拥塞窗口是否超时。如果超时,则会使拥塞窗口失效并重新计算拥塞窗口。如果此时距离最近接收到数据包的时间间隔足够短,说明双方处于你来我往的双向数据传输中,就进入延迟确认模式。

/* Congestion state accounting after a packet has been sent. */
static void tcp_event_data_sent(struct tcp_sock *tp,
                struct sock *sk)
{
    struct inet_connection_sock *icsk = inet_csk(sk);
    const u32 now = tcp_time_stamp;

    if (sysctl_tcp_slow_start_after_idle &&
        (!tp->packets_out && (s32)(now - tp->lsndtime) > icsk->icsk_rto))
        tcp_cwnd_restart(sk, __sk_dst_get(sk)); //重置cwnd

    tp->lsndtime = now; //更新最近发送数据包的时间

    /* If it is a reply for ato after last received
     * packet, enter pingpong mode.
     */
     //如果距离上次接收到数据包的时间在ato内,则进入延迟确认模式
    if ((u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato)
        icsk->icsk_ack.pingpong = 1;
}

Q:icsk->icsk_ack.ato在ACK的发送过程中扮演了重要角色,那么它到底是用来干什么的?

A:ato为ACK Timeout,指ACK的超时时间。但延迟确认定时器的超时时间为icsk->icsk_ack.timeout,

ato只是计算timeout的一个中间变量,会根接收到的数据包的时间间隔来做动态调整。一般如果接收到的数据包的时间间隔变小,ato也会相应的变小。如果接收到的数据包的时间间隔变大,ato也会相应的变大。ato的最小值为40ms,ato的最大值一般为200ms或一个RTT。所以在实际传输过程中,我们看到的ACK的超时时间,是处于40ms ~ min(200ms, RTT)之间的。

在tcp_event_data_recv()中更新ato的值,delta为距离上次收到数据包的时间:

  1. delta <= TCP_ATO_MIN /2时,ato = ato / 2 + TCP_ATO_MIN / 2。

  2. TCP_ATO_MIN / 2 < delta <= ato时,ato = min(ato / 2 + delta, rto)。

  3. delta > ato时,ato值不变。

在tcp_send_delayed_ack()中会把ato赋值给icsk->icsk_ack.timeout,用作延迟确认定时器的超时时间。

6.2 delay ack的定时器的触发时机

设置延时ack的时机主要有以下几个场景:

  • 发送syn后收到对端的syn/ack的时候
/net/ipv4/tcp_input.c
static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
                     const struct tcphdr *th, unsigned int len)
{
...
    if (th->ack) {
...
            if (sk->sk_write_pending ||
            icsk->icsk_accept_queue.rskq_defer_accept ||
            icsk->icsk_ack.pingpong) {  //进入pingpong场景
        /* Save one ACK. Data will be ready after
             * several ticks, if write_pending is set.
             *
             * It may be deleted, but with this feature tcpdumps
             * look so _wonderfully_ clever, that I was not able
             * to stand against the temptation 8)     --ANK
             */
            inet_csk_schedule_ack(sk);
            icsk->icsk_ack.lrcvtime = tcp_time_stamp;
            tcp_enter_quickack_mode(sk);
            inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
                          TCP_DELACK_MAX, TCP_RTO_MAX);  //设置delay ack定时器
            }
}
  • 发送ack时无法申请skb
//net/ipv4/tcp_ouput.c
/* This routine sends an ack and also updates the window. */
void tcp_send_ack(struct sock *sk)
{
    struct sk_buff *buff;

    /* If we have been reset, we may not send again. */
    if (sk->sk_state == TCP_CLOSE)
        return;

    tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);

    /* We are not putting this on the write queue, so
     * tcp_transmit_skb() will set the ownership to this
     * sock.
     */
    //发送ack时申请内存
    buff = alloc_skb(MAX_TCP_HEADER, sk_gfp_atomic(sk, GFP_ATOMIC));
    if (buff == NULL) {
        inet_csk_schedule_ack(sk);
        inet_csk(sk)->icsk_ack.ato = TCP_ATO_MIN;
        //申请失败,启动delay ack定时器
        inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
                      TCP_DELACK_MAX, TCP_RTO_MAX);
        return;
    }

    /* Reserve space for headers and prepare control bits. */
    skb_reserve(buff, MAX_TCP_HEADER);
    tcp_init_nondata_skb(buff, tcp_acceptable_seq(sk), TCPHDR_ACK);

    /* We do not want pure acks influencing TCP Small Queues or fq/pacing
     * too much.
     * SKB_TRUESIZE(max(1 .. 66, MAX_TCP_HEADER)) is unfortunately ~784
     * We also avoid tcp_wfree() overhead (cache line miss accessing
     * tp->tsq_flags) by using regular sock_wfree()
     */
    skb_set_tcp_pure_ack(buff);

    /* Send it off, this clears delayed acks for us. */
    skb_mstamp_get(&buff->skb_mstamp);
    tcp_transmit_skb(sk, buff, 0, sk_gfp_atomic(sk, GFP_ATOMIC));
}
  • 有数据放入prequeue队列中时:
//net/ipv4/tcp_ipv4.c
bool tcp_prequeue(struct sock *sk, struct sk_buff *skb)
{
...
    if (tp->ucopy.memory > sk->sk_rcvbuf) {
        struct sk_buff *skb1;

        BUG_ON(sock_owned_by_user(sk));

        while ((skb1 = __skb_dequeue(&tp->ucopy.prequeue)) != NULL) {
            sk_backlog_rcv(sk, skb1);
            NET_INC_STATS_BH(sock_net(sk),
                     LINUX_MIB_TCPPREQUEUEDROPPED);
        }

        tp->ucopy.memory = 0;
    } else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
        //有数据进入prequeue队列
        wake_up_interruptible_sync_poll(sk_sleep(sk),
                       POLLIN | POLLRDNORM | POLLRDBAND);
        if (!inet_csk_ack_scheduled(sk))
            //启动delay ack定时器
            inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
                          (3 * tcp_rto_min(sk)) / 4,
                          TCP_RTO_MAX);
    }
    return true;
}
  • 调用__tcp_ack_snd_check函数发送ACK时
//net/ipv4/tcp_input.c
/*
 * Check if sending an ack is needed.
 */
static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
{
    struct tcp_sock *tp = tcp_sk(sk);

        /* More than one full frame received... */
    if (((tp->rcv_nxt - tp->rcv_wup) > inet_csk(sk)->icsk_ack.rcv_mss &&
         /* ... and right edge of window advances far enough.
          * (tcp_recvmsg() will send ACK otherwise). Or...
          */
         __tcp_select_window(sk) >= tp->rcv_wnd) ||
        /* We ACK each frame or... */
        tcp_in_quickack_mode(sk) ||
        /* We have out of order data. */
        (ofo_possible && skb_peek(&tp->out_of_order_queue))) {
        /* Then ack it now */
        tcp_send_ack(sk);
    } else {
        /* Else, send delayed ack. */
        tcp_send_delayed_ack(sk);
    }
}

一般来说在最后一个场景最容易触发delay ack的,它的delay ack的条件是:

  1. 收到少于一个MSS的数据或通告窗口缩小

  2. 没有处于快速ACK模式

  3. 无乱序数据

上述条件都满足则会调用tcp_send_delayed_ack会设置延迟ACK定时器:

void tcp_send_delayed_ack(struct sock *sk)
{
    struct inet_connection_sock *icsk = inet_csk(sk);
    int ato = icsk->icsk_ack.ato;
    unsigned long timeout;

    tcp_ca_event(sk, CA_EVENT_DELAYED_ACK);
    /**
    设置ato的上限的可能为:
    1. 500ms
    2. 200ms,如果处于延时确认模式,或者处于快速确认模式且收到过小包
    3. RTT,如果有RTT采样
    **/

    if (ato > TCP_DELACK_MIN) {
        const struct tcp_sock *tp = tcp_sk(sk);
        int max_ato = HZ / 2; //500ms
//如果处于延时确认模式,或者处于快速确认模式且设置了ICSK_ACK_PUSHED标志*/
        if (icsk->icsk_ack.pingpong ||
            (icsk->icsk_ack.pending & ICSK_ACK_PUSHED))
            max_ato = TCP_DELACK_MAX; /**200ms**/

        /* Slow path, intersegment interval is "high". */

        /* If some rtt estimate is known, use it to bound delayed ack.
         * Do not use inet_csk(sk)->icsk_rto here, use results of rtt measurements
         * directly.
         */
        /*如果有RTT采样,使用RTT来作为ato的最大值*/
        if (tp->srtt_us) {
            int rtt = max_t(int, usecs_to_jiffies(tp->srtt_us >> 3),
                    TCP_DELACK_MIN);

            if (rtt < max_ato)
                max_ato = rtt;
        }
        //ato 不能超过最大值
        ato = min(ato, max_ato);
    }

    /* Stay within the limit we were given */
    timeout = jiffies + ato; //延迟ACK的超时时刻

    /* Use new timeout only if there wasn't a older one earlier. */
     /* 如果之前已经启动了延迟确认定时器了 */
    if (icsk->icsk_ack.pending & ICSK_ACK_TIMER) {
        /* If delack timer was blocked or is about to expire,
         * send ACK now.
         * 如果之前延迟确认定时器触发时,因为socket被用户进程锁住而无法发送ACK,那么现在马上发送。
         * 如果接收到数据报时,延迟确认定时器已经快要超时了(离现在不到1/4 * ato),那么马上发送ACK。
         */
        if (icsk->icsk_ack.blocked ||
            time_before_eq(icsk->icsk_ack.timeout, jiffies + (ato >> 2))) {
            tcp_send_ack(sk); /* 发送ACK */
            return;
        }
 /* 如果新的超时时间,比之前设定的超时时间晚,那么使用之前设定的超时时间 */
        if (!time_before(timeout, icsk->icsk_ack.timeout))
            timeout = icsk->icsk_ack.timeout;
    }
     /* 如果还没有启动延迟确认定时器 */
    icsk->icsk_ack.pending |= ICSK_ACK_SCHED | ICSK_ACK_TIMER; /* 设置ACK需要发送标志、定时器启动标志 */
    icsk->icsk_ack.timeout = timeout; /* 超时时间 */
    sk_reset_timer(sk, &icsk->icsk_delack_timer, timeout);/* 启动延迟确认定时器 */
}

/* minimal time to delay before sending an ACK. */
# define TCP_DELACK_MIN ((unsigned) (HZ/25))
/* maximal time to delay before sending an ACK */
# define TCP_DELACK_MAX ((unsigned) (HZ/5))

发送ACK时清除延迟ACK定时器:

/net/ipv4/tcp_output.c
static inline void tcp_event_ack_sent(struct sock *sk, unsigned int pkts)
{
    tcp_dec_quickack_mode(sk, pkts);
    inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK);
}
...
 static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
                  gfp_t gfp_mask)
{
...
    if (likely(tcb->tcp_flags & TCPHDR_ACK))
        tcp_event_ack_sent(sk, tcp_skb_pcount(skb));
}

6.3 delay ack定时器超时后会做什么?

延迟ACK定时器的超时函数为tcp_delack_timer:

/net/ipv4/tcp_timer.c
static void tcp_delack_timer(unsigned long data)
{
    struct sock *sk = (struct sock *)data;

    bh_lock_sock(sk);
    if (!sock_owned_by_user(sk)) {
        //进行超时处理
        tcp_delack_timer_handler(sk);
    } else {
        //标识延迟ACK被锁定,以后安装延迟ACK定时器时要立即发送ACK  
        inet_csk(sk)->icsk_ack.blocked = 1;
        NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_DELAYEDACKLOCKED);
        /* deleguate our work to tcp_release_cb() */
        if (!test_and_set_bit(TCP_DELACK_TIMER_DEFERRED, &tcp_sk(sk)->tsq_flags))
            sock_hold(sk);
    }
    bh_unlock_sock(sk);
    sock_put(sk);
}

void tcp_delack_timer_handler(struct sock *sk)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct inet_connection_sock *icsk = inet_csk(sk);

    sk_mem_reclaim_partial(sk);

    if (sk->sk_state == TCP_CLOSE || !(icsk->icsk_ack.pending & ICSK_ACK_TIMER))
        goto out;

    if (time_after(icsk->icsk_ack.timeout, jiffies)) {   //未到超时时间  
        sk_reset_timer(sk, &icsk->icsk_delack_timer, icsk->icsk_ack.timeout);
        goto out;
    }
    icsk->icsk_ack.pending &= ~ICSK_ACK_TIMER;

    if (!skb_queue_empty(&tp->ucopy.prequeue)) {  //处理prequeue队列  
        struct sk_buff *skb;

        NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSCHEDULERFAILED);

        while ((skb = __skb_dequeue(&tp->ucopy.prequeue)) != NULL)
            sk_backlog_rcv(sk, skb);

        tp->ucopy.memory = 0;
    }

    if (inet_csk_ack_scheduled(sk)) {//需要发送ACK 
        if (!icsk->icsk_ack.pingpong) { //非交互模式要尽快发送ACK  
            /* Delayed ACK missed: inflate ATO. */
            icsk->icsk_ack.ato = min(icsk->icsk_ack.ato << 1, icsk->icsk_rto);
        } else {   //交互模式允许更大的延迟  
            /* Delayed ACK missed: leave pingpong mode and
             * deflate ATO.
             */
            icsk->icsk_ack.pingpong = 0;
            icsk->icsk_ack.ato      = TCP_ATO_MIN;
        }
        tcp_send_ack(sk);  //发送ACK 
        NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_DELAYEDACKS);
    }

out:
    if (sk_under_memory_pressure(sk))
        sk_mem_reclaim(sk);
}

如果延迟ACK定时器超时时socket被应用进程锁定,则设置TCP_DELACK_TIMER_DEFERRED标记,这样在应用进程释放socket时会调用tcp_release_cb函数:

/net/ipv4/tcp_output.c
void tcp_release_cb(struct sock *sk)
{
    struct tcp_sock *tp = tcp_sk(sk);
    unsigned long flags, nflags;

    /* perform an atomic operation only if at least one flag is set */
    do {
        flags = tp->tsq_flags;
        if (!(flags & TCP_DEFERRED_ALL))
            return;
        nflags = flags & ~TCP_DEFERRED_ALL;
    } while (cmpxchg(&tp->tsq_flags, flags, nflags) != flags);

    if (flags & (1UL << TCP_TSQ_DEFERRED))
        tcp_tsq_handler(sk);

    /* Here begins the tricky part :
     * We are called from release_sock() with :
     * 1) BH disabled
     * 2) sk_lock.slock spinlock held
     * 3) socket owned by us (sk->sk_lock.owned == 1)
     *
     * But following code is meant to be called from BH handlers,
     * so we should keep BH disabled, but early release socket ownership
     */
    sock_release_ownership(sk);

    if (flags & (1UL << TCP_WRITE_TIMER_DEFERRED)) {
        tcp_write_timer_handler(sk);
        __sock_put(sk);
    }
    //tcp_delack_timer_handler函数最终也会获得运行机会。
    if (flags & (1UL << TCP_DELACK_TIMER_DEFERRED)) {
        tcp_delack_timer_handler(sk);
        __sock_put(sk);
    }
    if (flags & (1UL << TCP_MTU_REDUCED_DEFERRED)) {
        inet_csk(sk)->icsk_af_ops->mtu_reduced(sk);
        __sock_put(sk);
    }
}

tcp_delack_timer_handler函数最终也会获得运行机会。

6.4 write-write-read 导致的delay ack问题

在tcp/ip详解中,讲述了一种关于delay ack和nagle算法导致的数据传输死锁的问题,实质上,tcp协议栈经过多次优化,主要是优化了nagle算法和增强了delay ack的触发条件, 基本把这个问题干掉了。用户态程序极少会出现delay ack导致应用程序性能的损害。但是一不留神写了一个write-write-read的程序,例如write(head),write(body),read(data),而客户端和服务端又刚好相互交互数据,出现了pingpong的场景,这就出现了悲剧的delay ack。在Linux上还好,一般delay ack的时延默认是40ms,在Windows上就悲剧了200ms的时延。Windows为了增强用户客户端的吞吐量,强行将默认规则是每来两个数据包,Windows客户端才能回ack,否则就强行等待200ms,当然罗,这个行为对单向的数据下载是毫无影响的,毕竟数据会一直来,我就不停的ack就好了,如果是最后一个数据是奇数的,也没有关系,因为http头部已经包含了content-length,用户态知道收够数据,就返回了,才不管什么delay ack呢。Windows客户端的这种行为面对交互比较多的场景可能会有影响,不过这种问题也不是什么问题,毕竟大部分交互场景比较多的应用,如游戏,走的是udp流,囧。

7 keepalive 定时器

当单工模式下TCP数据发送方发送了一些数据后就不再发数据,数据接收方也不会发送报文,这时TCP连接处于静止状态(比如Telnet应用)。保活功能可以使用保活定时器向对端发送探测报文来确定对端的连接是否正常,如果对端有回应则继续维持连接,否则关闭连接,释放资源。开启保活功能需要使用SO_KEEPALIVE socket选项。

tcp的keepavlie 机制类似Android客户端的心跳机制。然而Android客户端实现的心跳机制的行为与这里所描述的有所不同,是客户端主动发送心跳,由于移动设备的网络的复杂性,经常会出现网络断开,如果没有心跳包的检测,客户端只会在需要发送数据的时候才知道自己已经断线,会延误,甚至丢失服务器发送过来的数据。

(1) TCP参数

tcp_keepalive_time
最后一次数据交换到TCP发送第一个保活探测报文的时间,即允许连接空闲的时间,默认为7200s。

tcp_keepalive_intvl
保活探测报文的重传时间,默认为75s。

tcp_keepalive_probes
保活探测报文的发送次数,默认为9次。

Q:一次完整的保活探测需要花费多长时间?

A:tcp_keepalive_time + tcp_keepalive_intvl * tcp_keepalive_probes,默认值为7875s。
如果觉得两个多小时太长了,可以自行调整上述参数。

(2) TCP层选项

TCP_KEEPIDLE:含义同tcp_keepalive_time。

TCP_KEEPINTVL:含义同tcp_keepalive_intvl。

TCP_KEEPCNT:含义同tcp_keepalive_probes。

Q:既然有了TCP参数可供调整,为什么还增加了上述的TCP层选项?

A:TCP参数是面向本机的所有TCP连接,一旦调整了,对所有的连接都有效。
而TCP层选项是面向一条连接的,一旦调整了,只对本条连接有效。

7.1 keepalive 定时器时机

设置保活定时器的时机主要有三个场景:

  • 客户端发送SYN后收到SYN|ACK,调用tcp_finish_connect函数完成连接时:
/net/ipv4/tcp_input.c
void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct inet_connection_sock *icsk = inet_csk(sk);

    tcp_set_state(sk, TCP_ESTABLISHED);

    if (skb != NULL) {
        icsk->icsk_af_ops->sk_rx_dst_set(sk, skb);
        security_inet_conn_established(sk, skb);
    }

    /* Make sure socket is routed, for correct metrics.  */
    icsk->icsk_af_ops->rebuild_header(sk);

    tcp_init_metrics(sk);

    tcp_init_congestion_control(sk);

    /* Prevent spurious tcp_cwnd_restart() on first data
     * packet.
     */
    tp->lsndtime = tcp_time_stamp;

    tcp_init_buffer_space(sk);
    //如果应用进程设置了SOCK_KEEPOPEN,则启动keepalive定时器
    if (sock_flag(sk, SOCK_KEEPOPEN))
        inet_csk_reset_keepalive_timer(sk, keepalive_time_when(tp));

    if (!tp->rx_opt.snd_wscale)
        __tcp_fast_path_on(tp, tp->snd_wnd);
    else
        tp->pred_flags = 0;

    if (!sock_flag(sk, SOCK_DEAD)) {
        sk->sk_state_change(sk);
        sk_wake_async(sk, SOCK_WAKE_IO, POLL_OUT);
    }
}

static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
                     const struct tcphdr *th, unsigned int len)
{
...
tcp_finish_connect(sk, skb);
...
}
  • 服务器端发送SYN|ACK后收到合法的ACK,调用tcp_create_openreq_child创建子socket时:
void tcp_ca_openreq_child(struct sock *sk, const struct dst_entry *dst)
{
    struct sock *newsk = inet_csk_clone_lock(sk, req, GFP_ATOMIC);  

    if (newsk != NULL) {    
...  
    if (sock_flag(newsk, SOCK_KEEPOPEN))    //应用进程开启keepalive服务  
        inet_csk_reset_keepalive_timer(newsk,  
                                        keepalive_time_when(newtp));  
...  
}
  • 使用SO_KEEPALIVE socket选项开启保活功能时:
int sock_setsockopt(struct socket *sock, int level, int optname,
            char __user *optval, unsigned int optlen)
{
    case SO_KEEPALIVE:
#ifdef CONFIG_INET
        if (sk->sk_protocol == IPPROTO_TCP &&
            sk->sk_type == SOCK_STREAM)
            tcp_set_keepalive(sk, valbool);
#endif
        sock_valbool_flag(sk, SOCK_KEEPOPEN, valbool);
        break;
}

tcp_set_keepalive函数用于开启或关闭keepalive服务:

void tcp_set_keepalive(struct sock *sk, int val)
{
    if ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))
        return;
    //启动keepavlie定时器
    if (val && !sock_flag(sk, SOCK_KEEPOPEN))
        inet_csk_reset_keepalive_timer(sk, keepalive_time_when(tcp_sk(sk)));
    //清除定时器
    else if (!val)
        inet_csk_delete_keepalive_timer(sk);
}

拆除keepalive定时器只能使用SO_KEEPALIVE socket选项。
Keepalive定时器的超时时间由keepalive_time_when函数决定:

static inline int keepalive_time_when(const struct tcp_sock *tp)
{
    return tp->keepalive_time ? : sysctl_tcp_keepalive_time;
}

其中sysctl_tcp_keepalive_time由net.ipv4.tcp_keepalive_time内核参数设定,tp->keepalive_time由TCP_KEEPIDLE socket选项设置:

static int do_tcp_setsockopt(struct sock *sk, int level,
        int optname, char __user *optval, unsigned int optlen)
{
...
    case TCP_KEEPIDLE:
        if (val < 1 || val > MAX_TCP_KEEPIDLE)
            err = -EINVAL;
        else {
            tp->keepalive_time = val * HZ;
            if (sock_flag(sk, SOCK_KEEPOPEN) &&
                !((1 << sk->sk_state) &
                  (TCPF_CLOSE | TCPF_LISTEN))) {
                u32 elapsed = keepalive_time_elapsed(tp);
                if (tp->keepalive_time > elapsed)
                    elapsed = tp->keepalive_time - elapsed;
                else
                    elapsed = 0;
                inet_csk_reset_keepalive_timer(sk, elapsed); //设置定时器
            }
        }
...
}

Keepalive定时器默认超时时间为TCP_KEEPALIVE_TIME(2小时)。

7.2 keepalive定时器超时做什么?

保活定时器的超时为tcp_keepalive_timer:

//net/ipv4/tcp_timer.c
static void tcp_keepalive_timer (unsigned long data)
{
    struct sock *sk = (struct sock *) data;
    struct inet_connection_sock *icsk = inet_csk(sk);
    struct tcp_sock *tp = tcp_sk(sk);
    u32 elapsed;

    /* Only process if socket is not in use. */
    bh_lock_sock(sk);
    if (sock_owned_by_user(sk)) {
        /* Try again later. */
        inet_csk_reset_keepalive_timer (sk, HZ/20);
        goto out;
    }
    //SYN-ACK定时器超时处理  
    if (sk->sk_state == TCP_LISTEN) {
        tcp_synack_timer(sk);
        goto out;
    }
    //FIN_WAIT2定时器超时处理  
    if (sk->sk_state == TCP_FIN_WAIT2 && sock_flag(sk, SOCK_DEAD)) {
        if (tp->linger2 >= 0) {
            const int tmo = tcp_fin_time(sk) - TCP_TIMEWAIT_LEN;

            if (tmo > 0) {
                tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
                goto out;
            }
        }
        tcp_send_active_reset(sk, GFP_ATOMIC);
        goto death;
    }
    //keepalive功能未开启或socket已经关闭  
    if (!sock_flag(sk, SOCK_KEEPOPEN) || sk->sk_state == TCP_CLOSE)
        goto out;

    elapsed = keepalive_time_when(tp);
    //有包在网络中或有数据未发
    /* It is alive without keepalive 8) */
    if (tp->packets_out || tcp_send_head(sk))
    //连接是活动的,无需keepalive定时器操心  
        goto resched;
    //计算自收到最后一个包到现在经历了多长时间
    elapsed = keepalive_time_elapsed(tp);
    //计算自收到最后一个包到现在经历的时间达到保活定时器的超时门限  
    if (elapsed >= keepalive_time_when(tp)) {
        /* If the TCP_USER_TIMEOUT option is enabled, use that
         * to determine when to timeout instead.
         */
         //应用进程使用TCP_USER_TIMEOUT socket选项设置了超时时间  
        if ((icsk->icsk_user_timeout != 0 &&
           //未活动时间超过应用进程的限制  
            elapsed >= icsk->icsk_user_timeout &&
             //发送过探测报文  
            icsk->icsk_probes_out > 0) ||
            //应用进程未设置超时时间  
            (icsk->icsk_user_timeout == 0 &&
            //探测次数超过用户设定的门限  
            icsk->icsk_probes_out >= keepalive_probes(tp))) {
            //发送RST复位连接 
            tcp_send_active_reset(sk, GFP_ATOMIC); 
            //发送出错报告,关闭本端连接  
            tcp_write_err(sk);
            goto out;
        }
        //发送探测包  
        if (tcp_write_wakeup(sk) <= 0) {
           //此计数在收到ACK时清零 
            icsk->icsk_probes_out++;
            //设置超时时间为探测间隔时间 
            elapsed = keepalive_intvl_when(tp);
        } else {  //由于底层队列拥塞包没有发送出去  
            /* If keepalive was lost due to local congestion,
             * try harder.
             */
             //缩短时间间隔为0.5s  
            elapsed = TCP_RESOURCE_PROBE_INTERVAL;
        }
    } else {
        //设置keepalive在从收到最后一个包开始到一个超时周期时再超时  
        /* It is tp->rcv_tstamp + keepalive_time_when(tp) */
        elapsed = keepalive_time_when(tp) - elapsed;
    }
    //回收内存资源 
    sk_mem_reclaim(sk);

resched:
   //重新设置保活定时器
    inet_csk_reset_keepalive_timer (sk, elapsed);
    goto out;

death:
    tcp_done(sk);

out:
    bh_unlock_sock(sk);
    sock_put(sk);
}

keepalive_probes函数返回最大探测次数:

static inline int keepalive_probes(const struct tcp_sock *tp)
{
    return tp->keepalive_probes ? : sysctl_tcp_keepalive_probes;
}

其中,tp->keepalive_probes由TCP_KEEPCNT socket选项设定,sysctl_tcp_keepalive_probes(默认为TCP_KEEPALIVE_PROBES,即9)由net.ipv4.tcp_keepalive_probes内核参数设定。

keepalive_intvl_when函数返回探测间隔时间

static inline int keepalive_intvl_when(const struct tcp_sock *tp)
{
    return tp->keepalive_intvl ? : sysctl_tcp_keepalive_intvl;
}

其中,tp->keepalive_intvl由TCP_KEEPINTVL socket选项设定,sysctl_tcp_keepalive_intvl(默认是TCP_KEEPALIVE_INTVL,即75s)由net.ipv4.tcp_keepalive_intvl内核参数设定。

现总结一下keepalive定时器的特性。在连接建立完成伊始就设置keepalive定时器,应用进程也可以使用socket选项设置或禁用它;每次定时器超时的时候,在自收到最后一个包到现在经历的时间超过保活定时器的超时门限的情况下,如果超过了应用进程设定的超时上限或探测次数则发送RST报文给对端并关闭连接,否则发送探测报文,增加探测计数,并将超时时间设置为keepalive_intvl,等待下次超时。当收到ACK时探测计数清零,收包时间也会刷新,整个探测过程重新开始。

8 syn ack 定时器

服务器端收到客户端SYN后发出SYN-ACK,然后等待ACK,完成三次握手,在等待这个ACK的时候,会启动syn ack定时器。由于SYN_ACK发送后并没有放入发送队列中,故重传时必须重新构建SYN|ACK报文。而这个定时器超时时间为init_rto,根据google的论文目前init_rto=1s。

初始RTO超时时间,在2011年RFC6298中将init RTO 的3s变成1s。这个值的变更主要是根据现在网络条件进行的调整的。主要变更理由如下:

  • RTO初始值为3,是在RFC 1122(1989年发布)中定义的,现在的网络比当时的网络要快太多。

  • 97.5%的网络的RTT小于1秒。

  • 三次握手期间的重传概率很低,只有2%。

  • 有2.5%的网络的RTT是大于1秒的,故它们会在三次握手期间重传,但是之后RTO就重新设置为3(保守性体现)了。

  • 不会影响RFC5681,三次握手的重传期间的拥塞窗口还是为1,即网络上仅仅多了一个syn报文而已。可以说,对重传的控制压缩到极点了。

  • 如果支持时间戳选项,则不需要将RTO重置为3了。也就是说,初始化RTO改为1秒不会影响到支持时间戳选项的TCP连接。

  • 初始化RTO的缩小令握手速度加快(指检测网络拥塞的场景),会让性能提升10%到50%。

8.1 syn ack 定时器的设定时机

具体路径为:

tcp_v4_do_rcv
    |--> tcp_rcv_state_process
               |--> tcp_v4_conn_request
                          |--> tcp_v4_send_synack
                          |--> inet_csk_reqsk_hash_add
                                     |--> inet_csk_reqsk_queue_added

TCP在发送SYN|ACK响应后设置SYN-ACK定时器:

//net/ipv4/tcp_ipv4.c
int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
{
    /* Never answer to SYNs send to broadcast or multicast */
    if (skb_rtable(skb)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
        goto drop;
    //TCP 连接请求
    return tcp_conn_request(&tcp_request_sock_ops,
                &tcp_request_sock_ipv4_ops, sk, skb);
drop:
    NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
    return 0;
}

//net/ipv4/tcp_input.c
int tcp_conn_request(struct request_sock_ops *rsk_ops,
             const struct tcp_request_sock_ops *af_ops,
             struct sock *sk, struct sk_buff *skb)
{
...
    tcp_rsk(req)->snt_isn = isn;  //设置初始seq
    tcp_openreq_init_rwin(req, sk, dst);  //设置接收窗口
    fastopen = !want_cookie &&
           tcp_try_fastopen(sk, skb, req, &foc, dst); //快速构建synack
    err = af_ops->send_synack(sk, dst, &fl, req,
                  skb_get_queue_mapping(skb), &foc);  //尝试发送synack
    if (!fastopen) {
        if (err || want_cookie)
            goto drop_and_free;

        tcp_rsk(req)->listener = NULL;
        af_ops->queue_hash_add(sk, req, TCP_TIMEOUT_INIT); //将requese sock加入到SYN表中,并设置SYN-ACK定时器 
    }

    return 0;

drop_and_release:
    dst_release(dst);
drop_and_free:
    reqsk_free(req);
drop:
    NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
    return 0;
}

queue_hash_add->inet_csk_reqsk_queue_hash_add 函数:

void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
                   unsigned long timeout)
{
    struct inet_connection_sock *icsk = inet_csk(sk);
    struct listen_sock *lopt = icsk->icsk_accept_queue.listen_opt;
    const u32 h = inet_synq_hash(inet_rsk(req)->ir_rmt_addr,
                     inet_rsk(req)->ir_rmt_port,
                     lopt->hash_rnd, lopt->nr_table_entries);
    //将request_sock放入syn_table中并记录超时时间    
    reqsk_queue_hash_req(&icsk->icsk_accept_queue, h, req, timeout);
    //设置SYN-ACK定时器  ->inet_csk_reset_keepalive_timer
    inet_csk_reqsk_queue_added(sk, timeout);
}

reqsk_queue_hash_req函数会记录request_sock的超时时间:

static inline void reqsk_queue_hash_req(struct request_sock_queue *queue,
                    u32 hash, struct request_sock *req,
                    unsigned long timeout)
{
    struct listen_sock *lopt = queue->listen_opt;
   //超时时间  
    req->expires = jiffies + timeout;
    req->num_retrans = 0;
    req->num_timeout = 0;
    req->sk = NULL;
    req->dl_next = lopt->syn_table[hash];

    write_lock(&queue->syn_wait_lock);
    lopt->syn_table[hash] = req;  //记录到哈希表中
    write_unlock(&queue->syn_wait_lock);
}

inet_csk_reqsk_queue_added函数做的事情是:把连接请求块链入半连接队列中,设置超时时间。

static inline void inet_csk_reqsk_queue_added(struct sock *sk,
                          const unsigned long timeout)
{
    //如果添加request sock之前syn_table为空  
    if (reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue) == 0)
    //设置SYN-ACK定时器
        inet_csk_reset_keepalive_timer(sk, timeout);
}

真正的定时器的设定函数:

void inet_csk_reset_keepalive_timer(struct sock *sk, unsigned long len)
{
    sk_reset_timer(sk, &sk->sk_timer, jiffies + len);
}

SYN-ACK定时器的超时时间为TCP_TIMEOUT_INIT(1秒)。

8.2 syn ack定时器超时之后做些什么?

SYN-ACK定时器的结构为sk->sk_timer,其超时函数为tcp_keepalive_timer:

static void tcp_keepalive_timer (unsigned long data)
{
...
    if (sk->sk_state == TCP_LISTEN) {
        tcp_synack_timer(sk); //进入超时处理
        goto out;
    }
...
}

tcp_synack_timer函数:

static void tcp_synack_timer(struct sock *sk)
{
    inet_csk_reqsk_queue_prune(sk, TCP_SYNQ_INTERVAL,
                   TCP_TIMEOUT_INIT, TCP_RTO_MAX);
}

inet_csk_reqsk_queue_prune函数:

void inet_csk_reqsk_queue_prune(struct sock *parent,
                const unsigned long interval,
                const unsigned long timeout,
                const unsigned long max_rto)
{
    struct inet_connection_sock *icsk = inet_csk(parent);
    struct request_sock_queue *queue = &icsk->icsk_accept_queue;
    struct listen_sock *lopt = queue->listen_opt; /*半连接队列*/
    /*如果没有设置TCP_SYNCNT选项,默认最多允许重传5次SYNACK*/
    int max_retries = icsk->icsk_syn_retries ? : sysctl_tcp_synack_retries;
    int thresh = max_retries;
    unsigned long now = jiffies;
    struct request_sock **reqp, *req;
    int i, budget;
   /*半连接队列要存在且至少有个连接请求块*/
    if (lopt == NULL || lopt->qlen == 0)
        return;

    /* 如果半连接队列的长度超过了最大值的一半,需要降低SYNACK的最大重传次数,详细见下文(1) */  


    if (lopt->qlen>>(lopt->max_qlen_log-1)) {
        int young = (lopt->qlen_young<<1);
 /* 半连接队列中,未重传过的连接请求块的比重越低,则允许的最大重传次数就越少。 
         * 不这样做的话,老的连接请求块会存活很长时间,导致没有足够的空间接纳新的连接请求块。 
         * 具体来说,默认的thresh为5,当未重传过的连接请求块的占比: 
         * < 1/2,thresh = 4 
         * < 1/4,thresh = 3 
         * < 1/8,thresh = 2 
         */  
        while (thresh > 2) {
            if (lopt->qlen < young)
                break;
            thresh--;
            young <<= 1;
        }
    }
    /* 如果设置了TCP_DEFER_ACCEPT选项,则更新SYNACK的最大重传次数,详细见下文(2) */  
    if (queue->rskq_defer_accept)
        max_retries = queue->rskq_defer_accept;
 /* 连接的初始超时时间是1s,SYNACK定时器在首次触发之后,接下来每200ms就触发一次。 
     * Q:连接请求块的超时时间依然是1s,那么SYNACK定时器为什么要更加频繁的触发呢? 
     * A:增加了定时器的精确度,误差从1s降到200ms,也能更加及时的剔除太老的连接请求块。 
     * 默认1s内遍历2次的半连接表。 
     */  
    budget = 2 * (lopt->nr_table_entries / (timeout / interval));
    /* 半连接表中的第i个连接请求块队列,是上次遍历到的+1,这样不用每次都从头开始 */
    i = lopt->clock_hand;

    do {//遍历SYN table 
        reqp=&lopt->syn_table[i];/* 半连接表中的第i个连接请求块队列 */  
        while ((req = *reqp) != NULL) { /* 遍历队列 */
        //超时  
            if (time_after_eq(now, req->expires)) { /* 如果SYNACK超时了 */  
                int expire = 0, resend = 0; /* expire表示是否要丢弃本连接请求块,resend表示是否要重传SYNACK */ 
        //计算request sock是否过期以及是否需要重发SYN|ACK  
                syn_ack_recalc(req, thresh, max_retries,
                           queue->rskq_defer_accept,
                           &expire, &resend);
               /* 增加timeout统计计数 */  
                req->rsk_ops->syn_ack_timeout(parent, req);
                 /* 有意思的条件判断,先考虑expire,再考虑resend。 
                 * 条件为真时,表示此连接请求块还不是太老,不用删除。 
                 */  
                if (!expire &&
                    (!resend ||
                     !inet_rtx_syn_ack(parent, req) || 
                     inet_rsk(req)->acked)) {
                    unsigned long timeo;
                    /* 如果是尚未重传过的 */  
                    if (req->num_timeout++ == 0)
                        lopt->qlen_young--;
                    timeo = min(timeout << req->num_timeout,
                            max_rto);/* 超时时间指数增大 */  
                    //更新request_sock超时时间  
                    req->expires = now + timeo;
                    reqp = &req->dl_next;
                    continue;
                }

                /* Drop this request */
                inet_csk_reqsk_queue_unlink(parent, req, reqp);/* 把连接请求块从半连接队列中删除 */  
                reqsk_queue_removed(queue, req); /* 更新半连接队列长度相关变量 */  
                reqsk_free(req); /* 释放连接请求块 */  
                continue;
            }
            reqp = &req->dl_next;
        }

        i = (i + 1) & (lopt->nr_table_entries - 1);

    } while (--budget > 0);
 /* 本地变量到第(i - 1)个连接请求块队列,下次从第i个开始 */ 
    lopt->clock_hand = i;
   //syn_table中还有成员  
    if (lopt->qlen)
        inet_csk_reset_keepalive_timer(parent, interval); /* 重置SYNACK定时器,超时时间为200ms */  
}
  • 当syn_table中剩余空间比较小时,需要减小最大重试次数,以便使旧的request_sock能够更快消亡,从而新的request_sock能够更多的被接受。
  • 将超时的request_sock移出syn_table并释放,即丢弃其对应的连接
  • 全部满足下列条件就不删除request_sock而只是更新超时时间:

    1. request_sock没有超时
    2. 下列3个条件之一成立: 不需要重传SYN|ACK,重传SYN|ACK成功,应用进程使用TCP_DEFER_ACCEPT socket选项意图使数据到来时listen socket再唤醒进程,当ACK到来但没有数据时。
  • TCP_DEFER_ACCEPT选项

用于三次握手阶段,使用这个选项时,收到客户端发送的纯ACK后,会直接把纯ACK丢弃。
所以就不会马上创建和初始化一个新的sock,不会把此连接请求块从半连接队列移动到全连接队列,
更不会唤醒监听进程来accept。TCP_DEFER_ACCEPT,顾名思义,就是延迟连接的accept。

syn_ack_recalc函数来确定request_sock是否超时以及是否需要重传SYN|ACK:

static inline void syn_ack_recalc(struct request_sock *req, const int thresh,
                  const int max_retries,
                  const u8 rskq_defer_accept,
                  int *expire, int *resend)
{
    /* 如果没有使用TCP_DEFER_ACCEPT选项 */  
    if (!rskq_defer_accept) {
        *expire = req->num_timeout >= thresh;  /* 超过了动态调整后的最大重传次数则放弃本连接请求块 */
        *resend = 1;  /* 始终为1,但其实是只有不放弃时(expire为0)才会真的重传 */  
        return;
    }
     /* 启用TCP_DEFER_ACCEPT时,放弃的条件更加严格,还需要满足以下两个条件之一: 
     * 1. 没有收到过纯ACK。 
     * 2. 超过了设置的最大延迟时间。 
     * 满足了以上两个条件之一,就不值得再抢救了,已弃疗。 
     */  
    *expire = req->num_timeout >= thresh &&  //超时次数达到限制  
          (!inet_rsk(req)->acked || req->num_timeout >= max_retries);//ACK没有到来或超时次数达到最高上限  
    /*
     * Do not resend while waiting for data after ACK,
     * start to resend on end of deferring period to give
     * last chance for data or ACK to create established socket.
     */
     /* 要重传SYNACK的情况有两种: 
     * 1. 没有收到过纯ACK时。 
     * 2. 已收到纯ACK,本次是最后一次重传机会了。 
     */  
    *resend = !inet_rsk(req)->acked || //ACK没有到来  
          req->num_timeout >= rskq_defer_accept - 1;  
    //超时次数超过或即将达到应用进程的限制,赶快重传SYN|ACK以便给对端最后一个机会建立连接  
}

综上,SYN|ACK定时器超时时重传SYN|ACK的条件是下列条件全部成立:

  1. request_sock超时

  2. request_sock的超时次数达到限制

  3. 下列条件之一成立:应用进程没有使用TCP_DEFER_ACCEPT socket选项来延迟accept request_sock的时间。应用进程使用TCP_DEFER_ACCEPT socket选项设置了超时次数限制,但ACK没有到来或,超时次数达到最高限制且超时次数超过或即将达到应用进程的限制。

SYN|ACK的重传是由inet_rtx_syn_ack函数完成的:

int inet_rtx_syn_ack(struct sock *parent, struct request_sock *req)
{
    int err = req->rsk_ops->rtx_syn_ack(parent, req);/* 调用tcp_v4_rtx_synack()来重传SYNACK */  
   //指向tcp_v4_send_synack
    if (!err)
        req->num_retrans++;/* 增加SYNACK的重传次数 */  
    return err;
}

static int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
                  struct flowi *fl,
                  struct request_sock *req,
                  u16 queue_mapping,
                  struct tcp_fastopen_cookie *foc)
{
    const struct inet_request_sock *ireq = inet_rsk(req);
    struct flowi4 fl4;
    int err = -1;
    struct sk_buff * skb;

    /* First, grab a route. */
    if (!dst && (dst = inet_csk_route_req(sk, &fl4, req)) == NULL)
        return -1;

    skb = tcp_make_synack(sk, dst, req, foc);  //构建SYN-ACK  
   //发送syn-ack
    if (skb) {
        __tcp_v4_send_check(skb, ireq->ir_loc_addr, ireq->ir_rmt_addr);

        skb_set_queue_mapping(skb, queue_mapping);
        err = ip_build_and_send_pkt(skb, sk, ireq->ir_loc_addr,
                        ireq->ir_rmt_addr,
                        ireq->opt);
        err = net_xmit_eval(err);
    }

    return err;
}

正所谓万事开头难,tcp的三次握手正体现了这一点,所以init rto会有1s这么久,但是这个时间已经从3s变成1s,说明现行网络越来越好了,超时的时间也越来越短。在三次握手的时候,网络上可能充斥着各种无效的syn包,这就是所谓的syn-flood攻击。tcp实现了两个连接队列,一个是半连接队列(tcp_max_syn_backlog进行调优),一个是监听队列(net.core.somaxconn进行调优)。队列值调大可以有效应对负载的突变,以防止syn-flood的攻击。

9 ER(Early Retransmit)定时器

ER定时器主要由谷歌在2013年进行开发,我们知道快速重传就是所谓的三个相同的ack,就立即重传丢失的那一段。但是在数据传输过程中,发送端无法收到足够的ack,而无法实施快速重传,这个时候只能苦苦等待RTO超时了。ER算法主要解决这个问题:

在下面条件是无法收到足够的ack的:

  • 拥塞窗口比较小
  • 窗口中一个很大数量的段丢失或者在传输的结尾处发生了丢包。

比如如下场景:拥塞窗口为3,然后第一段丢失了,理论上最多发送段只可能收到2个重复ack,此时由于快重传要求3个重复ack,那么只能等待RTO超时,才能重传第一个段。

上面两个条件中,ER算法是解决第一种可能性(也就是当连续的很多端丢失)。第二种情况则需要TLP(Tail Loss Probe)算法来解决。

ER算法基于两种模式,一种是基于字节的,一种是基于段的,Linux的ER是基于段的。ER算法会在小窗口(flight count 小于4)减小触发快重传的重复ack的阈值,比如减小到1或者2。而在Linux的实现中为了防止假超时会加上一个延迟再传输数据,这个功能就靠ER定时器实现。

9.1 在什么地方触发ER定时器?

TCP在收到ACK时会调用tcp_fastretrans_alert函数判断是否需要快速重传:

static void tcp_fastretrans_alert(struct sock *sk, const int acked,
                  const int prior_unsacked,
                  bool is_dupack, int flag)
{
...
        if (!tcp_time_to_recover(sk, flag)) {
            tcp_try_to_open(sk, flag, prior_unsacked);
            return;
        }
...

tcp_time_to_recover函数决定什么时候进入Recovery状态:

static bool tcp_time_to_recover(struct sock *sk, int flag)
{
    struct tcp_sock *tp = tcp_sk(sk);
    __u32 packets_out;

    /* Trick#1: The loss is proven. */
    if (tp->lost_out)  //有丢包事件发生
        return true;

    /* Not-A-Trick#2 : Classic rule... */
    if (tcp_dupack_heuristics(tp) > tp->reordering)
        return true;  //重复的ACK小于等于乱序的阈值

    /* Trick#4: It is still not OK... But will it be useful to delay
     * recovery more?
     */
    packets_out = tp->packets_out;
    if (packets_out <= tp->reordering &&  //未开启FACK,或没有未收到确认的包,或队列首包已发送但未超时
        tp->sacked_out >= max_t(__u32, packets_out/2, sysctl_tcp_reordering) &&
        !tcp_may_send_now(sk)) {//检查Nagle算法等  
        /* We have nothing to send. This connection is limited
         * either by receiver window or by application.
         */
        return true;
    }

    /* If a thin stream is detected, retransmit after first
     * received dupack. Employ only if SACK is supported in order
     * to avoid possible corner-case series of spurious retransmissions
     * Use only if there are no unsent data.
     */
    if ((tp->thin_dupack || sysctl_tcp_thin_dupack) && //thin_dupack功能未开启,或当前链接并不是"thin"的,或重复ACK的数量大于1,或SACK未开启,或有要发送的数据
        tcp_stream_is_thin(tp) && tcp_dupack_heuristics(tp) > 1 &&
        tcp_is_sack(tp) && !tcp_send_head(sk))
        return true;

    /* Trick#6: TCP early retransmit, per RFC5827.  To avoid spurious
     * retransmissions due to small network reorderings, we implement
     * Mitigation A.3 in the RFC and delay the retransmission for a short
     * interval if appropriate.
     *///do_early_retrans开启、没有重传完毕但没有确认的报文 、在网络中的报文数量比被SACK的报文数量多至少1个、在网络中的报文数量少于4、现在不允许发送数据
    if (tp->do_early_retrans && !tp->retrans_out && tp->sacked_out &&
        (tp->packets_out >= (tp->sacked_out + 1) && tp->packets_out < 4) &&
        !tcp_may_send_now(sk))
        return !tcp_pause_early_retransmit(sk, flag);  //设定ER定时器

    return false;
}

tcp_pause_early_retransmit函数是唯一能够设置ER定时器的函数:

static bool tcp_pause_early_retransmit(struct sock *sk, int flag)
{
    struct tcp_sock *tp = tcp_sk(sk);
    unsigned long delay;

    /* Delay early retransmit and entering fast recovery for
     * max(RTT/4, 2msec) unless ack has ECE mark, no RTT samples
     * available, or RTO is scheduled to fire first.
     *///sysctl_tcp_early_retrans的值是2或3、ACK中没有ECE标记、tp->srtt(smoothed round trip time)的值大于0
    if (sysctl_tcp_early_retrans < 2 || sysctl_tcp_early_retrans > 3 ||
        (flag & FLAG_ECE) || !tp->srtt_us)
        return false;

    delay = max(usecs_to_jiffies(tp->srtt_us >> 5),
            msecs_to_jiffies(2));
   /* 如果超时重传定时器更早超时、icsk->icsk_retransmit_timer超时时间在延迟时间之后*/
    if (!time_after(inet_csk(sk)->icsk_timeout, (jiffies + delay)))
        return false;

    inet_csk_reset_xmit_timer(sk, ICSK_TIME_EARLY_RETRANS, delay,
                  TCP_RTO_MAX);
    return true;
}

由代码可知,Linux为设置ER定时器设置了重重障碍,现在来仔细数数:

  1. 有丢包事件发生
  2. 重复的ACK小于等于乱序的阈值
  3. 未开启FACK,或没有未收到确认的包,或队列首包已发送但未超时
  4. 已发送的数据 > 乱序的阈值,或被SACK段的数量小于阈值,或允许发送skb
  5. thin_dupack功能未开启,或当前链接并不是”thin”的,或重复ACK的数量大于1,或SACK未开启,或有要发送的数据
  6. do_early_retrans开启
  7. 没有重传完毕但没有确认的报文
  8. 有被SACK的报文
  9. 在网络中的报文数量比被SACK的报文数量多至少1个
  10. 在网络中的报文数量少于4
  11. 现在不允许发送数据
  12. sysctl_tcp_early_retrans的值是2或3
  13. ACK中没有ECE标记
  14. tp->srtt(smoothed round trip time)的值大于0
  15. icsk->icsk_retransmit_timer超时时间在延迟时间之后

以上条件全部满足后,ER定时器会被设置,其超时时间是一个比重传定时器更小的值。在Linux中一般是25ms的超时时间。

安装丢失探测定时器、重传定时器、坚持定时器时ER定时器就会被清除。

9.2 ER超时定时器超时之后会做些什么?

ER定时器的超时函数是tcp_resume_early_retransmit:

void tcp_resume_early_retransmit(struct sock *sk)
{
    struct tcp_sock *tp = tcp_sk(sk);

    tcp_rearm_rto(sk);  //更新RTO,设置重传定时器  

    /* Stop if ER is disabled after the delayed ER timer is scheduled */
    if (!tp->do_early_retrans) //ER功能被禁用 
        return;

    tcp_enter_recovery(sk, false);  //进入恢复状态  
    tcp_update_scoreboard(sk, 1); //更新记分板,标记skb丢失  
    tcp_xmit_retransmit_queue(sk); //重传数据  
}

tcp_xmit_retransmit_queue函数执行重传数据的功能:

void tcp_xmit_retransmit_queue(struct sock *sk)
{
    const struct inet_connection_sock *icsk = inet_csk(sk);
    struct tcp_sock *tp = tcp_sk(sk);
    struct sk_buff *skb;
    struct sk_buff *hole = NULL;
    u32 last_lost;
    int mib_idx;
    int fwd_rexmitting = 0;

    if (!tp->packets_out) //网络没有数据包
        return;

    if (!tp->lost_out) //重传没有被ack的数据
        tp->retransmit_high = tp->snd_una;

    if (tp->retransmit_skb_hint) {
        skb = tp->retransmit_skb_hint;
        last_lost = TCP_SKB_CB(skb)->end_seq;
        if (after(last_lost, tp->retransmit_high))
            last_lost = tp->retransmit_high;
    } else {
        skb = tcp_write_queue_head(sk);
        last_lost = tp->snd_una;
    }

    tcp_for_write_queue_from(skb, sk) { //从skb开始遍历整个发送队列 
        __u8 sacked = TCP_SKB_CB(skb)->sacked;

        if (skb == tcp_send_head(sk))
            break;
        /* we could do better than to assign each time */
        if (hole == NULL)
            tp->retransmit_skb_hint = skb;

        /* Assume this retransmit will generate
         * only one packet for congestion window
         * calculation purposes.  This works because
         * tcp_retransmit_skb() will chop up the
         * packet to be MSS sized and all the
         * packet counting works out.
         */
        if (tcp_packets_in_flight(tp) >= tp->snd_cwnd)
            return;

        if (fwd_rexmitting) {
begin_fwd:
            if (!before(TCP_SKB_CB(skb)->seq, tcp_highest_sack_seq(tp)))
                break;
            mib_idx = LINUX_MIB_TCPFORWARDRETRANS;

        } else if (!before(TCP_SKB_CB(skb)->seq, tp->retransmit_high)) {
            tp->retransmit_high = last_lost;
            if (!tcp_can_forward_retransmit(sk))
                break;
            /* Backtrack if necessary to non-L'ed skb */
            if (hole != NULL) {
                skb = hole;
                hole = NULL;
            }
            fwd_rexmitting = 1;
            goto begin_fwd;

        } else if (!(sacked & TCPCB_LOST)) {//skb被SACK过或被重传
            if (hole == NULL && !(sacked & (TCPCB_SACKED_RETRANS|TCPCB_SACKED_ACKED)))
                hole = skb;
            continue;

        } else {
            last_lost = TCP_SKB_CB(skb)->end_seq;
            if (icsk->icsk_ca_state != TCP_CA_Loss)
                mib_idx = LINUX_MIB_TCPFASTRETRANS;
            else
                mib_idx = LINUX_MIB_TCPSLOWSTARTRETRANS;
        }

        if (sacked & (TCPCB_SACKED_ACKED|TCPCB_SACKED_RETRANS))
            continue;

        if (tcp_retransmit_skb(sk, skb))//重传skb 
            return;

        NET_INC_STATS_BH(sock_net(sk), mib_idx);

        if (tcp_in_cwnd_reduction(sk))
            tp->prr_out += tcp_skb_pcount(skb);

        if (skb == tcp_write_queue_head(sk))
            inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
                          inet_csk(sk)->icsk_rto,
                          TCP_RTO_MAX);
    }
}

值得注意的是在快速重传时不会重传已经被SACK过或被重传过的skb,这些skb也许能够被顺利收到,在这里不重传会减小网络拥塞。

10 尾部丢失探测(Tail Loss Probe)定时器

如果拥塞窗口较小且数据的最后一段数据丢失时,快速重传算法会因为无法收到足够数量的ACK而无法及时重传丢失的报文。尾部丢失探测(Tail Loss Probe)定时器就是为了解决这个问题而设计的。

10.1 在什么时候设定TLP定时器?

TLP在tcp_schedule_loss_probe函数中安装:

static void tcp_process_tlp_ack(struct sock *sk, u32 ack, int flag)
{
    struct tcp_sock *tp = tcp_sk(sk);
    bool is_tlp_dupack = (ack == tp->tlp_high_seq) &&
                 !(flag & (FLAG_SND_UNA_ADVANCED |
                       FLAG_NOT_DUP | FLAG_DATA_SACKED));

    /* Mark the end of TLP episode on receiving TLP dupack or when
     * ack is after tlp_high_seq.
     */
    if (is_tlp_dupack) {
        tp->tlp_high_seq = 0;
        return;
    }

    if (after(ack, tp->tlp_high_seq)) {
        tp->tlp_high_seq = 0;
        /* Don't reduce cwnd if DSACK arrives for TLP retrans. */
        if (!(flag & FLAG_DSACKING_ACK)) {
            tcp_init_cwnd_reduction(sk);
            tcp_set_ca_state(sk, TCP_CA_CWR);
            tcp_end_cwnd_reduction(sk);
            tcp_try_keep_open(sk);
            NET_INC_STATS_BH(sock_net(sk),
                     LINUX_MIB_TCPLOSSPROBERECOVERY);
        }
    }
}

TCP在收到ACK时会调用tcp_schedule_loss_probe:

//net/ipv4/tcp_input.c
static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
{
    if (icsk->icsk_pending == ICSK_TIME_RETRANS)
        tcp_schedule_loss_probe(sk);
}

发送最后一个数据时也会调用tcp_schedule_loss_probe尝试安装TLP:

//net/ipv4/tcp_input.c
static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
               int push_one, gfp_t gfp)
{
...
    if (likely(sent_pkts)) {
        if (tcp_in_cwnd_reduction(sk))
            tp->prr_out += sent_pkts;

        /* Send one loss probe per tail loss episode. */
        if (push_one != 2)
            tcp_schedule_loss_probe(sk);//不是在TLP定时器超时函数中发送的数据  
        tcp_cwnd_validate(sk, is_cwnd_limited);
        return false;
    }
...
}

下面总结一下TLP安装的条件:

  1. 安装了重传定时器,这时可能需要拆除重传定时器,改装TLP;如果没有安装重传定时器,说明没有数据需要重传,也就不需要TLP了

  2. 没有安装ER定时器;ER定时器负责重传丢失的中间数据,只有将中间数据补全了才能重传尾部数据

  3. 没有安装TLP定时器;重新安装TLP定时器相当于延长定时器超时时间

  4. 在TFO模式下server端会在发送SYN|ACK后设置重传定时器以便重传SYN|ACK,但在三次握手完成之前不能进行丢失探测

  5. net.ipv4.tcp_early_retrans内核参数 < 3

  6. tp->srtt >= 8

  7. 有包在网络中

  8. 开启SACK

  9. 拥塞状态为TCP_CA_Open

  10. 网络中的包的长度 <= 拥塞窗口大小或没有数据等待发送

TCP在安装重传定时器、ER定时器或坚持定时器时的同时TLP就会被拆除。

TLP的超时时间是根据RTT动态计算的。

10.2 TLP定时器超时会做些什么?

TLP的超时函数是tcp_send_loss_probe:

/* When probe timeout (PTO) fires, send a new segment if one exists, else
 * retransmit the last segment.
 */
void tcp_send_loss_probe(struct sock *sk)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct sk_buff *skb;
    int pcount;
    int mss = tcp_current_mss(sk);
    int err = -1;

    if (tcp_send_head(sk) != NULL) { //如果有数据发送,先发送数据
        err = tcp_write_xmit(sk, mss, TCP_NAGLE_OFF, 2, GFP_ATOMIC);
        goto rearm_timer;  //重启定时器
    }

    /* At most one outstanding TLP retransmission. */
    if (tp->tlp_high_seq) //已经有一个由TLP定时器发送的报文在网络中了  
        goto rearm_timer;
    //得到发送队列最后一个skb,即已发送的最后一个skb  
    /* Retransmit last segment. */
    skb = tcp_write_queue_tail(sk);
    if (WARN_ON(!skb))
        goto rearm_timer;

    if (skb_still_in_host_queue(sk, skb))
        goto rearm_timer;

    pcount = tcp_skb_pcount(skb);
    if (WARN_ON(!pcount))
        goto rearm_timer;

    if ((pcount > 1) && (skb->len > (pcount - 1) * mss)) {
        if (unlikely(tcp_fragment(sk, skb, (pcount - 1) * mss, mss)))
            goto rearm_timer;
        skb = tcp_write_queue_tail(sk);
    }

    if (WARN_ON(!skb || !tcp_skb_pcount(skb)))
        goto rearm_timer;
   //重传最后一个数据段  
    err = __tcp_retransmit_skb(sk, skb);

    /* Record snd_nxt for loss detection. */
    if (likely(!err))
        tp->tlp_high_seq = tp->snd_nxt;
  //设置重传定时器  
rearm_timer:
    inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
                  inet_csk(sk)->icsk_rto,
                  TCP_RTO_MAX);

    if (likely(!err))
        NET_INC_STATS_BH(sock_net(sk),
                 LINUX_MIB_TCPLOSSPROBES);
    return;
}

如果skb使用了GSO导致其由多个段构成且其数据过长,则将其分割后再取最后一个skb

可见TLP定时器的超时动作主要是重传最后一个报文段并设置重传定时器。

11 time wait 定时器

当socket进入TIME_WAIT状态后,TIME_WAIT定时器启动。在超时之前,替代socket的tw sock会处理旧连接中的包,阻止其危害新连接。定时器超时后,tw sock被删除,并释放其占用的端口号。

11.1 time wait 定时器设定的时机

TIME_WAIT定时器的安装由tcp_time_wait函数完成,调用tcp_time_wait函数的时机有:

  • 在TCP_FIN_WAIT2状态下socket关闭,没有用TCP_LINGER2选项将tp->linger2设置为小于0且tcp_fin_time的大小小于等于TCP_TIMEWAIT_LEN:
/net/ipv4/tcp.c
void tcp_close(struct sock *sk, long timeout)
{
...
    if (sk->sk_state == TCP_FIN_WAIT2) {
        struct tcp_sock *tp = tcp_sk(sk);
        if (tp->linger2 < 0) {
            tcp_set_state(sk, TCP_CLOSE);   //设置TCP状态
            tcp_send_active_reset(sk, GFP_ATOMIC);    //发送RST,重置连接
            NET_INC_STATS_BH(sock_net(sk),
                    LINUX_MIB_TCPABORTONLINGER);
        } else {   
            const int tmo = tcp_fin_time(sk);

            if (tmo > TCP_TIMEWAIT_LEN) {
                inet_csk_reset_keepalive_timer(sk,
                        tmo - TCP_TIMEWAIT_LEN);
            } else {
                tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);  //进入time wait状态
                goto out;
            }
        }
    }
...
}
  • TCP_FIN_WAIT2状态下收到FIN并发送ACK后:
static void tcp_fin(struct sock *sk)
{
    struct tcp_sock *tp = tcp_sk(sk);

    inet_csk_schedule_ack(sk);

    sk->sk_shutdown |= RCV_SHUTDOWN;
    sock_set_flag(sk, SOCK_DONE);

    switch (sk->sk_state) {
    case TCP_SYN_RECV:
    case TCP_ESTABLISHED:
        /* Move to CLOSE_WAIT */
        tcp_set_state(sk, TCP_CLOSE_WAIT);
        inet_csk(sk)->icsk_ack.pingpong = 1;
        break;

    case TCP_CLOSE_WAIT:
    case TCP_CLOSING:
        /* Received a retransmission of the FIN, do
         * nothing.
         */
        break;
    case TCP_LAST_ACK:
        /* RFC793: Remain in the LAST-ACK state. */
        break;

    case TCP_FIN_WAIT1:
        /* This case occurs when a simultaneous close
         * happens, we must ack the received FIN and
         * enter the CLOSING state.
         */
        tcp_send_ack(sk);
        tcp_set_state(sk, TCP_CLOSING);
        break;
    case TCP_FIN_WAIT2:
       //发送ACK之后,进入tcp_time_wait,启动定时器
        /* Received a FIN -- send ACK and enter TIME_WAIT. */
        tcp_send_ack(sk);
        tcp_time_wait(sk, TCP_TIME_WAIT, 0);  //进入timer wait
        break;
    default:
        /* Only TCP_LISTEN and TCP_CLOSE are left, in these
         * cases we should never reach this piece of code.
         */
        pr_err("%s: Impossible, sk->sk_state=%d\n",
               __func__, sk->sk_state);
        break;
    }

    /* It _is_ possible, that we have something out-of-order _after_ FIN.
     * Probably, we should reset in this case. For now drop them.
     */
    __skb_queue_purge(&tp->out_of_order_queue);
    if (tcp_is_sack(tp))
        tcp_sack_reset(&tp->rx_opt);
    sk_mem_reclaim(sk);

    if (!sock_flag(sk, SOCK_DEAD)) {
        sk->sk_state_change(sk);

        /* Do not send POLL_HUP for half duplex close. */
        if (sk->sk_shutdown == SHUTDOWN_MASK ||
            sk->sk_state == TCP_CLOSE)
            sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_HUP);
        else
            sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
    }
}
  • 孤立socket在TCP_FIN_WAIT1状态下收到ACK时,满足:

    1. 没有用TCP_LINGER2选项将tp->linger2设置为小于0

    2. tcp_fin_time的大小小于等于TCP_TIMEWAIT_LEN:

    3. ACK中没有数据或数据全是旧的

    4. ACK中没有FIN标记并且socket没有被应用进程锁定

/net/ipv4/tcp_input.c
    case TCP_FIN_WAIT1: {
        struct dst_entry *dst;
        int tmo;

        /* If we enter the TCP_FIN_WAIT1 state and we are a
         * Fast Open socket and this is the first acceptable
         * ACK we have received, this would have acknowledged
         * our SYNACK so stop the SYNACK timer.
         */
        if (req != NULL) {
            /* Return RST if ack_seq is invalid.
             * Note that RFC793 only says to generate a
             * DUPACK for it but for TCP Fast Open it seems
             * better to treat this case like TCP_SYN_RECV
             * above.
             */
            if (!acceptable)
                return 1;
            /* We no longer need the request sock. */
            reqsk_fastopen_remove(sk, req, false);
            tcp_rearm_rto(sk);
        }
        if (tp->snd_una != tp->write_seq)
            break;

        tcp_set_state(sk, TCP_FIN_WAIT2);
        sk->sk_shutdown |= SEND_SHUTDOWN;

        dst = __sk_dst_get(sk);
        if (dst)
            dst_confirm(dst);

        if (!sock_flag(sk, SOCK_DEAD)) {
            /* Wake up lingering close() */
            sk->sk_state_change(sk);
            break;
        }

        if (tp->linger2 < 0 ||
            (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
             after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt))) {
            tcp_done(sk);
            NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPABORTONDATA);
            return 1;
        }

        tmo = tcp_fin_time(sk);
        if (tmo > TCP_TIMEWAIT_LEN) {
            inet_csk_reset_keepalive_timer(sk, tmo - TCP_TIMEWAIT_LEN);
        } else if (th->fin || sock_owned_by_user(sk)) {
            /* Bad case. We could lose such FIN otherwise.
             * It is not a big problem, but it looks confusing
             * and not so rare event. We still can lose it now,
             * if it spins in bh_lock_sock(), but it is really
             * marginal case.
             */
            inet_csk_reset_keepalive_timer(sk, tmo);
        } else {
            tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
            goto discard;
        }
  • TCP在TCP_CLOSING状态下收到ACK时:
int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
              const struct tcphdr *th, unsigned int len)
{
...
    case TCP_CLOSING:
        if (tp->snd_una == tp->write_seq) {
            tcp_time_wait(sk, TCP_TIME_WAIT, 0); 进入time wait状态,等待其他数据
            goto discard;
        }
        break;
...
}
  • tcp_time_wait函数会调用inet_twsk_schedule函数安装TIME_WAIT定时器:
/net/ipv4/tcp_minisocks.c
/*
 * Move a socket to time-wait or dead fin-wait-2 state.
 */
void tcp_time_wait(struct sock *sk, int state, int timeo)
{
...
    /* Linkage updates. */
        __inet_twsk_hashdance(tw, sk, &tcp_hashinfo);
//将tw sock放入ESTABLESHED hash表和bind hash表中,将sk从ESTABLISHED hash表中移除  
        /* Get the TIME_WAIT timeout firing. */
        if (timeo < rto)
            timeo = rto;

        if (recycle_ok) {
            tw->tw_timeout = rto;
        } else {
            tw->tw_timeout = TCP_TIMEWAIT_LEN;
            if (state == TCP_TIME_WAIT)
                timeo = TCP_TIMEWAIT_LEN;
        }

        inet_twsk_schedule(tw, &tcp_death_row, timeo,
                   TCP_TIMEWAIT_LEN);
        inet_twsk_put(tw);
...
}

__inet_twsk_hashdance函数将tw_sock加入到bind hash表和ESTABLISHED表中,这样在tw_sock被删除之前相应IP|端口不允许bind,也不允许建立:

/net/ipv4/inet_timewait_sock.c
/*
 * Enter the time wait state. This is called with locally disabled BH.
 * Essentially we whip up a timewait bucket, copy the relevant info into it
 * from the SK, and mess with hash chains and list linkage.
 */
void __inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
               struct inet_hashinfo *hashinfo)
{
    const struct inet_sock *inet = inet_sk(sk);
    const struct inet_connection_sock *icsk = inet_csk(sk);
    struct inet_ehash_bucket *ehead = inet_ehash_bucket(hashinfo, sk->sk_hash);
    spinlock_t *lock = inet_ehash_lockp(hashinfo, sk->sk_hash);
    struct inet_bind_hashbucket *bhead;
    /* Step 1: Put TW into bind hash. Original socket stays there too.
       Note, that any socket with inet->num != 0 MUST be bound in
       binding cache, even if it is closed.
     */
    bhead = &hashinfo->bhash[inet_bhashfn(twsk_net(tw), inet->inet_num,
            hashinfo->bhash_size)];
    spin_lock(&bhead->lock);
    tw->tw_tb = icsk->icsk_bind_hash;
    WARN_ON(!icsk->icsk_bind_hash);
    //加入到bind hash表中  
    inet_twsk_add_bind_node(tw, &tw->tw_tb->owners);
    spin_unlock(&bhead->lock);

    spin_lock(lock);

    /*
     * Step 2: Hash TW into tcp ehash chain.
     * Notes :
     * - tw_refcnt is set to 3 because :
     * - We have one reference from bhash chain.
     * - We have one reference from ehash chain.
     * We can use atomic_set() because prior spin_lock()/spin_unlock()
     * committed into memory all tw fields.
     */
    atomic_set(&tw->tw_refcnt, 1 + 1 + 1);
    //加入到ESBABLISHED hash表中 
    inet_twsk_add_node_rcu(tw, &ehead->chain);

    /* Step 3: Remove SK from hash chain */
    if (__sk_nulls_del_node_init_rcu(sk))
        sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);

    spin_unlock(lock);
}

这样,在应用进程使用bind系统调用绑定与tw_sock相同的IP|端口对时内核会用到inet_csk_bind_conflict函数,但由于成功匹配到bind hash表中的tw_sock,会导致冲突,无法bind。而在建立连接时,inet_hash_connect函数会调用__inet_check_established检查即将建立的连接是否与已建立的连接冲突:

/* called with local bh disabled */
static int __inet_check_established(struct inet_timewait_death_row *death_row,
                    struct sock *sk, __u16 lport,
                    struct inet_timewait_sock **twp)
{
    sk_nulls_for_each(sk2, node, &head->chain) { //遍历链表
        if (sk2->sk_hash != hash)
            continue;
            if (likely(INET_MATCH(sk2, net, acookie,
                     saddr, daddr, ports, dif))) {   //地址|端口匹配 
            if (sk2->sk_state == TCP_TIME_WAIT) {  
                tw = inet_twsk(sk2);
                if (twsk_unique(sk, sk2, twp)) //调用twsk_unique判断是否冲突
                    break;
            }
            goto not_unique; //冲突
        }
    }
    ...
    if (twp) {
        *twp = tw;   //交给调用者
    } else if (tw) {
        /* Silly. Should hash-dance instead... */
        inet_twsk_deschedule(tw, death_row);

        inet_twsk_put(tw);
    }
    return 0;

not_unique:
    spin_unlock(lock);
    return -EADDRNOTAVAIL;  //返回错误结果
}

twsk_unique函数:

static inline int twsk_unique(struct sock *sk, struct sock *sktw, void *twp)
{
    if (sk->sk_prot->twsk_prot->twsk_unique != NULL)
        return sk->sk_prot->twsk_prot->twsk_unique(sk, sktw, twp);
    return 0;
}

12 FIN_WAIT2定时器

如果应用进程调用close系统调用关闭socket,但此时socket与对端的通信尚未完成,则这个socket被成“”孤儿socket”,如果孤儿socket进入FIN_WAIT2状态(或者socket进入FIN_WAIT2状态2再成为孤儿socket),会等待对端发送FIN彻底结束连接。但如果对端一直不发送FIN则孤儿socket会一直存在,从而一直占用系统资源。为了解决这个问题,需要在孤儿socket进入FIN_WAIT2状态时设置FIN_WAIT2定时器,如果定时器超时时仍然没有收到FIN,则关闭socket。

12.1 什么时候开启FIN_WAIT2定时器

开启FIN_WAIT2定时器的函数也是inet_csk_reset_keepalive_timer。设置FIN_WAIT2定时器的时间主要有两个:

  • 应用进程调用close系统调用而socket正处于TCP_FIN_WAIT2状态时:
void tcp_close(struct sock *sk, long timeout)
{
    struct sk_buff *skb;
    int data_was_unread = 0;
    int state;

    lock_sock(sk);
    sk->sk_shutdown = SHUTDOWN_MASK;

    if (sk->sk_state == TCP_LISTEN) {
        tcp_set_state(sk, TCP_CLOSE);

        /* Special case. */
        inet_csk_listen_stop(sk);

        goto adjudge_to_death;
    }

    /*  We need to flush the recv. buffs.  We do this only on the
     *  descriptor close, not protocol-sourced closes, because the
     *  reader process may not have drained the data yet!
     */
    while ((skb = __skb_dequeue(&sk->sk_receive_queue)) != NULL) {
        u32 len = TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq;

        if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
            len--;
        data_was_unread += len;
        __kfree_skb(skb);
    }

    sk_mem_reclaim(sk);

    /* If socket has been already reset (e.g. in tcp_reset()) - kill it. */
    if (sk->sk_state == TCP_CLOSE)
        goto adjudge_to_death;

    /* As outlined in RFC 2525, section 2.17, we send a RST here because
     * data was lost. To witness the awful effects of the old behavior of
     * always doing a FIN, run an older 2.1.x kernel or 2.0.x, start a bulk
     * GET in an FTP client, suspend the process, wait for the client to
     * advertise a zero window, then kill -9 the FTP client, wheee...
     * Note: timeout is always zero in such a case.
     */
    if (unlikely(tcp_sk(sk)->repair)) {
        sk->sk_prot->disconnect(sk, 0);
    } else if (data_was_unread) {
        /* Unread data was tossed, zap the connection. */
        NET_INC_STATS_USER(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE);
        tcp_set_state(sk, TCP_CLOSE);
        tcp_send_active_reset(sk, sk->sk_allocation);
    } else if (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) {
        /* Check zero linger _after_ checking for unread data. */
        sk->sk_prot->disconnect(sk, 0);
        NET_INC_STATS_USER(sock_net(sk), LINUX_MIB_TCPABORTONDATA);
    } else if (tcp_close_state(sk)) {
        /* We FIN if the application ate all the data before
         * zapping the connection.
         */

        /* RED-PEN. Formally speaking, we have broken TCP state
         * machine. State transitions:
         *
         * TCP_ESTABLISHED -> TCP_FIN_WAIT1
         * TCP_SYN_RECV -> TCP_FIN_WAIT1 (forget it, it's impossible)
         * TCP_CLOSE_WAIT -> TCP_LAST_ACK
         *
         * are legal only when FIN has been sent (i.e. in window),
         * rather than queued out of window. Purists blame.
         *
         * F.e. "RFC state" is ESTABLISHED,
         * if Linux state is FIN-WAIT-1, but FIN is still not sent.
         *
         * The visible declinations are that sometimes
         * we enter time-wait state, when it is not required really
         * (harmless), do not send active resets, when they are
         * required by specs (TCP_ESTABLISHED, TCP_CLOSE_WAIT, when
         * they look as CLOSING or LAST_ACK for Linux)
         * Probably, I missed some more holelets.
         *                      --ANK
         * XXX (TFO) - To start off we don't support SYN+ACK+FIN
         * in a single packet! (May consider it later but will
         * probably need API support or TCP_CORK SYN-ACK until
         * data is written and socket is closed.)
         */
        tcp_send_fin(sk);
    }

    sk_stream_wait_close(sk, timeout);

adjudge_to_death:
    state = sk->sk_state;
    sock_hold(sk);
    sock_orphan(sk);

    /* It is the last release_sock in its life. It will remove backlog. */
    release_sock(sk);


    /* Now socket is owned by kernel and we acquire BH lock
       to finish close. No need to check for user refs.
     */
    local_bh_disable();
    bh_lock_sock(sk);
    WARN_ON(sock_owned_by_user(sk));

    percpu_counter_inc(sk->sk_prot->orphan_count);

    /* Have we already been destroyed by a softirq or backlog? */
    if (state != TCP_CLOSE && sk->sk_state == TCP_CLOSE)
        goto out;

    /*  This is a (useful) BSD violating of the RFC. There is a
     *  problem with TCP as specified in that the other end could
     *  keep a socket open forever with no application left this end.
     *  We use a 3 minute timeout (about the same as BSD) then kill
     *  our end. If they send after that then tough - BUT: long enough
     *  that we won't make the old 4*rto = almost no time - whoops
     *  reset mistake.
     *
     *  Nope, it was not mistake. It is really desired behaviour
     *  f.e. on http servers, when such sockets are useless, but
     *  consume significant resources. Let's do it with special
     *  linger2 option.                 --ANK
     */

    if (sk->sk_state == TCP_FIN_WAIT2) { //进入状态FIN_WAIT2
        struct tcp_sock *tp = tcp_sk(sk);
        if (tp->linger2 < 0) {
            tcp_set_state(sk, TCP_CLOSE);
            tcp_send_active_reset(sk, GFP_ATOMIC);
            NET_INC_STATS_BH(sock_net(sk),
                    LINUX_MIB_TCPABORTONLINGER);
        } else {
            const int tmo = tcp_fin_time(sk);

            if (tmo > TCP_TIMEWAIT_LEN) {  //等待对端发FIN的时间长度大于TIME_WAIT的时
                inet_csk_reset_keepalive_timer(sk,
                        tmo - TCP_TIMEWAIT_LEN);  //启动定时器
            } else {
                tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
                goto out;
            }
        }
    }
    if (sk->sk_state != TCP_CLOSE) {
        sk_mem_reclaim(sk);
        if (tcp_check_oom(sk, 0)) {
            tcp_set_state(sk, TCP_CLOSE);
            tcp_send_active_reset(sk, GFP_ATOMIC);
            NET_INC_STATS_BH(sock_net(sk),
                    LINUX_MIB_TCPABORTONMEMORY);
        }
    }

    if (sk->sk_state == TCP_CLOSE) {
        struct request_sock *req = tcp_sk(sk)->fastopen_rsk;
        /* We could get here with a non-NULL req if the socket is
         * aborted (e.g., closed with unread data) before 3WHS
         * finishes.
         */
        if (req != NULL)
            reqsk_fastopen_remove(sk, req, false);
        inet_csk_destroy_sock(sk);
    }
    /* Otherwise, socket is reprieved until protocol close. */

out:
    bh_unlock_sock(sk);
    local_bh_enable();
    sock_put(sk);
}

然后socket就会成为孤儿socket。不过FIN_WAIT2定时器会一直“看护”它。

  • 孤儿socket进入FIN_WAIT2状态时:
int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
              const struct tcphdr *th, unsigned int len)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct inet_connection_sock *icsk = inet_csk(sk);
    struct request_sock *req;
    int queued = 0;
    bool acceptable;
    u32 synack_stamp;

    tp->rx_opt.saw_tstamp = 0;

    switch (sk->sk_state) {
    case TCP_CLOSE:
        goto discard;

    case TCP_LISTEN:
        if (th->ack)
            return 1;

        if (th->rst)
            goto discard;

        if (th->syn) {
            if (th->fin)
                goto discard;
            if (icsk->icsk_af_ops->conn_request(sk, skb) < 0)
                return 1;

            /* Now we have several options: In theory there is
             * nothing else in the frame. KA9Q has an option to
             * send data with the syn, BSD accepts data with the
             * syn up to the [to be] advertised window and
             * Solaris 2.1 gives you a protocol error. For now
             * we just ignore it, that fits the spec precisely
             * and avoids incompatibilities. It would be nice in
             * future to drop through and process the data.
             *
             * Now that TTCP is starting to be used we ought to
             * queue this data.
             * But, this leaves one open to an easy denial of
             * service attack, and SYN cookies can't defend
             * against this problem. So, we drop the data
             * in the interest of security over speed unless
             * it's still in use.
             */
            kfree_skb(skb);
            return 0;
        }
        goto discard;

    case TCP_SYN_SENT:
        queued = tcp_rcv_synsent_state_process(sk, skb, th, len);
        if (queued >= 0)
            return queued;

        /* Do step6 onward by hand. */
        tcp_urg(sk, skb, th);
        __kfree_skb(skb);
        tcp_data_snd_check(sk);
        return 0;
    }

    req = tp->fastopen_rsk;
    if (req != NULL) {
        WARN_ON_ONCE(sk->sk_state != TCP_SYN_RECV &&
            sk->sk_state != TCP_FIN_WAIT1);

        if (tcp_check_req(sk, skb, req, NULL, true) == NULL)
            goto discard;
    }

    if (!th->ack && !th->rst && !th->syn)
        goto discard;

    if (!tcp_validate_incoming(sk, skb, th, 0))
        return 0;

    /* step 5: check the ACK field */
    acceptable = tcp_ack(sk, skb, FLAG_SLOWPATH |
                      FLAG_UPDATE_TS_RECENT) > 0;

    switch (sk->sk_state) {
    case TCP_SYN_RECV:
        if (!acceptable)
            return 1;

        /* Once we leave TCP_SYN_RECV, we no longer need req
         * so release it.
         */
        if (req) {
            synack_stamp = tcp_rsk(req)->snt_synack;
            tp->total_retrans = req->num_retrans;
            reqsk_fastopen_remove(sk, req, false);
        } else {
            synack_stamp = tp->lsndtime;
            /* Make sure socket is routed, for correct metrics. */
            icsk->icsk_af_ops->rebuild_header(sk);
            tcp_init_congestion_control(sk);

            tcp_mtup_init(sk);
            tp->copied_seq = tp->rcv_nxt;
            tcp_init_buffer_space(sk);
        }
        smp_mb();
        tcp_set_state(sk, TCP_ESTABLISHED);
        sk->sk_state_change(sk);

        /* Note, that this wakeup is only for marginal crossed SYN case.
         * Passively open sockets are not waked up, because
         * sk->sk_sleep == NULL and sk->sk_socket == NULL.
         */
        if (sk->sk_socket)
            sk_wake_async(sk, SOCK_WAKE_IO, POLL_OUT);

        tp->snd_una = TCP_SKB_CB(skb)->ack_seq;
        tp->snd_wnd = ntohs(th->window) << tp->rx_opt.snd_wscale;
        tcp_init_wl(tp, TCP_SKB_CB(skb)->seq);
        tcp_synack_rtt_meas(sk, synack_stamp);

        if (tp->rx_opt.tstamp_ok)
            tp->advmss -= TCPOLEN_TSTAMP_ALIGNED;

        if (req) {
            /* Re-arm the timer because data may have been sent out.
             * This is similar to the regular data transmission case
             * when new data has just been ack'ed.
             *
             * (TFO) - we could try to be more aggressive and
             * retransmitting any data sooner based on when they
             * are sent out.
             */
            tcp_rearm_rto(sk);
        } else
            tcp_init_metrics(sk);

        tcp_update_pacing_rate(sk);

        /* Prevent spurious tcp_cwnd_restart() on first data packet */
        tp->lsndtime = tcp_time_stamp;

        tcp_initialize_rcv_mss(sk);
        tcp_fast_path_on(tp);
        break;

    case TCP_FIN_WAIT1: {
        struct dst_entry *dst;
        int tmo;

        /* If we enter the TCP_FIN_WAIT1 state and we are a
         * Fast Open socket and this is the first acceptable
         * ACK we have received, this would have acknowledged
         * our SYNACK so stop the SYNACK timer.
         */
        if (req != NULL) {
            /* Return RST if ack_seq is invalid.
             * Note that RFC793 only says to generate a
             * DUPACK for it but for TCP Fast Open it seems
             * better to treat this case like TCP_SYN_RECV
             * above.
             */
            if (!acceptable)
                return 1;
            /* We no longer need the request sock. */
            reqsk_fastopen_remove(sk, req, false);
            tcp_rearm_rto(sk);
        }
        if (tp->snd_una != tp->write_seq)
            break;

        tcp_set_state(sk, TCP_FIN_WAIT2);
        sk->sk_shutdown |= SEND_SHUTDOWN;

        dst = __sk_dst_get(sk);
        if (dst)
            dst_confirm(dst);

        if (!sock_flag(sk, SOCK_DEAD)) {
            /* Wake up lingering close() */
            sk->sk_state_change(sk);
            break;
        }

        if (tp->linger2 < 0 ||
            (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
             after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt))) { //ACK中有数据或FIN标记位  
            tcp_done(sk);  //ACK中有数据,孤儿socket不需要数据,马上关掉TCP  
            NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPABORTONDATA);
            return 1;
        }

        tmo = tcp_fin_time(sk);
        if (tmo > TCP_TIMEWAIT_LEN) {
            inet_csk_reset_keepalive_timer(sk, tmo - TCP_TIMEWAIT_LEN);
        } else if (th->fin || sock_owned_by_user(sk)) {
            /* Bad case. We could lose such FIN otherwise.
             * It is not a big problem, but it looks confusing
             * and not so rare event. We still can lose it now,
             * if it spins in bh_lock_sock(), but it is really
             * marginal case.
             */
            inet_csk_reset_keepalive_timer(sk, tmo);
        } else {
            tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
            goto discard;
        }
        break;
    }

    case TCP_CLOSING:
        if (tp->snd_una == tp->write_seq) {
            tcp_time_wait(sk, TCP_TIME_WAIT, 0);
            goto discard;
        }
        break;

    case TCP_LAST_ACK:
        if (tp->snd_una == tp->write_seq) {
            tcp_update_metrics(sk);
            tcp_done(sk);
            goto discard;
        }
        break;
    }

    /* step 6: check the URG bit */
    tcp_urg(sk, skb, th);

    /* step 7: process the segment text */
    switch (sk->sk_state) {
    case TCP_CLOSE_WAIT:
    case TCP_CLOSING:
    case TCP_LAST_ACK:
        if (!before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt))
            break;
    case TCP_FIN_WAIT1:
    case TCP_FIN_WAIT2:
        /* RFC 793 says to queue data in these states,
         * RFC 1122 says we MUST send a reset.
         * BSD 4.4 also does reset.
         */
        if (sk->sk_shutdown & RCV_SHUTDOWN) {
            if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
                after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt)) {
                NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPABORTONDATA);
                tcp_reset(sk);
                return 1;
            }
        }
        /* Fall through */
    case TCP_ESTABLISHED:
        tcp_data_queue(sk, skb);
        queued = 1;
        break;
    }

    /* tcp_data could move socket to TIME-WAIT */
    if (sk->sk_state != TCP_CLOSE) {
        tcp_data_snd_check(sk);
        tcp_ack_snd_check(sk);
    }

    if (!queued) {
discard:
        __kfree_skb(skb);
    }
    return 0;
}

可见FIN_WAIT2定时器功能开启的必要条件是应用进程没有使用TCP_LINGER2 socket选项将tp->linger2的值设为小于0。
与Keepalive定时器一样,FIN_WAIT2定时器只能用SO_KEEPALIVE socket选项拆除。

FIN_WAIT2定时器的超时时间为tcp_fin_time减去TCP_TIME_WAIT_LEN的值:

static inline int tcp_fin_time(const struct sock *sk)
{
    int fin_timeout = tcp_sk(sk)->linger2 ? : sysctl_tcp_fin_timeout;
    const int rto = inet_csk(sk)->icsk_rto;

    if (fin_timeout < (rto << 2) - (rto >> 1))
        fin_timeout = (rto << 2) - (rto >> 1);

    return fin_timeout;
}

其中tcp_sk(sk)->linger2由TCP_LINGER2 socket选项赋值,sysctl_tcp_fin_timeout的值由net.ipv4.tcp_fin_timeout内核选项决定,默认与TCP_TIMEWAIT_LEN一样(60s)。

12.2 FIN_TIME_WAIT定时器超时做了什么?

static void tcp_keepalive_timer (unsigned long data)
{
    struct sock *sk = (struct sock *) data;
    struct inet_connection_sock *icsk = inet_csk(sk);
    struct tcp_sock *tp = tcp_sk(sk);
    u32 elapsed;

    /* Only process if socket is not in use. */
    bh_lock_sock(sk);
    if (sock_owned_by_user(sk)) {
        /* Try again later. */
        inet_csk_reset_keepalive_timer (sk, HZ/20);
        goto out;
    }

    if (sk->sk_state == TCP_LISTEN) {
        tcp_synack_timer(sk);
        goto out;
    }
    //TCP_FIN_WAIT2的逻辑在这里。
    if (sk->sk_state == TCP_FIN_WAIT2 && sock_flag(sk, SOCK_DEAD)) {
        if (tp->linger2 >= 0) { //如果设置了linger2标识位
            const int tmo = tcp_fin_time(sk) - TCP_TIMEWAIT_LEN;

            if (tmo > 0) {
                tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);  //进入time_wait
                goto out;
            }
        }
        //否则的话,直接发送RST,终止连接
        tcp_send_active_reset(sk, GFP_ATOMIC);
        goto death;
    }

    if (!sock_flag(sk, SOCK_KEEPOPEN) || sk->sk_state == TCP_CLOSE)
        goto out;

    elapsed = keepalive_time_when(tp);

    /* It is alive without keepalive 8) */
    if (tp->packets_out || tcp_send_head(sk))
        goto resched;

    elapsed = keepalive_time_elapsed(tp);

    if (elapsed >= keepalive_time_when(tp)) {
        /* If the TCP_USER_TIMEOUT option is enabled, use that
         * to determine when to timeout instead.
         */
        if ((icsk->icsk_user_timeout != 0 &&
            elapsed >= icsk->icsk_user_timeout &&
            icsk->icsk_probes_out > 0) ||
            (icsk->icsk_user_timeout == 0 &&
            icsk->icsk_probes_out >= keepalive_probes(tp))) {
            tcp_send_active_reset(sk, GFP_ATOMIC);
            tcp_write_err(sk);
            goto out;
        }
        if (tcp_write_wakeup(sk) <= 0) {
            icsk->icsk_probes_out++;
            elapsed = keepalive_intvl_when(tp);
        } else {
            /* If keepalive was lost due to local congestion,
             * try harder.
             */
            elapsed = TCP_RESOURCE_PROBE_INTERVAL;
        }
    } else {
        /* It is tp->rcv_tstamp + keepalive_time_when(tp) */
        elapsed = keepalive_time_when(tp) - elapsed;
    }

    sk_mem_reclaim(sk);

resched:
    inet_csk_reset_keepalive_timer (sk, elapsed);
    goto out;

death:
    tcp_done(sk);

out:
    bh_unlock_sock(sk);
    sock_put(sk);
}

FIN_WAIT2定时器的超时动作有两种:进入TIME_WAIT状态或发送RST报文然后关闭TCP。

13. TCP定时器读后感

走读了一遍TCP定时器的最大感悟就是TCP定时器的收敛行为,正是因为tcp协议的牺牲精神所以才会保证现行网络的畅通无阻。

14. 参考资料

http://blog.csdn.net/u011130578/article/details/44181503

http://blog.csdn.net/column/details/zhangskd.html?&page=1

《TCP-IP Architecture, Design and Implementation in Linux》

《The Linux Networking Architecture》

阅读更多
个人分类: TCP之旅
想对作者说点什么? 我来说一句

没有更多推荐了,返回首页

关闭
关闭
关闭