cubic算法

头铁的伦

已于 2023-03-30 14:41:51 修改

阅读量2.8k

点赞数 2

文章标签：网络服务器运维

于 2023-03-30 14:41:34 首次发布

本文链接：https://blog.csdn.net/shanbl_linux_android/article/details/129857127

版权

文章详细介绍了TCP的拥塞控制机制，包括窗口的概念（Tx,Rx,cwnd等），慢启动算法，拥塞避免策略如CUBIC和Reno，以及快重传和快恢复。CUBIC是一种平滑增长的拥塞控制算法，其目标是找到网络的最佳发送速率。文章还讨论了网络拥塞的必然性和如何通过不同算法来避免和缓解拥塞问题。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

基本概念

窗口(wnd):分为tx、rx。Tx为发送缓冲，rx为接收缓冲

拥塞窗口(cwnd)：越大发送速度越快。所以从低到高调节

慢启动(ssh):网络拥堵时用于缓解网络复杂问题方案，多个数据发送后统一ack。

慢启动门限:cwnd小于此门限时使用慢启动传输数据。

发送拥塞窗口(snd_cwnd)：发送拥塞窗口，体现当前发送速率

发送拥塞窗口上(snd_cwnd_clamp):发送窗口的最大值

滑动窗口协议：发送队列控制，保证数据传输可靠性

NAGLE：多个数据组合到一起统一发送给对端

快重传：丢包后立刻重传，接收端会多次发送丢失数据ack以求快速重传

快恢复:将ssh和cwnd都减半，进入拥塞避免算法。一般是连续收到多个重复确认。

CUBIC:基于数学公式探测最大发送窗口的传输算法

Reno:基于RTT探测最大发送窗口的传输算法

BBR：给予链路信息计算大发送窗口的传输算法

SACK:它使得接收方能告诉发送方哪些报文段丢失，哪些报文段重传了，哪些报文段已经提前收到等信息。根据这些信息TCP就可以只重传哪些真正丢失的报文段.

慢启动:拥塞窗口呈指数形态上涨: 1 -> 2 -> 4 -> 8 -> 16，代码如下

当进入慢启动阶段，以当前慢启动门限和发送拥塞窗口+已确认数据中的最小值为新的发送拥塞窗口。并且返回残留的ack的数据用于拥塞控制(一般为dup ack发生是，可以进入快速恢复算法)。

u32 tcp_slow_start(struct tcp_sock *tp, u32 acked)
{
    u32 cwnd = min(tp->snd_cwnd + acked, tp->snd_ssthresh);
    acked -= cwnd - tp->snd_cwnd;
    tp->snd_cwnd = min(cwnd, tp->snd_cwnd_clamp);   -------------------->更新tcp发送窗口
    return acked;
}

拥塞控制阶段:拥塞窗口在一个往返周期呈次序递增，+1， +1， +1。以单个mss为单位

快速重传:针对网络没有拥塞导致的重传，仅是ack回复慢了或dup ack了的情况。这时候会将拥塞阈值也降低为当前拥塞窗口的一半，拥塞窗口为当前的降低一半加上收到的dup ack数量。

超时重传:拥塞窗口直接恢复初始值，直接从1开始进入慢启动

网络拥塞的必然性

网络拥塞的产生是由于tcp的算法特性决定，可以确认tcp的网络最大容量C由两部分组成，一个是物理层容量C1，一个是二级缓存的大小C2:

C=C1+C2. C2为常量，C为恒不变数值。

那么C1 = P * rtt, P为当前速度且未知最大值为多少、故而持续增长，rtt为一个往返周期。那么推导出:

C= P * rtt +C2.从下图可见，P的理想上升趋势为矩形模式。

那么当n个rtt过后，网络的容量实际会变化为

Cn = Pn * rttn + C2 且Cn > C。

那么这里产生一个问题， P是存在最大值Pmax的,不可能无限增长，当Pn > Pmax时，Cn > Cmax。这时网络拥塞产生了，但是tcp还无法感知且继续填充C2导致rtt变大，tcp也总是想把C2填满。而后tcp在1/2个rtt(最快)后会感知到rtt上升，然后降低P来维持网络状况。

这里会有个问题，理论上tcp应当在C1 -> C1max时，立刻降低速率，由于C2作为缓冲的存在，tcp的cwnd还是会继续增加导致tcp无法立刻控制速度直到C2被填满。所以在1/2 rtt ～ rtt之间必然会发生丢包。这随着定义的C2越大，rtt也会变的越大，导致的后果越发严重。

CUBIC实现

由于二层不清楚底层的网络状况和最大带宽，所以对于发送速率采用逐步增加的节奏来达到最大值，这样来应对不同的物理层能力。

整个窗口生长函数只是一个对数凹函数。这个凹函数使饱和点或平衡处的拥塞窗口比凸函数或线性函数更长（凸函数或者线性函数在饱和点处具有最大的窗口增量，因此它们发生分组丢失时具有最大的波动）。这些功能使BIC-TCP非常稳定，同时具有高度可扩展性。

CUBIC的增长曲线

窗口计算公式：

cubic窗口增长函数：W (t) = C(t − K) ^3 + Wmax．

C:以当前android实现来看C=8*(1024+717)/3/(1024 – 717)*10; C越大，则探测到最大窗口的时间越短.

t:是距离最近一次丢包的时间tcp_jiffies32 - ca->epoch_start.

K是窗口从W增加到Wmax所用的时间,kernel实现中为bic_k

beta是由tcp_cubic自行决定的当前内核beta = 717.beta决定了整个曲线对称范围围成区域的高度

在不丢包的情况下，K=(beta*Wmax／C)^(1/3) (通过W(0)=-beta*Wmax得到

窗口计算实现：

结构分析，CUBIC是BIC-TCP的下一代版本，所以cubic方法用仍然存在bic算法的结构:

struct bictcp {
        u32     cnt;            /* increase cwnd by 1 after ACKs */
        u32     last_max_cwnd;  /* last maximum snd_cwnd */
        u32     last_cwnd;      /* the last snd_cwnd */
        u32     last_time;      /* time when updated last_cwnd */
        u32     bic_origin_point;/* 这里可以理解为窗口的增长目标 */
        u32     bic_K;          /* time to origin point
                                   from the beginning of the current epoch */
        u32     delay_min;      /* min delay (msec << 3) */
        u32     epoch_start;    --> epoch_start<=0时表示丢包了。当state进入CA_LOSS时会归零
        u32     ack_cnt;        /* number of acks */
        u32     tcp_cwnd;       /* estimated tcp cwnd */
        u16     unused;
        u8      sample_cnt;     /* number of samples to decide curr_rtt */
        u8      found;          /* the exit point is found? */
        u32     round_start;    /* beginning of each round */
        u32     end_seq;        /* end_seq of the round */
        u32     last_ack;       /* last time when the ACK spacing is close */
        u32     curr_rtt;       /* the minimum rtt of current round */
};

static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked)
{
    u32 delta, bic_target, max_cnt;
    u64 offs, t;
    ca->ack_cnt += acked;/* count the number of ACKed packets */
    if (ca->last_cwnd == cwnd &&
        (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32)
       return;
    /* The CUBIC function can update ca->cnt at most once per jiffy.
     * On all cwnd reduction events, ca->epoch_start is set to 0,
     * which will force a recalculation of ca->cnt.
     */
    if (ca->epoch_start && tcp_jiffies32 == ca->last_time)
        goto tcp_friendliness;
    ca->last_cwnd = cwnd;
    ca->last_time = tcp_jiffies32;
    if (ca->epoch_start == 0) {
        ca->epoch_start = tcp_jiffies32;/* record beginning */
        ca->ack_cnt = acked;/* start counting */
        ca->tcp_cwnd = cwnd;/* syn with cubic */
        if (ca->last_max_cwnd <= cwnd) {
            ca->bic_K = 0;
            ca->bic_origin_point = cwnd;
        } else {
        /* Compute new K based on
         * (wmax-cwnd) * (srtt>>3 / HZ) / c * 2^(3*bictcp_HZ)
         */
        ca->bic_K = cubic_root(cube_factor
               * (ca->last_max_cwnd - cwnd));
        ca->bic_origin_point = ca->last_max_cwnd;
        }
    }
    t = (s32)(tcp_jiffies32 - ca->epoch_start);
    t += msecs_to_jiffies(ca->delay_min >> 3);
    /* change the unit from HZ to bictcp_HZ */
    t <<= BICTCP_HZ;
    do_div(t, HZ);
    if (t < ca->bic_K)/* t - K */
    offs = ca->bic_K - t;
    else
    offs = t - ca->bic_K;
    /* c/rtt * (t-K)^3 */
    delta = (cube_rtt_scale * offs * offs * offs) >> (10+3*BICTCP_HZ);
// 当计算出的窗口和当前想要达到的窗口存在差距时，基于目标差距来调整本次窗口增长多少
// 这里的二元判断代表增长示意图中的凹凸区间两个曲线
    if (t < ca->bic_K)                            /* below origin*/
        bic_target = ca->bic_origin_point - delta;
    else                                          /* above origin*/
       bic_target = ca->bic_origin_point + delta;
    /* cubic function - calc bictcp_cnt*/
    if (bic_target > cwnd) {
        ca->cnt = cwnd / (bic_target - cwnd);
    } else {
        ca->cnt = 100 * cwnd;              /* very small increment*/
    }
    /*
     * The initial growth of cubic function may be too conservative
     * when the available bandwidth is still unknown.
     */
    if (ca->last_max_cwnd == 0 && ca->cnt > 20)
        ca->cnt = 20;/* increase cwnd 5% per RTT */
tcp_friendliness:
    /* TCP Friendly */
    if (tcp_friendliness) {
        u32 scale = beta_scale;
       delta = (cwnd * scale) >> 3;
       while (ca->ack_cnt > delta) {/* update tcp cwnd */
       ca->ack_cnt -= delta;
       ca->tcp_cwnd++;
    }
    if (ca->tcp_cwnd > cwnd) {/* if bic is slower than tcp */
        delta = ca->tcp_cwnd - cwnd;
        max_cnt = cwnd / delta;
        if (ca->cnt > max_cnt)
            ca->cnt = max_cnt;
        }
    }
    /* The maximum rate of cwnd increase CUBIC allows is 1 packet per
     * 2 packets ACKed, meaning cwnd grows at 1.5x per RTT.
     */
    ca->cnt = max(ca->cnt, 2U);
}

接下来就是tcp的拥塞控制实现，这一方法被所有传输算法共用

/* In theory this is tp->snd_cwnd += 1 / tp->snd_cwnd (or alternative w),
 * for every packet that was ACKed.
 */
void tcp_cong_avoid_ai(struct tcp_sock *tp, u32 w, u32 acked)
{
        /* If credits accumulated at a higher w, apply them gently now. */
        if (tp->snd_cwnd_cnt >= w) {
                tp->snd_cwnd_cnt = 0;
                tp->snd_cwnd++;
        }
        tp->snd_cwnd_cnt += acked;
        if (tp->snd_cwnd_cnt >= w) {
                u32 delta = tp->snd_cwnd_cnt / w;
                tp->snd_cwnd_cnt -= delta * w;
                tp->snd_cwnd += delta;
        }
        tp->snd_cwnd = min(tp->snd_cwnd, tp->snd_cwnd_clamp);
}

思考：

1.从cubic算法分析来看，在凹形上升区间上，并没有可以限制tcp窗口上升的方法，那么拥塞窗口是否会无限大？

带宽取值为计算得出的数据传输速率与接收ACK速率两者之间的较小值.

 send_rate = #pkts_delivered/(last_snd_time - first_snd_time)
 ack_rate  = #pkts_delivered/(last_ack_time - first_ack_time)
 bw = min(send_rate, ack_rate)

阅读代码，tcp snd的收敛来自于recovery模式或窗口收敛模式下。

snd = 已发送未ack的包 + sndcnt(这个看代码分析)

void tcp_cwnd_reduction(struct sock *sk, int newly_acked_sacked, int flag){
        struct tcp_sock *tp = tcp_sk(sk);
        int sndcnt = 0;
        int delta = tp->snd_ssthresh - tcp_packets_in_flight(tp);

        if (newly_acked_sacked <= 0 || WARN_ON_ONCE(!tp->prior_cwnd))
                return;

        tp->prr_delivered += newly_acked_sacked;
        if (delta < 0) {
                u64 dividend = (u64)tp->snd_ssthresh * tp->prr_delivered +
                               tp->prior_cwnd - 1;
                sndcnt = div_u64(dividend, tp->prior_cwnd) - tp->prr_out;
        } else if ((flag & (FLAG_RETRANS_DATA_ACKED | FLAG_LOST_RETRANS)) ==
                   FLAG_RETRANS_DATA_ACKED) {
                sndcnt = min_t(int, delta,
                               max_t(int, tp->prr_delivered - tp->prr_out,
                                     newly_acked_sacked) + 1);
        } else {
                sndcnt = min(delta, newly_acked_sacked);
        }
        /* Force a fast retransmit upon entering fast recovery */
        sndcnt = max(sndcnt, (tp->prr_out ? 0 : 1));
        tp->snd_cwnd = tcp_packets_in_flight(tp) + sndcnt;
}

2.当应用程序发送数据速度远远小于带宽会怎么样？

app-limited触发，这个只有bbr在用，后续分析

/* If a gap is detected between sends, mark the socket application-limited. */void tcp_rate_check_app_limited(struct sock *sk){
        struct tcp_sock *tp = tcp_sk(sk);

        if (/* We have less than one packet to send. */tp->write_seq - tp->snd_nxt < tp->mss_cache &&
            /* Nothing in sending host's qdisc queues or NIC tx queue. */sk_wmem_alloc_get(sk) < SKB_TRUESIZE(1) &&
            /* We are not limited by CWND. */tcp_packets_in_flight(tp) < tp->snd_cwnd &&
            /* All lost packets have been retransmitted. */tp->lost_out <= tp->retrans_out)
                tp->app_limited =
                        (tp->delivered + tcp_packets_in_flight(tp)) ? : 1;
}

3.tcp失序的产生

tcp的失序并不单单会由于对端发送失序导致，最大可能性为可靠性质的链路层重传导致的。如wifi mac层对一组顺序的tcp ack中的某一个重传导致乱序。