让人们久等了的TCP BBR v2.0快要出炉了!

版权声明:本文为博主原创,无版权,未经博主允许可以随意转载,无需注明出处,随意修改或保持可作为原创! https://blog.csdn.net/dog250/article/details/80629551

这是连续的第四个雨夜了,这几天暴雨我几乎每晚都半通宵,晚上10点半左右睡觉,然后1点多就会醒来,听雨作文,无视并嘲笑着蚊子和飞机。

雨季来得晚了些,但却是猛的,我知道本周这可能是最后的雨夜了,所以我必须在这夜里写点东西或者做点事情,正好今天看到了bbr-dev list上的一篇topic,觉得有益,就想把它写下来,也就成了本文。


正文

2016年9月份,Google放出了其研究测试了好几年并且已经在其B4 SDN骨干网,油管全量部署的BBR拥塞控制算法,一时间得到了广泛关注。

有幸作为第一批研究BBR算法的那批人,我自己也做了很多分析和测试,并且和很大范围内的很多爱好者一起进行了各种讨论,此后不久,关于BBR的话题逐渐淡却,然而各大公司或者个人却都在私下里默默地对BBR进行进一步的调整或者说盲改,这个我就不多说了,挺烦的。

和大多数国内公司的盲改盲测不同,Google的做法显得有条有理,他们会首先在QUIC协议上就行试验性的实验并且经过充分的debug之后才会将确定性的结论实现在TCP协议上。这个意义上,QUIC BBR其实就是携带了充分调试信息的debug版本的BBR,实际上也是这样,毕竟30年前的TCP协议在传输反馈过程中信息量太少了,用它直接来做,显然不如出一个debug版本,而QUIC显然就是这个debug版本。就好像写程序一样,一开始的debug版本携带大量的信息将程序优化到极致,最后把debug信息去掉就是成型的发布版本了。


就在今年4月份,一位朋友发给我一份Google BBR最新的slides:
https://datatracker.ietf.org/meeting/101/materials/slides-101-iccrg-an-update-on-bbr-work-at-google-00
让我有了一些新的想法。该slides中提到了一个很重要的问题,那就是TCP BBR的失速问题,特别是在Wifi等无线的环境下,这个问题我之前也提到过。针对这个问题的解决,BBR给出了自己的思路。同时在该slides的最后,BBR做出了美好的展望。

这意味着BBR v2.0要来了?

然而终究还是没有来…然而有盼头了。


今年4月5日,也就是两个月前,BBR的作者之一Neal Cardwellbbr-dev list上放出一个新的topic:
* RFC: Linux TCP BBR patches for higher wifi throughput and lower queuing delays *
大家可以follow一下下面的链接:
https://groups.google.com/forum/#!topic/bbr-dev/8pgyOyUavvY
就可以知道这是一个什么样的壮举。这个新的topic大概意思是在介绍一组最新的patch set。

写本文的目的是因为很多人是打不开上面的这些链接的,另外,就算能打开,估计也很多人不知道这个链接,所以我觉得我可以充当一个hub,我来写一篇文章帮大家介绍这里面的内容,或者说我更愿意当一个router,给大家指引一条到达的路径。


先看一下bbr-dev上的这个topic的主要内容,我在此摘录两段,这两段内容描述了这组patch set的主要内容:
先看第一段,主要介绍了BBR失速问题:

1: Higher throughput for wifi and other paths with aggregation

Aggregation effects are extremely common with wifi, cellular, and cable modem link technologies, ACK decimation in middleboxes, and LRO and GRO in receiving hosts. The aggregation can happen in either direction, data or ACKs, but in either case the aggregation effect is visible to the sender in the ACK stream.
.
Previously, BBR’s sending was often limited by cwnd under severe ACK aggregation/decimation because BBR sized the cwnd at 2*BDP. If packets were ACKed in bursts after long delays then BBR stopped sending after sending 2*BDP, leaving the bottleneck idle for potentially long periods. Note that loss-based congestion control does not have this issue because when facing aggregation it continues increasing cwnd after bursts of ACKs, growing cwnd until the buffer is full.
.
To achieve good throughput in the presence of aggregation effects, this new algorithm allows the BBR sender to put extra data in flight to keep the bottleneck utilized during silences in the ACK stream that it has evidence to suggest were caused by aggregation.

下面是第二段,这段主要是介绍BBR收敛慢的问题:

2: Lower queuing delays by frequently draining excess in-flight data
.
In BBR v1.0 the “drain” phase of the pacing gain cycle holds the pacing_gain to 0.75 for essentially 1*min_rtt (or less if inflight falls below the BDP).
.
This patch modifies the behavior of this “drain” phase to attempt to “drain to target”, adaptively holding this “drain” phase until inflight reaches the target level that matches the estimated BDP (bandwidth-delay product).
.
This can significantly reduce the amount of data queued at the bottleneck, and hence reduce queuing delay and packet loss, in cases where there are multiple flows sharing a bottleneck.

下面说一下我的理解。


这两个问题中,BBR收敛慢的问题我自己在2016年底就开始关注了,然而没有得到什么比较好的解法,一开始我只能按照下面的粗暴方式去解决:

    /* A pacing_gain < 1.0 tries to drain extra queue we added if bw
     * probing didn't find more bw. If inflight falls to match BDP then we
     * estimate queue is drained; persisting would underutilize the pipe.
     */
     return is_full_length && // 只是把||改成了&&以确保一次性强收敛。
    //return is_full_length ||
        inflight <= bbr_target_cwnd(sk, bw, BBR_UNIT);

这种找打的解法当然没能达到预期,虽然几位朋友测试说这样确实丢包减少了,但我个人认为那要match多少约束性场景,所以对测试结论持怀疑态度。我从来不相信这种拍脑袋的代码级的小修小改能对性能产生往好的方向的影响,即便是出自我自己之手我也会嗤之以鼻。可能我真的错了,并且错在了那个关键的时间点,我的那个修改(当然肯定还有别的修改,或变与只是其中之一)可能是正确的!

后来,我竟然放过了这段代码,依然保持了它原来的样子,但是即便不能在一次性drain到target,我也不希望后面走6个平稳的增益为1的cycle,我认为那太久了,于是我后来又有了一个优化,即把增益为1的平稳cycle从固定的6个改成2个10个之间的随机值,即:

static void bbr_advance_cycle_phase(struct sock *sk)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct bbr *bbr = inet_csk_ca(sk);

    bbr->cycle_idx = (bbr->cycle_idx + 1) & (CYCLE_LEN - 1);
    if (bbr->cycle_idx == 2) {
        bbr->cycle_idx = 2 + next_pseudo_random32(jiffies)%11;
    }
    bbr->cycle_mstamp = tp->delivered_mstamp;
    bbr->pacing_gain = bbr_pacing_gain[bbr->cycle_idx];
}

嗯,这次效果还不错。之所以选择2到10而不是2到7,那是因为由于随机化,可能会让原本6个平稳周期缩短为0个,相当于减少了4个周期,那么就要有同样的概率让其多出4个平稳周期,而我,是相信概率的。就这样,我选择了6-4到6+4。

我不敢修改常数,我只能最多让数据在常数点达到平衡,我本来是想用正态分布的,但并不好实现…

在网络流量瞬息万变的场景下,保持固定心跳并不是什么明智的选择,1.25–0.75–1–1–1–1–1–1这种固定变速装置并不能正确应对环境的变化,这种经验来自平时自驾司机的观察,在国道或者高速上行驶,基本都是油门刹车随时踩的,BBR也应该这样,视情况选择是1.25还是0.75,或者1.

在对cycle进行了必要的随机化之后,我对BBR代码中的其它固定的常数都进行了随机化,概率分布化,以消除所谓的全局同步现象,也确实消除了,公平性也好了很多。

在这过程中,让我非常苦恼的是很难用数学方法去衡量BBR的表现。我们知道,以往的类似CUBIC,Vegas算法的paper中都会给出这个算法的数学模型,但是BBR很难找到这样的精确模型,它更多的是一种经验上的工程化算法,而不是一个基于数学模型的算法,所以说,它的曲线样子也就不固定咯。

我们知道Reno/CUBIC的曲线是锯齿,我问温州皮鞋厂老板BBR曲线是什么样子的,老板没有答上来,其实BBR v1.0的曲线跟人的心电图的样子非常像,就像一个固定时钟,这种固定的潮起潮落跟CUBIC没有本质的区别!所以一定要根据实际情况把曲线上的峰谷打散掉。

我们看一下今天展示的这个BBR v1.5的patch set是怎么解决这个问题的,首先它废掉了固定的cycle推进模式:

static void bbr_advance_cycle_phase(struct sock *sk)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct bbr *bbr = inet_csk_ca(sk);

    if (bbr_drain_to_target) {
        // 打散后的cycle推进逻辑在这里!
        bbr_drain_to_target_cycling(sk, rs);
        return;
    }
    // 固定的cycle推进逻辑终将成为历史
    bbr->cycle_idx = (bbr->cycle_idx + 1) & (CYCLE_LEN - 1);
    bbr->cycle_mstamp = tp->delivered_mstamp;
    bbr->pacing_gain = bbr_pacing_gain[bbr->cycle_idx];
}

接着让我们看一下冰山下面有什么。是的,好奇让人前行(嫉妒让人潜行?):

static void bbr_drain_to_target_cycling(struct sock *sk,
                                        const struct rate_sample *rs)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct bbr *bbr = inet_csk_ca(sk);
    // elapsed_us表示从本轮开始到现在的时间。
    u32 elapsed_us =
            tcp_stamp_us_delta(tp->delivered_mstamp, bbr->cycle_mstamp);
    u32 inflight, bw;
    if (bbr->mode != BBR_PROBE_BW)
        return;

    /* Always need to probe for bw before we forget good bw estimate. */
    // 如果超过了一轮的时间,就要开启新的一轮。
    // 注意:每一轮的周期数被随机了,周期数不再固定为8,而是2到8之间的随机值!
    // 我之前怎么就没有想到全局随机呢???
    if (elapsed_us > bbr->cycle_len * bbr->min_rtt_us) {
        /* Start a new PROBE_BW probing cycle of [2 to 8] x min_rtt. */
        bbr->cycle_mstamp = tp->delivered_mstamp;
        bbr->cycle_len = CYCLE_LEN - prandom_u32_max(bbr_cycle_rand);
        // 每一轮均从ProbeMore周期开始!
        bbr_set_cycle_idx(sk, BBR_BW_PROBE_UP);  /* probe bandwidth */
        return;
    }
    /* The pacing_gain of 1.0 paces at the estimated bw to try to fully
     * use the pipe without increasing the queue.
     */
     // 每一轮均从ProbeMore周期开始,那么平稳的cruise周期肯定是
     // 从Drain周期下来的,如果进入了平稳的cruise周期,便不再变化,
     // 直到结束,仔细想想,这个和我的那个cycle随机化修改是完全一
     // 致的啊!!
    if (bbr->pacing_gain == BBR_UNIT)
        return;
    inflight = rs->prior_in_flight;  /* what was in-flight before ACK? */
    bw = bbr_max_bw(sk);
    /* A pacing_gain < 1.0 tries to drain extra queue we added if bw
     * probing didn't find more bw. If inflight falls to match BDP then we
     * estimate queue is drained; persisting would underutilize the pipe.
     */
     // 如果处在Drain周期,那么就一直Drain到inflight等于target为止!
    if (bbr->pacing_gain < BBR_UNIT) {
        if (inflight <= bbr_inflight(sk, bw, BBR_UNIT))
            // Drain成功后便进入到了平稳的cruise周期。
            bbr_set_cycle_idx(sk, BBR_BW_PROBE_CRUISE); /* cruise */
        return;
    }
    /* A pacing_gain > 1.0 probes for bw by trying to raise inflight to at
     * least pacing_gain*BDP; this may take more than min_rtt if min_rtt is
     * small (e.g. on a LAN). We do not persist if packets are lost, since
     * a path with small buffers may not hold that much. Similarly we exit
     * if we were prevented by app/recv-win from reaching the target.
     */
     // 每一轮从ProbeMore开始,只有在出现丢包,RTT增加等情况下会进入Drain周期。
     // 注意:elapsed_us是从本轮开始计数,到当前的时间除以RTT正好记录了
     // 经过了多少个RTT,退出ProbeMore的条件是,至少经过了一个RTT,
     // 这一点和BBR 1.0版本完全一致。
    if (elapsed_us > bbr->min_rtt_us &&
            (inflight >= bbr_inflight(sk, bw, bbr->pacing_gain) ||
            rs->losses ||         /* perhaps pacing_gain*BDP won't fit */
            rs->is_app_limited || /* previously app-limited */
            !tcp_send_head(sk) || /* currently app/rwin-limited */
            !tcp_snd_wnd_test(tp, tcp_send_head(sk), tp->mss_cache))) {
            bbr_set_cycle_idx(sk, BBR_BW_PROBE_DOWN);  /* drain queue */
            return;
    }
}

这个patch证明我之前的思路还是对的,从那个“||”改为“&&”的时候就是正确的,此后加上我那个随机化平稳cruise周期基本就是这个代码了。这里可以说“||”改为“&&”是一个代码上的小trick,而我觉得真正有意义的地方在于随机化cycle后并且一次性Drain to target 。

好了,关于BBR收敛慢的问题暂时说到这里,相信BBR的这个patch也只是解决这个问题的一个开始,我们都很希望去面临一个开始,而不是结束,不管再好的结束,都不如一个跌跌撞撞懵懂的开始。

让我们继续本patch的下一个主题,即BBR失速问题。


关于这个问题,我自己也有一个解法:
TCP BBR失速控制的一个小trick一个小patchhttps://blog.csdn.net/dog250/article/details/80203520
当时已经看过了https://datatracker.ietf.org/meeting/101/materials/slides-101-iccrg-an-update-on-bbr-work-at-google-00这个slides,然而并没有看到今天这个patch set的代码,所以我写了一个完全不一样的,不使用windowed max,而是使用了移动指数平均的方法。我的那个实现其实是有问题的,问题在于我的计算周期太短了,我是每次ACK到达都会去计算速率,这样做是不是有失精度呢?

见贤思齐,我还是更推崇Google的做法。这并非我自己的不自信,在这个点上我是确实不力。对于上面那个解决收敛慢的方法,如今,我还是比较自信的。

今天,我就简单说一下Google使用windowed max的方式来记录extra AKCed,从而计算extra cwnd的方法。

任何教唆都不如直接上代码,先看bbr_set_cwnd:

static void bbr_set_cwnd(struct sock *sk, const struct rate_sample *rs,
             u32 acked, u32 bw, int gain)
{
    ...
    target_cwnd = bbr_bdp(sk, bw, gain);
    /* Increment the cwnd to account for excess ACKed data that seems
     * due to aggregation (of data and/or ACKs) visible in the ACK stream.
     */
    // 这个才是我们关注的!为cwnd增加了extra cwnd
    target_cwnd += bbr_ack_aggregation_cwnd(sk);
    ...
}

循着这个bbr_ack_aggregation_cwnd,看个究竟:

static u32 bbr_ack_aggregation_cwnd(struct sock *sk)
{
    u32 max_aggr_cwnd, aggr_cwnd = 0;
    if (bbr_extra_acked_gain && bbr_full_bw_reached(sk)) {
        // 计算一个上界
        max_aggr_cwnd = ((u64)bbr_bw(sk) * bbr_extra_acked_max_us)
            / BW_UNIT;
        // 这句中的bbr_extra_acked最重要,至于又一个gain,权当经验值好了!
        aggr_cwnd = (bbr_extra_acked_gain * bbr_extra_acked(sk))
            >> BBR_SCALE;
        aggr_cwnd = min(aggr_cwnd, max_aggr_cwnd);
    }
    return aggr_cwnd;
}

我说这虽然可能是BBR 2.0里面的自带功能,但却是一个十足的无失速BBR版本的1.0,这也是一个开始,解决失速问题的开始,不要苛求精确。

我们接着看bbr_extra_acked里面到底怎么取到的值:

/* Return maximum extra acked in past k-2k round trips,
 * where k = bbr_extra_acked_win_rtts.
 */
static u16 bbr_extra_acked(const struct sock *sk)
{
    struct bbr *bbr = inet_csk_ca(sk);
    return max(bbr->extra_acked[0], bbr->extra_acked[1]);
}

到这里,我们应该能看明白这其实是取了之前两轮中的最大值,至于什么是一轮,其实就是10个RTT,这里巧妙的使用了倒换数组,也就是new,old两个容器互相交叉使用,new满了倾倒内容后成为old,然后old成为new,这种用法在O(1)调度器和RCU锁的实现中都有用到,也是一个常用的技巧。

接下来看一个最后的函数,即如何来计算extra ACKed的值,其实,只看注释应该就够了,但是那和直接看那个slides无异。为了展示实现上的技巧,还是要贴出代码最实在:

/* Estimates the windowed max degree of ack aggregation.
 * This is used to provision extra in-flight data to keep sending during
 * inter-ACK silences.
 *
 * Degree of ack aggregation is estimated as extra data acked beyond expected.
 *
 * max_extra_acked = "maximum recent excess data ACKed beyond max_bw * interval"
 * cwnd += max_extra_acked
 *
 * Max extra_acked is clamped by cwnd and bw * bbr_extra_acked_max_us (100 ms).
 * Max filter is an approximate sliding window of 10-20 (packet timed) round
 * trips.
 */
 static void bbr_update_ack_aggregation(struct sock *sk,
                                        const struct rate_sample *rs)
 {
    u32 epoch_us, expected_acked, extra_acked;
    struct bbr *bbr = inet_csk_ca(sk);
    struct tcp_sock *tp = tcp_sk(sk);
    if (!bbr_extra_acked_gain || rs->acked_sacked <= 0 ||
        rs->delivered < 0 || rs->interval_us <= 0)
        return;
    if (bbr->round_start) {
        bbr->extra_acked_win_rtts = min(0x1F,
                                        bbr->extra_acked_win_rtts + 1);
        // 每10个rtt作为一轮参与计算。
        if (bbr->extra_acked_win_rtts >= bbr_extra_acked_win_rtts) {
            bbr->extra_acked_win_rtts = 0;
            // 交错使用extra_acked倒换数组容器
            bbr->extra_acked_win_idx = bbr->extra_acked_win_idx ?0 : 1;
            bbr->extra_acked[bbr->extra_acked_win_idx] = 0;
        }   
    }
    /* Compute how many packets we expected to be delivered over epoch. */
    // epoch_us表示自从收到“比预期过量的ACK”开始到现在的时间间隔!
    epoch_us = tcp_stamp_us_delta(tp->delivered_mstamp,
                                    bbr->ack_epoch_mstamp);
    // 很显然的预期公式,bw*t = acked。
    expected_acked = ((u64)bbr_bw(sk) * epoch_us) / BW_UNIT;
    /* Reset the aggregation epoch if ACK rate is below expected rate or
     * significantly large no. of ack received since epoch (potentially
     * quite old epoch).
     */
    // 只有从开始收到比预期多的ACK,SACK时才开始计时。另外,太多也不行。
    if (bbr->ack_epoch_acked <= expected_acked ||
        (bbr->ack_epoch_acked + rs->acked_sacked >=
        bbr_ack_epoch_acked_reset_thresh)) {
        bbr->ack_epoch_acked = 0;
        bbr->ack_epoch_mstamp = tp->delivered_mstamp;
        expected_acked = 0;
    }
    /* Compute excess data delivered, beyond what was expected. */
    bbr->ack_epoch_acked = min(0xFFFFFU,
                                bbr->ack_epoch_acked + rs->acked_sacked);
    // 用实际的ACK,SACK数量减去预期的,就是extra。
    extra_acked = bbr->ack_epoch_acked - expected_acked;
    extra_acked = min(extra_acked, tp->snd_cwnd);
    // 加入到容器以供set cwnd逻辑来取值。
    if (extra_acked > bbr->extra_acked[bbr->extra_acked_win_idx])
        bbr->extra_acked[bbr->extra_acked_win_idx] = extra_acked;
}

这个逻辑其实想想,理解上也并不困难,关键就是要去实现它,并且正确地实现它

很多时候,思路虽然重要,但正确的实现思路更加重要,因为这会影响到我们的认知。思路是一种解决问题的想法,它十有八九是正确的,请相信我,它真的十有八九是正确的。除非有人故意使坏,解决问题的心大家都是一致向前的,差别在于前进的远和近,问题理解了,那么解决问题的方法还会没有吗?

在实现并实施了一系列的方法后,为什么没有达到预期?很多时候可能就是你实现的方法不对而不是思路不对。这个时候人的素质就分出差别了。强大的人首先会审查实现在保证实现万无一失后再去审视思路,而一般的乌合大众则直接攻击思路,这显然是无益于任何事情向前推进的。

大隐隐于微信群,小隐隐于会议室。没有任何的实现前,多做事,少扯淡,推理谁都会,A的不见得比C的更高明

Google的开源项目基本都是公开公共的邮件group交流,其实任何成功的项目都是如此,在没有实现之前,任何方案都是有待商榷


本文到此先结束了,周末白天有时间的话,我会梳理一下另外一个BBR的细节,没时间就算了。对了,文末我有一个附录,4.14内核的完整的打过patchs的tcp_bbr.c,我们可以一起来测试一下,VPS主机用美国,日本的吧,比较常规。


附录

我特意将patch好的tcp_bbr.c文件贴如下,base版本4.14内核:

/* Bottleneck Bandwidth and RTT (BBR) congestion control
 *
 * BBR congestion control computes the sending rate based on the delivery
 * rate (throughput) estimated from ACKs. In a nutshell:
 *
 *   On each ACK, update our model of the network path:
 *      bottleneck_bandwidth = windowed_max(delivered / elapsed, 10 round trips)
 *      min_rtt = windowed_min(rtt, 10 seconds)
 *   pacing_rate = pacing_gain * bottleneck_bandwidth
 *   cwnd = max(cwnd_gain * bottleneck_bandwidth * min_rtt, 4)
 *
 * The core algorithm does not react directly to packet losses or delays,
 * although BBR may adjust the size of next send per ACK when loss is
 * observed, or adjust the sending rate if it estimates there is a
 * traffic policer, in order to keep the drop rate reasonable.
 *
 * Here is a state transition diagram for BBR:
 *
 *             |
 *             V
 *    +---> STARTUP  ----+
 *    |        |         |
 *    |        V         |
 *    |      DRAIN   ----+
 *    |        |         |
 *    |        V         |
 *    +---> PROBE_BW ----+
 *    |      ^    |      |
 *    |      |    |      |
 *    |      +----+      |
 *    |                  |
 *    +---- PROBE_RTT <--+
 *
 * A BBR flow starts in STARTUP, and ramps up its sending rate quickly.
 * When it estimates the pipe is full, it enters DRAIN to drain the queue.
 * In steady state a BBR flow only uses PROBE_BW and PROBE_RTT.
 * A long-lived BBR flow spends the vast majority of its time remaining
 * (repeatedly) in PROBE_BW, fully probing and utilizing the pipe's bandwidth
 * in a fair manner, with a small, bounded queue. *If* a flow has been
 * continuously sending for the entire min_rtt window, and hasn't seen an RTT
 * sample that matches or decreases its min_rtt estimate for 10 seconds, then
 * it briefly enters PROBE_RTT to cut inflight to a minimum value to re-probe
 * the path's two-way propagation delay (min_rtt). When exiting PROBE_RTT, if
 * we estimated that we reached the full bw of the pipe then we enter PROBE_BW;
 * otherwise we enter STARTUP to try to fill the pipe.
 *
 * BBR is described in detail in:
 *   "BBR: Congestion-Based Congestion Control",
 *   Neal Cardwell, Yuchung Cheng, C. Stephen Gunn, Soheil Hassas Yeganeh,
 *   Van Jacobson. ACM Queue, Vol. 14 No. 5, September-October 2016.
 *
 * There is a public e-mail list for discussing BBR development and testing:
 *   https://groups.google.com/forum/#!forum/bbr-dev
 *
 * NOTE: BBR might be used with the fq qdisc ("man tc-fq") with pacing enabled,
 * otherwise TCP stack falls back to an internal pacing using one high
 * resolution timer per TCP socket and may use more resources.
 */
#include <linux/module.h>
#include <net/tcp.h>
#include <linux/inet_diag.h>
#include <linux/inet.h>
#include <linux/random.h>
#include <linux/win_minmax.h>

/* Scale factor for rate in pkt/uSec unit to avoid truncation in bandwidth
 * estimation. The rate unit ~= (1500 bytes / 1 usec / 2^24) ~= 715 bps.
 * This handles bandwidths from 0.06pps (715bps) to 256Mpps (3Tbps) in a u32.
 * Since the minimum window is >=4 packets, the lower bound isn't
 * an issue. The upper bound isn't an issue with existing technologies.
 */
#define BW_SCALE 24
#define BW_UNIT (1 << BW_SCALE)

#define BBR_SCALE 8 /* scaling factor for fractions in BBR (e.g. gains) */
#define BBR_UNIT (1 << BBR_SCALE)

/* BBR has the following modes for deciding how fast to send: */
enum bbr_mode {
    BBR_STARTUP,    /* ramp up sending rate rapidly to fill pipe */
    BBR_DRAIN,  /* drain any queue created during startup */
    BBR_PROBE_BW,   /* discover, share bw: pace around estimated bw */
    BBR_PROBE_RTT,  /* cut inflight to min to probe min_rtt */
};

/* BBR congestion control block */
struct bbr {
    u32 min_rtt_us;         /* min RTT in min_rtt_win_sec window */
    u32 min_rtt_stamp;          /* timestamp of min_rtt_us */
    u32 probe_rtt_done_stamp;   /* end time for BBR_PROBE_RTT mode */
    struct minmax bw;   /* Max recent delivery rate in pkts/uS << 24 */
    u32 rtt_cnt;        /* count of packet-timed rounds elapsed */
    u32     next_rtt_delivered; /* scb->tx.delivered at end of round */
    u64 cycle_mstamp;        /* time of this cycle phase start */
    u32     mode:3,          /* current bbr_mode in state machine */
        prev_ca_state:3,     /* CA state on previous ACK */
        packet_conservation:1,  /* use packet conservation? */
        restore_cwnd:1,      /* decided to revert cwnd to old value */
        round_start:1,       /* start of packet-timed tx->ack round? */
        cycle_len:4,         /* phases in this PROBE_BW gain cycle */
        tso_segs_goal:7,     /* segments we want in each skb we send */
        idle_restart:1,      /* restarting after idle? */
        probe_rtt_round_done:1,  /* a BBR_PROBE_RTT round at 4 pkts? */
        unused:8,
        lt_is_sampling:1,    /* taking long-term ("LT") samples now? */
        lt_rtt_cnt:7,        /* round trips in long-term interval */
        lt_use_bw:1;         /* use lt_bw as our bw estimate? */
    u32 lt_bw;           /* LT est delivery rate in pkts/uS << 24 */
    u32 lt_last_delivered;   /* LT intvl start: tp->delivered */
    u32 lt_last_stamp;       /* LT intvl start: tp->delivered_mstamp */
    u32 lt_last_lost;        /* LT intvl start: tp->lost */
    u32 pacing_gain:10, /* current gain for setting pacing rate */
        cwnd_gain:10,   /* current gain for setting cwnd */
        full_bw_cnt:3,  /* number of rounds without large bw gains */
        cycle_idx:3,    /* current index in pacing_gain cycle array */
        has_seen_rtt:1, /* have we seen an RTT sample yet? */
        unused_b:5;
    u32 prior_cwnd; /* prior cwnd upon entering loss recovery */
    u32 full_bw;    /* recent bw, to estimate if pipe is full */
    /* For tracking ACK aggregation: */
    u64 ack_epoch_mstamp;   
    /* start of ACK sampling epoch */
    u16 extra_acked[2];     
    /* max excess data ACKed in epoch */
    u32 ack_epoch_acked:20, /* packets (S)ACKed in sampling epoch */
        extra_acked_win_rtts:5, /* age of extra_acked, in round trips */
        extra_acked_win_idx:1,  /* current index in extra_acked array */
        unused1:6;
};

#define CYCLE_LEN   8   /* number of phases in a pacing gain cycle */

/* Window length of bw filter (in rounds): */
static const int bbr_bw_rtts = CYCLE_LEN + 2;
/* Window length of min_rtt filter (in sec): */
static const u32 bbr_min_rtt_win_sec = 10;
/* Minimum time (in ms) spent at bbr_cwnd_min_target in BBR_PROBE_RTT mode: */
static const u32 bbr_probe_rtt_mode_ms = 200;
/* Skip TSO below the following bandwidth (bits/sec): */
static const int bbr_min_tso_rate = 1200000;

/* We use a high_gain value of 2/ln(2) because it's the smallest pacing gain
 * that will allow a smoothly increasing pacing rate that will double each RTT
 * and send the same number of packets per RTT that an un-paced, slow-starting
 * Reno or CUBIC flow would:
 */
static const int bbr_high_gain  = BBR_UNIT * 2885 / 1000 + 1;
/* The pacing gain of 1/high_gain in BBR_DRAIN is calculated to typically drain
 * the queue created in BBR_STARTUP in a single round:
 */
static const int bbr_drain_gain = BBR_UNIT * 1000 / 2885;
/* The gain for deriving steady-state cwnd tolerates delayed/stretched ACKs: */
static const int bbr_cwnd_gain  = BBR_UNIT * 2;

enum bbr_pacing_gain_phase {
    BBR_BW_PROBE_UP     = 0,
    BBR_BW_PROBE_DOWN   = 1,
    BBR_BW_PROBE_CRUISE = 2,
};


/* The pacing_gain values for the PROBE_BW gain cycle, to discover/share bw: */
static const int bbr_pacing_gain[] = {
    BBR_UNIT * 5 / 4,   /* probe for more available bw */
    BBR_UNIT * 3 / 4,   /* drain queue and/or yield bw to other flows */
    BBR_UNIT, BBR_UNIT, BBR_UNIT,   /* cruise at 1.0*bw to utilize pipe, */
    BBR_UNIT, BBR_UNIT, BBR_UNIT    /* without creating excess queue... */
};
/* Randomize the starting gain cycling phase over N phases: */
static const u32 bbr_cycle_rand = 7;

/* Try to keep at least this many packets in flight, if things go smoothly. For
 * smooth functioning, a sliding window protocol ACKing every other packet
 * needs at least 4 packets in flight:
 */
static const u32 bbr_cwnd_min_target = 4;

/* To estimate if BBR_STARTUP mode (i.e. high_gain) has filled pipe... */
/* If bw has increased significantly (1.25x), there may be more bw available: */
static const u32 bbr_full_bw_thresh = BBR_UNIT * 5 / 4;
/* But after 3 rounds w/o significant bw growth, estimate pipe is full: */
static const u32 bbr_full_bw_cnt = 3;

/* "long-term" ("LT") bandwidth estimator parameters... */
/* The minimum number of rounds in an LT bw sampling interval: */
static const u32 bbr_lt_intvl_min_rtts = 4;
/* If lost/delivered ratio > 20%, interval is "lossy" and we may be policed: */
static const u32 bbr_lt_loss_thresh = 50;
/* If 2 intervals have a bw ratio <= 1/8, their bw is "consistent": */
static const u32 bbr_lt_bw_ratio = BBR_UNIT / 8;
/* If 2 intervals have a bw diff <= 4 Kbit/sec their bw is "consistent": */
static const u32 bbr_lt_bw_diff = 4000 / 8;
/* If we estimate we're policed, use lt_bw for this many round trips: */
static const u32 bbr_lt_bw_max_rtts = 48;

/* Gain factor for adding extra_acked to target cwnd: */
static const int bbr_extra_acked_gain = BBR_UNIT;
/* Window length of extra_acked window. Max allowed val is 31. */
static const u32 bbr_extra_acked_win_rtts = 10;
/* Max allowed val for ack_epoch_acked, after which sampling epoch is reset */
static const u32 bbr_ack_epoch_acked_reset_thresh = 1U << 20;
/* Time period for clamping cwnd increment due to ack aggregation */
static const u32 bbr_extra_acked_max_us = 100 * 1000;

/* Each cycle, try to hold sub-unity gain until inflight <= BDP. */
static const bool bbr_drain_to_target = true;   /* default: enabled */


/* Do we estimate that STARTUP filled the pipe? */
static bool bbr_full_bw_reached(const struct sock *sk)
{
    const struct bbr *bbr = inet_csk_ca(sk);

    return bbr->full_bw_cnt >= bbr_full_bw_cnt;
}

static void bbr_set_cycle_idx(struct sock *sk, int cycle_idx)
{
    struct bbr *bbr = inet_csk_ca(sk);
    bbr->cycle_idx = cycle_idx;
    bbr->pacing_gain = bbr->lt_use_bw ?
                            BBR_UNIT : bbr_pacing_gain[bbr->cycle_idx];
}

u32 bbr_max_bw(const struct sock *sk);
u32 bbr_inflight(struct sock *sk, u32 bw, int gain);
u32 bbr_max_bw(const struct sock *sk);

static void bbr_drain_to_target_cycling(struct sock *sk,
                                                    const struct rate_sample *rs)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct bbr *bbr = inet_csk_ca(sk);
    u32 elapsed_us =
                tcp_stamp_us_delta(tp->delivered_mstamp, bbr->cycle_mstamp);
    u32 inflight, bw;
    if (bbr->mode != BBR_PROBE_BW)
        return;

    /* Always need to probe for bw before we forget good bw estimate. */
    if (elapsed_us > bbr->cycle_len * bbr->min_rtt_us) {
        /* Start a new PROBE_BW probing cycle of [2 to 8] x min_rtt. */
        bbr->cycle_mstamp = tp->delivered_mstamp;
        bbr->cycle_len = CYCLE_LEN - prandom_u32_max(bbr_cycle_rand);
        bbr_set_cycle_idx(sk, BBR_BW_PROBE_UP);  /* probe bandwidth */
        return;
    }
    /* The pacing_gain of 1.0 paces at the estimated bw to try to fully
     * use the pipe without increasing the queue.
     */
    if (bbr->pacing_gain == BBR_UNIT)
        return;
    inflight = rs->prior_in_flight;  /* what was in-flight before ACK? */
    bw = bbr_max_bw(sk);
    /* A pacing_gain < 1.0 tries to drain extra queue we added if bw
     * probing didn't find more bw. If inflight falls to match BDP then we
     * estimate queue is drained; persisting would underutilize the pipe.
     */
    if (bbr->pacing_gain < BBR_UNIT) {
        if (inflight <= bbr_inflight(sk, bw, BBR_UNIT))
            bbr_set_cycle_idx(sk, BBR_BW_PROBE_CRUISE); /* cruise */
        return;
    }
    /* A pacing_gain > 1.0 probes for bw by trying to raise inflight to at
     * least pacing_gain*BDP; this may take more than min_rtt if min_rtt is
     * small (e.g. on a LAN). We do not persist if packets are lost, since
     * a path with small buffers may not hold that much. Similarly we exit
     * if we were prevented by app/recv-win from reaching the target.
     */
    if (elapsed_us > bbr->min_rtt_us &&
            (inflight >= bbr_inflight(sk, bw, bbr->pacing_gain) ||
            rs->losses ||         /* perhaps pacing_gain*BDP won't fit */
            rs->is_app_limited || /* previously app-limited */
            !tcp_send_head(sk) || /* currently app/rwin-limited */
            !tcp_snd_wnd_test(tp, tcp_send_head(sk), tp->mss_cache))) {
            bbr_set_cycle_idx(sk, BBR_BW_PROBE_DOWN);  /* drain queue */
            return;
    }
}


/* Return maximum extra acked in past k-2k round trips,
 * where k = bbr_extra_acked_win_rtts.
 */
static u16 bbr_extra_acked(const struct sock *sk)
{
    struct bbr *bbr = inet_csk_ca(sk);
    return max(bbr->extra_acked[0], bbr->extra_acked[1]);
}


/* Return the windowed max recent bandwidth sample, in pkts/uS << BW_SCALE. */
u32 bbr_max_bw(const struct sock *sk)
{
    struct bbr *bbr = inet_csk_ca(sk);

    return minmax_get(&bbr->bw);
}

/* Return the estimated bandwidth of the path, in pkts/uS << BW_SCALE. */
static u32 bbr_bw(const struct sock *sk)
{
    struct bbr *bbr = inet_csk_ca(sk);

    return bbr->lt_use_bw ? bbr->lt_bw : bbr_max_bw(sk);
}

/* Return rate in bytes per second, optionally with a gain.
 * The order here is chosen carefully to avoid overflow of u64. This should
 * work for input rates of up to 2.9Tbit/sec and gain of 2.89x.
 */
static u64 bbr_rate_bytes_per_sec(struct sock *sk, u64 rate, int gain)
{
    rate *= tcp_mss_to_mtu(sk, tcp_sk(sk)->mss_cache);
    rate *= gain;
    rate >>= BBR_SCALE;
    rate *= USEC_PER_SEC;
    return rate >> BW_SCALE;
}

/* Convert a BBR bw and gain factor to a pacing rate in bytes per second. */
static u32 bbr_bw_to_pacing_rate(struct sock *sk, u32 bw, int gain)
{
    u64 rate = bw;

    rate = bbr_rate_bytes_per_sec(sk, rate, gain);
    rate = min_t(u64, rate, sk->sk_max_pacing_rate);
    return rate;
}

/* Initialize pacing rate to: high_gain * init_cwnd / RTT. */
static void bbr_init_pacing_rate_from_rtt(struct sock *sk)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct bbr *bbr = inet_csk_ca(sk);
    u64 bw;
    u32 rtt_us;

    if (tp->srtt_us) {      /* any RTT sample yet? */
        rtt_us = max(tp->srtt_us >> 3, 1U);
        bbr->has_seen_rtt = 1;
    } else {             /* no RTT sample yet */
        rtt_us = USEC_PER_MSEC;  /* use nominal default RTT */
    }
    bw = (u64)tp->snd_cwnd * BW_UNIT;
    do_div(bw, rtt_us);
    sk->sk_pacing_rate = bbr_bw_to_pacing_rate(sk, bw, bbr_high_gain);
}

/* Pace using current bw estimate and a gain factor. In order to help drive the
 * network toward lower queues while maintaining high utilization and low
 * latency, the average pacing rate aims to be slightly (~1%) lower than the
 * estimated bandwidth. This is an important aspect of the design. In this
 * implementation this slightly lower pacing rate is achieved implicitly by not
 * including link-layer headers in the packet size used for the pacing rate.
 */
static void bbr_set_pacing_rate(struct sock *sk, u32 bw, int gain)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct bbr *bbr = inet_csk_ca(sk);
    u32 rate = bbr_bw_to_pacing_rate(sk, bw, gain);

    if (unlikely(!bbr->has_seen_rtt && tp->srtt_us))
        bbr_init_pacing_rate_from_rtt(sk);
    if (bbr_full_bw_reached(sk) || rate > sk->sk_pacing_rate)
        sk->sk_pacing_rate = rate;
}

/* Return count of segments we want in the skbs we send, or 0 for default. */
static u32 bbr_tso_segs_goal(struct sock *sk)
{
    struct bbr *bbr = inet_csk_ca(sk);

    return bbr->tso_segs_goal;
}

static void bbr_set_tso_segs_goal(struct sock *sk)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct bbr *bbr = inet_csk_ca(sk);
    u32 min_segs;

    min_segs = sk->sk_pacing_rate < (bbr_min_tso_rate >> 3) ? 1 : 2;
    bbr->tso_segs_goal = min(tcp_tso_autosize(sk, tp->mss_cache, min_segs),
                 0x7FU);
}

/* Save "last known good" cwnd so we can restore it after losses or PROBE_RTT */
static void bbr_save_cwnd(struct sock *sk)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct bbr *bbr = inet_csk_ca(sk);

    if (bbr->prev_ca_state < TCP_CA_Recovery && bbr->mode != BBR_PROBE_RTT)
        bbr->prior_cwnd = tp->snd_cwnd;  /* this cwnd is good enough */
    else  /* loss recovery or BBR_PROBE_RTT have temporarily cut cwnd */
        bbr->prior_cwnd = max(bbr->prior_cwnd, tp->snd_cwnd);
}

static void bbr_cwnd_event(struct sock *sk, enum tcp_ca_event event)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct bbr *bbr = inet_csk_ca(sk);

    if (event == CA_EVENT_TX_START && tp->app_limited) {
        bbr->idle_restart = 1;
        bbr->ack_epoch_mstamp = tp->tcp_mstamp;
        bbr->ack_epoch_acked = 0;

        /* Avoid pointless buffer overflows: pace at est. bw if we don't
         * need more speed (we're restarting from idle and app-limited).
         */
        if (bbr->mode == BBR_PROBE_BW)
            bbr_set_pacing_rate(sk, bbr_bw(sk), BBR_UNIT);
    }
}

/* Find target cwnd. Right-size the cwnd based on min RTT and the
 * estimated bottleneck bandwidth:
 *
 * cwnd = bw * min_rtt * gain = BDP * gain
 *
 * The key factor, gain, controls the amount of queue. While a small gain
 * builds a smaller queue, it becomes more vulnerable to noise in RTT
 * measurements (e.g., delayed ACKs or other ACK compression effects). This
 * noise may cause BBR to under-estimate the rate.
 *
 * To achieve full performance in high-speed paths, we budget enough cwnd to
 * fit full-sized skbs in-flight on both end hosts to fully utilize the path:
 *   - one skb in sending host Qdisc,
 *   - one skb in sending host TSO/GSO engine
 *   - one skb being received by receiver host LRO/GRO/delayed-ACK engine
 * Don't worry, at low rates (bbr_min_tso_rate) this won't bloat cwnd because
 * in such cases tso_segs_goal is 1. The minimum cwnd is 4 packets,
 * which allows 2 outstanding 2-packet sequences, to try to keep pipe
 * full even with ACK-every-other-packet delayed ACKs.
 */
static u32 bbr_bdp(struct sock *sk, u32 bw, int gain)
{
    struct bbr *bbr = inet_csk_ca(sk);
    u32 bdp;
    u64 w;

    /* If we've never had a valid RTT sample, cap cwnd at the initial
     * default. This should only happen when the connection is not using TCP
     * timestamps and has retransmitted all of the SYN/SYNACK/data packets
     * ACKed so far. In this case, an RTO can cut cwnd to 1, in which
     * case we need to slow-start up toward something safe: TCP_INIT_CWND.
     */
    if (unlikely(bbr->min_rtt_us == ~0U))   /* no valid RTT samples yet? */
        return TCP_INIT_CWND;  /* be safe: cap at default initial cwnd*/

    w = (u64)bw * bbr->min_rtt_us;

    /* Apply a gain to the given value, then remove the BW_SCALE shift. */
    bdp = (((w * gain) >> BBR_SCALE) + BW_UNIT - 1) / BW_UNIT;

    return bdp;
}

static u32 bbr_quantization_budget(struct sock *sk, u32 cwnd, int gain)
{

    /* Allow enough full-sized skbs in flight to utilize end systems. */
    cwnd += 3 * bbr_tso_segs_goal(sk);

    return cwnd;
}

/* Find inflight based on min RTT and the estimated bottleneck bandwidth. */
u32 bbr_inflight(struct sock *sk, u32 bw, int gain)
{   
    u32 inflight;
    inflight = bbr_bdp(sk, bw, gain);
    inflight = bbr_quantization_budget(sk, inflight, gain);
    return inflight;

}

/* Find the cwnd increment based on estimate of ack aggregation */
static u32 bbr_ack_aggregation_cwnd(struct sock *sk)
{
    u32 max_aggr_cwnd, aggr_cwnd = 0;
    if (bbr_extra_acked_gain && bbr_full_bw_reached(sk)) {
        max_aggr_cwnd = ((u64)bbr_bw(sk) * bbr_extra_acked_max_us)
            / BW_UNIT;
        aggr_cwnd = (bbr_extra_acked_gain * bbr_extra_acked(sk))
            >> BBR_SCALE;
        aggr_cwnd = min(aggr_cwnd, max_aggr_cwnd);
    }
    return aggr_cwnd;
}


/* An optimization in BBR to reduce losses: On the first round of recovery, we
 * follow the packet conservation principle: send P packets per P packets acked.
 * After that, we slow-start and send at most 2*P packets per P packets acked.
 * After recovery finishes, or upon undo, we restore the cwnd we had when
 * recovery started (capped by the target cwnd based on estimated BDP).
 *
 * TODO(ycheng/ncardwell): implement a rate-based approach.
 */
static bool bbr_set_cwnd_to_recover_or_restore(
    struct sock *sk, const struct rate_sample *rs, u32 acked, u32 *new_cwnd)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct bbr *bbr = inet_csk_ca(sk);
    u8 prev_state = bbr->prev_ca_state, state = inet_csk(sk)->icsk_ca_state;
    u32 cwnd = tp->snd_cwnd;

    /* An ACK for P pkts should release at most 2*P packets. We do this
     * in two steps. First, here we deduct the number of lost packets.
     * Then, in bbr_set_cwnd() we slow start up toward the target cwnd.
     */
    if (rs->losses > 0)
        cwnd = max_t(s32, cwnd - rs->losses, 1);

    if (state == TCP_CA_Recovery && prev_state != TCP_CA_Recovery) {
        /* Starting 1st round of Recovery, so do packet conservation. */
        bbr->packet_conservation = 1;
        bbr->next_rtt_delivered = tp->delivered;  /* start round now */
        /* Cut unused cwnd from app behavior, TSQ, or TSO deferral: */
        cwnd = tcp_packets_in_flight(tp) + acked;
    } else if (prev_state >= TCP_CA_Recovery && state < TCP_CA_Recovery) {
        /* Exiting loss recovery; restore cwnd saved before recovery. */
        bbr->restore_cwnd = 1;
        bbr->packet_conservation = 0;
    }
    bbr->prev_ca_state = state;

    if (bbr->restore_cwnd) {
        /* Restore cwnd after exiting loss recovery or PROBE_RTT. */
        cwnd = max(cwnd, bbr->prior_cwnd);
        bbr->restore_cwnd = 0;
    }

    if (bbr->packet_conservation) {
        *new_cwnd = max(cwnd, tcp_packets_in_flight(tp) + acked);
        return true;    /* yes, using packet conservation */
    }
    *new_cwnd = cwnd;
    return false;
}

/* Slow-start up toward target cwnd (if bw estimate is growing, or packet loss
 * has drawn us down below target), or snap down to target if we're above it.
 */
static void bbr_set_cwnd(struct sock *sk, const struct rate_sample *rs,
             u32 acked, u32 bw, int gain)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct bbr *bbr = inet_csk_ca(sk);
    u32 cwnd = 0, target_cwnd = 0;

    if (!acked)
        return;

    if (bbr_set_cwnd_to_recover_or_restore(sk, rs, acked, &cwnd))
        goto done;

    /* If we're below target cwnd, slow start cwnd toward target cwnd. */
    target_cwnd = bbr_bdp(sk, bw, gain);
    ////
    /* Increment the cwnd to account for excess ACKed data that seems
     * due to aggregation (of data and/or ACKs) visible in the ACK stream.
     */
    target_cwnd += bbr_ack_aggregation_cwnd(sk);
    ////
    target_cwnd = bbr_quantization_budget(sk, target_cwnd, gain);
    if (bbr_full_bw_reached(sk))  /* only cut cwnd if we filled the pipe */
        cwnd = min(cwnd + acked, target_cwnd);
    else if (cwnd < target_cwnd || tp->delivered < TCP_INIT_CWND)
        cwnd = cwnd + acked;
    cwnd = max(cwnd, bbr_cwnd_min_target);

done:
    tp->snd_cwnd = min(cwnd, tp->snd_cwnd_clamp);   /* apply global cap */
    if (bbr->mode == BBR_PROBE_RTT)  /* drain queue, refresh min_rtt */
        tp->snd_cwnd = min(tp->snd_cwnd, bbr_cwnd_min_target);
}

/* End cycle phase if it's time and/or we hit the phase's in-flight target. */
static bool bbr_is_next_cycle_phase(struct sock *sk,
                    const struct rate_sample *rs)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct bbr *bbr = inet_csk_ca(sk);
    bool is_full_length =
        tcp_stamp_us_delta(tp->delivered_mstamp, bbr->cycle_mstamp) >
        bbr->min_rtt_us;
    u32 inflight, bw;

    /* The pacing_gain of 1.0 paces at the estimated bw to try to fully
     * use the pipe without increasing the queue.
     */
    if (bbr->pacing_gain == BBR_UNIT)
        return is_full_length;      /* just use wall clock time */

    inflight = rs->prior_in_flight;  /* what was in-flight before ACK? */
    bw = bbr_max_bw(sk);

    /* A pacing_gain > 1.0 probes for bw by trying to raise inflight to at
     * least pacing_gain*BDP; this may take more than min_rtt if min_rtt is
     * small (e.g. on a LAN). We do not persist if packets are lost, since
     * a path with small buffers may not hold that much.
     */
    if (bbr->pacing_gain > BBR_UNIT)
        return is_full_length &&
            (rs->losses ||  /* perhaps pacing_gain*BDP won't fit */
             inflight >= bbr_inflight(sk, bw, bbr->pacing_gain));

    /* A pacing_gain < 1.0 tries to drain extra queue we added if bw
     * probing didn't find more bw. If inflight falls to match BDP then we
     * estimate queue is drained; persisting would underutilize the pipe.
     */
    return is_full_length ||
        inflight <= bbr_inflight(sk, bw, BBR_UNIT);
}

static void bbr_advance_cycle_phase(struct sock *sk)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct bbr *bbr = inet_csk_ca(sk);


    bbr->cycle_idx = (bbr->cycle_idx + 1) & (CYCLE_LEN - 1);
    bbr->cycle_mstamp = tp->delivered_mstamp;
    bbr->pacing_gain = bbr_pacing_gain[bbr->cycle_idx];
}

/* Gain cycling: cycle pacing gain to converge to fair share of available bw. */
static void bbr_update_cycle_phase(struct sock *sk,
                   const struct rate_sample *rs)
{
    struct bbr *bbr = inet_csk_ca(sk);

    if (bbr_drain_to_target) {
        bbr_drain_to_target_cycling(sk, rs);
        return;
    }

    if ((bbr->mode == BBR_PROBE_BW) && !bbr->lt_use_bw &&
        bbr_is_next_cycle_phase(sk, rs))
        bbr_advance_cycle_phase(sk);
}

static void bbr_reset_startup_mode(struct sock *sk)
{
    struct bbr *bbr = inet_csk_ca(sk);

    bbr->mode = BBR_STARTUP;
    bbr->pacing_gain = bbr_high_gain;
    bbr->cwnd_gain   = bbr_high_gain;
}

static void bbr_reset_probe_bw_mode(struct sock *sk)
{
    struct bbr *bbr = inet_csk_ca(sk);

    bbr->mode = BBR_PROBE_BW;
    bbr->pacing_gain = BBR_UNIT;
    bbr->cwnd_gain = bbr_cwnd_gain;
    bbr->cycle_idx = CYCLE_LEN - 1 - prandom_u32_max(bbr_cycle_rand);
    bbr_advance_cycle_phase(sk);    /* flip to next phase of gain cycle */
}

static void bbr_reset_mode(struct sock *sk)
{
    if (!bbr_full_bw_reached(sk))
        bbr_reset_startup_mode(sk);
    else
        bbr_reset_probe_bw_mode(sk);
}

/* Start a new long-term sampling interval. */
static void bbr_reset_lt_bw_sampling_interval(struct sock *sk)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct bbr *bbr = inet_csk_ca(sk);

    bbr->lt_last_stamp = div_u64(tp->delivered_mstamp, USEC_PER_MSEC);
    bbr->lt_last_delivered = tp->delivered;
    bbr->lt_last_lost = tp->lost;
    bbr->lt_rtt_cnt = 0;
}

/* Completely reset long-term bandwidth sampling. */
static void bbr_reset_lt_bw_sampling(struct sock *sk)
{
    struct bbr *bbr = inet_csk_ca(sk);

    bbr->lt_bw = 0;
    bbr->lt_use_bw = 0;
    bbr->lt_is_sampling = false;
    bbr_reset_lt_bw_sampling_interval(sk);
}

/* Long-term bw sampling interval is done. Estimate whether we're policed. */
static void bbr_lt_bw_interval_done(struct sock *sk, u32 bw)
{
    struct bbr *bbr = inet_csk_ca(sk);
    u32 diff;

    if (bbr->lt_bw) {  /* do we have bw from a previous interval? */
        /* Is new bw close to the lt_bw from the previous interval? */
        diff = abs(bw - bbr->lt_bw);
        if ((diff * BBR_UNIT <= bbr_lt_bw_ratio * bbr->lt_bw) ||
            (bbr_rate_bytes_per_sec(sk, diff, BBR_UNIT) <=
             bbr_lt_bw_diff)) {
            /* All criteria are met; estimate we're policed. */
            bbr->lt_bw = (bw + bbr->lt_bw) >> 1;  /* avg 2 intvls */
            bbr->lt_use_bw = 1;
            bbr->pacing_gain = BBR_UNIT;  /* try to avoid drops */
            bbr->lt_rtt_cnt = 0;
            return;
        }
    }
    bbr->lt_bw = bw;
    bbr_reset_lt_bw_sampling_interval(sk);
}

/* Token-bucket traffic policers are common (see "An Internet-Wide Analysis of
 * Traffic Policing", SIGCOMM 2016). BBR detects token-bucket policers and
 * explicitly models their policed rate, to reduce unnecessary losses. We
 * estimate that we're policed if we see 2 consecutive sampling intervals with
 * consistent throughput and high packet loss. If we think we're being policed,
 * set lt_bw to the "long-term" average delivery rate from those 2 intervals.
 */
static void bbr_lt_bw_sampling(struct sock *sk, const struct rate_sample *rs)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct bbr *bbr = inet_csk_ca(sk);
    u32 lost, delivered;
    u64 bw;
    u32 t;

    if (bbr->lt_use_bw) {   /* already using long-term rate, lt_bw? */
        if (bbr->mode == BBR_PROBE_BW && bbr->round_start &&
            ++bbr->lt_rtt_cnt >= bbr_lt_bw_max_rtts) {
            bbr_reset_lt_bw_sampling(sk);    /* stop using lt_bw */
            bbr_reset_probe_bw_mode(sk);  /* restart gain cycling */
        }
        return;
    }

    /* Wait for the first loss before sampling, to let the policer exhaust
     * its tokens and estimate the steady-state rate allowed by the policer.
     * Starting samples earlier includes bursts that over-estimate the bw.
     */
    if (!bbr->lt_is_sampling) {
        if (!rs->losses)
            return;
        bbr_reset_lt_bw_sampling_interval(sk);
        bbr->lt_is_sampling = true;
    }

    /* To avoid underestimates, reset sampling if we run out of data. */
    if (rs->is_app_limited) {
        bbr_reset_lt_bw_sampling(sk);
        return;
    }

    if (bbr->round_start)
        bbr->lt_rtt_cnt++;  /* count round trips in this interval */
    if (bbr->lt_rtt_cnt < bbr_lt_intvl_min_rtts)
        return;     /* sampling interval needs to be longer */
    if (bbr->lt_rtt_cnt > 4 * bbr_lt_intvl_min_rtts) {
        bbr_reset_lt_bw_sampling(sk);  /* interval is too long */
        return;
    }

    /* End sampling interval when a packet is lost, so we estimate the
     * policer tokens were exhausted. Stopping the sampling before the
     * tokens are exhausted under-estimates the policed rate.
     */
    if (!rs->losses)
        return;

    /* Calculate packets lost and delivered in sampling interval. */
    lost = tp->lost - bbr->lt_last_lost;
    delivered = tp->delivered - bbr->lt_last_delivered;
    /* Is loss rate (lost/delivered) >= lt_loss_thresh? If not, wait. */
    if (!delivered || (lost << BBR_SCALE) < bbr_lt_loss_thresh * delivered)
        return;

    /* Find average delivery rate in this sampling interval. */
    t = div_u64(tp->delivered_mstamp, USEC_PER_MSEC) - bbr->lt_last_stamp;
    if ((s32)t < 1)
        return;     /* interval is less than one ms, so wait */
    /* Check if can multiply without overflow */
    if (t >= ~0U / USEC_PER_MSEC) {
        bbr_reset_lt_bw_sampling(sk);  /* interval too long; reset */
        return;
    }
    t *= USEC_PER_MSEC;
    bw = (u64)delivered * BW_UNIT;
    do_div(bw, t);
    bbr_lt_bw_interval_done(sk, bw);
}

/* Estimate the bandwidth based on how fast packets are delivered */
static void bbr_update_bw(struct sock *sk, const struct rate_sample *rs)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct bbr *bbr = inet_csk_ca(sk);
    u64 bw;

    bbr->round_start = 0;
    if (rs->delivered < 0 || rs->interval_us <= 0)
        return; /* Not a valid observation */

    /* See if we've reached the next RTT */
    if (!before(rs->prior_delivered, bbr->next_rtt_delivered)) {
        bbr->next_rtt_delivered = tp->delivered;
        bbr->rtt_cnt++;
        bbr->round_start = 1;
        bbr->packet_conservation = 0;
    }

    bbr_lt_bw_sampling(sk, rs);

    /* Divide delivered by the interval to find a (lower bound) bottleneck
     * bandwidth sample. Delivered is in packets and interval_us in uS and
     * ratio will be <<1 for most connections. So delivered is first scaled.
     */
    bw = (u64)rs->delivered * BW_UNIT;
    do_div(bw, rs->interval_us);

    /* If this sample is application-limited, it is likely to have a very
     * low delivered count that represents application behavior rather than
     * the available network rate. Such a sample could drag down estimated
     * bw, causing needless slow-down. Thus, to continue to send at the
     * last measured network rate, we filter out app-limited samples unless
     * they describe the path bw at least as well as our bw model.
     *
     * So the goal during app-limited phase is to proceed with the best
     * network rate no matter how long. We automatically leave this
     * phase when app writes faster than the network can deliver :)
     */
    if (!rs->is_app_limited || bw >= bbr_max_bw(sk)) {
        /* Incorporate new sample into our max bw filter. */
        minmax_running_max(&bbr->bw, bbr_bw_rtts, bbr->rtt_cnt, bw);
    }
}

/* Estimate when the pipe is full, using the change in delivery rate: BBR
 * estimates that STARTUP filled the pipe if the estimated bw hasn't changed by
 * at least bbr_full_bw_thresh (25%) after bbr_full_bw_cnt (3) non-app-limited
 * rounds. Why 3 rounds: 1: rwin autotuning grows the rwin, 2: we fill the
 * higher rwin, 3: we get higher delivery rate samples. Or transient
 * cross-traffic or radio noise can go away. CUBIC Hystart shares a similar
 * design goal, but uses delay and inter-ACK spacing instead of bandwidth.
 */
static void bbr_check_full_bw_reached(struct sock *sk,
                      const struct rate_sample *rs)
{
    struct bbr *bbr = inet_csk_ca(sk);
    u32 bw_thresh;

    if (bbr_full_bw_reached(sk) || !bbr->round_start || rs->is_app_limited)
        return;

    bw_thresh = (u64)bbr->full_bw * bbr_full_bw_thresh >> BBR_SCALE;
    if (bbr_max_bw(sk) >= bw_thresh) {
        bbr->full_bw = bbr_max_bw(sk);
        bbr->full_bw_cnt = 0;
        return;
    }
    ++bbr->full_bw_cnt;
}

/* If pipe is probably full, drain the queue and then enter steady-state. */
static void bbr_check_drain(struct sock *sk, const struct rate_sample *rs)
{
    struct bbr *bbr = inet_csk_ca(sk);

    if (bbr->mode == BBR_STARTUP && bbr_full_bw_reached(sk)) {
        bbr->mode = BBR_DRAIN;  /* drain queue we created */
        bbr->pacing_gain = bbr_drain_gain;  /* pace slow to drain */
        bbr->cwnd_gain = bbr_high_gain; /* maintain cwnd */
    }   /* fall through to check if in-flight is already small: */
    if (bbr->mode == BBR_DRAIN &&
        tcp_packets_in_flight(tcp_sk(sk)) <= bbr_inflight(sk, bbr_max_bw(sk), BBR_UNIT))
        bbr_reset_probe_bw_mode(sk);  /* we estimate queue is drained */
}


/* Estimates the windowed max degree of ack aggregation.
 * This is used to provision extra in-flight data to keep sending during
 * inter-ACK silences.
 *
 * Degree of ack aggregation is estimated as extra data acked beyond expected.
 *
 * max_extra_acked = "maximum recent excess data ACKed beyond max_bw * interval"
 * cwnd += max_extra_acked
 *
 * Max extra_acked is clamped by cwnd and bw * bbr_extra_acked_max_us (100 ms).
 * Max filter is an approximate sliding window of 10-20 (packet timed) round
 * trips.
 */
 static void bbr_update_ack_aggregation(struct sock *sk,
                                                    const struct rate_sample *rs)
 {
    u32 epoch_us, expected_acked, extra_acked;
    struct bbr *bbr = inet_csk_ca(sk);
    struct tcp_sock *tp = tcp_sk(sk);
    if (!bbr_extra_acked_gain || rs->acked_sacked <= 0 ||
        rs->delivered < 0 || rs->interval_us <= 0)
        return;
    if (bbr->round_start) {
        bbr->extra_acked_win_rtts = min(0x1F,
                                        bbr->extra_acked_win_rtts + 1);
        if (bbr->extra_acked_win_rtts >= bbr_extra_acked_win_rtts) {
            bbr->extra_acked_win_rtts = 0;
            bbr->extra_acked_win_idx = bbr->extra_acked_win_idx ?0 : 1;
            bbr->extra_acked[bbr->extra_acked_win_idx] = 0;
        }   
    }
    /* Compute how many packets we expected to be delivered over epoch. */
    epoch_us = tcp_stamp_us_delta(tp->delivered_mstamp,
                                    bbr->ack_epoch_mstamp);
    expected_acked = ((u64)bbr_bw(sk) * epoch_us) / BW_UNIT;
    /* Reset the aggregation epoch if ACK rate is below expected rate or
     * significantly large no. of ack received since epoch (potentially
     * quite old epoch).
     */
    if (bbr->ack_epoch_acked <= expected_acked ||
        (bbr->ack_epoch_acked + rs->acked_sacked >=
        bbr_ack_epoch_acked_reset_thresh)) {
        bbr->ack_epoch_acked = 0;
        bbr->ack_epoch_mstamp = tp->delivered_mstamp;
        expected_acked = 0;
    }
    /* Compute excess data delivered, beyond what was expected. */
    bbr->ack_epoch_acked = min(0xFFFFFU,
                                bbr->ack_epoch_acked + rs->acked_sacked);
    extra_acked = bbr->ack_epoch_acked - expected_acked;
    extra_acked = min(extra_acked, tp->snd_cwnd);
    if (extra_acked > bbr->extra_acked[bbr->extra_acked_win_idx])
        bbr->extra_acked[bbr->extra_acked_win_idx] = extra_acked;
}

/* The goal of PROBE_RTT mode is to have BBR flows cooperatively and
 * periodically drain the bottleneck queue, to converge to measure the true
 * min_rtt (unloaded propagation delay). This allows the flows to keep queues
 * small (reducing queuing delay and packet loss) and achieve fairness among
 * BBR flows.
 *
 * The min_rtt filter window is 10 seconds. When the min_rtt estimate expires,
 * we enter PROBE_RTT mode and cap the cwnd at bbr_cwnd_min_target=4 packets.
 * After at least bbr_probe_rtt_mode_ms=200ms and at least one packet-timed
 * round trip elapsed with that flight size <= 4, we leave PROBE_RTT mode and
 * re-enter the previous mode. BBR uses 200ms to approximately bound the
 * performance penalty of PROBE_RTT's cwnd capping to roughly 2% (200ms/10s).
 *
 * Note that flows need only pay 2% if they are busy sending over the last 10
 * seconds. Interactive applications (e.g., Web, RPCs, video chunks) often have
 * natural silences or low-rate periods within 10 seconds where the rate is low
 * enough for long enough to drain its queue in the bottleneck. We pick up
 * these min RTT measurements opportunistically with our min_rtt filter. :-)
 */

static void bbr_update_min_rtt(struct sock *sk, const struct rate_sample *rs)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct bbr *bbr = inet_csk_ca(sk);
    bool filter_expired;

    /* Track min RTT seen in the min_rtt_win_sec filter window: */
    filter_expired = after(tcp_jiffies32,
                   bbr->min_rtt_stamp + bbr_min_rtt_win_sec * HZ);
    if (rs->rtt_us >= 0 &&
        (rs->rtt_us <= bbr->min_rtt_us || filter_expired)) {
        bbr->min_rtt_us = rs->rtt_us;
        bbr->min_rtt_stamp = tcp_jiffies32;
    }

    if (bbr_probe_rtt_mode_ms > 0 && filter_expired &&
        !bbr->idle_restart && bbr->mode != BBR_PROBE_RTT) {
        bbr->mode = BBR_PROBE_RTT;  /* dip, drain queue */
        bbr->pacing_gain = BBR_UNIT;
        bbr->cwnd_gain = BBR_UNIT;
        bbr_save_cwnd(sk);  /* note cwnd so we can restore it */
        bbr->probe_rtt_done_stamp = 0;
    }

    if (bbr->mode == BBR_PROBE_RTT) {
        /* Ignore low rate samples during this mode. */
        tp->app_limited =
            (tp->delivered + tcp_packets_in_flight(tp)) ? : 1;
        /* Maintain min packets in flight for max(200 ms, 1 round). */
        if (!bbr->probe_rtt_done_stamp &&
            tcp_packets_in_flight(tp) <= bbr_cwnd_min_target) {
            bbr->probe_rtt_done_stamp = tcp_jiffies32 +
                msecs_to_jiffies(bbr_probe_rtt_mode_ms);
            bbr->probe_rtt_round_done = 0;
            bbr->next_rtt_delivered = tp->delivered;
        } else if (bbr->probe_rtt_done_stamp) {
            if (bbr->round_start)
                bbr->probe_rtt_round_done = 1;
            if (bbr->probe_rtt_round_done &&
                after(tcp_jiffies32, bbr->probe_rtt_done_stamp)) {
                bbr->min_rtt_stamp = tcp_jiffies32;
                bbr->restore_cwnd = 1;  /* snap to prior_cwnd */
                bbr_reset_mode(sk);
            }
        }
    }
    bbr->idle_restart = 0;
}

static void bbr_update_model(struct sock *sk, const struct rate_sample *rs)
{
    bbr_update_bw(sk, rs);
    bbr_update_ack_aggregation(sk, rs);
    bbr_update_cycle_phase(sk, rs);
    bbr_check_full_bw_reached(sk, rs);
    bbr_check_drain(sk, rs);
    bbr_update_min_rtt(sk, rs);
}

static void bbr_main(struct sock *sk, const struct rate_sample *rs)
{
    struct bbr *bbr = inet_csk_ca(sk);
    u32 bw;

    bbr_update_model(sk, rs);

    bw = bbr_bw(sk);
    bbr_set_pacing_rate(sk, bw, bbr->pacing_gain);
    bbr_set_tso_segs_goal(sk);
    bbr_set_cwnd(sk, rs, rs->acked_sacked, bw, bbr->cwnd_gain);
}

static void bbr_init(struct sock *sk)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct bbr *bbr = inet_csk_ca(sk);

    bbr->prior_cwnd = 0;
    bbr->tso_segs_goal = 0;  /* default segs per skb until first ACK */
    bbr->rtt_cnt = 0;
    bbr->next_rtt_delivered = 0;
    bbr->prev_ca_state = TCP_CA_Open;
    bbr->packet_conservation = 0;

    bbr->probe_rtt_done_stamp = 0;
    bbr->probe_rtt_round_done = 0;
    bbr->min_rtt_us = tcp_min_rtt(tp);
    bbr->min_rtt_stamp = tcp_jiffies32;

    minmax_reset(&bbr->bw, bbr->rtt_cnt, 0);  /* init max bw to 0 */

    bbr->has_seen_rtt = 0;
    bbr_init_pacing_rate_from_rtt(sk);

    bbr->restore_cwnd = 0;
    bbr->round_start = 0;
    bbr->idle_restart = 0;
    bbr->full_bw = 0;
    bbr->full_bw_cnt = 0;
    bbr->cycle_mstamp = 0;
    bbr->cycle_idx = 0;
    bbr->cycle_len = 0;
    bbr_reset_lt_bw_sampling(sk);
    bbr_reset_startup_mode(sk);
    bbr->ack_epoch_mstamp = tp->tcp_mstamp;
    bbr->ack_epoch_acked = 0;
    bbr->extra_acked_win_rtts = 0;
    bbr->extra_acked_win_idx = 0;
    bbr->extra_acked[0] = 0;
    bbr->extra_acked[1] = 0;

    cmpxchg(&sk->sk_pacing_status, SK_PACING_NONE, SK_PACING_NEEDED);
}

static u32 bbr_sndbuf_expand(struct sock *sk)
{
    /* Provision 3 * cwnd since BBR may slow-start even during recovery. */
    return 3;
}

/* In theory BBR does not need to undo the cwnd since it does not
 * always reduce cwnd on losses (see bbr_main()). Keep it for now.
 */
static u32 bbr_undo_cwnd(struct sock *sk)
{
    return tcp_sk(sk)->snd_cwnd;
}

/* Entering loss recovery, so save cwnd for when we exit or undo recovery. */
static u32 bbr_ssthresh(struct sock *sk)
{
    bbr_save_cwnd(sk);
    return TCP_INFINITE_SSTHRESH;    /* BBR does not use ssthresh */
}

static size_t bbr_get_info(struct sock *sk, u32 ext, int *attr,
               union tcp_cc_info *info)
{
    if (ext & (1 << (INET_DIAG_BBRINFO - 1)) ||
        ext & (1 << (INET_DIAG_VEGASINFO - 1))) {
        struct tcp_sock *tp = tcp_sk(sk);
        struct bbr *bbr = inet_csk_ca(sk);
        u64 bw = bbr_bw(sk);

        bw = bw * tp->mss_cache * USEC_PER_SEC >> BW_SCALE;
        memset(&info->bbr, 0, sizeof(info->bbr));
        info->bbr.bbr_bw_lo     = (u32)bw;
        info->bbr.bbr_bw_hi     = (u32)(bw >> 32);
        info->bbr.bbr_min_rtt       = bbr->min_rtt_us;
        info->bbr.bbr_pacing_gain   = bbr->pacing_gain;
        info->bbr.bbr_cwnd_gain     = bbr->cwnd_gain;
        *attr = INET_DIAG_BBRINFO;
        return sizeof(info->bbr);
    }
    return 0;
}

static void bbr_set_state(struct sock *sk, u8 new_state)
{
    struct bbr *bbr = inet_csk_ca(sk);

    if (new_state == TCP_CA_Loss) {
        struct rate_sample rs = { .losses = 1 };

        bbr->prev_ca_state = TCP_CA_Loss;
        bbr->full_bw = 0;
        bbr->round_start = 1;   /* treat RTO like end of a round */
        bbr_lt_bw_sampling(sk, &rs);
    }
}

static struct tcp_congestion_ops tcp_bbr_cong_ops __read_mostly = {
    .flags      = TCP_CONG_NON_RESTRICTED,
    .name       = "bbr",
    .owner      = THIS_MODULE,
    .init       = bbr_init,
    .cong_control   = bbr_main,
    .sndbuf_expand  = bbr_sndbuf_expand,
    .undo_cwnd  = bbr_undo_cwnd,
    .cwnd_event = bbr_cwnd_event,
    .ssthresh   = bbr_ssthresh,
    .tso_segs_goal  = bbr_tso_segs_goal,
    .get_info   = bbr_get_info,
    .set_state  = bbr_set_state,
};

static int __init bbr_register(void)
{
    BUILD_BUG_ON(sizeof(struct bbr) > ICSK_CA_PRIV_SIZE);
    return tcp_register_congestion_control(&tcp_bbr_cong_ops);
}

static void __exit bbr_unregister(void)
{
    tcp_unregister_congestion_control(&tcp_bbr_cong_ops);
}

module_init(bbr_register);
module_exit(bbr_unregister);

MODULE_AUTHOR("Van Jacobson <vanj@google.com>");
MODULE_AUTHOR("Neal Cardwell <ncardwell@google.com>");
MODULE_AUTHOR("Yuchung Cheng <ycheng@google.com>");
MODULE_AUTHOR("Soheil Hassas Yeganeh <soheil@google.com>");
MODULE_LICENSE("Dual BSD/GPL");
MODULE_DESCRIPTION("TCP BBR (Bottleneck Bandwidth and RTT)");

// The black clouds capped the heavy rain,
// Shoes in wenzhou, zhejiang, are soaked

皮鞋老板,皮鞋经理。

阅读更多

没有更多推荐了,返回首页