前一篇说到delay_ack对应数据包采样值带宽值很不巧刚好更新了带宽,造成了速率的陡降的问题。客户表示升级了linux系统也能解决问题,但是升级所有线上机器的内核周期太长。升级内核版本为何能修复这个问题呢?新内核版本并没有禁用delay_ack,也没有增加delay_ack情况下不进行带宽更新操作。所以还需要弄明白是新版本的内核是哪个变更规避了delay_ack带来的影响。
重新仔细看了下吞吐曲线图,实际上4分钟内不止出现了4次码率掉底问题,比较明显的就出现了6次,还有好几个很快就恢复吞吐量下降的点,为何会出现有的恢复快有的恢复慢呢?于是从新回到的出问题点,看看是如何恢复。比较奇怪的点,是发生带宽陡降之后的几次带宽采样,rs->interval_us前面几次都接近40ms,之后几次降低到20ms左右,而从wrieshark到图上来看实际的interval_us应该之后9ms左右,而且之后几次interval_us图上看起来是us级别的值,但是打印出来的却是20ms附近的值。
所以需要知道interval_us的计算方式。rs->interval_us是在接收ack包掉也会调用tcp_rate_gen()计算得到的,send_us是发送间隔,ack_us是接收间隔,tp->tcp_mstamp是表示当前时间,是tcp_time_stamp的缓存。
void tcp_rate_gen(struct sock *sk, u32 delivered, u32 lost...
{
....
snd_us = rs->interval_us; /* send phase */
ack_us = tcp_stamp_us_delta(tp->tcp_mstamp, rs->prior_mstamp); /* ack phase */
rs->interval_us = max(snd_us, ack_us);
}
发送间隔是在tcp_rate_skb_delivered()函数中调用,如果有不明白如何计算的可以看看我之前的文章,有一篇应该说的比较详细。大概的方式就是在发送数据包A的CB控制块中记录最近被接收的数据包的ack时间和发送时间,然后数据包A被ack时计算各个时间。
void tcp_rate_skb_delivered(struct sock *sk, struct sk_buff *skb,
struct rate_sample *rs)
{
struct tcp_sock *tp = tcp_sk(sk);
struct tcp_skb_cb *scb = TCP_SKB_CB(skb);
if (!scb->tx.delivered_mstamp)
return;
if (!rs->prior_delivered ||
after(scb->tx.delivered, rs->prior_delivered)) {
rs->prior_delivered = scb->tx.delivered;
rs->prior_mstamp = scb->tx.delivered_mstamp;//记录数据包发送时最近被ack的时间点
rs->is_app_limited = scb->tx.is_app_limited;
rs->is_retrans = scb->sacked & TCPCB_RETRANS;
/* Find the duration of the "send phase" of this window: */
rs->interval_us = tcp_stamp_us_delta(
skb->skb_mstamp,
scb->tx.first_tx_mstamp); // 发送间隔计算点
/* Record send time of most recently ACKed packet: */
tp->first_tx_mstamp = skb->skb_mstamp;
}
/* Mark off the skb delivered once it's sacked to avoid being
* used again when it's cumulatively acked. For acked packets
* we don't need to reset since it'll be freed soon.
*/
if (scb->sacked & TCPCB_SACKED_ACKED)
scb->tx.delivered_mstamp = 0;
}
但是有个特殊情况,比如长时间网络中没有数据包交互,从新开始发送时,最近被确认的包时很早以前了,这时候会最近发送时间和接收时间为当前时间。
void tcp_rate_skb_sent(struct sock *sk, struct sk_buff *skb)
{
struct tcp_sock *tp = tcp_sk(sk);
/* In general we need to start delivery rate samples from the
* time we received the most recent ACK, to ensure we include
* the full time the network needs to deliver all in-flight
* packets. If there are no packets in flight yet, then we
* know that any ACKs after now indicate that the network was
* able to deliver those packets completely in the sampling
* interval between now and the next ACK.
*
* Note that we use packets_out instead of tcp_packets_in_flight(tp)
* because the latter is a guess based on RTO and loss-marking
* heuristics. We don't want spurious RTOs or loss markings to cause
* a spuriously small time interval, causing a spuriously high
* bandwidth estimate.
*/
if (!tp->packets_out) { //重新开始发送的时候
tp->first_tx_mstamp = skb->skb_mstamp;
tp->delivered_mstamp = skb->skb_mstamp;
}
TCP_SKB_CB(skb)->tx.first_tx_mstamp = tp->first_tx_mstamp;
TCP_SKB_CB(skb)->tx.delivered_mstamp = tp->delivered_mstamp;
TCP_SKB_CB(skb)->tx.delivered = tp->delivered;
TCP_SKB_CB(skb)->tx.is_app_limited = tp->app_limited ? 1 : 0;
}
既然重新开始发送的时候会重置tp->first_tx_mstamp和tp->delivered_mstamp时间,那问题会出在哪里呢?经过排查发现问题是出在qdisc层,早期的bbr的pacing实现是通过qdisc层来显示的,qdisc根据sk的pacing_rate计算出配额,如果配额足够则立刻发送,如果不足,则qdisc层会延期数据包的发送到网卡直到累积足够的配额。而我们的tcpdump抓包是介于qdisc层与网卡之间,也就是我们看到的发送时间是qdisc发送时间,并不是真实tcp层记录的发送时间。
那bbr是哪个patch修复了这个问题呢?是bbr第三个patch kernel/git/torvalds/linux.git - Linux kernel source tree中将bbr的pacing实现机制改为hrtimer来实现,不再依靠qdisc层来实现。