TCP Performance problems caused by interaction between Nagle's Algorithm and Delayed ACK

原创 2011年08月27日 21:01:07

TCP Performance problems caused by interaction between Nagle's Algorithm and Delayed ACK

Stuart Cheshire
20th May 2005

This page describes a TCP performance problem resulting from a little-known interaction between Nagle's Algorithm and Delayed ACK. At least, I believe it's not well known: I haven't seen it documented elsewhere, yet in the course of my career at Apple Ihave run into the performance problem it causes over and over again (the first time being the in thePPCToolbox over TCP code I wrote myself back in 1999), so I think it's about time it was documented.

这篇文章描述了Nagle算法和Delayed ACK 两种机制作用下导致TCP的一个性能问题。至少,我相信很少有人知道:因为很少有文档会讲到这个。

The overall summary is that many of the mechanisms that make TCP so great, likefast retransmit, work best when there's a continuous flow of data for the TCP state machine to work on. If you send a block of data and then stop and wait for an application-layer acknowledgment from the other end, the state machine can falter. It's a bit like a water pump losing its prime — as long as the whole pipeline is full of water the pump works well and the water flows, but if you have a bit of water followed by a bunch of air, the pump flails because the impeller has nothing substantial to push against.

整体总结很多的机制使TCP如此之大,像快速重发,在一个连续TCP状态机工作的时候可以工作的很好如果您要发送的数据块,然后停止,并等待ACK,状态机就会出问题。这个有点像水泵失去了动力 - 只要整个管道装满水,水泵会正常工作并且水会正常流动如果有水流中有一堆空气,叶轮失去推力导致水泵会出现异常

If you're writing a request/response application-layer protocol over TCP, the implication of this is you will want to make your implementation at least double-buffered: Send your first request, and while you're still waiting for the response, generate and send your second. Then when you get the response for the first, generate and send your third request. This way you always have two requests outstanding. While you're waiting for the response for requestn, requestn+1 is behind it in the return pipeline, conceptually pushing the data along.

如果你正在编写一个通过TCP请求/响应的应用层协议这意味着使您的实现至少双缓冲发送第一个请求,当你仍然在等待响应的时候生成并发送你的第二个数据包然后,当你获得第一个响应包生成并发送你的第三个请求。这样里总有2个没有获得响应的请求当你等待第N个请求的响应的时候,第N +1个请求在返回管道中

Interestingly, the performance benefit you get from going from single-buffering to double, triple, quadruple, etc., is not a linear slope. Implementing double-buffering usually yields almost all of the performance improvement that's there to be had; going to triple, quadruple, or n-way buffered usually yields little more. This is because what matters is not the sheer quantity of data in the pipeline, but the fact that there issomething in the pipeline following the packet you're currently waiting for. As long as there's at least a packet's worth of data following the response you're waiting for, that's enough to avoid the pathological slowdown described here. As long as there's at least four or five packet's worth of data, that's enough to trigger afast retransmit should the the packet you're waiting for be lost.


Case Study: WiFi Performance Testing Program

In this document I describe my latest encounter with this bad interaction between Nagle's Algorithm and Delayed ACK: a testing program used in WiFi conformance testing. This program tests the speed of a WiFi implementation by repeatedly sending 100,000 bytes of data over TCP and then waiting for an application-layer ack from the other to confirm its reception. Windows achieved the 3.5Mb/s required to pass the test; Mac OS X got just 2.7Mb/s and failed. The naive (and wrong) conclusion would be that Windows is fast, and Mac OS X is slow, and that's that. The truth is not so simple. The true explanation was that the test was flawed, and Mac OS X happened to expose the problem, while Windows, basically through luck, did not.


在这篇文章中我描述了我在Nagle算法延迟的ACK之间不良互动最新遭遇在WiFi一致性测试使用的测试程序这个程序测试通过重复发送100K个字节的数据,然后在应用层等待一个对方接受完毕的响应Windows 平台达到3.5MB/ S通过测试; Mac OS X中仅仅得到了2.7MB/ S且没有通过测试​​。天真和错误结论是Windows速度快,而Mac OS X是缓慢的就是这样事实并非如此简单真正解释检验的标准是有缺陷的,Mac OS X刚好引发这个问题而Windows运气很好没有触发这个问题


Engineers found that reducing the buffer size from 100,000 bytes to 99,912 bytes made the measured speed jump to 5.2Mb/s, easily passing the test. At 99,913 bytes the test got 2.7Mb/s and failed. Clearly what's going on here is more interesting than just a slow wireless card and/or driver.


The diagram below shows a TCP packet trace of the failing transfer, which achieves only 2.7Mb/s, generated usingtcptrace and jPlot:


This diagram shows a TCP packet trace of the transfer using 99912-byte blocks, which achieves 5.2Mb/s and passes:


In the failing case the code is clearly sending data successfully from 0-200ms, then doing nothing from 200-400ms, then sending again from 400-600ms, then nothing again from 600-800ms. Why does it keep pausing? To understand that we need to understand Delayed ACK and Nagle's Algorithm:

在失败的测试中,数据在0到200ms的时候成功发送,然后200-400的时候会阻塞,400-600ms继续发送。600-800ms继续啥也不干。为什么会周期性的阻塞呢?要弄懂这个问题,需要了解Delayed ACK 和 Nagle's Algorithm。

Delayed ACK

Delayed ACK means TCP doesn't immediately ack every single received TCP segment. (When reading the following, think about an interactive ssh session, not bulk transfer.) If you receive a single lone TCP segment, you wait 100-200ms, on the assumption that the receiving application will probably generate a response of some kind. (E.g., every time sshd receives a keystroke, it typically generates a character echo in response.) You don't want the TCP stack to send an empty ACK followed by a TCP data packet 1ms later, every time, so you delay a little, so you can combine the ACK and data packet into one. So far so good. But what if the application doesn't generate any response data? Well, in that case, what difference can a little delay make? If there's no response data, then the client can't be waiting for anything, can it? Well, the application-layer client can't be waiting for anything, but the TCP stack at the endcan be waiting: This is where Nagle's Algorithm enters the story:

Delayed ACK 意味着TCP没有立即应答每一个收到TCP包阅读下面的内容时,想象一个互动SSH会话,而不是批量传输如果您收到一个单一的TCP段,你等待100-200ms,这个是假设接收应用程序可能会产生某种形式反应例如,sshd的每一次收到一个按键通常会产生一个echo响应你不会希望TCP栈发送一个空的ACK包后紧接着又发送一个响应数据包把,所以每一次,延迟一小段时间,这样就可以将ACK包和响应包组合成一个数据包到目前为止这样子效果还不错但是,如果应用程序不产生任何响应数据呢那么在这种情况下一点延迟会导致什么情况如果没有响应数据然后在客户端不能等待任何数据,是么?是的应用层客户不能等待数据包TCP协议栈最终还是会等待Nagle算法

Nagle's Algorithm

For efficiency you want to send full-sized TCP data packets. Nagle's Algorithm says that if you have a few bytes to send, but not a full packet's worth, and you already have some unacknowledged data in flight, then you wait, until either the application gives you more data, enough to make another full-sized TCP data packet, or the other end acknowledges all your outstanding data, so you no longer have any data in flight.


Usually this is a good idea. Nagle's Algorithm is to protect the network from stupid apps that do things like this, where a naive TCP stack might end up sending 100,000 one-byte packets.

大多数时候这是个好主意。Nagle's Algorithm 可以用来保护网络被愚蠢应用折磨,像下面这段代码,原生的TCP栈会发送10万个1字节的包。

    for (i=0; i<100000; i++) write(socket, &buffer[i], 1);

The bad interaction is that now there is something at the sending end waiting for that response from the server. That something is Nagle's Algorithm, waiting for its in-flight data to be acknowledged before sending more.


The next thing to know is that Delayed ACK applies to a single packet. If a second packet arrives, an ACK is generated immediately. So TCP will ACK everysecond packet immediately. Send two packets, and you get an immediate ACK. Send three packets, and you'll get an immediate ACK covering the first two, then a 200ms pause before the ACK for the third.

下一步的事情需要知道Delayed ACK发送了一个单一包后,如果第二个包到达服务端,ACK会立刻生成。所以TCP会给所有第二个包立刻生成ACK。发送2个包,你会立刻获得一个ACK。发送三个包,你会获得前2个包的ACK,然后等待200ms后第三个包的ACK会发出。

Armed with this information, we can now understand what's going on. Let's look at the numbers:

 99,900 bytes = 68 full-sized 1448-byte packets, plus 1436 bytes extra
100,000 bytes = 69 full-sized 1448-byte packets, plus   88 bytes extra

With 99,900 bytes, you send 68 full-sized packets. Nagle holds onto the last 1436 bytes. Then:

  • Your 68th packet (an even number) is immediately acknowledged by the receiver without delay
  • Nagle gets the ACK and releases the remaining 1436 bytes
  • The receiver gets the remaining 1436 bytes, and delays its ACK
  • the server application generates its one-byte application-layer ack for the transfer,
  • Delayed ACK combines that one byte with its pending ACK packet, sends the combined TCP ACK+data packet promptly,

...and everything flows (relatively) smoothly.







5,Delayed ACK 将这个1字节的包和ACK包并成一个包发送出去。


Now consider the 100,000-byte case. You send a stream of 69 full-sized packets. Nagle holds onto the last 88 bytes. Then:

  • The receiver ACKs every second packet, up to packet 68.
  • One more data packet arrives, packet 69.
  • Delayed ACK means that the receiver won't ACK this packet until it gets
    (a) some response data from the local process, or
    (b) another packet from the sender.
  • The local process won't generate any response data (a) because it hasn't got the full 100,000 bytes yet.
  • The sender won't send the last packet (b) because Nagle won't let it until it gets an ACK from the receiver.




3,Delayed ACK 不会立刻ACK69号包直到:






Now we have a brief deadlock, with performance-killing results:

  • Nagle won't send the last bit of data until it gets an ACK
  • Delayed ACK won't send that ACK until it gets some response data
  • Server process won't generate any response until it gets all the data
  • but, Nagle won't send the last bit of data... and so on.

So, at the end of each 100,000-byte transfer we get this little awkward pause. Finally the delayed ack timer goes off and the deadlock un-wedges, until next time. On a gigabit network, all these huge 200ms pauses can be devastating to an application protocol that runs into this problem. These pauses can limit a request/response application-layer protocol to at most five transactions per second, on a network link where it should be capable of a thousand transactions per second or more. In the case of this specific attempt at a performance test, it should have been able to, in principle, transfer each 100,000-bytes chunk over a local gigabit Ethernet link in as little as 1ms. Instead, because it stops and waits after each chunk, instead of double-buffering as recommended above, sending each 100,000-byte chunk takes 1ms + 200ms pause = 201ms, making the test run roughlytwo hundred times slower than it should.


Why didn't Windows suffer this problem?

On Windows the TCP segment size is 1460 bytes. On Mac OS X and other operating systems that add a TCP time-stamp option, the TCP segment size is twelve bytes smaller: 1448 bytes.

What this means is that on Windows, 100,000 bytes is 68 full-sized 1460-byte packets plus 720 extra bytes. Because 68 is an even number, by pure luck the application avoids the Nagle/Delayed ACK interaction.

On Mac OS X, 100,000 bytes is 69 full-sized 1448-byte packets, plus 88 bytes extra. Because 69 is an odd number, Mac OS X exposes the application problem.

A crude way to solve the problem, though still not as efficient as the double-buffering approach, is to make sure the application sends each semantic message using a single large write (either by copying the message data into a contiguous buffer, or by using a scatter/gather-type write operation like sendmsg) and set the TCP_NODELAY socket option, thereby disabling Nagle's algorithm. This avoids this particular problem, though it can still suffer other problems inherent in not using double buffering — like, if the last packet of a response gets lost, there are no packets following it to trigger afast retransmit.

为什么windows没有这个问题呢,因为windows的TCP段的大小是1460字节。苹果和其他系统的TCP段大小是1448字节。这样意味着windows系统上,100K字节是68个全包+720剩余字节。因为68是偶数,所以这个测试程序幸运的避免了Nagle/Delayed ACK 问题。

TCP协议Nagle算法和Delayed ACK相互影响实例分析

建议:阅读本文之前,最好对于TCP的发送、重发以及ACK机制有所了解。 问题描述 最近在一个消息中间件系统(该消息中间件由客户端SDK和服务端Server组成)的性能测试时,发现每个请求的响应时间...
  • u013721793
  • u013721793
  • 2016年04月20日 20:56
  • 501

Delayed ACK 和 nagle算法

Nagle算法作用为延迟发送,其意为,不足MSS的数据直到收到确认再发送 Nagle’s Algorithm: if there is new data to send if the w...
  • u013621423
  • u013621423
  • 2016年06月13日 22:36
  • 268

Nagle's algorithm(转)

    Nagle算法是以他的发明人John Nagle的名字命名的,它用于自动连接许多的小缓冲器消息;这一过程(称为nagling)通过减少必须发送包的个数来增加网络软件系统的效率。Nagle算法于...
  • stanyang
  • stanyang
  • 2006年11月09日 13:33
  • 4618


  • dog250
  • dog250
  • 2014年03月16日 00:27
  • 21403

TCP-IP详解:Delay ACK

先了解一下,TCP传输的数据流的分类: TCP交互数据流:一般情况下数据总是以小于MSS的分组发送,做的是小流量的数据交互,常见的应用比如SSH,Telnet等 TCP成块数据流:TCP尽最大能力的运...
  • wdscq1234
  • wdscq1234
  • 2016年09月04日 12:40
  • 3069

Linux下TCP延迟确认(Delayed Ack)机制导致的时延问题分析

版权声明:本文由潘安群原创文章,转载请注明出处:  文章原文链接: 来源:腾云阁 https://www.qc...
  • qian_xiaoqian
  • qian_xiaoqian
  • 2016年10月30日 18:53
  • 548

TCP之Delay Ack和Nagle算法

TCP之Delay Ack和Nagle算法 1.  Delay Ack      TCP是可靠传输,可靠的核心是收到包后回复一个ack来告诉对方收到了。      delay ack是指...
  • chenglinhust
  • chenglinhust
  • 2017年06月03日 16:11
  • 672

Delayed ACK

delayed ack algorithm也就是中所谓的"经受时延的确认"(翻译得真饶舌 = =||)。在RFC1122中提到delayed ack 的概念:          " A host t...
  • pud_zha
  • pud_zha
  • 2012年08月29日 13:51
  • 2329


  • firebird321
  • firebird321
  • 2013年12月04日 21:06
  • 2066

TCP Nagle算法详解

转: 在网络拥塞控制领域,我们知道有一个非常有名的算法叫做Nagle算法(Nagle algori...
  • YUAN1125
  • YUAN1125
  • 2016年05月30日 11:10
  • 5393
您举报文章:TCP Performance problems caused by interaction between Nagle's Algorithm and Delayed ACK