TCP_CORK: More than you ever wanted to know

转载 2015年12月18日 10:25:02


TCP_CORK: More than you ever wanted to know

April 6, 2005

I previously mentioned the leakiness of Unix's file metaphor. The leak often becomes a gushing torrent when trying to bump up performance. TCP_CORK is yet another example.

Before I get into the details of TCP_CORK and the problem it addresses, I wantto point out that this is a Linux only option,although variants exist on other *nix flavors -- for instance TCP_NOPUSHon FreeBSD and Mac OS X (although from what I read the OS X implementationis buggy). This is one of the unfortunate aspects of modern Unix programming.While most of the APIs are identical between Unix like OSes, if the functionality isn't specified by POSIX, none of themajor *nix's can seem to agree on an implementation.

What are "physical" socket writes?

The root of the abstraction leak derives from the semantics of the write() functionwhen applied to TCP/IP. Historically (and any Unix experts in the crowdfeel free to correct me here if this is not accurate) the write() functionresulted in a physical, non-buffered, write to the device. With TCP/IP the device is a network packet, but the implementors were forced to define a physical write given Unix's file semantics, so a TCP/IP write() was defined as follows:

Any data that has been sent to the kernel with write() is placedinto one or more packets and immediately sent onto the wire.

The resulting behavior is what application programmers expected. When they called write() the data would be sent and available to host on the other side of the wire. But it didn't take long to realize that this resulted in some interesting performance problems, which were addressedby Nagle's algorithm.

Nagle's algorithm

In the early 1980'sJohn Nagle found that the networks at Ford Aerospace were becomingcongested with packets containing only a single character's worthof data. Basicallyevery time a user struck a key in a telnet-like console app an entire packetwas put onto the network. As Nagle pointed out, this resulted in about 4000% overhead (the total amount of data sent vs.the actual application data). Nagle's solution was simple: wait for thepeer to acknowledge the previously sent packet before sending anypartial packets. This gives the OS time to coalescemultiple calls to write() from the application into larger packets before forwarding the data to the peer.

Nagle's algorithm is transparent to applicationdevelopers, and it effectively sticks a fat finger in the abstraction leak. Calls to write() guarantee that data is delivered to the peer. Nagle also hasthe side benefit of providing additional rudimentary flow control.

Nagle not optimal for streams

While Nagle's algorithm is an excellent compromise for many applications, and it is thedefault behavior for most TCP/IP implementations including Linux's, it isn't without drawbacks. The Nagle algorithm is most effective ifTCP/IP traffic is generated sporadically by user input, not by applications using stream oriented protocols. It worksgreat for Telnet, but it is less than optimal for HTTP. For example, if an application needs to send 1 1/2 packets of data to complete a message, the second packetis delayed until an ACK is received from the previous packet, thereby needlessly increasing latency when the application doesn't expect to sendmore data.

It also requires the peer to process more packets when networklatency is low. This can affect the responsiveness of the peer,by causing it to needlessly consume resources.

Unfortunately, as is often the case, the file abstraction must be violated to improve performance. The application must instruct the OS not to sendany packets unless they are full, or the application signals the OS tosend all pending data. This is the effect of TCP_CORK.

The application must tell the OS where the boundaries of the application layer messages are. For instance multiple HTTPmessages can be passed on one connection using HTTP pipelines. When a message is complete the application should signal the OS to send any outstanding data. If the application fails to signal the peerof a completed message, the peer will hang waiting for theremainder of the message.

In my HTTP implementation, I use the flush metaphor which is commonwith streams, but not usually associated with calls to write() whichare supposed to be physical. I set the TCP_CORK option when thesocket is created, and then "flush" the socket at message boundaries.

Prefer the gather function writev()

If you need to write multiple buffers that are currently in memory youshould prefer the gather function writev() before considering TCP_CORK with multiplecalls to write(). This function allows multiple non-contiguous buffers to be written with one system call. The kernel can then coalesce the buffers efficientlyinto packet structures before writing them to the network. It alsoreduces the number of system calls required to send the data, and henceimproves performance.

This should be combined with TCP_NODELAY option or TCP_CORK options. TCP_NODELAY disables the Nagle algorithm and ensures that the data will be written immediately.Using TCP_CORK with writev() will allow the kernel to buffer and align packets betweenmultiple calls to write() or writev(), but you must remember to remove the cork optionto write the data as described in the next section.

TCP_NODELAY is set on a socket as follows:

int state = 1;
setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &state, sizeof(state));
The drawback of writev() is that it is difficult to use with non-blocking I/O, when the functionmay return before all the data is written. A post call operation must be preformed to determine how much data was written, and to realign the buffersfor subsequent calls. This is an area with auxiliary library functionality would help. Also the behavior of writev() with non-blocking I/O isn't well documented.

A quick look at the TCP_CORK API

If you need the kernel to align and buffer packet data over the lifespanof buffers (hence the inability of using writev()), then TCP_CORK shouldbe considered.TCP_CORK is set on a socket file descriptor using the setsockopt() function.When the TCP_CORK option is set, only full packets are sent, untilthe TCP_CORK option is removed. This is important. To ensure all waiting data is sent, the TCP_CORK option MUST be removed. Herein lies the beauty of the Nagle algorithm. It doesn't require anyintervention from the application programmer. But once you set TCP_CORK,you have to be prepared to remove it when there is no more data to send.I can't stress this enough, as it is possible that TCP_CORK could causesubtle bugs if the cork isn't pulled at the appropriate times.

To set TCP_CORK use the following:

int state = 1;
setsockopt(fd, IPPROTO_TCP, TCP_CORK, &state, sizeof(state));
The cork can be removed and partial packets data send with:

int state = 0;
setsockopt(fd, IPPROTO_TCP, TCP_CORK, &state, sizeof(state));
As I mentioned, I use the flush paradigm, which involves awkwardly removing and reapplying of the TCP_CORK option.This can be done as follows:

int state = 0;
setsockopt(fd, IPPROTO_TCP, TCP_CORK, &state, sizeof(state));
state ~= state;
setsockopt(fd, IPPROTO_TCP, TCP_CORK, &state, sizeof(state));

Other solutions

User mode buffered streams, is another solution to problem. User mode buffering is implemented follows: instead of calling write() directly, the application stores data in a write buffer. When the write buffer is full, all data is then sentwith a call to write().

Even with buffered streams the application must be ableto instruct the OS to forward all pending data when the stream has been flushed for optimal performance.The application does not know where packet boundaries reside, hencebuffer flushes might not align on packet boundaries. TCP_CORK can packdata more effectively, because it has direct access to the TCP/IP layer.

Also application buffering requires gratuitous memory copies, whichmany high performance servers attempt to minimize. Memory bus contention and latency often limit a server's throughput.

If you do use an application buffering and streaming mechanism (as doesApache), I highly recommend applying the TCP_NODELAYsocket option which disables Nagle's algorithm. All calls to write() will then result in immediate transfer of data.

More than you ever wanted to know about GeoJSON

点击打开链接 关于GeoJason很好的介绍文章 Let's look at GeoJSON in a little more depth, from the gr...
  • samantha_wang
  • samantha_wang
  • 2015年05月10日 19:22
  • 606

杭电ACM1088 Write a simple HTML Browser Java

Write a simple HTML Browser Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 65536/32768 K (J...
  • wangyang1354
  • wangyang1354
  • 2015年10月13日 20:50
  • 790

【字符串】HDU1088Write a simple HTML Browser

题目链接: #include using namespace std; int main() { // ...
  • wlxsq
  • wlxsq
  • 2017年09月06日 22:57
  • 94

HDOJ 1088 Write a simple HTML Browser

HDOJ 1088 Write a simple HTML Browser字符串的控制,理解好题意.不过这题...
  • jqandjq
  • jqandjq
  • 2009年03月23日 21:39
  • 620

hdu 1088 HTML解析

Write a simple HTML Browser Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 65536/32768 K (...
  • guodongxiaren
  • guodongxiaren
  • 2014年03月30日 20:03
  • 1311

hdu-1088 Write a simple HTML Browser

Write a simple HTML Browser Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 65536/32768 K (J...
  • oYinGui1
  • oYinGui1
  • 2016年07月14日 09:53
  • 203

字符串的处理(模拟) ZOJ Problem Set - 1099 HTML

  • u010123208
  • u010123208
  • 2014年04月19日 10:44
  • 1156

SVN提交改动时报错You must input more than 5 chars as comment!

SVN上提交改动的时候遇到报错,You must input more than 5 chars as comment!,这个错的意思是“你必须添加不少于5个字的注释”。这个是因为系统限制你提交改动的...
  • woduoxiangfeiya
  • woduoxiangfeiya
  • 2016年04月25日 11:42
  • 689

cxf 发布webservice报 You have more than one version of 'org.apache.commons.logging.Log' visible

 项目中采用cxf发布webservice,在同一osgi环境下写测试类发现能够正常运行,但经过发布成webservice后,用工程调用报如下异常: org.apache.commons.loggin...
  • einarzhang
  • einarzhang
  • 2011年02月13日 12:38
  • 3687

Linux命令 — 压缩与解压命令大全

  • Super_Eagle
  • Super_Eagle
  • 2013年11月20日 20:51
  • 4296
您举报文章:TCP_CORK: More than you ever wanted to know