Zero Copy I: User-Mode Perspective

最新推荐文章于 2021-11-08 11:42:00 发布

shenyan008

最新推荐文章于 2021-11-08 11:42:00 发布

阅读量722

点赞数

分类专栏： Linux 网络编程文章标签： system descriptor socket signal buffer file

Linux 同时被 2 个专栏收录

101 篇文章 1 订阅

订阅专栏

网络编程

10 篇文章 0 订阅

订阅专栏

摘自：http://www.linuxjournal.com/article/6345

Jan 01, 2003 By Dragan Stancevic

SysAdmin

Explaining what is zero-copy functionality for Linux, why it's useful and where it needs work.

By now almost everyone has heard ofso-called zero-copy functionality under Linux, but I often run intopeople who don't have a full understanding of the subject. Becauseof this, I decided to write a few articles that dig into the mattera bit deeper, in the hope of unraveling this useful feature. Inthis article, we take a look at zero copy from a user-modeapplication point of view, so gory kernel-level details are omittedintentionally.

What Is Zero-Copy?

To better understand the solution to a problem, we first needto understand the problem itself. Let's look at what is involved inthe simple procedure of a network server dæmon serving datastored in a file to a client over the network. Here's some samplecode:

read(file, tmp_buf, len);
write(socket, tmp_buf, len);

Looks simple enough; you would think there is not muchoverhead with only those two system calls. In reality, thiscouldn't be further from the truth. Behind those two calls, thedata has been copied at least four times, and almost as manyuser/kernel context switches have been performed. (Actually thisprocess is much more complicated, but I wanted to keep it simple).To get a better idea of the process involved, take a look at Figure1. The top side shows context switches, and the bottom side showscopy operations.

Figure 1. Copying in Two Sample System Calls

Step one: the read system call causes a context switch fromuser mode to kernel mode. The first copy is performed by the DMAengine, which reads file contents from the disk and stores theminto a kernel address space buffer.

Step two: data is copied from the kernel buffer into the userbuffer, and the read system call returns. The return from the callcaused a context switch from kernel back to user mode. Now the datais stored in the user address space buffer, and it can begin itsway down again.

Step three: the write system call causes a context switchfrom user mode to kernel mode. A third copy is performed to put thedata into a kernel address space buffer again. This time, though,the data is put into a different buffer, a buffer that isassociated with sockets specifically.

Step four: the write system call returns, creating our fourthcontext switch. Independently and asynchronously, a fourth copyhappens as the DMA engine passes the data from the kernel buffer tothe protocol engine. You are probably asking yourself, “What doyou mean independently and asynchronously? Wasn't the datatransmitted before the call returned?” Call return, in fact,doesn't guarantee transmission; it doesn't even guarantee the startof the transmission. It simply means the Ethernet driver had freedescriptors in its queue and has accepted our data fortransmission. There could be numerous packets queued before ours.Unless the driver/hardware implements priority rings or queues,data is transmitted on a first-in-first-out basis. (The forked DMAcopy in Figure 1 illustrates the fact that the last copy can bedelayed).

As you can see, a lot of data duplication is not reallynecessary to hold things up. Some of the duplication could beeliminated to decrease overhead and increase performance. As adriver developer, I work with hardware that has some prettyadvanced features. Some hardware can bypass the main memoryaltogether and transmit data directly to another device. Thisfeature eliminates a copy in the system memory and is a nice thingto have, but not all hardware supports it. There is also the issueof the data from the disk having to be repackaged for the network,which introduces some complications. To eliminate overhead, wecould start by eliminating some of the copying between the kerneland user buffers.

One way to eliminate a copy is to skip calling read andinstead call mmap. For example:

tmp_buf = mmap(file, len);
write(socket, tmp_buf, len);

To get a better idea of the process involved, take a look atFigure 2. Context switches remain the same.

Figure 2. Calling mmap

Step one: the mmap system call causes the file contents to becopied into a kernel buffer by the DMA engine. The buffer is sharedthen with the user process, without any copy being performedbetween the kernel and user memory spaces.

Step two: the write system call causes the kernel to copy thedata from the original kernel buffers into the kernel buffersassociated with sockets.

Step three: the third copy happens as the DMA engine passesthe data from the kernel socket buffers to the protocolengine.

By using mmap instead of read, we've cut in half the amountof data the kernel has to copy. This yields reasonably good resultswhen a lot of data is being transmitted. However, this improvementdoesn't come without a price; there are hidden pitfalls when usingthe mmap+write method. You will fall into one of them when youmemory map a file and then call write while another processtruncates the same file. Your write system call will be interruptedby the bus error signal SIGBUS, because you performed a bad memoryaccess. The default behavior for that signal is to kill the processand dump core—not the most desirable operation for a networkserver. There are two ways to get around this problem.

The first way is to install a signal handler for the SIGBUSsignal, and then simply call return in the handler. By doing thisthe write system call returns with the number of bytes it wrotebefore it got interrupted and the errno set to success. Let mepoint out that this would be a bad solution, one that treats thesymptoms and not the cause of the problem. Because SIGBUS signalsthat something has gone seriously wrong with the process, I woulddiscourage using this as a solution.

The second solution involves file leasing (which is called“opportunistic locking” in Microsoft Windows) from the kernel.This is the correct way to fix this problem. By using leasing onthe file descriptor, you take a lease with the kernel on aparticular file. You then can request a read/write lease from thekernel. When another process tries to truncate the file you aretransmitting, the kernel sends you a real-time signal, theRT_SIGNAL_LEASE signal. It tells you the kernel is breaking yourwrite or read lease on that file. Your write call is interruptedbefore your program accesses an invalid address and gets killed bythe SIGBUS signal. The return value of the write call is the numberof bytes written before the interruption, and the errno will be setto success. Here is some sample code that shows how to get a leasefrom the kernel:

if(fcntl(fd, F_SETSIG, RT_SIGNAL_LEASE) == -1) {
    perror("kernel lease set signal");
    return -1;
}
/* l_type can be F_RDLCK F_WRLCK */
if(fcntl(fd, F_SETLEASE, l_type)){
    perror("kernel lease set type");
    return -1;
}

You should get your lease before mmaping the file, and breakyour lease after you are done. This is achieved by calling fcntlF_SETLEASE with the lease type of F_UNLCK.

Sendfile

In kernel version 2.1, the sendfile system call wasintroduced to simplify the transmission of data over the networkand between two local files. Introduction of sendfile not onlyreduces data copying, it also reduces context switches. Use it likethis:

sendfile(socket, file, len);

To get a better idea of the process involved, take a look atFigure 3.

Figure 3. Replacing Read and Write with Sendfile

Step one: the sendfile system call causes the file contentsto be copied into a kernel buffer by the DMA engine. Then the datais copied by the kernel into the kernel buffer associated withsockets.

Step two: the third copy happens as the DMA engine passes thedata from the kernel socket buffers to the protocol engine.

You are probably wondering what happens if another processtruncates the file we are transmitting with the sendfile systemcall. If we don't register any signal handlers, the sendfile callsimply returns with the number of bytes it transferred before itgot interrupted, and the errno will be set to success.

If we get a lease from the kernel on the file before we callsendfile, however, the behavior and the return status are exactlythe same. We also get the RT_SIGNAL_LEASE signal before thesendfile call returns.

So far, we have been able to avoid having the kernel makeseveral copies, but we are still left with one copy. Can that beavoided too? Absolutely, with a little help from the hardware. Toeliminate all the data duplication done by the kernel, we need anetwork interface that supports gather operations. This simplymeans that data awaiting transmission doesn't need to be inconsecutive memory; it can be scattered through various memorylocations. In kernel version 2.4, the socket buffer descriptor wasmodified to accommodate those requirements—what is known as zerocopy under Linux. This approach not only reduces multiple contextswitches, it also eliminates data duplication done by theprocessor. For user-level applications nothing has changed, so thecode still looks like this:

sendfile(socket, file, len);

To get a better idea of the process involved, take a look atFigure 4.

Figure 4. Hardware that supports gather can assemble data frommultiple memory locations, eliminating another copy.

Step one: the sendfile system call causes the file contentsto be copied into a kernel buffer by the DMA engine.

Step two: no data is copied into the socket buffer. Instead,only descriptors with information about the whereabouts and lengthof the data are appended to the socket buffer. The DMA enginepasses data directly from the kernel buffer to the protocol engine,thus eliminating the remaining final copy.

Because data still is actually copied from the disk to thememory and from the memory to the wire, some might argue this isnot a true zero copy. This is zero copy from the operating systemstandpoint, though, because the data is not duplicated betweenkernel buffers. When using zero copy, other performance benefitscan be had besides copy avoidance, such as fewer context switches,less CPU data cache pollution and no CPU checksumcalculations.

Now that we know what zero copy is, let's put theory intopractice and write some code. You can download the full source codefromwww.xalien.org/articles/source/sfl-src.tgz.To unpack the source code, type tar -zxvfsfl-src.tgz at the prompt. To compile the code and createthe random data file data.bin, run make.

Looking at the code starting with header files:

/* sfl.c sendfile example program
Dragan Stancevic <
header name                 function / variable
-------------------------------------------------*/
#include <stdio.h>          /* printf, perror */
#include <fcntl.h>          /* open */
#include <unistd.h>         /* close */
#include <errno.h>          /* errno */
#include <string.h>         /* memset */
#include <sys/socket.h>     /* socket */
#include <netinet/in.h>     /* sockaddr_in */
#include <sys/sendfile.h>   /* sendfile */
#include <arpa/inet.h>      /* inet_addr */
#define BUFF_SIZE (10*1024) /* size of the tmp
                               buffer */

Besides the regular <sys/socket.h> and<netinet/in.h> required for basic socket operation, we need aprototype definition of the sendfile system call. This can be foundin the <sys/sendfile.h> server flag:

/* are we sending or receiving */
if(argv[1][0] == 's') is_server++;
/* open descriptors */
sd = socket(PF_INET, SOCK_STREAM, 0);
if(is_server) fd = open("data.bin", O_RDONLY);

The same program can act as either a server/sender or aclient/receiver. We have to check one of the command-promptparameters, and then set the flag is_server to run in sender mode.We also open a stream socket of the INET protocol family. As partof running in server mode we need some type of data to transmit toa client, so we open our data file. We are using the system callsendfile to transmit data, so we don't have to read the actualcontents of the file and store it in our program memory buffer.Here's the server address:

/* clear the memory */
memset(&sa, 0, sizeof(struct sockaddr_in));
/* initialize structure */
sa.sin_family = PF_INET;
sa.sin_port = htons(1033);
sa.sin_addr.s_addr = inet_addr(argv[2]);

We clear the server address structure and assign the protocolfamily, port and IP address of the server. The address of theserver is passed as a command-line parameter. The port number ishard coded to unassigned port 1033. This port number was chosenbecause it is above the port range requiring root access to thesystem.

Here is the server execution branch:

if(is_server){
    int client; /* new client socket */
    printf("Server binding to [%s]\n", argv[2]);
    if(bind(sd, (struct sockaddr *)&sa,
                      sizeof(sa)) < 0){
        perror("bind");
        exit(errno);
    }

As a server, we need to assign an address to our socketdescriptor. This is achieved by the system call bind, which assignsthe socket descriptor (sd) a server address (sa):

if(listen(sd,1) < 0){
    perror("listen");
    exit(errno);
}

Because we are using a stream socket, we have to advertise ourwillingness to accept incoming connections and set the connectionqueue size. I've set the backlog queue to 1, but it is common toset the backlog a bit higher for established connections waiting tobe accepted. In older versions of the kernel, the backlog queue wasused to prevent syn flood attacks. Because the system call listenchanged to set parameters for only established connections, thebacklog queue feature has been deprecated for this call. The kernelparameter tcp_max_syn_backlog has taken over the role of protectingthe system from syn flood attacks:

if((client = accept(sd, NULL, NULL)) < 0){
    perror("accept");
    exit(errno);
}

The system call accept creates a new connected socket from thefirst connection request on the pending connections queue. Thereturn value from the call is a descriptor for a newly createdconnection; the socket is now ready for read, write or poll/selectsystem calls:

if((cnt = sendfile(client,fd,&off,
                          BUFF_SIZE)) < 0){
    perror("sendfile");
    exit(errno);
}
printf("Server sent %d bytes.\n", cnt);
close(client);

A connection is established on the client socket descriptor, so wecan start transmitting data to the remote system. We do this bycalling the sendfile system call, which is prototyped under Linuxin the following manner:

extern ssize_t
sendfile (int __out_fd, int __in_fd, off_t *offset,
          size_t __count) __THROW;

The first two parameters are file descriptors. The third parameterpoints to an offset from which sendfile should start sending data.The fourth parameter is the number of bytes we want to transmit. Inorder for the sendfile transmit to use zero-copy functionality, youneed memory gather operation support from your networking card. Youalso need checksum capabilities for protocols that implementchecksums, such as TCP or UDP. If your NIC is outdated and doesn'tsupport those features, you still can use sendfile to transmitfiles. The difference is the kernel will merge the buffers beforetransmitting them.

Portability Issues

One of the problems with the sendfile system call, ingeneral, is the lack of a standard implementation, as there is forthe open system call. Sendfile implementations in Linux, Solaris orHP-UX are quite different. This poses a problem for developers whowish to use zero copy in their network data transmissioncode.

One of the implementation differences is Linux provides asendfile that defines an interface for transmitting data betweentwo file descriptors (file-to-file) and (file-to-socket). HP-UX andSolaris, on the other hand, can be used only for file-to-socketsubmissions.

The second difference is Linux doesn't implement vectoredtransfers. Solaris sendfile and HP-UX sendfile have extraparameters that eliminate overhead associated with prependingheaders to the data being transmitted.

Looking Ahead

The implementation of zero copy under Linux is far fromfinished and is likely to change in the near future. Morefunctionality should be added. For example, the sendfile calldoesn't support vectored transfers, and servers such as Samba andApache have to use multiple sendfile calls with the TCP_CORK flagset. This flag tells the system more data is coming through in thenext sendfile calls. TCP_CORK also is incompatible with TCP_NODELAYand is used when we want to prepend or append headers to the data.This is a perfect example of where a vectored call would eliminatethe need for multiple sendfile calls and delays mandated by thecurrent implementation.

One rather unpleasant limitation in the current sendfile isit cannot be used when transferring files greater than 2GB. Filesof such size are not all that uncommon today, and it's ratherdisappointing having to duplicate all that data on its way out.Because both sendfile and mmap methods are unusable in this case, asendfile64 would be really handy in a future kernel version.

Conclusion

Despite some drawbacks, zero-copy sendfile is a usefulfeature, and I hope you have found this article informative enoughto start using it in your programs. If you have a more in-depthinterest in the subject, keep an eye out for my second article,titled “Zero Copy II: Kernel Perspective”, where I will dig a bitmore into the kernel internals of zero copy.

Further Information

email: visitor@xalien.org

Dragan Stancevic is a kernel and hardware bring-up engineer in his late twenties. He is a software engineer by profession but has a deep interest in applied physics and has been known to play with extremely high voltages in his free time.

shenyan008

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Zero Copy I: User-Mode Perspective

摘自：http://www.linuxjournal.com/article/6345Jan 01, 2003 By Dragan Stancevic inSysAdminExplaining what is zero-copy functionality for Linux, why it's useful and where it needs work
复制链接

扫一扫