How to use epoll? A complete example in C

Thursday, 2 June 2011 @ 1238 GMT by Mukund Sivaraman

Network servers are traditionally implemented using a separate process or threadper connection. For high performance applications that need to handle a very large number of clients simultaneously, this approach won't work well, because factors such asresource usage and context-switching time influence the ability to handle many clients at atime. An alternate method is toperformnon-blockingI/O in a single thread, along with somereadiness notification method which tells you when you can read or write more data on a socket.

This article is an introduction to Linux's epoll(7) facility, which is the best readiness notification facility in Linux. We will write sample code for a complete TCP server implementation in C. I assume you have C programming experience, know how to compile and run programs on Linux, and can read man pages of the various C functions that are used.

epoll was introduced in Linux 2.6, and is not available in other UNIX-like operating systems. It provides a facility similar totheselect(2) andpoll(2)functions:

  • select(2) can monitor up to FD_SETSIZEnumber of descriptors at a time, typically a small number determined at libc's compile time.
  • poll(2) doesn't have a fixed limit of descriptorsit can monitor at a time, but apart from other things, even we have to perform a linearscan of all the passed descriptors every time to check readiness notification, which is O(n) and slow.

epoll has no such fixed limits, and does not perform any linear scans. Hence it is able to perform better and handle a larger number of events.

An epoll instance is created by epoll_create(2)orepoll_create1(2) (they take different arguments), which return an epoll instance.epoll_ctl(2) is used to add/remove descriptors to be watched on the epoll instance. To wait for events on the watched set,epoll_wait(2) is used, which blocks until events are available. Please see their manpages for more info.

When descriptors are added to an epoll instance, they can be added in two modes:level triggered andedge triggered. When you use level triggered mode, and data is available for reading,epoll_wait(2) will always return with ready events. If you don't read the data completely, and callepoll_wait(2) on the epoll instance watching the descriptor again, it will return again with a ready event because datails available. In edge triggered mode, you will only get a readiness notficationonce. If you don't read the data fully, and callepoll_wait(2) on the epoll instance watching the descriptor again, it will block because the readiness event was already delivered.

The epoll event structure that you pass to epoll_ctl(2) is shown below. With every descriptor being watched, you can associate an integer or a pointer as user data.

typedef union epoll_data
  void        *ptr;
  int          fd;
  __uint32_t   u32;
  __uint64_t   u64;
} epoll_data_t;

struct epoll_event
  __uint32_t   events; /* Epoll events */
  epoll_data_t data;   /* User data variable */

Let's write code now. We'll implement a tiny TCP server that prints everything sent to the socket on standard output. We'll begin by writing a functioncreate_and_bind() which creates and binds aTCP socket:

static int
create_and_bind (char *port)
  struct addrinfo hints;
  struct addrinfo *result, *rp;
  int s, sfd;

  memset (&hints, 0, sizeof (struct addrinfo));
  hints.ai_family = AF_UNSPEC;     /* Return IPv4 and IPv6 choices */
  hints.ai_socktype = SOCK_STREAM; /* We want a TCP socket */
  hints.ai_flags = AI_PASSIVE;     /* All interfaces */

  s = getaddrinfo (NULL, port, &hints, &result);
  if (s != 0)
      fprintf (stderr, "getaddrinfo: %s\n", gai_strerror (s));
      return -1;

  for (rp = result; rp != NULL; rp = rp->ai_next)
      sfd = socket (rp->ai_family, rp->ai_socktype, rp->ai_protocol);
      if (sfd == -1)

      s = bind (sfd, rp->ai_addr, rp->ai_addrlen);
      if (s == 0)
          /* We managed to bind successfully! */

      close (sfd);

  if (rp == NULL)
      fprintf (stderr, "Could not bind\n");
      return -1;

  freeaddrinfo (result);

  return sfd;

create_and_bind() contains a standard code block for a portable way of getting a IPv4 or IPv6 socket. It accepts aport argument as a string, whereargv[1] can be passed. Thegetaddrinfo(3) function returns a bunch ofaddrinfo structures inresult, which are compatible with the hints passed in thehints argument. Theaddrinfo struct looks like this:

struct addrinfo
  int              ai_flags;
  int              ai_family;
  int              ai_socktype;
  int              ai_protocol;
  size_t           ai_addrlen;
  struct sockaddr *ai_addr;
  char            *ai_canonname;
  struct addrinfo *ai_next;

We walk through the structures one by one and try creating sockets using them, until we are able to both create and bind a socket. If we were successful,create_and_bind() returns the socket descriptor. If unsuccessful, it returns -1.

Next, let's write a function to make a socket non-blocking. make_socket_non_blocking() setstheO_NONBLOCK flag on the descriptor passed in thesfd argument:

static int
make_socket_non_blocking (int sfd)
  int flags, s;

  flags = fcntl (sfd, F_GETFL, 0);
  if (flags == -1)
      perror ("fcntl");
      return -1;

  flags |= O_NONBLOCK;
  s = fcntl (sfd, F_SETFL, flags);
  if (s == -1)
      perror ("fcntl");
      return -1;

  return 0;

Now, on to the main() function of the program which contains the event loop. This is the bulk of the program:

#define MAXEVENTS 64

main (int argc, char *argv[])
  int sfd, s;
  int efd;
  struct epoll_event event;
  struct epoll_event *events;

  if (argc != 2)
      fprintf (stderr, "Usage: %s [port]\n", argv[0]);
      exit (EXIT_FAILURE);

  sfd = create_and_bind (argv[1]);
  if (sfd == -1)
    abort ();

  s = make_socket_non_blocking (sfd);
  if (s == -1)
    abort ();

  s = listen (sfd, SOMAXCONN);
  if (s == -1)
      perror ("listen");
      abort ();

  efd = epoll_create1 (0);
  if (efd == -1)
      perror ("epoll_create");
      abort ();
    } = sfd; = EPOLLIN | EPOLLET;
  s = epoll_ctl (efd, EPOLL_CTL_ADD, sfd, &event);
  if (s == -1)
      perror ("epoll_ctl");
      abort ();

  /* Buffer where events are returned */
  events = calloc (MAXEVENTS, sizeof event);

  /* The event loop */
  while (1)
      int n, i;

      n = epoll_wait (efd, events, MAXEVENTS, -1);
      for (i = 0; i < n; i++)
	  if ((events[i].events & EPOLLERR) ||
              (events[i].events & EPOLLHUP) ||
              (!(events[i].events & EPOLLIN)))
              /* An error has occured on this fd, or the socket is not
                 ready for reading (why were we notified then?) */
	      fprintf (stderr, "epoll error\n");
	      close (events[i].data.fd);

	  else if (sfd == events[i].data.fd)
              /* We have a notification on the listening socket, which
                 means one or more incoming connections. */
              while (1)
                  struct sockaddr in_addr;
                  socklen_t in_len;
                  int infd;
                  char hbuf[NI_MAXHOST], sbuf[NI_MAXSERV];

                  in_len = sizeof in_addr;
                  infd = accept (sfd, &in_addr, &in_len);
                  if (infd == -1)
                      if ((errno == EAGAIN) ||
                          (errno == EWOULDBLOCK))
                          /* We have processed all incoming
                             connections. */
                          perror ("accept");

                  s = getnameinfo (&in_addr, in_len,
                                   hbuf, sizeof hbuf,
                                   sbuf, sizeof sbuf,
                                   NI_NUMERICHOST | NI_NUMERICSERV);
                  if (s == 0)
                      printf("Accepted connection on descriptor %d "
                             "(host=%s, port=%s)\n", infd, hbuf, sbuf);

                  /* Make the incoming socket non-blocking and add it to the
                     list of fds to monitor. */
                  s = make_socket_non_blocking (infd);
                  if (s == -1)
                    abort ();

         = infd;
         = EPOLLIN | EPOLLET;
                  s = epoll_ctl (efd, EPOLL_CTL_ADD, infd, &event);
                  if (s == -1)
                      perror ("epoll_ctl");
                      abort ();
              /* We have data on the fd waiting to be read. Read and
                 display it. We must read whatever data is available
                 completely, as we are running in edge-triggered mode
                 and won't get a notification again for the same
                 data. */
              int done = 0;

              while (1)
                  ssize_t count;
                  char buf[512];

                  count = read (events[i].data.fd, buf, sizeof buf);
                  if (count == -1)
                      /* If errno == EAGAIN, that means we have read all
                         data. So go back to the main loop. */
                      if (errno != EAGAIN)
                          perror ("read");
                          done = 1;
                  else if (count == 0)
                      /* End of file. The remote has closed the
                         connection. */
                      done = 1;

                  /* Write the buffer to standard output */
                  s = write (1, buf, count);
                  if (s == -1)
                      perror ("write");
                      abort ();

              if (done)
                  printf ("Closed connection on descriptor %d\n",

                  /* Closing the descriptor will make epoll remove it
                     from the set of descriptors which are monitored. */
                  close (events[i].data.fd);

  free (events);

  close (sfd);

  return EXIT_SUCCESS;

main() first calls create_and_bind() which sets up the socket. It then makes the socket non-blocking, and then callslisten(2). It then creates an epoll instanceinefd, to which it adds the listening socketsfd to watch for input eventsin an edge-triggered mode.

The outer while loop is the main events loop. It calls epoll_wait(2), where the thread remains blocked waiting for events. When events are available,epoll_wait(2) returns the events in theevents argument, which is a bunch ofepoll_event structures.

The epoll instance in efd is continuously updated in the event loop when we add new incoming connections to watch, and remove existing connections when they die.

When events are available, they can be of three types:

  • Errors: When an error condition occurs, or the event is not a notification about data available for reading, we simply close the associated descriptor. Closing the descriptor automaticallyremoves it from the watched set of epoll instanceefd.
  • New connections: When the listeningdescriptor sfd is ready for reading, it means one or morenew connections have arrived. While there are newconnections,accept(2) the connections, print a messageabout it, make the incoming socket non-blocking and add it to thewatched set of epoll instanceefd.
  • Client data: When data is available for reading onany of the client descriptors, we useread(2) to readthe data in pieces of 512 bytes in an inner while loop. This is becausewe have to read all the data that is available now, as we won't getfurther events about it as the descriptor is watched in edge-triggeredmode. The data which is read is written to stdout (fd=1)usingwrite(2). If read(2) returns 0,it means an EOF and we can close the client's connection. If -1 isreturned, anderrno is set toEAGAIN, it means that all datafor this event was read, and we can go back to the main loop.

That's that. It goes around and around in a loop, adding and removing descriptors in the watched set.

Download the epoll-example.cprogram.

Update1: Level and edge triggered definitions were erroneously reversed (though the code was correct). It was noticed by Reddit userbodski. The article has been corrected now. I should have proof-read it beforeposting. Apologies, and thank you for pointing out the mistake. :)

Update2: The code has been modified to run accept(2) until it says it would block, so that if multiple connections have arrived, we accept all of them. It was noticed by Reddit usercpitchford. Thank youfor the comments. :)


Nginx 使用 epoll 作为其事件驱动模型的一部分,特别是针对多路复用 I/O 和并发连接处理场景。Epoll 是 Linux 内核提供的一种高效事件通知机制,它能够有效地管理大量活动文件描述符(FDs),非常适合于网络服务器这类需要处理成千上万并发连接的应用程序。 ### 为什么选择 Epoll? 1. **高效处理大规模并发**:Epoll 能够以较低的开销处理大量的并发连接,使得 Nginx 在面对大量用户请求时依然保持高效的响应能力。 2. **延迟触发 I/O**:Epoll 支持非阻塞的 I/O 操作,这意味着应用程序可以继续运行而不需要等待 I/O 操作完成,提高了整体系统性能。 3. **减少上下文切换**:Epoll 只会在有事件发生时才唤醒内核,减少了不必要的上下文切换,提高了系统吞吐量。 4. **灵活性**:Epoll 提供了多种事件模式,包括读、写、错误事件等,允许开发者根据具体的业务需求灵活地监控和处理 I/O 状态变化。 ### Nginx 中使用 Epoll 的示例配置 在 Nginx 的配置文件中,使用 epoll 可以通过以下方式启用: ```nginx events { worker_connections 1024; # 设置每个进程最大连接数 multi_accept on; # 允许多线程同时接受多个连接 } ``` 在这个例子中,`worker_connections` 参数指定了每个 worker 进程可以处理的最大并发连接数,通常需要根据服务器的处理器核心数进行适当调整。 此外,对于特定版本的 Nginx,如 Nginx Plus 或自定义构建,可能还需要额外的命令行参数来启用 epoll: ```bash ./configure --with-http_ssl_module --with-http_realip_module --with-epoll ``` 然后在启动 Nginx 时: ```bash nginx -c your_configuration_file.conf ``` 通过这样的配置,Nginx 就能够利用 epoll 的优势,显著提升服务器的并发处理能力和整体性能。 ### 相关问题: 1. **如何评估 Epoll 是否适合我的应用需求?** - 根据您的应用程序的并发连接数、I/O 密集型工作负载及系统资源状况来决定。Epoll 特别适用于需要处理高并发连接的应用场景。 2. **Epoll 在 Windows 平台上是否同样有效?** - Windows 平台上的替代方案通常是 kqueue 或 select,在不同操作系统下选择最合适的 I/O 处理机制至关重要。 3. **Nginx 配置中有哪些其他组件也会影响性能?** - 例如 `proxy_cache`, `gzip`, `upstream` 配置等。合理的配置可以进一步提升 Nginx 的性能并优化用户体验。 通过以上信息,您可以更好地理解 Nginx 如何利用 epoll 实现高性能的并发处理能力。




