【知识积累】Epoll

Darren Gong

已于 2023-05-10 21:38:29 修改

阅读量424

点赞数

分类专栏： Linux 文章标签： linux 运维服务器

于 2022-10-16 13:22:42 首次发布

本文链接：https://blog.csdn.net/axin1240101543/article/details/127346150

版权

Linux 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

EPOLL(7) Linux Programmer's Manual EPOLL(7)
EPOLL（7） Linux程序员手册

NAME

epoll - I/O event notification facility
epoll-I/O事件通知工具

SYNOPSIS

#include <sys/epoll.h>

DESCRIPTION

The epoll API performs a similar task to poll(2): monitoring multiple file descriptors to
see if I/O is possible on any of them. The epoll API can be used either as an edge-triggered or a level-triggered interface and scales well to large numbers of watched file
descriptors. The following system calls are provided to create and manage an epoll instance:
epoll API执行与poll（2）类似的任务：监视多个文件描述符查看是否可以在其中任何一个上进行I/O。epoll API可以用作边缘触发或level触发接口，可以很好地扩展到大量被监视的文件描述符。

提供以下系统调用来创建和管理epoll实例：
* epoll_create(2) creates an epoll instance and returns a file descriptor referring to
that instance. (The more recent epoll_create1(2) extends the functionality of
epoll_create(2).)
epollcreate（2）创建一个epoll实例并返回一个引用那个实例。（最近的epoll_create1（2）扩展了epoll_create（2）。）

* Interest in particular file descriptors is then registered via epoll_ctl(2). The set
of file descriptors currently registered on an epoll instance is sometimes called an
epoll set.
然后通过epoll_ctl（2）注册对特定文件描述符的兴趣。目前在epoll实例上注册的文件描述符集有时称为epoll集。

* epoll_wait(2) waits for I/O events, blocking the calling thread if no events are currently available.epoll_wait（2）等待I/O事件，如果当前没有可用的事件，则阻塞调用线程

Level-triggered and edge-triggered
水平触发和边缘触发

The epoll event distribution interface is able to behave both as edge-triggered (ET) and
as level-triggered (LT). The difference between the two mechanisms can be described as follows. Suppose that this scenario happens:
epoll事件分布接口能够同时作为边缘触发（ET）和水平触发（LT）。这两种机制的区别可以描述如下。假设这种情况发生:
1. The file descriptor that represents the read side of a pipe (rfd) is registered on the
epoll instance.
   1.表示管道读端(rfd:read file descriptor)的文件描述符在epoll实例上注册。

2. A pipe writer writes 2 kB of data on the write side of the pipe.
   2.管道写入器在管道的写端写入2 kB的数据。（完成写入）

3. A call to epoll_wait(2) is done that will return rfd as a ready file descriptor.
   3.调用epoll_wait(2)将返回rfd作为就绪文件描述符。（消耗事件）

4. The pipe reader reads 1 kB of data from rfd.
   4.管道读取器从rfd读取1 kB的数据。（完成一部分读取，不会读取整个缓冲区数据）

5. A call to epoll_wait(2) is done.
   5.调用epoll_wait(2)就完成了。（由于没有事件，导致无限阻塞）

If the rfd file descriptor has been added to the epoll interface using the EPOLLET (edge-
triggered) flag, the call to epoll_wait(2) done in step 5 will probably hang despite the
available data still present in the file input buffer; meanwhile the remote peer might be
expecting a response based on the data it already sent. The reason for this is that edge-triggered mode delivers events only when changes occur on the monitored file descriptor.
So, in step 5 the caller might end up waiting for some data that is already present inside the input buffer. In the above example, an event on rfd will be generated because of the write done in 2 and the event is consumed in 3. Since the read operation done in 4 does not consume the whole buffer data, the call to epoll_wait(2) done in step 5 might block indefinitely.
   如果使用epoll ET（边缘触发）标志将rfd文件描述符添加到epoll接口，则尽管文件输入缓冲区中仍存在可用数据，但步骤5中对epoll_wait（2）的调用可能会挂起；与此同时，远程对等方可能期望基于其已发送的数据得到响应。这是因为边缘-触发模式仅在受监视的文件描述符发生更改时传递事件。

        因此，在步骤5中，调用者可能会等待输入缓冲区中已经存在的一些数据。在上面的示例中，由于在2中完成了写入操作，将生成rfd上的事件，而在3中消耗该事件。由于在4中完成的读取操作不会消耗整个缓冲区数据，因此在步骤5中完成的对epoll_wait（2）的调用可能会无限期阻塞。

An application that employs the EPOLLET flag should use nonblocking file descriptors to
avoid having a blocking read or write starve a task that is handling multiple file
descriptors. The suggested way to use epoll as an edge-triggered (EPOLLET) interface is as follows:
   使用EPOLL ET标志的应用程序应使用非阻塞文件描述符，以避免阻塞读或写，饥饿处理多个文件描述符的任务。使用epoll作为边缘触发（EPOLL ET）接口的建议方法如下：

i with nonblocking file descriptors;
带有非阻塞文件描述符；

ii by waiting for an event only after read(2) or write(2) return EAGAIN.
只在读（2）或写（2）返回eagain后等待事件（读取或者写入完成）

By contrast, when used as a level-triggered interface (the default, when EPOLLET is not specified), epoll is simply a faster poll(2), and can be used wherever the latter is used
since it shares the same semantics.
   相比之下，当使用水平触发（默认，和poll语义一样）接口时(缺省情况下，当没有指定EPOLL ET时)，epoll只是一个更快的轮询(2)，并且可以在使用后者的任何地方使用，因为它共享相同的语义。

Since even with edge-triggered epoll, multiple events can be generated upon receipt of multiple chunks of data, the caller has the option to specify the EPOLLONESHOT flag, to tell epoll to disable the associated file descriptor after the receipt of an event with
epoll_wait(2). When the EPOLLONESHOT flag is specified, it is the caller's responsibility
to rearm the file descriptor using epoll_ctl(2) with EPOLL_CTL_MOD.
   由于即使使用边缘触发的epoll，在收到多个数据块时也会生成多个事件，因此调用者可以选择指定EPOLLONESHOT标志，告诉epoll在收到带有epoll_wait的事件后禁用相关的文件描述符（2）。当指定EPOLLONESHOT标志时，调用方负责使用epoll_ctl（2）和epoll_ctl_MOD重新配置文件描述符。

/proc interfaces

The following interfaces can be used to limit the amount of kernel memory consumed by
epoll:
以下接口可用于限制epol消耗的内核内存量：

/proc/sys/fs/epoll/max_user_watches (since Linux 2.6.28)
This specifies a limit on the total number of file descriptors that a user can reg‐
ister across all epoll instances on the system. The limit is per real user ID.
Each registered file descriptor costs roughly 90 bytes on a 32-bit kernel, and
roughly 160 bytes on a 64-bit kernel. Currently, the default value for
max_user_watches is 1/25 (4%) of the available low memory, divided by the registration cost in bytes.
这指定了用户可以在系统上的所有epoll实例中注册的文件描述符总数的限制。限制为每个实际用户ID。在32位内核上，每个注册的文件描述符大约占用90字节，在64位内核上大约占用160字节。目前，max_user_watches的默认值是可用低内存的1/25（4%）除以注册成本（字节）。

Example for suggested usage
建议用法示例

While the usage of epoll when employed as a level-triggered interface does have the same semantics as poll(2), the edge-triggered usage requires more clarification to avoid stalls in the application event loop. In this example, listener is a nonblocking socket on which listen(2) has been called. The function do_use_fd() uses the new ready file descriptor until EAGAIN is returned by either read(2) or write(2). An event-driven state machine application should, after having received EAGAIN, record its current state so that at the next call to do_use_fd() it will continue to read(2) or write(2) from where it stopped before.
当用作级别触发接口时，epoll的用法与poll（2）具有相同的语义，但边缘触发用法需要进一步澄清，以避免应用程序事件循环中的暂停。在本例中，listener是一个调用了listen（2）的非阻塞套接字。函数do_use_fd（）使用新的就绪文件描述符，直到读（2）或写（2）返回EAGAIN。事件驱动的状态机应用程序在收到EAGAIN后，应记录其当前状态，以便在下次调用do_use_fd（）时，它将继续从以前停止的位置读取（2）或写入（2）。

#define MAX_EVENTS 10
           struct epoll_event ev, events[MAX_EVENTS];
           int listen_sock, conn_sock, nfds, epollfd;

           /* Set up listening socket, 'listen_sock' (socket(),
              bind(), listen()) */

           epollfd = epoll_create(10);
           if (epollfd == -1) {
               perror("epoll_create");
               exit(EXIT_FAILURE);
           }

           ev.events = EPOLLIN;
           ev.data.fd = listen_sock;
           if (epoll_ctl(epollfd, EPOLL_CTL_ADD, listen_sock, &ev) == -1) {
               perror("epoll_ctl: listen_sock");
               exit(EXIT_FAILURE);
           }

           for (;;) {
               nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1);
               if (nfds == -1) {
                   perror("epoll_pwait");
                   exit(EXIT_FAILURE);
               }

               for (n = 0; n < nfds; ++n) {
                   if (events[n].data.fd == listen_sock) {
                       conn_sock = accept(listen_sock,
                                       (struct sockaddr *) &local, &addrlen);
                       if (conn_sock == -1) {
                           perror("accept");
                           exit(EXIT_FAILURE);
                       }
                       setnonblocking(conn_sock);
                       ev.events = EPOLLIN | EPOLLET;
                       ev.data.fd = conn_sock;
                       if (epoll_ctl(epollfd, EPOLL_CTL_ADD, conn_sock,
                                   &ev) == -1) {
                           perror("epoll_ctl: conn_sock");
                           exit(EXIT_FAILURE);
                       }
                   } else {
                       do_use_fd(events[n].data.fd);
                   }
               }
           }

When used as an edge-triggered interface, for performance reasons, it is possible to add
the file descriptor inside the epoll interface (EPOLL_CTL_ADD) once by specifying
(EPOLLIN|EPOLLOUT). This allows you to avoid continuously switching between EPOLLIN and EPOLLOUT calling epoll_ctl(2) with EPOLL_CTL_MOD.
当用作边缘触发接口时，出于性能原因，可以通过指定（EPOLLIN|EPOLLOUT）在epoll接口（epoll_CTL_add）中添加一次文件描述符。这样可以避免在EPOLLIN和EPOLLOUT之间连续切换，使用epoll_ctl_MOD调用epoll_ctl（2）。

Questions and answers

Q0 What is the key used to distinguish the file descriptors registered in an epoll set?
        用于区分epoll集中注册的文件描述符的键是什么？

A0 The key is the combination of the file descriptor number and the open file description
(also known as an "open file handle", the kernel's internal representation of an open file).
        键是文件描述符编号和打开文件描述（也称为“打开文件句柄”，内核对打开文件的内部表示）的组合。

Q1 What happens if you register the same file descriptor on an epoll instance twice?
        如果在epoll实例上注册两次相同的文件描述符，会发生什么情况？

A1 You will probably get EEXIST. However, it is possible to add a duplicate (dup(2),
dup2(2), fcntl(2) F_DUPFD) descriptor to the same epoll instance. This can be a useful technique for filtering events, if the duplicate file descriptors are registered with different events masks.
        你可能会得到EEXIST。但是，可以向同一个epoll实例添加重复的（dup（2）、dup2（2）和fcntl（2）F_DUPFD）描述符。如果使用不同的事件掩码注册重复的文件描述符，这对于筛选事件来说是一种有用的技术。

Q2 Can two epoll instances wait for the same file descriptor? If so, are events reported to both epoll file descriptors?
        两个epoll实例可以等待相同的文件描述符吗？如果是，事件是否报告给两个epoll文件描述符？

A2 Yes, and events would be reported to both. However, careful programming may be needed to do this correctly.
        是的，事件将向双方报告。然而，可能需要仔细编程才能正确执行此操作。

Q3 Is the epoll file descriptor itself poll/epoll/selectable?

        epoll文件描述符本身是poll/epoll/select？

A3 Yes. If an epoll file descriptor has events waiting then it will indicate as being
readable.
        对，如果epoll文件描述符有等待的事件，那么它将指示为可读。

Q4 What happens if one attempts to put an epoll file descriptor into its own file
descriptor set?
        如果尝试将epoll文件描述符放入自己的文件描述符集中，会发生什么情况？

A4 The epoll_ctl(2) call will fail (EINVAL). However, you can add an epoll file descriptor inside another epoll file descriptor set.
        epoll_ctl（2）调用将失败（EINVAL）。但是，您可以在另一个epoll文件描述符集中添加一个epol文件描述符。

Q5 Can I send an epoll file descriptor over a UNIX domain socket to another process?
        我可以通过UNIX域套接字将epoll文件描述符发送到另一个进程吗？

A5 Yes, but it does not make sense to do this, since the receiving process would not have
copies of the file descriptors in the epoll set.
        是的，但这样做没有意义，因为接收进程在epoll集中没有文件描述符的副本。

Q6 Will closing a file descriptor cause it to be removed from all epoll sets automati‐
cally?
        关闭文件描述符会导致它自动从所有epoll集合中删除吗？

A6 Yes, but be aware of the following point. A file descriptor is a reference to an open
file description (see open(2)). Whenever a descriptor is duplicated via dup(2), dup2(2), fcntl(2) F_DUPFD, or fork(2), a new file descriptor referring to the same open file description is created. An open file description continues to exist until all file descriptors referring to it have been closed. A file descriptor is removed from an epoll set only after all the file descriptors referring to the underlying open file description have been closed (or before if the descriptor is explicitly removed using epoll_ctl(2) EPOLL_CTL_DEL). This means that even after a file descriptor that is part of an epoll set has been closed, events may be reported for that file descriptor if other file descriptors referring to the same underlying file description remain open.
        是的，但要注意以下几点。文件描述符是对打开的文件描述的引用（参见open（2））。每当通过dup（2）、dup2（2），fcntl（2）F_DUPFD或fork（2）复制描述符时，都会创建一个引用相同打开文件描述的新文件描述符。打开的文件描述将继续存在，直到所有引用它的文件描述符都已关闭。只有在引用底层打开文件描述的所有文件描述符都已关闭之后（或者之前，如果使用epoll_ctl（2）epoll_ctl_DEL显式删除了描述符），才会从epoll集中删除文件描述符。这意味着，即使作为epoll集一部分的文件描述符已关闭，如果引用相同底层文件描述的其他文件描述符仍保持打开状态，则可能会报告该文件描述符的事件。

Q7 If more than one event occurs between epoll_wait(2) calls, are they combined or
reported separately?
        如果epoll_wait（2）调用之间发生多个事件，它们是合并的还是单独报告的？

A7 They will be combined.
        他们将合并。

Q8 Does an operation on a file descriptor affect the already collected but not yet
reported events?
        对文件描述符的操作是否会影响已收集但尚未报告的事件？

A8 You can do two operations on an existing file descriptor. Remove would be meaningless for this case. Modify will reread available I/O.
        可以对现有文件描述符执行两个操作。删除对于这种情况没有意义。修改将重新读取可用I/O。

Q9 Do I need to continuously read/write a file descriptor until EAGAIN when using the EPOLLET flag (edge-triggered behavior) ?
        当使用EPOLL ET标志（边缘触发行为）时，我是否需要一直读/写文件描述符直到EAGAIN？

A9 Receiving an event from epoll_wait(2) should suggest to you that such file descriptor
is ready for the requested I/O operation. You must consider it ready until the next
(nonblocking) read/write yields EAGAIN. When and how you will use the file descriptor is entirely up to you.
        从epoll_wait（2）接收到一个事件，应该会提示您该文件描述符已准备好执行请求的I/O操作。在下一次（非阻塞）读/写操作产生EAGAIN之前，必须将其视为就绪。何时以及如何使用文件描述符完全取决于您。

For packet/token-oriented files (e.g., datagram socket, terminal in canonical mode),
the only way to detect the end of the read/write I/O space is to continue to
read/write until EAGAIN.
        对于面向数据包/令牌的文件（例如，数据报套接字、规范模式下的终端），检测读/写I/O空间结束的唯一方法是继续读/写，直到EAGAIN。

For stream-oriented files (e.g., pipe, FIFO, stream socket), the condition that the
read/write I/O space is exhausted can also be detected by checking the amount of data
read from / written to the target file descriptor. For example, if you call read(2)
by asking to read a certain amount of data and read(2) returns a lower number of
bytes, you can be sure of having exhausted the read I/O space for the file descriptor.
The same is true when writing using write(2). (Avoid this latter technique if you
cannot guarantee that the monitored file descriptor always refers to a stream-oriented
file.)
        对于面向流的文件（例如管道、FIFO、流套接字），也可以通过检查从目标文件描述符读取/写入的数据量来检测读/写I/O空间耗尽的情况。例如，如果您通过请求读取一定数量的数据来调用read（2），而read（2中）返回的字节数较低，那么您可以肯定已经耗尽了文件描述符的读取I/O空间。使用write（2）编写时也是如此。（如果无法保证受监视的文件描述符始终引用面向流的文件，请避免使用后一种技术。）

Possible pitfalls and ways to avoid themo Starvation (edge-triggered)
可能的陷阱和避免饥饿的方法（边缘触发）

If there is a large amount of I/O space, it is possible that by trying to drain it the other files will not get processed causing starvation. (This problem is not specific to epoll.)
        如果有大量I/O空间，则可能通过尝试耗尽空间，其他文件将无法处理，从而导致饥饿。（这个问题不是epoll特有的。）

The solution is to maintain a ready list and mark the file descriptor as ready in its
associated data structure, thereby allowing the application to remember which files need
to be processed but still round robin amongst all the ready files. This also supports
ignoring subsequent events you receive for file descriptors that are already ready.
   解决方案是维护一个就绪列表，并在其关联的数据结构中将文件描述符标记为就绪，从而允许应用程序记住哪些文件需要处理，但仍在所有就绪文件之间循环。这还支持忽略已准备好的文件描述符的后续事件。

o If using an event cache...
如果使用事件缓存。。。

If you use an event cache or store all the file descriptors returned from epoll_wait(2),
then make sure to provide a way to mark its closure dynamically (i.e., caused by a previous event's processing). Suppose you receive 100 events from epoll_wait(2), and in event
#47 a condition causes event #13 to be closed. If you remove the structure and close(2)
the file descriptor for event #13, then your event cache might still say there are events
waiting for that file descriptor causing confusion.
   如果使用事件缓存或存储从epoll_wait（2）返回的所有文件描述符，请确保提供动态标记其关闭的方法（即，由前一个事件的处理引起的）。假设您从epoll_wait（2）收到100个事件，在事件#47中，一个条件导致事件#13关闭。如果移除结构并关闭（2）事件#13的文件描述符，则事件缓存可能仍会显示有事件等待该文件描述符，从而导致混淆。

One solution for this is to call, during the processing of event 47,
epoll_ctl(EPOLL_CTL_DEL) to delete file descriptor 13 and close(2), then mark its associ‐
ated data structure as removed and link it to a cleanup list. If you find another event
for file descriptor 13 in your batch processing, you will discover the file descriptor had
been previously removed and there will be no confusion.
   一种解决方案是，在处理事件47期间，调用epoll_ctl（epoll_ctl_DEL）删除文件描述符13并关闭（2），然后将其关联的数据结构标记为已删除，并将其链接到清理列表。如果在批处理过程中发现文件描述符13的另一个事件，您会发现文件描述符以前已被删除，因此不会产生混淆。

VERSIONS

The epoll API was introduced in Linux kernel 2.5.44. Support was added to glibc in ver‐
sion 2.3.2.
epoll API是在Linux内核2.5.44中引入的。在版本2.3.2中为glibc添加了支持。

CONFORMING TO
The epoll API is Linux-specific. Some other systems provide similar mechanisms, for exam‐
ple, FreeBSD has kqueue, and Solaris has /dev/poll.
epoll API是特定于Linux的。其他一些系统提供类似的机制，例如，FreeBSD有kqueue，Solaris有/dev/poll。

COLOPHON

This page is part of release 3.53 of the Linux man-pages project. A description of the
project, and information about reporting bugs, can be found at
http://www.kernel.org/doc/man-pages/.
此页面是Linux手册页项目3.53版的一部分。
有关项目描述和报告错误的信息，请访问http://www.kernel.org/doc/man-pages/.

epoll的边缘触发和水平触发区别？

epoll是一种高效的事件驱动I/O模型，可以用于实现高并发网络通信。在epoll中，有两种不同的事件触发方式：边缘触发（Edge-Triggered）和水平触发（Level-Triggered），它们之间的区别如下：

1. 边缘触发（Edge-Triggered）：只有在状态发生变化的时候才会触发事件，也就是说只有当数据从没有可读变为可读，或者从没有可写变为可写的时候才会触发事件。这种方式能够避免重复触发事件，同时也需要更加精细的事件处理和管理。

2. 水平触发（Level-Triggered）：只要状态没有变化，就会一直触发事件。也就是说，当有数据可读或者可写的时候，会一直触发事件，直到数据被读取或写入完成。这种方式更加简单，但是需要更多的事件处理和管理。（epoll默认水平触发，epoll的触发方式可以通过epoll_ctl()函数的EPOLLET选项来进行设置。如果将EPOLLET选项设置为1，则采用边缘触发方式；如果设置为0，则采用水平触发方式）

总的来说，边缘触发适用于处理高并发场景，可以减少事件的重复触发，提高系统性能。而水平触发则适用于一般的网络通信场景，实现简单，但是需要更多的事件处理和管理。需要根据具体的业务需求和系统特点进行综合考虑，并进行合理的技术选型。