select、poll、epoll

metabit

已于 2023-12-23 19:09:33 修改

阅读量592

点赞数

分类专栏：充电文章标签： select poll epoll

于 2023-02-17 20:37:56 首次发布

本文链接：https://blog.csdn.net/dawnto/article/details/129090172

版权

充电专栏收录该内容

36 篇文章 0 订阅

订阅专栏

select、poll、epoll

select

int select(int nfds, fd_set readfds, fd_set writefds, fd_set exceptfds, struct timeval timeout);

int nfds：被select管理的文件描述符的个数，最大描述符编号+1
fd_set *readfds：读文件描述符集合
fd_set *writefds：写文件描述符集合
fd_set *exceptfds：异常文件描述符集合
struct timeval *timeout：超时时间，NULL：永远等待，正数：时间长度，0：立即返回
使用结构体表示，不存在负数值的情况，所以用NULL，正数，0 表示三种超时状态

总结：fd_set为一个1024比特的位图，位图中每一位代表一个文件描述符。

void FD_CLR(int fd, fd_set *set);

从set中清除fd

int FD_ISSET(int fd, fd_set *set);

查看fd是否存在与set中

void FD_SET(int fd, fd_set *set);

将fd加入set

void FD_ZERO(fd_set *set);

将set清空

在产生select调用时，文件描述符位图需从用户态拷贝到内核态。内核态处理完fd事件后再拷贝给用户态。之后用户态判定哪些文件描述符处于就绪状态，从而处理。用户态代码需要遍历所有的文件描述符。select处理文件描述符的上限为1024。若需要扩充文件描述符上限，则需要通过重新编译内核源码实现。

man select

SELECT(2)                                                          Linux Programmer's Manual                                                         SELECT(2)

NAME
       select, pselect, FD_CLR, FD_ISSET, FD_SET, FD_ZERO - synchronous I/O multiplexing

SYNOPSIS
       #include <sys/select.h>

       int select(int nfds, fd_set *readfds, fd_set *writefds,
                  fd_set *exceptfds, struct timeval *timeout);

       void FD_CLR(int fd, fd_set *set);
       int  FD_ISSET(int fd, fd_set *set);
       void FD_SET(int fd, fd_set *set);
       void FD_ZERO(fd_set *set);

       int pselect(int nfds, fd_set *readfds, fd_set *writefds,
                   fd_set *exceptfds, const struct timespec *timeout,
                   const sigset_t *sigmask);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       pselect(): _POSIX_C_SOURCE >= 200112L

DESCRIPTION
       select()  allows a program to monitor multiple file descriptors, waiting until one or more of the file descriptors become "ready" for some class of I/O
       operation (e.g., input possible).  A file descriptor is considered ready if it is possible to perform a corresponding I/O operation (e.g., read(2),  or
       a sufficiently small write(2)) without blocking.

       select() can monitor only file descriptors numbers that are less than FD_SETSIZE; poll(2) and epoll(7) do not have this limitation.  See BUGS.

   File descriptor sets
       The  principal  arguments  of  select()  are three "sets" of file descriptors (declared with the type fd_set), which allow the caller to wait for three
       classes of events on the specified set of file descriptors.  Each of the fd_set arguments may be specified as NULL if no file  descriptors  are  to  be
       watched for the corresponding class of events.

       Note well: Upon return, each of the file descriptor sets is modified in place to indicate which file descriptors are currently "ready".  Thus, if using
       select() within a loop, the sets must be reinitialized before each call.  The implementation of the fd_set arguments as value-result arguments is a de‐
       sign error that is avoided in poll(2) and epoll(7).

       The contents of a file descriptor set can be manipulated using the following macros:

       FD_ZERO()
              This macro clears (removes all file descriptors from) set.  It should be employed as the first step in initializing a file descriptor set.

       FD_SET()
              This  macro adds the file descriptor fd to set.  Adding a file descriptor that is already present in the set is a no-op, and does not produce an
              error.

       FD_CLR()
              This macro removes the file descriptor fd from set.  Removing a file descriptor that is not present in the set is a no-op, and does not  produce
              an error.

       FD_ISSET()
              select()  modifies the contents of the sets according to the rules described below.  After calling select(), the FD_ISSET() macro can be used to
              test if a file descriptor is still present in a set.  FD_ISSET() returns nonzero if the file descriptor fd is present in set, and zero if it  is
              not.

   Arguments
       The arguments of select() are as follows:

       readfds
              The  file  descriptors in this set are watched to see if they are ready for reading.  A file descriptor is ready for reading if a read operation
              will not block; in particular, a file descriptor is also ready on end-of-file.

              After select() has returned, readfds will be cleared of all file descriptors except for those that are ready for reading.

       writefds
              The file descriptors in this set are watched to see if they are ready for writing.  A file descriptor is ready for writing if a write  operation
              will not block.  However, even if a file descriptor indicates as writable, a large write may still block.

              After select() has returned, writefds will be cleared of all file descriptors except for those that are ready for writing.

       exceptfds
              The  file  descriptors in this set are watched for "exceptional conditions".  For examples of some exceptional conditions, see the discussion of
              POLLPRI in poll(2).

              After select() has returned, exceptfds will be cleared of all file descriptors except for those for which an exceptional condition has occurred.

       nfds   This argument should be set to the highest-numbered file descriptor in any of the three sets, plus 1.  The indicated file  descriptors  in  each
              set are checked, up to this limit (but see BUGS).

       timeout
              The  timeout  argument is a timeval structure (shown below) that specifies the interval that select() should block waiting for a file descriptor
              to become ready.  The call will block until either:

              • a file descriptor becomes ready;

              • the call is interrupted by a signal handler; or

              • the timeout expires.

              Note that the timeout interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the blocking  interval
              may overrun by a small amount.

              If both fields of the timeval structure are zero, then select() returns immediately.  (This is useful for polling.)

              If timeout is specified as NULL, select() blocks indefinitely waiting for a file descriptor to become ready.

   pselect()
       The pselect() system call allows an application to safely wait until either a file descriptor becomes ready or until a signal is caught.

       The operation of select() and pselect() is identical, other than these three differences:

       • select()  uses  a timeout that is a struct timeval (with seconds and microseconds), while pselect() uses a struct timespec (with seconds and nanosec‐
         onds).

       • select() may update the timeout argument to indicate how much time was left.  pselect() does not change this argument.

       • select() has no sigmask argument, and behaves as pselect() called with NULL sigmask.

       sigmask is a pointer to a signal mask (see sigprocmask(2)); if it is not NULL, then pselect() first replaces the current signal mask by the one pointed
       to  by sigmask, then does the "select" function, and then restores the original signal mask.  (If sigmask is NULL, the signal mask is not modified dur‐
       ing the pselect() call.)

       Other than the difference in the precision of the timeout argument, the following pselect() call:

           ready = pselect(nfds, &readfds, &writefds, &exceptfds,
                           timeout, &sigmask);

       is equivalent to atomically executing the following calls:

           sigset_t origmask;

           pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
           ready = select(nfds, &readfds, &writefds, &exceptfds, timeout);
           pthread_sigmask(SIG_SETMASK, &origmask, NULL);

       The reason that pselect() is needed is that if one wants to wait for either a signal or for a file descriptor to become ready, then an atomic  test  is
       needed  to prevent race conditions.  (Suppose the signal handler sets a global flag and returns.  Then a test of this global flag followed by a call of
       select() could hang indefinitely if the signal arrived just after the test but just before the call.  By contrast, pselect() allows one to first  block
       signals, handle the signals that have come in, then call pselect() with the desired sigmask, avoiding the race.)

   The timeout
       The timeout argument for select() is a structure of the following type:

           struct timeval {
               time_t      tv_sec;         /* seconds */
               suseconds_t tv_usec;        /* microseconds */
           };

       The corresponding argument for pselect() has the following type:

           struct timespec {
               time_t      tv_sec;         /* seconds */
               long        tv_nsec;        /* nanoseconds */
           };

       On Linux, select() modifies timeout to reflect the amount of time not slept; most other implementations do not do this.  (POSIX.1 permits either behav‐
       ior.)  This causes problems both when Linux code which reads timeout is ported to other operating systems, and when code is ported to Linux that reuses
       a struct timeval for multiple select()s in a loop without reinitializing it.  Consider timeout to be undefined after select() returns.

RETURN VALUE
       On  success, select() and pselect() return the number of file descriptors contained in the three returned descriptor sets (that is, the total number of
       bits that are set in readfds, writefds, exceptfds).  The return value may be zero if the timeout expired before any file descriptors became ready.

       On error, -1 is returned, and errno is set to indicate the error; the file descriptor sets are unmodified, and timeout becomes undefined.

ERRORS
       EBADF  An invalid file descriptor was given in one of the sets.  (Perhaps a file descriptor that was already closed, or one on which an error  has  oc‐
              curred.)  However, see BUGS.

       EINTR  A signal was caught; see signal(7).

       EINVAL nfds is negative or exceeds the RLIMIT_NOFILE resource limit (see getrlimit(2)).

       EINVAL The value contained within timeout is invalid.

       ENOMEM Unable to allocate memory for internal tables.

VERSIONS
       pselect() was added to Linux in kernel 2.6.16.  Prior to this, pselect() was emulated in glibc (but see BUGS).

CONFORMING TO
       select() conforms to POSIX.1-2001, POSIX.1-2008, and 4.4BSD (select() first appeared in 4.2BSD).  Generally portable to/from non-BSD systems supporting
       clones of the BSD socket layer (including System V variants).  However, note that the System V variant typically sets the timeout variable  before  re‐
       turning, but the BSD variant does not.

       pselect() is defined in POSIX.1g, and in POSIX.1-2001 and POSIX.1-2008.

NOTES
       An fd_set is a fixed size buffer.  Executing FD_CLR() or FD_SET() with a value of fd that is negative or is equal to or larger than FD_SETSIZE will re‐
       sult in undefined behavior.  Moreover, POSIX requires fd to be a valid file descriptor.

       The operation of select() and pselect() is not affected by the O_NONBLOCK flag.

       On some other UNIX systems, select() can fail with the error EAGAIN if the system fails to allocate kernel-internal resources, rather  than  ENOMEM  as
       Linux  does.   POSIX  specifies  this  error  for poll(2), but not for select().  Portable programs may wish to check for EAGAIN and loop, just as with
       EINTR.

   The self-pipe trick
       On systems that lack pselect(), reliable (and more portable) signal trapping can be achieved using the self-pipe trick.  In this  technique,  a  signal
       handler  writes a byte to a pipe whose other end is monitored by select() in the main program.  (To avoid possibly blocking when writing to a pipe that
       may be full or reading from a pipe that may be empty, nonblocking I/O is used when reading from and writing to the pipe.)

   Emulating usleep(3)
       Before the advent of usleep(3), some code employed a call to select() with all three sets empty, nfds zero, and a non-NULL timeout as a fairly portable
       way to sleep with subsecond precision.

   Correspondence between select() and poll() notifications
       Within  the Linux kernel source, we find the following definitions which show the correspondence between the readable, writable, and exceptional condi‐
       tion notifications of select() and the event notifications provided by poll(2) and epoll(7):

           #define POLLIN_SET  (EPOLLRDNORM | EPOLLRDBAND | EPOLLIN |
                                EPOLLHUP | EPOLLERR)
                              /* Ready for reading */
           #define POLLOUT_SET (EPOLLWRBAND | EPOLLWRNORM | EPOLLOUT |
                                EPOLLERR)
                              /* Ready for writing */
           #define POLLEX_SET  (EPOLLPRI)
                              /* Exceptional condition */

   Multithreaded applications
       If a file descriptor being monitored by select() is closed in another thread, the result is unspecified.  On some UNIX systems, select()  unblocks  and
       returns,  with  an  indication that the file descriptor is ready (a subsequent I/O operation will likely fail with an error, unless another process re‐
       opens file descriptor between the time select() returned and the I/O operation is performed).  On Linux (and some other systems), closing the file  de‐
       scriptor  in  another thread has no effect on select().  In summary, any application that relies on a particular behavior in this scenario must be con‐
       sidered buggy.

   C library/kernel differences
       The Linux kernel allows file descriptor sets of arbitrary size, determining the length of the sets to be checked from the value of nfds.   However,  in
       the glibc implementation, the fd_set type is fixed in size.  See also BUGS.

       The  pselect()  interface  described in this page is implemented by glibc.  The underlying Linux system call is named pselect6().  This system call has
       somewhat different behavior from the glibc wrapper function.

       The Linux pselect6() system call modifies its timeout argument.  However, the glibc wrapper function hides this behavior by using a local variable  for
       the  timeout argument that is passed to the system call.  Thus, the glibc pselect() function does not modify its timeout argument; this is the behavior
       required by POSIX.1-2001.

       The final argument of the pselect6() system call is not a sigset_t * pointer, but is instead a structure of the form:

           struct {
               const kernel_sigset_t *ss;   /* Pointer to signal set */
               size_t ss_len;               /* Size (in bytes) of object
                                               pointed to by 'ss' */
           };

       This allows the system call to obtain both a pointer to the signal set and its size, while allowing for the fact that most architectures support a max‐
       imum of 6 arguments to a system call.  See sigprocmask(2) for a discussion of the difference between the kernel and libc notion of the signal set.

   Historical glibc details
       Glibc 2.0 provided an incorrect version of pselect() that did not take a sigmask argument.

       In glibc versions 2.1 to 2.2.1, one must define _GNU_SOURCE in order to obtain the declaration of pselect() from <sys/select.h>.

BUGS
       POSIX allows an implementation to define an upper limit, advertised via the constant FD_SETSIZE, on the range of file descriptors that can be specified
       in a file descriptor set.  The Linux kernel imposes no fixed limit, but the glibc implementation makes fd_set a fixed-size type,  with  FD_SETSIZE  de‐
       fined  as  1024,  and  the FD_*() macros operating according to that limit.  To monitor file descriptors greater than 1023, use poll(2) or epoll(7) in‐
       stead.

       According to POSIX, select() should check all specified file descriptors in the three file descriptor sets, up to the limit nfds-1.  However, the  cur‐
       rent  implementation  ignores  any file descriptor in these sets that is greater than the maximum file descriptor number that the process currently has
       open.  According to POSIX, any such file descriptor that is specified in one of the sets should result in the error EBADF.

       Starting with version 2.1, glibc provided an emulation of pselect() that was implemented using sigprocmask(2) and select().   This  implementation  re‐
       mained  vulnerable  to  the  very race condition that pselect() was designed to prevent.  Modern versions of glibc use the (race-free) pselect() system
       call on kernels where it is provided.

       On Linux, select() may report a socket file descriptor as "ready for reading", while nevertheless a subsequent read blocks.   This  could  for  example
       happen when data has arrived but upon examination has the wrong checksum and is discarded.  There may be other circumstances in which a file descriptor
       is spuriously reported as ready.  Thus it may be safer to use O_NONBLOCK on sockets that should not block.

       On Linux, select() also modifies timeout if the call is interrupted by a signal handler (i.e., the EINTR error  return).   This  is  not  permitted  by
       POSIX.1.  The Linux pselect() system call has the same behavior, but the glibc wrapper hides this behavior by internally copying the timeout to a local
       variable and passing that variable to the system call.

EXAMPLES
       #include <stdio.h>
       #include <stdlib.h>
       #include <sys/select.h>

       int
       main(void)
       {
           fd_set rfds;
           struct timeval tv;
           int retval;

           /* Watch stdin (fd 0) to see when it has input. */

           FD_ZERO(&rfds);
           FD_SET(0, &rfds);

           /* Wait up to five seconds. */

           tv.tv_sec = 5;
           tv.tv_usec = 0;

           retval = select(1, &rfds, NULL, NULL, &tv);
           /* Don't rely on the value of tv now! */

           if (retval == -1)
               perror("select()");
           else if (retval)
               printf("Data is available now.\n");
               /* FD_ISSET(0, &rfds) will be true. */
           else
               printf("No data within five seconds.\n");

           exit(EXIT_SUCCESS);
       }

SEE ALSO
       accept(2), connect(2), poll(2), read(2), recv(2), restart_syscall(2), send(2), sigprocmask(2), write(2), epoll(7), time(7)

       For a tutorial with discussion and examples, see select_tut(2).

COLOPHON
       This page is part of release 5.10 of the Linux man-pages project.  A description of the project, information about reporting bugs, and the latest  ver‐
       sion of this page, can be found at https://www.kernel.org/doc/man-pages/.

Linux                                                                     2020-11-01                                                                 SELECT(2)

poll

int poll(struct pollfd *fds, nfds_t nfds, int timeout);

struct pollfd *fds：对fd的封装，它时pollfd 数组的首地址。
The set of file descriptors to be monitored is specified in the fds argument, which is an array of structures of the following form:

struct pollfd {
   int   fd;         /* file descriptor  文件描述符*/    
   short events;     /* requested events 要监听的请求的事件*/
   short revents;    /* returned events  就绪时的事件*/
};

有四类处理输入的事件，三类处理输出的事件，三类处理异常的事件

nfds_t nfds：被poll管理的文件描述符的个数
int timeout：超时时间，负数：无线等待，正数：正常等待，0：直接返回

产生调用时，文件描述符数组需从用户态拷贝到内核态。内核态处理完fd事件后再拷贝给用户态。之后用户态判定哪些文件描述符处于就绪状态，从而处理。用户态代码需要遍历所有的文件描述符。文件描述符个数没有明确限制。变长数组可"任性扩容"。注意：poll在用户态保存文件描述符使用的是数组，而在内核态，会转换成链表，再拷贝回用户态时，又转换成了数组。

man poll

POLL(2)                                                            Linux Programmer's Manual                                                           POLL(2)

NAME
       poll, ppoll - wait for some event on a file descriptor

SYNOPSIS
       #include <poll.h>

       int poll(struct pollfd *fds, nfds_t nfds, int timeout);

       #define _GNU_SOURCE         /* See feature_test_macros(7) */
       #include <signal.h>
       #include <poll.h>

       int ppoll(struct pollfd *fds, nfds_t nfds,
               const struct timespec *tmo_p, const sigset_t *sigmask);

DESCRIPTION
       poll() performs a similar task to select(2): it waits for one of a set of file descriptors to become ready to perform I/O.  The Linux-specific epoll(7)
       API performs a similar task, but offers features beyond those found in poll().

       The set of file descriptors to be monitored is specified in the fds argument, which is an array of structures of the following form:

           struct pollfd {
               int   fd;         /* file descriptor */
               short events;     /* requested events */
               short revents;    /* returned events */
           };

       The caller should specify the number of items in the fds array in nfds.

       The field fd contains a file descriptor for an open file.  If this field is negative, then the corresponding events field is ignored  and  the  revents
       field  returns  zero.   (This  provides an easy way of ignoring a file descriptor for a single poll() call: simply negate the fd field.  Note, however,
       that this technique can't be used to ignore file descriptor 0.)

       The field events is an input parameter, a bit mask specifying the events the application is interested in for the file descriptor fd.  This  field  may
       be specified as zero, in which case the only events that can be returned in revents are POLLHUP, POLLERR, and POLLNVAL (see below).

       The field revents is an output parameter, filled by the kernel with the events that actually occurred.  The bits returned in revents can include any of
       those specified in events, or one of the values POLLERR, POLLHUP, or POLLNVAL.  (These three bits are meaningless in the events field, and will be  set
       in the revents field whenever the corresponding condition is true.)

       If none of the events requested (and no error) has occurred for any of the file descriptors, then poll() blocks until one of the events occurs.

       The  timeout argument specifies the number of milliseconds that poll() should block waiting for a file descriptor to become ready.  The call will block
       until either:

       • a file descriptor becomes ready;

       • the call is interrupted by a signal handler; or

       • the timeout expires.

       Note that the timeout interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that  the  blocking  interval  may
       overrun by a small amount.  Specifying a negative value in timeout means an infinite timeout.  Specifying a timeout of zero causes poll() to return im‐
       mediately, even if no file descriptors are ready.

       The bits that may be set/returned in events and revents are defined in <poll.h>:

       POLLIN There is data to read.

       POLLPRI
              There is some exceptional condition on the file descriptor.  Possibilities include:

              • There is out-of-band data on a TCP socket (see tcp(7)).

              • A pseudoterminal master in packet mode has seen a state change on the slave (see ioctl_tty(2)).

              • A cgroup.events file has been modified (see cgroups(7)).

       POLLOUT
              Writing is now possible, though a write larger than the available space in a socket or pipe will still block (unless O_NONBLOCK is set).

       POLLRDHUP (since Linux 2.6.17)
              Stream socket peer closed connection, or shut down writing half of connection.  The _GNU_SOURCE feature test macro must be defined  (before  in‐
              cluding any header files) in order to obtain this definition.

       POLLERR
              Error  condition (only returned in revents; ignored in events).  This bit is also set for a file descriptor referring to the write end of a pipe
              when the read end has been closed.

       POLLHUP
              Hang up (only returned in revents; ignored in events).  Note that when reading from a channel such as a pipe or  a  stream  socket,  this  event
              merely indicates that the peer closed its end of the channel.  Subsequent reads from the channel will return 0 (end of file) only after all out‐
              standing data in the channel has been consumed.

       POLLNVAL
              Invalid request: fd not open (only returned in revents; ignored in events).

       When compiling with _XOPEN_SOURCE defined, one also has the following, which convey no further information beyond the bits listed above:

       POLLRDNORM
              Equivalent to POLLIN.

       POLLRDBAND
              Priority band data can be read (generally unused on Linux).

       POLLWRNORM
              Equivalent to POLLOUT.

       POLLWRBAND
              Priority data may be written.

       Linux also knows about, but does not use POLLMSG.

   ppoll()
       The relationship between poll() and ppoll() is analogous to the relationship between select(2) and pselect(2): like pselect(2), ppoll() allows  an  ap‐
       plication to safely wait until either a file descriptor becomes ready or until a signal is caught.

       Other than the difference in the precision of the timeout argument, the following ppoll() call:

           ready = ppoll(&fds, nfds, tmo_p, &sigmask);

       is nearly equivalent to atomically executing the following calls:

           sigset_t origmask;
           int timeout;

           timeout = (tmo_p == NULL) ? -1 :
                     (tmo_p->tv_sec * 1000 + tmo_p->tv_nsec / 1000000);
           pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
           ready = poll(&fds, nfds, timeout);
           pthread_sigmask(SIG_SETMASK, &origmask, NULL);

       The  above  code segment is described as nearly equivalent because whereas a negative timeout value for poll() is interpreted as an infinite timeout, a
       negative value expressed in *tmo_p results in an error from ppoll().

       See the description of pselect(2) for an explanation of why ppoll() is necessary.

       If the sigmask argument is specified as NULL, then no signal mask manipulation is performed (and thus ppoll() differs from poll() only in the precision
       of the timeout argument).

       The  tmo_p argument specifies an upper limit on the amount of time that ppoll() will block.  This argument is a pointer to a structure of the following
       form:

           struct timespec {
               long    tv_sec;         /* seconds */
               long    tv_nsec;        /* nanoseconds */
           };

       If tmo_p is specified as NULL, then ppoll() can block indefinitely.

RETURN VALUE
       On success, poll() returns a nonnegative value which is the number of elements in the pollfds whose revents fields have been set  to  a  nonzero  value
       (indicating an event or an error).  A return value of zero indicates that the system call timed out before any file descriptors became read.

       On error, -1 is returned, and errno is set to indicate the cause of the error.

ERRORS
       EFAULT fds  points  outside  the  process's  accessible  address space.  The array given as argument was not contained in the calling program's address
              space.

       EINTR  A signal occurred before any requested event; see signal(7).

       EINVAL The nfds value exceeds the RLIMIT_NOFILE value.

       EINVAL (ppoll()) The timeout value expressed in *ip is invalid (negative).

       ENOMEM Unable to allocate memory for kernel data structures.

VERSIONS
       The poll() system call was introduced in Linux 2.1.23.  On older kernels that lack this system call, the glibc poll() wrapper function provides  emula‐
       tion using select(2).

       The ppoll() system call was added to Linux in kernel 2.6.16.  The ppoll() library call was added in glibc 2.4.

CONFORMING TO
       poll() conforms to POSIX.1-2001 and POSIX.1-2008.  ppoll() is Linux-specific.

NOTES
       The operation of poll() and ppoll() is not affected by the O_NONBLOCK flag.

       On  some  other  UNIX  systems,  poll() can fail with the error EAGAIN if the system fails to allocate kernel-internal resources, rather than ENOMEM as
       Linux does.  POSIX permits this behavior.  Portable programs may wish to check for EAGAIN and loop, just as with EINTR.

       Some implementations define the nonstandard constant INFTIM with the value -1 for use as a timeout for poll().  This constant is not provided in glibc.

       For a discussion of what may happen if a file descriptor being monitored by poll() is closed in another thread, see select(2).

   C library/kernel differences
       The Linux ppoll() system call modifies its tmo_p argument.  However, the glibc wrapper function hides this behavior by using a local variable  for  the
       timeout argument that is passed to the system call.  Thus, the glibc ppoll() function does not modify its tmo_p argument.

       The  raw  ppoll()  system  call  has a fifth argument, size_t sigsetsize, which specifies the size in bytes of the sigmask argument.  The glibc ppoll()
       wrapper function specifies this argument as a fixed value (equal to sizeof(kernel_sigset_t)).  See sigprocmask(2) for a discussion on  the  differences
       between the kernel and the libc notion of the sigset.

BUGS
       See the discussion of spurious readiness notifications under the BUGS section of select(2).

EXAMPLES
       The  program  below  opens  each  of  the  files  named in its command-line arguments and monitors the resulting file descriptors for readiness to read
       (POLLIN).  The program loops, repeatedly using poll() to monitor the file descriptors, printing the number of ready file descriptors  on  return.   For
       each ready file descriptor, the program:

       • displays the returned revents field in a human-readable form;

       • if the file descriptor is readable, reads some data from it, and displays that data on standard output; and

       • if the file descriptors was not readable, but some other event occurred (presumably POLLHUP), closes the file descriptor.

       Suppose we run the program in one terminal, asking it to open a FIFO:

           $ mkfifo myfifo
           $ ./poll_input myfifo

       In a second terminal window, we then open the FIFO for writing, write some data to it, and close the FIFO:

           $ echo aaaaabbbbbccccc > myfifo

       In the terminal where we are running the program, we would then see:

           Opened "myfifo" on fd 3
           About to poll()
           Ready: 1
             fd=3; events: POLLIN POLLHUP
               read 10 bytes: aaaaabbbbb
           About to poll()
           Ready: 1
             fd=3; events: POLLIN POLLHUP
               read 6 bytes: ccccc

           About to poll()
           Ready: 1
             fd=3; events: POLLHUP
               closing fd 3
           All file descriptors closed; bye

       In the above output, we see that poll() returned three times:

       • On  the  first  return,  the bits returned in the revents field were POLLIN, indicating that the file descriptor is readable, and POLLHUP, indicating
         that the other end of the FIFO has been closed.  The program then consumed some of the available input.

       • The second return from poll() also indicated POLLIN and POLLHUP; the program then consumed the last of the available input.

       • On the final return, poll() indicated only POLLHUP on the FIFO, at which point the file descriptor was closed and the program terminated.

   Program source

       /* poll_input.c

          Licensed under GNU General Public License v2 or later.
       */
       #include <poll.h>
       #include <fcntl.h>
       #include <sys/types.h>
       #include <stdio.h>
       #include <stdlib.h>
       #include <unistd.h>

       #define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
                               } while (0)

       int
       main(int argc, char *argv[])
       {
           int nfds, num_open_fds;
           struct pollfd *pfds;

           if (argc < 2) {
              fprintf(stderr, "Usage: %s file...\n", argv[0]);
              exit(EXIT_FAILURE);
           }

           num_open_fds = nfds = argc - 1;
           pfds = calloc(nfds, sizeof(struct pollfd));
           if (pfds == NULL)
               errExit("malloc");

           /* Open each file on command line, and add it 'pfds' array */

           for (int j = 0; j < nfds; j++) {
               pfds[j].fd = open(argv[j + 1], O_RDONLY);
               if (pfds[j].fd == -1)
                   errExit("open");

               printf("Opened \"%s\" on fd %d\n", argv[j + 1], pfds[j].fd);

               pfds[j].events = POLLIN;
           }

           /* Keep calling poll() as long as at least one file descriptor is
              open */

           while (num_open_fds > 0) {
               int ready;

               printf("About to poll()\n");
               ready = poll(pfds, nfds, -1);
               if (ready == -1)
                   errExit("poll");

               printf("Ready: %d\n", ready);

               /* Deal with array returned by poll() */

               for (int j = 0; j < nfds; j++) {
                   char buf[10];

                   if (pfds[j].revents != 0) {
                       printf("  fd=%d; events: %s%s%s\n", pfds[j].fd,
                               (pfds[j].revents & POLLIN)  ? "POLLIN "  : "",
                               (pfds[j].revents & POLLHUP) ? "POLLHUP " : "",
                               (pfds[j].revents & POLLERR) ? "POLLERR " : "");

                       if (pfds[j].revents & POLLIN) {
                           ssize_t s = read(pfds[j].fd, buf, sizeof(buf));
                           if (s == -1)
                               errExit("read");
                           printf("    read %zd bytes: %.*s\n",
                                   s, (int) s, buf);
                       } else {                /* POLLERR | POLLHUP */
                           printf("    closing fd %d\n", pfds[j].fd);
                           if (close(pfds[j].fd) == -1)
                               errExit("close");
                           num_open_fds--;
                       }
                   }
               }
           }

           printf("All file descriptors closed; bye\n");
           exit(EXIT_SUCCESS);
       }

SEE ALSO
       restart_syscall(2), select(2), select_tut(2), epoll(7), time(7)

COLOPHON
       This page is part of release 5.10 of the Linux man-pages project.  A description of the project, information about reporting bugs, and the latest  ver‐
       sion of this page, can be found at https://www.kernel.org/doc/man-pages/.

Linux                                                                     2020-04-11                                                                   POLL(2)

epoll

int epoll_create(int size);

int size：可忽略任意大于0的值即可。

NAME
       epoll_create, epoll_create1 - open an epoll file descriptor

SYNOPSIS
       #include <sys/epoll.h>

       int epoll_create(int size);
       int epoll_create1(int flags);

DESCRIPTION
       epoll_create()  creates a new epoll(7) instance.  Since Linux 2.6.8, the size argument is ignored, but must be greater than zero; see NOTES.

int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);

int epfd：epoll_create()创建的文件描述符
int op：EPOLL_CTL_ADD：添加，EPOLL_CTL_MOD：更新，EPOLL_CTL_DEL：删除
int fd：待监听的文件描述符
struct epoll_event *event：要监听的fd事件

NAME
       epoll_ctl - control interface for an epoll file descriptor

SYNOPSIS
       #include <sys/epoll.h>

       int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);

DESCRIPTION
       This  system  call is used to add, modify, or remove entries in the interest list of the epoll(7) instance re‐
       ferred to by the file descriptor epfd.  It requests that the operation op be performed for the target file de‐
       scriptor, fd.

int epoll_wait(int epfd, struct epoll_event *events,int maxevents, int timeout);

int epfd：epoll_create()创建的文件描述符
epoll_event *events：就绪事件列表，就绪事件个数为int epoll_wait()的返回值
int maxevents：最多返回的事件个数，内核通过该值确定events数组的长度
int timeout：超时控制

NAME
       epoll_wait, epoll_pwait - wait for an I/O event on an epoll file descriptor

SYNOPSIS
       #include <sys/epoll.h>

       int epoll_wait(int epfd, struct epoll_event *events,
                      int maxevents, int timeout);
       int epoll_pwait(int epfd, struct epoll_event *events,
                      int maxevents, int timeout,
                      const sigset_t *sigmask);

DESCRIPTION
       The  epoll_wait()  system  call  waits  for events on the epoll(7) instance referred to by the file descriptor
       epfd.  The buffer pointed to by events is used to return information from the ready list about  file  descrip‐
       tors in the interest list that have some events available.  Up to maxevents are returned by epoll_wait().  The
       maxevents argument must be greater than zero.

内核监听epoll的文件描述符时采用红黑树，就绪事件链表等数据结构。epoll的两种内置触发模式为ET(edge-trigger),LT(level-trigger)。

man epoll

EPOLL(7)                                                           Linux Programmer's Manual                                                          EPOLL(7)

NAME
       epoll - I/O event notification facility

SYNOPSIS
       #include <sys/epoll.h>

DESCRIPTION
       The  epoll API performs a similar task to poll(2): monitoring multiple file descriptors to see if I/O is possible on any of them.  The epoll API can be
       used either as an edge-triggered or a level-triggered interface and scales well to large numbers of watched file descriptors.

       The central concept of the epoll API is the epoll instance, an in-kernel data structure which, from a user-space perspective, can be  considered  as  a
       container for two lists:

       • The interest list (sometimes also called the epoll set): the set of file descriptors that the process has registered an interest in monitoring.

       • The ready list: the set of file descriptors that are "ready" for I/O.  The ready list is a subset of (or, more precisely, a set of references to) the
         file descriptors in the interest list.  The ready list is dynamically populated by the kernel as a result of I/O activity on those file descriptors.

       The following system calls are provided to create and manage an epoll instance:

       • epoll_create(2) creates a new epoll instance and returns a file descriptor referring to that instance.  (The more recent epoll_create1(2) extends the
         functionality of epoll_create(2).)

       • Interest in particular file descriptors is then registered via epoll_ctl(2), which adds items to the interest list of the epoll instance.

       • epoll_wait(2)  waits for I/O events, blocking the calling thread if no events are currently available.  (This system call can be thought of as fetch‐
         ing items from the ready list of the epoll instance.)

   Level-triggered and edge-triggered
       The epoll event distribution interface is able to behave both as edge-triggered (ET) and as level-triggered (LT).  The difference between the two mech‐
       anisms can be described as follows.  Suppose that this scenario happens:

       1. The file descriptor that represents the read side of a pipe (rfd) is registered on the epoll instance.

       2. A pipe writer writes 2 kB of data on the write side of the pipe.

       3. A call to epoll_wait(2) is done that will return rfd as a ready file descriptor.

       4. The pipe reader reads 1 kB of data from rfd.

       5. A call to epoll_wait(2) is done.

       If the rfd file descriptor has been added to the epoll interface using the EPOLLET (edge-triggered) flag, the call to epoll_wait(2) done in step 5 will
       probably hang despite the available data still present in the file input buffer; meanwhile the remote peer might be expecting a response based  on  the
       data  it  already sent.  The reason for this is that edge-triggered mode delivers events only when changes occur on the monitored file descriptor.  So,
       in step 5 the caller might end up waiting for some data that is already present inside the input buffer.  In the above example, an event on rfd will be
       generated  because  of  the write done in 2 and the event is consumed in 3.  Since the read operation done in 4 does not consume the whole buffer data,
       the call to epoll_wait(2) done in step 5 might block indefinitely.

       An application that employs the EPOLLET flag should use nonblocking file descriptors to avoid having a blocking read or write starve  a  task  that  is
       handling multiple file descriptors.  The suggested way to use epoll as an edge-triggered (EPOLLET) interface is as follows:

       a) with nonblocking file descriptors; and

       b) by waiting for an event only after read(2) or write(2) return EAGAIN.

       By  contrast,  when used as a level-triggered interface (the default, when EPOLLET is not specified), epoll is simply a faster poll(2), and can be used
       wherever the latter is used since it shares the same semantics.

       Since even with edge-triggered epoll, multiple events can be generated upon receipt of multiple chunks of data, the caller has the  option  to  specify
       the EPOLLONESHOT flag, to tell epoll to disable the associated file descriptor after the receipt of an event with epoll_wait(2).  When the EPOLLONESHOT
       flag is specified, it is the caller's responsibility to rearm the file descriptor using epoll_ctl(2) with EPOLL_CTL_MOD.

       If multiple threads (or processes, if child processes have inherited the epoll file descriptor across fork(2)) are blocked in epoll_wait(2) waiting  on
       the  same epoll file descriptor and a file descriptor in the interest list that is marked for edge-triggered (EPOLLET) notification becomes ready, just
       one of the threads (or processes) is awoken from epoll_wait(2).  This provides a useful optimization for avoiding "thundering herd"  wake-ups  in  some
       scenarios.

   Interaction with autosleep
       If the system is in autosleep mode via /sys/power/autosleep and an event happens which wakes the device from sleep, the device driver will keep the de‐
       vice awake only until that event is queued.  To keep the device awake until the event has been processed, it  is  necessary  to  use  the  epoll_ctl(2)
       EPOLLWAKEUP flag.

       When  the  EPOLLWAKEUP  flag  is  set  in the events field for a struct epoll_event, the system will be kept awake from the moment the event is queued,
       through the epoll_wait(2) call which returns the event until the subsequent epoll_wait(2) call.  If the event should keep the system awake beyond  that
       time, then a separate wake_lock should be taken before the second epoll_wait(2) call.

   /proc interfaces
       The following interfaces can be used to limit the amount of kernel memory consumed by epoll:

       /proc/sys/fs/epoll/max_user_watches (since Linux 2.6.28)
              This  specifies a limit on the total number of file descriptors that a user can register across all epoll instances on the system.  The limit is
              per real user ID.  Each registered file descriptor costs roughly 90 bytes on a 32-bit kernel, and roughly 160 bytes on a  64-bit  kernel.   Cur‐
              rently, the default value for max_user_watches is 1/25 (4%) of the available low memory, divided by the registration cost in bytes.

   Example for suggested usage
       While  the  usage of epoll when employed as a level-triggered interface does have the same semantics as poll(2), the edge-triggered usage requires more
       clarification to avoid stalls in the application event loop.  In this example, listener is a nonblocking socket on which  listen(2)  has  been  called.
       The  function do_use_fd() uses the new ready file descriptor until EAGAIN is returned by either read(2) or write(2).  An event-driven state machine ap‐
       plication should, after having received EAGAIN, record its current state so that at the next call  to  do_use_fd()  it  will  continue  to  read(2)  or
       write(2) from where it stopped before.

           #define MAX_EVENTS 10
           struct epoll_event ev, events[MAX_EVENTS];
           int listen_sock, conn_sock, nfds, epollfd;

           /* Code to set up listening socket, 'listen_sock',
              (socket(), bind(), listen()) omitted */

           epollfd = epoll_create1(0);
           if (epollfd == -1) {
               perror("epoll_create1");
               exit(EXIT_FAILURE);
           }

           ev.events = EPOLLIN;
           ev.data.fd = listen_sock;
           if (epoll_ctl(epollfd, EPOLL_CTL_ADD, listen_sock, &ev) == -1) {
               perror("epoll_ctl: listen_sock");
               exit(EXIT_FAILURE);
           }

           for (;;) {
               nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1);
               if (nfds == -1) {
                   perror("epoll_wait");
                   exit(EXIT_FAILURE);
               }

               for (n = 0; n < nfds; ++n) {
                   if (events[n].data.fd == listen_sock) {
                       conn_sock = accept(listen_sock,
                                          (struct sockaddr *) &addr, &addrlen);
                       if (conn_sock == -1) {
                           perror("accept");
                           exit(EXIT_FAILURE);
                       }
                       setnonblocking(conn_sock);
                       ev.events = EPOLLIN | EPOLLET;
                       ev.data.fd = conn_sock;
                       if (epoll_ctl(epollfd, EPOLL_CTL_ADD, conn_sock,
                                   &ev) == -1) {
                           perror("epoll_ctl: conn_sock");
                           exit(EXIT_FAILURE);
                       }
                   } else {
                       do_use_fd(events[n].data.fd);
                   }
               }
           }

       When used as an edge-triggered interface, for performance reasons, it is possible to add the file descriptor inside the epoll interface (EPOLL_CTL_ADD)
       once by specifying (EPOLLIN|EPOLLOUT).  This allows you to avoid  continuously  switching  between  EPOLLIN  and  EPOLLOUT  calling  epoll_ctl(2)  with
       EPOLL_CTL_MOD.

   Questions and answers
       0.  What is the key used to distinguish the file descriptors registered in an interest list?

           The  key is the combination of the file descriptor number and the open file description (also known as an "open file handle", the kernel's internal
           representation of an open file).

       1.  What happens if you register the same file descriptor on an epoll instance twice?

           You will probably get EEXIST.  However, it is possible to add a duplicate (dup(2), dup2(2), fcntl(2) F_DUPFD) file descriptor to the same epoll in‐
           stance.  This can be a useful technique for filtering events, if the duplicate file descriptors are registered with different events masks.

       2.  Can two epoll instances wait for the same file descriptor?  If so, are events reported to both epoll file descriptors?

           Yes, and events would be reported to both.  However, careful programming may be needed to do this correctly.

       3.  Is the epoll file descriptor itself poll/epoll/selectable?

           Yes.  If an epoll file descriptor has events waiting, then it will indicate as being readable.

       4.  What happens if one attempts to put an epoll file descriptor into its own file descriptor set?

           The epoll_ctl(2) call fails (EINVAL).  However, you can add an epoll file descriptor inside another epoll file descriptor set.

       5.  Can I send an epoll file descriptor over a UNIX domain socket to another process?

           Yes, but it does not make sense to do this, since the receiving process would not have copies of the file descriptors in the interest list.

       6.  Will closing a file descriptor cause it to be removed from all epoll interest lists?

           Yes,  but be aware of the following point.  A file descriptor is a reference to an open file description (see open(2)).  Whenever a file descriptor
           is duplicated via dup(2), dup2(2), fcntl(2) F_DUPFD, or fork(2), a new file descriptor referring to the same open file description is created.   An
           open file description continues to exist until all file descriptors referring to it have been closed.

           A  file descriptor is removed from an interest list only after all the file descriptors referring to the underlying open file description have been
           closed.  This means that even after a file descriptor that is part of an interest list has been closed, events may be reported for  that  file  de‐
           scriptor  if  other file descriptors referring to the same underlying file description remain open.  To prevent this happening, the file descriptor
           must be explicitly removed from the interest list (using epoll_ctl(2) EPOLL_CTL_DEL) before it is duplicated.  Alternatively, the application  must
           ensure that all file descriptors are closed (which may be difficult if file descriptors were duplicated behind the scenes by library functions that
           used dup(2) or fork(2)).

       7.  If more than one event occurs between epoll_wait(2) calls, are they combined or reported separately?

           They will be combined.

       8.  Does an operation on a file descriptor affect the already collected but not yet reported events?

           You can do two operations on an existing file descriptor.  Remove would be meaningless for this case.  Modify will reread available I/O.

       9.  Do I need to continuously read/write a file descriptor until EAGAIN when using the EPOLLET flag (edge-triggered behavior)?

           Receiving an event from epoll_wait(2) should suggest to you that such file descriptor is ready for the requested I/O operation.  You must  consider
           it ready until the next (nonblocking) read/write yields EAGAIN.  When and how you will use the file descriptor is entirely up to you.

           For  packet/token-oriented files (e.g., datagram socket, terminal in canonical mode), the only way to detect the end of the read/write I/O space is
           to continue to read/write until EAGAIN.

           For stream-oriented files (e.g., pipe, FIFO, stream socket), the condition that the read/write I/O space is  exhausted  can  also  be  detected  by
           checking the amount of data read from / written to the target file descriptor.  For example, if you call read(2) by asking to read a certain amount
           of data and read(2) returns a lower number of bytes, you can be sure of having exhausted the read I/O space for the file descriptor.  The  same  is
           true  when  writing  using  write(2).   (Avoid  this latter technique if you cannot guarantee that the monitored file descriptor always refers to a
           stream-oriented file.)

   Possible pitfalls and ways to avoid them
       o Starvation (edge-triggered)

       If there is a large amount of I/O space, it is possible that by trying to drain it the other files will not get processed  causing  starvation.   (This
       problem is not specific to epoll.)

       The  solution  is  to maintain a ready list and mark the file descriptor as ready in its associated data structure, thereby allowing the application to
       remember which files need to be processed but still round robin amongst all the ready files.  This also supports ignoring subsequent events you receive
       for file descriptors that are already ready.

       o If using an event cache...

       If  you  use  an event cache or store all the file descriptors returned from epoll_wait(2), then make sure to provide a way to mark its closure dynami‐
       cally (i.e., caused by a previous event's processing).  Suppose you receive 100 events from epoll_wait(2), and in event #47 a  condition  causes  event
       #13  to  be closed.  If you remove the structure and close(2) the file descriptor for event #13, then your event cache might still say there are events
       waiting for that file descriptor causing confusion.

       One solution for this is to call, during the processing of event 47, epoll_ctl(EPOLL_CTL_DEL) to delete file descriptor 13 and close(2), then mark  its
       associated  data  structure  as  removed and link it to a cleanup list.  If you find another event for file descriptor 13 in your batch processing, you
       will discover the file descriptor had been previously removed and there will be no confusion.

VERSIONS
       The epoll API was introduced in Linux kernel 2.5.44.  Support was added to glibc in version 2.3.2.

CONFORMING TO
       The epoll API is Linux-specific.  Some other systems provide similar mechanisms, for example, FreeBSD has kqueue, and Solaris has /dev/poll.

NOTES
       The set of file descriptors that is being monitored via an epoll file descriptor can be viewed via the entry for  the  epoll  file  descriptor  in  the
       process's /proc/[pid]/fdinfo directory.  See proc(5) for further details.

       The kcmp(2) KCMP_EPOLL_TFD operation can be used to test whether a file descriptor is present in an epoll instance.

SEE ALSO
       epoll_create(2), epoll_create1(2), epoll_ctl(2), epoll_wait(2), poll(2), select(2)

COLOPHON
       This  page is part of release 5.10 of the Linux man-pages project.  A description of the project, information about reporting bugs, and the latest ver‐
       sion of this page, can be found at https://www.kernel.org/doc/man-pages/.

Linux                                                                     2019-03-06                                                                  EPOLL(7)

Reference
Linux Programmer’s Manual