select、poll、epoll
select
int select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout);
- int nfds:被select管理的文件描述符的个数,最大描述符编号+1
- fd_set *readfds:读文件描述符集合
- fd_set *writefds:写文件描述符集合
- fd_set *exceptfds:异常文件描述符集合
- struct timeval *timeout:超时时间,NULL:永远等待,正数:时间长度,0:立即返回
使用结构体表示,不存在负数值的情况,所以用NULL,正数,0 表示三种超时状态
总结:fd_set为一个1024比特的位图,位图中每一位代表一个文件描述符。
void FD_CLR(int fd, fd_set *set);
从set中清除fd
int FD_ISSET(int fd, fd_set *set);
查看fd是否存在与set中
void FD_SET(int fd, fd_set *set);
将fd加入set
void FD_ZERO(fd_set *set);
将set清空
在产生select调用时,文件描述符位图需从用户态拷贝到内核态。内核态处理完fd事件后再拷贝给用户态。之后用户态判定哪些文件描述符处于就绪状态,从而处理。用户态代码需要遍历所有的文件描述符。select处理文件描述符的上限为1024。若需要扩充文件描述符上限,则需要通过重新编译内核源码实现。
man select
SELECT(2) Linux Programmer's Manual SELECT(2)
NAME
select, pselect, FD_CLR, FD_ISSET, FD_SET, FD_ZERO - synchronous I/O multiplexing
SYNOPSIS
#include <sys/select.h>
int select(int nfds, fd_set *readfds, fd_set *writefds,
fd_set *exceptfds, struct timeval *timeout);
void FD_CLR(int fd, fd_set *set);
int FD_ISSET(int fd, fd_set *set);
void FD_SET(int fd, fd_set *set);
void FD_ZERO(fd_set *set);
int pselect(int nfds, fd_set *readfds, fd_set *writefds,
fd_set *exceptfds, const struct timespec *timeout,
const sigset_t *sigmask);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
pselect(): _POSIX_C_SOURCE >= 200112L
DESCRIPTION
select() allows a program to monitor multiple file descriptors, waiting until one or more of the file descriptors become "ready" for some class of I/O
operation (e.g., input possible). A file descriptor is considered ready if it is possible to perform a corresponding I/O operation (e.g., read(2), or
a sufficiently small write(2)) without blocking.
select() can monitor only file descriptors numbers that are less than FD_SETSIZE; poll(2) and epoll(7) do not have this limitation. See BUGS.
File descriptor sets
The principal arguments of select() are three "sets" of file descriptors (declared with the type fd_set), which allow the caller to wait for three
classes of events on the specified set of file descriptors. Each of the fd_set arguments may be specified as NULL if no file descriptors are to be
watched for the corresponding class of events.
Note well: Upon return, each of the file descriptor sets is modified in place to indicate which file descriptors are currently "ready". Thus, if using
select() within a loop, the sets must be reinitialized before each call. The implementation of the fd_set arguments as value-result arguments is a de‐
sign error that is avoided in poll(2) and epoll(7).
The contents of a file descriptor set can be manipulated using the following macros:
FD_ZERO()
This macro clears (removes all file descriptors from) set. It should be employed as the first step in initializing a file descriptor set.
FD_SET()
This macro adds the file descriptor fd to set. Adding a file descriptor that is already present in the set is a no-op, and does not produce an
error.
FD_CLR()
This macro removes the file descriptor fd from set. Removing a file descriptor that is not present in the set is a no-op, and does not produce
an error.
FD_ISSET()
select() modifies the contents of the sets according to the rules described below. After calling select(), the FD_ISSET() macro can be used to
test if a file descriptor is still present in a set. FD_ISSET() returns nonzero if the file descriptor fd is present in set, and zero if it is
not.
Arguments
The arguments of select() are as follows:
readfds
The file descriptors in this set are watched to see if they are ready for reading. A file descriptor is ready for reading if a read operation
will not block; in particular, a file descriptor is also ready on end-of-file.
After select() has returned, readfds will be cleared of all file descriptors except for those that are ready for reading.
writefds
The file descriptors in this set are watched to see if they are ready for writing. A file descriptor is ready for writing if a write operation
will not block. However, even if a file descriptor indicates as writable, a large write may still block.
After select() has returned, writefds will be cleared of all file descriptors except for those that are ready for writing.
exceptfds
The file descriptors in this set are watched for "exceptional conditions". For examples of some exceptional conditions, see the discussion of
POLLPRI in poll(2).
After select() has returned, exceptfds will be cleared of all file descriptors except for those for which an exceptional condition has occurred.
nfds This argument should be set to the highest-numbered file descriptor in any of the three sets, plus 1. The indicated file descriptors in each
set are checked, up to this limit (but see BUGS).
timeout
The timeout argument is a timeval structure (shown below) that specifies the interval that select() should block waiting for a file descriptor
to become ready. The call will block until either:
• a file descriptor becomes ready;
• the call is interrupted by a signal handler; or
• the timeout expires.
Note that the timeout interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the blocking interval
may overrun by a small amount.
If both fields of the timeval structure are zero, then select() returns immediately. (This is useful for polling.)
If timeout is specified as NULL, select() blocks indefinitely waiting for a file descriptor to become ready.
pselect()
The pselect() system call allows an application to safely wait until either a file descriptor becomes ready or until a signal is caught.
The operation of select() and pselect() is identical, other than these three differences:
• select() uses a timeout that is a struct timeval (with seconds and microseconds), while pselect() uses a struct timespec (with seconds and nanosec‐
onds).
• select() may update the timeout argument to indicate how much time was left. pselect() does not change this argument.
• select() has no sigmask argument, and behaves as pselect() called with NULL sigmask.
sigmask is a pointer to a signal mask (see sigprocmask(2)); if it is not NULL, then pselect() first replaces the current signal mask by the one pointed
to by sigmask, then does the "select" function, and then restores the original signal mask. (If sigmask is NULL, the signal mask is not modified dur‐
ing the pselect() call.)
Other than the difference in the precision of the timeout argument, the following pselect() call:
ready = pselect(nfds, &readfds, &writefds, &exceptfds,
timeout, &sigmask);
is equivalent to atomically executing the following calls:
sigset_t origmask;
pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
ready = select(nfds, &readfds, &writefds, &exceptfds, timeout);
pthread_sigmask(SIG_SETMASK, &origmask, NULL);
The reason that pselect() is needed is that if one wants to wait for either a signal or for a file descriptor to become ready, then an atomic test is
needed to prevent race conditions. (Suppose the signal handler sets a global flag and returns. Then a test of this global flag followed by a call of
select() could hang indefinitely if the signal arrived just after the test but just before the call. By contrast, pselect() allows one to first block
signals, handle the signals that have come in, then call pselect() with the desired sigmask, avoiding the race.)
The timeout
The timeout argument for select() is a structure of the following type:
struct timeval {
time_t tv_sec; /* seconds */
suseconds_t tv_usec; /* microseconds */
};
The corresponding argument for pselect() has the following type:
struct timespec {
time_t tv_sec; /* seconds */
long tv_nsec; /* nanoseconds */
};
On Linux, select() modifies timeout to reflect the amount of time not slept; most other implementations do not do this. (POSIX.1 permits either behav‐
ior.) This causes problems both when Linux code which reads timeout is ported to other operating systems, and when code is ported to Linux that reuses
a struct timeval for multiple select()s in a loop without reinitializing it. Consider timeout to be undefined after select() returns.
RETURN VALUE
On success, select() and pselect() return the number of file descriptors contained in the three returned descriptor sets (that is, the total number of
bits that are set in readfds, writefds, exceptfds). The return value may be zero if the timeout expired before any file descriptors became ready.
On error, -1 is returned, and errno is set to indicate the error; the file descriptor sets are unmodified, and timeout becomes undefined.
ERRORS
EBADF An invalid file descriptor was given in one of the sets. (Perhaps a file descriptor that was already closed, or one on which an error has oc‐
curred.) However, see BUGS.
EINTR A signal was caught; see signal(7).
EINVAL nfds is negative or exceeds the RLIMIT_NOFILE resource limit (see getrlimit(2)).
EINVAL The value contained within timeout is invalid.
ENOMEM Unable to allocate memory for internal tables.
VERSIONS
pselect() was added to Linux in kernel 2.6.16. Prior to this, pselect() was emulated in glibc (but see BUGS).
CONFORMING TO
select() conforms to POSIX.1-2001, POSIX.1-2008, and 4.4BSD (select() first appeared in 4.2BSD). Generally portable to/from non-BSD systems supporting
clones of the BSD socket layer (including System V variants). However, note that the System V variant typically sets the timeout variable before re‐
turning, but the BSD variant does not.
pselect() is defined in POSIX.1g, and in POSIX.1-2001 and POSIX.1-2008.
NOTES
An fd_set is a fixed size buffer. Executing FD_CLR() or FD_SET() with a value of fd that is negative or is equal to or larger than FD_SETSIZE will re‐
sult in undefined behavior. Moreover, POSIX requires fd to be a valid file descriptor.
The operation of select() and pselect() is not affected by the O_NONBLOCK flag.
On some other UNIX systems, select() can fail with the error EAGAIN if the system fails to allocate kernel-internal resources, rather than ENOMEM as
Linux does. POSIX specifies this error for poll(2), but not for select(). Portable programs may wish to check for EAGAIN and loop, just as with
EINTR.
The self-pipe trick
On systems that lack pselect(), reliable (and more portable) signal trapping can be achieved using the self-pipe trick. In this technique, a signal
handler writes a byte to a pipe whose other end is monitored by select() in the main program. (To avoid possibly blocking when writing to a pipe that
may be full or reading from a pipe that may be empty, nonblocking I/O is used when reading from and writing to the pipe.)
Emulating usleep(3)
Before the advent of usleep(3), some code employed a call to select() with all three sets empty, nfds zero, and a non-NULL timeout as a fairly portable
way to sleep with subsecond precision.
Correspondence between select() and poll() notifications
Within the Linux kernel source, we find the following definitions which show the correspondence between the readable, writable, and exceptional condi‐
tion notifications of select() and the event notifications provided by poll(2) and epoll(7):
#define POLLIN_SET (EPOLLRDNORM | EPOLLRDBAND | EPOLLIN |
EPOLLHUP | EPOLLERR)
/* Ready for reading */
#define POLLOUT_SET (EPOLLWRBAND | EPOLLWRNORM | EPOLLOUT |
EPOLLERR)
/* Ready for writing */
#define POLLEX_SET (EPOLLPRI)
/* Exceptional condition */
Multithreaded applications
If a file descriptor being monitored by select() is closed in another thread, the result is unspecified. On some UNIX systems, select() unblocks and
returns, with an indication that the file descriptor is ready (a subsequent I/O operation will likely fail with an error, unless another process re‐
opens file descriptor between the time select() returned and the I/O operation is performed). On Linux (and some other systems), closing the file de‐
scriptor in another thread has no effect on select(). In summary, any application that relies on a particular behavior in this scenario must be con‐
sidered buggy.
C library/kernel differences
The Linux kernel allows file descriptor sets of arbitrary size, determining the length of the sets to be checked from the value of nfds. However, in
the glibc implementation, the fd_set type is fixed in size. See also BUGS.
The pselect() interface described in this page is implemented by glibc. The underlying Linux system call is named pselect6(). This system call has
somewhat different behavior from the glibc wrapper function.
The Linux pselect6() system call modifies its timeout argument. However, the glibc wrapper function hides this behavior by using a local variable for
the timeout argument that is passed to the system call. Thus, the glibc pselect() function does not modify its timeout argument; this is the behavior
required by POSIX.1-2001.
The final argument of the pselect6() system call is not a sigset_t * pointer, but is instead a structure of the form:
struct {
const kernel_sigset_t *ss; /* Pointer to signal set */
size_t ss_len; /* Size (in bytes) of object
pointed to by 'ss' */
};
This allows the system call to obtain both a pointer to the signal set and its size, while allowing for the fact that most architectures support a max‐
imum of 6 arguments to a system call. See sigprocmask(2) for a discussion of the difference between the kernel and libc notion of the signal set.
Historical glibc details
Glibc 2.0 provided an incorrect version of pselect() that did not take a sigmask argument.
In glibc versions 2.1 to 2.2.1, one must define _GNU_SOURCE in order to obtain the declaration of pselect() from <sys/select.h>.
BUGS
POSIX allows an implementation to define an upper limit, advertised via the constant FD_SETSIZE, on the range of file descriptors that can be specified
in a file descriptor set. The Linux kernel imposes no fixed limit, but the glibc implementation makes fd_set a fixed-size type, with FD_SETSIZE de‐
fined as 1024, and the FD_*() macros operating according to that limit. To monitor file descriptors greater than 1023, use poll(2) or epoll(7) in‐
stead.
According to POSIX, select() should check all specified file descriptors in the three file descriptor sets, up to the limit nfds-1. However, the cur‐
rent implementation ignores any file descriptor in these sets that is greater than the maximum file descriptor number that the process currently has
open. According to POSIX, any such file descriptor that is specified in one of the sets should result in the error EBADF.
Starting with version 2.1, glibc provided an emulation of pselect() that was implemented using sigprocmask(2) and select(). This implementation re‐
mained vulnerable to the very race condition that pselect() was designed to prevent. Modern versions of glibc use the (race-free) pselect() system
call on kernels where it is provided.
On Linux, select() may report a socket file descriptor as "ready for reading", while nevertheless a subsequent read blocks. This could for example
happen when data has arrived but upon examination has the wrong checksum and is discarded. There may be other circumstances in which a file descriptor
is spuriously reported as ready. Thus it may be safer to use O_NONBLOCK on sockets that should not block.
On Linux, select() also modifies timeout if the call is interrupted by a signal handler (i.e., the EINTR error return). This is not permitted by
POSIX.1. The Linux pselect() system call has the same behavior, but the glibc wrapper hides this behavior by internally copying the timeout to a local
variable and passing that variable to the system call.
EXAMPLES
#include <stdio.h>
#include <stdlib.h>
#include <sys/select.h>
int
main(void)
{
fd_set rfds;
struct timeval tv;
int retval;
/* Watch stdin (fd 0) to see when it has input. */
FD_ZERO(&rfds);
FD_SET(0, &rfds);
/* Wait up to five seconds. */
tv.tv_sec = 5;
tv.tv_usec = 0;
retval = select(1, &rfds, NULL, NULL, &tv);
/* Don't rely on the value of tv now! */
if (retval == -1)
perror("select()");
else if (retval)
printf("Data is available now.\n");
/* FD_ISSET(0, &rfds) will be true. */
else
printf("No data within five seconds.\n");
exit(EXIT_SUCCESS);
}
SEE ALSO
accept(2), connect(2), poll(2), read(2), recv(2), restart_syscall(2), send(2), sigprocmask(2), write(2), epoll(7), time(7)
For a tutorial with discussion and examples, see select_tut(2).
COLOPHON
This page is part of release 5.10 of the Linux man-pages project. A description of the project, information about reporting bugs, and the latest ver‐
sion of this page, can be found at https://www.kernel.org/doc/man-pages/.
Linux 2020-11-01 SELECT(2)
poll
int poll(struct pollfd *fds, nfds_t nfds, int timeout);
struct pollfd *fds:对fd的封装,它时pollfd 数组的首地址。
The set of file descriptors to be monitored is specified in the fds argument, which is an array of structures of the following form:
struct pollfd {
int fd; /* file descriptor 文件描述符*/
short events; /* requested events 要监听的请求的事件*/
short revents; /* returned events 就绪时的事件*/
};
有四类处理输入的事件,三类处理输出的事件,三类处理异常的事件
nfds_t nfds:被poll管理的文件描述符的个数
int timeout:超时时间,负数:无线等待,正数:正常等待,0:直接返回
产生调用时,文件描述符数组需从用户态拷贝到内核态。内核态处理完fd事件后再拷贝给用户态。之后用户态判定哪些文件描述符处于就绪状态,从而处理。用户态代码需要遍历所有的文件描述符。文件描述符个数没有明确限制。变长数组可"任性扩容"。注意:poll在用户态保存文件描述符使用的是数组,而在内核态,会转换成链表,再拷贝回用户态时,又转换成了数组。
man poll
POLL(2) Linux Programmer's Manual POLL(2)
NAME
poll, ppoll - wait for some event on a file descriptor
SYNOPSIS
#include <poll.h>
int poll(struct pollfd *fds, nfds_t nfds, int timeout);
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <signal.h>
#include <poll.h>
int ppoll(struct pollfd *fds, nfds_t nfds,
const struct timespec *tmo_p, const sigset_t *sigmask);
DESCRIPTION
poll() performs a similar task to select(2): it waits for one of a set of file descriptors to become ready to perform I/O. The Linux-specific epoll(7)
API performs a similar task, but offers features beyond those found in poll().
The set of file descriptors to be monitored is specified in the fds argument, which is an array of structures of the following form:
struct pollfd {
int fd; /* file descriptor */
short events; /* requested events */
short revents; /* returned events */
};
The caller should specify the number of items in the fds array in nfds.
The field fd contains a file descriptor for an open file. If this field is negative, then the corresponding events field is ignored and the revents
field returns zero. (This provides an easy way of ignoring a file descriptor for a single poll() call: simply negate the fd field. Note, however,
that this technique can't be used to ignore file descriptor 0.)
The field events is an input parameter, a bit mask specifying the events the application is interested in for the file descriptor fd. This field may
be specified as zero, in which case the only events that can be returned in revents are POLLHUP, POLLERR, and POLLNVAL (see below).
The field revents is an output parameter, filled by the kernel with the events that actually occurred. The bits returned in revents can include any of
those specified in events, or one of the values POLLERR, POLLHUP, or POLLNVAL. (These three bits are meaningless in the events field, and will be set
in the revents field whenever the corresponding condition is true.)
If none of the events requested (and no error) has occurred for any of the file descriptors, then poll() blocks until one of the events occurs.
The timeout argument specifies the number of milliseconds that poll() should block waiting for a file descriptor to become ready. The call will block
until either:
• a file descriptor becomes ready;
• the call is interrupted by a signal handler; or
• the timeout expires.
Note that the timeout interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the blocking interval may
overrun by a small amount. Specifying a negative value in timeout means an infinite timeout. Specifying a timeout of zero causes poll() to return im‐
mediately, even if no file descriptors are ready.
The bits that may be set/returned in events and revents are defined in <poll.h>:
POLLIN There is data to read.
POLLPRI
There is some exceptional condition on the file descriptor. Possibilities include:
• There is out-of-band data on a TCP socket (see tcp(7)).
• A pseudoterminal master in packet mode has seen a state change on the slave (see ioctl_tty(2)).
• A cgroup.events file has been modified (see cgroups(7)).
POLLOUT
Writing is now possible, though a write larger than the available space in a socket or pipe will still block (unless O_NONBLOCK is set).
POLLRDHUP (since Linux 2.6.17)
Stream socket peer closed connection, or shut down writing half of connection. The _GNU_SOURCE feature test macro must be defined (before in‐
cluding any header files) in order to obtain this definition.
POLLERR
Error condition (only returned in revents; ignored in events). This bit is also set for a file descriptor referring to the write end of a pipe
when the read end has been closed.
POLLHUP
Hang up (only returned in revents; ignored in events). Note that when reading from a channel such as a pipe or a stream socket, this event
merely indicates that the peer closed its end of the channel. Subsequent reads from the channel will return 0 (end of file) only after all out‐
standing data in the channel has been consumed.
POLLNVAL
Invalid request: fd not open (only returned in revents; ignored in events).
When compiling with _XOPEN_SOURCE defined, one also has the following, which convey no further information beyond the bits listed above:
POLLRDNORM
Equivalent to POLLIN.
POLLRDBAND
Priority band data can be read (generally unused on Linux).
POLLWRNORM
Equivalent to POLLOUT.
POLLWRBAND
Priority data may be written.
Linux also knows about, but does not use POLLMSG.
ppoll()
The relationship between poll() and ppoll() is analogous to the relationship between select(2) and pselect(2): like pselect(2), ppoll() allows an ap‐
plication to safely wait until either a file descriptor becomes ready or until a signal is caught.
Other than the difference in the precision of the timeout argument, the following ppoll() call:
ready = ppoll(&fds, nfds, tmo_p, &sigmask);
is nearly equivalent to atomically executing the following calls:
sigset_t origmask;
int timeout;
timeout = (tmo_p == NULL) ? -1 :
(tmo_p->tv_sec * 1000 + tmo_p->tv_nsec / 1000000);
pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
ready = poll(&fds, nfds, timeout);
pthread_sigmask(SIG_SETMASK, &origmask, NULL);
The above code segment is described as nearly equivalent because whereas a negative timeout value for poll() is interpreted as an infinite timeout, a
negative value expressed in *tmo_p results in an error from ppoll().
See the description of pselect(2) for an explanation of why ppoll() is necessary.
If the sigmask argument is specified as NULL, then no signal mask manipulation is performed (and thus ppoll() differs from poll() only in the precision
of the timeout argument).
The tmo_p argument specifies an upper limit on the amount of time that ppoll() will block. This argument is a pointer to a structure of the following
form:
struct timespec {
long tv_sec; /* seconds */
long tv_nsec; /* nanoseconds */
};
If tmo_p is specified as NULL, then ppoll() can block indefinitely.
RETURN VALUE
On success, poll() returns a nonnegative value which is the number of elements in the pollfds whose revents fields have been set to a nonzero value
(indicating an event or an error). A return value of zero indicates that the system call timed out before any file descriptors became read.
On error, -1 is returned, and errno is set to indicate the cause of the error.
ERRORS
EFAULT fds points outside the process's accessible address space. The array given as argument was not contained in the calling program's address
space.
EINTR A signal occurred before any requested event; see signal(7).
EINVAL The nfds value exceeds the RLIMIT_NOFILE value.
EINVAL (ppoll()) The timeout value expressed in *ip is invalid (negative).
ENOMEM Unable to allocate memory for kernel data structures.
VERSIONS
The poll() system call was introduced in Linux 2.1.23. On older kernels that lack this system call, the glibc poll() wrapper function provides emula‐
tion using select(2).
The ppoll() system call was added to Linux in kernel 2.6.16. The ppoll() library call was added in glibc 2.4.
CONFORMING TO
poll() conforms to POSIX.1-2001 and POSIX.1-2008. ppoll() is Linux-specific.
NOTES
The operation of poll() and ppoll() is not affected by the O_NONBLOCK flag.
On some other UNIX systems, poll() can fail with the error EAGAIN if the system fails to allocate kernel-internal resources, rather than ENOMEM as
Linux does. POSIX permits this behavior. Portable programs may wish to check for EAGAIN and loop, just as with EINTR.
Some implementations define the nonstandard constant INFTIM with the value -1 for use as a timeout for poll(). This constant is not provided in glibc.
For a discussion of what may happen if a file descriptor being monitored by poll() is closed in another thread, see select(2).
C library/kernel differences
The Linux ppoll() system call modifies its tmo_p argument. However, the glibc wrapper function hides this behavior by using a local variable for the
timeout argument that is passed to the system call. Thus, the glibc ppoll() function does not modify its tmo_p argument.
The raw ppoll() system call has a fifth argument, size_t sigsetsize, which specifies the size in bytes of the sigmask argument. The glibc ppoll()
wrapper function specifies this argument as a fixed value (equal to sizeof(kernel_sigset_t)). See sigprocmask(2) for a discussion on the differences
between the kernel and the libc notion of the sigset.
BUGS
See the discussion of spurious readiness notifications under the BUGS section of select(2).
EXAMPLES
The program below opens each of the files named in its command-line arguments and monitors the resulting file descriptors for readiness to read
(POLLIN). The program loops, repeatedly using poll() to monitor the file descriptors, printing the number of ready file descriptors on return. For
each ready file descriptor, the program:
• displays the returned revents field in a human-readable form;
• if the file descriptor is readable, reads some data from it, and displays that data on standard output; and
• if the file descriptors was not readable, but some other event occurred (presumably POLLHUP), closes the file descriptor.
Suppose we run the program in one terminal, asking it to open a FIFO:
$ mkfifo myfifo
$ ./poll_input myfifo
In a second terminal window, we then open the FIFO for writing, write some data to it, and close the FIFO:
$ echo aaaaabbbbbccccc > myfifo
In the terminal where we are running the program, we would then see:
Opened "myfifo" on fd 3
About to poll()
Ready: 1
fd=3; events: POLLIN POLLHUP
read 10 bytes: aaaaabbbbb
About to poll()
Ready: 1
fd=3; events: POLLIN POLLHUP
read 6 bytes: ccccc
About to poll()
Ready: 1
fd=3; events: POLLHUP
closing fd 3
All file descriptors closed; bye
In the above output, we see that poll() returned three times:
• On the first return, the bits returned in the revents field were POLLIN, indicating that the file descriptor is readable, and POLLHUP, indicating
that the other end of the FIFO has been closed. The program then consumed some of the available input.
• The second return from poll() also indicated POLLIN and POLLHUP; the program then consumed the last of the available input.
• On the final return, poll() indicated only POLLHUP on the FIFO, at which point the file descriptor was closed and the program terminated.
Program source
/* poll_input.c
Licensed under GNU General Public License v2 or later.
*/
#include <poll.h>
#include <fcntl.h>
#include <sys/types.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
} while (0)
int
main(int argc, char *argv[])
{
int nfds, num_open_fds;
struct pollfd *pfds;
if (argc < 2) {
fprintf(stderr, "Usage: %s file...\n", argv[0]);
exit(EXIT_FAILURE);
}
num_open_fds = nfds = argc - 1;
pfds = calloc(nfds, sizeof(struct pollfd));
if (pfds == NULL)
errExit("malloc");
/* Open each file on command line, and add it 'pfds' array */
for (int j = 0; j < nfds; j++) {
pfds[j].fd = open(argv[j + 1], O_RDONLY);
if (pfds[j].fd == -1)
errExit("open");
printf("Opened \"%s\" on fd %d\n", argv[j + 1], pfds[j].fd);
pfds[j].events = POLLIN;
}
/* Keep calling poll() as long as at least one file descriptor is
open */
while (num_open_fds > 0) {
int ready;
printf("About to poll()\n");
ready = poll(pfds, nfds, -1);
if (ready == -1)
errExit("poll");
printf("Ready: %d\n", ready);
/* Deal with array returned by poll() */
for (int j = 0; j < nfds; j++) {
char buf[10];
if (pfds[j].revents != 0) {
printf(" fd=%d; events: %s%s%s\n", pfds[j].fd,
(pfds[j].revents & POLLIN) ? "POLLIN " : "",
(pfds[j].revents & POLLHUP) ? "POLLHUP " : "",
(pfds[j].revents & POLLERR) ? "POLLERR " : "");
if (pfds[j].revents & POLLIN) {
ssize_t s = read(pfds[j].fd, buf, sizeof(buf));
if (s == -1)
errExit("read");
printf(" read %zd bytes: %.*s\n",
s, (int) s, buf);
} else { /* POLLERR | POLLHUP */
printf(" closing fd %d\n", pfds[j].fd);
if (close(pfds[j].fd) == -1)
errExit("close");
num_open_fds--;
}
}
}
}
printf("All file descriptors closed; bye\n");
exit(EXIT_SUCCESS);
}
SEE ALSO
restart_syscall(2), select(2), select_tut(2), epoll(7), time(7)
COLOPHON
This page is part of release 5.10 of the Linux man-pages project. A description of the project, information about reporting bugs, and the latest ver‐
sion of this page, can be found at https://www.kernel.org/doc/man-pages/.
Linux 2020-04-11 POLL(2)
epoll
int epoll_create(int size);
int size:可忽略任意大于0的值即可。
NAME
epoll_create, epoll_create1 - open an epoll file descriptor
SYNOPSIS
#include <sys/epoll.h>
int epoll_create(int size);
int epoll_create1(int flags);
DESCRIPTION
epoll_create() creates a new epoll(7) instance. Since Linux 2.6.8, the size argument is ignored, but must be greater than zero; see NOTES.
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
int epfd:epoll_create()创建的文件描述符
int op:EPOLL_CTL_ADD:添加,EPOLL_CTL_MOD:更新,EPOLL_CTL_DEL:删除
int fd:待监听的文件描述符
struct epoll_event *event:要监听的fd事件
NAME
epoll_ctl - control interface for an epoll file descriptor
SYNOPSIS
#include <sys/epoll.h>
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
DESCRIPTION
This system call is used to add, modify, or remove entries in the interest list of the epoll(7) instance re‐
ferred to by the file descriptor epfd. It requests that the operation op be performed for the target file de‐
scriptor, fd.
int epoll_wait(int epfd, struct epoll_event *events,int maxevents, int timeout);
int epfd:epoll_create()创建的文件描述符
epoll_event *events:就绪事件列表,就绪事件个数为int epoll_wait()的返回值
int maxevents:最多返回的事件个数,内核通过该值确定events数组的长度
int timeout:超时控制
NAME
epoll_wait, epoll_pwait - wait for an I/O event on an epoll file descriptor
SYNOPSIS
#include <sys/epoll.h>
int epoll_wait(int epfd, struct epoll_event *events,
int maxevents, int timeout);
int epoll_pwait(int epfd, struct epoll_event *events,
int maxevents, int timeout,
const sigset_t *sigmask);
DESCRIPTION
The epoll_wait() system call waits for events on the epoll(7) instance referred to by the file descriptor
epfd. The buffer pointed to by events is used to return information from the ready list about file descrip‐
tors in the interest list that have some events available. Up to maxevents are returned by epoll_wait(). The
maxevents argument must be greater than zero.
内核监听epoll的文件描述符时采用红黑树,就绪事件链表等数据结构。epoll的两种内置触发模式为ET(edge-trigger),LT(level-trigger)。
man epoll
EPOLL(7) Linux Programmer's Manual EPOLL(7)
NAME
epoll - I/O event notification facility
SYNOPSIS
#include <sys/epoll.h>
DESCRIPTION
The epoll API performs a similar task to poll(2): monitoring multiple file descriptors to see if I/O is possible on any of them. The epoll API can be
used either as an edge-triggered or a level-triggered interface and scales well to large numbers of watched file descriptors.
The central concept of the epoll API is the epoll instance, an in-kernel data structure which, from a user-space perspective, can be considered as a
container for two lists:
• The interest list (sometimes also called the epoll set): the set of file descriptors that the process has registered an interest in monitoring.
• The ready list: the set of file descriptors that are "ready" for I/O. The ready list is a subset of (or, more precisely, a set of references to) the
file descriptors in the interest list. The ready list is dynamically populated by the kernel as a result of I/O activity on those file descriptors.
The following system calls are provided to create and manage an epoll instance:
• epoll_create(2) creates a new epoll instance and returns a file descriptor referring to that instance. (The more recent epoll_create1(2) extends the
functionality of epoll_create(2).)
• Interest in particular file descriptors is then registered via epoll_ctl(2), which adds items to the interest list of the epoll instance.
• epoll_wait(2) waits for I/O events, blocking the calling thread if no events are currently available. (This system call can be thought of as fetch‐
ing items from the ready list of the epoll instance.)
Level-triggered and edge-triggered
The epoll event distribution interface is able to behave both as edge-triggered (ET) and as level-triggered (LT). The difference between the two mech‐
anisms can be described as follows. Suppose that this scenario happens:
1. The file descriptor that represents the read side of a pipe (rfd) is registered on the epoll instance.
2. A pipe writer writes 2 kB of data on the write side of the pipe.
3. A call to epoll_wait(2) is done that will return rfd as a ready file descriptor.
4. The pipe reader reads 1 kB of data from rfd.
5. A call to epoll_wait(2) is done.
If the rfd file descriptor has been added to the epoll interface using the EPOLLET (edge-triggered) flag, the call to epoll_wait(2) done in step 5 will
probably hang despite the available data still present in the file input buffer; meanwhile the remote peer might be expecting a response based on the
data it already sent. The reason for this is that edge-triggered mode delivers events only when changes occur on the monitored file descriptor. So,
in step 5 the caller might end up waiting for some data that is already present inside the input buffer. In the above example, an event on rfd will be
generated because of the write done in 2 and the event is consumed in 3. Since the read operation done in 4 does not consume the whole buffer data,
the call to epoll_wait(2) done in step 5 might block indefinitely.
An application that employs the EPOLLET flag should use nonblocking file descriptors to avoid having a blocking read or write starve a task that is
handling multiple file descriptors. The suggested way to use epoll as an edge-triggered (EPOLLET) interface is as follows:
a) with nonblocking file descriptors; and
b) by waiting for an event only after read(2) or write(2) return EAGAIN.
By contrast, when used as a level-triggered interface (the default, when EPOLLET is not specified), epoll is simply a faster poll(2), and can be used
wherever the latter is used since it shares the same semantics.
Since even with edge-triggered epoll, multiple events can be generated upon receipt of multiple chunks of data, the caller has the option to specify
the EPOLLONESHOT flag, to tell epoll to disable the associated file descriptor after the receipt of an event with epoll_wait(2). When the EPOLLONESHOT
flag is specified, it is the caller's responsibility to rearm the file descriptor using epoll_ctl(2) with EPOLL_CTL_MOD.
If multiple threads (or processes, if child processes have inherited the epoll file descriptor across fork(2)) are blocked in epoll_wait(2) waiting on
the same epoll file descriptor and a file descriptor in the interest list that is marked for edge-triggered (EPOLLET) notification becomes ready, just
one of the threads (or processes) is awoken from epoll_wait(2). This provides a useful optimization for avoiding "thundering herd" wake-ups in some
scenarios.
Interaction with autosleep
If the system is in autosleep mode via /sys/power/autosleep and an event happens which wakes the device from sleep, the device driver will keep the de‐
vice awake only until that event is queued. To keep the device awake until the event has been processed, it is necessary to use the epoll_ctl(2)
EPOLLWAKEUP flag.
When the EPOLLWAKEUP flag is set in the events field for a struct epoll_event, the system will be kept awake from the moment the event is queued,
through the epoll_wait(2) call which returns the event until the subsequent epoll_wait(2) call. If the event should keep the system awake beyond that
time, then a separate wake_lock should be taken before the second epoll_wait(2) call.
/proc interfaces
The following interfaces can be used to limit the amount of kernel memory consumed by epoll:
/proc/sys/fs/epoll/max_user_watches (since Linux 2.6.28)
This specifies a limit on the total number of file descriptors that a user can register across all epoll instances on the system. The limit is
per real user ID. Each registered file descriptor costs roughly 90 bytes on a 32-bit kernel, and roughly 160 bytes on a 64-bit kernel. Cur‐
rently, the default value for max_user_watches is 1/25 (4%) of the available low memory, divided by the registration cost in bytes.
Example for suggested usage
While the usage of epoll when employed as a level-triggered interface does have the same semantics as poll(2), the edge-triggered usage requires more
clarification to avoid stalls in the application event loop. In this example, listener is a nonblocking socket on which listen(2) has been called.
The function do_use_fd() uses the new ready file descriptor until EAGAIN is returned by either read(2) or write(2). An event-driven state machine ap‐
plication should, after having received EAGAIN, record its current state so that at the next call to do_use_fd() it will continue to read(2) or
write(2) from where it stopped before.
#define MAX_EVENTS 10
struct epoll_event ev, events[MAX_EVENTS];
int listen_sock, conn_sock, nfds, epollfd;
/* Code to set up listening socket, 'listen_sock',
(socket(), bind(), listen()) omitted */
epollfd = epoll_create1(0);
if (epollfd == -1) {
perror("epoll_create1");
exit(EXIT_FAILURE);
}
ev.events = EPOLLIN;
ev.data.fd = listen_sock;
if (epoll_ctl(epollfd, EPOLL_CTL_ADD, listen_sock, &ev) == -1) {
perror("epoll_ctl: listen_sock");
exit(EXIT_FAILURE);
}
for (;;) {
nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1);
if (nfds == -1) {
perror("epoll_wait");
exit(EXIT_FAILURE);
}
for (n = 0; n < nfds; ++n) {
if (events[n].data.fd == listen_sock) {
conn_sock = accept(listen_sock,
(struct sockaddr *) &addr, &addrlen);
if (conn_sock == -1) {
perror("accept");
exit(EXIT_FAILURE);
}
setnonblocking(conn_sock);
ev.events = EPOLLIN | EPOLLET;
ev.data.fd = conn_sock;
if (epoll_ctl(epollfd, EPOLL_CTL_ADD, conn_sock,
&ev) == -1) {
perror("epoll_ctl: conn_sock");
exit(EXIT_FAILURE);
}
} else {
do_use_fd(events[n].data.fd);
}
}
}
When used as an edge-triggered interface, for performance reasons, it is possible to add the file descriptor inside the epoll interface (EPOLL_CTL_ADD)
once by specifying (EPOLLIN|EPOLLOUT). This allows you to avoid continuously switching between EPOLLIN and EPOLLOUT calling epoll_ctl(2) with
EPOLL_CTL_MOD.
Questions and answers
0. What is the key used to distinguish the file descriptors registered in an interest list?
The key is the combination of the file descriptor number and the open file description (also known as an "open file handle", the kernel's internal
representation of an open file).
1. What happens if you register the same file descriptor on an epoll instance twice?
You will probably get EEXIST. However, it is possible to add a duplicate (dup(2), dup2(2), fcntl(2) F_DUPFD) file descriptor to the same epoll in‐
stance. This can be a useful technique for filtering events, if the duplicate file descriptors are registered with different events masks.
2. Can two epoll instances wait for the same file descriptor? If so, are events reported to both epoll file descriptors?
Yes, and events would be reported to both. However, careful programming may be needed to do this correctly.
3. Is the epoll file descriptor itself poll/epoll/selectable?
Yes. If an epoll file descriptor has events waiting, then it will indicate as being readable.
4. What happens if one attempts to put an epoll file descriptor into its own file descriptor set?
The epoll_ctl(2) call fails (EINVAL). However, you can add an epoll file descriptor inside another epoll file descriptor set.
5. Can I send an epoll file descriptor over a UNIX domain socket to another process?
Yes, but it does not make sense to do this, since the receiving process would not have copies of the file descriptors in the interest list.
6. Will closing a file descriptor cause it to be removed from all epoll interest lists?
Yes, but be aware of the following point. A file descriptor is a reference to an open file description (see open(2)). Whenever a file descriptor
is duplicated via dup(2), dup2(2), fcntl(2) F_DUPFD, or fork(2), a new file descriptor referring to the same open file description is created. An
open file description continues to exist until all file descriptors referring to it have been closed.
A file descriptor is removed from an interest list only after all the file descriptors referring to the underlying open file description have been
closed. This means that even after a file descriptor that is part of an interest list has been closed, events may be reported for that file de‐
scriptor if other file descriptors referring to the same underlying file description remain open. To prevent this happening, the file descriptor
must be explicitly removed from the interest list (using epoll_ctl(2) EPOLL_CTL_DEL) before it is duplicated. Alternatively, the application must
ensure that all file descriptors are closed (which may be difficult if file descriptors were duplicated behind the scenes by library functions that
used dup(2) or fork(2)).
7. If more than one event occurs between epoll_wait(2) calls, are they combined or reported separately?
They will be combined.
8. Does an operation on a file descriptor affect the already collected but not yet reported events?
You can do two operations on an existing file descriptor. Remove would be meaningless for this case. Modify will reread available I/O.
9. Do I need to continuously read/write a file descriptor until EAGAIN when using the EPOLLET flag (edge-triggered behavior)?
Receiving an event from epoll_wait(2) should suggest to you that such file descriptor is ready for the requested I/O operation. You must consider
it ready until the next (nonblocking) read/write yields EAGAIN. When and how you will use the file descriptor is entirely up to you.
For packet/token-oriented files (e.g., datagram socket, terminal in canonical mode), the only way to detect the end of the read/write I/O space is
to continue to read/write until EAGAIN.
For stream-oriented files (e.g., pipe, FIFO, stream socket), the condition that the read/write I/O space is exhausted can also be detected by
checking the amount of data read from / written to the target file descriptor. For example, if you call read(2) by asking to read a certain amount
of data and read(2) returns a lower number of bytes, you can be sure of having exhausted the read I/O space for the file descriptor. The same is
true when writing using write(2). (Avoid this latter technique if you cannot guarantee that the monitored file descriptor always refers to a
stream-oriented file.)
Possible pitfalls and ways to avoid them
o Starvation (edge-triggered)
If there is a large amount of I/O space, it is possible that by trying to drain it the other files will not get processed causing starvation. (This
problem is not specific to epoll.)
The solution is to maintain a ready list and mark the file descriptor as ready in its associated data structure, thereby allowing the application to
remember which files need to be processed but still round robin amongst all the ready files. This also supports ignoring subsequent events you receive
for file descriptors that are already ready.
o If using an event cache...
If you use an event cache or store all the file descriptors returned from epoll_wait(2), then make sure to provide a way to mark its closure dynami‐
cally (i.e., caused by a previous event's processing). Suppose you receive 100 events from epoll_wait(2), and in event #47 a condition causes event
#13 to be closed. If you remove the structure and close(2) the file descriptor for event #13, then your event cache might still say there are events
waiting for that file descriptor causing confusion.
One solution for this is to call, during the processing of event 47, epoll_ctl(EPOLL_CTL_DEL) to delete file descriptor 13 and close(2), then mark its
associated data structure as removed and link it to a cleanup list. If you find another event for file descriptor 13 in your batch processing, you
will discover the file descriptor had been previously removed and there will be no confusion.
VERSIONS
The epoll API was introduced in Linux kernel 2.5.44. Support was added to glibc in version 2.3.2.
CONFORMING TO
The epoll API is Linux-specific. Some other systems provide similar mechanisms, for example, FreeBSD has kqueue, and Solaris has /dev/poll.
NOTES
The set of file descriptors that is being monitored via an epoll file descriptor can be viewed via the entry for the epoll file descriptor in the
process's /proc/[pid]/fdinfo directory. See proc(5) for further details.
The kcmp(2) KCMP_EPOLL_TFD operation can be used to test whether a file descriptor is present in an epoll instance.
SEE ALSO
epoll_create(2), epoll_create1(2), epoll_ctl(2), epoll_wait(2), poll(2), select(2)
COLOPHON
This page is part of release 5.10 of the Linux man-pages project. A description of the project, information about reporting bugs, and the latest ver‐
sion of this page, can be found at https://www.kernel.org/doc/man-pages/.
Linux 2019-03-06 EPOLL(7)
Reference
Linux Programmer’s Manual