因为学习epoll时遇到很多疑问,就用有道翻译了文档,结合了自己的理解,希望看到错误的人在回复中帮我指出,我好把它完善起来,帮助自己也以后的学习者
后面部分慢慢翻译前面应该差不多了,只是没法深入到底层机制做分析,就取巧用了一些猜测。
对于无基础的,在学习epoll服务端模型前可参看Bozh 的技术博客了解阻塞非阻塞及几种服务端模型
关键词:udp,tcp,select,epoll,libevent
EPOLL(7) Linux Programmer's Manual EPOLL(7)
Linux 程序设计 手册
NAME
标题
epoll - I/O event notification facility epoll的I/O事件通知机制
epoll - I/O event notification facility epoll的I/O事件通知机制
SYNOPSIS
纲要:
#include <sys/epoll.h>
#include <sys/epoll.h>
DESCRIPTION
描述:
(下文中将提到的描述符一般指网络套接字描述符,但无限定,管道描述符等只要是文件描述符都可以被监控,linux系统中一切皆文件。。网络关键词是需要自行上网搜索了解前提知识,文档中不会一一说明)
epoll is a variant of poll(2) that can be used either as an edge-trig‐
epoll函数是poll(2)的另一种变体,它的功能被扩展,可以监控的文件描述符数量大大增加(原select的监听描述符限制在1024个,这大大限制了服务器的可连接数,虽然可以修改这个限制但据说会影响效率。而epoll由可监控数量由内存决定,网上称1G空间可扩展支持监控10万描述符)。
gered or a level-triggered interface and scales well to large numbers
of watched file descriptors. The following system calls are provided
文件描述符的I/O等事件触发机制:分为水平触发LT和边缘触发ET两种方式,epoll支持对这两种触发的监控(网络关键词:ET模式只支持非阻塞,read函数的阻塞和非阻塞方式)
to create and manage an epoll instance:
接下来是对系统调用函数的说明,其功能包括对epoll的创建和管理:
(下文中将提到的描述符一般指网络套接字描述符,但无限定,管道描述符等只要是文件描述符都可以被监控,linux系统中一切皆文件。。网络关键词是需要自行上网搜索了解前提知识,文档中不会一一说明)
epoll is a variant of poll(2) that can be used either as an edge-trig‐
epoll函数是poll(2)的另一种变体,它的功能被扩展,可以监控的文件描述符数量大大增加(原select的监听描述符限制在1024个,这大大限制了服务器的可连接数,虽然可以修改这个限制但据说会影响效率。而epoll由可监控数量由内存决定,网上称1G空间可扩展支持监控10万描述符)。
gered or a level-triggered interface and scales well to large numbers
of watched file descriptors. The following system calls are provided
文件描述符的I/O等事件触发机制:分为水平触发LT和边缘触发ET两种方式,epoll支持对这两种触发的监控(网络关键词:ET模式只支持非阻塞,read函数的阻塞和非阻塞方式)
to create and manage an epoll instance:
接下来是对系统调用函数的说明,其功能包括对epoll的创建和管理:
* An epoll instance created by epoll_create(2), which returns a file
descriptor referring to the epoll instance. (The more recent
epoll_create(2)函数负责创建一个epoll实例(实例其实有点类似c++中的对象,但linux一切皆可用文件表示,这里是返回epoll实例的文件描述符/socket喜欢称呼套接字)
epoll_create1(2) extends the functionality of epoll_create(2).)
epoll_create1(2)扩展了epoll_create(2)的功能 (这个初学暂时不去理会)
descriptor referring to the epoll instance. (The more recent
epoll_create(2)函数负责创建一个epoll实例(实例其实有点类似c++中的对象,但linux一切皆可用文件表示,这里是返回epoll实例的文件描述符/socket喜欢称呼套接字)
epoll_create1(2) extends the functionality of epoll_create(2).)
epoll_create1(2)扩展了epoll_create(2)的功能 (这个初学暂时不去理会)
* Interest in particular file descriptors is then registered via
通过调用epoll_ctl(2)函数可以将你所感兴趣的文件描述符注册到监控链表中(其实就是你想监控哪个fd描述符,就用这函数把这描述符插入到epoll的监控队列中,没看过内核,通过功能瞎猜的....),
epoll_ctl(2). The set of file descriptors currently registered on
an epoll instance is sometimes called an epoll set.
这个注册的用法文档只说 called an epoll set.,(在select函数中这个标志是set,epoll是叫EPOLL_CTL_ADD添加)
epoll_ctl函数
函数声明:int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event)
该函数用于控制某个epoll文件描述符上的事件,可以注册事件,修改事件,删除事件。
epfd:由 epoll_create 生成的epoll专用的文件描述符;
op:要进行的操作例如注册事件,可能的取值EPOLL_CTL_ADD 注册、EPOLL_CTL_MOD 修 改、EPOLL_CTL_DEL 删除
fd:关联的文件描述符;
event:指向epoll_event的指针;
如果调用成功返回0,不成功返回-1
其实上网查个实例就知道了
如:
struct epoll_event ev;
//设置与要处理的事件相关的文件描述符
ev.data.fd=listenfd;
//设置要处理的事件类型
ev.events=EPOLLIN|EPOLLET;
//注册epoll事件
epoll_ctl(epfd,EPOLL_CTL_ADD,listenfd,&ev);
通过调用epoll_ctl(2)函数可以将你所感兴趣的文件描述符注册到监控链表中(其实就是你想监控哪个fd描述符,就用这函数把这描述符插入到epoll的监控队列中,没看过内核,通过功能瞎猜的....),
epoll_ctl(2). The set of file descriptors currently registered on
an epoll instance is sometimes called an epoll set.
这个注册的用法文档只说 called an epoll set.,(在select函数中这个标志是set,epoll是叫EPOLL_CTL_ADD添加)
epoll_ctl函数
函数声明:int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event)
该函数用于控制某个epoll文件描述符上的事件,可以注册事件,修改事件,删除事件。
epfd:由 epoll_create 生成的epoll专用的文件描述符;
op:要进行的操作例如注册事件,可能的取值EPOLL_CTL_ADD 注册、EPOLL_CTL_MOD 修 改、EPOLL_CTL_DEL 删除
fd:关联的文件描述符;
event:指向epoll_event的指针;
如果调用成功返回0,不成功返回-1
其实上网查个实例就知道了
如:
struct epoll_event ev;
//设置与要处理的事件相关的文件描述符
ev.data.fd=listenfd;
//设置要处理的事件类型
ev.events=EPOLLIN|EPOLLET;
//注册epoll事件
epoll_ctl(epfd,EPOLL_CTL_ADD,listenfd,&ev);
* Finally, the actual wait is started by epoll_wait(2).
最后就是调用epoll_wait(2)开始等待事件发生(更想翻译成QT中的进入事件循环)
最后就是调用epoll_wait(2)开始等待事件发生(更想翻译成QT中的进入事件循环)
Level-Triggered and Edge-Triggered
水平触发和边缘触发
一个非常形象的描述这个问题:(转自网络博客,引用太多的博客,所以懒得一一贴出原贴地址)
水平触发 --> 有事了,你不处理?不断骚扰你直到你处理。
边沿触发 --> 有事了,告诉你一次,你不处理?拉倒!
//边沿触发会引起一个问题:边沿触发的原理是缓冲区不稳定(写入新数据)时才发出信号激活触发,若客户端不再发送数据过来,缓冲区就会稳定,无信号发出,也不会有激活触发被。假设每一次触发事件都能读走缓冲的数据自然不会有什么问题,但现实中总会一次读取不完而造成部分数据残留在缓冲中,而这时刚好是客户端最后一次发数据,以后不会有数据过来缓冲区就处于稳定状态。稳定状态无触发(他只在改变时通知一次,你没处理完,他也不会通知第二次,因为他不会去扫描缓冲中是否还有数据残留),没有触发就没有激活事件来读取缓冲中残留的数据,结果是程序得不到完整数据,而数据其实就在它的缓冲中(这是事件驱动机制中容易出现的问题)。
水平触发就好解释了,它不断扫描缓冲区中是否有数据,有就马上发信号触发程序的读事件,让程序降数据读走 ,不会有残留问题,在后面的实例中,accept就是采用的LT水平触发,防止丢失连接。
具体博客编程中有遇到了这问题 :http://blog.csdn.net/sparkliang/article/details/4770655#reply
The epoll event distribution interface is able to behave both as edge-
triggered (ET) and as level-triggered (LT). The difference between the
two mechanisms can be described as follows. Suppose that this scenario
happens:
epoll可以处理的事件类型有ET和LT两种,接下来就描述两者区别,假设下面场景:
http://blog.sina.com.cn/u/544465b0010000bp
一个非常形象的描述这个问题:(转自网络博客,引用太多的博客,所以懒得一一贴出原贴地址)
水平触发 --> 有事了,你不处理?不断骚扰你直到你处理。
边沿触发 --> 有事了,告诉你一次,你不处理?拉倒!
//边沿触发会引起一个问题:边沿触发的原理是缓冲区不稳定(写入新数据)时才发出信号激活触发,若客户端不再发送数据过来,缓冲区就会稳定,无信号发出,也不会有激活触发被。假设每一次触发事件都能读走缓冲的数据自然不会有什么问题,但现实中总会一次读取不完而造成部分数据残留在缓冲中,而这时刚好是客户端最后一次发数据,以后不会有数据过来缓冲区就处于稳定状态。稳定状态无触发(他只在改变时通知一次,你没处理完,他也不会通知第二次,因为他不会去扫描缓冲中是否还有数据残留),没有触发就没有激活事件来读取缓冲中残留的数据,结果是程序得不到完整数据,而数据其实就在它的缓冲中(这是事件驱动机制中容易出现的问题)。
水平触发就好解释了,它不断扫描缓冲区中是否有数据,有就马上发信号触发程序的读事件,让程序降数据读走 ,不会有残留问题,在后面的实例中,accept就是采用的LT水平触发,防止丢失连接。
具体博客编程中有遇到了这问题 :http://blog.csdn.net/sparkliang/article/details/4770655#reply
The epoll event distribution interface is able to behave both as edge-
triggered (ET) and as level-triggered (LT). The difference between the
two mechanisms can be described as follows. Suppose that this scenario
happens:
epoll可以处理的事件类型有ET和LT两种,接下来就描述两者区别,假设下面场景:
http://blog.sina.com.cn/u/544465b0010000bp
1. The file descriptor that represents the read side of a pipe (rfd) is
registered on the epoll instance.
1. 假设我们把一个用来从管道中读取数据的文件句柄(RFD)添加到了epoll的监控队列中(可能是什么红黑树不是队列,具体内核实现不影响我们上层的编程,我用队列说明容易理解)
2. A pipe writer writes 2 kB of data on the write side of the pipe.
2. 这个时候从管道的另一端被写入了2KB的数据
registered on the epoll instance.
1. 假设我们把一个用来从管道中读取数据的文件句柄(RFD)添加到了epoll的监控队列中(可能是什么红黑树不是队列,具体内核实现不影响我们上层的编程,我用队列说明容易理解)
2. A pipe writer writes 2 kB of data on the write side of the pipe.
2. 这个时候从管道的另一端被写入了2KB的数据
3. A call to epoll_wait(2) is done that will return rfd as a ready file
descriptor.
3. 调用epoll_wait(2),并且它会返回RFD,说明它已经准备好读取操作
(这翻译没错,但原文没说具体步骤,实际返回的是nfds监控事件组的大小,然后自己判断如果返回的m_events事件组中包含有RFD,说明RFD有输入发生,它这里简化说成是否有RFD返回,没看过代码的话好难理解。。)
实际工作原理是epoll将所有要监控的fd文件描述符添加到内核的监控队列中,然后他会另造一个状态队列,将有事件发生的fd复制到状态队列中,也就是我们将要得到的m_events事件组,如果RFD在里面就说明它状态改变过。相比seletc的效率要高多了,select我们还得用for循环判断整个监控队列找出哪些fd状态发生过改变,而epoll直接将改变的fd做成一个m_events事件组返回给我们读取。
例: int nfds = epoll_wait (RFD, m_events, MAX_EVENTS, EPOLL_TIME_OUT;//等待EPOLL事件的发生,相当于监听,至于
descriptor.
3. 调用epoll_wait(2),并且它会返回RFD,说明它已经准备好读取操作
(这翻译没错,但原文没说具体步骤,实际返回的是nfds监控事件组的大小,然后自己判断如果返回的m_events事件组中包含有RFD,说明RFD有输入发生,它这里简化说成是否有RFD返回,没看过代码的话好难理解。。)
实际工作原理是epoll将所有要监控的fd文件描述符添加到内核的监控队列中,然后他会另造一个状态队列,将有事件发生的fd复制到状态队列中,也就是我们将要得到的m_events事件组,如果RFD在里面就说明它状态改变过。相比seletc的效率要高多了,select我们还得用for循环判断整个监控队列找出哪些fd状态发生过改变,而epoll直接将改变的fd做成一个m_events事件组返回给我们读取。
例: int nfds = epoll_wait (RFD, m_events, MAX_EVENTS, EPOLL_TIME_OUT;//等待EPOLL事件的发生,相当于监听,至于
4. The pipe reader reads 1 kB of data from rfd.
4. 然后假设我们读取了1KB的数据
4. 然后假设我们读取了1KB的数据
5. A call to epoll_wait(2) is done.
5. 调用epoll_wait(2)......
Edge Triggered 工作模式:
如果我们在第1步将RFD添加到epoll描述符的时候使用了EPOLLET标志,那么在第5步调用epoll_wait(2)之后将有可能会挂起,因为剩余的数据还存在于文件的输入缓冲区内,而且数据发出端还在等待一个针对已经发出数据的反馈信息。只有在监视的文件句柄上发生了某个事件的时候 ET 工作模式才会汇报事件。因此在第5步的时候,调用者可能会放弃等待仍在存在于文件输入缓冲区内的剩余数据。在上面的例子中,会有一个事件产生在RFD句柄上,因为在第2步执行了一个写操作,然后,事件将会在第3步被销毁。因为第4步的读取操作没有读空文件输入缓冲区内的数据,因此我们在第5步调用 epoll_wait(2)完成后,是否挂起是不确定的。epoll工作在ET模式的时候,必须使用非阻塞套接口,以避免由于一个文件句柄的阻塞读/阻塞写操作把处理多个文件描述符的任务饿死。最好以下面的方式调用ET模式的epoll接口,在后面会介绍避免可能的缺陷。(这原版翻译看的痛苦,,后面我按自己理解重新翻译了下)
i 基于非阻塞文件句柄
ii 只有当read(2)或者write(2)返回EAGAIN时才需要挂起,等待
Level Triggered 工作模式
相反的,以LT方式调用epoll接口的时候,它就相当于一个速度比较快的poll(2),并且无论后面的数据是否被使用,因此他们具有同样的职能。因为即使使用ET模式的epoll,在收到多个chunk的数据的时候仍然会产生多个事件。调用者可以设定EPOLLONESHOT标志,在 epoll_wait(2)收到事件后epoll会与事件关联的文件句柄从epoll描述符中禁止掉。因此当EPOLLONESHOT设定后,使用带有 EPOLL_CTL_MOD标志的epoll_ctl(2)处理文件句柄就成为调用者必须作的事情。
以上翻译自man epoll.
5. 调用epoll_wait(2)......
Edge Triggered 工作模式:
如果我们在第1步将RFD添加到epoll描述符的时候使用了EPOLLET标志,那么在第5步调用epoll_wait(2)之后将有可能会挂起,因为剩余的数据还存在于文件的输入缓冲区内,而且数据发出端还在等待一个针对已经发出数据的反馈信息。只有在监视的文件句柄上发生了某个事件的时候 ET 工作模式才会汇报事件。因此在第5步的时候,调用者可能会放弃等待仍在存在于文件输入缓冲区内的剩余数据。在上面的例子中,会有一个事件产生在RFD句柄上,因为在第2步执行了一个写操作,然后,事件将会在第3步被销毁。因为第4步的读取操作没有读空文件输入缓冲区内的数据,因此我们在第5步调用 epoll_wait(2)完成后,是否挂起是不确定的。epoll工作在ET模式的时候,必须使用非阻塞套接口,以避免由于一个文件句柄的阻塞读/阻塞写操作把处理多个文件描述符的任务饿死。最好以下面的方式调用ET模式的epoll接口,在后面会介绍避免可能的缺陷。(这原版翻译看的痛苦,,后面我按自己理解重新翻译了下)
i 基于非阻塞文件句柄
ii 只有当read(2)或者write(2)返回EAGAIN时才需要挂起,等待
Level Triggered 工作模式
相反的,以LT方式调用epoll接口的时候,它就相当于一个速度比较快的poll(2),并且无论后面的数据是否被使用,因此他们具有同样的职能。因为即使使用ET模式的epoll,在收到多个chunk的数据的时候仍然会产生多个事件。调用者可以设定EPOLLONESHOT标志,在 epoll_wait(2)收到事件后epoll会与事件关联的文件句柄从epoll描述符中禁止掉。因此当EPOLLONESHOT设定后,使用带有 EPOLL_CTL_MOD标志的epoll_ctl(2)处理文件句柄就成为调用者必须作的事情。
以上翻译自man epoll.
If the rfd file descriptor has been added to the epoll interface using
the EPOLLET (edge-triggered) flag, the call to epoll_wait(2) done in
假设我们用EPOLLET标志,以ET边沿触发方式将RFD描述符添加到epoll监控队列中,
//设置要处理的事件类型
ev.events=EPOLLIN|EPOLLET;
step 5 will probably hang despite the available data still present in
当执行到上述场景的第5步,epoll_wait(2)就会挂起(因为我们已经不再往RFD管道中写数据,它的状态不变,就不会再触发事件了,但这时缓冲中还有1K残留数据未读,得想法解决这个问题。),(注意:这个例子比较特殊,因为epoll监控的只有一个RFD,RFD缓冲状态不改变的话,epoll_wait(2)就会一直阻塞下去。一般网络情况下epoll监控成千上万个fd,任何一个fd触发都能解除epoll_wait(2)阻塞,所以一般设计中epoll_wait一直阻塞的情况不太可能发生。)
the file input buffer; meanwhile the remote peer might be expecting a
(** remote :远程?这外国人思维跳的真快,明明上个例子是用的管道,怎么又变成远程等待回复了。。。还是按他的翻译吧)
response based on the data it already sent. The reason for this is
假设协议是这么回事(把上个例子的管道理解成网络吧。。),
1远程客户端向服务端发送了一个2K 的请求数据。
而服务端采用ET模式监控,结果就虽然接收了2K请求数据,但一次recv只从缓冲拿走了1K,还剩1K在缓冲中。服务端没有继续取数据,而是执行在上述例子的第5步卡在epoll_wait(2)那里了。
2 服务端,程序处理了1K数据,它协议规定了必须处理2K完整数据,才会发消息告诉客户端“我收到你的数据了,你再发数据过来吧”。于是问题出现:客户端已经发送了2K数据,他在等待服务端发消息过来,否则就一直等待;而服务端接收2K数据却因程序设计BUG只处理了1K数据,他阻塞在第5步等待剩下的1k数据。双方都等待对方发送数据到自己这里来,于是都卡了。
that edge-triggered mode only delivers events when changes occur on the
ET边沿触发原则:只有在描述符的状态发生改变时才会激活,驱动事件去做事
monitored file descriptor. So, in step 5 the caller might end up wait‐
所以在第五步就有可能剩下数据已经在缓冲中,但因为之前的一次事件驱动已经用完了,不会再主动去读取数据,除非有新数据到来激活触发从而驱动新事件去读取数据。上个网络的例子客户端也在等待,是不可能有新数据再来,所以没有触发事件,也就阻塞了。
ing for some data that is already present inside the input buffer. In
the above example, an event on rfd will be generated because of the
write done in 2 and the event is consumed in 3. Since the read opera‐
tion done in 4 does not consume the whole buffer data, the call to
epoll_wait(2) done in step 5 might block indefinitely.
上个例子第2步写入数据产生状态变化激活了一次事件,但是在第四步事件处理中却没有把2K数据全部处理掉,你消费了这次事件却只做了一半的事,所以执行到第五步时,如果是遵循上述我所假设的网络协议,第五步就阻塞了。你把唯一的事件用掉了,没有产生新事件来驱动程序继续做事了。 但并不是所有情况都会阻塞,假如网络协议是不停的发数据过来,新的事件驱动又可以读缓冲了。但最后一个数据有可能残留(因为数据太大,一次事件中处理不完),也有可能成功接收完(刚好最后一段数据比较小,一次事件就处理完了)。
the EPOLLET (edge-triggered) flag, the call to epoll_wait(2) done in
假设我们用EPOLLET标志,以ET边沿触发方式将RFD描述符添加到epoll监控队列中,
//设置要处理的事件类型
ev.events=EPOLLIN|EPOLLET;
step 5 will probably hang despite the available data still present in
当执行到上述场景的第5步,epoll_wait(2)就会挂起(因为我们已经不再往RFD管道中写数据,它的状态不变,就不会再触发事件了,但这时缓冲中还有1K残留数据未读,得想法解决这个问题。),(注意:这个例子比较特殊,因为epoll监控的只有一个RFD,RFD缓冲状态不改变的话,epoll_wait(2)就会一直阻塞下去。一般网络情况下epoll监控成千上万个fd,任何一个fd触发都能解除epoll_wait(2)阻塞,所以一般设计中epoll_wait一直阻塞的情况不太可能发生。)
the file input buffer; meanwhile the remote peer might be expecting a
(** remote :远程?这外国人思维跳的真快,明明上个例子是用的管道,怎么又变成远程等待回复了。。。还是按他的翻译吧)
response based on the data it already sent. The reason for this is
假设协议是这么回事(把上个例子的管道理解成网络吧。。),
1远程客户端向服务端发送了一个2K 的请求数据。
而服务端采用ET模式监控,结果就虽然接收了2K请求数据,但一次recv只从缓冲拿走了1K,还剩1K在缓冲中。服务端没有继续取数据,而是执行在上述例子的第5步卡在epoll_wait(2)那里了。
2 服务端,程序处理了1K数据,它协议规定了必须处理2K完整数据,才会发消息告诉客户端“我收到你的数据了,你再发数据过来吧”。于是问题出现:客户端已经发送了2K数据,他在等待服务端发消息过来,否则就一直等待;而服务端接收2K数据却因程序设计BUG只处理了1K数据,他阻塞在第5步等待剩下的1k数据。双方都等待对方发送数据到自己这里来,于是都卡了。
that edge-triggered mode only delivers events when changes occur on the
ET边沿触发原则:只有在描述符的状态发生改变时才会激活,驱动事件去做事
monitored file descriptor. So, in step 5 the caller might end up wait‐
所以在第五步就有可能剩下数据已经在缓冲中,但因为之前的一次事件驱动已经用完了,不会再主动去读取数据,除非有新数据到来激活触发从而驱动新事件去读取数据。上个网络的例子客户端也在等待,是不可能有新数据再来,所以没有触发事件,也就阻塞了。
ing for some data that is already present inside the input buffer. In
the above example, an event on rfd will be generated because of the
write done in 2 and the event is consumed in 3. Since the read opera‐
tion done in 4 does not consume the whole buffer data, the call to
epoll_wait(2) done in step 5 might block indefinitely.
上个例子第2步写入数据产生状态变化激活了一次事件,但是在第四步事件处理中却没有把2K数据全部处理掉,你消费了这次事件却只做了一半的事,所以执行到第五步时,如果是遵循上述我所假设的网络协议,第五步就阻塞了。你把唯一的事件用掉了,没有产生新事件来驱动程序继续做事了。 但并不是所有情况都会阻塞,假如网络协议是不停的发数据过来,新的事件驱动又可以读缓冲了。但最后一个数据有可能残留(因为数据太大,一次事件中处理不完),也有可能成功接收完(刚好最后一段数据比较小,一次事件就处理完了)。
An application that employs the EPOLLET flag should use nonblocking
如果一个文件描述符将以边沿触发ET的形式被监听,就应该将文件描述符设置成非阻塞模式。
setnonblocking(conn_sock);
ev.events = EPOLLIN | EPOLLET;
file descriptors to avoid having a blocking read or write starve a task
that is handling multiple file descriptors. The suggested way to use
这样做是防止阻塞模式下的read和write将整个工作任务阻塞( 这和Qt界面主线程的事件循环被阻塞一样,一个事件处理时间太长,就有卡死的假象,因为这个事件处理机制是单线程轮询的,不能让每个事件的处理占用太多时间,更何况网络服务器要处理上千的连接)。这里的阻塞和上面的例子不同,上个例子中有1K数据取不到形成永久等待,但这是业务上的逻辑等待,它的read是非阻塞的,读一次事件就退出了让给其他事件执行,它因逻辑上的错误不再有事件驱动,不影响事件循环。
反观阻塞模式下,read读不到数据就不退出,事件执行也就不退出,整个单线程的事件轮询就阻塞了,其他fd的事件没法处理了。
epoll as an edge-triggered (EPOLLET) interface is as follows:
如果一个文件描述符将以边沿触发ET的形式被监听,就应该将文件描述符设置成非阻塞模式。
setnonblocking(conn_sock);
ev.events = EPOLLIN | EPOLLET;
file descriptors to avoid having a blocking read or write starve a task
that is handling multiple file descriptors. The suggested way to use
这样做是防止阻塞模式下的read和write将整个工作任务阻塞( 这和Qt界面主线程的事件循环被阻塞一样,一个事件处理时间太长,就有卡死的假象,因为这个事件处理机制是单线程轮询的,不能让每个事件的处理占用太多时间,更何况网络服务器要处理上千的连接)。这里的阻塞和上面的例子不同,上个例子中有1K数据取不到形成永久等待,但这是业务上的逻辑等待,它的read是非阻塞的,读一次事件就退出了让给其他事件执行,它因逻辑上的错误不再有事件驱动,不影响事件循环。
反观阻塞模式下,read读不到数据就不退出,事件执行也就不退出,整个单线程的事件轮询就阻塞了,其他fd的事件没法处理了。
epoll as an edge-triggered (EPOLLET) interface is as follows:
i with nonblocking file descriptors; and
ii by waiting for an event only after read(2) or write(2)
return EAGAIN.
epoll以ET模式监控应该遵循以下:(一个epoll可以同时针对不同的fd分别采用ET和LT两种不同模式监听)
i 文件描述符是非阻塞的(如上讨论ET模式下事件只触发一次,缓冲区残留,解决办法是循环读直到实际读出数据比要求读的数据小,就表明缓冲空了,但特殊情况是刚好读出数据实际与要求的相同,而缓冲正好空了,你再继续循环读,为空,如果是阻塞模式,就一直阻塞了,所以宁愿用非阻塞模式读到出错停止也比阻塞好)
ii 继续读/写直到read/write返回了EAGAIN标志,(这里没有讨论写,自行网络去)不然接收情况下缓冲会残留数据。(EAGAIN表示缓冲空,无数据,不是返回0,read时返回0表示网络连接正常断开了)
EAGAIN在read返回中表示无数据可读,不是错误。again是又的意思,它提醒你可以又读一次,这次读为空但下一次读就不一定了,可能数据刚好第二次时就发送过来了。当然当read为阻塞状态时,无数据就阻塞,都阻塞了又怎么可能返回EAGAIN呢?事实是它有时的确能返回,因为一些信号的缘故,阻塞被唤醒,但依然是无数据读,所以依然返回EAGAIN,你可以判断后决定继续read阻塞还是做其他的。
By contrast, when used as a level-triggered interface (the default,
when EPOLLET is not specified), epoll is simply a faster poll(2), and
can be used wherever the latter is used since it shares the same seman‐
tics.
return EAGAIN.
epoll以ET模式监控应该遵循以下:(一个epoll可以同时针对不同的fd分别采用ET和LT两种不同模式监听)
i 文件描述符是非阻塞的(如上讨论ET模式下事件只触发一次,缓冲区残留,解决办法是循环读直到实际读出数据比要求读的数据小,就表明缓冲空了,但特殊情况是刚好读出数据实际与要求的相同,而缓冲正好空了,你再继续循环读,为空,如果是阻塞模式,就一直阻塞了,所以宁愿用非阻塞模式读到出错停止也比阻塞好)
ii 继续读/写直到read/write返回了EAGAIN标志,(这里没有讨论写,自行网络去)不然接收情况下缓冲会残留数据。(EAGAIN表示缓冲空,无数据,不是返回0,read时返回0表示网络连接正常断开了)
EAGAIN在read返回中表示无数据可读,不是错误。again是又的意思,它提醒你可以又读一次,这次读为空但下一次读就不一定了,可能数据刚好第二次时就发送过来了。当然当read为阻塞状态时,无数据就阻塞,都阻塞了又怎么可能返回EAGAIN呢?事实是它有时的确能返回,因为一些信号的缘故,阻塞被唤醒,但依然是无数据读,所以依然返回EAGAIN,你可以判断后决定继续read阻塞还是做其他的。
By contrast, when used as a level-triggered interface (the default,
when EPOLLET is not specified), epoll is simply a faster poll(2), and
can be used wherever the latter is used since it shares the same seman‐
tics.
Since even with edge-triggered epoll, multiple events can be generated
upon receipt of multiple chunks of data, the caller has the option to
specify the EPOLLONESHOT flag, to tell epoll to disable the associated
file descriptor after the receipt of an event with epoll_wait(2). When
the EPOLLONESHOT flag is specified, it is the caller's responsibility
to rearm the file descriptor using epoll_ctl(2) with EPOLL_CTL_MOD.
upon receipt of multiple chunks of data, the caller has the option to
specify the EPOLLONESHOT flag, to tell epoll to disable the associated
file descriptor after the receipt of an event with epoll_wait(2). When
the EPOLLONESHOT flag is specified, it is the caller's responsibility
to rearm the file descriptor using epoll_ctl(2) with EPOLL_CTL_MOD.
/proc interfaces
The following interfaces can be used to limit the amount of kernel mem‐
ory consumed by epoll:
The following interfaces can be used to limit the amount of kernel mem‐
ory consumed by epoll:
/proc/sys/fs/epoll/max_user_watches (since Linux 2.6.28)
This specifies a limit on the total number of file descriptors
that a user can register across all epoll instances on the sys‐
tem. The limit is per real user ID. Each registered file
descriptor costs roughly 90 bytes on a 32-bit kernel, and
roughly 160 bytes on a 64-bit kernel. Currently, the default
value for max_user_watches is 1/25 (4%) of the available low
memory, divided by the registration cost in bytes.
This specifies a limit on the total number of file descriptors
that a user can register across all epoll instances on the sys‐
tem. The limit is per real user ID. Each registered file
descriptor costs roughly 90 bytes on a 32-bit kernel, and
roughly 160 bytes on a 64-bit kernel. Currently, the default
value for max_user_watches is 1/25 (4%) of the available low
memory, divided by the registration cost in bytes.
Example for Suggested Usage
While the usage of epoll when employed as a level-triggered interface
does have the same semantics as poll(2), the edge-triggered usage
requires more clarification to avoid stalls in the application event
loop. In this example, listener is a nonblocking socket on which lis‐
ten(2) has been called. The function do_use_fd() uses the new ready
file descriptor until EAGAIN is returned by either read(2) or write(2).
An event-driven state machine application should, after having received
EAGAIN, record its current state so that at the next call to
do_use_fd() it will continue to read(2) or write(2) from where it
stopped before.
#define MAX_EVENTS 10
struct epoll_event ev, events[MAX_EVENTS];
int listen_sock, conn_sock, nfds, epollfd;
/* Set up listening socket, 'listen_sock' (socket(),
bind(), listen()) */
bind(), listen()) */
epollfd = epoll_create(10);
if (epollfd == -1) {
perror("epoll_create");
exit(EXIT_FAILURE);
}
if (epollfd == -1) {
perror("epoll_create");
exit(EXIT_FAILURE);
}
ev.events = EPOLLIN;
ev.data.fd = listen_sock;
if (epoll_ctl(epollfd, EPOLL_CTL_ADD, listen_sock, &ev) == -1) {
perror("epoll_ctl: listen_sock");
exit(EXIT_FAILURE);
}
ev.data.fd = listen_sock;
if (epoll_ctl(epollfd, EPOLL_CTL_ADD, listen_sock, &ev) == -1) {
perror("epoll_ctl: listen_sock");
exit(EXIT_FAILURE);
}
for (;;) {
nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1);
if (nfds == -1) {
perror("epoll_pwait");
exit(EXIT_FAILURE);
}
nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1);
if (nfds == -1) {
perror("epoll_pwait");
exit(EXIT_FAILURE);
}
for (n = 0; n < nfds; ++n) {
if (events[n].data.fd == listen_sock) {
conn_sock = accept(listen_sock,
(struct sockaddr *) &local, &addrlen);
if (conn_sock == -1) {
perror("accept");
exit(EXIT_FAILURE);
}
setnonblocking(conn_sock);
ev.events = EPOLLIN | EPOLLET;
ev.data.fd = conn_sock;
if (epoll_ctl(epollfd, EPOLL_CTL_ADD, conn_sock,
&ev) == -1) {
perror("epoll_ctl: conn_sock");
exit(EXIT_FAILURE);
}
} else {
do_use_fd(events[n].data.fd);
}
}
}
if (events[n].data.fd == listen_sock) {
conn_sock = accept(listen_sock,
(struct sockaddr *) &local, &addrlen);
if (conn_sock == -1) {
perror("accept");
exit(EXIT_FAILURE);
}
setnonblocking(conn_sock);
ev.events = EPOLLIN | EPOLLET;
ev.data.fd = conn_sock;
if (epoll_ctl(epollfd, EPOLL_CTL_ADD, conn_sock,
&ev) == -1) {
perror("epoll_ctl: conn_sock");
exit(EXIT_FAILURE);
}
} else {
do_use_fd(events[n].data.fd);
}
}
}
When used as an edge-triggered interface, for performance reasons, it
is possible to add the file descriptor inside the epoll interface
(EPOLL_CTL_ADD) once by specifying (EPOLLIN|EPOLLOUT). This allows you
to avoid continuously switching between EPOLLIN and EPOLLOUT calling
epoll_ctl(2) with EPOLL_CTL_MOD.
Questions and Answers
Q0 What is the key used to distinguish the file descriptors registered
in an epoll set?
is possible to add the file descriptor inside the epoll interface
(EPOLL_CTL_ADD) once by specifying (EPOLLIN|EPOLLOUT). This allows you
to avoid continuously switching between EPOLLIN and EPOLLOUT calling
epoll_ctl(2) with EPOLL_CTL_MOD.
Questions and Answers
Q0 What is the key used to distinguish the file descriptors registered
in an epoll set?
A0 The key is the combination of the file descriptor number and the
open file description (also known as an "open file handle", the
kernel's internal representation of an open file).
open file description (also known as an "open file handle", the
kernel's internal representation of an open file).
Q1 What happens if you register the same file descriptor on an epoll
instance twice?
instance twice?
A1 You will probably get EEXIST. However, it is possible to add a
duplicate (dup(2), dup2(2), fcntl(2) F_DUPFD) descriptor to the
same epoll instance. This can be a useful technique for filtering
events, if the duplicate file descriptors are registered with dif‐
ferent events masks.
duplicate (dup(2), dup2(2), fcntl(2) F_DUPFD) descriptor to the
same epoll instance. This can be a useful technique for filtering
events, if the duplicate file descriptors are registered with dif‐
ferent events masks.
Q2 Can two epoll instances wait for the same file descriptor? If so,
are events reported to both epoll file descriptors?
are events reported to both epoll file descriptors?
A2 Yes, and events would be reported to both. However, careful pro‐
gramming may be needed to do this correctly.
gramming may be needed to do this correctly.
Q3 Is the epoll file descriptor itself poll/epoll/selectable?
A3 Yes. If an epoll file descriptor has events waiting then it will
indicate as being readable.
indicate as being readable.
Q4 What happens if one attempts to put an epoll file descriptor into
its own file descriptor set?
A4 The epoll_ctl(2) call will fail (EINVAL). However, you can add an
epoll file descriptor inside another epoll file descriptor set.
its own file descriptor set?
A4 The epoll_ctl(2) call will fail (EINVAL). However, you can add an
epoll file descriptor inside another epoll file descriptor set.
Q5 Can I send an epoll file descriptor over a UNIX domain socket to
another process?
another process?
A5 Yes, but it does not make sense to do this, since the receiving
process would not have copies of the file descriptors in the epoll
set.
process would not have copies of the file descriptors in the epoll
set.
Q6 Will closing a file descriptor cause it to be removed from all
epoll sets automatically?
A6 Yes, but be aware of the following point. A file descriptor is a
reference to an open file description (see open(2)). Whenever a
descriptor is duplicated via dup(2), dup2(2), fcntl(2) F_DUPFD, or
fork(2), a new file descriptor referring to the same open file
description is created. An open file description continues to
exist until all file descriptors referring to it have been closed.
A file descriptor is removed from an epoll set only after all the
file descriptors referring to the underlying open file description
have been closed (or before if the descriptor is explicitly removed
using epoll_ctl(2) EPOLL_CTL_DEL). This means that even after a
file descriptor that is part of an epoll set has been closed,
events may be reported for that file descriptor if other file
descriptors referring to the same underlying file description
remain open.
epoll sets automatically?
A6 Yes, but be aware of the following point. A file descriptor is a
reference to an open file description (see open(2)). Whenever a
descriptor is duplicated via dup(2), dup2(2), fcntl(2) F_DUPFD, or
fork(2), a new file descriptor referring to the same open file
description is created. An open file description continues to
exist until all file descriptors referring to it have been closed.
A file descriptor is removed from an epoll set only after all the
file descriptors referring to the underlying open file description
have been closed (or before if the descriptor is explicitly removed
using epoll_ctl(2) EPOLL_CTL_DEL). This means that even after a
file descriptor that is part of an epoll set has been closed,
events may be reported for that file descriptor if other file
descriptors referring to the same underlying file description
remain open.
Q7 If more than one event occurs between epoll_wait(2) calls, are they
combined or reported separately?
combined or reported separately?
A7 They will be combined.
Q8 Does an operation on a file descriptor affect the already collected
but not yet reported events?
but not yet reported events?
A8 You can do two operations on an existing file descriptor. Remove
would be meaningless for this case. Modify will reread available
I/O.
would be meaningless for this case. Modify will reread available
I/O.
Q9 Do I need to continuously read/write a file descriptor until EAGAIN
when using the EPOLLET flag (edge-triggered behavior) ?
when using the EPOLLET flag (edge-triggered behavior) ?
A9 Receiving an event from epoll_wait(2) should suggest to you that
such file descriptor is ready for the requested I/O operation. You
must consider it ready until the next (nonblocking) read/write
yields EAGAIN. When and how you will use the file descriptor is
entirely up to you.
such file descriptor is ready for the requested I/O operation. You
must consider it ready until the next (nonblocking) read/write
yields EAGAIN. When and how you will use the file descriptor is
entirely up to you.
For packet/token-oriented files (e.g., datagram socket, terminal in
canonical mode), the only way to detect the end of the read/write
I/O space is to continue to read/write until EAGAIN.
canonical mode), the only way to detect the end of the read/write
I/O space is to continue to read/write until EAGAIN.
For stream-oriented files (e.g., pipe, FIFO, stream socket), the
condition that the read/write I/O space is exhausted can also be
detected by checking the amount of data read from / written to the
target file descriptor. For example, if you call read(2) by asking
to read a certain amount of data and read(2) returns a lower number
of bytes, you can be sure of having exhausted the read I/O space
for the file descriptor. The same is true when writing using
write(2). (Avoid this latter technique if you cannot guarantee
that the monitored file descriptor always refers to a stream-ori‐
ented file.)
condition that the read/write I/O space is exhausted can also be
detected by checking the amount of data read from / written to the
target file descriptor. For example, if you call read(2) by asking
to read a certain amount of data and read(2) returns a lower number
of bytes, you can be sure of having exhausted the read I/O space
for the file descriptor. The same is true when writing using
write(2). (Avoid this latter technique if you cannot guarantee
that the monitored file descriptor always refers to a stream-ori‐
ented file.)
Possible Pitfalls and Ways to Avoid Them
o Starvation (edge-triggered)
Possible Pitfalls and Ways to Avoid Them
o Starvation (edge-triggered)
o Starvation (edge-triggered)
Possible Pitfalls and Ways to Avoid Them
o Starvation (edge-triggered)
If there is a large amount of I/O space, it is possible that by trying
to drain it the other files will not get processed causing starvation.
(This problem is not specific to epoll.)
to drain it the other files will not get processed causing starvation.
(This problem is not specific to epoll.)
The solution is to maintain a ready list and mark the file descriptor
as ready in its associated data structure, thereby allowing the appli‐
cation to remember which files need to be processed but still round
robin amongst all the ready files. This also supports ignoring subse‐
quent events you receive for file descriptors that are already ready.
as ready in its associated data structure, thereby allowing the appli‐
cation to remember which files need to be processed but still round
robin amongst all the ready files. This also supports ignoring subse‐
quent events you receive for file descriptors that are already ready.
o If using an event cache...
If you use an event cache or store all the file descriptors returned
from epoll_wait(2), then make sure to provide a way to mark its closure
dynamically (i.e., caused by a previous event's processing). Suppose
you receive 100 events from epoll_wait(2), and in event #47 a condition
causes event #13 to be closed. If you remove the structure and
close(2) the file descriptor for event #13, then your event cache might
still say there are events waiting for that file descriptor causing
confusion.
One solution for this is to call, during the processing of event 47,
epoll_ctl(EPOLL_CTL_DEL) to delete file descriptor 13 and close(2),
then mark its associated data structure as removed and link it to a
cleanup list. If you find another event for file descriptor 13 in your
batch processing, you will discover the file descriptor had been previ‐
ously removed and there will be no confusion.
VERSIONS
The epoll API was introduced in Linux kernel 2.5.44. Support was added
to glibc in version 2.3.2.
The epoll API was introduced in Linux kernel 2.5.44. Support was added
to glibc in version 2.3.2.
CONFORMING TO
The epoll API is Linux-specific. Some other systems provide similar
mechanisms, for example, FreeBSD has kqueue, and Solaris has /dev/poll.
The epoll API is Linux-specific. Some other systems provide similar
mechanisms, for example, FreeBSD has kqueue, and Solaris has /dev/poll.
SEE ALSO
epoll_create(2), epoll_create1(2), epoll_ctl(2), epoll_wait(2)
epoll_create(2), epoll_create1(2), epoll_ctl(2), epoll_wait(2)
COLOPHON
This page is part of release 3.35 of the Linux man-pages project. A
description of the project, and information about reporting bugs, can
be found at http://man7.org/linux/man-pages/.
This page is part of release 3.35 of the Linux man-pages project. A
description of the project, and information about reporting bugs, can
be found at http://man7.org/linux/man-pages/.
Manual page epoll(4) line 127/304 51% (press h for help or q to quit)