I/O复用之epoll

最近在学习socket编程,发现epoll模型还是一个不错的东东~所以,就拿来学习下,并将自己的学习总结如下。

简介

epoll是Linux下多路复用IO接口select/poll的增强版本,它能 显著提高程序在 大量并发连接中只有少量活跃的情况下的系统CPU 利用率,因为它会复用文件描述符集合来传递结果而不用迫使开发者每次等待事件之前都必须重新准备要被侦听的文件描述符集合,另一点原因就是获取事件的时候,它无须遍历整个被侦听的描述符集,只要遍历那些被内核IO事件异步唤醒而加入Ready队列的描述符集合就行了。epoll除了提供select/poll那种IO事件的电平触发(Level Triggered)外,还提供了边沿触发(Edge Triggered),这就使得用户空间程序有可能缓存IO状态,减少epoll_wait/epoll_pwait的调用,提高应用程序效率。
 
下面是man epoll得到的信息,有删减

NAME
       epoll - I/O event notification facility

SYNOPSIS
       #include <sys/epoll.h>

DESCRIPTION
       epoll  is  a variant of poll(2) that can be used either as Edge or Level Triggered
       interface and scales well to large numbers of watched fds. Three system calls  are
       provided  to  set  up  and  control  an  epoll set: epoll_create(2), epoll_ctl(2),
       epoll_wait(2).

       An epoll set is connected to a file descriptor created by epoll_create(2).  Inter-
       est  for  certain  file descriptors is then registered via epoll_ctl(2).  Finally,
       the actual wait is started by epoll_wait(2).

NOTES
       The epoll event distribution interface is able to behave both as Edge Triggered  (
       ET  ) and Level Triggered ( LT ). The difference between ET and LT event distribu-
       tion mechanism can be described as follows. Suppose that this scenario happens :

       1      The file descriptor that represents the read side of a  pipe  (  RFD  )  is
              added inside the epoll device.

       2      Pipe writer writes 2Kb of data on the write side of the pipe.

       3      A call to epoll_wait(2) is done that will return RFD as ready file descrip-
              tor.

       4      The pipe reader reads 1Kb of data from RFD.

       5      A call to epoll_wait(2) is done.

       If the RFD file descriptor has been added to the epoll interface using the EPOLLET
       flag,  the  call to epoll_wait(2) done in step 5 will probably hang because of the
       available data still present in the file input buffers and the remote  peer  might
       be  expecting a response based on the data it already sent. The reason for this is
       that Edge Triggered event distribution delivers events only when events happens on
       the  monitored  file.  So, in step 5 the caller might end up waiting for some data
       that is already present inside the input buffer. In the above example, an event on
       RFD  will be generated because of the write done in 2 and the event is consumed in
       3.  Since the read operation done in 4 does not consume the whole buffer data, the
       call to epoll_wait(2) done in step 5 might lock indefinitely. The epoll interface,
       when used with the EPOLLET flag ( Edge Triggered ) should  use  non-blocking  file
       descriptors  to avoid having a blocking read or write starve the task that is han-
       dling multiple file descriptors.  The suggested way to use epoll as an Edge  Trig-
       gered (EPOLLET) interface is below, and possible pitfalls to avoid follow.

              i      with non-blocking file descriptors

              ii     by  going to wait for an event only after read(2) or write(2) return
                     EAGAIN

       On the contrary, when used as a Level Triggered interface, epoll is by all means a
       faster  poll(2),  and  can be used wherever the latter is used since it shares the
       same semantics. Since even with the Edge Triggered epoll multiple  events  can  be
       generated  up  on receipt of multiple chunks of data, the caller has the option to
       specify the EPOLLONESHOT flag, to  tell  epoll  to  disable  the  associated  file
       descriptor  after  the  receipt  of  an  event with epoll_wait(2).  When the EPOL-
       LONESHOT flag is specified, it is caller responsibility to rearm the file descrip-
       tor using epoll_ctl(2) with EPOLL_CTL_MOD.

EXAMPLE FOR SUGGESTED USAGE
       While  the usage of epoll when employed like a Level Triggered interface does have
       the same semantics of poll(2), an Edge Triggered usage requires more clarification
       to avoid stalls in the application event loop. In this example, listener is a non-
       blocking socket on which listen(2) has been called. The function do_use_fd()  uses
       the  new  ready  file  descriptor  until  EAGAIN  is returned by either read(2) or
       write(2).  An event driven state machine application should, after having received
       EAGAIN,  record  its current state so that at the next call to do_use_fd() it will
       continue to read(2) or write(2) from where it stopped before.

struct epoll_event ev, *events;

       for(;;) {
           nfds = epoll_wait(kdpfd, events, maxevents, -1);

           for(n = 0; n < nfds; ++n) {
               if(events[n].data.fd == listener) {
                   client = accept(listener, (struct sockaddr *) &local,
                                   &addrlen);
                   if(client < 0){
                       perror("accept");
                       continue;
                   }
                   setnonblocking(client);
                   ev.events = EPOLLIN | EPOLLET;
                   ev.data.fd = client;
                   if (epoll_ctl(kdpfd, EPOLL_CTL_ADD, client, &ev) < 0) {
                       fprintf(stderr, "epoll set insertion error: fd=%d\n",
                               client);
                       return -1;
                   }
               }
               else
                   do_use_fd(events[n].data.fd);
           }
       }

       When used as an Edge triggered interface, for performance reasons, it is  possible
       to  add  the  file descriptor inside the epoll interface ( EPOLL_CTL_ADD ) once by
       specifying ( EPOLLIN|EPOLLOUT ). This allows you to avoid  continuously  switching
       between EPOLLIN and EPOLLOUT calling epoll_ctl(2) with EPOLL_CTL_MOD

优点

支持一个进程打开大数目的socket描述符

select 最不能忍受的是一个进程所打开的FD是有一定限制的,由FD_SETSIZE设置,默认值是1024。对于那些需要支持的上万连接数目的IM服务器来说显然太少了。这时候你一是可以选择修改这个宏然后重新编译内核,不过资料也同时指出这样会带来网络效率的下降,二是可以选择多进程的解决方案(传统的Apache方案),不过虽然linux上面创建进程的代价比较小,但仍旧是不可忽视的,加上进程间数据同步远比不上线程间同步的高效,所以也不是一种完美的方案。不过 epoll则没有这个限制,它所支持的FD上限是最大可以打开文件的数目,这个数字一般远大于2048,举个例子,在1GB内存的机器上大约是10万左右,具体数目可以cat /proc/sys/fs/file-max察看,一般来说这个数目和系统内存关系很大。

IO效率不随FD数目增加而线性下降

传统的select/poll另一个致命弱点就是当你拥有一个很大的socket集合,不过由于网络延时,任一时间只有部分的socket是“活跃”的,但是select/poll每次调用都会线性扫描全部的集合,导致效率呈现线性下降。但是epoll不存在这个问题,它只会对“活跃”的socket进行操作---这是因为在内核实现中epoll是根据每个fd上面的callback回调函数实现的。那么,只有“活跃”的socket才会主动的去调用 callback函数,其他idle状态socket则不会,在这点上,epoll实现了一个“伪”AIO,因为这时候推动力在os内核。在一些 benchmark中,如果所有的socket基本上都是活跃的---比如一个高速LAN环境,epoll并不比select/poll有什么效率,相反,如果过多使用epoll_ctl,效率相比还有稍微的下降。但是一旦使用idle connections模拟WAN环境,epoll的效率就远在select/poll之上了。

内存共享

这点实际上涉及到epoll的具体实现了。无论是select,poll还是epoll都需要内核把FD消息通知给用户空间,如何避免不必要的内存拷贝就很重要,在这点上,epoll是通过内核于用户空间mmap同一块内存实现的。

内核微调

内核微调可以说是整个linux平台的优点。也许你可以怀疑linux平台,但是你无法回避linux平台赋予你微调内核的能力。比如,内核TCP/IP协议 栈使用内存池管理sk_buff结构,那么可以在运行时期动态调整这个内存pool(skb_head_pool)的大小--- 通过echo XXXX>/proc/sys/net/core/hot_list_length完成。再比如listen函数的第2个参数(TCP完成3次握手的数据包队列长度),也可以根据你平台内存大小动态调整。

工作模式

LT(level triggered)是缺省的工作方式,并且同时支持block和no-block socket.在这种做法中,内核告诉你一个文件描述符是否就绪了,然后你可以对这个就绪的fd进行IO操作。如果你不作任何操作,内核还是会继续通知你的,所以,这种模式编程出错误可能性要小一点。传统的select/poll都是这种模型的代表。

ET (edge-triggered)是高速工作方式,只支持no-block socket。在这种模式下,当描述符从未就绪变为就绪时,内核通过epoll告诉你。然后它会假设你知道文件描述符已经就绪,并且不会再为那个文件描述符发送更多的就绪通知,直到你做了某些操作导致那个文件描述符不再为就绪状态了(比如,你在发送,接收或者接收请求,或者发送接收的数据少于一定量时导致了一个EWOULDBLOCK 错误)。但是请注意,如果一直不对这个fd作IO操作(从而导致它再次变成未就绪),内核不会发送更多的通知(only once),不过在TCP协议中,ET模式的加速效用仍需要更多的benchmark确认。

ET和LT的区别就在这里体现,LT事件不会丢弃,而是只要读buffer里面有数据可以让用户读,则不断的通知你。而ET则只在事件发生之时通知。可以简单理解为LT是水平触发,而ET则为边缘触发。LT模式只要有事件未处理就会触发,而ET则只在高低电平变换时(即状态从1到0或者0到1)触发。

系统调用

 epoll_create

NAME
       epoll_create - open an epoll file descriptor

SYNOPSIS
       #include <sys/epoll.h>

       int epoll_create(int size)

DESCRIPTION
       Open  an epoll file descriptor by requesting the kernel allocate an event back-
       ing store dimensioned for size descriptors. The size is not the maximum size of
       the backing store but just a hint to the kernel about how to dimension internal
       structures. 
The returned file descriptor will be used for all  the  subsequent
       calls  to  the epoll interface. The file descriptor returned by epoll_create(2)
       must be closed by using close(2).

RETURN VALUE
       When successful, epoll_create(2) returns a non-negative integer identifying the
       descriptor.   When an error occurs, epoll_create(2) returns -1 and errno is set
       appropriately.

ERRORS
       EINVAL size is not positive.

       ENFILE The system limit on the total  number  of  open  files  has  been
              reached.

       ENOMEM There was insufficient memory to create the kernel object.

epoll_wait

NAME
       epoll_wait - wait for an I/O event on an epoll file descriptor

SYNOPSIS
       #include <sys/epoll.h>

       int epoll_wait(int epfd, struct epoll_event * events,
                      int maxevents, int timeout);

DESCRIPTION
       Wait for events on the epoll file descriptor epfd for a maximum time of timeout
       milliseconds. The memory area pointed to by events will contain the events that
       will   be   available  for  the  caller.   Up  to  maxevents  are  returned  by
       epoll_wait(2).  The maxevents parameter must be greater than zero.
Specifying a
       timeout of -1 makes epoll_wait(2) wait indefinitely, while specifying a timeout
       equal to zero makes epoll_wait(2) to return immediately even if no  events  are
       available (return code equal to zero).  The struct epoll_event is defined as :          

typedef union epoll_data {
               void *ptr;
               int fd;
               __uint32_t u32;
               __uint64_t u64;
           } epoll_data_t;

           struct epoll_event {
               __uint32_t events;      /* Epoll events */
               epoll_data_t data;      /* User data variable */
           };

       The  data  of  each  returned structure will contain the same data the user set
       with a epoll_ctl(2) (EPOLL_CTL_ADD,EPOLL_CTL_MOD) while the events member  will
       contain the returned event bit field.

RETURN VALUE
       When successful, epoll_wait(2) returns the number of file descriptors ready for
       the requested I/O, or zero if  no  file  descriptor  became  ready 
during  the
       requested timeout milliseconds.  When an error occurs, epoll_wait(2) returns -1
       and errno is set appropriately.

ERRORS
       EBADF  epfd is not a valid file descriptor.

       EFAULT The memory area pointed to by events is not accessible with  write  per-
              missions.

       EINTR  The call was interrupted by a signal handler before any of the requested
              events occurred or the timeout expired.

       EINVAL epfd is not an epoll file descriptor, or maxevents is less than or equal
              to zero.

epoll_ctl

NAME
       epoll_ctl - control interface for an epoll descriptor

SYNOPSIS
       #include <sys/epoll.h>

       int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event)

DESCRIPTION
       Control  an  epoll  descriptor,  epfd, by requesting that the operation op be per-
       formed on the target file descriptor, fd.  The event describes the  object  linked
       to the file descriptor fd.  The struct epoll_event is defined as :          

typedef union epoll_data {
               void *ptr;
               int fd;
               __uint32_t u32;
               __uint64_t u64;
           } epoll_data_t;

           struct epoll_event {
               __uint32_t events;      /* Epoll events */
               epoll_data_t data;      /* User data variable */
           };

       The  events member is a bit set composed using the following available event types
       :

       EPOLLIN
              The associated file is available for read(2) operations.

       EPOLLOUT
              The associated file is available for write(2) operations.

       EPOLLRDHUP
              Stream socket peer closed connection, or shut down writing half of  connec-
              tion.   (This  flag  is especially useful for writing simple code to detect
              peer shutdown when using Edge Triggered monitoring.)

       EPOLLPRI
              There is urgent data available for read(2) operations.

       EPOLLERR
              Error condition happened on the associated file descriptor.   epoll_wait(2)
              will always wait for this event; it is not necessary to set it in events.

       EPOLLHUP
              Hang  up  happened  on  the associated file descriptor.  epoll_wait(2) will
              always wait for this event; it is not necessary to set it in events.

       EPOLLET
              Sets the Edge Triggered behaviour for the associated file descriptor.   The
              default  behaviour  for  epoll  is  Level  Triggered. See epoll(7) for more
              detailed information about Edge  and  Level  Triggered  event  distribution
              architectures.

       EPOLLONESHOT (since kernel 2.6.2)
              Sets the one-shot behaviour for the associated file descriptor.  This means
              that after an event is pulled out with epoll_wait(2)  the  associated  file
              descriptor  is  internally disabled and no other events will be reported by
              the epoll interface. The user must call epoll_ctl(2) with EPOLL_CTL_MOD  to
              re-enable the file descriptor with a new event mask.

       The  epoll  interface  supports  all file descriptors that support poll(2).  Valid
       values for the op parameter are :

              EPOLL_CTL_ADD
                     Add the target file descriptor fd to the epoll descriptor  epfd  and
                     associate the event event with the internal file linked to fd.

              EPOLL_CTL_MOD
                     Change  the  event  event associated with the target file descriptor
                     fd.

              EPOLL_CTL_DEL
                     Remove the target file descriptor fd from the epoll file descriptor,
                     epfd.  The event is ignored and can be NULL (but see BUGS below).

RETURN VALUE
       When  successful,  epoll_ctl(2)  returns  zero. When an error occurs, epoll_ctl(2)
       returns -1 and errno is set appropriately.

ERRORS
       EBADF  epfd or fd is not a valid file descriptor.

       EEXIST op was EPOLL_CTL_ADD, and the supplied file descriptor  fd  is  already  in
              epfd.

       EINVAL epfd  is  not  an  epoll file descriptor, or fd is the same as epfd, or the
              requested operation op is not supported by this interface.

       ENOENT op was EPOLL_CTL_MOD or EPOLL_CTL_DEL, and fd is not in epfd.

       ENOMEM There was insufficient memory to handle the requested op control operation.

       EPERM  The target file fd does not support epoll.

 

实例

 server.c

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>
#include <sys/types.h>
#include <netinet/in.h>
#include <sys/socket.h>
#include <sys/wait.h>
#include <unistd.h>
#include <arpa/inet.h>
#include <openssl/ssl.h>
#include <openssl/err.h>
#include <fcntl.h>
#include <sys/epoll.h>
#include <sys/time.h>
#include <sys/resource.h>


#define MAXBUF 1024  
#define MAXEPOLLSIZE 10000  

#define MYPORT 5000
#define LISTENQ 10
/*
   setnonblocking - 设置句柄为非阻塞方式
 */
int setnonblocking(int sockfd)
{
        if (fcntl(sockfd, F_SETFL, fcntl(sockfd, F_GETFD, 0)|O_NONBLOCK) == -1)
        {
                return -1;
        }
        return 0;
}

/*
   handle_message - 处理每个 socket 上的消息收发
 */
int handle_message(int new_fd)
{
        char buf[MAXBUF + 1];
        int len;
        /* 开始处理每个新连接上的数据收发 */
        bzero(buf, MAXBUF + 1);
        /* 接收客户端的消息 */
        len = recv(new_fd, buf, MAXBUF, 0);
        if (len > 0)
        {
                printf("%d receive msg succeed:%s,total %d Byte\n",new_fd, buf, len);
        }
        else
        {
                if (len < 0)
                        printf("receive msg failed %d,error msg is %s\n", errno, strerror(errno));
                close(new_fd);
                return -1;
        }
        /* 处理每个新连接上的数据收发结束 */
        return len;
}
/************关于本文档********************************************
 *filename: epoll-server.c
 *purpose: 演示epoll处理海量socket连接的方法
 *******************************************************************
**/
int main(int argc, char **argv)
{
        int listener, new_fd, kdpfd, nfds, n, ret, curfds;
        socklen_t len;
        struct sockaddr_in my_addr, their_addr;
        struct epoll_event ev;
        struct epoll_event events[MAXEPOLLSIZE];
        struct rlimit rt;

        /* 设置每个进程允许打开的最大文件数 */
        rt.rlim_max = rt.rlim_cur = MAXEPOLLSIZE;
        if (setrlimit(RLIMIT_NOFILE, &rt) == -1)
        {
                perror("setrlimit");
                exit(1);
        }
        else
        {
                printf("set system resource succeed. \n");
        }
        /* 开启 socket 监听 */
        if ((listener = socket(AF_INET, SOCK_STREAM, 0)) == -1)
        {
                perror("socket");
                exit(1);
        }
        else
        {
                printf("socket create succeed.\n");
        }

        setnonblocking(listener);
        bzero(&my_addr, sizeof(my_addr));
        my_addr.sin_family = AF_INET;
        my_addr.sin_port = htons(MYPORT);
        my_addr.sin_addr.s_addr = INADDR_ANY;

        if (bind(listener, (struct sockaddr *) &my_addr, sizeof(struct sockaddr)) == -1)
        {
                perror("bind");
                exit(1);
        }
        else
        {
                printf("IP addr and port bind succeed\n");
        }
        if (listen(listener, LISTENQ) == -1)
        {
                perror("listen");
                exit(1);
        }
        else
        {
                printf("start service succeed. \n");
        }

        len = sizeof(struct sockaddr_in);
        ev.events = EPOLLIN | EPOLLET;
        ev.data.fd = listener;
        if (epoll_ctl(kdpfd, EPOLL_CTL_ADD, listener, &ev) < 0)
        {
                fprintf(stderr, "epoll set insertion error: fd=%d\n", listener);
                return -1;
        }
        else
        {
                printf("listen fd socket put into epoll succeed.\n");
        }
        curfds = 1;
        while (1)
        {
                /* 等待有事件发生 */
                nfds = epoll_wait(kdpfd, events, curfds, -1);
                if (nfds == -1)
                {
                        perror("epoll_wait");
                        break;
                }
                /* 处理所有事件 */
                for (n = 0; n < nfds; ++n)
                {
                        if (events[n].data.fd == listener)
                        {
                                new_fd = accept(listener, (struct sockaddr *) &their_addr,&len);
                                if (new_fd < 0)
                                {
                                        perror("accept");
                                        continue;
                                }
                                else
                                {
                                        printf("conn come from  %s:%d, allocated socket is:%d\n",
                                                        inet_ntoa(their_addr.sin_addr), ntohs(their_addr.sin_port), new_fd);
                                }
                                setnonblocking(new_fd);
                                ev.events = EPOLLIN | EPOLLET;
                                ev.data.fd = new_fd;
                                if (epoll_ctl(kdpfd, EPOLL_CTL_ADD, new_fd, &ev) < 0)
                                {
                                        fprintf(stderr, "put socket '%d' into epoll failed .%s\n",
                                                        new_fd, strerror(errno));
                                        return -1;
                                }
                                curfds++;
                        }
                        else
                        {
                                ret = handle_message(events[n].data.fd);
                                if (ret < 1 && errno != 11)
                                {
                                        epoll_ctl(kdpfd, EPOLL_CTL_DEL, events[n].data.fd,&ev);
                                        curfds--;
                                }
                        }
                }
        }
        close(listener);
        return 0;
}

 client.c

#include <stdio.h>
#include <stdlib.h>
#include <sys/un.h>
#include <netdb.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <arpa/inet.h>

#define BUFLEN 1024

int main(int argc,char *argv[])
{
        int connect_fd;
        int ret;
        char snd_buf[BUFLEN];
        int i;
        int port;
        int len;
        static struct sockaddr_in srv_addr;
        if(argc!=3){
                printf("Usage: %s server_ip_address port\n",argv[0]);
                return 1;
        }
        port=atoi(argv[2]);
        connect_fd=socket(AF_INET,SOCK_STREAM,0);
        if(connect_fd<0){
                perror("cannot create communication socket");
                return 1;
        }
        memset(&srv_addr,0,sizeof(srv_addr));
        srv_addr.sin_family=AF_INET;
        srv_addr.sin_addr.s_addr=inet_addr(argv[1]);
        srv_addr.sin_port=htons(port);

        ret=connect(connect_fd,(struct sockaddr*)&srv_addr,sizeof(srv_addr));
        if(ret==-1){
                perror("cannot connect to the server");
                close(connect_fd);
                return 1;
        }
        memset(snd_buf,0,BUFLEN);
        while(1){
                write(STDOUT_FILENO,"input message:",14);
                bzero(snd_buf, BUFLEN);
                len=read(STDIN_FILENO,snd_buf,BUFLEN);
                if(snd_buf[0]=='@')
                        break;
                if(len>0){
                        send(connect_fd, snd_buf, len, 0);
                        bzero(snd_buf, BUFLEN);
                        len=recv(connect_fd,snd_buf,BUFLEN,0);
                        if(len>0)
                                printf("Message from server: %s\n",snd_buf);
                }
        }
        close(connect_fd);
        return 0;
}


 

参考:

1.http://baike.baidu.com/link?url=e_ZNqAmDO2pTaMeG0n3ROAT3rckHikX7G-zsqYuz-nzlNNrS59CPPH6O03dOh1Q8Y3tnKtw4-qWhs76lrx1iJa

2.http://blog.csdn.net/haoahua/article/details/2037704

3.Linux man手册

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值