epoll源码分析---sys_epoll_create()函数 http://blog.chinaunix.net/uid-28443939-id-3470593.html

最新推荐文章于 2024-02-19 13:53:00 发布

1255645

最新推荐文章于 2024-02-19 13:53:00 发布

阅读量583

点赞数

linux 网络编程同时被 2 个专栏收录

39 篇文章 0 订阅

订阅专栏

c++

23 篇文章 0 订阅

订阅专栏

epoll源码分析---sys_epoll_create()函数 2013-01-10 21:25:44

分类： LINUX

eventpoll的优点就不用说了，网上的资料很多，eventpoll的使用也很广泛，特别是在Web服务器中。因为最近要用到epoll，所以好好地看了一下它的实现，把学到的一些东西做下整理，做个记录。

一、sys_epoll_create()

其源码如下：

 
    SYSCALL_DEFINE1(epoll_create, int, size)
 
 {
 
     if (size <= 0)
 
         return -EINVAL;
 
     return sys_epoll_create1(0);
 
 }

SYSCALL_DEFINE1 ( epoll_create , int , size ) 在预处理之后就是 long sys_epoll_create(int size)。从这里可以看到在用户层调用epoll_create时，传入的size参数没有使用。 sys_epoll_create（）在检查完参数后直接调用 sys_epoll_create1（）函数来完成主要的工作。因此接下来看看 sys_epoll_create1（）是怎么实现的。

二、 sys_epoll_create1（）函数

 
     SYSCALL_DEFINE1(epoll_create1, int, flags)
 
 {
 
     int error;
 
     struct eventpoll *ep = NULL;
 
     /*
 
      * 如果(EPOLL_CLOEXEC != O_CLOEXEC)成立，在编译时就会报错。这种方式
 
      * 
 
      */
 
     BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC);
 
     /* flags要么为0，要么为EPOLL_CLOEXEC，否则返回EINVAL错误 */
 
     if (flags & ~EPOLL_CLOEXEC)
 
         return -EINVAL;
     /*
 
      * 分配eventpoll实例并初始化,存储在file结构的private_data成员中。
 
      * private_data成员用来存储文件描述符真正对应的对象。例如
 
      * 如果文件描述符是一个套接字的话，其对应的file实例的private_data
 
      * 成员存储的就是一个socket实例。
 
      */
 
     error = ep_alloc(&ep);
 
     if (error < 0)
 
         return error;
 
     /*
 
      * 创建eventpoll文件，这个文件的file_operations为eventpoll_fops，
 
      * 私有的数据为eventpoll实例
 
      */
 
     error = anon_inode_getfd("[eventpoll]", &eventpoll_fops, ep,
 
                  flags & O_CLOEXEC);
 
     if (error < 0)
 
         ep_free(ep);
 
     return error;
 
 }

首先看一看 BUILD_BUG_ON宏，该宏用来在编译时检查 condition是否为true，如果是true，会报编译错误，

 
     #define BUILD_BUG_ON(condition) ((void)BUILD_BUG_ON_ZERO(condition))

宏 BUILD_BUG_ON_ZERO的定义如下：

 
     #define BUILD_BUG_ON_ZERO(e) (sizeof(struct { int:-!!(e); }))

乍看之下，这个宏的定义好像有些乱，是不是不合语法啊？但是细看之下，你就会发现这个宏的巧妙之处。

先看!!(e),这个很容易看懂，就是将e转换为bool值，假设e为10，第一次取反时变为0，第二次取反时又变为1；如果e为0的话，两次取反仍是0. 接着看-!!(e)(注意前面的“-”,就是取负)，如果e不为0时，!!(e)返回的是1，-!!(e)返回的就是-1，此时宏 BUILD_BUG_ON_ZERO预处理后为(sizeof（struct { int:-1};)).我们知道定义结构体时，可以指定成员按位来存放，可以取的值为0到对应类型的bit位个数，小于0编译时会报错（在sizeof中可以为0，但是真正定义类型时不能为0 ）。内核使用的是下限值，还以使用上限值，即指定的位数超过对应成员的类型，例如下面的例子：

 
     #define BUILD_BUG_ON_ZERO1(e) (sizeof(struct {char:(!!e << 9);}))

接下来看 ep_alloc（）函数，源码和注释如下如下：

 
     /*
 
  * 分配eventpoll实例并初始化
 
  */
 
 static int ep_alloc(struct eventpoll **pep)
 
 {
 
     int error;
 
     struct user_struct *user;
 
     struct eventpoll *ep;
 
     user = get_current_user();
 
     error = -ENOMEM;
 
     ep = kzalloc(sizeof(*ep), GFP_KERNEL);
 
     if (unlikely(!ep))
 
         goto free_uid;
 
     /*
 
      * 
 
      */
 
     spin_lock_init(&ep->lock);
 
     /*
 
      * 初始化用于向用户空间传递事件和移除epoll中的文件之间
 
      * 的互斥锁
 
      */
 
     mutex_init(&ep->mtx);
 
     /*
 
      * 初始化epoll文件的登对队列。调用epoll_wait的进程
 
      * 可能在此队列上睡眠， 等待ep_poll_callback()函数唤醒或超时
 
      */
 
     init_waitqueue_head(&ep->wq);
 
     /*
 
      * poll_wait是eventpoll文件本身的唤醒队列，该队列上睡眠
 
      * 的进程是等待eventpoll文件本身的某些事件发生。
 
      */
 
     init_waitqueue_head(&ep->poll_wait);
 
     /*
 
      * 初始化就绪队列，如果当某个文件指定的事件发生时，
 
      * 会防止到该队列中。
 
      */
 
     INIT_LIST_HEAD(&ep->rdllist);
 
     /*
 
      * 用于存储文件描述符的红黑树根节点
 
      */
 
     ep->rbr = RB_ROOT;
 
     /*
 
      * 如果正在向用户空间传递事件，此时状态就绪的文件
 
      * 描述符相关的结构会暂时放在该队列上，否则会直接
 
      * 添加到就绪队列rdllist中。
 
      */
 
     ep->ovflist = EP_UNACTIVE_PTR;
 
     ep->user = user;
 
     *pep = ep;
 
     return 0;
 
 free_uid:
 
     free_uid(user);
 
     return error;
 
 }

上面的注释比较详细了，不再多叙。

最后一个关心的函数是anon_inode_getfd（），该函数的作用类似于sock_map_fd（），就是将eventpoll实例映射到一个文件中，

 
     int anon_inode_getfd(const char *name, const struct file_operations *fops,
 
          void *priv, int flags)
 
 {
 
     int error, fd;
 
     struct file *file;
 
     /*
 
      * 分配一个空闲的文件描述符。
 
      */
 
     error = get_unused_fd_flags(flags);
 
     if (error < 0)
 
         return error;
 
     fd = error;
 
     file = anon_inode_getfile(name, fops, priv, flags);
 
     if (IS_ERR(file)) {
 
         error = PTR_ERR(file);
 
         goto err_put_unused_fd;
 
     }
 
     fd_install(fd, file);
 
     return fd;
 
 err_put_unused_fd:
 
     put_unused_fd(fd);
 
     return error;
 
 }

该函数首先调用 get_unused_fd_flags（）分配一个空闲的文件描述符，然后创建一个匿名文件，附加上去。因为涉及到文件系统的操作，不做过多的分析。

三、find_next_zero_bit（）函数

anon_inode_getfile（）是一个宏定义，对应的函数时alloc_fd(),alloc_fd（）中调用 find_next_zero_bit（）在文件描述符的位图中查找一个空闲的bit位，空闲的bit位的索引即为找到的文件描述符，我对这个函数比较感兴趣，特别研究一个一番，跟大家分享一下。

 
     /*
 
  * find_next_zero_bit返回的值的范围是0~(size-1)，相当于是bit数组中的索引
 
  * @addr: 位图的地址
 
  * @size: 位图的bit位个数
 
  * @offset: 可以理解为bit数组中的索引，也就是说从这个bit位开始查找
 
  */
 
 unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
 
                  unsigned long offset)
 
 {
 
     /*
 
      * 这里BITOP_WORD用来计算offset对应位图中的unsigned long元素在
 
      * addr数组中的索引，所以p为offset所在的unsigned long元素的地址
 
      */
 
     const unsigned long *p = addr + BITOP_WORD(offset);
 
     /*
 
      * 相当于是offset - (offset % BITS_PER_LONG),也就是offset所在的unsigned long
 
      * 之前所有unsigned long元素的bit位个数
 
      */
 
     unsigned long result = offset & ~(BITS_PER_LONG-1);
 
     unsigned long tmp;
 
     /*
 
      * 如果偏移量大于等于位图的大小，则直接
 
      * 返回size。
 
      */
 
     if (offset >= size)
 
         return size;
 
     /*
 
      * 计算offset所在的unsigned long及其之后所有的unsigned long元素中
 
      * 的bit位个数
 
      */
 
     size -= result;
 
     /*
 
      * 计算offset对应的unsigned long中占用的bit位所在的位置，这个也可以理解
 
      * 为一个索引。假设计算前offset的值为67，计算后offset的值为3，也就是所在
 
      * unsigned long中的第4个bit位。
 
      */
 
     offset %= BITS_PER_LONG;
 
     if (offset) {
 
         /*
 
          * tmp的值为offset所在的unsigned long的值
 
          */
 
         tmp = *(p++);
 
         /*
 
          * BITS_PER_LONG - offset计算的offset所在的unsigned long元素中offset所在的
 
          * bit位及其之后的bit位个数。tmp中offset所对应的bit位及其之后的bit位都保留，而将
 
          * tmp中offset所对应的bit位之前的bit位都设置为1.
 
          */
 
         tmp |= ~0UL >> (BITS_PER_LONG - offset);
 
         /*
 
          * 如果size小于BITS_PER_LONG，说明offset在最后一个unsigned long元素。
 
          */
 
         if (size < BITS_PER_LONG)
 
             goto found_first;
 
         /*
 
          * 如果tmp取反后为不为0，则说明tmp中有为0的bit位，因此从
 
          * tmp中查找空闲的bit位。
 
          */
 
         if (~tmp)
 
             goto found_middle;
 
         /*
 
          * 如果offset所对应的bit位所在的unsigned long中没有空闲的bit位，
 
          * 开始从其之后得unsigned long元素中查找。计算剩余的bit位个数，
 
          * 修改已经查找的bit位个数
 
          */
 
         size -= BITS_PER_LONG;
 
         result += BITS_PER_LONG;
 
     }
 
     /*
 
      * 如果size小于BITS_PER_LONG，则退出循环
 
      */
 
     while (size & ~(BITS_PER_LONG-1)) {
 
         /*
 
          * 将下一个查找的元素存储在tmp中，如果tmp取反后不为0，则说明tmp中有为0的bit位，因此
 
          * 从tmp中查找空闲的bit位。
 
          */
 
         if (~(tmp = *(p++)))
 
             goto found_middle;
 
         /*
 
          * 计算剩余的bit位个数，修改已经查找的bit位个数
 
          */
 
         result += BITS_PER_LONG;
 
         size -= BITS_PER_LONG;
 
     }
 
     /*
 
      * 如果全部查找后，仍没有找到空闲的bit位，则直接返回result。
 
      * 此时result的值应该为位图的bit位的个数。
 
      */
 
     if (!size)
 
         return result;
 
     /*
 
      * 如果size不为0，则在剩余的最后的bit位(剩余的个数小于BITS_PER_LONG)中查找。
 
      */
 
     tmp = *p;
 
 found_first:
 
     /*
 
      * 因为剩余的bit位个数有可能小于BITS_PER_LONG，因此需要将unsigned long中
 
      * 不用的bit位置为1，以免干扰后续的查找
 
      */
 
     tmp |= ~0UL << size;
 
     /*
 
      * 如果所有bit位都为1，则说明没有空余的bit位，
 
      * 则返回所有的bit位的个数。
 
      */
 
     if (tmp == ~0UL)    /* Are any bits zero? */
 
         return result + size;    /* Nope. */
 
 found_middle:
 
     /*
 
      * ffz(tmp)返回的是tmp中第一个为0的bit位的索引
 
      */
 
     return result + ffz(tmp);
 
 }

1255645

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
epoll源码分析---sys_epoll_create()函数 http://blog.chinaunix.net/uid-28443939-id-3470593.html

epoll源码分析---sys_epoll_create()函数 2013-01-10 21:25:44分类： LINUX eventpoll的优点就不用说了，网上的资料很多，eventpoll的使用也很广泛，特别是在Web服务器中。因为最近要用到epoll，所以好好地看了一下它的实现，把学到的一些东西做下整理，做个记录。一、sys_epoll_creat
复制链接

扫一扫

专栏目录