[转]epoll是如何监控多个描述符及如何获得通知

最新推荐文章于 2024-08-22 15:01:59 发布

luoyue625

最新推荐文章于 2024-08-22 15:01:59 发布

阅读量1.2k

点赞数 1

分类专栏： linux 文章标签： poll监控多个文件描述符

linux 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

原文出自：http://blog.chinaunix.net/uid-23629988-id-3569332.html

我们起初聊的是TCP/IP协议栈的自下而上的流程。当我说到内核在选择了正确的socket以后，会唤醒在这个socket上等待的进程，通知他们有新数据包来了。这时，该朋友说到，这是同步模式，那epoll是如何实现的呢？这里先插一句，我认为该朋友的说法有问题。对于前者，可以说是同步模式，其实我觉得他更想强调的是阻塞模式。而无论是将其说成阻塞模式，还是同步模式，epoll都不是相反的。聊这个问题的时候，我当时对epoll的太少了，也就没有对epoll的实现发表什么看法。只是简单的聊了聊epoll的一些特点，以及与select的对比。

这几天找了点空余时间，带着这个问题，看了一些epoll的资料。

关于epoll本身的文章和资料已经有了很多，大多数都是关于epoll的应用，或者是将其与poll，select进行对比，也有部分是分析epoll的源码的，但是都是分析epoll的API在内核的实现。对于我这个简单的问题，网上貌似没有直接的答案。这两天，看了看代码，大致明白了epoll如何同时监控多个描述符及如何获得通知。

1. 无论是select还是epoll，都是基于poll的机制实现的。而poll是VFS要求的一个成员函数，每个具体文件系统的实现，都有对应的poll实现（socket也是一个虚拟的文件系统）。
2. 无论是select还是epoll，其实仍然是阻塞模式。只不过select和epoll在阻塞调用中，可以监控多个文件描述符，还可以设置一个超时。

select的实现代码相对于epoll，要简单很多，是以轮询的方式查询各个描述符。下面看看epoll是如何做到的？

首先看ep_insert函数，这个函数用于插入新的监控描述符。

/* Initialize the poll table using the queue callback */
epq.epi = epi;
init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);
/*
* Attach the item to the poll hooks and get current event bits.
* We can safely use the file* here because its usage count has
* been increased by the caller of this function. Note that after
* this operation completes, the poll callback can start hitting
* the new item.
*/
revents = tfile->f_op->poll(tfile, &epq.pt);

这几行代码是关键的代码。这里的epq像一个粘合剂，把epoll和内核本身的poll机制黏在了一起。epq.epi = epi，这个epi对应了epoll一个监控描述符对象在epoll中的实例，然后init_poll_funcptr设置了epq的poll table的回调函数，完成了epq的初始化。然后调用该文件描述符对应的poll实现函数。

这里就要跳转到具体的poll函数了，以socket文件描述符为例，当该socket为UDP时候，对应的poll实现函数为udp_poll。

unsigned int udp_poll(struct file *file, struct socket *sock, poll_table *wait)
{
unsigned int mask = datagram_poll(file, sock, wait);
struct sock *sk = sock->sk;
/* Check for false positives due to checksum errors */
if ((mask & POLLRDNORM) && !(file->f_flags & O_NONBLOCK) &&
!(sk->sk_shutdown & RCV_SHUTDOWN) && !first_packet_length(sk))
mask &= ~(POLLIN | POLLRDNORM);
return mask;
}

这个代码很简单。关键函数是datagram_poll

unsigned int datagram_poll(struct file *file, struct socket *sock,
poll_table *wait)
{
struct sock *sk = sock->sk;
unsigned int mask;
sock_poll_wait(file, sk_sleep(sk), wait);
mask = 0;
//处理事件
...... ......

return mask;
}

sk_sleep(sk)就是sk上的wait队列。那么就需要进入sock_poll_wait

static inline void sock_poll_wait(struct file *filp,
wait_queue_head_t *wait_address, poll_table *p)
{
if (p && wait_address) {
poll_wait(filp, wait_address, p);
/*
* We need to be sure we are in sync with the
* socket flags modification.
*
* This memory barrier is paired in the wq_has_sleeper.
*/
smp_mb();
}
}

static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p)
{
if (p && wait_address)
p->qproc(filp, wait_address, p);
}

看到这里，绕了一圈，又回到了起点。。。需要查看ep_ptable_queue_proc的实现

static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
poll_table *pt)
{
struct epitem *epi = ep_item_from_epqueue(pt);
struct eppoll_entry *pwq;
if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) {
init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
pwq->whead = whead;
pwq->base = epi;
add_wait_queue(whead, &pwq->wait);
list_add_tail(&pwq->llink, &epi->pwqlist);
epi->nwait++;
} else {
/* We have to signal that an error occurred */
epi->nwait = -1;
}
}

这里的代码很明确，获得epoll中的监控描述符的实例epi，然后创建一个epoll的等待节点，并将其放到参数whead的队列中。对应上面的例子中，即sock的等待队列中。这样，就将epoll和具体对应的描述联系起来。当对应的描述符执行唤醒操作时，就会利用上面的关联，唤醒epoll。

上文书说到，epoll是如何加到每个监控描述符的wait queue中，这只是第一步。上次也提过，epoll实际上也是一个阻塞操作，只不过是可以同时监控多个文件描述符。下面看一下epoll_wait->ep_poll的实现。

epoll既然是阻塞的，必然需要wait queue。但是这个不能使用监控的文件描述符的wait queue，epoll自己本身也是一个虚拟的文件系统。epoll_create的返回值也是一个文件描述符。Unix下，一切皆是文件嘛。

所以epoll的实现代码如下：

init_waitqueue_entry(&wait, current);
__add_wait_queue_exclusive(&ep->wq, &wait);
for (;;) {
/*
* We don't want to sleep if the ep_poll_callback() sends us
* a wakeup in between. That's why we set the task state
* to TASK_INTERRUPTIBLE before doing the checks.
*/
set_current_state(TASK_INTERRUPTIBLE);
if (ep_events_available(ep) || timed_out)
break;
if (signal_pending(current)) {
res = -EINTR;
break;
}
spin_unlock_irqrestore(&ep->lock, flags);
if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
timed_out = 1;
spin_lock_irqsave(&ep->lock, flags);
}
__remove_wait_queue(&ep->wq, &wait);

这里epoll_wait是将当前进程添加到epoll自身的wait queue中。那么问题来了，前文说到epoll已经将当前进程加到了各个监控描述符的wait queue中。现在这里又有了一个epoll自身的wait queue。这是为什么呢？
回答这个问题，需要我们再跳回ep_ptable_queue_proc——不记得这个函数的同学，请翻看前面的文章。这个函数调用init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);，将epoll当前进程的wait queue节点的回调函数设置为ep_poll_callback。对比epoll调用的init_waitqueue_entry函数，这个函数设置wait queue节点的回调函数为default_wake_function。

那么当监控文件描述符执行wakeup动作时，比如一个socket收到数据时，调用sk_data_ready->sock_def_readable->wake_up_interruptible_sync_poll->....最终会执行wait_queue节点的回调函数。对于epoll来说，即ep_poll_callback。

static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key)
{
int pwake = 0;
unsigned long flags;
struct epitem *epi = ep_item_from_wait(wait);
struct eventpoll *ep = epi->ep;
spin_lock_irqsave(&ep->lock, flags);
/*
* If the event mask does not contain any poll(2) event, we consider the
* descriptor to be disabled. This condition is likely the effect of the
* EPOLLONESHOT bit that disables the descriptor when an event is received,
* until the next EPOLL_CTL_MOD will be issued.
*/
if (!(epi->event.events & ~EP_PRIVATE_BITS))
goto out_unlock;
/*
* Check the events coming with the callback. At this stage, not
* every device reports the events in the "key" parameter of the
* callback. We need to be able to handle both cases here, hence the
* test for "key" != NULL before the event match test.
*/
if (key && !((unsigned long) key & epi->event.events))
goto out_unlock;
/*
* If we are transferring events to userspace, we can hold no locks
* (because we're accessing user memory, and because of linux f_op->poll()
* semantics). All the events that happen during that period of time are
* chained in ep->ovflist and requeued later on.
*/
if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) {
if (epi->next == EP_UNACTIVE_PTR) {
epi->next = ep->ovflist;
ep->ovflist = epi;
}
goto out_unlock;
}
/* If this file is already in the ready list we exit soon */
if (!ep_is_linked(&epi->rdllink))
list_add_tail(&epi->rdllink, &ep->rdllist);
/*
* Wake up ( if active ) both the eventpoll wait list and the ->poll()
* wait list.
*/
if (waitqueue_active(&ep->wq))
wake_up_locked(&ep->wq);
if (waitqueue_active(&ep->poll_wait))
pwake++;
out_unlock:
spin_unlock_irqrestore(&ep->lock, flags);
/* We have to call this outside the lock */
if (pwake)
ep_poll_safewake(&ep->poll_wait);
return 1;
}

这个函数的注释相当清楚，可以清晰的知道每一行代码的用途。其中

if (waitqueue_active(&ep->wq))
wake_up_locked(&ep->wq);

这两行代码，检测了epoll自身的wait queue上是否有等待的节点，如果有的话，就执行唤醒动作。对于epoll的使用者来说，如果用户态正阻塞在epoll_wait中，那么ep->wq一定不为空，这时就会被唤醒。将该进程移到就绪队列中。

这两篇文章基本上理清了epoll如何监控多个描述符及如何获得通知的过程。对于如何监控来说，还欠缺了epoll内部结构，如何保存的各个描述符，如何维护的信息等。不过这样的文章网上已经有了很多。也许以后我会针对这个问题，再写两篇文章吧。