理解 linux epoll的实现原理

最新推荐文章于 2024-09-26 12:36:29 发布

芒骁

最新推荐文章于 2024-09-26 12:36:29 发布

阅读量858

点赞数

分类专栏： linux 文章标签： linux redis java

原文链接：https://zhuanlan.zhihu.com/p/527192934

版权

linux 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

理解 linux epoll的实现原理

1. epoll_create

创建了Eventpoll对象，epoll的实现基本上就靠这个对象在工作。

然后绑定到一个伪file实例，通过匿名fd连接用户与内核。

注意这里只是有了eventpoll对象，它还没开始工作。
在这里插入图片描述

2. epoll_ctl

https://editor.csdn.net/md/?articleId=121450854 过程详解

在这里插入图片描述

redis 网络IO 以redis IO的过程源码实例

传入了操作op，以及操作对象fd，以及刚才epoll_create创建的伪file对象地址（epfd 文件描述符就这用处），告诉内核需要监听什么事 event

注意
我们知道 epoll可以监听多个文件，操作op表示对文件信息的修改，比如添加需要监听的文件。修改信息等，而event表示要监视的文件操作，比如监视文件可读事件、可写事件等等。两个概念不一样。

这个过程涉及到两个数据结构，还是回到我们的主要对象 eventpoll
在这里插入图片描述

epoll 与 select/poll不同之处一—— 存储的数据结构不同

我们知道 select的实现是一个长度为16的long数组，可以表示1024位。

用户与内核传递的数据结构就是数组，用户通过遍历数组来实现获取事件。随着事件的数量增加，这样做的时间会很久。

很显然改进的方式就是通过事件驱动。并且用新的数据结构替换。

epoll就是使用红黑树结构。

红黑树

struct rb_root
{
	struct rb_node *rb_node;
};

struct rb_node
{
	unsigned long  rb_parent_color;		//父节点颜色
#define	RB_RED		0
#define	RB_BLACK	1
	struct rb_node *rb_right;			//左子树
	struct rb_node *rb_left;			//右子树
} __attribute__((aligned(sizeof(long))));

epoll_ctl 红黑树的插入操作 ep_rbtree_insert

初始化 epitem，将相关信息封装，然后将它关于红黑树node的属性加入到红黑树节点上

在这里插入图片描述
可以看到这里在插入一个新节点时对于其在红黑树中的位置的选择过程是用一个while循环来实现的，当该while循环退出后，说明我们已经找到了该节点应在的位置，接下来调用rb_link_node函数将该节点插入到红黑树中，该函数的实现很简单，就是往一颗二叉树中插入一个新的节点，实现如下

https://zhuanlan.zhihu.com/p/527192934

rb_link_node

在这里插入图片描述
然后再调用rb_insert_color函数，这个函数实现的是对插入一个新节点之后的整个红黑树进行调整的过程

3. epoll_wait —— 事件驱动关键

把被监听的文件句柄添加到epoll后，就可以通过调用 epoll_wait() 等待被监听的文件状态发生改变。epoll_wait() 调用会阻塞当前进程，当被监听的文件状态发生改变时，epoll_wait() 调用便会返回。

关键数据结构 —— rdllist

判断被监听的文件集合中是否有就绪的文件，如果有就返回。

如果没有就把当前进程添加到epoll的等待队列（rdllist）并且进入睡眠。

什么时候唤醒？

这个过程和wait与notify很像。

进程会一直睡眠直到有以下几种情况发生：

被监听的文件集合中有就绪的文件

设置了超时时间并且超时了

接收到信号

如果有就绪的文件，那么就调用 ep_send_events() 函数把就绪文件复制到 events 参数中。

返回就绪文件的个数。

4. socket 传来事件，唤醒 epoll_wait睡眠进程

1. epoll_ctl阶段

ep_insert
revents = ep_item_poll(epi, &epq.pt, 1);
TCP实现poll接口 tcp_poll

主要是这里 tcp_poll

unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
{
    struct sock *sk = sock->sk;
    ...
    poll_wait(file, sk->sk_sleep, wait);
    ...
    return mask;
}

每个 socket 对象都有个等待队列（waitqueue, 关于等待队列可以参考文章: 等待队列原理与实现），用于存放等待 socket 状态更改的进程。

从上述代码可以知道，tcp_poll() 调用了 poll_wait() 函数，而 poll_wait() 最终会调用 ep_ptable_queue_proc() 函数，ep_ptable_queue_proc() 函数实现如下：

ep_ptable_queue_proc

/*
 * This is the callback that is used to add our wait queue to the
 * target file wakeup lists.
 */
static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
				 poll_table *pt)
{
	struct ep_pqueue *epq = container_of(pt, struct ep_pqueue, pt);
	struct epitem *epi = epq->epi;
	struct eppoll_entry *pwq;

	if (unlikely(!epi))	// an earlier allocation has failed
		return;

	pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL);
	if (unlikely(!pwq)) {
		epq->epi = NULL;
		return;
	}

	init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
	pwq->whead = whead;
	pwq->base = epi;
	if (epi->event.events & EPOLLEXCLUSIVE)
		add_wait_queue_exclusive(whead, &pwq->wait);
	else
		add_wait_queue(whead, &pwq->wait);
	pwq->next = epi->pwqlist;
	epi->pwqlist = pwq;
}

ep_ptable_queue_proc() 函数主要工作是把当前 epitem 对象添加到 socket 对象的等待队列中，并且设置唤醒函数为 ep_poll_callback()，也就是说，当socket状态发生变化时，会触发调用 ep_poll_callback() 函数。ep_poll_callback() 函数实现如下

ep_poll_callback

ep_poll_callback() 函数的主要工作是把就绪的文件添加到 eventepoll 对象的就绪队列中，然后唤醒调用 epoll_wait() 被阻塞的进程

/*
 * This is the callback that is passed to the wait queue wakeup
 * mechanism. It is called by the stored file descriptors when they
 * have events to report.
 *
 * This callback takes a read lock in order not to contend with concurrent
 * events from another file descriptor, thus all modifications to ->rdllist
 * or ->ovflist are lockless.  Read lock is paired with the write lock from
 * ep_scan_ready_list(), which stops all list modifications and guarantees
 * that lists state is seen correctly.
 *
* Another thing worth to mention is that ep_poll_callback() can be called
        * concurrently for the same @epi from different CPUs if poll table was inited
* with several wait queues entries.  Plural wakeup from different CPUs of a
        * single wait queue is serialized by wq.lock, but the case when multiple wait
        * queues are used should be detected accordingly.  This is detected using
* cmpxchg() operation.
*/
static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
{
    int pwake = 0;
    struct epitem *epi = ep_item_from_wait(wait);
    struct eventpoll *ep = epi->ep;
    __poll_t pollflags = key_to_poll(key);
    unsigned long flags;
    int ewake = 0;

    read_lock_irqsave(&ep->lock, flags);

    ep_set_busy_poll_napi_id(epi);

    /*
     * If the event mask does not contain any poll(2) event, we consider the
     * descriptor to be disabled. This condition is likely the effect of the
     * EPOLLONESHOT bit that disables the descriptor when an event is received,
     * until the next EPOLL_CTL_MOD will be issued.
     */
    if (!(epi->event.events & ~EP_PRIVATE_BITS))
        goto out_unlock;

    /*
     * Check the events coming with the callback. At this stage, not
     * every device reports the events in the "key" parameter of the
     * callback. We need to be able to handle both cases here, hence the
     * test for "key" != NULL before the event match test.
     */
    if (pollflags && !(pollflags & epi->event.events))
        goto out_unlock;

    /*
     * If we are transferring events to userspace, we can hold no locks
     * (because we're accessing user memory, and because of linux f_op->poll()
     * semantics). All the events that happen during that period of time are
     * chained in ep->ovflist and requeued later on.
     */
    if (READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR) {
        if (chain_epi_lockless(epi))
            ep_pm_stay_awake_rcu(epi);
    } else if (!ep_is_linked(epi)) {
        /* In the usual case, add event to ready list. */
        if (list_add_tail_lockless(&epi->rdllink, &ep->rdllist))
            ep_pm_stay_awake_rcu(epi);
    }

    /*
     * Wake up ( if active ) both the eventpoll wait list and the ->poll()
     * wait list.
     */
    if (waitqueue_active(&ep->wq)) {
        if ((epi->event.events & EPOLLEXCLUSIVE) &&
            !(pollflags & POLLFREE)) {
            switch (pollflags & EPOLLINOUT_BITS) {
                case EPOLLIN:
                    if (epi->event.events & EPOLLIN)
                        ewake = 1;
                    break;
                case EPOLLOUT:
                    if (epi->event.events & EPOLLOUT)
                        ewake = 1;
                    break;
                case 0:
                    ewake = 1;
                    break;
            }
        }
        wake_up(&ep->wq);
    }
    if (waitqueue_active(&ep->poll_wait))
        pwake++;

    out_unlock:
    read_unlock_irqrestore(&ep->lock, flags);

    /* We have to call this outside the lock */
    if (pwake)
        ep_poll_safewake(ep, epi);

    if (!(epi->event.events & EPOLLEXCLUSIVE))
        ewake = 1;

    if (pollflags & POLLFREE) {
        /*
         * If we race with ep_remove_wait_queue() it can miss
         * ->whead = NULL and do another remove_wait_queue() after
         * us, so we can't use __remove_wait_queue().
         */
        list_del_init(&wait->entry);
        /*
         * ->whead != NULL protects us from the race with ep_free()
         * or ep_remove(), ep_remove_wait_queue() takes whead->lock
         * held by the caller. Once we nullify it, nothing protects
         * ep/epi or even wait.
         */
        smp_store_release(&ep_pwq_from_wait(wait)->whead, NULL);
    }

    return ewake;
}