qemu2事件处理机制

最新推荐文章于 2022-03-13 00:22:07 发布

TangGeeA

最新推荐文章于 2022-03-13 00:22:07 发布

阅读量1.6k

点赞数 1

分类专栏： qemu qemu

本文链接：https://blog.csdn.net/woai110120130/article/details/99693067

版权

qemu 同时被 2 个专栏收录

27 篇文章 34 订阅

订阅专栏

qemu

10 篇文章 5 订阅

订阅专栏

前边分析了glib的主事件循环，在分析qemu事件派发机制之前，我们来回顾下glib事件循环的主要api和流程。
主事件循环提供的机制主要有两方面
1 观察文件描述符事件, 如文件可读，文件可写
2 定时任务

主事件循环有三个概念
1 GMainContext　
2 GMainLooper
3 GSource

GMainLooper代表一个事件循环，和线程绑定，其中成员变量is_running用于控制主事件循状态（开启/停止）, GMainContext则代表主事件循环的上下文，与GMainLooper绑定，GMainContext和GMLooper为一对一的关系，同样GMainLooper和线程也是一对一的关系。
GSource则代表一个事件源，一个GMainContext可以绑定多个事件源.
如何理解事件源？
前面说了主事件循环主要用于观察文件描述符可读，可写事件。这就是事件源，另外定时任务的时间到达也属于一种特殊的事件源。

glib主事件循环以什么方式观察文件描述符可读可写，相信对linux编程比较熟悉的读者都应该猜想到了，无非就是poll/select/epoll等函数。

这些函数都可以把文件描述符设置到一个集合里面，设置感兴趣的事件，之后调用poll/select/epoll函数，然后进入休眠。当内核发现感兴趣的事件(读、写等)到来后，就会从休眠中唤醒poll/select/epoll的调用进程，同时内核会把触发的事件也告诉给调用者。这种方式比read/write等阻塞方式的好处是可以同时观察多个文件描述符,当文件描述符真正可用的时候再去read和write,就不会阻塞了。glib对于文件描述符读写事件的观察就是实现于此。

另外poll/select/epoll还提供超时机制。通过参数设置最多睡眠多久，当没有事件到来的时候，内核也会由于超时唤醒休眠进程，定时任务就是实现于此。

主事件循环的伪代码大概是这样的

g_poll() {
   while (GMainLooper.is_running) {
     int timeout = g_main_context_prepare();
     g_main_context_query();
     g_main_context_poll();
     g_main_context_check();
     g_main_context_dispatch();
   }
}

1 g_main_context_prepare会遍历所有注册到GMainContext上的事件源GSource, 找到最近要到期的定时任务（将来这个超时时间会作为poll/select/epoll的超时时间，时间到了可以派发定时任务）
2 g_main_context_query　函数用于收集事件源GSource关心的文件描述符
3 g_main_context_poll　则调用poll/select/epoll观察文件描述符事件，注意超时事件为prepare阶段计算到的timeout时间
4 g_main_context_check() 注意这里已经从poll/select/epoll函数返回了，无非就是两种情况，一种为有观察的文件描述符事件到来，另一种为超时，也可能两种情况都有触发。（其实还有一种情况是注册了新的GSource或者GSource 增加了新的文件描述符或新的超时函数）总之poll/select/epoll返回了，g_main_context_check调用每个事件源来的check方法来确认是否真的是关心的事件到达，这里把就绪的事件源收集起来
５ g_main_context_dispatch()派发事件，也就是把４中收集到的事件源进行事件派发。

关于上述五个步骤对应了四个事件源回调函数。
事件源的回调函数为调用者自身设置，如下

typedef struct {
  gboolean (*prepare)(GSource* source, gint* timeout);
  gboolean (*check)(GSource* source);
  gboolean (*dispatch)(GSource* source,
                       GSourceFunc callback,
                       gpointer user_data);
  void (*finalize)(GSource* source);
} GSourceFuncs;

这些回调函数对应上述五个过程，函数名字都能很好的对应，就不去说明了。

g_poll函数为glib提供的默认轮询函数，qemu里面没有使用它，但是qemu的事件也必须按照上面五个步骤去做才能符合glib的主事件循环框架，我们就来分析下qemu的主事件循环，也是理解qemu框架比较重要的部分。

我们再来看看glibc主事件循环提供的api

//用于创建GSource
GSource *g_source_new(GSourceFuncs   *source_funcs, guint  struct_size);

//设置是否可以递归调用
void   g_source_set_can_recurse (GSource  *source, gboolean can_recurse)

// 从内嵌GSource的结构获取GSource
void     g_source_set_can_recurs (GSource        *source,　 gboolean        can_recurse)

// 设置GSource名字
void g_source_set_name(GSource        *source,
                                              const char     *name)
                                              
// 给GSource 设置一个默认的GMainContext,如果第二个参数为空则使用默认的GMContext
guint    g_source_attach(GSource        *source,GMainContext   *context)

// 获取默认的GMainContext
GMainContext *g_main_context_default   (void)

// 在当前线程启动looper循环
void       g_main_loop_run(GMainLoop    *loop)

// 增加观察的文件描述符
void     g_source_add_poll(GSource  *source, GPollFD        *fd)

//增加定时任务
void    g_source_set_ready_time (GSource  *source,  gint64          ready_time)

常用的api也就这些，还有很多变种这里就不介绍了，另外为了理解qemu主事件循环，再说明下g_source_new　函数，函数第二个参数是一个uint类型的参数，如何理解创建一个GSource还要给定大小。原因是gmain api希望给用于自定义数据的能力，也就是用户可以自定义个数据结构，里面内嵌GSource（一定是第一个元素），然后g_source_new申请内存的时候申请struct_size大小的内存，但是只初始化其中的GSource结构。如果你把ｃ语言的结构体看做一块内存，就很好理解了，强制类型转换就是重新解释这篇内存，所以ｃ++里面叫做reinterpret_cast 。这也是c语言实现多态的一种方式。另外g_source_new的第一个参数就是我们前边说的GSourceFuncs（提醒注意）

有了这些基本只是我们就可以去分析了qemu的主事件循环了

首先来看下AioContext的数据结构，我们对照数据结构说明qemu aio的设计

struct AioContext {
    GSource source;  // GSource需要注册到GMainContext中
    RFifoLock lock;　//一个可嵌套的先进先出锁

    /* The list of registered AIO handlers */
    QLIST_HEAD(, AioHandler) aio_handlers;　//用于描述关注的文件描述符事件

    /* This is a simple lock used to protect the aio_handlers list.
     * Specifically, it's used to ensure that no callbacks are removed while
     * we're walking and dispatching callbacks.
     */
    int walking_handlers;　//用于遍历aio_handlers的时候删除aio_handlers上的元素

    /* Used to avoid unnecessary event_notifier_set calls in aio_notify;
     * accessed with atomic primitives.  If this field is 0, everything
     * (file descriptors, bottom halves, timers) will be re-evaluated
     * before the next blocking poll(), thus the event_notifier_set call
     * can be skipped.  If it is non-zero, you may need to wake up a
     * concurrent aio_poll or the glib main event loop, making
     * event_notifier_set necessary.
     *
     * Bit 0 is reserved for GSource usage of the AioContext, and is 1
     * between a call to aio_ctx_prepare and the next call to aio_ctx_check.
     * Bits 1-31 simply count the number of active calls to aio_poll
     * that are in the prepare or poll phase.
     *
     * The GSource and aio_poll must use a different mechanism because
     * there is no certainty that a call to GSource's prepare callback
     * (via g_main_context_prepare) is indeed followed by check and
     * dispatch.  It's not clear whether this would be a bug, but let's
     * play safe and allow it---it will just cause extra calls to
     * event_notifier_set until the next call to dispatch.
     *
     * Instead, the aio_poll calls include both the prepare and the
     * dispatch phase, hence a simple counter is enough for them.
     */
    uint32_t notify_me;　//用于防止多余的通知函数调用

    /* lock to protect between bh's adders and deleter */
    QemuMutex bh_lock;　//保护bh队列的锁

    /* Anchor of the list of Bottom Halves belonging to the context */
    struct QEMUBH *first_bh;　//bh队列

    /* A simple lock used to protect the first_bh list, and ensure that
     * no callbacks are removed while we're walking and dispatching callbacks.
     */
    int walking_bh;　//防止遍历过程中删除bh

    /* Used by aio_notify.　
     *
     * "notified" is used to avoid expensive event_notifier_test_and_clear
     * calls.  When it is clear, the EventNotifier is clear, or one thread
     * is going to clear "notified" before processing more events.  False
     * positives are possible, i.e. "notified" could be set even though the
     * EventNotifier is clear.
     *
     * Note that event_notifier_set *cannot* be optimized the same way.  For
     * more information on the problem that would result, see "#ifdef BUG2"
     * in the docs/aio_notify_accept.promela formal model.
     */
    bool notified;　//是否发送了通知
    EventNotifier notifier;　//事件通知者

    /* Scheduling this BH forces the event loop it iterate */
    QEMUBH *notify_dummy_bh;　//用于防止有线程需要lock锁是aio线程进入休眠

    /* Thread pool for performing work and receiving completion callbacks */
    struct ThreadPool *thread_pool;　//线程池

#ifdef CONFIG_LINUX_AIO
    /* State for native Linux AIO.  Uses aio_context_acquire/release for
     * locking.
     */
    struct LinuxAioState *linux_aio;
#endif

    /* TimerLists for calling timers - one per clock type */
    QEMUTimerListGroup tlg;　//定时器队列

    int external_disable_cnt;　//禁用外部事件的计数器

    /* epoll(7) state used when built with CONFIG_EPOLL */
    int epollfd;
    bool epoll_enabled;
    bool epoll_available;
};

看到上面的数据结构，我们来大概说下Aio 的设计
首先aio支持文件描述符事件，这些事件使用aio_handlers描述，aio 文件描述符事件的api只能在aio thread中使用，所以walking_handlers是放在事件派发过程中对 aio_handlers中的元素进行删除，所以这是一个简单的锁，并且只在aio线程中使用，后面我们会看到在派发的过程中使用。
notify_me,notified,notifier这三个变量要放在一组里面使用。这三个变量的作用主要是为了外部唤醒aio context的休眠。比如我们添加一个关心的文件描述符，但是aio thread正在poll中等待事件到来，这个poll也许休眠很久，可能错过我们新添加的事件。所以通过notifier进行唤醒。　notifier的机制其实很简单，就是内置打开一个管道，我们把管道的读端的可读事件设置为我们关心的事件，这样我们就可以通过像写端写入数据来唤醒aio thread在poll中的休眠了。notify_me的作用也很好理解，比如管道还没有进入poll设置还没有开始收集文件描述作为关心事件，那我们唤醒也是没用的，所以用notify_me标记aio　thread是否需要唤醒了。notified则适用于判断poll唤醒后是否需要消耗通知管道读端的可读事件，只是一个小小的优化。
first_bh 是用于事件后半段的实现，其实主要的作用就是在aio context中调用bh的回调函数，bh有两种，一种的idle的bh,默认超时时间为10ms.　另外非idle的bh超时时间为0ms,关于bh可以参考qemu aio api

QEMUBH *notify_dummy_bh, RFifoLock lock是配合使用的，当一个线程想获取lock锁的时候，如果这个锁被aio线程锁持有，就是通过notify_dummy_bh进行唤醒，来尽快释放锁

tlg　则是一个定时器队列，要参与到超时时间的计算。派发过程中也会回调到期的定时函数
thread_pool则实现了一个线程池功能，请参考 qemu AIO线程池分析

以上就是qemu的aio 设置思路。　下面我们来具体分析下代码

int qemu_init_main_loop(Error **errp)
{
    int ret;
    GSource *src;
    Error *local_error = NULL;

    init_clocks();

    ret = qemu_signal_init();
    if (ret) {
        return ret;
    }

    qemu_aio_context = aio_context_new(&local_error);
    if (!qemu_aio_context) {
        error_propagate(errp, local_error);
        return -EMFILE;
    }
    qemu_notify_bh = qemu_bh_new(notify_event_cb, NULL);
    gpollfds = g_array_new(FALSE, FALSE, sizeof(GPollFD));
    src = aio_get_g_source(qemu_aio_context);
    g_source_attach(src, NULL);
    g_source_unref(src);
    src = iohandler_get_g_source();
    g_source_attach(src, NULL);
    g_source_unref(src);
    return 0;
}

aio_context_new创建了主线程的AioContext,g_source_attach绑定到了默认的GMainContext.

main_loop_wait函数就是主线程的loop函数

int main_loop_wait(int nonblocking)
{
    int ret;
    uint32_t timeout = UINT32_MAX;
    int64_t timeout_ns;

    if (nonblocking) {
        timeout = 0;
    }

    /* poll any events */
    g_array_set_size(gpollfds, 0); /* reset for new iteration */
    /* XXX: separate device handlers from system ones */
#ifdef CONFIG_SLIRP
    slirp_pollfds_fill(gpollfds, &timeout);
#endif

    if (timeout == UINT32_MAX) {
        timeout_ns = -1;
    } else {
        timeout_ns = (uint64_t)timeout * (int64_t)(SCALE_MS);
    }

    timeout_ns = qemu_soonest_timeout(timeout_ns,
                                      timerlistgroup_deadline_ns(
                                          &main_loop_tlg));

    ret = os_host_main_loop_wait(timeout_ns);

#if DEBUG_MAIN_LOOP_PERF
    static int iter_count = 0;
    static struct timespec lastreport;
    if (++iter_count == 3000) {
        iter_count = 0;
        struct timespec now;
        clock_gettime(CLOCK_MONOTONIC, &now);
        if (lastreport.tv_sec != 0) {
            long long diff = (now.tv_sec - lastreport.tv_sec)*1000000000ll +
                             now.tv_nsec - lastreport.tv_nsec;
            printf("%s: 3k iterations in %06.04f ms\n", __func__, diff / 1000000.0);
        }
        lastreport = now;
    }
#endif

#ifdef CONFIG_SLIRP
    slirp_pollfds_poll(gpollfds, (ret < 0));
#endif

    if (main_loop_poll_callback)
        (*main_loop_poll_callback)();

    /* CPU thread can infinitely wait for event after
       missing the warp */
    qemu_start_warp_timer();
    qemu_clock_run_all_timers();

    return ret;
}

这里主要计算了主线程的定时任务队列的最小超时时间，和上层给定的超时时间。然后调用os_host_main_loop_wait来进入整体流程。os_host_main_loop_wait结束后派发主线程是定时器事件

我们来看os_host_main_loop_wait主要流程都在这里面

static int os_host_main_loop_wait(int64_t timeout)
{
    int ret;
    static int spin_counter;

    glib_pollfds_fill(&timeout);

    /* If the I/O thread is very busy or we are incorrectly busy waiting in
     * the I/O thread, this can lead to starvation of the BQL such that the
     * VCPU threads never run.  To make sure we can detect the later case,
     * print a message to the screen.  If we run into this condition, create
     * a fake timeout in order to give the VCPU threads a chance to run.
     */
    if (!timeout && (spin_counter > MAX_MAIN_LOOP_SPIN)) {
        static bool notified;

        if (!notified && !qtest_driver()) {
            fprintf(stderr,
                    "main-loop: WARNING: I/O thread spun for %d iterations\n",
                    MAX_MAIN_LOOP_SPIN);
            notified = true;
        }

        timeout = SCALE_MS;
    }

    if (timeout) {
        spin_counter = 0;
        qemu_mutex_unlock_iothread();
    } else {
        spin_counter++;
    }

    ret = qemu_poll_ns((GPollFD *)gpollfds->data, gpollfds->len, timeout);

    if (timeout) {
        qemu_mutex_lock_iothread();
    }

    glib_pollfds_poll();
    return ret;
}

glib_pollfds_fill函数用于调用GSource的prepare函数，来同步最小超时时间，以及收集文件描述符到gpollfds全局变量中，然后调用qemu_poll_ns函数进入poll，poll唤醒后调用glib_pollfds_poll函数进行GSource check确定事件是否真实发生，另外调用GSource dispatch派发事件。

/* qemu implementation of g_poll which uses a nanosecond timeout but is
 * otherwise identical to g_poll
 */
int qemu_poll_ns(GPollFD *fds, guint nfds, int64_t timeout)
{
#ifdef CONFIG_PPOLL
    if (timeout < 0) {
        return ppoll((struct pollfd *)fds, nfds, NULL, NULL);
    } else {
        struct timespec ts;
        int64_t tvsec = timeout / 1000000000LL;
        /* Avoid possibly overflowing and specifying a negative number of
         * seconds, which would turn a very long timeout into a busy-wait.
         */
        if (tvsec > (int64_t)INT32_MAX) {
            tvsec = INT32_MAX;
        }
        ts.tv_sec = tvsec;
        ts.tv_nsec = timeout % 1000000000LL;
        return ppoll((struct pollfd *)fds, nfds, &ts, NULL);
    }
#else
    return g_poll(fds, nfds, qemu_timeout_ns_to_ms(timeout));
#endif
}

static void glib_pollfds_poll(void)
{
    GMainContext *context = g_main_context_default();
    GPollFD *pfds = &g_array_index(gpollfds, GPollFD, glib_pollfds_idx);

    if (g_main_context_check(context, max_priority, pfds, glib_n_poll_fds)) {
        g_main_context_dispatch(context);
    }
}

这样整个流程就分析完了，剩下的过程比较琐碎，就不具体深入写了。

新版api可以执行的环境发生了一些变化，请参考qemu AIO线程模型