浅析 Qemu 热迁移特性——Multifd

最新推荐文章于 2024-05-21 16:36:04 发布

Cliff Yang

最新推荐文章于 2024-05-21 16:36:04 发布

阅读量1.3k

点赞数

分类专栏：热迁移

原文链接：https://blog.didiyun.com/index.php/2019/03/07/qemu-multifd/

版权

热迁移专栏收录该内容

7 篇文章 1 订阅

订阅专栏

本文面向对 Qemu 热迁移有一定了解的读者。

Multifd 是什么

Multifd 是 Qemu 热迁移的一个新特性，直到 2018 年 6 月底才完全进入 upstream，目前还不是十分稳定。其本质是通过热迁移时使用多个 fd，将原本 RAM 的串行发送接收变成了并行的发送接收。

Multifd 解决了什么问题

Qemu 默认只使用单个 fd 进行热迁移。这会带来三个问题:

接收端的 CPU 在 10Gigabit 或以上的网络会变成瓶颈
虽然可以直接发送迁移相关的页，单 fd 情况下下会先将它们拷贝一遍再发送
由于发送和接收很麻烦，使透明大页更难使用

为了解决这些问题，Multifd 使用了多个 fd 进行迁移。主 fd 进行控制信息的发送，其他 fd 负责页的发送，避免了不必要的拷贝。

Multifd 的设计原理

当使用单 fd 进行迁移时，迁移流的样式是这样的：

migration stream
[page header1][4k page 1][page header2][4k page 2]...

当我们通过 Multifd 新增 fd 进行迁移时，在原本的迁移流之外还增加了其他迁移流。迁移流的样式是这样的：

migration stream
[page header1][page header 2]...
additional fd

[4k page 1][4k page 2]...

因此，这种设计使得：

不需要在发送和接收时进行拷贝，而是可以直接发送和接收
无需修改 migration stream，只是将 pages 通过侧信道进行传递
Huge page 可以像普通页一样直接发送和接收，使透明大页的使用更容易

Multifd 的使用

在 Qemu中，Multifd 特性有两个参数：

x-multifd-channels：使用的 Multifd 通道数，默认是 2 ( 不包含主 fd )
x-multifd-page-count：每个发送/接收线程每次发送/接收的页数，默认是 16

以上两个参数均可以在迁移前在 Qemu Monitor 中通过 migrate_set_parameter parameter value
的方式进行设置。需要注意的是，源端和目的端的两个参数的值需要保持一致。

之外，在开始热迁移前，需要通过 migrate_set_capability x-multifd on
在源端和目的端都开启 x-multifd 特性。在目前版本中，任何一端 x-multifd 特性没有开启都会导致迁移失败。

Multifd 的实现浅析

以 Qemu 3.1.0-rc5 代码为例。由于目的端原理与源端类似，因此本节只对源端的关键代码进行走读。

QIOChannel

Multifd 的数据传输基于 QIOChannel，其源于 GIOChannel，但是也有其特殊性。从相关注释中可以看到 QIOChannel 支持向量 IO 的使用。

/**

* QIOChannel:

*

* The QIOChannel defines the core API for a generic I/O channel

* class hierarchy. It is inspired by GIOChannel, but has the

* following differences

*

*  - Use QOM to properly support arbitrary subclassing

*  - Support use of iovecs for efficient I/O with multiple blocks

*  - None of the character set translation, binary data exclusively

*  - Direct support for QEMU Error object reporting

*  - File descriptor passing

*

* This base class is abstract so cannot be instantiated. There

* will be subclasses for dealing with sockets, files, and higher

* level protocols such as TLS, WebSocket, etc.

*/

热迁移流程

对于热迁移流程，我们可以把它抽象成以下几个阶段：

migrate_fd_connect

migration_thread
1. qemu_savevm_state_setup 进行所有系统的初始化
  - ram_save_setup 内存的初始化函数
2. migration_iteration_run 进行迁移
3. qemu_savevm_state_pending 进行迁移数据的计算，之后作为是否迭代的判断
4. qemu_savevm_state_iterate 进行迁移的迭代操作
  - ram_save_iterate 内存的迭代函数
5. migration_completion 在只剩期望downtime可传输的数据量时进行最后的停机迁移
  - qemu_savevm_state_complete_precopy 对于precopy，会走这个流程
    - ram_save_complete 内存的最终结束迁移函数
6. migration_iteration_finish 最后的收尾工作

Multifd 的关键数据结构

以下数据结构供读者在阅读后文代码时查阅：

typedef struct {

    uint32_t magic;

    uint32_t version;

    unsigned char uuid[16]; /* QemuUUID */

    uint8_t id;

} __attribute__((packed)) MultiFDInit_t;



typedef struct {

    uint32_t magic;

    uint32_t version;

    uint32_t flags;

    uint32_t size;

    uint32_t used;

    uint64_t packet_num;

    char ramblock[256];

    uint64_t offset[];

} __attribute__((packed)) MultiFDPacket_t;



typedef struct {

    /* number of used pages */

    uint32_t used;

    /* number of allocated pages */

    uint32_t allocated;

    /* global number of generated multifd packets */

    uint64_t packet_num;

    /* offset of each page */

    ram_addr_t *offset;

    /* pointer to each page */

    struct iovec *iov;

    RAMBlock *block;

} MultiFDPages_t;



typedef struct {

    /* this fields are not changed once the thread is created */

    /* channel number */

    uint8_t id;

    /* channel thread name */

    char *name;

    /* channel thread id */

    QemuThread thread;

    /* communication channel */

    QIOChannel *c;

    /* sem where to wait for more work */

    QemuSemaphore sem;

    /* this mutex protects the following parameters */

    QemuMutex mutex;

    /* is this channel thread running */

    bool running;

    /* should this thread finish */

    bool quit;

    /* thread has work to do */

    int pending_job;

    /* array of pages to sent */

    MultiFDPages_t *pages;

    /* packet allocated len */

    uint32_t packet_len;

    /* pointer to the packet */

    MultiFDPacket_t *packet;

    /* multifd flags for each packet */

    uint32_t flags;

    /* global number of generated multifd packets */

    uint64_t packet_num;

    /* thread local variables */

    /* packets sent through this channel */

    uint64_t num_packets;

    /* pages sent through this channel */

    uint64_t num_pages;

    /* syncs main thread and channels */

    QemuSemaphore sem_sync;

}  MultiFDSendParams;



struct {

    MultiFDSendParams *params;

    /* number of created threads */

    int count;

    /* array of pages to sent */

    MultiFDPages_t *pages;

    /* syncs main thread and channels */

    QemuSemaphore sem_sync;

    /* global number of generated multifd packets */

    uint64_t packet_num;

    /* send channels ready */

    QemuSemaphore channels_ready;

} *multifd_send_state;

Multifd 代码浅析

下面我们以热迁移流程的顺序对 Multifd 的执行过程进行梳理：

migrate_fd_connect

multifd_save_setup

ram_save_setup

multifd_send_sync_main

ram_save_iterate

ram_find_and_save_block
multifd_send_sync_main

ram_save_complete

ram_find_and_save_block
multifd_send_sync_main

其中：

ram_find_and_save_block: finds a dirty page and sends it to f

ram_save_host_page: save a whole host page
- ram_save_target_page: save one target page
  - ram_save_multifd_page
    - multifd_queue_page
      - multifd_send_pages

具体函数

想直接了解整个流程的同学可以先对最后一节进行阅读，再回来进行具体函数的阅读。

multifd_save_setup

int multifd_save_setup(void)

{

    int thread_count;

    uint32_t page_count = migrate_multifd_page_count();

    uint8_t i;



    if (!migrate_use_multifd()) {

        return 0;

    }

    thread_count = migrate_multifd_channels();

    multifd_send_state = g_malloc0(sizeof(*multifd_send_state));

    multifd_send_state->params = g_new0(MultiFDSendParams, thread_count);

    atomic_set(&multifd_send_state->count, 0);

    multifd_send_state->pages = multifd_pages_init(page_count);

    qemu_sem_init(&multifd_send_state->sem_sync, 0);

    qemu_sem_init(&multifd_send_state->channels_ready, 0);



    for (i = 0; i < thread_count; i++) {

        MultiFDSendParams *p = &multifd_send_state->params[i];



        qemu_mutex_init(&p->mutex);

        qemu_sem_init(&p->sem, 0);

        qemu_sem_init(&p->sem_sync, 0);

        p->quit = false;

        p->pending_job = 0;

        p->id = i;

        p->pages = multifd_pages_init(page_count);

        p->packet_len = sizeof(MultiFDPacket_t)

                      + sizeof(ram_addr_t) * page_count;

        p->packet = g_malloc0(p->packet_len);

        p->name = g_strdup_printf("multifdsend_%d", i);

        socket_send_channel_create(multifd_new_send_channel_async, p);

        //在multifd_new_send_channel_async中进行源端的迁移线程multifd_send_thread的建立

    }

    return 0;

}

该函数的作用是进行 Multifd 数据结构等的初始化和发送线程的建立。

multifd_send_thread

static void *multifd_send_thread(void *opaque)

{

    MultiFDSendParams *p = opaque;

    Error *local_err = NULL;

    int ret;



    trace_multifd_send_thread_start(p->id);

    rcu_register_thread();



    if (multifd_send_initial_packet(p, &local_err) < 0) {

    //传输在数据发送前initial_packet，会在接收端进行检验

        goto out;

    }

    /* initial packet */

    p->num_packets = 1;



    while (true) {

        qemu_sem_wait(&p->sem); //等待multifd\_send\_sync\_main或multifd\_send\_pages进行post

        qemu_mutex_lock(&p->mutex);



        if (p->pending_job) { //如果有未完成的工作

            uint32_t used = p->pages->used;

            uint64_t packet_num = p->packet_num;

            uint32_t flags = p->flags;



            multifd_send_fill_packet(p);

            p->flags = 0;

            p->num_packets++;

            p->num_pages += used;

            p->pages->used = 0;

            qemu_mutex_unlock(&p->mutex);



            trace_multifd_send(p->id, packet_num, used, flags);



            ret = qio_channel_write_all(p->c, (void *)p->packet,

                                        p->packet_len, &local_err);

            //发送packet，里面包含了数据大小，flags等对端需要的信息

            if (ret != 0) {

                break;

            }



            ret = qio_channel_writev_all(p->c, p->pages->iov, used, &local_err); \\发送pages

            if (ret != 0) {

                break;

            }



            qemu_mutex_lock(&p->mutex);

            p->pending_job--;

            //本次工作完成

            qemu_mutex_unlock(&p->mutex);



            if (flags & MULTIFD_FLAG_SYNC) {

                qemu_sem_post(&multifd_send_state->sem_sync);

                //与multifd_send_sync_main同步

            }

            qemu_sem_post(&multifd_send_state->channels_ready);

            //channels_ready，告知multifd_send_pages有发送channel可用

        } else if (p->quit) {

            qemu_mutex_unlock(&p->mutex);

            break;

        } else {

            qemu_mutex_unlock(&p->mutex);

            /* sometimes there are spurious wakeups */

        }

    }



out:

    if (local_err) {

        multifd_send_terminate_threads(local_err);

    }



    qemu_mutex_lock(&p->mutex);

    p->running = false;

    qemu_mutex_unlock(&p->mutex);



    rcu_unregister_thread();

    trace_multifd_send_thread_end(p->id, p->num_packets, p->num_pages);



    return NULL;

}

该函数的作用是作为发送线程（之前提到的侧信道 side-channel ) 进行实际的数据发送。

multifd_send_sync_main

static void multifd_send_sync_main(void)

{

    int i;



    if (!migrate_use_multifd()) {

        return;

    }

    if (multifd_send_state->pages->used) {

        multifd_send_pages();

        //如果还存在着应该被发送的页，进行发送

    }

    for (i = 0; i < migrate_multifd_channels(); i++) {

        MultiFDSendParams *p = &multifd_send_state->params[i];

        //p是该channel的Send描述符

        trace_multifd_send_sync_main_signal(p->id);



        qemu_mutex_lock(&p->mutex);



        p->packet_num = multifd_send_state->packet_num++;

        p->flags |= MULTIFD_FLAG_SYNC;

        p->pending_job++;

        qemu_mutex_unlock(&p->mutex);

        qemu_sem_post(&p->sem);

        //pending_job配置好，可以唤醒multifd_send_thread。

        //这个packet用于告知负责接收的multifd_recv_thread本批数据传输结束

    }

    for (i = 0; i < migrate_multifd_channels(); i++) {

        MultiFDSendParams *p = &multifd_send_state->params[i];



        trace_multifd_send_sync_main_wait(p->id);

        qemu_sem_wait(&multifd_send_state->sem_sync);

        //等待上一步multifd_send_thread发送结束时进行同步

    }

    trace_multifd_send_sync_main(multifd_send_state->packet_num);

}

该函数用于 multifd_send_thread 与主迁移线程 migration_thread 的同步。

multifd_queue_page

static void multifd_queue_page(RAMBlock *block, ram_addr_t offset)

{

    MultiFDPages_t *pages = multifd_send_state->pages;



    if (!pages->block) {

        pages->block = block;

    }

    //填充pages->iov和pages->offset待multifd_send_thread使用

    if (pages->block == block) {

        pages->offset[pages->used] = offset;

        pages->iov[pages->used].iov_base = block->host + offset;

        pages->iov[pages->used].iov_len = TARGET_PAGE_SIZE;

        pages->used++;



        if (pages->used < pages->allocated) {

            return;

            //当used<allocated(x-multifd-page-count)时直接返回，直到used==allocated时才进行multifd_send_pages

        }

    }



    multifd_send_pages();



    if (pages->block != block) {

        multifd_queue_page(block, offset);

    }

}

该函数的任务是将需要被发送的页填充到 multifd_send_state 中，当数目达到 page-count 使由 multifd_send_pages 将工作交给 multifd_send_thread。

multifd_send_pages

/*

* How we use multifd_send_state->pages and channel->pages?

*

* We create a pages for each channel, and a main one.  Each time that

* we need to send a batch of pages we interchange the ones between

* multifd_send_state and the channel that is sending it.  There are

* two reasons for that:

*    - to not have to do so many mallocs during migration

*    - to make easier to know what to free at the end of migration

*

* This way we always know who is the owner of each "pages" struct,

* and we don't need any loocking.  It belongs to the migration thread

* or to the channel thread.  Switching is safe because the migration

* thread is using the channel mutex when changing it, and the channel

* have to had finish with its own, otherwise pending_job can't be

* false.

*/



static void multifd_send_pages(void)

{

    int i;

    static int next_channel;

    MultiFDSendParams *p = NULL; /* make happy gcc */

    MultiFDPages_t *pages = multifd_send_state->pages;

    uint64_t transferred;



    //等待multifd_send_thread ready

    qemu_sem_wait(&multifd_send_state->channels_ready);

    //寻找没有pending_job的channel，给它分配工作

    for (i = next_channel;; i = (i + 1) % migrate_multifd_channels()) {

        p = &multifd_send_state->params[i];



        qemu_mutex_lock(&p->mutex);

        if (!p->pending_job) {

            p->pending_job++;

            next_channel = (i + 1) % migrate_multifd_channels();

            break;

        }

        qemu_mutex_unlock(&p->mutex);

    }

    p->pages->used = 0;



    p->packet_num = multifd_send_state->packet_num++;

    p->pages->block = NULL;

    //使用multifd_send_state的pages替代p中的pages

    multifd_send_state->pages = p->pages;

    p->pages = pages;

    transferred = ((uint64_t) pages->used) * TARGET_PAGE_SIZE + p->packet_len;

    ram_counters.multifd_bytes += transferred;

    ram_counters.transferred += transferred;;

    qemu_mutex_unlock(&p->mutex);

    qemu_sem_post(&p->sem); //唤醒multifd_send_thread

}

该函数负责在有可用 channe l时向 channel 分发工作，并最后将它们唤醒。

这里可以看到，每次发送时，都需要使用 multifd_send_state->pages 将 channel 发送描述符 p->pages 替代。

从函数前的注释中我们可以找到这样做的原因：

减少迁移中 malloc 的调用
在最后进行 free 时，可以清晰地知道应该 free 哪些

Multifd 源端发送流程

在了解了几个关键函数后，我们可以清楚地看清整个 Multifd 源端的发送流程：

在 migration_thread 调用之前，通过 multifd_save_setup 进行了初始化和负责发送的 multifd_send_thread 的建立。此时，所有 multifd_send_thread 等待 p->sem。
进入 migration_thread，进入 RAM 的初始化流程 ram_save_setup，开始初始化，并进行第一次主迁移线程与发送线程的同步（multifd_send_sync_main）。由于 multifd_pages_init 将 multifd_send_state->page->used 初始化为 0，第一次同步时不会进行 multifd_send_pages 的调用。等到 multifd_send_thread 第一次发送后，channels_ready 和 sem->sync 被 post，第一次同步结束。
开始迭代和最后的结束过程。ram_save_iterate 和 ram_save_complete 对 Multifd 发送端关键函数的调用模式是一致的。首先在 ram_find_and_save_block 中进行脏页的查找，之后 ram_find_and_save_block 先调用 multifd_queue_page 进行 multifd_send_state->pages 的填充，再通过 multifd_send_pages 将 multifd_send_state->pages 分发给 channel，进行 multifd_send_thread 的唤醒和发送。在 ram_find_and_save_block 之后再次调用同步函数 multifd_send_sync_main 进行主迁移线程和 multifd_send_thread 的同步。

https://blog.didiyun.com/index.php/2019/03/07/qemu-multifd/