本文面向对 Qemu 热迁移有一定了解的读者。
Multifd 是什么
Multifd 是 Qemu 热迁移的一个新特性,直到 2018 年 6 月底才完全进入 upstream,目前还不是十分稳定。其本质是通过热迁移时使用多个 fd,将原本 RAM 的串行发送接收变成了并行的发送接收。
Multifd 解决了什么问题
Qemu 默认只使用单个 fd 进行热迁移。这会带来三个问题:
- 接收端的 CPU 在 10Gigabit 或以上的网络会变成瓶颈
- 虽然可以直接发送迁移相关的页,单 fd 情况下下会先将它们拷贝一遍再发送
- 由于发送和接收很麻烦,使透明大页更难使用
为了解决这些问题,Multifd 使用了多个 fd 进行迁移。主 fd 进行控制信息的发送,其他 fd 负责页的发送,避免了不必要的拷贝。
Multifd 的设计原理
当使用单 fd 进行迁移时,迁移流的样式是这样的:
- migration stream
[page header1][4k page 1][page header2][4k page 2]...
当我们通过 Multifd 新增 fd 进行迁移时, 在原本的迁移流之外还增加了其他迁移流。迁移流的样式是这样的:
- migration stream
[page header1][page header 2]...
-
additional fd
[4k page 1][4k page 2]...
因此,这种设计使得:
- 不需要在发送和接收时进行拷贝,而是可以直接发送和接收
- 无需修改 migration stream,只是将 pages 通过侧信道进行传递
- Huge page 可以像普通页一样直接发送和接收,使透明大页的使用更容易
Multifd 的使用
在 Qemu中,Multifd 特性有两个参数:
- x-multifd-channels:使用的 Multifd 通道数,默认是 2 ( 不包含主 fd )
- x-multifd-page-count:每个发送/接收线程每次发送/接收的页数,默认是 16
以上两个参数均可以在迁移前在 Qemu Monitor 中通过 migrate_set_parameter parameter value
的方式进行设置。需要注意的是,源端和目的端的两个参数的值需要保持一致。
之外,在开始热迁移前,需要通过 migrate_set_capability x-multifd on
在源端和目的端都开启 x-multifd 特性。在目前版本中,任何一端 x-multifd 特性没有开启都会导致迁移失败。
Multifd 的实现浅析
以 Qemu 3.1.0-rc5 代码为例。由于目的端原理与源端类似,因此本节只对源端的关键代码进行走读。
QIOChannel
Multifd 的数据传输基于 QIOChannel,其源于 GIOChannel,但是也有其特殊性。从相关注释中可以看到 QIOChannel 支持向量 IO 的使用。
/**
* QIOChannel:
*
* The QIOChannel defines the core API for a generic I/O channel
* class hierarchy. It is inspired by GIOChannel, but has the
* following differences
*
* - Use QOM to properly support arbitrary subclassing
* - Support use of iovecs for efficient I/O with multiple blocks
* - None of the character set translation, binary data exclusively
* - Direct support for QEMU Error object reporting
* - File descriptor passing
*
* This base class is abstract so cannot be instantiated. There
* will be subclasses for dealing with sockets, files, and higher
* level protocols such as TLS, WebSocket, etc.
*/
热迁移流程
对于热迁移流程,我们可以把它抽象成以下几个阶段:
migrate_fd_connect
- migration_thread
- qemu_savevm_state_setup 进行所有系统的初始化
- ram_save_setup 内存的初始化函数
- migration_iteration_run 进行迁移
- qemu_savevm_state_pending 进行迁移数据的计算,之后作为是否迭代的判断
- qemu_savevm_state_iterate 进行迁移的迭代操作
- ram_save_iterate 内存的迭代函数
- migration_completion 在只剩期望downtime可传输的数据量时进行最后的停机迁移
- qemu_savevm_state_complete_precopy 对于precopy,会走这个流程
- ram_save_complete 内存的最终结束迁移函数
- qemu_savevm_state_complete_precopy 对于precopy,会走这个流程
- migration_iteration_finish 最后的收尾工作
- qemu_savevm_state_setup 进行所有系统的初始化
Multifd 的关键数据结构
以下数据结构供读者在阅读后文代码时查阅:
typedef struct {
uint32_t magic;
uint32_t version;
unsigned char uuid[16]; /* QemuUUID */
uint8_t id;
} __attribute__((packed)) MultiFDInit_t;
typedef struct {
uint32_t magic;
uint32_t version;
uint32_t flags;
uint32_t size;
uint32_t used;
uint64_t packet_num;
char ramblock[256];
uint64_t offset[];
} __attribute__((packed)) MultiFDPacket_t;
typedef struct {
/* number of used pages */
uint32_t used;
/* number of allocated pages */
uint32_t allocated;
/* global number of generated multifd packets */
uint64_t packet_num;
/* offset of each page */
ram_addr_t *offset;
/* pointer to each page */
struct iovec *iov;
RAMBlock *block;
} MultiFDPages_t;
typedef struct {
/* this fields are not changed once the thread is created */
/* channel number */
uint8_t id;
/* channel thread name */
char *name;
/* channel thread id */
QemuThread thread;
/* communication channel */
QIOChannel *c;
/* sem where to wait for more work */
QemuSemaphore sem;
/* this mutex protects the following parameters */
QemuMutex mutex;
/* is this channel thread running */
bool running;
/* should this thread finish */
bool quit;
/* thread has work to do */
int pending_job;
/* array of pages to sent */
MultiFDPages_t *pages;
/* packet allocated len */
uint32_t packet_len;
/* pointer to the packet */
MultiFDPacket_t *packet;
/* multifd flags for each packet */
uint32_t flags;
/* global number of generated multifd packets */
uint64_t packet_num;
/* thread local variables */
/* packets sent through this channel */
uint64_t num_packets;
/* pages sent through this channel */
uint64_t num_pages;
/* syncs main thread and channels */
QemuSemaphore sem_sync;
} MultiFDSendParams;
struct {
MultiFDSendParams *params;
/* number of created threads */
int count;
/* array of pages to sent */
MultiFDPages_t *pages;
/* syncs main thread and channels */
QemuSemaphore sem_sync;
/* global number of generated multifd packets */
uint64_t packet_num;
/* send channels ready */
QemuSemaphore channels_ready;
} *multifd_send_state;
Multifd 代码浅析
下面我们以热迁移流程的顺序对 Multifd 的执行过程进行梳理:
migrate_fd_connect
- multifd_save_setup
ram_save_setup
- multifd_send_sync_main
ram_save_iterate
- ram_find_and_save_block
- multifd_send_sync_main
ram_save_complete
- ram_find_and_save_block
- multifd_send_sync_main
其中:
ram_find_and_save_block: finds a dirty page and sends it to f
- ram_save_host_page: save a whole host page
- ram_save_target_page: save one target page
- ram_save_multifd_page
- multifd_queue_page
- multifd_send_pages
- multifd_queue_page
- ram_save_multifd_page
- ram_save_target_page: save one target page
具体函数
想直接了解整个流程的同学可以先对最后一节进行阅读,再回来进行具体函数的阅读。
multifd_save_setup
int multifd_save_setup(void)
{
int thread_count;
uint32_t page_count = migrate_multifd_page_count();
uint8_t i;
if (!migrate_use_multifd()) {
return 0;
}
thread_count = migrate_multifd_channels();
multifd_send_state = g_malloc0(sizeof(*multifd_send_state));
multifd_send_state->params = g_new0(MultiFDSendParams, thread_count);
atomic_set(&multifd_send_state->count, 0);
multifd_send_state->pages = multifd_pages_init(page_count);
qemu_sem_init(&multifd_send_state->sem_sync, 0);
qemu_sem_init(&multifd_send_state->channels_ready, 0);
for (i = 0; i < thread_count; i++) {
MultiFDSendParams *p = &multifd_send_state->params[i];
qemu_mutex_init(&p->mutex);
qemu_sem_init(&p->sem, 0);
qemu_sem_init(&p->sem_sync, 0);
p->quit = false;
p->pending_job = 0;
p->id = i;
p->pages = multifd_pages_init(page_count);
p->packet_len = sizeof(MultiFDPacket_t)
+ sizeof(ram_addr_t) * page_count;
p->packet = g_malloc0(p->packet_len);
p->name = g_strdup_printf("multifdsend_%d", i);
socket_send_channel_create(multifd_new_send_channel_async, p);
//在multifd_new_send_channel_async中进行源端的迁移线程multifd_send_thread的建立
}
return 0;
}
该函数的作用是进行 Multifd 数据结构等的初始化和发送线程的建立。
multifd_send_thread
static void *multifd_send_thread(void *opaque)
{
MultiFDSendParams *p = opaque;
Error *local_err = NULL;
int ret;
trace_multifd_send_thread_start(p->id);
rcu_register_thread();
if (multifd_send_initial_packet(p, &local_err) < 0) {
//传输在数据发送前initial_packet,会在接收端进行检验
goto out;
}
/* initial packet */
p->num_packets = 1;
while (true) {
qemu_sem_wait(&p->sem); //等待multifd\_send\_sync\_main或multifd\_send\_pages进行post
qemu_mutex_lock(&p->mutex);
if (p->pending_job) { //如果有未完成的工作
uint32_t used = p->pages->used;
uint64_t packet_num = p->packet_num;
uint32_t flags = p->flags;
multifd_send_fill_packet(p);
p->flags = 0;
p->num_packets++;
p->num_pages += used;
p->pages->used = 0;
qemu_mutex_unlock(&p->mutex);
trace_multifd_send(p->id, packet_num, used, flags);
ret = qio_channel_write_all(p->c, (void *)p->packet,
p->packet_len, &local_err);
//发送packet,里面包含了数据大小,flags等对端需要的信息
if (ret != 0) {
break;
}
ret = qio_channel_writev_all(p->c, p->pages->iov, used, &local_err); \\发送pages
if (ret != 0) {
break;
}
qemu_mutex_lock(&p->mutex);
p->pending_job--;
//本次工作完成
qemu_mutex_unlock(&p->mutex);
if (flags & MULTIFD_FLAG_SYNC) {
qemu_sem_post(&multifd_send_state->sem_sync);
//与multifd_send_sync_main同步
}
qemu_sem_post(&multifd_send_state->channels_ready);
//channels_ready,告知multifd_send_pages有发送channel可用
} else if (p->quit) {
qemu_mutex_unlock(&p->mutex);
break;
} else {
qemu_mutex_unlock(&p->mutex);
/* sometimes there are spurious wakeups */
}
}
out:
if (local_err) {
multifd_send_terminate_threads(local_err);
}
qemu_mutex_lock(&p->mutex);
p->running = false;
qemu_mutex_unlock(&p->mutex);
rcu_unregister_thread();
trace_multifd_send_thread_end(p->id, p->num_packets, p->num_pages);
return NULL;
}
该函数的作用是作为发送线程(之前提到的侧信道 side-channel ) 进行实际的数据发送。
multifd_send_sync_main
static void multifd_send_sync_main(void)
{
int i;
if (!migrate_use_multifd()) {
return;
}
if (multifd_send_state->pages->used) {
multifd_send_pages();
//如果还存在着应该被发送的页,进行发送
}
for (i = 0; i < migrate_multifd_channels(); i++) {
MultiFDSendParams *p = &multifd_send_state->params[i];
//p是该channel的Send描述符
trace_multifd_send_sync_main_signal(p->id);
qemu_mutex_lock(&p->mutex);
p->packet_num = multifd_send_state->packet_num++;
p->flags |= MULTIFD_FLAG_SYNC;
p->pending_job++;
qemu_mutex_unlock(&p->mutex);
qemu_sem_post(&p->sem);
//pending_job配置好,可以唤醒multifd_send_thread。
//这个packet用于告知负责接收的multifd_recv_thread本批数据传输结束
}
for (i = 0; i < migrate_multifd_channels(); i++) {
MultiFDSendParams *p = &multifd_send_state->params[i];
trace_multifd_send_sync_main_wait(p->id);
qemu_sem_wait(&multifd_send_state->sem_sync);
//等待上一步multifd_send_thread发送结束时进行同步
}
trace_multifd_send_sync_main(multifd_send_state->packet_num);
}
该函数用于 multifd_send_thread 与主迁移线程 migration_thread 的同步。
multifd_queue_page
static void multifd_queue_page(RAMBlock *block, ram_addr_t offset)
{
MultiFDPages_t *pages = multifd_send_state->pages;
if (!pages->block) {
pages->block = block;
}
//填充pages->iov和pages->offset待multifd_send_thread使用
if (pages->block == block) {
pages->offset[pages->used] = offset;
pages->iov[pages->used].iov_base = block->host + offset;
pages->iov[pages->used].iov_len = TARGET_PAGE_SIZE;
pages->used++;
if (pages->used < pages->allocated) {
return;
//当used<allocated(x-multifd-page-count)时直接返回,直到used==allocated时才进行multifd_send_pages
}
}
multifd_send_pages();
if (pages->block != block) {
multifd_queue_page(block, offset);
}
}
该函数的任务是将需要被发送的页填充到 multifd_send_state 中,当数目达到 page-count 使由 multifd_send_pages 将工作交给 multifd_send_thread。
multifd_send_pages
/*
* How we use multifd_send_state->pages and channel->pages?
*
* We create a pages for each channel, and a main one. Each time that
* we need to send a batch of pages we interchange the ones between
* multifd_send_state and the channel that is sending it. There are
* two reasons for that:
* - to not have to do so many mallocs during migration
* - to make easier to know what to free at the end of migration
*
* This way we always know who is the owner of each "pages" struct,
* and we don't need any loocking. It belongs to the migration thread
* or to the channel thread. Switching is safe because the migration
* thread is using the channel mutex when changing it, and the channel
* have to had finish with its own, otherwise pending_job can't be
* false.
*/
static void multifd_send_pages(void)
{
int i;
static int next_channel;
MultiFDSendParams *p = NULL; /* make happy gcc */
MultiFDPages_t *pages = multifd_send_state->pages;
uint64_t transferred;
//等待multifd_send_thread ready
qemu_sem_wait(&multifd_send_state->channels_ready);
//寻找没有pending_job的channel,给它分配工作
for (i = next_channel;; i = (i + 1) % migrate_multifd_channels()) {
p = &multifd_send_state->params[i];
qemu_mutex_lock(&p->mutex);
if (!p->pending_job) {
p->pending_job++;
next_channel = (i + 1) % migrate_multifd_channels();
break;
}
qemu_mutex_unlock(&p->mutex);
}
p->pages->used = 0;
p->packet_num = multifd_send_state->packet_num++;
p->pages->block = NULL;
//使用multifd_send_state的pages替代p中的pages
multifd_send_state->pages = p->pages;
p->pages = pages;
transferred = ((uint64_t) pages->used) * TARGET_PAGE_SIZE + p->packet_len;
ram_counters.multifd_bytes += transferred;
ram_counters.transferred += transferred;;
qemu_mutex_unlock(&p->mutex);
qemu_sem_post(&p->sem); //唤醒multifd_send_thread
}
该函数负责在有可用 channe l时向 channel 分发工作,并最后将它们唤醒。
这里可以看到,每次发送时,都需要使用 multifd_send_state->pages 将 channel 发送描述符 p->pages 替代。
从函数前的注释中我们可以找到这样做的原因:
- 减少迁移中 malloc 的调用
- 在最后进行 free 时,可以清晰地知道应该 free 哪些
Multifd 源端发送流程
在了解了几个关键函数后,我们可以清楚地看清整个 Multifd 源端的发送流程:
- 在 migration_thread 调用之前,通过 multifd_save_setup 进行了初始化和负责发送的 multifd_send_thread 的建立。此时,所有 multifd_send_thread 等待 p->sem。
-
进入 migration_thread,进入 RAM 的初始化流程 ram_save_setup,开始初始化,并进行第一次主迁移线程与发送线程的同步(multifd_send_sync_main)。由于 multifd_pages_init 将 multifd_send_state->page->used 初始化为 0,第一次同步时不会进行 multifd_send_pages 的调用。等到 multifd_send_thread 第一次发送后,channels_ready 和 sem->sync 被 post,第一次同步结束。
-
开始迭代和最后的结束过程。ram_save_iterate 和 ram_save_complete 对 Multifd 发送端关键函数的调用模式是一致的。首先在 ram_find_and_save_block 中进行脏页的查找,之后 ram_find_and_save_block 先调用 multifd_queue_page 进行 multifd_send_state->pages 的填充,再通过 multifd_send_pages 将 multifd_send_state->pages 分发给 channel,进行 multifd_send_thread 的唤醒和发送。在 ram_find_and_save_block 之后再次调用同步函数 multifd_send_sync_main 进行主迁移线程和 multifd_send_thread 的同步。