QEMU之CPU虚拟化

概述

KVM是由以色列初创公司Qumranet在CPU推出硬件虚拟化之后开发的一个基于内核的虚拟机监控器。

KVM是一个虚拟化的统称方案,除了x86外,ARM等其他架构也有自己的方案,所以KVM的主体代码位于内核树virt/kvm目录下面,表示所有CPU架构的公共代码,这也是内核kvm.ko对应的源码。

CPU架构代码位于arch/目录下面,如x86的架构相关的代码在arch/x86/kvm下。当然,同一个架构可能会有多种不同的实现,如KVM就有Intel和AMD两家的CPU实现,所以在x86目录下面就有多种实现代码,如Intel的vmx.c(对应intel VM-X方案)、AMD的svm.c(对应AMD-V方案),ioapic.c和lapic.c是中断控制器的代码,这也是intel-kvm.ko和amd-kvm.ko的来源。这种源码组织架构也常见于Linux内核的其他子系统。

KVM的所有虚拟化实现(Intel和AMD)都会向KVM模块注册一个kvm_x86_ops结构体,这样,KVM中的一些函数就是一个外壳,它可能首先会调用kvm_arch_xxx函数,表示的是调用CPU架构相关的函数,而如果kvm_arch_xxx函数需要调用到实现相关的代码,则会调用kvm_x86_ops结构中的相关回调函数。

kvm_intel.ko 与 kvm.ko 的关系:

/dev/kvm
vmx_x86_ops
kvm_init/kvm_exit
user space
kvm.ko
kvm_intel.ko

VM创建

qemu侧虚机创建

qemu中支持kvm的代码入口主要都在kvm-all.c中,其中初始化函数kvm_init()。

qemu
accel
kvm
kvm-all.c
xen
xen-all.c
...

当运行qemu时,如果命令行中带有--enable-kvm参数,则在qemu_init()函数中会处理:

case QEMU_OPTION_enable_kvm:
    olist = qemu_find_opts("machine");
    qemu_opts_parse_noisily(olist, "accel=kvm", false);
    break;

machine optslist这个参数项加了一个accel=kvm参数,之后main函数会调用configure_accelerator(current_machine),该函数会从machine的参数列表中取出accel的值,找出所属的类型,然后调用accel_init_machine。

int accel_init_machine(AccelState *accel, MachineState *ms)
{
    AccelClass *acc = ACCEL_GET_CLASS(accel);        /*获取指定类型(这里是kvm)的accel类*/
    int ret;
    ms->accelerator = accel;
    *(acc->allowed) = true;
    ret = acc->init_machine(ms);    /* 执行其对应的 init_machine 函数*/
    if (ret < 0) {
        ms->accelerator = NULL;
        *(acc->allowed) = false;
        object_unref(OBJECT(accel));
    } else {
        object_set_accelerator_compat_props(acc->compat_props);
    }
    return ret;
}

那么accel=kvm的init_machine函数是谁呢?

#define TYPE_KVM_ACCEL ACCEL_CLASS_NAME("kvm")    #定义TYPE_KVM_ACCEL 就是 kvm-accel

然后在kvm-all.c中,构造kvm_accel_type结构体时设置了其init_machine钩子函数:

static void kvm_accel_class_init(ObjectClass *oc, void *data)
{
    AccelClass *ac = ACCEL_CLASS(oc);
    ac->name = "KVM";
    ac->init_machine = kvm_init;        /* 这里初始化kvm accel的init_machine 函数为 kvm_init()*/
    ac->has_memory = kvm_accel_has_memory;
    ac->allowed = &kvm_allowed;

    ...
}

/* 初始化kvm_accel_type结构体 */
static const TypeInfo kvm_accel_type = {
    .name = TYPE_KVM_ACCEL,
    .parent = TYPE_ACCEL,
    .instance_init = kvm_accel_instance_init,
    .class_init = kvm_accel_class_init,
    .instance_size = sizeof(KVMState),
};

static void kvm_type_init(void)
{
    type_register_static(&kvm_accel_type);    /* 注册kvm_accel_type结构体 */
}

type_init(kvm_type_init);

kvm-all.c中 kvm_init()函数

static int kvm_init(MachineState *ms)
{
    /* 省略代码... */

    s = KVM_STATE(ms->accelerator);

    /* 省略代码... */

    s->fd = qemu_open("/dev/kvm", O_RDWR);    /* 打开 /dev/kvm 得到fd句柄 */

    /* 省略代码... */

    do {
        ret = kvm_ioctl(s, KVM_CREATE_VM, type);    /* ioctl打开的/dev/kvm的fd句柄,KVM_CREATE_VM命令通知kvm.ko模块创建虚机*/
    } while (ret == -EINTR);

    /* 省略代码... */

    ret = kvm_arch_init(ms, s);        /* 做一些架构相关的初始化操作*/

    /* 省略代码... */

    return ret;
}
kvm_init
qemu_open, 打开/dev/kvm
kvm_ioctl(KVM_CREATE_VM)
kvm_arch_init

kvm_init()的主要作用就是调用/dev/kvm提供的一系列ioctl接口,在内核KVM中创建一台虚拟机。一个QEMU进程对应一台虚拟机VM。

kvm侧虚机创建

内核kvm模块的主要代码入口在kvm_main.c中,以kvm与intel组合为例,后面的分析涉及架构都是intel:

Linux
virt
arch
kvm
kvm_main.c eventfd.c 等等
x86
kvm
x86.c vmx svm 等等
数据结构

内核kvm模块中,struct kvm其实就代表一台虚拟机。
在这里插入图片描述

初始化/dev/kvm

kvm_init()函数中初始化/dev/kvm设备,留给qemu去访问,并初始化对应的options操作函数。

vmx_init
kmv_init
misc_register(&kvm_dev)

x86架构下,kvm的options对象kvm_x86_ops。

arch/x86/kvm/x86.c中定义了全局变量 kvm_x86_ops

struct kvm_x86_ops kvm_x86_ops __read_mostly;
EXPORT_SYMBOL_GPL(kvm_x86_ops);

kvm_x86_ops结构体中是一系列函数指针,其具体的函数初始化是vmx_x86_ops中初始化的。

struct kvm_x86_ops {
    int (*hardware_enable)(void);
    void (*hardware_disable)(void);
    void (*hardware_unsetup)(void);
    bool (*cpu_has_accelerated_tpr)(void);
    bool (*has_emulated_msr)(u32 index);
    void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);

    unsigned int vm_size;
    int (*vm_init)(struct kvm *kvm);
    void (*vm_destroy)(struct kvm *kvm);

    /*省略一大堆函数指针*/
}

x86架构的vmx.c中vmx_init函数在调用kvm_init时传入的是vmx_init_ops:

    r = kvm_init(&vmx_init_ops, sizeof(struct vcpu_vmx),
             __alignof__(struct vcpu_vmx), THIS_MODULE);

主要起作用的是vmx_x86_ops,在/arch/x86/kvm/vmx/vmx.c中初始化:

static struct kvm_x86_init_ops vmx_init_ops __initdata = {
    .cpu_has_kvm_support = cpu_has_kvm_support,
    .disabled_by_bios = vmx_disabled_by_bios,
    .check_processor_compatibility = vmx_check_processor_compat,
    .hardware_setup = hardware_setup,

    .runtime_ops = &vmx_x86_ops,
};

其中,vmx_x86_ops也是一个全局静态对象,其具体内容:

static struct kvm_x86_ops vmx_x86_ops __initdata = {
    .hardware_unsetup = hardware_unsetup,

    .hardware_enable = hardware_enable,
    .hardware_disable = hardware_disable,
    .cpu_has_accelerated_tpr = report_flexpriority,
    .has_emulated_msr = vmx_has_emulated_msr,

    .vm_size = sizeof(struct kvm_vmx),
    .vm_init = vmx_vm_init,

    /*省略...*/
};

内核kvm_main.c中,定义了kvm的设备、字符设备ioctl、vm虚机的ioctl、vcpu的iotctl等全局变量以便响应用户态的操作。

static struct file_operations kvm_vcpu_fops = {
    .release        = kvm_vcpu_release,
    .unlocked_ioctl = kvm_vcpu_ioctl,
    .mmap           = kvm_vcpu_mmap,
    .llseek        = noop_llseek,
    KVM_COMPAT(kvm_vcpu_compat_ioctl),
};

static struct file_operations kvm_vm_fops = {
    .release        = kvm_vm_release,
    .unlocked_ioctl = kvm_vm_ioctl,
    .llseek        = noop_llseek,
    KVM_COMPAT(kvm_vm_compat_ioctl),
};

static struct file_operations kvm_chardev_ops = {
    .unlocked_ioctl = kvm_dev_ioctl,
    .llseek        = noop_llseek,
    KVM_COMPAT(kvm_dev_ioctl),
};

static struct miscdevice kvm_dev = {
    KVM_MINOR,
    "kvm",
    &kvm_chardev_ops,
};

kvm_preempt_ops.sched_in = kvm_sched_in;
kvm_preempt_ops.sched_out = kvm_sched_out;

kvm_dev_ioctl

ioctl操作对应处理函数
KVM_GET_API_VERSION
KVM_CREATE_VM创建虚机,kvm_dev_ioctl_create_vm() --> kvm_create_vm()
KVM_CHECK_EXTENSION检查扩展功能,kvm_vm_ioctl_check_extension_generic()
KVM_GET_VCPU_MMAP_SIZE创建qemu与kvm共享内存

kvm_vm_ioctl:

ioctl操作对应处理函数
KVM_CREATE_VCPU创建vcpu,kvm_vm_ioctl_create_vcpu
KVM_ENABLE_CAPkvm_vm_ioctl_enable_cap_generic
KVM_SET_USER_MEMORY_REGIONkvm_vm_ioctl_set_memory_region
KVM_GET_DIRTY_LOGkvm_vm_ioctl_get_dirty_log
KVM_REGISTER_COALESCED_MMIO
KVM_IRQFDkvm_irqfd
KVM_IOEVENTFDkvm_ioeventfd
KVM_CREATE_DEVICEkvm_ioctl_create_device
KVM_CHECK_EXTENSIONkvm_vm_ioctl_check_extension_generic

kvm_vcpu_ioctl:

ioctl操作对应处理函数
KVM_RUN运行vcpu,kvm_arch_vcpu_ioctl_run()
KVM_GET_REGS
KVM_SET_REGS

kvm_dev_ioctl与kvm_vm_ioctl与kvm_vcpu_ioctl之间的关系:
在这里插入图片描述

QEMU创建CPU

qemu中的CPU模型继承关系:
在这里插入图片描述

qemu中支持的x86 CPU都定义在target/i386/cpu.c中的X86CPUDefinition类型的builtin_x86_defs数组中:

/* Base definition for a CPU model */
typedef struct X86CPUDefinition {
    const char *name;
    uint32_t level;
    uint32_t xlevel;
    /* vendor is zero-terminated, 12 character ASCII string */
    char vendor[CPUID_VENDOR_SZ + 1];
    int family;
    int model;
    int stepping;
    FeatureWordArray features;
    const char *model_id;
    CPUCaches *cache_info;

    /* Use AMD EPYC encoding for apic id */
    bool use_epyc_apic_id_encoding;

    /*
     * Definitions for alternative versions of CPU model.
     * List is terminated by item with version == 0.
     * If NULL, version 1 will be registered automatically.
     */
    const X86CPUVersionDefinition *versions;
} X86CPUDefinition;

其中:

X86CPUDefinition成员作用
nameCPU的名字
levelCPUID指令支持的最大功能号
xlevelCPUID扩展质量支持的最大功能号
vendor、family、model、steppingCPU的基本信息
features记录CPU特性的数组
model_idCPU的全名

builtin_x86_defs数组:

static X86CPUDefinition builtin_x86_defs[] = {
    {
        .name = "qemu64",
        .level = 0xd,
        .vendor = CPUID_VENDOR_AMD,
        .family = 6,
        .model = 6,
        .stepping = 3,
        .features[FEAT_1_EDX] =
            PPRO_FEATURES |
            CPUID_MTRR | CPUID_CLFLUSH | CPUID_MCA |
            CPUID_PSE36,
        .features[FEAT_1_ECX] =
            CPUID_EXT_SSE3 | CPUID_EXT_CX16,
        .features[FEAT_8000_0001_EDX] =
            CPUID_EXT2_LM | CPUID_EXT2_SYSCALL | CPUID_EXT2_NX,
        .features[FEAT_8000_0001_ECX] =
            CPUID_EXT3_LAHF_LM | CPUID_EXT3_SVM,
        .xlevel = 0x8000000A,
        .model_id = "QEMU Virtual CPU version " QEMU_HW_VERSION,
    },

    ... /*有2000多行代码*/
}

qemu中通过struct X86CPU结构体来实例化一个虚拟的x86 CPU:

在这里插入图片描述

qemu中创建vcpu的函数调用路径:
在这里插入图片描述

其中,qemu中的kvm_init_vcpu()代码

int kvm_init_vcpu(CPUState *cpu)
{
    /*以下都省略部分代码,只留关心的部分*/
    ret = kvm_get_vcpu(s, kvm_arch_vcpu_id(cpu));        /*KVM_CREATE_VCPU去创建vcpu*/

    mmap_size = kvm_ioctl(s, KVM_GET_VCPU_MMAP_SIZE, 0);    /*创建共享内存空间*/

    cpu->kvm_run = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED,
                        cpu->kvm_fd, 0);    /*qemu拿到共享内存后,对其fd进行mmap,kvm中处理函数是kvm_vcpu_mmap()*/

    ret = kvm_arch_init_vcpu(cpu);

    return ret;
}

KVM创建CPU

在这里插入图片描述

qemu与kvm共享数据

QEMU与KVM经常需要共享数据,如KVM将VM Exit的信息放到共享内存中,QEMU可以通过共享内存区域获取这些数据。QEMU与KVM之间的数据共享是QEMU在创建VCPU时分配的。

qemu在kvm_init_vcpu()中有kvm_ioctl(s, KVM_GET_VCPU_MMAP_SIZE, 0),该接口返回的是qemu与kvm共享内存的大小。

kvm中处理该接口的函数是:

static long kvm_dev_ioctl(struct file *filp,
              unsigned int ioctl, unsigned long arg)
{
    /*省略部分代码*/
    case KVM_GET_VCPU_MMAP_SIZE:
        if (arg)
            goto out;
        r = PAGE_SIZE;     /* struct kvm_run */
#ifdef CONFIG_X86
        r += PAGE_SIZE;    /* pio data page */
#endif
#ifdef CONFIG_KVM_MMIO
        r += PAGE_SIZE;    /* coalesced mmio ring page */
#endif
        break;

    return r;
}

ioctl(KVM_GET_VCPU_MMAP_SIZE)可能返回的大小为1个、2个或者3个页。第一页用于kvm_run,该结构体用于与QEMU和KVM进行基本的数据交互,第二页用于虚拟机访问IO端口时存储相应的数据,最后一页用于聚合的MMIO。

然后qemu对共享内存进行mmap操作

static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
{
    struct kvm_vcpu *vcpu = vmf->vma->vm_file->private_data;
    struct page *page;

    if (vmf->pgoff == 0)
        page = virt_to_page(vcpu->run);
#ifdef CONFIG_X86
    else if (vmf->pgoff == KVM_PIO_PAGE_OFFSET)
        page = virt_to_page(vcpu->arch.pio_data);
#endif
#ifdef CONFIG_KVM_MMIO
    else if (vmf->pgoff == KVM_COALESCED_MMIO_PAGE_OFFSET)
        page = virt_to_page(vcpu->kvm->coalesced_mmio_ring);
#endif
    else
        return kvm_arch_vcpu_fault(vcpu, vmf);
    get_page(page);
    vmf->page = page;
    return 0;
}

static const struct vm_operations_struct kvm_vcpu_vm_ops = {
    .fault = kvm_vcpu_fault,
};

static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
{
    vma->vm_ops = &kvm_vcpu_vm_ops;
    return 0;
}

QEMU调用mmap映射VCPU的fd这个匿名文件的时候,实际上仅分配了虚拟地址空间,并且设置了这段虚拟地址空间的操作为kvm_vcpu_vm_ops,该操作回调只有一个fault回调函数kvm_vcpu_fault。kvm_vcpu_fault函数会在QEMU访问共享内存产生缺页异常的时候被调用,从其代码可以看到,内核会在QEMU把对应的数据与虚拟地址空间联系起来。

访问共享内存页实际访问
page1kvm_vcpu->run
page2kvm_vcpu->arch
page3kvm->coalesced_mmio_ring

VCPU运行

QEMU运行VCPU

每个VCPU都会有一个对应的VMCS(Virtual Machine Control Structure),该结构是Intel x86处理器中实现CPU虚拟化记录vCPU状态的一个关键数据结构。VMCS的物理地址会作为操作数提供给VMX的指令。VMCS总共有如下4种状态:

  • Inactive:即只是分配和初始化VMCS结构或者是执行VMCLEAR指令之后的状态。
  • working:CPU在一个VMCS上执行了VMPTRLD指令或者产生VM exit之后所处的状态,这个时候CPU还是在VMX root状态。
  • Active:当前VMCS执行了VMPTRLD指令,同一个CPU执行了另一个VCPU的VMPTRLD之后,前一个VMCS所处的状态。
  • controlling:当CPU在一个VMCS上执行了VMLAUNCH指令之后CPU所处的VMX non-root状态。
    在这里插入图片描述

Intel SDM 31.6所描述的要让一个虚拟机运行起来的步骤。

  1. 在非分页内存中分配一个4KB对齐的VMCS区域,其大小通过IA32_VMX_BASIC MSR得到,对于KVM,这个过程主要是通过vmx_create_vcpu调用alloc_vmcs来完成的。
  2. 初始化VMCS区域的版本标识(VMCS区域的前31位),这也是通过IA32_VMX_BASIC SMR得到的,清除VMCS区域前4个字节的31位,对于KVM,这个过程在alloc_vmcs_cpu中完成。
  3. 使用VMCS的物理地址作为操作数执行VMCLEAR指令,这会将当前CPU的working-VMCS指针指向FFFFFFFF_FFFFFFFFH,指令执行完成之后检查RFLAGS.CF=0以及RFLAGS.ZE=0,对于KVM,这个过程主要通过loaded_vmcs_clear函数最终调用vmcs_clear来完成。
  4. 使用VMCS的物理地址执行VMPTRLD指令,这个时候CPU的working-VMCS指针指向VMCS区域的物理地址,对于KVM,这个过程通过vmx_vcpu_load调用vmcs_load来完成。
  5. 执行VMWRITE指令,初始化VMCS的host-state区域,当产生VM exit后,这个区域会用来创建宿主机的CPU状态和上下文,host-state区域包括控制寄存器(CR0、CR3以及CR4),段寄存器(CS、SS、DS、ES、FS、GS、TR)以及RSP、RIP和一些MSR寄存器,对于KVM,这个过程主要在vmx_vcpu_setup函数中完成。
  6. 执行VMWRITE指令,初始化VMCS中的VM-exit control区域、VM-entry control区域以及VM-execution control区域。这些区域的某些数据需要根据VMX capability MSR的报告设置,如MSR寄存器报告在当前CPU上某些位只能设置为0,对于KVM,这个过程主要在vmx_vcpu_setup函数中完成。
  7. 执行VMWRITE指令,初始化guest-state区域,当CPU进入VMX non-root模式时会根据这些数据创建上下文,对于KVM,这个过程主要在vmx_vcpu_reset中完成。
  8. guest-state的设置需要满足如下条件。
  • ① 如果虚拟机需要模拟一个从BIOS启动的完整OS,则需要将guest的状态设置为物理CPU加电时的状态。
  • ② 需要将VMM不能截获的guest-state数据正确设置,如通用寄存器、CR2控制寄存器、调试寄存器、浮点数寄存器等。
  1. 执行VMLAUNCH,使得CPU处于VMX non-root状态,如果这个过程出错,将会设置RFLAGS.CF或者RFLAGS.ZF,对于KVM,这个过程在vmx_vcpu_run中完成。

qemu中vcpu线程的routine函数是

static void *qemu_kvm_cpu_thread_fn(void *arg)
{
    /*省略*/
    r = kvm_init_vcpu(cpu);

    kvm_init_cpu_signals(cpu);

    /* signal CPU creation */
    cpu->created = true;
    qemu_cond_signal(&qemu_cpu_cond);
    qemu_guest_random_seed_thread_part2(cpu->random_seed);

    do {
        if (cpu_can_run(cpu)) {
            r = kvm_cpu_exec(cpu);    /*vcpu运行的核心代码*/
            if (r == EXCP_DEBUG) {
                cpu_handle_guest_debug(cpu);
            }
        }
        qemu_wait_io_event(cpu);    /*vcpu不好运行时,则将CPU等待在cpu->halt_cond条件上*/
    } while (!cpu->unplug || cpu_can_run(cpu));

    /*省略*/
    return NULL;
}

qemu中vcpu运行的核心代码函数kvm_cpu_exec(),其核心也是一个do{}while()循环。

int kvm_cpu_exec(CPUState *cpu)
{
    /*省略*/

    do {
        /*省略*/
        kvm_arch_pre_run(cpu, run);

        run_ret = kvm_vcpu_ioctl(cpu, KVM_RUN, 0);

        attrs = kvm_arch_post_run(cpu, run);

        switch (run->exit_reason) {
        case KVM_EXIT_IO:
            DPRINTF("handle_io\n");
            /* Called outside BQL */
            kvm_handle_io(run->io.port, attrs,
                          (uint8_t *)run + run->io.data_offset,
                          run->io.direction,
                          run->io.size,
                          run->io.count);
            ret = 0;
            break;
        case KVM_EXIT_MMIO:
            DPRINTF("handle_mmio\n");
            /* Called outside BQL */
            address_space_rw(&address_space_memory,
                             run->mmio.phys_addr, attrs,
                             run->mmio.data,
                             run->mmio.len,
                             run->mmio.is_write);
            ret = 0;
            break;

        /*省略*/

        case KVM_EXIT_SYSTEM_EVENT:

        default:
            DPRINTF("kvm_arch_handle_exit\n");
            ret = kvm_arch_handle_exit(cpu, run);
            break;
        }
    } while (ret == 0);
    /*省略*/
    return ret;
}

kvm_arch_pre_run首先做一些运行前的准备工作,如nmi和smi的中断注入,之后触发VCPU的ioctl(KVM_RUN)使该CPU运行起来,KVM模块在处理该ioctl时,会执行对应的VMX指令,把该VCPU运行的物理CPU从VMX root模式转换成VMX non-root模式,开始运行虚拟机中的代码。虚拟机内部如果遇到一些事件产生VM Exit,就会退出到KVM,如果KVM无法处理就会分发到QEMU,也就是在ioctl(KVM_RUN)返回的时候调用kvm_arch_post_run来进行一些初步处理,然后开始根据QEMU和KVM共享内存kvm_run中的数据来判断退出原因,并做出相应处理,如对于I/O的退出会调用kvm_handle_io进行分发,最终调用到注册该I/O端口的设备回调函数。可以看到,这里用了很多kvm_run里面的数据,如果退出原因是由于访问MMIO,则会调用address_space_rw,这个函数会找到MMIO是由哪个设备注册的,从而调用其相关回调函数。

qemu、kvm与vm之间的关系:
在这里插入图片描述

KVM运行VCPU

kvm_vcpu_ioctl

由kvm_vcpu_ioctl中去处理,最后有arch/x86/kvm/x86.c中的vcpu_run()函数做主要处理:

kvm_vcpu_ioctl
kvm_arch_vcpu_ioctl_run
vcpu_run
static struct file_operations kvm_vcpu_fops = {
    .release        = kvm_vcpu_release,
    .unlocked_ioctl = kvm_vcpu_ioctl,
    .mmap           = kvm_vcpu_mmap,
    .llseek        = noop_llseek,
    KVM_COMPAT(kvm_vcpu_compat_ioctl),
};

kvm_vcpu_ioctl()函数如何保证是当前vcpu线程在处理的呢?函数中首先处理如下判断,

if (vcpu->kvm->mm != current->mm)
    return -EIO;
switch (ioctl) {
    case KVM_RUN: {
        struct pid *oldpid;
        r = -EINVAL;
        if (arg)
            goto out;
        oldpid = rcu_access_pointer(vcpu->pid);        //这里有可能运行该vcpu的线程换了
        if (unlikely(oldpid != task_pid(current))) {
            /* The thread running this VCPU changed. */
            struct pid *newpid;

            r = kvm_arch_vcpu_run_pid_change(vcpu);
            if (r)
                break;

            newpid = get_task_pid(current, PIDTYPE_PID);
            rcu_assign_pointer(vcpu->pid, newpid);        //如果换线程了,则更新vcpu->pid为current->pid
            if (oldpid)
                synchronize_rcu();
            put_pid(oldpid);
        }
        /*这里可以对vcpu进行特征统计,对运行vcpu的线程进行标记,但是如果统计vcpu特征了,还需要标记线程么?*/
        r = kvm_arch_vcpu_ioctl_run(vcpu);        //进入具体架构vcpu run代码
        trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
        break;
    }
kvm_arch_vcpu_ioctl_run

进入kvm_arch_vcpu_ioctl_run()函数,这里分析x86架构:

int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
{
    struct kvm_run *kvm_run = vcpu->run;
    int r;

    vcpu_load(vcpu);
    //省略代码

    if (kvm_run->immediate_exit)
        r = -EINTR;
    else
        r = vcpu_run(vcpu);    //主要是vcpu_run函数

out:
    kvm_put_guest_fpu(vcpu);
    if (kvm_run->kvm_valid_regs)
        store_regs(vcpu);
    post_kvm_run_save(vcpu);
    kvm_sigset_deactivate(vcpu);

    vcpu_put(vcpu);
    return r;
}
vcpu_load 与 vcpu_put

vcpu_load是加载vcpu至对应的物理cpu,vcpu_put则相反。

kvm中定义了一个per cpu变量,kvm_running_vcpu,用于记录是否运行vcpu任务。

static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_running_vcpu);

vcpu_load()函数,主要就是kvm_running_vcpu赋值,

/*
 * Switches to specified vcpu, until a matching vcpu_put()
 */
void vcpu_load(struct kvm_vcpu *vcpu)
{
    int cpu = get_cpu();    //关闭抢占,返回cpu的id

    __this_cpu_write(kvm_running_vcpu, vcpu);    //赋值per-cpu变量kvm_running_vcpu为当前vcpu
    preempt_notifier_register(&vcpu->preempt_notifier);
    kvm_arch_vcpu_load(vcpu, cpu);
    put_cpu();        //开启抢占
}
EXPORT_SYMBOL_GPL(vcpu_load);

vcpu_put()与vcpu_load()是相对使用的。

void vcpu_put(struct kvm_vcpu *vcpu)
{
    preempt_disable();
    kvm_arch_vcpu_put(vcpu);
    preempt_notifier_unregister(&vcpu->preempt_notifier);
    __this_cpu_write(kvm_running_vcpu, NULL);
    preempt_enable();
}
EXPORT_SYMBOL_GPL(vcpu_put);
vcpu_run
static int vcpu_run(struct kvm_vcpu *vcpu)
{
    /*省略*/
    for (;;) {
        if (kvm_vcpu_running(vcpu)) {
            r = vcpu_enter_guest(vcpu);        /*判断的结果是可以运行,则会调用vcpu_enter_guest来进入虚拟机*/
        } else {
            r = vcpu_block(kvm, vcpu);        /*如果vcpu_run判断此时VCPU不能运行,不考虑poll机制,则调用schedule()提请调度,让出CPU。*/
        }

        if (r <= 0)
            break;

        /*省略*/
    }
    /*省略*/

    return r;
}


/* 判断两个方面:
 * 1. vcpu.arch结构的mp_state是否为KVM_MP_STATE_RUNNABLE
 * 2. vcpu.arch结构中的apf.halted表示的虚拟机中是否存在需要访问却被宿主机swap出去的内存页,如果由于apf而被暂停,则这个时候虚拟CPU也是不能运行的
 */
static inline bool kvm_vcpu_running(struct kvm_vcpu *vcpu)
{
    if (is_guest_mode(vcpu))
        kvm_x86_ops.nested_ops->check_events(vcpu);

    return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
        !vcpu->arch.apf.halted);
}

如果vcpu_run判断此时VCPU不能运行,则会调用vcpu_block,后者调用kvm_vcpu_block,如果不考虑poll机制,则kvm_vcpu_block会调用schedule()提请调度,让出CPU。

vcpu_block
kvm_vcpu_block
schedule
void kvm_vcpu_block(struct kvm_vcpu *vcpu)
{
    /*省略*/
    for (;;) {
        set_current_state(TASK_INTERRUPTIBLE);

        if (kvm_vcpu_check_block(vcpu) < 0)
            break;

        waited = true;
        schedule();
    }
    /*省略*/
}
vcpu_enter_guest

返回1,则vcpu_run()函数就一直在for循环中,否则返回至userspace。

/*
 * Returns 1 to let vcpu_run() continue the guest execution loop without
 * exiting to the userspace.  Otherwise, the value will be returned to the
 * userspace.
 */
static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
{
    /*省略...........................*/

    r = kvm_mmu_reload(vcpu);
    if (unlikely(r)) {
        goto cancel_injection;
    }

    preempt_disable();        //关闭抢占

    kvm_x86_ops.prepare_guest_switch(vcpu);    //这里是保存host主机的state,以便虚拟机退出后能正常运行host

    /*
     * Disable IRQs before setting IN_GUEST_MODE.  Posted interrupt
     * IPI are then delayed after guest entry, which ensures that they
     * result in virtual interrupt delivery.
     * 这里禁止CPU的外部中断请求 */
    local_irq_disable();
    vcpu->mode = IN_GUEST_MODE;        //进入guest mode

    //省略
    trace_kvm_entry(vcpu->vcpu_id);        //这里追踪kvm entry,而kvm exit是在vmx_vcpu_run()函数中追踪的
    //省略

    exit_fastpath = kvm_x86_ops.run(vcpu);        //这里进入vmx_vcpu_run()函数

    //省略

    vcpu->arch.last_vmentry_cpu = vcpu->cpu;
    vcpu->arch.last_guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());

    vcpu->mode = OUTSIDE_GUEST_MODE;    //退出guest mode
    smp_wmb();

    kvm_x86_ops.handle_exit_irqoff(vcpu);    //退出虚机后,处理外部中断

       /* 
     * Consume any pending interrupts, including the possible source of
     * VM-Exit on SVM and any ticks that occur between VM-Exit and now.
     * An instruction is required after local_irq_enable() to fully unblock
     * interrupts on processors that implement an interrupt shadow, the
     * stat.exits increment will do nicely.
     */
    kvm_before_interrupt(vcpu);
    local_irq_enable();
    ++vcpu->stat.exits;                //这里对退出的数据进行统计
    local_irq_disable();
    kvm_after_interrupt(vcpu);

    if (lapic_in_kernel(vcpu)) {
        s64 delta = vcpu->arch.apic->lapic_timer.advance_expire_delta;
        if (delta != S64_MIN) {
            trace_kvm_wait_lapic_expire(vcpu->vcpu_id, delta);
            vcpu->arch.apic->lapic_timer.advance_expire_delta = S64_MIN;
        }
    }

    local_irq_enable();
    preempt_enable();

    //省略

    r = kvm_x86_ops.handle_exit(vcpu, exit_fastpath);    //其实到这里已经没有什么外部中断需要处理了,就是统计虚机退出的一些原因数据
    return r;

cancel_injection:
    if (req_immediate_exit)
        kvm_make_request(KVM_REQ_EVENT, vcpu);
    kvm_x86_ops.cancel_injection(vcpu);
    if (unlikely(vcpu->arch.apic_attention))
        kvm_lapic_sync_from_vapic(vcpu);
out:
    return r;
}

该函数会陷入kvm_vcpu对应的vmx_vcpu_run,当vmx_vcpu_run执行完返回的时候,其实已经完成了一轮VMEntry与VM Exit了。

vcpu->mode有以下几种

enum {
    OUTSIDE_GUEST_MODE,
    IN_GUEST_MODE,
    EXITING_GUEST_MODE,
    READING_SHADOW_PAGE_TABLES,
};

CPU在guest模式运行时,中断是关闭的,运行着虚拟机代码的CPU不会接收到外部中断,但是外部中断会导致CPU退出guest模式,进入VMX root模式。外部中断的处理是在handle_exit之前进行的,所以后面在handle_exit中处理外部中断的时候就没有什么实际的事可以做了,而只是对统计数据进行了修改。

vmx_vcpu_run
static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
{
    fastpath_t exit_fastpath;
    struct vcpu_vmx *vmx = to_vmx(vcpu);
    unsigned long cr3, cr4;

reenter_guest:
    /* Record the guest's net vcpu time for enforced NMI injections. */
    if (unlikely(!enable_vnmi &&
             vmx->loaded_vmcs->soft_vnmi_blocked))
        vmx->loaded_vmcs->entry_time = ktime_get();

    /* Don't enter VMX if guest state is invalid, let the exit handler
       start emulation until we arrive back to a valid state */
    if (vmx->emulation_required)
        return EXIT_FASTPATH_NONE;

    if (vmx->ple_window_dirty) {
        vmx->ple_window_dirty = false;
        vmcs_write32(PLE_WINDOW, vmx->ple_window);
    }

    /*
     * We did this in prepare_switch_to_guest, because it needs to
     * be within srcu_read_lock.
     */
    WARN_ON_ONCE(vmx->nested.need_vmcs12_to_shadow_sync);

    if (kvm_register_is_dirty(vcpu, VCPU_REGS_RSP))
        vmcs_writel(GUEST_RSP, vcpu->arch.regs[VCPU_REGS_RSP]);
    if (kvm_register_is_dirty(vcpu, VCPU_REGS_RIP))
        vmcs_writel(GUEST_RIP, vcpu->arch.regs[VCPU_REGS_RIP]);

    cr3 = __get_current_cr3_fast();
    if (unlikely(cr3 != vmx->loaded_vmcs->host_state.cr3)) {
        vmcs_writel(HOST_CR3, cr3);
        vmx->loaded_vmcs->host_state.cr3 = cr3;
    }

    cr4 = cr4_read_shadow();
    if (unlikely(cr4 != vmx->loaded_vmcs->host_state.cr4)) {
        vmcs_writel(HOST_CR4, cr4);
        vmx->loaded_vmcs->host_state.cr4 = cr4;
    }

    /* When single-stepping over STI and MOV SS, we must clear the
     * corresponding interruptibility bits in the guest state. Otherwise
     * vmentry fails as it then expects bit 14 (BS) in pending debug
     * exceptions being set, but that's not correct for the guest debugging
     * case. */
    if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
        vmx_set_interrupt_shadow(vcpu, 0);

    kvm_load_guest_xsave_state(vcpu);

    pt_guest_enter(vmx);

    atomic_switch_perf_msrs(vmx);

    if (enable_preemption_timer)
        vmx_update_hv_timer(vcpu);

    if (lapic_in_kernel(vcpu) &&
        vcpu->arch.apic->lapic_timer.timer_advance_ns)
        kvm_wait_lapic_expire(vcpu);

    /*
     * If this vCPU has touched SPEC_CTRL, restore the guest's value if
     * it's non-zero. Since vmentry is serialising on affected CPUs, there
     * is no need to worry about the conditional branch over the wrmsr
     * being speculatively taken.
     */
    x86_spec_ctrl_set_guest(vmx->spec_ctrl, 0);

    /* The actual VMENTER/EXIT is in the .noinstr.text section. */
    vmx_vcpu_enter_exit(vcpu, vmx);

    /*
     * We do not use IBRS in the kernel. If this vCPU has used the
     * SPEC_CTRL MSR it may have left it on; save the value and
     * turn it off. This is much more efficient than blindly adding
     * it to the atomic save/restore list. Especially as the former
     * (Saving guest MSRs on vmexit) doesn't even exist in KVM.
     *
     * For non-nested case:
     * If the L01 MSR bitmap does not intercept the MSR, then we need to
     * save it.
     *
     * For nested case:
     * If the L02 MSR bitmap does not intercept the MSR, then we need to
     * save it.
     */
    if (unlikely(!msr_write_intercepted(vcpu, MSR_IA32_SPEC_CTRL)))
        vmx->spec_ctrl = native_read_msr(MSR_IA32_SPEC_CTRL);

    x86_spec_ctrl_restore_host(vmx->spec_ctrl, 0);

    /* All fields are clean at this point */
    if (static_branch_unlikely(&enable_evmcs))
        current_evmcs->hv_clean_fields |=
            HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL;

    if (static_branch_unlikely(&enable_evmcs))
        current_evmcs->hv_vp_id = vcpu->arch.hyperv.vp_index;

    /* MSR_IA32_DEBUGCTLMSR is zeroed on vmexit. Restore it if needed */
    if (vmx->host_debugctlmsr)
        update_debugctlmsr(vmx->host_debugctlmsr);

#ifndef CONFIG_X86_64
    /*
     * The sysexit path does not restore ds/es, so we must set them to
     * a reasonable value ourselves.
     *
     * We can't defer this to vmx_prepare_switch_to_host() since that
     * function may be executed in interrupt context, which saves and
     * restore segments around it, nullifying its effect.
     */
    loadsegment(ds, __USER_DS);
    loadsegment(es, __USER_DS);
#endif

    vmx_register_cache_reset(vcpu);

    pt_guest_exit(vmx);

    kvm_load_host_xsave_state(vcpu);

    vmx->nested.nested_run_pending = 0;
    vmx->idt_vectoring_info = 0;

    if (unlikely(vmx->fail)) {
        vmx->exit_reason = 0xdead;
        return EXIT_FASTPATH_NONE;
    }

    vmx->exit_reason = vmcs_read32(VM_EXIT_REASON);
    if (unlikely((u16)vmx->exit_reason == EXIT_REASON_MCE_DURING_VMENTRY))
        kvm_machine_check();

    trace_kvm_exit(vmx->exit_reason, vcpu, KVM_ISA_VMX);

    if (unlikely(vmx->exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY))
        return EXIT_FASTPATH_NONE;

    vmx->loaded_vmcs->launched = 1;
    vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);

    vmx_recover_nmi_blocking(vmx);
    vmx_complete_interrupts(vmx);

    if (is_guest_mode(vcpu))
        return EXIT_FASTPATH_NONE;

    exit_fastpath = vmx_exit_handlers_fastpath(vcpu);
    if (exit_fastpath == EXIT_FASTPATH_REENTER_GUEST) {
        if (!kvm_vcpu_exit_request(vcpu)) {
            /*
             * FIXME: this goto should be a loop in vcpu_enter_guest,
             * but it would incur the cost of a retpoline for now.
             * Revisit once static calls are available.
             */
            if (vcpu->arch.apicv_active)
                vmx_sync_pir_to_irr(vcpu);
            goto reenter_guest;
        }
        exit_fastpath = EXIT_FASTPATH_EXIT_HANDLED;
    }

    return exit_fastpath;
}

该函数首先根据VCPU的状态写一些VMCS的值,接着执行汇编ASM_VMX_VMLAUNCH将CPU置于guest模式,这个时候CPU就开始执行虚拟机的代码,当发生退出时候,其地址是vmx_return。

VCPU退出

x86架构

VCPU的exit事件,由kvm_x86_ops.handle_exit()来处理,在/arch/x86/kvm/x86.c中

static int vcpu_enter_guest(struct kvm_vcpu *vcpu){
    //省略
    r = kvm_x86_ops.handle_exit(vcpu, exit_fastpath);
}

退出事件

#define VMX_EXIT_REASONS_FAILED_VMENTRY         0x80000000

#define EXIT_REASON_EXCEPTION_NMI       0
#define EXIT_REASON_EXTERNAL_INTERRUPT  1
#define EXIT_REASON_TRIPLE_FAULT        2
#define EXIT_REASON_INIT_SIGNAL            3

#define EXIT_REASON_INTERRUPT_WINDOW    7
#define EXIT_REASON_NMI_WINDOW          8
#define EXIT_REASON_TASK_SWITCH         9
#define EXIT_REASON_CPUID               10
#define EXIT_REASON_HLT                 12
#define EXIT_REASON_INVD                13
#define EXIT_REASON_INVLPG              14
#define EXIT_REASON_RDPMC               15
#define EXIT_REASON_RDTSC               16
#define EXIT_REASON_VMCALL              18
#define EXIT_REASON_VMCLEAR             19
#define EXIT_REASON_VMLAUNCH            20
#define EXIT_REASON_VMPTRLD             21
#define EXIT_REASON_VMPTRST             22
#define EXIT_REASON_VMREAD              23
#define EXIT_REASON_VMRESUME            24
#define EXIT_REASON_VMWRITE             25
#define EXIT_REASON_VMOFF               26
#define EXIT_REASON_VMON                27
#define EXIT_REASON_CR_ACCESS           28
#define EXIT_REASON_DR_ACCESS           29
#define EXIT_REASON_IO_INSTRUCTION      30
#define EXIT_REASON_MSR_READ            31
#define EXIT_REASON_MSR_WRITE           32
#define EXIT_REASON_INVALID_STATE       33
#define EXIT_REASON_MSR_LOAD_FAIL       34
#define EXIT_REASON_MWAIT_INSTRUCTION   36
#define EXIT_REASON_MONITOR_TRAP_FLAG   37
#define EXIT_REASON_MONITOR_INSTRUCTION 39
#define EXIT_REASON_PAUSE_INSTRUCTION   40
#define EXIT_REASON_MCE_DURING_VMENTRY  41
#define EXIT_REASON_TPR_BELOW_THRESHOLD 43
#define EXIT_REASON_APIC_ACCESS         44
#define EXIT_REASON_EOI_INDUCED         45
#define EXIT_REASON_GDTR_IDTR           46
#define EXIT_REASON_LDTR_TR             47
#define EXIT_REASON_EPT_VIOLATION       48
#define EXIT_REASON_EPT_MISCONFIG       49
#define EXIT_REASON_INVEPT              50
#define EXIT_REASON_RDTSCP              51
#define EXIT_REASON_PREEMPTION_TIMER    52
#define EXIT_REASON_INVVPID             53
#define EXIT_REASON_WBINVD              54
#define EXIT_REASON_XSETBV              55
#define EXIT_REASON_APIC_WRITE          56
#define EXIT_REASON_RDRAND              57
#define EXIT_REASON_INVPCID             58
#define EXIT_REASON_VMFUNC              59
#define EXIT_REASON_ENCLS               60
#define EXIT_REASON_RDSEED              61
#define EXIT_REASON_PML_FULL            62
#define EXIT_REASON_XSAVES              63
#define EXIT_REASON_XRSTORS             64
#define EXIT_REASON_UMWAIT              67
#define EXIT_REASON_TPAUSE              68
vmx_handle_exit()

退出最终会到vmx_handle_exit()中处理,然后根据事件分发给对应的处理函数

/*
 * The exit handlers return 1 if the exit was handled fully and guest execution
 * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
 * to be done to userspace and return 0.
 */
static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
    [EXIT_REASON_EXCEPTION_NMI]           = handle_exception_nmi,    /*处理不可屏蔽中断non-maskable interrupt*/
    [EXIT_REASON_EXTERNAL_INTERRUPT]      = handle_external_interrupt, /*总是返回1,没做什么具体处理,可忽略*/
    [EXIT_REASON_TRIPLE_FAULT]            = handle_triple_fault,    /*总是返回0,kvm exit shutdown*/
    [EXIT_REASON_NMI_WINDOW]          = handle_nmi_window,    /*总是返回1, 可不考虑*/
    [EXIT_REASON_IO_INSTRUCTION]          = handle_io,    /*看名字就是IO操作*/
    [EXIT_REASON_CR_ACCESS]               = handle_cr,    /*操作控制寄存器*/
    [EXIT_REASON_DR_ACCESS]               = handle_dr,    /*操作调试寄存器*/
    [EXIT_REASON_CPUID]                   = kvm_emulate_cpuid,    /*模拟cpuid,还是操作eax等寄存器*/
    [EXIT_REASON_MSR_READ]                = kvm_emulate_rdmsr,    /*模拟rdmsr指令,本质还是操作EAX寄存器*/
    [EXIT_REASON_MSR_WRITE]               = kvm_emulate_wrmsr,    /*模拟wrmsr指令,操作MSR等寄存器*/
    [EXIT_REASON_INTERRUPT_WINDOW]        = handle_interrupt_window,    /*总是返回1,可不考虑*/
    [EXIT_REASON_HLT]                     = kvm_emulate_halt,    /*HLT指令,暂停cpu*/
    [EXIT_REASON_INVD]              = handle_invd,    /*调用kvm_emulate_instruction*/
    [EXIT_REASON_INVLPG]              = handle_invlpg,    /*调用kvm_skip_emulate_instruction*/
    [EXIT_REASON_RDPMC]                   = handle_rdpmc,    /*x86的rdpmc指令,读取PMU寄存器*/
    [EXIT_REASON_VMCALL]                  = handle_vmcall,    /*vmcall指令,kvm_emulate_hypercall调用*/
    [EXIT_REASON_VMCLEAR]              = handle_vmx_instruction,
    [EXIT_REASON_VMLAUNCH]              = handle_vmx_instruction,
    [EXIT_REASON_VMPTRLD]              = handle_vmx_instruction,
    [EXIT_REASON_VMPTRST]              = handle_vmx_instruction,
    [EXIT_REASON_VMREAD]              = handle_vmx_instruction,
    [EXIT_REASON_VMRESUME]              = handle_vmx_instruction,
    [EXIT_REASON_VMWRITE]              = handle_vmx_instruction,
    [EXIT_REASON_VMOFF]              = handle_vmx_instruction,
    [EXIT_REASON_VMON]              = handle_vmx_instruction,        /*handle_vmx_instruct函数总是返回1*/
    [EXIT_REASON_TPR_BELOW_THRESHOLD]     = handle_tpr_below_threshold,    /*操作寄存器,函数返回1*/
    [EXIT_REASON_APIC_ACCESS]             = handle_apic_access,        /*APIC控制器*/
    [EXIT_REASON_APIC_WRITE]              = handle_apic_write,    /*函数返回1*/
    [EXIT_REASON_EOI_INDUCED]             = handle_apic_eoi_induced,    /*函数总返回1*/
    [EXIT_REASON_WBINVD]                  = handle_wbinvd,    //操作寄存器
    [EXIT_REASON_XSETBV]                  = handle_xsetbv,    //操作寄存器
    [EXIT_REASON_TASK_SWITCH]             = handle_task_switch,        //处理模拟进程切换
    [EXIT_REASON_MCE_DURING_VMENTRY]      = handle_machine_check,    //总是返回1,可忽略
    [EXIT_REASON_GDTR_IDTR]              = handle_desc,
    [EXIT_REASON_LDTR_TR]              = handle_desc,
    [EXIT_REASON_EPT_VIOLATION]          = handle_ept_violation,    //和NMI相关
    [EXIT_REASON_EPT_MISCONFIG]           = handle_ept_misconfig,    //ept配置错误处理
    [EXIT_REASON_PAUSE_INSTRUCTION]       = handle_pause,        //PAUSE
    [EXIT_REASON_MWAIT_INSTRUCTION]          = handle_mwait,    //使用NOP指令模拟MWAIT
    [EXIT_REASON_MONITOR_TRAP_FLAG]       = handle_monitor_trap,    //返回1,可忽略
    [EXIT_REASON_MONITOR_INSTRUCTION]     = handle_monitor,        //NOP模拟MONITOR
    [EXIT_REASON_INVEPT]                  = handle_vmx_instruction,
    [EXIT_REASON_INVVPID]                 = handle_vmx_instruction,
    [EXIT_REASON_RDRAND]                  = handle_invalid_op,        //返回1,可忽略
    [EXIT_REASON_RDSEED]                  = handle_invalid_op,
    [EXIT_REASON_PML_FULL]              = handle_pml_full,        //返回1,可忽略
    [EXIT_REASON_INVPCID]                 = handle_invpcid,        //和操作内存相关,PCIDs
    [EXIT_REASON_VMFUNC]              = handle_vmx_instruction,        //返回1,可忽略
    [EXIT_REASON_PREEMPTION_TIMER]          = handle_preemption_timer,    //返回1,可忽略
    [EXIT_REASON_ENCLS]              = handle_encls,        //返回1,可忽略
};
vm exit原因

有许多events或者instructions会导致VM exit,其中某些事永久enable开启的,有些是可以通过VMSC控制域开关的。

Unconditional reasons for VM exit include:

  • CPUID
  • RDMSR and WRMSR unless MSR bitmap is used
  • most of VMX instructions
  • INIT signal
  • SIPI signal - does not result in exit if the processor is not in wait-for-SIPI state
  • triple fault
  • task switches (hardware, including
  • VM entry failure

There are too many controllable exit reasons to describe each one separately, but most of them can be classified as one of:

  • interrupts or interrupt windows

  • I/O ports access

  • memory access - controlled by EPT

  • HLT/PAUSE and pre-emption timer - useful for multiple VMs running on one physical CPU

  • changes to descriptor tables and control registers

  • APIC access

kvm_userspace_exit

virt/kvm/kvm_main.c中的kvm_vcpu_ioctl()在处理KVM_RUN中,当从kvm_arch_vcpu_ioctl_run()这个涉及具体架构的vcpu run的处理函数退出时,意味着内核kvm层对vcpu的处理已经无法处理,需要继续退出至qemu去处理,即需要从内核态返回用户态去处理了。

        r = kvm_arch_vcpu_ioctl_run(vcpu);
        trace_kvm_userspace_exit(vcpu->run->exit_reason, r);

kvm_arch_vcpu_ioctl_run()函数退出时,系统叫它userspace exit

#define KVM_EXIT_UNKNOWN          0
#define KVM_EXIT_EXCEPTION        1
#define KVM_EXIT_IO               2
#define KVM_EXIT_HYPERCALL        3
#define KVM_EXIT_DEBUG            4
#define KVM_EXIT_HLT              5
#define KVM_EXIT_MMIO             6
#define KVM_EXIT_IRQ_WINDOW_OPEN  7
#define KVM_EXIT_SHUTDOWN         8
#define KVM_EXIT_FAIL_ENTRY       9
#define KVM_EXIT_INTR             10
#define KVM_EXIT_SET_TPR          11
#define KVM_EXIT_TPR_ACCESS       12
#define KVM_EXIT_S390_SIEIC       13
#define KVM_EXIT_S390_RESET       14
#define KVM_EXIT_DCR              15 /* deprecated */
#define KVM_EXIT_NMI              16
#define KVM_EXIT_INTERNAL_ERROR   17
#define KVM_EXIT_OSI              18
#define KVM_EXIT_PAPR_HCALL      19
#define KVM_EXIT_S390_UCONTROL      20
#define KVM_EXIT_WATCHDOG         21
#define KVM_EXIT_S390_TSCH        22
#define KVM_EXIT_EPR              23
#define KVM_EXIT_SYSTEM_EVENT     24
#define KVM_EXIT_S390_STSI        25
#define KVM_EXIT_IOAPIC_EOI       26
#define KVM_EXIT_HYPERV           27
#define KVM_EXIT_ARM_NISV         28

VCPU调度

现代处理器通常都是多对称处理,操作系统一般可以自由地将VCPU调度到任何一个物理CPU上运行。当VCPU在不同的物理CPU上运行的时候会影响虚拟机的性能。这是由于在同一个物理CPU上运行VCPU时只需要执行VMRESUME指令即可,但是如果要切换到不同的物理CPU,则需要执行VMCLEAR、VMPTRLD和VMLAUNCH指令


将一个VCPU调度到不同的物理CPU上的简化步骤,实际kvm处理比这复杂:

  1. 在源物理CPU执行VMCLEAR指令,这可以保证将当前CPU关联的VMCS相关缓存数据冲刷到内存中
  2. 在目的VMCS区域以VCPU的VMCS物理地址为操作数执行VMPTRLD指令
  3. 在目的VMCS区域执行VMLAUNCH指令

每个物理CPU会有一个指向VMCS结构体的指针per cpu变量current_vmcs,这是在vmx.c中定义的

DEFINE_PER_CPU(struct vmcs *, current_vmcs);

每一个VCPU也分配了一个VMCS结构,这是在vmx_create_vcpu中创建并保存在vmx_vcpu的loaded_vmcs中vmcs成员中的。VCPU的调度本质上就是让物理CPU的per cpu变量current_vmcs在所有VCPU之间分配,在某一时刻会指向这些VCPU中的一个。
在这里插入图片描述

  1. 内核调用vcpu_load将VCPU1与PCPU1关联起来,如果是第一次调用ioctl(KVM_RUN),则vcpu_load在kvm_vcpu_ioctl函数的开始被调用。如果是被调度进来的,则是在kvm_sched_in中,通过kvm_arch_vcpu_load调用到最终实现的vcpu_load(如vmx_vcpu_load),完成关联过程。
  2. 当PCPU1执行虚拟机代码时,当前线程是禁止抢占以及被中断打断的,但是中断却可以触发VM Exit,也就是让虚拟机退出到宿主机。退出并处理一些必要的工作之后就会开启中断和抢占,这样PCPU1就有可能去调度别的线程或VCPU。
  3. VCPU1的线程被抢占之后调用kvm_sched_out。当又该调度VCPU1时,系统却把它调度到物理CPU2上,那么就需要将VCPU1的状态与PCPU2关联起来。
  • 16
    点赞
  • 21
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值