海光 cpu kvm 虚拟机中 dpdk virtio 网卡 MMIO 方式访问网卡 resource bar 空间卡住问题分析
环境信息与基础知识
AMD虚拟化叫做 AMD-V (AMD Virtualization,AMD 主板的 BIOS 中称为 SVM,kvm 中的实现代码在 arch/i386/kvm/svm.c 中。
查看硬件虚拟化是否使能:
[root@localhost ~]# virt-host-validate
QEMU: Checking for hardware virtualization : PASS
QEMU: Checking if device /dev/kvm exists : PASS
QEMU: Checking if device /dev/kvm is accessible : PASS
QEMU: Checking if device /dev/vhost-net exists : PASS
QEMU: Checking if device /dev/net/tun exists : PASS
QEMU: Checking for cgroup 'cpu' controller support : PASS
QEMU: Checking for cgroup 'cpuacct' controller support : PASS
QEMU: Checking for cgroup 'cpuset' controller support : PASS
QEMU: Checking for cgroup 'memory' controller support : PASS
QEMU: Checking for cgroup 'devices' controller support : PASS
QEMU: Checking for cgroup 'blkio' controller support : PASS
QEMU: Checking for device assignment IOMMU support : PASS
QEMU: Checking if IOMMU is enabled by kernel : PASS
硬件虚拟化正常使能。
cpu 型号
processor : 63
vendor_id : HygonGenuine
cpu family : 24
model : 1
model name : Hygon C86 5280 16-core Processor
stepping : 1
microcode : 0x80901047
cpu MHz : 2799.918
cache size : 512 KB
physical id : 1
siblings : 32
core id : 15
cpu cores : 16
apicid : 95
initial apicid : 95
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca
bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips : 4968.44
TLB size : 2560 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate eff_freq_ro [13] [14]
cpu 支持 svm。
dpdk 读取网卡配置空间的代码
dpdk 中打开 virtio config 文件的代码:
snprintf(cfgname, sizeof(cfgname),
"/sys/class/uio/uio%u/device/config", uio_num);
dev->intr_handle.uio_cfg_fd = open(cfgname, O_RDWR);
dpdk 中读取配置空间的代码:
pread(intr_handle->uio_cfg_fd, buf, len, offset);
在宿主机上通过 ftrace 跟踪到的 kvm 调试信息:
<...>-1999977 [033] .... 159616.781146: kvm_exit: reason io rip 0xffffffff826a0799 info c1d00221 ffffffff826a079b
<...>-1999977 [033] .... 159616.781183: kvm_pio: pio_read at 0xc1d0 size 2 count 1 val 0x80
<...>-1999977 [033] .... 159616.781186: kvm_exit: reason io rip 0xffffffff826a0799 info c1d20221 ffffffff826a079b
<...>-1999977 [033] .... 159616.781190: kvm_pio: pio_read at 0xc1d2 size 2 count 1 val 0x80
<...>-1999977 [033] .... 159616.781192: kvm_exit: reason io rip 0xffffffff826a2e48 info c1d00221 ffffffff826a2e4a
<...>-1999977 [033] .... 159616.781196: kvm_pio: pio_read at 0xc1d0 size 2 count 1 val 0x80
<...>-1999977 [033] .... 159616.781197: kvm_exit: reason io rip 0xffffffff826a2e48 info c1d20221 ffffffff826a2e4a
<...>-1999977 [033] .... 159616.781202: kvm_pio: pio_read at 0xc1d2 size 2 count 1 val 0x80
<...>-1999977 [033] .... 159616.781205: kvm_exit: reason io rip 0xffffffff826a0799 info c1f00221 ffffffff826a079b
<...>-1999977 [033] .... 159616.781209: kvm_pio: pio_read at 0xc1f0 size 2 count 1 val 0x80
<...>-1999977 [033] .... 159616.781211: kvm_exit: reason io rip 0xffffffff826a0799 info c1f20221 ffffffff826a079b
PIO 方式访问正常。
查看 virtio port 空间信息
/sys/bus/pci/devices/0000:00:04.0/uio/uio0/portio/port0
[root@localhost]cat *
BAR0
port_x86
0x20
0xc180
kvm 虚拟中 dpdk 执行 modern_set_status 通过 MMIO 方式设置 virtio 寄存器卡住的位置
gdb 查看的执行信息:
(gdb) disass
Dump of assembler code for function modern_set_status:
0x00000000004f1180 <+0>: mov 0x40(%rdi),%rax
=> 0x00000000004f1184 <+4>: mov %sil,0x14(%rax)
0x00000000004f1188 <+8>: retq
End of assembler dump.
kvm trace 打印的日志信息:
<...>-1999977 [004] .... 159622.334088: kvm_exit: reason npf rip 0x4f1184 info 10000000f fe000014
<...>-1999977 [004] .... 159622.334089: kvm_exit: reason npf rip 0x4f1184 info 10000000f fe000014
<...>-1999977 [004] .... 159622.334091: kvm_exit: reason npf rip 0x4f1184 info 10000000f fe000014
<...>-1999977 [004] .... 159622.334092: kvm_exit: reason npf rip 0x4f1184 info 10000000f fe000014
<...>-1999977 [004] .... 159622.334094: kvm_exit: reason npf rip 0x4f1184 info 10000000f fe000014
fe000014 即为卡住位置的指令要访问的 MMIO 内存,它在 virito resource bar 中,此时一直有 kvm_exit 退出虚拟机的过程,但是并不会在 KVM 中模拟执行到 kvm_mmio 逻辑中,也不会返回到 qemu 中执行 mmio 内存对应的回调。
kvm 虚拟机中正常 MMIO 内存访问示例
<...>-2308041 [052] .... 159622.334111: kvm_exit: reason npf rip 0xffffffff8101c6bb info 10000000f fee00380
<...>-2308041 [052] .... 159622.334111: kvm_emulate_insn: 0:ffffffff8101c6bb:89 b7 00 c0 5f ff (prot64)
<...>-2308041 [052] .... 159622.334111: kvm_mmio: mmio write len 4 gpa 0xfee00380 val 0x6dcb
Guest 中访问 MMIO 内存 fee00380 时退出虚拟机,然后 Host kvm 模拟执行本条内存访问指令,kvm_mmio 输出信息表明 kvm 成功检测到 mmio 内存访问敏感指令,向 GPA 0xfee00380 处写入 0x6dcb 的值,此后进入 qemu 执行后端的回调函数。
qemu 使用 e1000 网卡对比测试
使用官方驱动时
e1000 网卡 pci resource 空间内容:
[root@localhost] cat ./resource
0x00000000febc0000 0x00000000febdffff 0x0000000000040200
........................................................
e1000 网卡绑定到内核驱动,访问 pci resource 空间的示例:
<...>-2640453 [032] .... 219439.831167: kvm_exit: reason npf rip 0xffffffffa0006d20 info 10000000f febc0010
<...>-2640453 [032] .... 219439.831167: kvm_page_fault: address febc0010 error_code f
<...>-2640453 [032] .... 219439.831168: vcpu_match_mmio: gva 0xffffc90003080010 gpa 0xfebc0010 Write GPA
<...>-2640453 [032] .... 219439.831168: kvm_mmio: mmio write len 4 gpa 0xfebc0010 val 0x1c8
<...>-2640453 [032] .... 219439.831227: kvm_exit: reason npf rip 0xffffffffa0006d44 info 10000000f febc0010
<...>-2640453 [032] .... 219439.831228: kvm_page_fault: address febc0010 error_code f
<...>-2640453 [032] .... 219439.831228: vcpu_match_mmio: gva 0xffffc90003080010 gpa 0xfebc0010 Write GPA
<...>-2640453 [032] .... 219439.831228: kvm_mmio: mmio write len 4 gpa 0xfebc0010 val 0x188
访问 MMIO 地址 febc0010 触发 VM exit,HOST KVM 中检测到缺页异常,febc0010 所在地址缺页,且 error_code 为 f,此后执行 kvm_mmio 操作完成内核模块指令执行后进入 qemu 调用后端函数执行回调函数。
使用 dpdk 驱动访问相同的 GPA
绑定到 dpdk 时访问网卡 resource bar 的 trace 信息:
<...>-2640451 [007] .... 218994.638671: kvm_exit: reason npf rip 0x6068c0 info 10000000d febc0010
<...>-2640451 [007] .... 218994.638672: kvm_page_fault: address febc0010 error_code d
<...>-2640451 [007] d... 218994.638672: kvm_entry: vcpu 0
<...>-2640451 [007] .... 218994.638675: kvm_exit: reason npf rip 0x6068c0 info 10000000d febc0010
<...>-2640451 [007] .... 218994.638675: kvm_page_fault: address febc0010 error_code d
<...>-2640451 [007] d... 218994.638675: kvm_entry: vcpu 0
访问 MMIO 地址 febc0010 时触发 VM exit,HOST KVM 中执检测到缺页异常,但是 error_code 为 d,而非 f,KVM 中没有判断到此处是 MMIO 内存访问,也没有模拟指令运行,重新切回虚拟机中 dpdk 程序会一直执行此条写入 MMIO 寄存器的指令,现象为卡住。
其它对照测试
相同的软件版本,在 intel cpu 上执行没有问题。
初步结论
使用海光 amd cpu 时,kvm 虚拟机运行 3.16.35 内核,虚拟机中在用户态 mmap 网卡 pci bar 空间通过 MMIO 方式读写时,KVM 行为异常,无法正常检测到 MMIO 内存访问导致程序出现表面的卡死情况。
mmap resourceX 文件的核心逻辑
if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
vma->vm_end - vma->vm_start,
vma->vm_page_prot))
return -EAGAIN;
由于时间关系没有继续投入分析此问题,后续再来探讨。
扩展链接
https://www.spinics.net/lists/kvm/msg220131.html
扩展链接中提到了错误指令导致无限死循环执行的问题,与本文中的问题有点类似,但是参考意义不大。