虚拟化技术笔记

最新推荐文章于 2023-11-05 18:11:00 发布

孔令飞

最新推荐文章于 2023-11-05 18:11:00 发布

阅读量3.7k

点赞数

分类专栏：虚拟化文章标签： Xen KVM 虚拟化虚拟化技术

本文链接：https://blog.csdn.net/lnxfei/article/details/32101299

版权

虚拟化专栏收录该内容

4 篇文章 0 订阅

订阅专栏

虚拟化技术理解

Created Wednesday 05 March 2014

虚拟机监控程序 : Virtual Machine Monitor 简称VMM
虚拟化最大优势:
- 将不同的应用程序或者服务器运行在不同的虚拟机中, 可以不同不同程序间的相互干扰
- 便于维护, 降低成本
- migration, 保证正常运行
- 方便研究

Xen的应用范围(from Xen详解.pdf)
- 服务器整合:在虚拟机范围内,在一台物理主机上安装多个服务器, 用于演示及故障隔绝
- 无硬件依赖:允许应用程序和操作系统对新硬件的移值测试
- 多操作系统配置:以开发和测试为目的,同时运行多个操作系统
- 内核开发:在虚拟机的沙盒中,做内核的测试和调试,无需为了测试而单独架设一台独立的

机器

集群运算: 和单独的管理每个物理主机相比较,在 VM 级管理更加灵活,在负载均衡方面,更易于控制,和隔离
为客户操作系统提供硬件技术支持: 可以开发新的操作系统, 以得益于现存操作系统的广泛硬件支持,比如 Linux;

ISA: Instruction Set Architecture, 指令集
当虚拟的指令集与物理的指令集相同时, 可以运行没有任何修改的操作系统, 而当两者不完全相同时, 客户机的操作系统就必须在源代码级或二进制级作相应修改 (敏感指令完全属于特权指令)
根据是否需要小修改操作系统源代码, 虚拟化技术又分为
- Paravirtualization: 泛虚拟化, 超虚拟化, 半虚拟化
- Full-virtualization: 完全虚拟化
事件通道(Event channel) 是Xen用于虚拟域和VMM之间, 虚拟域之间的一种异步事件通知机制. 共分8组, 每128个通道一组.
Hypervisor服务(Hypercall), Hypercall如同操作系统下的系统调用
硬件虚拟化, Intel VT, AMD svm Intel-VT(vmx), AMD-v(svm) , Pass-through: VT-d, IOMMU
Intel VT技术引入了一种新的处理器操作, 成为VMX(Virtual Machine Extensions),
每一个X86客户机的内存地址总是从0开始. 也因此监控程序必须把客户机虚拟地址到客户机物理地址进行重新映射. Xen泛虚拟化实现采用修改客户机页表的方式实现这一重新映像.
对CPU特权级的理解: CPU特权级的作用主要体现在2个方面:
- CPU根据当前代码段的特权级决定代码能执行的指令
- 特权级为3的long jump到特权级为0的段或访问特权级为0的数据段时, 会被CPU禁止, 只有int指令除外
敏感指令
- 敏感指令引入虚拟化后, Guest OS就不能运行在Ring 0 上. 因此, 原本需要在最高级别下执行的指令就不能直接执行, 而是交由VMM处理执行. 这部分指令称为敏感指令. 当执行这些指令时, 理论上都要产生trap被VMM捕获执行.
- 敏感指令包括:
  - 企图访问或修改虚拟机模式或机器状态指令
  - 企图访问或修改敏感寄存器或存储单元, 如始终寄存器, 中断寄存器等指令
  - 企图访问存储保护系统或内存, 地址分配系统的指令
  - 所有的I/O指令
虚拟化举例:
- 完全虚拟化
  - VMware
  - VirtualBox
  - Virtual PC
  - KVM-x86
- 半虚拟化, 刚开始为了突破x86架构的全虚拟化限制, 后来主要是为了提高虚拟化的效率
  - Xen, KVM-PowerPC
处理器呈现给软件的接口就是一堆的指令(指令集)和一堆的寄存器. 而I/0设备呈现给软件的借口也是一堆的状态和控制寄存器(有些设备也有内部存储). 其中影响处理器和设备状态和行为的寄存器成为关键资源或特权资源.
可以读写系统关键资源的指令叫做敏感指令. 绝大多数的铭感指令是特权指令. 特权指令只能在处理器的最高特权级(内核态)执行. 对于一般RISC处理器, 敏感指令肯定是特权指令, 唯x86除外.

正是为了这个例外, 造就了后来x86上的虚拟化技术的江湖纷争.先是一VMware为代表的Full virtualization 派对无需修改直接运行理念的偏执, 到后来Xen 适当修改Guest OS后获得极佳的性能, 以致让Para virtualization大热, 再到后来Intel 和AMD卷入战火, 从硬件上扩展, 一来解决传统x86虚拟化的困难, 二来为了性能的提升；到最后硬件扩展皆为Full派和Para派所采用. 自此Para派的性能优势不再那么明显, Full派的无需修改直接运行的友好性渐占上风.

经典的虚拟化方法

经典的虚拟化方法主要使用"特权解除"(Privilege deprivileging)"和"陷入-模拟(Trap-and-Emulation)的方法. 即: 将Guest OS运行在非特权级(特权解除), 而将VMM运行于最高特权级(完全控制系统资源). 解除了Guest OS的特权后, Guest OS的大部分指令仍可在硬件上直接运行. 只有当运行到特权指令时, 才会陷入到VMM模拟执行(陷入-模拟)
由此可引入虚拟化对体系结构(ISA)的要求:

- 须支持多个特权级
- 非敏感指令的执行结果不依赖于CPU的特权级.
- CPU需要支持一种保护机制, 如MMU, 可将物理系统和其他VM与当前活动的VM隔离
- 敏感指令需皆为特权指令
x86 ISA中有十多条敏感指令不是特权指令, 因此x86无法使用经典的虚拟化技术完全虚拟化.

x86虚拟化方法

鉴于x86指令集本身的局限性, 长期以来对x86的虚拟化实现大致分为两派, 即以VMware为代表的Full virtualization派和以Xen为代表的Paravirtualizaiton派. 两派的分歧主要在对非特权敏感指令的处理上. Full派采用的是动态的方法, 即: 运行时监测. 朴拙后在VMM中模拟. 而Para派则主动进攻, 将所有用到的非特权敏感指令全部替换, 这样就减少了大量的陷入-> 上下文切换->模拟-> 上下文切换的过程. 获得了大幅的性能提升. 但缺点也很明显.

完全虚拟化派

秉承无需修改直接运行的理念, 该派一直在对"运行时监测, 朴拙后模拟"的过程, 进行偏执的优化. 该派内部又有些差别, 其有VMware为代表的BT和以SUN为代表的Scan-and-Patch

基于二进制翻译(BT)的完全虚拟化方法

其主要思想是在执行时将, VM上执行的Guest OS之指令, 翻译成x86 ISA的一个子集, 其中的敏感指令被替换成陷入指令. 翻译过程与执行交替执行. 不含敏感指令的用户态程序可以不经翻译直接执行. 该技术为VMware Workstation, VMware ESX Server的早期版本, VirtualPC以及QEMU所采用

基于扫描与修补的完全虚拟化方法(SUN之VirtualBox)

(1) VMM会在VM运行每块指令之前对其进行扫描, 查找敏感指令.
(2) 补丁指令块在VMM中动态生成, 通常每一个需要修补的指令会对应一块补丁指令
(3) 敏感指令被替换成一个外跳转, 从VM跳到VMM, 在VMM中执行动态生成的补丁指令块
(4) 当补丁指令块执行完后, 执行流再跳转回VM的下一条指令处继续执行

OS协助的类虚拟化派

其基本思想是通过修改Guest OS的代码, 将含有敏感指令的操作, 替换为对VMM的超调用(Hypercall), 可将控制权转移给VMM. 该技术的优势在于性能能接近于物理机. 缺点在于需要修改Guest OS

Operating System level virtualization

The operating system level virtualization probably has the least overhead among the above three approaches and is the fastest solution. However, it has a limitation that you can only run the same
operating system as the host. This takes away on of the great benefits of virtualization

HVM: Hardware virtual machine
PV: Paravirtualization machine
VMM 又叫Hypervisor

Intel和AMD的虚拟化技术

Intel VT-x: Virtualization Technology for x86
Intel VT-i: Virtualization Technology for Itanium
Intel VT-d: Virtualization Technology for Dircted I/O
AMD-v: AMD Virtualization
其基本思想就是引入新的处理器运行模式和新的指令.使得VMM和Guest OS运行在不同的模式下. Guest OS运行于受控模式下, 原来一些敏感指令在受控模式下会全部陷入VMM, 这样就解决了部分非特权敏感指令的陷入-模拟难题, 而且模式切换时上下文的保存恢复由硬件来完成. 这样就大大提高了陷入-模拟时的上下文切换效率.

MMU是Memory Management Unit的缩写，中文名是内存管理单元，它是中央处理器（CPU）中用来管理虚拟存储器、物理存储器的控制线路，同时也负责虚拟地址映射为物理地址，以及提供硬件机制的内存访问授权。

实现从页号到物理块号的地址映射。
VMM结构
- 宿主模型: OS-hosted VMMS
- Hypervisor模型: Hypervisor VMMs
- 混合模型: Mybrid VMMs
处理器虚拟化原理精要

VMM 对物理资源的虚拟化可以分为三个部分: 处理器虚拟化, 内存虚拟化和I/O设备虚拟化. 其中以处理器虚拟化最为关键.

TLB

TLB(Translation Lookaside Buffer)翻译后备缓冲器是一个内存管理单元用于改进虚拟地址到物理地址转换速度的缓存。
TLB是一个小的，虚拟寻址的缓存，其中每一行都保存着一个由单个PTE组成的块。如果没有TLB，则每次取数据都需要两次访问内存，即查页表获得物理地址和取数据。
TLB：Translation lookaside buffer,即旁路转换缓冲，或称为页表缓冲；里面存放的是一些页表文件（虚拟地址到物理地址的转换表）。
又称为快表技术。由于“页表”存储在主存储器中，查询页表所付出的代价很大，由此产生了TLB。
X86保护模式下的寻址方式：段式逻辑地址—〉线形地址—〉页式地址；
页式地址=页面起始地址+页内偏移地址；
对应于虚拟地址：叫page（页面）；对应于物理地址：叫frame（页框）；
X86体系的系统内存里存放了两级页表，第一级页表称为页目录，第二级称为页表。
TLB和CPU里的一级、二级缓存之间不存在本质的区别，只不过前者缓存页表数据，而后两个缓存实际数据。

xend服务器进程通过domain0来管理系统, xend负责管理众多的虚拟主机, 并且提供进入这些系统的控制台.命令经一个命令行工具通过一个http的接口被传送到xend
xen-3.0.3-142.el5_9.3

Summary: Xen is a virtual machine monitor
Description:

This package contains the Xen tools and management daemons needed to run virtual machines on x86, x86_64, and ia64 systems. Information on how to use Xen can be found at the Xen project pages. The Xen system also requires the Xen hypervisor and domain-0 kernel, which can be found in the kernel-xen* package. Virtualization can be used to run multiple operating systems on one physical system, for purposes of hardware consolidation, hardware abstraction, or to test untrusted applications in a sandboxed environment.

Xen的网络架构
- Xen支持3种网络工作模式
  - Bridge 安装虚拟机时的默认模式
  - Route
  - NAT
- Bridge模式下, Xend启动时的流程
  1. 创建虚拟网桥
  2. 停止物理网卡eth0
  3. 物理网卡eth0的MAC地址和IP地址被复制到虚拟网卡veth0
  4. 物理网卡eth0重命名为peth0
  5. veth0重命名为eth0
  6. peth0的MAC地址更改, ARP功能关闭
  7. 链接peth0, vif0.0到网桥xenbr0
  8. 启动peth0, vif0.0, xenbr0
Xen blktap

blktap 是Xen提供给我们的一套实现虚拟块设备的框架, 它是运行在用户空间的.

blktap的工作流程

当xen启动的时候, 他会先启动blktapctl, 它是一个后台程序, 当我们启动虚拟机的时候就会通过xenbus这个通道把需要的虚拟块存储设备注册到blktapctl中, 该注册过程会创建2个命名管道以及一个字符设备. 这2个命名管道将被用于字符设备与图中tapdisk之间的通信.

(XEN) 一般来说 xend 启动执行 network-bridge 脚本会线把 eth0 的 IP 和 MAC 地址复制给虚拟网络接口 veth0，然后再把真实的 eth0 重命名为 peth0，把虚拟的 veth0 重命名为 eth0

字符设备

字符设备是指在I/O传输过程中以字符为单位进行传输的设备. 如键盘, 打印机等.

Xend is responsible for managing virtual machines and providing access to their consoles
/etc/xen/xmexample1 is a simple template configuration file for describing a single VM
/etc/xen/xmexample2 file is a templete description that is intended to be reused for multiple virtual machines.

For xen
- service xendomains start will start the configure file which locate in /etc/xen/auto directory
- hypervisor will automatic call xendomains to start the gust which locate in /etc/xen/auto directory
- In fact xendomains call xm to start|stop a xen guest
For xen
- network-bridge: This script is called whenever xend is started or stopped to respectively initialize or tear down the xen virtual network.
- When you use a file-backed virtual storage you will receive a low I/O performance
  - Migration
    - not live: moves a virtual machine from one host to another by pausing it, copying its memory contents, and then resuming it on the destination
    - ...
The continued success of the Xen hypervisor depends substantially on the development of a highly skilled community of developers who can both contribute to the project and use the technology within their own products.
Xen provides an abstract interface to devices, built on some core communication systems provided by the hypervisor.
Early versions of xen only supported paravirtualizaed guests.
NAT: Network Address Translation, 是一种将私有地址转化为合法IP地址的转换技术. 它被广泛应用于各种类型的internet接入方式和各种类型的网络中. 原因很简单, NAT不仅完美地解决了IP地址不足的问题, 而且还能够有效地避免来自网络外部的攻击. 隐藏并保护网络内部的计算机.
网桥: 像一个聪明的中继器. 中继器从一个网络电缆里接受信号, 放大他们, 将其从入另一个电缆. 网桥可以是专门硬件设备, 也可以由计算机加装的网桥软件来实现. 这时计算机会安装多个网络适配器(网卡). 网桥在网络互联中起到数据接受, 地址过滤与数据转发功能, 用来实现多个网络系统之间的数据交换.
NUMA:
- 理解

NUMA architectures logically follow in scaling from symmetric multiprocessing (SMP) architectures. In modern computers, memory access is far slower than CPU. And it seems that the trend would sustain in the near future. So CPUs increasingly starved for data, have had to stall while they wait for memory accesses to complete. Multi-processor systems make the problem considerably worse. Now a system can starve several processors at the same time, notably because only one processor can access memory at a time.. Take SMP for example, all the processors share a common memory controller, which makes it hard to get high performance as number of CPUs increase. That is, we can not get scalability for slow memory access and shared memory controller. NUMA attempts to address this problem by providing separate memory for each processor, avoiding the performance hit when several processors attempt to address the same memory. (同一时间只能有一个cpu访问内存)

Hardware itself can not lead to high performance. Special issue for NUMA must be cared to leverage underling NUMA hardware. For example, Linux has special scheduler for NUMA. The same as Linux, Xen also need to support NUMA to get high performance.Unfortunately, Xen in RHEL support NUMA in very limitedly way. For the hypervisor level, it only "knows" that the underling hardware is NUMA, but has no "policy" to deal with it. To enable "NUMA awareness" in Xen, you must set "numa=on" parameter( default is off ) in kernel command line. It is better to set "loglvl=all" option to get more log information, which makes it easier to be sure about the openness of the "NUMA awareness". The command line looks like bellow:

kernel /xen.gz-2.6.18-164.el5 numa=on loglvl=all

We could check if the underlying machine is NUMA via xm info:

# xm info | grep nr_
nr_cpus : 4
nr_nodes : 1
If nr_node>=2, underlying machine is NUMA.

(XEN) Note:
1. A guest can not be given more VCPUs than it was initialized with on creation. If we do so,the guest will get number of VCPUs it was initialized with on creation.
2. HVM guests are allocated a certain amount at creation time, and that is the number they are stuck with. That is, HVM do not support vcpu hot plug/unplug.
(XEN) Ballon
- Rather than changing the amount of main memory addressed by the guest kernel, the balloon driver takes or gives pages to the guest OS. That is the balloon driver will consume memory in a guest and allow the hypervisor to allocate it elsewhere. The guest kernel still believes the balloon driver has that memory and is not aware that it is probably being used by another guest. If a guest needs more memory, the balloon driver will request memory from hypervisor, then give it back to the kernel. This comes with a consequence: the guest starts with the maximum amount of memory, maxmem, it will be allowed to use.
- Currently, xend only automatically balloon down Domain0 when there is not enough free memory in hypervisor to start a new guest. Xend doesn`t automatically balloon down Domain0 when there isn`t enough free memory to balloon up an existed guest. We could get the current size of free memory via xm info:
(XEN)
- The advantage of this PV PCI passthru method is that it has been available for years in Xen, and it doesn't require any special hardware or chipset (it doesn't require hardware IOMMU (VT-d) support from the hardware).
IOMMU：input/output memory management unit。
(XEN) 修改桥接网卡

本来宿主机一直使用eth1实体网卡，但有一天eth1网卡坏了，我给eth0插上网线以后，XEN实体机
可以上网了，但虚拟机还无法上网，因为虚拟机通过实体机的eth1口上网，默认会使用xenbr1网桥。
我们可以修改每个虚拟机的配置文件，将bridge从xenbr1改为xenbr0，如下粗体字：
vif = [ "mac=00:16:3e:0f:91:9f,bridge=xenbr1" ]
执行完上述步骤以后要关闭虚拟机，重读配置文件后再次启动。
也可以将xenbr1网桥和eth1网口脱钩，转而和eth0网卡挂接。我们通过修改配置文件就可以看到xen
是如何将逻辑桥接网口和物理网口挂接的。
在/etc/xen/xend-config.sxp中，我们可以看到如下注释说明的内容：
#The bridge is named xenbr0, by default. To rename the bridge, use
# (network-script 'network-bridge bridge=<name>')
这一句话是说默认网桥的网桥是xenbr0，如果想采用非默认网桥，可以使用括号里的格式；
我们用的eth1做网卡，所以配置文件应该是这样的：
(network-script 'network-bridge netdev=eth1')
为什么使用eth1就是使用xenbr1哪？我们能不能让xenbr1使用eth0网口哪？改成如下配置：
(network-script 'network-bridge netdev=eth0 bridge=xenbr1')
然后重启xend服务即可。

(看起来好像必须要有pethx+xenbrx+你在configure 文件中用xenbrx虚拟机才有网络)

Remeber IOMMU和MMU区别:
- On some platforms, it is possible to make use of an Input/Output Memory

Management Unit (IOMMU ). This performs a similar feature to a standard MMU;
it maps between a physical and a virtual address space. The difference is the
application; whereas an MMU performs this mapping for applications running on
the CPU, the IOMMU performs it for devices.

When a page fault occurs, the block device driver can only perform DMA transfers into the bottom part of physical memory. If the page fault occurs

elsewhere, it must use the CPU to write the data, one word at a time, to the
correct address, which is very slow.

我的理解, 现在的设备有DMA功能, 但是只能在4GB on device的空间进行DMA, 当操作发生在4GB之外的地方就必须借助CPU进行操作, 而通过IOMMU则可以在>4GB的地方进行DMA

Binary Rewriting
- The binary rewriting approach requires that the instruction stream be scanned

by the virtualization environment and privileged instructions identified. These are
then rewritten to point to their emulated versions. Performance from this approach is not ideal, particularly when doing anything
I/O intensive. In implementation, this is actually very similar to how a debugger works. For
a debugger to be useful, it must provide the ability to set breakpoints, which
will cause the running process to be interrupted and allow it to be inspected by
the user. A virtualization environment that uses this technique does something
similar. It inserts breakpoints on any jump and on any unsafe instruction. When
it gets to a jump, the instruction stream reader needs to quickly scan the next part
for unsafe instructions and mark them. When it reaches an unsafe instruction, it
has to emulate it.

Pentium and newer machines include a number of features to make implement-
ing a debugger easier. These features allow particular addresses to be marked, for
example, and the debugger automatically activated. These can be used when
writing a virtual machine that works in this way. Consider the hypothetical in-
struction stream in Figure 1.1. Here, two breakpoint registers would be used, DR0
and DR1, with values set to 4 and 8, respectively. When the first breakpoint is
reached, the system emulates the privileged instruction, sets the program counter
to 5, and resumes. When the second is reached, it scans the jump target and sets
the debug registers accordingly. Ideally, it caches these values, so the next time it
jumps to the same place it can just load the debug register values.

Xen never rewrites the binary.
This is something that VMWare (as far as I understand). To the best of my understanding (but I have never seen the VMWare source code), the method consists of basically doing runtime patching of code that needs to run differently - typically, this involves replacing an existing op-code with something else - either causing a trap to the hypervisor, or a replacement set of code that "does the right thing". If I understand how this works in VMWare is that the hypervisor "learns" the code by single-stepping through a block, and either applies binary patches or marks the section as "clear" (doesn't need changing). The next time this code gets executed, it has already been patched or is clear, so it can run at "full speed".

In Xen, using paravirtualization (ring compression), then the code in the OS has been modified to be aware of the virtualized environment, and as such is "trusted" to understand certain things. But the hypervisor will still trap for example writes to the page-table (otherwise someone could write a malicious kernel module that modifies the page-table to map in another guest's memory, or some such).

The HVM method does intercept CERTAIN instructions - but the rest of the code runs at normal full speed, thanks to the hardware support in modern processors, such as SVM in AMD and VMX in Intel processors. ARM has a similar technology in the latest models of their processors, but I'm not sure what the name of it is.

I'm not sure if I've answered quite all of your questions, if I've missed something, or it's not clear enough, feel free to ask...

rewriting happens at compile time (or design time), rather than at runtime.

Paravirtualization:
- From the perspective of an operating system, the biggest difference is that it

runs in ring 1 on a Xen system, instead of ring 0. This means that it cannot
perform any privileged instructions. In order to provide similar functionality, the
hypervisor exposes a set of hypercalls that correspond to the instructions. (Use hypercall)

A hypercall is conceptually similar to a system call. On UNIX3 systems, the

convention for invoking a system call is to push the values and then raise an
interrupt, or invoke a system call instruction if one exists. To issue the exit (0)
system call on FreeBSD, for example, you would execute a sequence of instructions
similar to that shown in Listing 1.1.

(MMU): 内存管理单元（英语：memory management unit，缩写为MMU），有时称作分页内存管理单元（英语：paged memory management unit，缩写为PMMU）。它是一种负责处理中央处理器（CPU）的内存访问请求的计算机硬件。它的功能包括虚拟地址到物理地址的转换（即虚拟内存管理）、内存保护、中央处理器高速缓存的控制，在较为简单的计算机体系结构中，負責总线的仲裁以及存储体切换（bank switching，尤其是在8位的系统上）
ISA: Instruction Set Architecture
IVT adds a new mode to the processor, called VMX. A hypervisor can run in

VMX mode and be invisible to the operating system, running in ring 0. When
the CPU is in VMX mode, it looks normal from the perspective of an unmodified
OS. All instructions do what they would be expected to, from the perspective of
the guest, and there are no unexpected failures as long as the hypervisor correctly
performs the emulation.
A set of extra instructions is added that can be used by a process in VMX root
mode. These instructions do things like allocating a memory page on which to
store a full copy of the CPU state, start, and stop a VM. Finally, a set of bitmaps is
defined indicating whether a particular interrupt, instruction, or exception should
be passed to the virtual machine’s OS running in ring 0 or by the hypervisor
running in VMX root mode.
In addition to the features of Intel’s VT4, AMD’s Pacifica provides a few extra
things linked to the x86-64 extensions and to the Opteron architecture. Current
Opterons have an on-die memory controller. Because of the tight integration
between the memory controller and the CPU, it is possible for the hypervisor to
delegate some of the partitioning to the memory controller.

Using AMD-V, there are two ways in which the hypervisor can handle mem-
ory partitioning. In fact, two modes are provided. The first, Shadow Page Tables,
allows the hypervisor to trap whenever the guest OS attempts to modify its page
tables and change the mapping itself. This is done, in simple terms, by marking
the page tables as read only, and catching the resulting fault to the hypervisor,
instead of the guest operating system kernel. The second mode is a little more
complicated. Nested Page Tables allow a lot of this to be done in hardware.
Nested page tables do exactly what their name implies; they add another layer
of indirection to virtual memory. The MMU already handles virtual to physical
translations as defined by the OS. Now, these “physical” addresses are translated
to real physical addresses using another set of page tables defined by the hyper-
visor. Because the translation is done in hardware, it is almost as fast as normal
virtual memory lookups.

为了让一个体系结构可以被虚拟化, Popek和Goldberg认为所有的敏感指令必须同时是特权指令. 直观地说, 任何指令用一种会影响其他进行的方式改变机器状态的行为都必须遭到hypervisor的阻止
虚拟化内存比较简单: 只需要把内存分割为多个区域, 然后每一个访问物理内存的特权级指令发生陷入, 并由一个映射到所允许的内存区域的指令所代替. 一个现代的CPU包括一个内存管理但愿MMU, 正是内存管理但愿基于操作系统提供的信息实现了上述翻译过程.
DMA 直接内存存取(DMA) 改善系统实时效能的一个熟知的方法是，额外提供一个逻辑模块，在事件发生时产生响应，并允许处理器在较方便的时间来处理信息。这个DMA控制器通常将传送到模块的信息复制到内存(RAM)，并允许已处理的信息自动从内存移到外部外围装置。所有这些工作皆独立于目前的CPU活动－详见图1。这种方式肯定有所助益，但其效益仅限于延迟必然发生的事件－CPU还是得在某一时间处理信息。S12X采用一个根本的方法，即提供「智能型DMA」控制器，不只移动资料，同时直接执行所有的处理工作。
Now, both Intel and AMD have added a set of instructions that makes virtu-

alization considerably easier for x86. AMD introduced AMD-V, formerly known
as Pacifica, whereas Intel’s extensions are known simply as (Intel) Virtualization
Technology (IVT or VT ). The idea behind these is to extend the x86 ISA to make
up for the shortcomings in the existing instruction set. Conceptually, they can be
thought of as adding a “ring -1” above ring 0, allowing the OS to stay where it
expects to be and catching attempts to access the hardware directly.

(XEN)
- xen: This package contains the Xen tools and management daemons needed to run virtual machines on x86, x86_64, and ia64 systems. Information on how to use Xen can be found at the Xen project pages. The Xen system also requires the Xen hypervisor and domain-0 kernel, which can be found in the kernel-xen* package.
- kernel-xen:
(XEN) xen为什么会有方案+机制分离, 或者XEN为什么想尽力减少hypervisor的代码, 原因是: 如果xen出现bug则会危机到其上运行的虚拟机, 影响很大, 所以要尽可能减少其代码量, 减少出错的几率, 增加稳定性
(XEN) Early versions of Xen did a lot more in the hypervisor. Network multiplexing,

for example, was part of Xen 1.0, but was later moved into Domain 0. Most
operating systems already include very flexible features for bridging and tunnelling
virtual network interfaces, so it makes more sense to use these than implement
something new.

Another advantage of relying on Domain 0 features is ease of administration.

In the case of networks, a tool such as pf or iptables is incredibly complicated, and
a BSD or Linux administrator already has a significant amount of time and effort
invested in learning about it. Such an administrator can use Xen easily, since she
can re-use her existing knowledge.

(XEN)
- model=e1000: Ethernet 81540
- model=rtl8139: Ethernet 8139
- model=* : Ethernet 8139
(XEN) xm list - > Times: The Time column is deceptive. Virtual IO (network and block devices) used by Domains requires coordina-

tion by Domain0, which means that Domain0 is actually charged for much of the time that a DomainU is doing
IO. Use of this time value to determine relative utilizations by domains is thus very suspect, as a high
IO workload may show as less utilized than a high CPU workload. Consider yourself warned.

(XEN) disk

An array of block device stanzas, in the form:
disk = [ "stanza1", "stanza2", ... ]
Each stanza has 3 terms, separated by commas, "backend-dev,frontend-dev,mode".
backend-dev
The device in the backend domain that will be exported to the guest (frontend) domain. Supported formats
include:
phy:device - export the physical device listed. The device can be in symbolic form, as in sda7, or as the
hex major/minor number, as in 0x301 (which is hda1).
file://path/to/file - export the file listed as a loopback device. This will take care of the loopback
setup before exporting the device.
frontend-dev
How the device should appear in the guest domain. The device can be in symbolic form, as in sda7, or as
the hex major/minor number, as in 0x301 (which is hda1).
mode
The access mode for the device. There are currently 2 valid options, r (read-only), w (read/write).

(XEN) The output contains a variety of labels created by a user to define the rights and access privileges of certain domains. This is a part of sHype Xen Access Controls, which is another level of security an administrator may use. This is an optional feature of Xen, and in fact must be compiled into a Xen kernel if it is desired.
DomU 它通常不被允许执行任何能够直接访问硬件的hypercall, 虽然在某些情况下它被允许方位一个或更多的设备
IOMMU/MMU

http://blog.csdn.net/hotsolaris/archive/2007/08/08/1731839.aspx

IOMMU= input/output memory management unit
MMU = memory management unit 内存管理单元

The IOMMU is a MMU that connects a DMA-capable I/O bus to the primary storage memory.
Like the CPU memory management unit, an IOMMU take care of mapping virtual address(设备地址) to physical addresses and some units gurarantee memeory protection from misbehaving devices.

IOMMU, IO Memory Management Units, are hardware devices that translate device DMA addresses to machine addresses. An isolation capable IOMMU restricts a device so that it can only access parts of memory it has been explicitly granted access to.

把DMA控制器看成一个cpu，然后，想象这个cpu用IOMMU去访问主存。与通常的cpu访问主存一样的方法。
IOMMU主要解决IO总线的宽度不够。利用系统主存有64G，而外设总线用32 bit PCI，那么外设的DMA控制器只能访问主存的0－4G。而通常低地址内存都紧张，那么DMA就用IOMMU把地址重新映射一下。

OS控制IOMMU和MMU
Memory protection from malicious or misbehaving devices: a device cannot read or write to memory that hasn't been explicitly allocated (mapped) for it. The memory protection is based on the fact that OS running on the CPU (see figure) exclusively controls both the MMU and the IOMMU. The devices are physically unable to circumvent or corrupt configured memory management tables.

(XEN)
- When an operating system boots, one of the first things it typically does is query

the firmware to find out a little about its surroundings. This includes things
like the amount of RAM available, what peripherals are connected, and what the
current time is.
A kernel booting in a Xen environment, however, does not have access to the
real firmware. Instead, there must be another mechanism. Much of the required
information is provided by shared memory pages. There are two of these: the
first is mapped into the guest’s address space by the domain builder at guest boot
time; the second must be explicitly mapped by the guest.

MMU:
- 4. MMU 请点评

现代操作系统普遍采用虚拟内存管理（Virtual Memory Management）机制，这需要处理器中的MMU（Memory Management Unit，内存管理单元）提供支持，本节简要介绍MMU的作用。
首先引入两个概念，虚拟地址和物理地址。如果处理器没有MMU，或者有MMU但没有启用，CPU执行单元发出的内存地址将直接传到芯片引脚上，被内存芯片（以下称为物理内存，以便与虚拟内存区分）接收，这称为物理地址（Physical Address，以下简称PA），如下图所示。

图 17.5. 物理地址
物理地址

如果处理器启用了MMU，CPU执行单元发出的内存地址将被MMU截获，从CPU到MMU的地址称为虚拟地址（Virtual Address，以下简称VA），而MMU将这个地址翻译成另一个地址发到CPU芯片的外部地址引脚上，也就是将VA映射成PA，如下图所示。
图 17.6. 虚拟地址
虚拟地址

如果是32位处理器，则内地址总线是32位的，与CPU执行单元相连（图中只是示意性地画了4条地址线），而经过MMU转换之后的外地址总线则不一定是32位的。也就是说，虚拟地址空间和物理地址空间是独立的，32位处理器的虚拟地址空间是4GB，而物理地址空间既可以大于也可以小于4GB。

MMU将VA映射到PA是以页（Page）为单位的，32位处理器的页尺寸通常是4KB。例如，MMU可以通过一个映射项将VA的一页0xb7001000~0xb7001fff映射到PA的一页0x2000~0x2fff，如果CPU执行单元要访问虚拟地址0xb7001008，则实际访问到的物理地址是0x2008。物理内存中的页称为物理页面或者页帧（Page Frame）。虚拟内存的哪个页面映射到物理内存的哪个页帧是通过页表（Page Table）来描述的，页表保存在物理内存中，MMU会查找页表来确定一个VA应该映射到什么PA。

操作系统和MMU是这样配合的：

操作系统在初始化或分配、释放内存时会执行一些指令在物理内存中填写页表，然后用指令设置MMU，告诉MMU页表在物理内存中的什么位置。

设置好之后，CPU每次执行访问内存的指令都会自动引发MMU做查表和地址转换操作，地址转换操作由硬件自动完成，不需要用指令控制MMU去做。

我们在程序中使用的变量和函数都有各自的地址，程序被编译后，这些地址就成了指令中的地址，指令中的地址被CPU解释执行，就成了CPU执行单元发出的内存地址，所以在启用MMU的情况下，程序中使用的地址都是虚拟地址，都会引发MMU做查表和地址转换操作。那为什么要设计这么复杂的内存管理机制呢？多了一层VA到PA的转换到底换来了什么好处？All problems in computer science can be solved by another level of indirection.还记得这句话吗？多了一层间接必然是为了解决什么问题的，等讲完了必要的预备知识之后，将在第 5 节 “虚拟内存管理”讨论虚拟内存管理机制的作用。

MMU除了做地址转换之外，还提供内存保护机制。各种体系结构都有用户模式（User Mode）和特权模式（Privileged Mode）之分，操作系统可以在页表中设置每个内存页面的访问权限，有些页面不允许访问，有些页面只有在CPU处于特权模式时才允许访问，有些页面在用户模式和特权模式都可以访问，访问权限又分为可读、可写和可执行三种。这样设定好之后，当CPU要访问一个VA时，MMU会检查CPU当前处于用户模式还是特权模式，访问内存的目的是读数据、写数据还是取指令，如果和操作系统设定的页面权限相符，就允许访问，把它转换成PA，否则不允许访问，产生一个异常（Exception）。异常的处理过程和中断类似，不同的是中断由外部设备产生而异常由CPU内部产生，中断产生的原因和CPU当前执行的指令无关，而异常的产生就是由于CPU当前执行的指令出了问题，例如访问内存的指令被MMU检查出权限错误，除法指令的除数为0等都会产生异常。

图 17.7. 处理器模式
处理器模式

通常操作系统把虚拟地址空间划分为用户空间和内核空间，例如x86平台的Linux系统虚拟地址空间是0x00000000~0xffffffff，前3GB（0x00000000~0xbfffffff）是用户空间，后1GB（0xc0000000~0xffffffff）是内核空间。用户程序加载到用户空间，在用户模式下执行，不能访问内核中的数据，也不能跳转到内核代码中执行。这样可以保护内核，如果一个进程访问了非法地址，顶多这一个进程崩溃，而不会影响到内核和整个系统的稳定性。CPU在产生中断或异常时不仅会跳转到中断或异常服务程序，还会自动切换模式，从用户模式切换到特权模式，因此从中断或异常服务程序可以跳转到内核代码中执行。事实上，整个内核就是由各种中断和异常处理程序组成的。总结一下：在正常情况下处理器在用户模式执行用户程序，在中断或异常情况下处理器切换到特权模式执行内核程序，处理完中断或异常之后再返回用户模式继续执行用户程序。

段错误我们已经遇到过很多次了，它是这样产生的：

用户程序要访问的一个VA，经MMU检查无权访问。

MMU产生一个异常，CPU从用户模式切换到特权模式，跳转到内核代码中执行异常服务程序。

内核把这个异常解释为段错误，把引发异常的进程终止掉。

——地址范围、虚拟地址映射为物理地址以及分页机制

任何时候，计算机上都存在一个程序能够产生的地址集合，我们称之为地址范围。这个范围的大小由CPU的位数决定，例如一个32位的CPU，它的地址范围是0~0xFFFFFFFF （4G)，而对于一个64位的CPU，它的地址范围为0~0xFFFFFFFFFFFFFFFF （16E).这个范围就是我们的程序能够产生的地址范围，我们把这个地址范围称为虚拟地址空间，该空间中的某一个地址我们称之为虚拟地址。与虚拟地址空间和虚拟地址相对应的则是物理地址空间和物理地址，大多数时候我们的系统所具备的物理地址空间只是虚拟地址空间的一个子集。这里举一个最简单的例子直观地说明这两者，对于一台内存为256M的32bit x86主机来说，它的虚拟地址空间范围是0~0xFFFFFFFF（4G），而物理地址空间范围是0x000000000~0x0FFFFFFF（256M）。
在没有使用虚拟存储器的机器上，虚拟地址被直接送到内存总线上，使具有相同地址的物理存储器被读写；而在使用了虚拟存储器的情况下，虚拟地址不是被直接送到内存地址总线上，而是送到存储器管理单元MMU，把虚拟地址映射为物理地址。
大多数使用虚拟存储器的系统都使用一种称为分页（paging）机制。虚拟地址空间划分成称为页（page）的单位，而相应的物理地址空间也被进行划分，单位是页帧(frame).页和页帧的大小必须相同。在这个例子中我们有一台可以生成32位地址的机器，它的虚拟地址范围从0~0xFFFFFFFF（4G），而这台机器只有256M的物理地址，因此他可以运行4G的程序，但该程序不能一次性调入内存运行。这台机器必须有一个达到可以存放4G程序的外部存储器（例如磁盘或是FLASH），以保证程序片段在需要时可以被调用。在这个例子中，页的大小为4K，页帧大小与页相同——这点是必须保证的，因为内存和外围存储器之间的传输总是以页为单位的。对应4G的虚拟地址和256M的物理存储器，他们分别包含了1M个页和64K个页帧。
功能编辑
将线性地址映射为物理地址
现代的多用户多进程操作系统，需要MMU，才能达到每个用户进程都拥有自己独立的地址空间的目标。使用MMU,操作系统划分出一段地址区域，在这块地址区域中，每个进程看到的内容都不一定一样。例如MICROSOFT WINDOWS操作系统将地址范围4M-2G划分为用户地址空间，进程A在地址0X400000（4M）映射了可执行文件，进程B同样在地址0X400000（4M）映射了可执行文件，如果A进程读地址0X400000，读到的是A的可执行文件映射到RAM的内容，而进程B读取地址0X400000时，则读到的是B的可执行文件映射到RAM的内容。
这就是MMU在当中进行地址转换所起的作用。
提供硬件机制的内存访问授权
多年以来，微处理器一直带有片上存储器管理单元（MMU），MMU能使单个软件线程工作于硬件保护地址空间。但是在许多商用实时操作系统中，即使系统中含有这些硬件也没采用MMU。
当应用程序的所有线程共享同一存储器空间时，任何一个线程将有意或无意地破坏其它线程的代码、数据或堆栈。异常线程甚至可能破坏内核代码或内部数据结构。例如线程中的指针错误就能轻易使整个系统崩溃，或至少导致系统工作异常。
就安全性和可靠性而言，基于进程的实时操作系统(RTOS）的性能更为优越。为生成具有单独地址空间的进程，RTOS只需要生成一些基于RAM的数据结构并使MMU加强对这些数据结构的保护。基本思路是在每个关联转换中“接入”一组新的逻辑地址。MMU利用当前映射，将在指令调用或数据读写过程中使用的逻辑地址映射为存储器物理地址。MMU还标记对非法逻辑地址进行的访问，这些非法逻辑地址并没有映射到任何物理地址。
这些进程虽然增加了利用查询表访问存储器所固有的系统开销，但其实现的效益很高。在进程边界处，疏忽或错误操作将不会出现，用户接口线程中的缺陷并不会导致其它更关键线程的代码或数据遭到破坏。目前在可靠性和安全性要求很高的复杂嵌入式系统中，仍然存在采无存储器保护的操作系统的情况，这实在有些不可思议。
采用MMU还有利于选择性地将页面映射或解映射到逻辑地址空间。物理存储器页面映射至逻辑空间，以保持当前进程的代码，其余页面则用于数据映射。类似地，物理存储器页面通过映射可保持进程的线程堆栈。RTOS可以在每个线程堆栈解映射之后，很容易地保留逻辑地址所对应的页面内容。这样，如果任何线程分配的堆栈发生溢出，将产生硬件存储器保护故障，内核将挂起该线程，而不使其破坏位于该地址空间中的其它重要存储器区，如另一线程堆栈。这不仅在线程之间，还在同一地址空间之间增加了存储器保护。
存储器保护（包括这类堆栈溢出检测）在应用程序开发中通常非常有效。采用了存储器保护，程序错误将产生异常并能被立即检测，它由源代码进行跟踪。如果没有存储器保护，程序错误将导致一些细微的难以跟踪的故障。实际上，由于在扁平存储器模型中，RAM通常位于物理地址的零页面，因此甚至NULL指针引用的解除都无法检测到。
4MMU和CPU编辑
X86系列的MMU
INTEL出品的80386CPU或者更新的CPU中都集成有MMU. 可以提供32BIT共4G的地址空间.
X86 MMU提供的寻址模式有4K/2M/4M的PAGE模式（根据不同的CPU，提供不同的能力），此处提供的是目前大部分操作系统使用的4K分页机制的描述，并且不提供ACCESS CHECK的部分。
涉及的寄存器
a) GDT
b) LDT
c) CR0
d) CR3
e) SEGMENT REGISTER
虚拟地址到物理地址的转换步骤
a) SEGMENT REGISTER作为GDT或者LDT的INDEX，取出对应的GDT/LDT ENTRY.
注意： SEGMENT是无法取消的，即使是FLAT模式下也是如此. 说FLAT模式下不使用SEGMENT REGISTER是错误的. 任意的RAM寻址指令中均有DEFAULT的SEGMENT假定. 除非使用SEGMENT OVERRIDE PREFⅨ来改变当前寻址指令的SEGMENT，否则使用的就是DEFAULT SEGMENT.
ENTRY格式
typedef struct
{
UINT16 limit_0_15;
UINT16 base_0_15;
UINT8 base_16_23;
UINT8 accessed : 1;
UINT8 readable : 1;
UINT8 conforming : 1;
UINT8 code_data : 1;
UINT8 app_system : 1;
UINT8 dpl : 2;
UINT8 present : 1;
UINT8 limit_16_19 : 4;
UINT8 unused : 1;
UINT8 always_0 : 1;
UINT8 seg_16_32 : 1;
UINT8 granularity : 1;
UINT8 base_24_31;
} CODE_SEG_DESCRIPTOR,*PCODE_SEG_DESCRIPTOR;
typedef struct
{
UINT16 limit_0_15;
UINT16 base_0_15;
UINT8 base_16_23;
UINT8 accessed : 1;
UINT8 writeable : 1;
UINT8 expanddown : 1;
UINT8 code_data : 1;
UINT8 app_system : 1;
UINT8 dpl : 2;
UINT8 present : 1;
UINT8 limit_16_19 : 4;
UINT8 unused : 1;
UINT8 always_0 : 1;
UINT8 seg_16_32 : 1;
UINT8 granularity : 1;
UINT8 base_24_31;
} DATA_SEG_DESCRIPTOR,*PDATA_SEG_DESCRIPTOR;
共有4种ENTRY格式，此处提供的是CODE SEGMENT和DATA SEGMENT的ENTRY格式. FLAT模式下的ENTRY在base_0_15,base_16_23处为0，而limit_0_15,limit_16_19处为0xfffff. granularity处为1. 表名SEGMENT地址空间是从0到0XFFFFFFFF的4G的地址空间.
b) 从SEGMENT处取出BASE ADDRESS 和LIMIT. 将要访问的ADDRESS首先进行ACCESS CHECK，是否超出SEGMENT的限制.
c) 将要访问的ADDRESS+BASE ADDRESS，形成需要32BIT访问的虚拟地址. 该地址被解释成如下格式：
typedef struct
{
UINT32 offset :12;
UINT32 page_index :10;
UINT32 pdbr_index :10;
} VA,*LPVA;
d) pdbr_index作为CR3的INDEX，获得到一个如下定义的数据结构
typedef struct
{
UINT8 present :1;
UINT8 writable :1;
UINT8 supervisor :1;
UINT8 writethrough:1;
UINT8 cachedisable:1;
UINT8 accessed :1;
UINT8 reserved1 :1;
UINT8 pagesize :1;
UINT8 ignoreed :1;
UINT8 avl :3;
UINT8 ptadr_12_15 :4;
UINT16 ptadr_16_31;
}PDE,*LPPDE;
e) 从中取出PAGE TABLE的地址. 并且使用page_index作为INDEX，得到如下数据结构
typedef struct
{
UINT8 present :1;
UINT8 writable :1;
UINT8 supervisor :1;
UINT8 writethrough:1;
UINT8 cachedisable:1;
UINT8 accessed :1;
UINT8 dirty :1;
UINT8 pta :1;
UINT8 global :1;
UINT8 avl :3;
UINT8 ptadr_12_15 :4;
UINT16 ptadr_16_31;
}PTE,*LPPTE;
f) 从PTE中获得PAGE的真正物理地址的BASE ADDRESS. 此BASE ADDRESS表名了物理地址的.高20位. 加上虚拟地址的offset就是物理地址所在了.
ARM系列的MMU
ARM出品的CPU，MMU作为一个协处理器存在。根据不同的系列有不同搭配。需要查询DATASHEET才可知道是否有MMU。如果有的话，一定是编号为15的协处理器。可以提供32BIT共4G的地址空间。
ARM MMU提供的分页机制有1K/4K/64K 3种模式. 本文介绍的是目前操作系统通常使用的4K模式。
涉及的寄存器，全部位于协处理器15.
ARM cpu地址转换涉及三种地址：虚拟地址（VA，Virtual Address），变换后的虚拟地址（MVA，Modified Virtual Address），物理地址（PA，Physical Address）。没有启动MMU时，CPU核心、cache、MMU、外设等所有部件使用的都是物理地址。启动MMU后，CPU核心对外发出的是虚拟地址VA，VA被转换为MVA供cache、MMU使用，并再次被转换为PA，最后使用PA读取实际设备。
ARM没有SEGMENT的寄存器，是真正的FLAT模式的CPU。给定一个ADDRESS，该地址可以被理解为如下数据结构:
typedef struct
{
UINT32 offset :12;
UINT32 page_index :8;
UINT32 pdbr_index :12;
} VA,*LPVA;
从MMU寄存器2中取出BIT14-31，pdbr_index就是这个表的索引，每个入口为4BYTE大小，结构为
typedef struct
{
UINT32 type :2; //always set to 01b
UINT32 writebackcacheable:1;
UINT32 writethroughcacheable:1;
UINT32 ignore :1; //set to 1b always
UINT32 domain :4;
UINT32 reserved :1; //set 0
UINT32 base_addr:22;
} PDE,*LPPDE;
获得的PDE地址，获得如下结构的ARRAY，用page_index作为索引，取出内容。
typedef struct
{
UINT32 type :2; //always set to 11b
UINT32 ignore :3; //set to 100b always
UINT32 domain :4;
UINT32 reserved :3; //set 0
UINT32 base_addr:20;
} PTE,*LPPTE;
从PTE中获得的基地址和上offset，组成了物理地址.
PDE/PTE中其他的BIT，用于访问控制。这边讲述的是一切正常，物理地址被正常组合出来的状况。
ARM/X86 MMU使用上的差异
⒈X86始终是有SEGMENT的概念存在. 而ARM则没有此概念（没有SEGMENT REGISTER.).
⒉ARM有个DOMAIN的概念. 用于访问授权. 这是X86所没有的概念. 当通用OS尝试同时适用于此2者的CPU上，一般会抛弃DOMAIN的使用.

(XEN)Shared Info page: 该页面内容存放在一个结构体中, 在guest kernel中会生命一个只想该结构体的指针, 在guest启动过程中, hypervisor会先将该结构体地址存入esi寄存器中, 然后在guest启动过程中会来读取该寄存器的值, 从而或者该页的信息.
Red Hat Enterprise Virtualization (RHEV) is a complete enterprise virtualization management solution for server and desktop virtualization, based on Kernel-based Virtual Machine (KVM) technology.
In computing, hardware-assisted virtualization is a platform virtualization approach that enables efficient full virtualization using help from hardware capabilities, primarily from the host processors. Full virtualization is used to simulate a complete hardware environment, or virtual machine, in which an unmodified guest operating system (using the same instruction set as the host machine) executes in complete isolation. Hardware-assisted virtualization was added to x86 processors (Intel VT-x or AMD-V) in 2006.

Hardware-assisted virtualization is also known as accelerated virtualization; Xen calls it hardware virtual machine (HVM), and Virtual Iron calls it native virtualization.

(XEN)将虚拟网桥附加到网卡上:

#/etc/xen/scripts/network-bridge netdev=eth0 bridge=xenbr0 stop
#/etc/xen/scripts/network-bridge netdev=eth0 bridge=xenbr1 start

Software virtualization is unsupported by Red Hat Enterprise Linux.
Term of RHEV-H, RHEV-M, Hyper-V

Hyper-V: Windows virtualization platform

影子页表和嵌套页表(概念)
硬件虚拟话+1 Intel-EPT, AMD- NPT

EPT: Extended Page Tables
NPT: Nested Page Tables

Instead, targeted modifications are introduced to make it simpler and faster to support multiple guest operating systems. For example, the guest operating system might be modified to use a special hypercall application binary interface (ABI) instead of using certain architectural features that would normally be used. This means that only small changes are typically required in the guest operating systems, but any such changes make it difficult to support closed-source operating systems that are distributed in binary form only, such as Microsoft Windows. As in full virtualization, applications are typically still run unmodified. Figure 1.4 illustrates paravirtualization.

Major advantages include performance, scalability, and manageability. The two most common examples of this strategy are User-mode Linux (UML) and Xen. The choice of paravirtualization for Xen has been shown to achieve high performance and strong isolation even on typical desktop hardware.
Xen extends this model to device I/O. It exports simplified, generic device interfaces to guest operating systems. This is true of a Xen system even when it uses hardware support for virtualization allowing the guest operating systems to run unmodified. Only device drivers for the generic Xen devices need to be introduced into the system.

QEMUâ€” Another example of an emulator, but the ways in which QEMU is unlike Bochs are worth noting. QEMU supports two modes of operation. The first is the Full System Emulation mode, which is similar to Bochs in that it emulates a full personal computer (PC), including peripherals. This mode emulates a number of processor architectures, such as x86, x86_64, ARM, SPARC, PowerPC, and MIPS, with reasonable speed using dynamic translation. Using this mode, you can emulate an environment capable of hosting the Microsoft Windows operating systems (including XP) and Linux guests hosted on Linux, Solaris, and FreeBSD platforms.

Additional operating system combinations are also supported. The second mode of operation is known as User Mode Emulation. This mode is only available when executing on a Linux host and allows binaries for a different architecture to be executed. For instance, binaries compiled for the MIPS architecture can be executed on Linux executing on x86 architectures. Other architectures supported in this mode include SPARC, PowerPC, and ARM, with more in development. Xen relies on the QEMU device model for HVM guests.

模拟(Emulation): 完全将指令进行翻译, 效率极其低下
VMware has a bare-metal product, ESX Server, . With VMware workstation, the hypervisor runs in hosted mode as an application installed on top of a base operating system such as Windows or Linux

The method of KVM operation is rather interesting. Each guest running on KVM is actually executed in user space of the host system. This approach makes each guest instance (a given guest kernel and its associated guest user space) look like a normal process to the underlying host kernel. Thus KVM has weaker isolation than other approaches we have discussed. With KVM, the well-tuned Linux process scheduler performs the hypervisor task of multiplexing across the virtual machines just as it would multiplex across user processes under normal operation. To accomplish this, KVM has introduced a new mode of execution that is distinct from the typical modes (kernel and user) found on a Linux system. This new mode designed for virtualized guests is aptly called guest mode. Guest mode has its own user and kernel modes. Guest mode is used when performing execution of all non-I/O guest code, and KVM falls back to normal user mode to support I/O operations for virtual guests.

The paravirtualization implementation known as User-mode Linux (UML) allows a Linux operating system to execute other Linux operating systems in user space

The Xen hypervisor sits above the physical hardware and presents guest domains with a virtual hardware interface

Intel-VT作用?

With hardware support for virtualization such as Intel's VT-x and AMD's AMD-V extensions, these additional protection rings become less critical. These extensions provide root and nonroot modes that each have rings 0 through 3. The Xen hypervisor can run in root mode while the guest OS runs in nonroot mode in the ring for which it was originally intended.

Domain0 runs a device driver specific to each actual physical device and then communicates with other guest domains through an asynchronous shared memory transport.

The physical device driver running in Domain0 or a driver domain is called a backend, and each guest with access to the device runs a generic device frontend driver. The backends provide each frontend with the illusion of a generic device that is dedicated to that domain. Backends also implement the sharing and multiple accesses required to give multiple guest domains the illusion of their own dedicated copy of the device.

Intel VT

Intel introduced a new set of hardware extensions called Virtualization Technology (VT), designed specifically to aid in virtualization of other operating systems allowing Xen to run unmodified guests. Intel added this technology to the IA-32 Platform and named it VT-x, and to the IA64 platforms and named it VT-i. With these new technologies, Intel introduced two new operation levels to the processor, for use by a hypervisor such as Xen. Intel maintains a list on its Web site of exactly which processors support this feature. This list is available at www.intel.com/products/processor_number/index.htm.
0

When using Intel VT technology, Xen executes in a new operational state called Virtual Machine Extensions (VMX) root operation mode. The unmodified guest domains execute in the other newly created CPU state, VMX non-root operation mode. Because the DomUs run in non-root operation mode, they are confined to a subset of operations available to the system hardware. Failure to adhere to the restricted subset of instructions causes a VM exit to occur, along with control returning to Xen.

AMD-V
Xen 3.0 also includes support for the AMD-V processor. One of AMD-V's benefits is a tagged translation lookaside buffer (TLB). Using this tagged TLB, guests get mapped to an address space that can be altogether different from what the VMM sets. The reason it is called a tagged TLB is that the TLB contains additional information known as address space identifiers (ASIDs). ASIDs ensure that a TLB flush does not need to occur at every context switch.

AMD also introduced a new technology to control access to I/O called I/O Memory Management Unit (IOMMU), which is analogous to Intel's VT-d technology. IOMMU is in charge of virtual machine I/O, including limiting DMA access to what is valid for the virtual machine, directly assigning real devices to VMs. One way to check if your processor is AMD-V capable from a Linux system is to check the output of the /proc/cpunfo for an svm flag. If the flag is present, you likely have an AMD-V processor.

HVM
Intel VT and AMD's AMD-V architectures are fairly similar and share many things in common conceptually, but their implementations are slightly different. It makes sense to provide a common interface layer to abstract their nuances away. Thus, the HVM interface was born. The original code for the HVM layer was implemented by an IBM Watson Research Center employee named Leendert van Doorn and was contributed to the upstream Xen project. A compatibility listing is located at the Xen Wiki at http://wiki.xensource.com/xenwiki/HVM_Compatible_Processors.

The HVM layer is implemented using a function call table (hvm_function_table), which contains the functions that are common to both hardware virtualization implementations. These methods, including initialize_guest_resources() and store_cpu_guest_regs(), are implemented differently for each backend under this unified interface.

Networking Devices

In general, network device support in Xen is based on the drivers found on your Domain0 guest. In fact the only things to worry about with networking devices are ensuring that your Domain0 kernel contains the necessary normal drivers for your hardware and that the Domain0 kernel also includes the Xen backend devices. If a non-Xen installation of the operating system you choose for Domain0 recognizes a particular network device, you should have no problem using that network device in Xen. Additionally, you may want to ensure that your Domain0 kernel includes support for ethernet bridging and loopback, if you want your DomainU kernels to obtain /dev/ethX devices. If your host kernel doesn't support bridging, or bridging does not work for you, you can select the alternate means of using IP routing in Domain0. This approach can also be useful if you want to isolate your DomainU guests from any external networks. Details on advanced network configurations are presented later in Chapter 10, "Network Configuration."

Xen device various:

physical(a hard drive or partition)
filesystem image or partitioned image: raw, qcow
standard network storage protocols such as: NBD, iSCSI, NFS etc

file:
phy:
tap:aio
tap:qcow

[host-a]#ls /sys/bus/
acpi/ i2c/ pci/ pcmcia/ pnp/ serio/ xen/
bluetooth/ ide/ pci_express/ platform/ scsi/ usb/ xen-backend/
[host-a]#ls /sys/bus/xen/drivers/
pcifront
[host-a]#ls /sys/bus/xen-backend/drivers/
tap vbd vif

xen back driver:
blktap, blkbk
netloop, netbk

loop是指拿文件来模拟块设备
vfb Xen 底下的 VM 都是透過 VNC 來傳送畫面,所以這裡的 vfb(virtual framebuffer device) 就是設定系統畫面與輸入裝置 Keyboard/Mouse.
MX root operation（根虚拟化操作）和VMX non-root operation（非根虚拟化操作），统称为VMX操作模式

VT-x

扩展了传统的x86处理器架构，它引入了两种操作模式：VMX root operation（根虚拟化操作）和VMX non-root operation（非根虚拟化操作），统称为VMX操作模式。VMX root operation是VMM运行所处的模式, 设计给VMM/Hypervisor使用，其行为跟传统的IA32并无特别不同，而VMX non-root operation则是客户机运行所处的模式,在VMM控制之下的IA32/64环境。所有的模式都能支持所有的四个Privileges levels。

由此，GDT、IDT、LDT、TSS等这些指令就能正常地运行于虚拟机内部了，而在以往，这些特权指令需要模拟运行。而VMM也能从模拟运行特权指令当中解放出来，这样既能解决Ring Aliasing问题（软件运行的实际Ring与设计运行的Ring不相同带来的问题），又能解决Ring Compression问题，从而大大地提升运行效率。Ring Compression问题的解决，也就解决了64bit客户操作系统的运行问题。

为了建立这种两个操作模式的架构，VT-x设计了一个Virtual-Machine Control Structure（VMCS，虚拟机控制结构）的数据结构，包括了Guest-State Area（客户状态区）和Host-State Area（主机状态区），用来保存虚拟机以及主机的各种状态参数，并提供了VM entry和VM exit两种操作在虚拟机与VMM之间切换，用户可以通过在VMCS的VM-execution control fields里面指定在执行何种指令/发生何种事件的时候，VMX non-root operation环境下的虚拟机就执行VM exit，从而让VMM获得控制权，因此VT-x解决了虚拟机的隔离问题，又解决了性能问题

IOMMU, VT-d, VT-x, MMU, HVM , PV, DMA, TLB， QEMU

AMD-T

整体上跟VT-x相似，但是有些名字肯能不同：
VT-x将用于存放虚拟机状态和控制信息的数据结构称为VMCS，而AMD叫VMCB
VT-x将TLB记录中用于标记VM地址空间的字段为VPID，而AMD-V称为ASID
VT-x: root 操作模式，非root操作模式. AMD-V guest操作模式，host操作模式

VMCS/VMCB包含了启动和控制虚拟机的全部信息

guest/非root模式的意义在于其让客户操作系统处于完全不同的环境，而不需要改变操作系统的代码, 在该模式上运行的特权指令即便是在Ring 0 上也变得可以被VMM截取. 此外VMM还可以通过VMCB中的各种截取控制字段选择性的对指令和事情进行截取，或设置有条件的截取，所有的敏感的特权指令和非特权指令都在其控制之中。

VT-D

Intel VT-d技术是一种基于North Bridge北桥芯片的硬件辅助虚拟化技术，通过在北桥中内置提供DMA虚拟化和IRQ虚拟化硬件，实现了新型的I/O虚拟化方式，Intel VT-d能够在虚拟环境中大大地提升 I/O 的可靠性、灵活性与性能。

传统的IOMMUs（I/O memory management units，I/O内存管理单元）提供了一种集中的方式管理所有的DMA——除了传统的内部DMA，还包括如AGP GART、TPT、RDMA over TCP/IP等这些特别的DMA，它通过在内存地址范围来区别设备，因此容易实现，却不容易实现DMA隔离，因此VT-d通过更新设计的IOMMU架构，实现了多个DMA保护区域的存在，最终实现了DMA虚拟化。这个技术也叫做DMA Remapping。

VT-d实现的中断重映射可以支持所有的I/O源，包括IOAPICs，以及所有的中断类型，如通常的MSI以及扩展的MSI-X。

VT-d进行的改动还有很多，如硬件缓冲、地址翻译等，通过这些种种措施，VT-d实现了北桥芯片级别的I/O设备虚拟化。VT-d最终体现到虚拟化模型上的就是新增加了两种设备虚拟化方式：

直接I/O设备分配, 虚拟机直接分配物理I/O设备给虚拟机，这个模型下，虚拟机内部的驱动程序直接和硬件设备直接通信，只需要经过少量，或者不经过VMM的管理。为了系统的健壮性，需要硬件的虚拟化支持，以隔离和保护硬件资源只给指定的虚拟机使用，硬件同时还需要具备多个I/O容器分区来同时为多个虚拟机服务，这个模型几乎完全消除了在VMM中运行驱动程序的需求。例如CPU，虽然CPU不算是通常意义的I/O设备——不过它确实就是通过这种方式分配给虚拟机，当然CPU的资源还处在VMM的管理之下。

运用VT-d技术，虚拟机得以使用直接I/O设备分配方式或者I/O设备共享方式来代替传统的设备模拟/额外设备接口方式，从而大大提升了虚拟化的I/O性能。

The Sun xVM Server network drivers uses a similar approach to the disk block driver for handling network packets. On DomU, the pseudo network driver xnf(xen-netfront) gets the I/O requests from the network stack and sends them to xnb(xen-netback) on Dom0. The back-end network driver xnb on Dom0 forwards packets sent by xnf to the native network driver.The buffer management for packet receiving has more impact on network performance than packet transmitting does. On the packet receiving end, the data is transferred via DMA into the native driver receiving buffer on dom0. Then, the packet is copied from the native driver buffer to the VMM buffer. The VMM buffer is then mapped to the DomU kernel address space without another copy of the data.

The sequence of operations for packet receiving is as follows:

1. Data is transferred via DMA into the native driver, bge, receive buffer ring.
2. The xnb drivers gets a new buffer from the VMM and copies data from the bge receive ring to the new buffer.
3. The xnb driver sends DomU an event through the event channel.
4. The xnf driver in DomU receives an interrupt.
5. The xnf driver maps a mblk(9S)to the VMM buffer and sends the mblk(9S) to the upper stack.