vfio工作原理

Jessica_1409573408

已于 2024-08-11 18:17:07 修改

阅读量1.6k

点赞数 34

文章标签： linux

于 2024-06-20 19:53:55 首次发布

本文链接：https://blog.csdn.net/qq_41644888/article/details/139841456

版权

vfio主要思想与基本原理

传统的设备直通方法PCI passthrough需要KVM完成大量的工作，如与IOMMU交互、注册中断处理函数，这种方法会让KVM过多的与设备打交道，扮演一个设备驱动的角色，这种方案灵活性受限。
vfio是一种用户态驱动框架，利用硬件层面的I/O虚拟化技术(intel的VT-D，AMD的AMD-Vi)将host上的物理设备直通给虚拟机使用,将host上的设备直通给虚拟机使用后，QEMU需要接管所有虚拟机对设备资源的访问（PCI配置空间、BAR空间、设备中断）。将设备直通给虚拟机，主要有两个难点：（1）设备DMA使用的地址，虚拟机内存在指定设备DMA地址的时候能够随意指定地址。因此需要一种机制对设备的DMA地址访问进行隔离。

（2）在intel架构上，MSI中断是通过写一段地址来完成的，任何DMA的发起者都能够写任何数据，从而会导致虚拟机内部的攻击者能够让外设产生不属于它的中断。这两个难题可利用iommu进行中断重映射和DMA重映射进行解决。

DMA Remapping:IOMMU会对设备的DMA地址在进行一层转换，使得设备的DMA能够访问的地址仅限于宿主机分配的一定内存中。Interrput Remmaping:IOMMU对所有的中断请求做一个重定向，从而将直通设备内中断正确地分配到虚拟机。
VFIO基本框架如下所示：
----VFIO Interface----
–vfio_iommu—vfio_pci–
-----iommu--------pci----
vfio_iommu是对底层iommu driver的封装，向上提供IOMMU功能（DMA重映射，中断重映射） vfio_pci是对设备驱动的封装，用于向用户态进程提供访问设备驱动的功能（设备pci配置空间，BAR空间）
VFIO设备直通中有三个重要的概念：container、group、device
其中group是IOMMU进行DMA隔离的最小单元，一个group中可能有一个或多个device，一个group中的多个设备只能属于一个虚拟机，否则一个虚拟机中的device可以利用DMA攻击另一个虚拟机里的数据，无法做到物理上的DMA隔离。multi-function设备（如既是网卡又是打印机的设备）在物理硬件上是互联的，不同的功能之间能进行数据的相互访问，因此multi-funtion设备的多个function都必须放在一个group中。
container由多个group组成，多个group之间可能会共享一组页表，通过将多个group组成一个container可以提高系统的性能，一般一个进程/一个虚拟机可以作为一个container。（一个虚拟机或qemu进程用同一个页表）

将设备与vfio驱动绑定的流程

1. 假设需要直通的设备如下所示：
  [root@localhost ~]# lspci | grep 00:03.0
  00:03.0 Ethernet controller: Red Hat, Inc. Virtio network device
1. 查看设备当前驱动类型
  [root@localhost 0000:00:03.0]# readlink driver
  …/…/…/bus/pci/drivers/virtio-pci
1. 将设备与驱动解绑
  echo 0000:00:03.0 > driver/unbind
1. 查看设备bdf号
  [root@localhost 0000:00:03.0]# lspci -ns 00:03.0
  00:03.0 0200: 1af4:1000
1. 为设备重新绑定驱动
  echo vfio-pci > ./driver_override
  echo 1af4 1000 > /sys/bus/pci/drivers/vfio-pci/new_id
  echo 1 > /sys/bus/pci/drivers_autoprobe
1. 将透传设备配置给虚拟机使用

    <hostdev mode='subsystem' type='pci' managed='yes'>
       <driver name='vfio'/>
       <source>
         <address domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
       </source>
    </hostdev>

基于vfio驱动框架进行vfio编程

VFIO接口主要对应于三类：container，group和device
用户态进程打开"/dev/vfio/vfio"可获得一个新的container，在container层级使用的ioctl:

VFIO_GET_API_VERSION:获取VFIO API的版本
VFIO_CHECK_EXTENSION:检测支持哪个IOMMU。不同的版本和架构所支持的iommu类型可能存在差异。
VFIO_SET_IOMMU:设置IOMMU，该IOMMU类型必须是经过VFIO_CHECK_EXTENSION检测过的
VFIO_IOMMU_GET_INFO:获取IOMMU的信息
VFIO_IOMMU_MAP_DMA:用于指定设备端看到的IO地址到qemu进程虚拟地址之间的映射关系（与KVM中的KVM_SET_USER_MEMORY_REGION指定虚机物理地址到用户态进程的之间的映射关系类似）

用户态进程打开"/dev"vfio/$grouid"可以得到一个group，在group层级使用的ioctl:

VFIO_GROUP_GET_STATUS:用于获取指定group的状态信息（如：是否可用 VFIO_GROUP_FLAGS_VIABLE，是否设置了container VFIO_GROUP_FLAGS_CONTAINER_SET）
VFIO_GROUP_SET_CONTAINER:用于设置container和group之间的管理，一个container下可有多个group
VFIO_GROUP_GET_DEVICE_FD:返回一个文件描述符fd，用于描述一个具体的设备。

用户态接口通过VFIO_GROUP_GET_DEVICE_FD ioctl返回的设备fd得到一个具体的设备，device层面的ioctl:

VFIO_DEVICE_GET_INFO:获取设备相关信息
VFIO_DEVICE_GET_REGION_INFO:获取指定设备的各个内存区域信息（BAR，PCI配置空间，ROM空间）
VFIO_DEVICE_GET_IRQ_INFO:获取设备的中断信息
VFIO_DEVICE_RESET:设备重置

vfio编程步骤可参考qemu源码中的util/vfio-helpers.c的实现，关键函数：

qemu_vfio_init_pci
qemu_vfio_do_mapping：将iova开始的size大小的一段空间映射到qemu进程host开始的size大小即完成GPA到HVA之间的映射关系
qemu_vfio_pci_init_irq
qemu_vfio_reset

VFIO相关内核模块

设备绑定vfio-pci驱动

当设备采用vfio的方式直通给VM使用时，会与当前驱动解绑与vfio-pci驱动绑定，在绑定的的过程中会调用vfio_pci_probe函数

static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
{
    ...
    group = vfio_iommu_group_get(&pdev->dev); // 获取设备的iommu group
    ...
    ret = vfio_register_group_dev(&vdev->vdev); // 实现一个vfio层面的设备，将vfio_device和vfio_group绑定，一个设备只能属于一个group
}

vfio_group与vfio_device

struct vfio_device {
        struct device *dev;     // dev成员表示物理设备
        const struct vfio_device_ops *ops;
        struct vfio_group *group;
        struct vfio_device_set *dev_set;
        struct list_head dev_set_list;

        /* Members below here are private, not for driver use */
        refcount_t refcount;
        unsigned int open_count;
        struct completion comp;
        struct list_head group_next;
};

int vfio_register_group_dev(struct vfio_device *device)  
{
    ...
	iommu_group = iommu_group_get(device->dev); // 获取设备的iommu group
	if (!iommu_group)
		return -EINVAL;
        
        // iommu_group是iommu层的group，根据iommu_group获取或生成一个vfio层的group vfio_group
	group = vfio_group_get_from_iommu(iommu_group);  
	if (!group) {
		group = vfio_create_group(iommu_group); // 创建vfio group
		if (IS_ERR(group)) {
			iommu_group_put(iommu_group);
			return PTR_ERR(group);
		}
	} else {
		iommu_group_put(iommu_group);
	}

	existing_device = vfio_group_get_device(group, device->dev); // 判断vfio_group中是否已经存在这个device
    ...
    device->group = group;  // 设置vfio_device的相关参数
    refcount_set(&device->refcount, 1);
    ...
	return 0;
}

vfio_group的创建（vfio_group创建时会向用户态在sysfs下呈现一个/dev/vfio/$groupid设备）

struct vfio_group {
	struct kref			kref;
	int				minor;
	atomic_t			container_users;
	struct iommu_group		*iommu_group;
	struct vfio_container		*container;
	struct list_head		device_list;
	struct mutex			device_lock;
	struct device			*dev;
	struct notifier_block		nb;
	struct list_head		vfio_next;
	struct list_head		container_next;
	struct list_head		unbound_list;
	struct mutex			unbound_lock;
	atomic_t			opened;
	wait_queue_head_t		container_q;
	bool				noiommu;
	unsigned int			dev_counter;
	struct kvm			*kvm;
	struct blocking_notifier_head	notifier;
};

static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group)
{
    struct vfio_group *group, *tmp;
    ...
    group->iommu_group = iommu_group;
    ...
    minor = vfio_alloc_group_minor(group);
	if (minor < 0) {
		vfio_group_unlock_and_free(group);
		return ERR_PTR(minor);
	}

	dev = device_create(vfio.class, NULL,                 // 创建的即是/dev/vfio/$groupid这个设备，用户态qemu进程可通过该设备来控制vfio_group
			    MKDEV(MAJOR(vfio.group_devt), minor),
			    group, "%s%d", group->noiommu ? "noiommu-" : "",
			    iommu_group_id(iommu_group));
}

当用户态进程调用VFIO_GROUP_GET_DEVICE_FD对应的ioctl时，对应的会调用内核函数vfio_group_get_device_fd

static int vfio_group_get_device_fd(struct vfio_group *group, char *buf)
{
    struct vfio_device *device;
    struct file *filep;
    int fdno;
    int ret = 0;
    ...
    mutex_lock(&device->dev_set->lock);
    device->open_count++;
    if (device->open_count == 1 && device->ops->open_device) {  //主要是针对mdev设备 调用vfio_mdev_dev_ops
            ret = device->ops->open_device(device);
            if (ret)
                    goto err_undo_count;
    }
    mutex_unlock(&device->dev_set->lock);

    if (device->ops->open) {   // 调用注册好的vfio_pci_ops中的vfio_pci_open
            ret = device->ops->open(device);
            if (ret)
                    goto err_close_device;
    }
    ...
    fdno = ret = get_unused_fd_flags(O_CLOEXEC); // 获取一个空闲的文件描述符
    if (ret < 0)
            goto err_release;

    filep = anon_inode_getfile("[vfio-device]", &vfio_device_fops,  //创建一个与vfio设备关联的文件
                                device, O_RDWR);
    ...
    fd_install(fdno, filep); //将文件描述符与文件之间关联起来
    ...
}

vfio-pci接口

在vfio_group_get_device_fd函数流程中会调用到vfio_pci_ops中的vfio_pci_open函数。vfio_pci_open会调用到vfio_pci_enable

static int vfio_pci_open(struct vfio_device *core_vdev)
{
        struct vfio_pci_device *vdev =
                container_of(core_vdev, struct vfio_pci_device, vdev);
        int ret = 0;

        mutex_lock(&vdev->reflck->lock);

        if (!vdev->refcnt) {
                ret = vfio_pci_enable(vdev);
                if (ret)
                        goto error;

                vfio_spapr_pci_eeh_open(vdev->pdev);
                vfio_pci_vf_token_user_add(vdev, 1);
        }
        vdev->refcnt++;
error:
        mutex_unlock(&vdev->reflck->lock);
        return ret;
}

static int vfio_pci_enable(struct vfio_pci_device *vdev)
{
    ...
    ret = pci_enable_device(pdev);  // 将设备使能，每个PCI设备的驱动都需要调用该函数
    ...
    ret = vfio_config_init(vdev);   //  根据物理设备的配置信息生成vfio_pci_device的配置信息（pci,bar，ram）vfiop_pci_device结构的pci_config_map保存了物理设备的配置空间数据，rbar数组保存了7个BAR数据
    ...
}

VFIO设备的读写函数分别是vfio_pci_read和vfio_pci_write，这两个函数最终都会调用到vfio_pci_rw。其中VFIO_PCI_OFFSET_TO_INDEX将内存索引转换为一个偏移，pdev->config_size中保存了物理设备的配置空间的大小。

static ssize_t vfio_pci_rw(struct vfio_pci_device *vdev, char __user *buf,
                           size_t count, loff_t *ppos, bool iswrite)
{
        unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);   // 得到用户态访问的内存区域的索引，然后根据这个索引去做具体的访问

        if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
                return -EINVAL;

        switch (index) {
        case VFIO_PCI_CONFIG_REGION_INDEX:
                return vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);

        case VFIO_PCI_ROM_REGION_INDEX:
                if (iswrite)
                        return -EINVAL;
                return vfio_pci_bar_rw(vdev, buf, count, ppos, false);

        case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
                return vfio_pci_bar_rw(vdev, buf, count, ppos, iswrite);

        case VFIO_PCI_VGA_REGION_INDEX:
                return vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite);
        default:
                index -= VFIO_PCI_NUM_REGIONS;
                return vdev->region[index].ops->rw(vdev, buf,
                                                   count, ppos, iswrite);
        }

        return -EINVAL;
}

vfio驱动分析

在vfio驱动初始化函数vfio_init中会调用misc_register(&vfio_dev)注册一个杂项设备vfio_dev，向用户态呈现一个设备节点**/dev/vfio/vfio**。当用户态进程打开“/dev/vfio/vfio”时内核会调用vfio_fops_open，分配一个vfio_container。
vfio_dev结构体实现如下所示

static struct miscdevice vfio_dev = {
        .minor = VFIO_MINOR,
        .name = "vfio",
        .fops = &vfio_fops,
        .nodename = "vfio/vfio",
        .mode = S_IRUGO | S_IWUGO,
};

vfioiommu驱动分析

vfioiommu作为VFIO接口和底层IOMMU通信的桥梁，其向上接收来自VFIO接口的请求，向下利用IOMMU驱动完成DMA重定向和Interrupt重定向功能。以vfio_iommu_type1驱动为例简单分析一下vfio iommu驱动的功能：
每一个container都会打开一个vfio iommu driver,

static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
        .name                   = "vfio-iommu-type1",
        .owner                  = THIS_MODULE,
        .open                   = vfio_iommu_type1_open,
        .release                = vfio_iommu_type1_release,
        .ioctl                  = vfio_iommu_type1_ioctl,
        .attach_group           = vfio_iommu_type1_attach_group, // 将一个iommu_group添加到vfio_iommu对应的iommu_domain中，group中的所有设备信息都会写入至IOMMU硬件的context表中。
        .detach_group           = vfio_iommu_type1_detach_group,
        .pin_pages              = vfio_iommu_type1_pin_pages,
        .unpin_pages            = vfio_iommu_type1_unpin_pages,
        .register_notifier      = vfio_iommu_type1_register_notifier,
        .unregister_notifier    = vfio_iommu_type1_unregister_notifier,
        .dma_rw                 = vfio_iommu_type1_dma_rw,
        .group_iommu_domain     = vfio_iommu_type1_group_iommu_domain,
};

static void *vfio_iommu_type1_open(unsigned long arg)
{
        struct vfio_iommu *iommu;
        iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
        ...
        INIT_LIST_HEAD(&iommu->domain_list);    // domain_list成员用来链接所有的vfio_domain
        INIT_LIST_HEAD(&iommu->iova_list);
        iommu->dma_list = RB_ROOT;      // dma_list表示该container中DMA重定向的映射表，即GPA到HVA的转换
        iommu->dma_avail = dma_entry_limit;
        ...
}

在调用container的ioctl(VFIO_SET_IOMMU)时，会调用到内核函数vfio_ioctl_set_iommu，该函数实现如下

static long vfio_ioctl_set_iommu(struct vfio_container *container,
                                 unsigned long arg)
{
        ...
        list_for_each_entry(driver, &vfio.iommu_drivers_list, vfio_next) {
                ...
                if (!try_module_get(driver->ops->owner))
                        continue;
                if (driver->ops->ioctl(NULL, VFIO_CHECK_EXTENSION, arg) <= 0) { // 找到在arg中指定的IOMMU类型
                        module_put(driver->ops->owner);
                        continue;
                }
                data = driver->ops->open(arg);  // 返回一个vfio_iommu
                ...
                ret = __vfio_container_attach_groups(container, driver, data); // 将container中的所有group附加在vfio iommu驱动中
                ...
                container->iommu_driver = driver;
                container->iommu_data = data;
                break;
        }
        ...
}

container中的所有设备都会直通给虚拟机使用，这些设备使用同一份dma映射表，用户态进程在调用ioctl(VFIO_IOMMU_MAP_DMA)进行DMA重映射时会调用到内核函数vfio_iommu_type1_map_dma->vfio_dma_do_map，此时需要指定一个vfio_iommu_type_dma_map类型的参数，该结构体指定了用户态进程的虚拟地址与设备IO地址之间的映射关系，其中vaddr表示用户态进程的虚拟地址，iova表示设备的IO地址，size表示其大小。

/**
 * VFIO_IOMMU_MAP_DMA - _IOW(VFIO_TYPE, VFIO_BASE + 13, struct vfio_dma_map)
 *
 * Map process virtual addresses to IO virtual addresses using the
 * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
 */
struct vfio_iommu_type1_dma_map {
        __u32   argsz;
        __u32   flags;
#define VFIO_DMA_MAP_FLAG_READ (1 << 0)         /* readable from device */
#define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)        /* writable from device */
        __u64   vaddr;                          /* Process virtual address */
        __u64   iova;                           /* IO virtual address */
        __u64   size;                           /* Size of mapping (bytes) */
};

static int vfio_dma_do_map(struct vfio_iommu *iommu,
                           struct vfio_iommu_type1_dma_map *map)
{
        struct vfio_dma *dma;
        ...
        if (vfio_find_dma(iommu, iova, size)) {  // 查找指定的映射关系是否存在
                ret = -EEXIST;
                goto out_unlock;
        }
        ...
        dma = kzalloc(sizeof(*dma), GFP_KERNEL); // 分配vfio_dma并进行初始化
        if (!dma) {
                ret = -ENOMEM;
                goto out_unlock;
        }

        iommu->dma_avail--;
        dma->iova = iova;
        dma->vaddr = vaddr;
        dma->prot = prot;
        ...
        /* Insert zero-sized and grow as we map chunks of it */
        vfio_link_dma(iommu, dma); // 将vfio_dma挂载到vfio_iommu的dma_list红黑树

        /* Don't pin and map if container doesn't contain IOMMU capable domain*/
        if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
                dma->size = size;
        else
                ret = vfio_pin_map_dma(iommu, dma, size);       // 将一块连续的内存空间pin住，并通过vfio_iommu_map建立映射关系
        ...

}

static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
                                      dma_addr_t start, size_t size)
{
        struct rb_node *node = iommu->dma_list.rb_node;

        while (node) {
                struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);

                if (start + size <= dma->iova)
                        node = node->rb_left;
                else if (start >= dma->iova + dma->size)
                        node = node->rb_right;
                else
                        return dma;
        }

        return NULL;
}

struct vfio_dma {
        struct rb_node          node;
        dma_addr_t              iova;           /* Device address */
        unsigned long           vaddr;          /* Process virtual addr */
        size_t                  size;           /* Map size (bytes) */
        int                     prot;           /* IOMMU_READ/WRITE 读写权限*/ 
        bool                    iommu_mapped;
        bool                    lock_cap;       /* capable(CAP_IPC_LOCK) */
        struct task_struct      *task;
        struct rb_root          pfn_list;       /* Ex-user pinned pfn list */
        unsigned long           *bitmap;
}

VFIO与设备直通

该部分主要是关于qemu代码的分析

vfio_realize作为VFIO设备的具现化函数，

static void vfio_realize(PCIDevice *pdev, Error **errp)
{
        VFIOPCIDevice *vdev = PCI_VFIO(pdev);
        ...
        vdev->vbasedev.sysfsdev =   
            g_strdup_printf("/sys/bus/pci/devices/%04x:%02x:%02x.%01x", // 获取设备的sysfsdev
                            vdev->host.domain, vdev->host.bus,
                            vdev->host.slot, vdev->host.function);
        ...
        tmp = g_strdup_printf("%s/iommu_group", vdev->vbasedev.sysfsdev);
        len = readlink(tmp, group_path, sizeof(group_path));
        g_free(tmp);
        ...
        group_name = basename(group_path);
        ...
        if (sscanf(group_name, "%d", &groupid) != 1) {  // 找到设备对应的groupid
        error_setg_errno(errp, errno, "failed to read %s", group_path);
        goto error;
        }
        trace_vfio_realize(vdev->vbasedev.name, groupid);

        group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev), errp); // 根据grouid(/dev/vfio/$groupid)得到VFIOGroup，将设备所属的VFIOGroup添加至VFIOContainer中
        ...
        ret = vfio_get_device(group, vdev->vbasedev.name, &vdev->vbasedev, errp); // 获取设备的基本信息，并建立VFIOGroup与VFIOdevice之间的关联关系
        ...
        vfio_populate_device(vdev, &err);   // 将设备的内存区域信息取出（pci配置空间，bar空间，ROM）
        ...
        ret = pread(vdev->vbasedev.fd, vdev->pdev.config, //获取一份VFIODevice的配置空间保存值pdev.config中
                MIN(pci_config_size(&vdev->pdev), vdev->config_size),
                vdev->config_offset);
        if (ret < (int)MIN(pci_config_size(&vdev->pdev), vdev->config_size)) {
        ret = ret < 0 ? -errno : -EFAULT;
        error_setg_errno(errp, -ret, "failed to read device config space");
        goto error;
        }

        /* vfio emulates a lot for us, but some bits need extra love */
        vdev->emulated_config_bits = g_malloc0(vdev->config_size);

        /* QEMU can choose to expose the ROM or not */
        memset(vdev->emulated_config_bits + PCI_ROM_ADDRESS, 0xff, 4);
        /* QEMU can also add or extend BARs */
        memset(vdev->emulated_config_bits + PCI_BASE_ADDRESS_0, 0xff, 6 * 4);
        if (vdev->vendor_id != PCI_ANY_ID) {
        if (vdev->vendor_id >= 0xffff) {
        error_setg(errp, "invalid PCI vendor ID provided");
        goto error;
        }
        vfio_add_emulated_word(vdev, PCI_VENDOR_ID, vdev->vendor_id, ~0);  // 写入设备的配置空间，对一系列配置进行微调，使之能够向虚拟机中呈现一个完成的PCI设备的摸样
        trace_vfio_pci_emulated_vendor_id(vdev->vbasedev.name, vdev->vendor_id);
        } else {
        vdev->vendor_id = pci_get_word(pdev->config + PCI_VENDOR_ID);
        }

        ...
        vfio_pci_size_rom(vdev);    // 处理设备的rom，主要是为有ROM的直通设备创建一个Memoryregion

        vfio_bars_prepare(vdev);

        vfio_msix_early_setup(vdev, &err);  // 处理msix中断
        if (err) {
                error_propagate(errp, err);
                goto error;
        }

        vfio_bars_register(vdev); // 对每一个BAR进行初始化

        ret = vfio_add_capabilities(vdev, errp);   // 根据设备的PCI能力为虚拟机设备添加功能
        ...
        if (vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1)) {
                vdev->intx.mmap_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
                                                        vfio_intx_mmap_enable, vdev);
                pci_device_set_intx_routing_notifier(&vdev->pdev, vfio_intx_update);
                ret = vfio_intx_enable(vdev, errp); // 中断的使能
                if (ret) {
                goto out_teardown;
                }
        }
        ...
}

设备I/O地址空间模拟

vfio_populate_device会对直通设备的每个BAR调用vfio_region_setup。虚拟设备的所有BAR信息都存放在虚拟设备结构体VFIOPCIDevice的bars数组成员中，其类型为VFIOBAR。VFIOBAR中有一个重要成员region，其中存放了虚拟机设备的BAR信息。

static void vfio_populate_device(VFIOPCIDevice *vdev, Error **errp)
{
    VFIODevice *vbasedev = &vdev->vbasedev;
    struct vfio_region_info *reg_info;
    struct vfio_irq_info irq_info = { .argsz = sizeof(irq_info) };
    ...
    for (i = VFIO_PCI_BAR0_REGION_INDEX; i < VFIO_PCI_ROM_REGION_INDEX; i++) { // [VFIO_PCI_BAR0_REGION_INDEX, VFIO_PCI_ROM_REGION_INDEX)表示的是7个BAR的索引
        char *name = g_strdup_printf("%s BAR %d", vbasedev->name, i);

        ret = vfio_region_setup(OBJECT(vdev), vbasedev,
                                &vdev->bars[i].region, i, name);
        g_free(name);

        if (ret) {
            error_setg_errno(errp, -ret, "failed to get region %d info", i);
            return;
        }

        QLIST_INIT(&vdev->bars[i].quirks);
    }

    ret = vfio_get_region_info(vbasedev,
                               VFIO_PCI_CONFIG_REGION_INDEX, &reg_info);
	...

}

typedef struct VFIOBAR {
    VFIORegion region;
    MemoryRegion *mr;
    size_t size;
    uint8_t type;
    bool ioport;
    bool mem64;
    QLIST_HEAD(, VFIOQuirk) quirks;
} VFIOBAR;

typedef struct VFIORegion {
    struct VFIODevice *vbasedev; //指向虚拟设备
    off_t fd_offset; /* offset of region within device fd 表示BAR在直通设备fd表示的文件中的偏移*/
    MemoryRegion *mem; /* slow, read/write access 指向该BAR对应的MemoryRegion*/
    size_t size;
    uint32_t flags; /* VFIO region flags (rd/wr/mmap) */
    uint32_t nr_mmaps;
    VFIOMmap *mmaps;
    uint8_t nr; /* cache the region number for debug */
} VFIORegion;

int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
                      int index, const char *name)
{
    struct vfio_region_info *info;
    int ret;
	ret = vfio_get_region_info(vbasedev, index, &info); // 获取指定内存区域的信息
	...
	region->vbasedev = vbasedev;
    region->flags = info->flags;
    region->size = info->size;
    region->fd_offset = info->offset;
    region->nr = index;

    if (region->size) {
        region->mem = g_new0(MemoryRegion, 1);
        memory_region_init_io(region->mem, obj, &vfio_region_ops, // 利用相关信息创建一个MemoryRegion结构体并进行初始化
                              region, name, region->size);
		/* 分配VFIORegion的mmaps空间以及初始化相关成员 */
		if (!vbasedev->no_mmap &&
            region->flags & VFIO_REGION_INFO_FLAG_MMAP) {

            ret = vfio_setup_region_sparse_mmaps(region, info);

            if (ret) {
                region->nr_mmaps = 1;
                region->mmaps = g_new0(VFIOMmap, region->nr_mmaps);
                region->mmaps[0].offset = 0;
                region->mmaps[0].size = region->size;
            }
        }
	}
}

vfio_realize中调用vfio_bars_register对每一个BAR进行初始化

static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
{
    VFIOBAR *bar = &vdev->bars[nr];
    char *name;

    if (!bar->size) {
        return;
    }

    bar->mr = g_new0(MemoryRegion, 1);
    name = g_strdup_printf("%s base BAR %d", vdev->vbasedev.name, nr);****
    memory_region_init_io(bar->mr, OBJECT(vdev), NULL, NULL, name, bar->size);
    g_free(name);

    if (bar->region.size) {
        memory_region_add_subregion(bar->mr, 0, bar->region.mem);

        if (vfio_region_mmap(&bar->region)) { // 将直通设备的BAR地址空间映射到QEMU中
            error_report("Failed to mmap %s BAR %d. Performance may be slow",
                         vdev->vbasedev.name, nr);
        }
    }

    pci_register_bar(&vdev->pdev, nr, bar->type, bar->mr); // 为虚拟设备注册BAR
}

VFIO 中断处理

参考：qemu中的eventfd使用原理
当通过VFIO框架将设备直通给虚拟机后，vfio-pci驱动会接管该PCI设备的中断。所以vfio-pci会为设备注册中断处理函数，物理机收到直通设备的中断时，既可以在内核直接注入虚拟机中，也可以交给qemu处理。
qemu在VFIO虚拟设备的fd上调用ioctl(VFIO_DEVICE_SET_IRQS)设置一个eventfd,当用户态vfio-pci驱动接收到直通设备的中断时就会向这个eventfd发送信号。初始化过程中，QEMU还会在虚拟机的fd上调用ioctl(KVM_IRQFD)将前述eventfd与VFIO虚拟设备的中断后联系起来（主要是调用qemu中的kvm_irqchip_assign_irqfd函数，内核中的kvm_irqfd函数），当eventfd上有信号时则向虚拟机注入中断，这样即完成了物理设备触发中断，虚拟机接收中断的流程。

在qemu/util/vfio-helps.c的vfio-pci设备的中断初始化函数qemu_vfio_pci_init_irq中，将会调用ioctl(s->device, VFIO_DEVICE_SET_IRQS,irq_set)设置中断，其中irq_set(struct vfio_irq_set *irq_set;)对应的结构体的实现如下所示：

/**
 * Initialize device IRQ with @irq_type and register an event notifier.
 */
int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e,
                           int irq_type, Error **errp)
{
    ...
    /* Get to a known IRQ state */
    *irq_set = (struct vfio_irq_set) {
        .argsz = irq_set_size,
        .flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER,
        .index = irq_info.index,
        .start = 0,
        .count = 1,
    };

    *(int *)&irq_set->data = event_notifier_get_fd(e);
    r = ioctl(s->device, VFIO_DEVICE_SET_IRQS, irq_set);
    ...
}

struct vfio_irq_set {
	__u32	argsz;
	__u32	flags;
#define VFIO_IRQ_SET_DATA_NONE		(1 << 0) /* Data not present */
#define VFIO_IRQ_SET_DATA_BOOL		(1 << 1) /* Data is bool (u8) */
#define VFIO_IRQ_SET_DATA_EVENTFD	(1 << 2) /* Data is eventfd (s32) */
#define VFIO_IRQ_SET_ACTION_MASK	(1 << 3) /* Mask interrupt */
#define VFIO_IRQ_SET_ACTION_UNMASK	(1 << 4) /* Unmask interrupt */
#define VFIO_IRQ_SET_ACTION_TRIGGER	(1 << 5) /* Trigger interrupt */
	__u32	index;  //用于表示是哪个中断，如INTx中断对应的index为VFIO_PCI_INTX_IRQ_INDEX
	__u32	start;  //主要是提供给msi和msix中断使用    
	__u32	count;  //data中的数据项有多少个
	__u8	data[];
};

其中flags的类型主要分为两种，一种是指数据类型（NONE,BOOL,EVENTFD），另一种是用户态的行为（屏蔽中断，不屏蔽中断，触发中断）
在内核中对应调用回调函数vfio_pci_ioctl_set_irqs，该函数中根据index以及flag的值调用不同的函数进行处理，并将处理结果返回给qemu

enum {
	VFIO_PCI_INTX_IRQ_INDEX,
	VFIO_PCI_MSI_IRQ_INDEX,
	VFIO_PCI_MSIX_IRQ_INDEX,
	VFIO_PCI_ERR_IRQ_INDEX,
	VFIO_PCI_REQ_IRQ_INDEX,
	VFIO_PCI_NUM_IRQS
};

int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev, uint32_t flags,
			    unsigned index, unsigned start, unsigned count,
			    void *data)
{
	int (*func)(struct vfio_pci_core_device *vdev, unsigned index,
		    unsigned start, unsigned count, uint32_t flags,
		    void *data) = NULL;

	switch (index) {
	case VFIO_PCI_INTX_IRQ_INDEX:
		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
		case VFIO_IRQ_SET_ACTION_MASK:
			func = vfio_pci_set_intx_mask;
			break;
		case VFIO_IRQ_SET_ACTION_UNMASK:
			func = vfio_pci_set_intx_unmask;
			break;
		case VFIO_IRQ_SET_ACTION_TRIGGER:
			func = vfio_pci_set_intx_trigger;
			break;
		}
		break;
	case VFIO_PCI_MSI_IRQ_INDEX:
	case VFIO_PCI_MSIX_IRQ_INDEX:
		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
		case VFIO_IRQ_SET_ACTION_MASK:
		case VFIO_IRQ_SET_ACTION_UNMASK:
			/* XXX Need masking support exported */
			break;
		case VFIO_IRQ_SET_ACTION_TRIGGER:
			func = vfio_pci_set_msi_trigger;
			break;
		}
		break;
	case VFIO_PCI_ERR_IRQ_INDEX:
		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
		case VFIO_IRQ_SET_ACTION_TRIGGER:
			if (pci_is_pcie(vdev->pdev))
				func = vfio_pci_set_err_trigger;
			break;
		}
		break;
	case VFIO_PCI_REQ_IRQ_INDEX:
		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
		case VFIO_IRQ_SET_ACTION_TRIGGER:
			func = vfio_pci_set_req_trigger;
			break;
		}
		break;
	}

	if (!func)
		return -ENOTTY;

	return func(vdev, index, start, count, flags, data);
}

其中vfio_pci_set_intx_trigger比较复杂些，涉及为直通设备重新申请中断资源

struct eventfd_ctx {
	struct kref kref;  
	wait_queue_head_t wqh;  //等待队列头，所有阻塞在eventfd上的读进程挂在该等待队列上
	/*
	 * Every time that a write(2) is performed on an eventfd, the
	 * value of the __u64 being written is added to "count" and a
	 * wakeup is performed on "wqh". If EFD_SEMAPHORE flag was not
	 * specified, a read(2) will return the "count" value to userspace,
	 * and will reset "count" to zero. The kernel side eventfd_signal()
	 * also, adds to the "count" counter and issue a wakeup.
	 */
	__u64 count; //当用户程序write eventfd时内核将会将只加在count上，并且会唤醒wqh上的一个进程。用户态进程read eventfd后，内核会将值-1或清零（由EFD_SEMAPHORE标志决定）。内核态的eventfd_signal也可以增加count的值和唤醒对应的进程。 --->唤醒等在eventfd上的进程有两种方式：用户态write;内核态的eventfd_signal
	unsigned int flags; //决定用户态read后内核的处理方式，EFD_SEMAPHORE,EFD_CLOEXEC,EFD_NONBLOCK
	int id;
};

static int vfio_pci_set_intx_trigger(struct vfio_pci_core_device *vdev,
				     unsigned index, unsigned start,
				     unsigned count, uint32_t flags, void *data)
{
	if (is_intx(vdev) && !count && (flags & VFIO_IRQ_SET_DATA_NONE)) {
		vfio_intx_disable(vdev);
		return 0;
	}

	if (!(is_intx(vdev) || is_irq_none(vdev)) || start != 0 || count != 1)
		return -EINVAL;

	if (flags & VFIO_IRQ_SET_DATA_EVENTFD) { //如果flags设置为VFIO_IRQ_SET_DATA_EVENTFD
		struct eventfd_ctx *trigger = NULL;
		int32_t fd = *(int32_t *)data;
		int ret;

		if (fd >= 0) {
			trigger = eventfd_ctx_fdget(fd); //利用用户态传递的fd获取eventfd_ctx
			if (IS_ERR(trigger))
				return PTR_ERR(trigger);
		}

		if (is_intx(vdev))
			ret = vfio_intx_set_signal(vdev, trigger); 
		else
			ret = vfio_intx_enable(vdev, trigger);  //完成VFIO设备中断的设置

		if (ret && trigger)
			eventfd_ctx_put(trigger);

		return ret;
	}

	if (!is_intx(vdev))
		return -EINVAL;

	if (flags & VFIO_IRQ_SET_DATA_NONE) {
		vfio_send_intx_eventfd(vdev, vfio_irq_ctx_get(vdev, 0));
	} else if (flags & VFIO_IRQ_SET_DATA_BOOL) {
		uint8_t trigger = *(uint8_t *)data;
		if (trigger)
			vfio_send_intx_eventfd(vdev, vfio_irq_ctx_get(vdev, 0));
	}
	return 0;
}

static int vfio_intx_enable(struct vfio_pci_core_device *vdev,
			    struct eventfd_ctx *trigger)
{
	struct pci_dev *pdev = vdev->pdev;
	struct vfio_pci_irq_ctx *ctx;
	unsigned long irqflags;
	char *name;
	int ret;

	if (!is_irq_none(vdev))
		return -EINVAL;

	if (!pdev->irq)
		return -ENODEV;

	name = kasprintf(GFP_KERNEL_ACCOUNT, "vfio-intx(%s)", pci_name(pdev));
	if (!name)
		return -ENOMEM;

	ctx = vfio_irq_ctx_alloc(vdev, 0); //分配vfio_pci_irq_ctx
	if (!ctx) {
		kfree(name);
		return -ENOMEM;
	}

	ctx->name = name;   //vfio_pci_irq_ctx初始化
	ctx->trigger = trigger;
	ctx->vdev = vdev;
	ctx->masked = vdev->virq_disabled;
	if (vdev->pci_2_3) {
		pci_intx(pdev, !ctx->masked);
		irqflags = IRQF_SHARED;
	} else {
		irqflags = ctx->masked ? IRQF_NO_AUTOEN : 0;
	}

	vdev->irq_type = VFIO_PCI_INTX_IRQ_INDEX;

	ret = request_irq(pdev->irq, vfio_intx_handler, //申请中断资源
			  irqflags, ctx->name, ctx);
	if (ret) {
		vdev->irq_type = VFIO_PCI_NUM_IRQS;
		kfree(name);
		vfio_irq_ctx_free(vdev, ctx, 0);
		return ret;
	}

	return 0;
}

static irqreturn_t vfio_intx_handler(int irq, void *dev_id)
{
	struct vfio_pci_irq_ctx *ctx = dev_id;
	struct vfio_pci_core_device *vdev = ctx->vdev;
	unsigned long flags;
	int ret = IRQ_NONE;

	spin_lock_irqsave(&vdev->irqlock, flags);

	if (!vdev->pci_2_3) {
		disable_irq_nosync(vdev->pdev->irq);    //将VFIO物理设备的中断进行屏蔽
		ctx->masked = true;
		ret = IRQ_HANDLED;
	} else if (!ctx->masked &&  /* may be shared */
		   pci_check_and_mask_intx(vdev->pdev)) {       //将VFIO物理设备的中断进行屏蔽
		ctx->masked = true;
		ret = IRQ_HANDLED;
	}

	spin_unlock_irqrestore(&vdev->irqlock, flags);

	if (ret == IRQ_HANDLED)
		vfio_send_intx_eventfd(vdev, ctx);      //触发中断

	return ret;
}

static void vfio_send_intx_eventfd(void *opaque, void *data)
{
	struct vfio_pci_core_device *vdev = opaque;

	if (likely(is_intx(vdev) && !vdev->virq_disabled)) {
		struct vfio_pci_irq_ctx *ctx = data;
		struct eventfd_ctx *trigger = READ_ONCE(ctx->trigger);

		if (likely(trigger))
			eventfd_signal(trigger); //当eventfd有信号之后，KVM就会向虚拟机注入中断，从而完成直通设备下的中断注入
	}
}

在qemu vfio_realize函数的最后会调用vfio_intx_enable来设置设备的中断

typedef struct PCIINTxRoute {
    enum {
        PCI_INTX_ENABLED,
        PCI_INTX_INVERTED,
        PCI_INTX_DISABLED,
    } mode; //中断的状态
    int irq;    //INTx中断对应的连接到中断控制器上的中断线
} PCIINTxRoute;

static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
{
    uint8_t pin = vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1); //获取VFIO物理设备的中断引脚
    Error *err = NULL;
    int32_t fd;
    int ret;


    if (!pin) {
        return true;
    }

    vfio_disable_interrupts(vdev);

    vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
    pci_config_set_interrupt_pin(vdev->pdev.config, pin); //将VFIO物理设备的中断引脚更新到虚拟设备的配置空间

#ifdef CONFIG_KVM
    /*
     * Only conditional to avoid generating error messages on platforms
     * where we won't actually use the result anyway.
     */
    if (kvm_irqfds_enabled() && kvm_resamplefds_enabled()) {
        vdev->intx.route = pci_device_route_intx_to_irq(&vdev->pdev,    //获取直通设备INTx中断的路由信息
                                                        vdev->intx.pin);
    }
#endif

    ret = event_notifier_init(&vdev->intx.interrupt, 0);    //event_notifier初始化
    if (ret) {
        error_setg_errno(errp, -ret, "event_notifier_init failed");
        return false;
    }
    fd = event_notifier_get_fd(&vdev->intx.interrupt);
    qemu_set_fd_handler(fd, vfio_intx_interrupt, NULL, vdev);   //设置监听函数vfio_intx_interrupt，当内核的eventfd上有信号时则通知到qemu，qemu则会调用监听在该fd上vfio_intx_interrupt函数

    if (!vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
                                VFIO_IRQ_SET_ACTION_TRIGGER, fd, errp)) {
        qemu_set_fd_handler(fd, NULL, NULL, vdev);
        event_notifier_cleanup(&vdev->intx.interrupt);
        return false;
    }

    if (!vfio_intx_enable_kvm(vdev, &err)) {
        warn_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
    }

    vdev->interrupt = VFIO_INT_INTx;

    trace_vfio_intx_enable(vdev->vbasedev.name);
    return true;
}

DMA重定向

DMA remmaping含义是设置设备端的内存视图到qemu进程虚拟地址之间的映射，这是由函数vfio_listener_region_add完成。该函数主要是获取iova（虚拟机内存GPA）的首地址，vaddr（qemu进程HVA）的首地址，然后调用vfio_dma_map建立虚拟机GPA与qemu进程HVA之间的映射关系。

static void vfio_listener_region_add(MemoryListener *listener,
                                     MemoryRegionSection *section)
{
    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
    llend = int128_make64(section->offset_within_address_space);
    ...
    /* Here we assume that memory_region_is_ram(section->mr)==true */

    vaddr = memory_region_get_ram_ptr(section->mr) +
            section->offset_within_region +
            (iova - section->offset_within_address_space);
    ...
    ret = vfio_dma_map(container, iova, int128_get64(llsize),
                       vaddr, section->readonly);    
    ...
}
static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
                        ram_addr_t size, void *vaddr, bool readonly)
{
    struct vfio_iommu_type1_dma_map map = {
        .argsz = sizeof(map),
        .flags = VFIO_DMA_MAP_FLAG_READ,
        .vaddr = (__u64)(uintptr_t)vaddr,
        .iova = iova,
        .size = size,
    };

    if (!readonly) {
        map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
    }

    /*
     * Try the mapping, if it fails with EBUSY, unmap the region and try
     * again.  This shouldn't be necessary, but we sometimes see it in
     * the VGA ROM space.
     */
    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
        (errno == EBUSY && vfio_dma_unmap(container, iova, size) == 0 &&
         ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
        return 0;
    }

    error_report("VFIO_MAP_DMA: %d", -errno);
    return -errno;
}

Jessica_1409573408

关注

34
点赞
踩
46

收藏

觉得还不错? 一键收藏
0
评论
vfio工作原理

传统的设备直通方法PCI passthrough需要KVM完成大量的工作，如与IOMMU交互、注册中断处理函数，这种方法会让KVM过多的与设备打交道，扮演一个设备驱动的角色，这种方案灵活性受限。vfio是一种用户态驱动框架，利用硬件层面的I/O虚拟化技术(intel的VT-D，AMD的AMD-Vi)将host上的物理设备直通给虚拟机使用,将host上的设备直通给虚拟机使用后，QEMU需要接管所有虚拟机对设备资源的访问（PCI配置空间、BAR空间、设备中断）。
复制链接

扫一扫