Qemu中PCI设备透传(PCI-Assign)源码分析

在网上看到很多人说 Qemu PCI-Assign 透传不支持 IOMMU ,而 VFIO 透传却可以(还被当做一种优势进行推荐)。而 VFIO SRIOV 并非有必然联系,那就是说 VFIO PCI-Assign 进本都是靠软件实现的了?既然都是软件实现的,为啥 PCI-Assign 不可以,而 VFIO 可以呢?这不科学啊!从来也没人说清楚这件事,到源码里看一下吧!(最后发现他们说的都是错的,网上无根据的文章不要乱信)

PCI-Assign透传的基本使用步骤

随便在网上找一个 PCI-Assign 透传的使用方法:

How to use 'pci pass-through' to run Linux in Qemu accessing real Ath9k adapter
===============================================================================

# Boot kernel with 'intel_iommu=on'

# Unbind driver from the device and bind 'pci-stub' to it
$ echo "168c 0030" > /sys/bus/pci/drivers/pci-stub/new_id
$ echo 0000:0b:00.0 > /sys/bus/pci/devices/0000:0b:00.0/driver/unbind
$ echo 0000:0b:00.0 > /sys/bus/pci/drivers/pci-stub/bind

# Then just run
$ sudo qemu-system-i386 -m 1024 \
	-device pci-assign,host=0b:00.0,rombar=0 \
	-enable-kvm \
	-kernel $KERNEL \
	-hda $DISK \
	-boot c \
	-append "root=/dev/sda rw"

# In case of
qemu-system-i386: -device pci-assign,host=0b:00.0: Failed to assign device "(null)" : Operation not permitted
qemu-system-i386: -device pci-assign,host=0b:00.0: Device initialization failed.
qemu-system-i386: -device pci-assign,host=0b:00.0: Device 'kvm-pci-assign' could not be initialized

$ dmesg | tail
[  112.129138] kvm_iommu_map_guest: No interrupt remapping support, disallowing device assignment. Re-enble with "allow_unsafe_assigned_interrupts=1" module option.

# run
$ echo 1 > /sys/module/kvm/parameters/allow_unsafe_assigned_interrupts

总结一下基本步骤:

  1. 在内核启动参数中配置 intel_iommu=on ,开启 IOMMU (从这里就可以看出, PCI-Assign 不支持 IOMMU 的话,要配置 IOMMU 干什么呢?);
  2. pci-stub 驱动的 new_id 文件写入要绑定设备的 VendorID DeviceID
  3. PCI 设备与原有驱动解绑;
  4. PCI 设备绑定到 pci-stub 驱动上。
  5. Qemu 上以 PCI 设备的 PCI 地址为形参启动虚拟机。

疑问:我们知道用户态程序使用一个设备,必须要使用它的用户态接口,通常就是一个设备文件,而这里指定一个地址是什么意思? Qemu 是如何使用这个 PCI 设备的?

PCI设备与pci-stub驱动绑定

接下来我们就看看这个 pci-stub 驱动到底是怎么绑定到 PCI 设备上的。

驱动内核模块加载时绑定

  • 这是内核中 pci-stub 驱动的源码,如果在模块加载时指定了 ids 参数,将会在加载时为这个 pci-stub 驱动动态增加匹配的 VendorID DeviceID 。我们知道, Linux 内核的 Device Driver Bus 驱动模型中, PCI 总线是通过 VendorID DeviceID 进行匹配的,因此这时就可对应的 PCI 设备进行了匹配绑定。同时,这个驱动啥都没干,没有提供任何上层的应用接口,比如说字符设备或者块设备等(drivers/pci/pci-stub.c):
// SPDX-License-Identifier: GPL-2.0
/*
 * Simple stub driver to reserve a PCI device
 *
 * Copyright (C) 2008 Red Hat, Inc.
 * Author:
 *	Chris Wright
 *
 * Usage is simple, allocate a new id to the stub driver and bind the
 * device to it.  For example:
 *
 * # echo "8086 10f5" > /sys/bus/pci/drivers/pci-stub/new_id
 * # echo -n 0000:00:19.0 > /sys/bus/pci/drivers/e1000e/unbind
 * # echo -n 0000:00:19.0 > /sys/bus/pci/drivers/pci-stub/bind
 * # ls -l /sys/bus/pci/devices/0000:00:19.0/driver
 * .../0000:00:19.0/driver -> ../../../bus/pci/drivers/pci-stub
 */

#include <linux/module.h>
#include <linux/pci.h>

static char ids[1024] __initdata;

module_param_string(ids, ids, sizeof(ids), 0);
MODULE_PARM_DESC(ids, "Initial PCI IDs to add to the stub driver, format is "
		 "\"vendor:device[:subvendor[:subdevice[:class[:class_mask]]]]\""
		 " and multiple comma separated entries can be specified");

static int pci_stub_probe(struct pci_dev *dev, const struct pci_device_id *id)
{
	pci_info(dev, "claimed by stub\n");
	return 0;
}

static struct pci_driver stub_driver = {
	.name		= "pci-stub",
	.id_table	= NULL,	/* only dynamic id's */
	.probe		= pci_stub_probe,
};

static int __init pci_stub_init(void)
{
	char *p, *id;
	int rc;

	rc = pci_register_driver(&stub_driver);
	if (rc)
		return rc;

	/* no ids passed actually */
	if (ids[0] == '\0')
		return 0;

	/* add ids specified in the module parameter */
	p = ids;
	while ((id = strsep(&p, ","))) {
		unsigned int vendor, device, subvendor = PCI_ANY_ID,
			subdevice = PCI_ANY_ID, class = 0, class_mask = 0;
		int fields;

		if (!strlen(id))
			continue;

		fields = sscanf(id, "%x:%x:%x:%x:%x:%x",
				&vendor, &device, &subvendor, &subdevice,
				&class, &class_mask);

		if (fields < 2) {
			printk(KERN_WARNING
			       "pci-stub: invalid id string \"%s\"\n", id);
			continue;
		}

		printk(KERN_INFO
		       "pci-stub: add %04X:%04X sub=%04X:%04X cls=%08X/%08X\n",
		       vendor, device, subvendor, subdevice, class, class_mask);

		rc = pci_add_dynid(&stub_driver, vendor, device,
				   subvendor, subdevice, class, class_mask, 0);
		if (rc)
			printk(KERN_WARNING
			       "pci-stub: failed to add dynamic id (%d)\n", rc);
	}

	return 0;
}

static void __exit pci_stub_exit(void)
{
	pci_unregister_driver(&stub_driver);
}

module_init(pci_stub_init);
module_exit(pci_stub_exit);

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Chris Wright <chrisw@sous-sol.org>");
  • 我们再看看 pci_add_dynid 这个函数的实现,他申请并初始化了一个 pci_dynid 结构,增加到 Driver 的一个链表中,最后调用了 driver_attach 函数触发 PCI Bus Device Driver 匹配(drivers/pci/pci-driver.c):
/**
 * pci_add_dynid - add a new PCI device ID to this driver and re-probe devices
 * @drv: target pci driver
 * @vendor: PCI vendor ID
 * @device: PCI device ID
 * @subvendor: PCI subvendor ID
 * @subdevice: PCI subdevice ID
 * @class: PCI class
 * @class_mask: PCI class mask
 * @driver_data: private driver data
 *
 * Adds a new dynamic pci device ID to this driver and causes the
 * driver to probe for all devices again.  @drv must have been
 * registered prior to calling this function.
 *
 * CONTEXT:
 * Does GFP_KERNEL allocation.
 *
 * RETURNS:
 * 0 on success, -errno on failure.
 */
int pci_add_dynid(struct pci_driver *drv,
		  unsigned int vendor, unsigned int device,
		  unsigned int subvendor, unsigned int subdevice,
		  unsigned int class, unsigned int class_mask,
		  unsigned long driver_data)
{
	struct pci_dynid *dynid;

	dynid = kzalloc(sizeof(*dynid), GFP_KERNEL);
	if (!dynid)
		return -ENOMEM;

	dynid->id.vendor = vendor;
	dynid->id.device = device;
	dynid->id.subvendor = subvendor;
	dynid->id.subdevice = subdevice;
	dynid->id.class = class;
	dynid->id.class_mask = class_mask;
	dynid->id.driver_data = driver_data;

	spin_lock(&drv->dynids.lock);
	list_add_tail(&dynid->node, &drv->dynids.list);
	spin_unlock(&drv->dynids.lock);

	return driver_attach(&drv->driver);
}
EXPORT_SYMBOL_GPL(pci_add_dynid);
  • driver_attach 遍历 PCI Bus 上的所有 Device ,使用 VendorID DeviceID Driver 进行匹配(匹配方法由 Bus 驱动提供)和绑定(drivers/base/dd.c):
static int __driver_attach(struct device *dev, void *data)
{
	struct device_driver *drv = data;
	int ret;

	/*
	 * Lock device and try to bind to it. We drop the error
	 * here and always return 0, because we need to keep trying
	 * to bind to devices and some drivers will return an error
	 * simply if it didn't support the device.
	 *
	 * driver_probe_device() will spit a warning if there
	 * is an error.
	 */

	ret = driver_match_device(drv, dev);
	if (ret == 0) {
		/* no match */
		return 0;
	} else if (ret == -EPROBE_DEFER) {
		dev_dbg(dev, "Device match requests probe deferral\n");
		driver_deferred_probe_add(dev);
	} else if (ret < 0) {
		dev_dbg(dev, "Bus failed to match device: %d", ret);
		return ret;
	} /* ret > 0 means positive match */

	if (dev->parent)	/* Needed for USB */
		device_lock(dev->parent);
	device_lock(dev);
	if (!dev->driver)
		driver_probe_device(drv, dev);
	device_unlock(dev);
	if (dev->parent)
		device_unlock(dev->parent);

	return 0;
}

/**
 * driver_attach - try to bind driver to devices.
 * @drv: driver.
 *
 * Walk the list of devices that the bus has on it and try to
 * match the driver with each one.  If driver_probe_device()
 * returns 0 and the @dev->driver is set, we've found a
 * compatible pair.
 */
int driver_attach(struct device_driver *drv)
{
	return bus_for_each_dev(drv->bus, NULL, drv, __driver_attach);
}
EXPORT_SYMBOL_GPL(driver_attach);

通过设备的sysfs文件绑定

  • PCI Bus 驱动在注册时分别注册了三个( Bus Device Driver )的 SysFS 属性组,而 Driver SysFS 属性组包含 new_id 这个 SysFS 属性文件。在对这个文件的进行设置的函数 new_id_store 中,写入数据中解析出 VendorID DeviceID ,最后调用 pci_add_dynid 进行与设备的绑定(drivers/pci/pci-driver.c):
/**
 * store_new_id - sysfs frontend to pci_add_dynid()
 * @driver: target device driver
 * @buf: buffer for scanning device ID data
 * @count: input size
 *
 * Allow PCI IDs to be added to an existing driver via sysfs.
 */
static ssize_t new_id_store(struct device_driver *driver, const char *buf,
			    size_t count)
{
	struct pci_driver *pdrv = to_pci_driver(driver);
	const struct pci_device_id *ids = pdrv->id_table;
	__u32 vendor, device, subvendor = PCI_ANY_ID,
		subdevice = PCI_ANY_ID, class = 0, class_mask = 0;
	unsigned long driver_data = 0;
	int fields = 0;
	int retval = 0;

	fields = sscanf(buf, "%x %x %x %x %x %x %lx",
			&vendor, &device, &subvendor, &subdevice,
			&class, &class_mask, &driver_data);
	if (fields < 2)
		return -EINVAL;

	if (fields != 7) {
		struct pci_dev *pdev = kzalloc(sizeof(*pdev), GFP_KERNEL);
		if (!pdev)
			return -ENOMEM;

		pdev->vendor = vendor;
		pdev->device = device;
		pdev->subsystem_vendor = subvendor;
		pdev->subsystem_device = subdevice;
		pdev->class = class;

		if (pci_match_id(pdrv->id_table, pdev))
			retval = -EEXIST;

		kfree(pdev);

		if (retval)
			return retval;
	}

	/* Only accept driver_data values that match an existing id_table
	   entry */
	if (ids) {
		retval = -EINVAL;
		while (ids->vendor || ids->subvendor || ids->class_mask) {
			if (driver_data == ids->driver_data) {
				retval = 0;
				break;
			}
			ids++;
		}
		if (retval)	/* No match */
			return retval;
	}

	retval = pci_add_dynid(pdrv, vendor, device, subvendor, subdevice,
			       class, class_mask, driver_data);
	if (retval)
		return retval;
	return count;
}
static DRIVER_ATTR_WO(new_id);


static struct attribute *pci_drv_attrs[] = {
	&driver_attr_new_id.attr,
	&driver_attr_remove_id.attr,
	NULL,
};
ATTRIBUTE_GROUPS(pci_drv);


struct bus_type pci_bus_type = {
	.name		= "pci",
	.match		= pci_bus_match,
	.uevent		= pci_uevent,
	.probe		= pci_device_probe,
	.remove		= pci_device_remove,
	.shutdown	= pci_device_shutdown,
	.dev_groups	= pci_dev_groups,
	.bus_groups	= pci_bus_groups,
	.drv_groups	= pci_drv_groups,
	.pm		= PCI_PM_OPS_PTR,
	.num_vf		= pci_bus_num_vf,
	.force_dma	= true,
};
EXPORT_SYMBOL(pci_bus_type);

static int __init pci_driver_init(void)
{
	return bus_register(&pci_bus_type);
}
postcore_initcall(pci_driver_init);

Qemu中的pci-assign设备

  • 快速查一下 Qemu 源码中引用 pci-assign 的位置,我们要看的源码应该就在 hw/i386/kvm/pci-assign.c 这个位置(至于为什么?自己猜):
$ grep -rHn pci-assign
hw/i386/kvm/pci-assign.c:868:    error_report("pci-assign: error: requires KVM with in-kernel irqchip "
hw/i386/kvm/pci-assign.c:1058: * because this is what worked for pci-assign for a long time.  This makes
hw/i386/kvm/pci-assign.c:1677:    .name = "pci-assign",
hw/i386/kvm/pci-assign.c:1745:        error_report("pci-assign: error: requires KVM support");
hw/i386/kvm/pci-assign.c:1754:        error_report("pci-assign: Maximum supported assigned devices (%d) "
hw/i386/kvm/pci-assign.c:1761:        error_report("pci-assign: error: no host device specified");
hw/i386/kvm/pci-assign.c:1787:        error_report("pci-assign: Error: Couldn't get real device (%s)!",
hw/i386/kvm/pci-assign.c:1879:    .name               = "kvm-pci-assign",
hw/i386/kvm/pci-assign.c:1920:        error_report("pci-assign: Insufficient privileges for %s", rom_file);
hw/i386/kvm/pci-assign.c:1943:        error_report("pci-assign: Cannot read from host %s", rom_file);
hw/i386/kvm/Makefile.objs:1:obj-y += clock.o apic.o i8259.o ioapic.o i8254.o pci-assign.o
qdev-monitor.c:50:    { "kvm-pci-assign", "pci-assign" },
docs/qdev-device-use.txt:379:    -device pci-assign,host=ADDR,iommu=IOMMU,id=ID

注册pci-assign设备

  • 查看 pci-assign 设备的注册,从其类型定义中可以看出,使用 assigned_realize 函数进行设备的初始化(hw/i386/kvm/pci-assign.c):
static void assign_class_init(ObjectClass *klass, void *data)
{
    PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
    DeviceClass *dc = DEVICE_CLASS(klass);

    k->realize      = assigned_realize;
    k->exit         = assigned_exitfn;
    k->config_read  = assigned_dev_pci_read_config;
    k->config_write = assigned_dev_pci_write_config;
    dc->props       = assigned_dev_properties;
    dc->vmsd        = &vmstate_assigned_device;
    dc->reset       = reset_assigned_device;
    set_bit(DEVICE_CATEGORY_MISC, dc->categories);
    dc->desc        = "KVM-based PCI passthrough";
}

static const TypeInfo assign_info = {
    .name               = TYPE_PCI_ASSIGN,
    .parent             = TYPE_PCI_DEVICE,
    .instance_size      = sizeof(AssignedDevice),
    .class_init         = assign_class_init,
    .instance_init      = assigned_dev_instance_init,
};

static void assign_register_types(void)
{
    type_register_static(&assign_info);
}

实例化pci-assign设备

  • 实例化设备函数 assigned_realize 如下(hw/i386/kvm/pci-assign.c):
static void assigned_realize(struct PCIDevice *pci_dev, Error **errp)
{
    AssignedDevice *dev = PCI_ASSIGN(pci_dev);
    uint8_t e_intx;
    int r;
    Error *local_err = NULL;

// 如果不支持KVM,直接报错:
    if (!kvm_enabled()) {
        error_setg(&local_err, "pci-assign requires KVM support");
        goto exit_with_error;
    }

    if (!dev->host.domain && !dev->host.bus && !dev->host.slot &&
        !dev->host.function) {
        error_setg(&local_err, "no host device specified");
        goto exit_with_error;
    }

// 初始化模拟设备的配置数据,Qemu在Host内存中申请了一块区域用于模拟PCI的配置空间数据,同时保持与真实设备的同步更新。
    /*
     * Set up basic config space access control. Will be further refined during
     * device initialization.
     */
    assigned_dev_emulate_config_read(dev, 0, PCI_CONFIG_SPACE_SIZE);
    assigned_dev_direct_config_read(dev, PCI_STATUS, 2);
    assigned_dev_direct_config_read(dev, PCI_REVISION_ID, 1);
    assigned_dev_direct_config_read(dev, PCI_CLASS_PROG, 3);
    assigned_dev_direct_config_read(dev, PCI_CACHE_LINE_SIZE, 1);
    assigned_dev_direct_config_read(dev, PCI_LATENCY_TIMER, 1);
    assigned_dev_direct_config_read(dev, PCI_BIST, 1);
    assigned_dev_direct_config_read(dev, PCI_CARDBUS_CIS, 4);
    assigned_dev_direct_config_read(dev, PCI_SUBSYSTEM_VENDOR_ID, 2);
    assigned_dev_direct_config_read(dev, PCI_SUBSYSTEM_ID, 2);
    assigned_dev_direct_config_read(dev, PCI_CAPABILITY_LIST + 1, 7);
    assigned_dev_direct_config_read(dev, PCI_MIN_GNT, 1);
    assigned_dev_direct_config_read(dev, PCI_MAX_LAT, 1);
    memcpy(dev->emulate_config_write, dev->emulate_config_read,
           sizeof(dev->emulate_config_read));

// 从PCI设备的SysFS属性文件中读PCI设备的配置数据,并保存起来:
    get_real_device(dev, &local_err);
    if (local_err) {
        goto out;
    }

// 从配置数据中解析设备属性,并根据需要进行修改:
    if (assigned_device_pci_cap_init(pci_dev, &local_err) < 0) {
        goto out;
    }

// 如果支持MSIX,对MSIX内存进行MMIO内存映射:
    /* intercept MSI-X entry page in the MMIO */
    if (dev->cap.available & ASSIGNED_DEVICE_CAP_MSIX) {
        assigned_dev_register_msix_mmio(dev, &local_err);
        if (local_err) {
            goto out;
        }
    }

// 为PCI设备的MMIO/PIO BARs注册对应的虚拟机内存空间:
    /* handle real device's MMIO/PIO BARs */
    assigned_dev_register_regions(dev->real_device.regions,
                                  dev->real_device.region_number, dev,
                                  &local_err);
    if (local_err) {
        goto out;
    }

    /* handle interrupt routing */
    e_intx = dev->dev.config[PCI_INTERRUPT_PIN] - 1;
    dev->intpin = e_intx;
    dev->intx_route.mode = PCI_INTX_DISABLED;
    dev->intx_route.irq = -1;

// 通过ioctl调用KVM内核模块,进行IOMMU等的配置和初始化:
    /* assign device to guest */
    assign_device(dev, &local_err);
    if (local_err) {
        goto out;
    }

// 新型传统的INTX型中断模拟初始化:
    /* assign legacy INTx to the device */
    r = assign_intx(dev, &local_err);
    if (r < 0) {
        goto assigned_out;
    }

    assigned_dev_load_option_rom(dev);

    return;

assigned_out:
    deassign_device(dev);

out:
    free_assigned_device(dev);

exit_with_error:
    assert(local_err);
    error_propagate(errp, local_err);
}

初始化模拟配置数据

  • PCI-Assign 设备的私有数据结构,集成自基类 PCIDevice ,可以看到模拟设备的配置空间数据都保存在这里,对其进行读写只不过是修改那个数组里的内存而已,然后会在设备 Reset 时(这个就不分析了)写入真实的物理设备(hw/i386/kvm/pci-assign.c):
typedef struct AssignedDevice {
    PCIDevice dev;
    PCIHostDeviceAddress host;
    uint32_t dev_id;
    uint32_t features;
    int intpin;
    AssignedDevRegion v_addrs[PCI_NUM_REGIONS - 1];
    PCIDevRegions real_device;
    PCIINTxRoute intx_route;
    AssignedIRQType assigned_irq_type;
    struct {
#define ASSIGNED_DEVICE_CAP_MSI (1 << 0)
#define ASSIGNED_DEVICE_CAP_MSIX (1 << 1)
        uint32_t available;
#define ASSIGNED_DEVICE_MSI_ENABLED (1 << 0)
#define ASSIGNED_DEVICE_MSIX_ENABLED (1 << 1)
#define ASSIGNED_DEVICE_MSIX_MASKED (1 << 2)
        uint32_t state;
    } cap;
    uint8_t emulate_config_read[PCI_CONFIG_SPACE_SIZE];
    uint8_t emulate_config_write[PCI_CONFIG_SPACE_SIZE];
    int msi_virq_nr;
    int *msi_virq;
    MSIXTableEntry *msix_table;
    hwaddr msix_table_addr;
    uint16_t msix_table_size;
    uint16_t msix_max;
    MemoryRegion mmio;
    char *configfd_name;
    int32_t bootindex;
} AssignedDevice;

static void assigned_dev_emulate_config_read(AssignedDevice *dev,
                                             uint32_t offset, uint32_t len)
{
    memset(dev->emulate_config_read + offset, 0xff, len);
}

static void assigned_dev_direct_config_read(AssignedDevice *dev,
                                            uint32_t offset, uint32_t len)
{
    memset(dev->emulate_config_read + offset, 0, len);
}

static void assigned_dev_direct_config_write(AssignedDevice *dev,
                                             uint32_t offset, uint32_t len)
{
    memset(dev->emulate_config_write + offset, 0, len);
}

读取设备真实配置数据

  • get_real_device 在读取 PCI 物理设备的配置数据时,实际上读取的是 Device SysFS 属性文件 config ,并保存在私有数据结构的 config 变量中,然后再通过读取 resource 和遍历 resource%d ,获取这个 PCI 设备的全部资源,每个 resource 其实都是一个 PCI BAR 资源,同时打开这些资源的 SysFS 属性文件,保存其 fd
static void get_real_device(AssignedDevice *pci_dev, Error **errp)
{
    char dir[128], name[128];
    int fd, r = 0;
    FILE *f;
    uint64_t start, end, size, flags;
    uint16_t id;
    PCIRegion *rp;
    PCIDevRegions *dev = &pci_dev->real_device;
    Error *local_err = NULL;

    dev->region_number = 0;

    snprintf(dir, sizeof(dir), "/sys/bus/pci/devices/%04x:%02x:%02x.%x/",
             pci_dev->host.domain, pci_dev->host.bus,
             pci_dev->host.slot, pci_dev->host.function);

    snprintf(name, sizeof(name), "%sconfig", dir);

    if (pci_dev->configfd_name && *pci_dev->configfd_name) {
        dev->config_fd = monitor_fd_param(cur_mon, pci_dev->configfd_name,
                                          &local_err);
        if (local_err) {
            error_propagate(errp, local_err);
            return;
        }
    } else {
        dev->config_fd = open(name, O_RDWR);

        if (dev->config_fd == -1) {
            error_setg_file_open(errp, errno, name);
            return;
        }
    }
again:
    r = read(dev->config_fd, pci_dev->dev.config,
             pci_config_size(&pci_dev->dev));
    if (r < 0) {
        if (errno == EINTR || errno == EAGAIN) {
            goto again;
        }
        error_setg_errno(errp, errno, "read(\"%s\")",
                         (pci_dev->configfd_name && *pci_dev->configfd_name) ?
                         pci_dev->configfd_name : name);
        return;
    }

    /* Restore or clear multifunction, this is always controlled by qemu */
    if (pci_dev->dev.cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
        pci_dev->dev.config[PCI_HEADER_TYPE] |= PCI_HEADER_TYPE_MULTI_FUNCTION;
    } else {
        pci_dev->dev.config[PCI_HEADER_TYPE] &= ~PCI_HEADER_TYPE_MULTI_FUNCTION;
    }

    /* Clear host resource mapping info.  If we choose not to register a
     * BAR, such as might be the case with the option ROM, we can get
     * confusing, unwritable, residual addresses from the host here. */
    memset(&pci_dev->dev.config[PCI_BASE_ADDRESS_0], 0, 24);
    memset(&pci_dev->dev.config[PCI_ROM_ADDRESS], 0, 4);

    snprintf(name, sizeof(name), "%sresource", dir);

    f = fopen(name, "r");
    if (f == NULL) {
        error_setg_file_open(errp, errno, name);
        return;
    }

    for (r = 0; r < PCI_ROM_SLOT; r++) {
        if (fscanf(f, "%" SCNi64 " %" SCNi64 " %" SCNi64 "\n",
                   &start, &end, &flags) != 3) {
            break;
        }

        rp = dev->regions + r;
        rp->valid = 0;
        rp->resource_fd = -1;
        size = end - start + 1;
        flags &= IORESOURCE_IO | IORESOURCE_MEM | IORESOURCE_PREFETCH
                 | IORESOURCE_MEM_64;
        if (size == 0 || (flags & ~IORESOURCE_PREFETCH) == 0) {
            continue;
        }
        if (flags & IORESOURCE_MEM) {
            flags &= ~IORESOURCE_IO;
        } else {
            flags &= ~IORESOURCE_PREFETCH;
        }
        snprintf(name, sizeof(name), "%sresource%d", dir, r);
        fd = open(name, O_RDWR);
        if (fd == -1) {
            continue;
        }
        rp->resource_fd = fd;

        rp->type = flags;
        rp->valid = 1;
        rp->base_addr = start;
        rp->size = size;
        pci_dev->v_addrs[r].region = rp;
        DEBUG("region %d size %" PRIu64 " start 0x%" PRIx64
              " type %d resource_fd %d\n",
              r, rp->size, start, rp->type, rp->resource_fd);
    }

    fclose(f);

    /* read and fill vendor ID */
    get_real_vendor_id(dir, &id, &local_err);
    if (local_err) {
        error_propagate(errp, local_err);
        return;
    }
    pci_dev->dev.config[0] = id & 0xff;
    pci_dev->dev.config[1] = (id & 0xff00) >> 8;

    /* read and fill device ID */
    get_real_device_id(dir, &id, &local_err);
    if (local_err) {
        error_propagate(errp, local_err);
        return;
    }
    pci_dev->dev.config[2] = id & 0xff;
    pci_dev->dev.config[3] = (id & 0xff00) >> 8;

    pci_word_test_and_clear_mask(pci_dev->emulate_config_write + PCI_COMMAND,
                                 PCI_COMMAND_MASTER | PCI_COMMAND_INTX_DISABLE);

    dev->region_number = r;
}

解析和设置设备配置

  • 从刚才读取的二进制 PCI 配置数据中分析设备的属性,并进行一些初始设定:
static int assigned_device_pci_cap_init(PCIDevice *pci_dev, Error **errp)
{
    AssignedDevice *dev = PCI_ASSIGN(pci_dev);
    PCIRegion *pci_region = dev->real_device.regions;
    int ret, pos;

    /* Clear initial capabilities pointer and status copied from hw */
    pci_set_byte(pci_dev->config + PCI_CAPABILITY_LIST, 0);
    pci_set_word(pci_dev->config + PCI_STATUS,
                 pci_get_word(pci_dev->config + PCI_STATUS) &
                 ~PCI_STATUS_CAP_LIST);

    /* Expose MSI capability
     * MSI capability is the 1st capability in capability config */
    pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSI, 0);
    if (pos != 0 && kvm_check_extension(kvm_state, KVM_CAP_ASSIGN_DEV_IRQ)) {
        if (verify_irqchip_in_kernel(errp) < 0) {
            return -ENOTSUP;
        }
        dev->dev.cap_present |= QEMU_PCI_CAP_MSI;
        dev->cap.available |= ASSIGNED_DEVICE_CAP_MSI;
        /* Only 32-bit/no-mask currently supported */
        ret = pci_add_capability(pci_dev, PCI_CAP_ID_MSI, pos, 10,
                                  errp);
        if (ret < 0) {
            return ret;
        }
        pci_dev->msi_cap = pos;

        pci_set_word(pci_dev->config + pos + PCI_MSI_FLAGS,
                     pci_get_word(pci_dev->config + pos + PCI_MSI_FLAGS) &
                     PCI_MSI_FLAGS_QMASK);
        pci_set_long(pci_dev->config + pos + PCI_MSI_ADDRESS_LO, 0);
        pci_set_word(pci_dev->config + pos + PCI_MSI_DATA_32, 0);

        /* Set writable fields */
        pci_set_word(pci_dev->wmask + pos + PCI_MSI_FLAGS,
                     PCI_MSI_FLAGS_QSIZE | PCI_MSI_FLAGS_ENABLE);
        pci_set_long(pci_dev->wmask + pos + PCI_MSI_ADDRESS_LO, 0xfffffffc);
        pci_set_word(pci_dev->wmask + pos + PCI_MSI_DATA_32, 0xffff);
    }
    /* Expose MSI-X capability */
    pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSIX, 0);
    if (pos != 0 && kvm_device_msix_supported(kvm_state)) {
        int bar_nr;
        uint32_t msix_table_entry;
        uint16_t msix_max;

        if (verify_irqchip_in_kernel(errp) < 0) {
            return -ENOTSUP;
        }
        dev->dev.cap_present |= QEMU_PCI_CAP_MSIX;
        dev->cap.available |= ASSIGNED_DEVICE_CAP_MSIX;
        ret = pci_add_capability(pci_dev, PCI_CAP_ID_MSIX, pos, 12,
                                  errp);
        if (ret < 0) {
            return ret;
        }
        pci_dev->msix_cap = pos;

        msix_max = (pci_get_word(pci_dev->config + pos + PCI_MSIX_FLAGS) &
                    PCI_MSIX_FLAGS_QSIZE) + 1;
        msix_max = MIN(msix_max, KVM_MAX_MSIX_PER_DEV);
        pci_set_word(pci_dev->config + pos + PCI_MSIX_FLAGS, msix_max - 1);

        /* Only enable and function mask bits are writable */
        pci_set_word(pci_dev->wmask + pos + PCI_MSIX_FLAGS,
                     PCI_MSIX_FLAGS_ENABLE | PCI_MSIX_FLAGS_MASKALL);

        msix_table_entry = pci_get_long(pci_dev->config + pos + PCI_MSIX_TABLE);
        bar_nr = msix_table_entry & PCI_MSIX_FLAGS_BIRMASK;
        msix_table_entry &= ~PCI_MSIX_FLAGS_BIRMASK;
        dev->msix_table_addr = pci_region[bar_nr].base_addr + msix_table_entry;
        dev->msix_table_size = msix_max * sizeof(MSIXTableEntry);
        dev->msix_max = msix_max;
    }

    /* Minimal PM support, nothing writable, device appears to NAK changes */
    pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_PM, 0);
    if (pos) {
        uint16_t pmc;

        ret = pci_add_capability(pci_dev, PCI_CAP_ID_PM, pos, PCI_PM_SIZEOF,
                                  errp);
        if (ret < 0) {
            return ret;
        }

        assigned_dev_setup_cap_read(dev, pos, PCI_PM_SIZEOF);

        pmc = pci_get_word(pci_dev->config + pos + PCI_CAP_FLAGS);
        pmc &= (PCI_PM_CAP_VER_MASK | PCI_PM_CAP_DSI);
        pci_set_word(pci_dev->config + pos + PCI_CAP_FLAGS, pmc);

        /* assign_device will bring the device up to D0, so we don't need
         * to worry about doing that ourselves here. */
        pci_set_word(pci_dev->config + pos + PCI_PM_CTRL,
                     PCI_PM_CTRL_NO_SOFT_RESET);

        pci_set_byte(pci_dev->config + pos + PCI_PM_PPB_EXTENSIONS, 0);
        pci_set_byte(pci_dev->config + pos + PCI_PM_DATA_REGISTER, 0);
    }

    pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_EXP, 0);
    if (pos) {
        uint8_t version, size = 0;
        uint16_t type, devctl, lnksta;
        uint32_t devcap, lnkcap;

        version = pci_get_byte(pci_dev->config + pos + PCI_EXP_FLAGS);
        version &= PCI_EXP_FLAGS_VERS;
        if (version == 1) {
            size = 0x14;
        } else if (version == 2) {
            /*
             * Check for non-std size, accept reduced size to 0x34,
             * which is what bcm5761 implemented, violating the
             * PCIe v3.0 spec that regs should exist and be read as 0,
             * not optionally provided and shorten the struct size.
             */
            size = MIN(0x3c, PCI_CONFIG_SPACE_SIZE - pos);
            if (size < 0x34) {
                error_setg(errp, "Invalid size PCIe cap-id 0x%x",
                           PCI_CAP_ID_EXP);
                return -EINVAL;
            } else if (size != 0x3c) {
                warn_report("%s: PCIe cap-id 0x%x has "
                            "non-standard size 0x%x; std size should be 0x3c",
                            __func__, PCI_CAP_ID_EXP, size);
            }
        } else if (version == 0) {
            uint16_t vid, did;
            vid = pci_get_word(pci_dev->config + PCI_VENDOR_ID);
            did = pci_get_word(pci_dev->config + PCI_DEVICE_ID);
            if (vid == PCI_VENDOR_ID_INTEL && did == 0x10ed) {
                /*
                 * quirk for Intel 82599 VF with invalid PCIe capability
                 * version, should really be version 2 (same as PF)
                 */
                size = 0x3c;
            }
        }

        if (size == 0) {
            error_setg(errp, "Unsupported PCI express capability version %d",
                       version);
            return -EINVAL;
        }

        ret = pci_add_capability(pci_dev, PCI_CAP_ID_EXP, pos, size,
                                  errp);
        if (ret < 0) {
            return ret;
        }

        assigned_dev_setup_cap_read(dev, pos, size);

        type = pci_get_word(pci_dev->config + pos + PCI_EXP_FLAGS);
        type = (type & PCI_EXP_FLAGS_TYPE) >> 4;
        if (type != PCI_EXP_TYPE_ENDPOINT &&
            type != PCI_EXP_TYPE_LEG_END && type != PCI_EXP_TYPE_RC_END) {
            error_setg(errp, "Device assignment only supports endpoint "
                       "assignment, device type %d", type);
            return -EINVAL;
        }

        /* capabilities, pass existing read-only copy
         * PCI_EXP_FLAGS_IRQ: updated by hardware, should be direct read */

        /* device capabilities: hide FLR */
        devcap = pci_get_long(pci_dev->config + pos + PCI_EXP_DEVCAP);
        devcap &= ~PCI_EXP_DEVCAP_FLR;
        pci_set_long(pci_dev->config + pos + PCI_EXP_DEVCAP, devcap);

        /* device control: clear all error reporting enable bits, leaving
         *                 only a few host values.  Note, these are
         *                 all writable, but not passed to hw.
         */
        devctl = pci_get_word(pci_dev->config + pos + PCI_EXP_DEVCTL);
        devctl = (devctl & (PCI_EXP_DEVCTL_READRQ | PCI_EXP_DEVCTL_PAYLOAD)) |
                  PCI_EXP_DEVCTL_RELAX_EN | PCI_EXP_DEVCTL_NOSNOOP_EN;
        pci_set_word(pci_dev->config + pos + PCI_EXP_DEVCTL, devctl);
        devctl = PCI_EXP_DEVCTL_BCR_FLR | PCI_EXP_DEVCTL_AUX_PME;
        pci_set_word(pci_dev->wmask + pos + PCI_EXP_DEVCTL, ~devctl);

        /* Clear device status */
        pci_set_word(pci_dev->config + pos + PCI_EXP_DEVSTA, 0);

        /* Link capabilities, expose links and latencues, clear reporting */
        lnkcap = pci_get_long(pci_dev->config + pos + PCI_EXP_LNKCAP);
        lnkcap &= (PCI_EXP_LNKCAP_SLS | PCI_EXP_LNKCAP_MLW |
                   PCI_EXP_LNKCAP_ASPMS | PCI_EXP_LNKCAP_L0SEL |
                   PCI_EXP_LNKCAP_L1EL);
        pci_set_long(pci_dev->config + pos + PCI_EXP_LNKCAP, lnkcap);

        /* Link control, pass existing read-only copy.  Should be writable? */

        /* Link status, only expose current speed and width */
        lnksta = pci_get_word(pci_dev->config + pos + PCI_EXP_LNKSTA);
        lnksta &= (PCI_EXP_LNKSTA_CLS | PCI_EXP_LNKSTA_NLW);
        pci_set_word(pci_dev->config + pos + PCI_EXP_LNKSTA, lnksta);

        if (version >= 2) {
            /* Slot capabilities, control, status - not needed for endpoints */
            pci_set_long(pci_dev->config + pos + PCI_EXP_SLTCAP, 0);
            pci_set_word(pci_dev->config + pos + PCI_EXP_SLTCTL, 0);
            pci_set_word(pci_dev->config + pos + PCI_EXP_SLTSTA, 0);

            /* Root control, capabilities, status - not needed for endpoints */
            pci_set_word(pci_dev->config + pos + PCI_EXP_RTCTL, 0);
            pci_set_word(pci_dev->config + pos + PCI_EXP_RTCAP, 0);
            pci_set_long(pci_dev->config + pos + PCI_EXP_RTSTA, 0);

            /* Device capabilities/control 2, pass existing read-only copy */
            /* Link control 2, pass existing read-only copy */
        }
    }

    pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_PCIX, 0);
    if (pos) {
        uint16_t cmd;
        uint32_t status;

        /* Only expose the minimum, 8 byte capability */
        ret = pci_add_capability(pci_dev, PCI_CAP_ID_PCIX, pos, 8,
                                  errp);
        if (ret < 0) {
            return ret;
        }

        assigned_dev_setup_cap_read(dev, pos, 8);

        /* Command register, clear upper bits, including extended modes */
        cmd = pci_get_word(pci_dev->config + pos + PCI_X_CMD);
        cmd &= (PCI_X_CMD_DPERR_E | PCI_X_CMD_ERO | PCI_X_CMD_MAX_READ |
                PCI_X_CMD_MAX_SPLIT);
        pci_set_word(pci_dev->config + pos + PCI_X_CMD, cmd);

        /* Status register, update with emulated PCI bus location, clear
         * error bits, leave the rest. */
        status = pci_get_long(pci_dev->config + pos + PCI_X_STATUS);
        status &= ~(PCI_X_STATUS_BUS | PCI_X_STATUS_DEVFN);
        status |= pci_get_bdf(pci_dev);
        status &= ~(PCI_X_STATUS_SPL_DISC | PCI_X_STATUS_UNX_SPL |
                    PCI_X_STATUS_SPL_ERR);
        pci_set_long(pci_dev->config + pos + PCI_X_STATUS, status);
    }

    pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_VPD, 0);
    if (pos) {
        /* Direct R/W passthrough */
        ret = pci_add_capability(pci_dev, PCI_CAP_ID_VPD, pos, 8,
                                  errp);
        if (ret < 0) {
            return ret;
        }

        assigned_dev_setup_cap_read(dev, pos, 8);

        /* direct write for cap content */
        assigned_dev_direct_config_write(dev, pos + 2, 6);
    }

    /* Devices can have multiple vendor capabilities, get them all */
    for (pos = 0; (pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_VNDR, pos));
        pos += PCI_CAP_LIST_NEXT) {
        uint8_t len = pci_get_byte(pci_dev->config + pos + PCI_CAP_FLAGS);
        /* Direct R/W passthrough */
        ret = pci_add_capability(pci_dev, PCI_CAP_ID_VNDR, pos, len,
                                  errp);
        if (ret < 0) {
            return ret;
        }

        assigned_dev_setup_cap_read(dev, pos, len);

        /* direct write for cap content */
        assigned_dev_direct_config_write(dev, pos + 2, len - 2);
    }

    /* If real and virtual capability list status bits differ, virtualize the
     * access. */
    if ((pci_get_word(pci_dev->config + PCI_STATUS) & PCI_STATUS_CAP_LIST) !=
        (assigned_dev_pci_read_byte(pci_dev, PCI_STATUS) &
         PCI_STATUS_CAP_LIST)) {
        dev->emulate_config_read[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
    }

    return 0;
}

注册PCI设备的MSIX资源

  • PCI 设备 MSIX 中断的模拟,其实是申请一块内存,对这块内存的读写操作进行捕捉,并使用 KVM ioctl 进行中断注入( KVM 目前已经提供了 irqfd 的中断注入方法):
static uint64_t
assigned_dev_msix_mmio_read(void *opaque, hwaddr addr,
                            unsigned size)
{
    AssignedDevice *adev = opaque;
    uint64_t val;

    memcpy(&val, (void *)((uint8_t *)adev->msix_table + addr), size);

    return val;
}

static void assigned_dev_msix_mmio_write(void *opaque, hwaddr addr,
                                         uint64_t val, unsigned size)
{
    AssignedDevice *adev = opaque;
    PCIDevice *pdev = &adev->dev;
    uint16_t ctrl;
    MSIXTableEntry orig;
    int i = addr >> 4;

    if (i >= adev->msix_max) {
        return; /* Drop write */
    }

    ctrl = pci_get_word(pdev->config + pdev->msix_cap + PCI_MSIX_FLAGS);

    DEBUG("write to MSI-X table offset 0x%lx, val 0x%lx\n", addr, val);

    if (ctrl & PCI_MSIX_FLAGS_ENABLE) {
        orig = adev->msix_table[i];
    }

    memcpy((uint8_t *)adev->msix_table + addr, &val, size);

    if (ctrl & PCI_MSIX_FLAGS_ENABLE) {
        MSIXTableEntry *entry = &adev->msix_table[i];

        if (!assigned_dev_msix_masked(&orig) &&
            assigned_dev_msix_masked(entry)) {
            /*
             * Vector masked, disable it
             *
             * XXX It's not clear if we can or should actually attempt
             * to mask or disable the interrupt.  KVM doesn't have
             * support for pending bits and kvm_assign_set_msix_entry
             * doesn't modify the device hardware mask.  Interrupts
             * while masked are simply not injected to the guest, so
             * are lost.  Can we get away with always injecting an
             * interrupt on unmask?
             */
        } else if (assigned_dev_msix_masked(&orig) &&
                   !assigned_dev_msix_masked(entry)) {
            /* Vector unmasked */
            if (i >= adev->msi_virq_nr || adev->msi_virq[i] < 0) {
                /* Previously unassigned vector, start from scratch */
                assigned_dev_update_msix(pdev);
                return;
            } else {
                /* Update an existing, previously masked vector */
                MSIMessage msg;
                int ret;

                msg.address = entry->addr_lo |
                    ((uint64_t)entry->addr_hi << 32);
                msg.data = entry->data;

                ret = kvm_irqchip_update_msi_route(kvm_state,
                                                   adev->msi_virq[i], msg,
                                                   pdev);
                if (ret) {
                    error_report("Error updating irq routing entry (%d)", ret);
                }
                kvm_irqchip_commit_routes(kvm_state);
            }
        }
    }
}

static const MemoryRegionOps assigned_dev_msix_mmio_ops = {
    .read = assigned_dev_msix_mmio_read,
    .write = assigned_dev_msix_mmio_write,
    .endianness = DEVICE_NATIVE_ENDIAN,
    .valid = {
        .min_access_size = 4,
        .max_access_size = 8,
    },
    .impl = {
        .min_access_size = 4,
        .max_access_size = 8,
    },
};

static void assigned_dev_register_msix_mmio(AssignedDevice *dev, Error **errp)
{
    dev->msix_table = mmap(NULL, dev->msix_table_size, PROT_READ | PROT_WRITE,
                           MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
    if (dev->msix_table == MAP_FAILED) {
        error_setg_errno(errp, errno, "failed to allocate msix_table");
        dev->msix_table = NULL;
        return;
    }
    dev->dev.msix_table = (uint8_t *)dev->msix_table;

    assigned_dev_msix_reset(dev);

    memory_region_init_io(&dev->mmio, OBJECT(dev), &assigned_dev_msix_mmio_ops,
                          dev, "assigned-dev-msix", dev->msix_table_size);
}
  • kvm_irqchip_commit_routes 函数通过 ioctl 通知 KVM 进行中断注入(这是一种目前不被推荐的方法,最新的 Qemu KVM 推荐使用 irqfd 来实现,accel/kvm/kvm-all.c):
void kvm_irqchip_commit_routes(KVMState *s)
{
    int ret;

    if (kvm_gsi_direct_mapping()) {
        return;
    }

    if (!kvm_gsi_routing_enabled()) {
        return;
    }

    s->irq_routes->flags = 0;
    trace_kvm_irqchip_commit_routes();
    ret = kvm_vm_ioctl(s, KVM_SET_GSI_ROUTING, s->irq_routes);
    assert(ret == 0);
}

注册PCI设备的内存资源

  • 注册 PCI 设备的内存资源,这里直接映射 PCI 设备的 resource%d 文件,然后添加到 KVM 的内存域中,期间不需要 Copy ,这应该是 PCI 透传之所以性能高的地方吧:
static void assigned_dev_register_regions(PCIRegion *io_regions,
                                          unsigned long regions_num,
                                          AssignedDevice *pci_dev,
                                          Error **errp)
{
    uint32_t i;
    PCIRegion *cur_region = io_regions;

    for (i = 0; i < regions_num; i++, cur_region++) {
        if (!cur_region->valid) {
            continue;
        }

        /* handle memory io regions */
        if (cur_region->type & IORESOURCE_MEM) {
            int t = PCI_BASE_ADDRESS_SPACE_MEMORY;
            if (cur_region->type & IORESOURCE_PREFETCH) {
                t |= PCI_BASE_ADDRESS_MEM_PREFETCH;
            }
            if (cur_region->type & IORESOURCE_MEM_64) {
                t |= PCI_BASE_ADDRESS_MEM_TYPE_64;
            }

            /* map physical memory */
            pci_dev->v_addrs[i].u.r_virtbase = mmap(NULL, cur_region->size,
                                                    PROT_WRITE | PROT_READ,
                                                    MAP_SHARED,
                                                    cur_region->resource_fd,
                                                    (off_t)0);

            if (pci_dev->v_addrs[i].u.r_virtbase == MAP_FAILED) {
                pci_dev->v_addrs[i].u.r_virtbase = NULL;
                error_setg_errno(errp, errno, "Couldn't mmap 0x%" PRIx64 "!",
                                 cur_region->base_addr);
                return;
            }

            pci_dev->v_addrs[i].r_size = cur_region->size;
            pci_dev->v_addrs[i].e_size = 0;

            /* add offset */
            pci_dev->v_addrs[i].u.r_virtbase +=
                (cur_region->base_addr & 0xFFF);

            if (cur_region->size & 0xFFF) {
                error_report("PCI region %d at address 0x%" PRIx64 " has "
                             "size 0x%" PRIx64 ", which is not a multiple of "
                             "4K.  You might experience some performance hit "
                             "due to that.",
                             i, cur_region->base_addr, cur_region->size);
                memory_region_init_io(&pci_dev->v_addrs[i].real_iomem,
                                      OBJECT(pci_dev), &slow_bar_ops,
                                      &pci_dev->v_addrs[i],
                                      "assigned-dev-slow-bar",
                                      cur_region->size);
            } else {
                void *virtbase = pci_dev->v_addrs[i].u.r_virtbase;
                char name[32];
                snprintf(name, sizeof(name), "%s.bar%d",
                         object_get_typename(OBJECT(pci_dev)), i);
                memory_region_init_ram_ptr(&pci_dev->v_addrs[i].real_iomem,
                                           OBJECT(pci_dev), name,
                                           cur_region->size, virtbase);
                vmstate_register_ram(&pci_dev->v_addrs[i].real_iomem,
                                     &pci_dev->dev.qdev);
            }

            assigned_dev_iomem_setup(&pci_dev->dev, i, cur_region->size);
            pci_register_bar((PCIDevice *) pci_dev, i, t,
                             &pci_dev->v_addrs[i].container);
            continue;
        } else {
            /* handle port io regions */
            uint32_t val;
            int ret;

            /* Test kernel support for ioport resource read/write.  Old
             * kernels return EIO.  New kernels only allow 1/2/4 byte reads
             * so should return EINVAL for a 3 byte read */
            ret = pread(pci_dev->v_addrs[i].region->resource_fd, &val, 3, 0);
            if (ret >= 0) {
                error_report("Unexpected return from I/O port read: %d", ret);
                abort();
            } else if (errno != EINVAL) {
                error_report("Kernel doesn't support ioport resource "
                             "access, hiding this region.");
                close(pci_dev->v_addrs[i].region->resource_fd);
                cur_region->valid = 0;
                continue;
            }

            pci_dev->v_addrs[i].u.r_baseport = cur_region->base_addr;
            pci_dev->v_addrs[i].r_size = cur_region->size;
            pci_dev->v_addrs[i].e_size = 0;

            assigned_dev_ioport_setup(&pci_dev->dev, i, cur_region->size);
            pci_register_bar((PCIDevice *) pci_dev, i,
                             PCI_BASE_ADDRESS_SPACE_IO,
                             &pci_dev->v_addrs[i].container);
        }
    }

    /* success */
}

通过KVM分配PCI设备

  • 终于调用 KVM PCI 设备分配接口了,前面只是做一些简单检查,必须支持 IOMMU ,后面就进入 kvm_device_pci_assign 函数了:
static void assign_device(AssignedDevice *dev, Error **errp)
{
    uint32_t flags = KVM_DEV_ASSIGN_ENABLE_IOMMU;
    int r;

    /* Only pass non-zero PCI segment to capable module */
    if (!kvm_check_extension(kvm_state, KVM_CAP_PCI_SEGMENT) &&
        dev->host.domain) {
        error_setg(errp, "Can't assign device inside non-zero PCI segment "
                   "as this KVM module doesn't support it.");
        return;
    }

    if (!kvm_check_extension(kvm_state, KVM_CAP_IOMMU)) {
        error_setg(errp, "No IOMMU found.  Unable to assign device \"%s\"",
                   dev->dev.qdev.id);
        return;
    }

    if (dev->features & ASSIGNED_DEVICE_SHARE_INTX_MASK &&
        kvm_has_intx_set_mask()) {
        flags |= KVM_DEV_ASSIGN_PCI_2_3;
    }

    r = kvm_device_pci_assign(kvm_state, &dev->host, flags, &dev->dev_id);
    if (r < 0) {
        switch (r) {
        case -EBUSY: {
            char *cause;

            cause = assign_failed_examine(dev);
            error_setg_errno(errp, -r, "Failed to assign device \"%s\"",
                             dev->dev.qdev.id);
            error_append_hint(errp, "%s", cause);
            g_free(cause);
            break;
        }
        default:
            error_setg_errno(errp, -r, "Failed to assign device \"%s\"",
                             dev->dev.qdev.id);
            break;
        }
    }
}
  • 最后通过 KVM 提供的 KVM_ASSIGN_PCI_DEVICE 功能进行 PCI 设备分配(target/i386/kvm.c):
/* Classic KVM device assignment interface. Will remain x86 only. */
int kvm_device_pci_assign(KVMState *s, PCIHostDeviceAddress *dev_addr,
                          uint32_t flags, uint32_t *dev_id)
{
    struct kvm_assigned_pci_dev dev_data = {
        .segnr = dev_addr->domain,
        .busnr = dev_addr->bus,
        .devfn = PCI_DEVFN(dev_addr->slot, dev_addr->function),
        .flags = flags,
    };
    int ret;

    dev_data.assigned_dev_id =
        (dev_addr->domain << 16) | (dev_addr->bus << 8) | dev_data.devfn;

    ret = kvm_vm_ioctl(s, KVM_ASSIGN_PCI_DEVICE, &dev_data);
    if (ret < 0) {
        return ret;
    }

    *dev_id = dev_data.assigned_dev_id;

    return 0;
}

分配PCI传统中断INTX

  • 最后在对传统的 INTX 中断模拟部分进行初始化,最终还是使用 KVM 实现的这部分捕捉与注入功能:
static int assign_intx(AssignedDevice *dev, Error **errp)
{
    AssignedIRQType new_type;
    PCIINTxRoute intx_route;
    bool intx_host_msi;
    int r;

    /* Interrupt PIN 0 means don't use INTx */
    if (assigned_dev_pci_read_byte(&dev->dev, PCI_INTERRUPT_PIN) == 0) {
        pci_device_set_intx_routing_notifier(&dev->dev, NULL);
        return 0;
    }

    if (verify_irqchip_in_kernel(errp) < 0) {
        return -ENOTSUP;
    }

    pci_device_set_intx_routing_notifier(&dev->dev,
                                         assigned_dev_update_irq_routing);

    intx_route = pci_device_route_intx_to_irq(&dev->dev, dev->intpin);
    assert(intx_route.mode != PCI_INTX_INVERTED);

    if (!pci_intx_route_changed(&dev->intx_route, &intx_route)) {
        return 0;
    }

    switch (dev->assigned_irq_type) {
    case ASSIGNED_IRQ_INTX_HOST_INTX:
    case ASSIGNED_IRQ_INTX_HOST_MSI:
        intx_host_msi = dev->assigned_irq_type == ASSIGNED_IRQ_INTX_HOST_MSI;
        r = kvm_device_intx_deassign(kvm_state, dev->dev_id, intx_host_msi);
        break;
    case ASSIGNED_IRQ_MSI:
        r = kvm_device_msi_deassign(kvm_state, dev->dev_id);
        break;
    case ASSIGNED_IRQ_MSIX:
        r = kvm_device_msix_deassign(kvm_state, dev->dev_id);
        break;
    default:
        r = 0;
        break;
    }
    if (r) {
        perror("assign_intx: deassignment of previous interrupt failed");
    }
    dev->assigned_irq_type = ASSIGNED_IRQ_NONE;

    if (intx_route.mode == PCI_INTX_DISABLED) {
        dev->intx_route = intx_route;
        return 0;
    }

retry:
    if (dev->features & ASSIGNED_DEVICE_PREFER_MSI_MASK &&
        dev->cap.available & ASSIGNED_DEVICE_CAP_MSI) {
        intx_host_msi = true;
        new_type = ASSIGNED_IRQ_INTX_HOST_MSI;
    } else {
        intx_host_msi = false;
        new_type = ASSIGNED_IRQ_INTX_HOST_INTX;
    }

    r = kvm_device_intx_assign(kvm_state, dev->dev_id, intx_host_msi,
                               intx_route.irq);
    if (r < 0) {
        if (r == -EIO && !(dev->features & ASSIGNED_DEVICE_PREFER_MSI_MASK) &&
            dev->cap.available & ASSIGNED_DEVICE_CAP_MSI) {
            /* Retry with host-side MSI. There might be an IRQ conflict and
             * either the kernel or the device doesn't support sharing. */
            error_report("Host-side INTx sharing not supported, "
                         "using MSI instead");
            error_printf("Some devices do not work properly in this mode.\n");
            dev->features |= ASSIGNED_DEVICE_PREFER_MSI_MASK;
            goto retry;
        }
        error_setg_errno(errp, -r, "Failed to assign irq for \"%s\"",
                         dev->dev.qdev.id);
        error_append_hint(errp, "Perhaps you are assigning a device "
                          "that shares an IRQ with another device?\n");
        return r;
    }

    dev->intx_route = intx_route;
    dev->assigned_irq_type = new_type;
    return r;
}


/* The pci config space got updated. Check if irq numbers have changed
 * for our devices
 */
static void assigned_dev_update_irq_routing(PCIDevice *dev)
{
    AssignedDevice *assigned_dev = PCI_ASSIGN(dev);
    Error *err = NULL;
    int r;

    r = assign_intx(assigned_dev, &err);
    if (r < 0) {
        error_report_err(err);
        err = NULL;
        qdev_unplug(&dev->qdev, &err);
        assert(!err);
    }
}

int kvm_device_intx_assign(KVMState *s, uint32_t dev_id, bool use_host_msi,
                           uint32_t guest_irq)
{
    uint32_t irq_type = KVM_DEV_IRQ_GUEST_INTX |
        (use_host_msi ? KVM_DEV_IRQ_HOST_MSI : KVM_DEV_IRQ_HOST_INTX);

    return kvm_assign_irq_internal(s, dev_id, irq_type, guest_irq);
}

static int kvm_assign_irq_internal(KVMState *s, uint32_t dev_id,
                                   uint32_t irq_type, uint32_t guest_irq)
{
    struct kvm_assigned_irq assigned_irq = {
        .assigned_dev_id = dev_id,
        .guest_irq = guest_irq,
        .flags = irq_type,
    };

    if (kvm_check_extension(s, KVM_CAP_ASSIGN_DEV_IRQ)) {
        return kvm_vm_ioctl(s, KVM_ASSIGN_DEV_IRQ, &assigned_irq);
    } else {
        return kvm_vm_ioctl(s, KVM_ASSIGN_IRQ, &assigned_irq);
    }
}

内核中的KVM ASSIGN实现

  • KVM 模块的初始化与退出(arch/x86/kvm/svm.c):
static int __init svm_init(void)
{
	return kvm_init(&svm_x86_ops, sizeof(struct vcpu_svm),
			__alignof__(struct vcpu_svm), THIS_MODULE);
}

static void __exit svm_exit(void)
{
	kvm_exit();
}

module_init(svm_init)
module_exit(svm_exit)
  • KVM 字符设备文件的注册与实现(virt/kvm/kvm_main.c):
static long kvm_dev_ioctl(struct file *filp,
			  unsigned int ioctl, unsigned long arg)
{
	long r = -EINVAL;

	switch (ioctl) {
	case KVM_GET_API_VERSION:
		if (arg)
			goto out;
		r = KVM_API_VERSION;
		break;
	case KVM_CREATE_VM:
		r = kvm_dev_ioctl_create_vm(arg);
		break;
	case KVM_CHECK_EXTENSION:
		r = kvm_vm_ioctl_check_extension_generic(NULL, arg);
		break;
	case KVM_GET_VCPU_MMAP_SIZE:
		if (arg)
			goto out;
		r = PAGE_SIZE;     /* struct kvm_run */
#ifdef CONFIG_X86
		r += PAGE_SIZE;    /* pio data page */
#endif
#ifdef CONFIG_KVM_MMIO
		r += PAGE_SIZE;    /* coalesced mmio ring page */
#endif
		break;
	case KVM_TRACE_ENABLE:
	case KVM_TRACE_PAUSE:
	case KVM_TRACE_DISABLE:
		r = -EOPNOTSUPP;
		break;
	default:
		return kvm_arch_dev_ioctl(filp, ioctl, arg);
	}
out:
	return r;
}

static struct file_operations kvm_chardev_ops = {
	.unlocked_ioctl = kvm_dev_ioctl,
	.compat_ioctl   = kvm_dev_ioctl,
	.llseek		= noop_llseek,
};

static struct miscdevice kvm_dev = {
	KVM_MINOR,
	"kvm",
	&kvm_chardev_ops,
};


int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
		  struct module *module)
{
	int r;
	int cpu;

...
    
	r = misc_register(&kvm_dev);
	if (r) {
		pr_err("kvm: misc device register failed\n");
		goto out_unreg;
	}

...
}
EXPORT_SYMBOL_GPL(kvm_init);


KVM_ASSIGN_PCI_DEVICE
  • KVM 字符设备 ioctl 中调用进行设备分配的地方(arch/x86/kvm/x86.c):
long kvm_arch_vm_ioctl(struct file *filp,
		       unsigned int ioctl, unsigned long arg)
{

...

	switch (ioctl) {

...

	default:
		r = kvm_vm_ioctl_assigned_device(kvm, ioctl, arg);
	}
out:
	return r;
}
  • 获取 Qemu 或其他用户态程序传来的设备地址信息(arch/x86/kvm/assigned-dev.c):
long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
				  unsigned long arg)
{
	void __user *argp = (void __user *)arg;
	int r;

	switch (ioctl) {
	case KVM_ASSIGN_PCI_DEVICE: {
		struct kvm_assigned_pci_dev assigned_dev;

		r = -EFAULT;
		if (copy_from_user(&assigned_dev, argp, sizeof assigned_dev))
			goto out;
		r = kvm_vm_ioctl_assign_device(kvm, &assigned_dev);
		if (r)
			goto out;
		break;
	}

...

	default:
		r = -ENOTTY;
		break;
	}
out:
	return r;
}
  • 根据设备地址找到设备的 Device 设备结构,为其分配内核态的数据结构,并注册对应的 KVM 设备 Slot ,最后调用 kvm_assign_device 函数:
static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
				      struct kvm_assigned_pci_dev *assigned_dev)
{
	int r = 0, idx;
	struct kvm_assigned_dev_kernel *match;
	struct pci_dev *dev;

	if (!(assigned_dev->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU))
		return -EINVAL;

	mutex_lock(&kvm->lock);
	idx = srcu_read_lock(&kvm->srcu);

	match = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
				      assigned_dev->assigned_dev_id);
	if (match) {
		/* device already assigned */
		r = -EEXIST;
		goto out;
	}

	match = kzalloc(sizeof(struct kvm_assigned_dev_kernel), GFP_KERNEL);
	if (match == NULL) {
		printk(KERN_INFO "%s: Couldn't allocate memory\n",
		       __func__);
		r = -ENOMEM;
		goto out;
	}
	dev = pci_get_domain_bus_and_slot(assigned_dev->segnr,
				   assigned_dev->busnr,
				   assigned_dev->devfn);
	if (!dev) {
		printk(KERN_INFO "%s: host device not found\n", __func__);
		r = -EINVAL;
		goto out_free;
	}

	/* Don't allow bridges to be assigned */
	if (dev->hdr_type != PCI_HEADER_TYPE_NORMAL) {
		r = -EPERM;
		goto out_put;
	}

	r = probe_sysfs_permissions(dev);
	if (r)
		goto out_put;

	if (pci_enable_device(dev)) {
		printk(KERN_INFO "%s: Could not enable PCI device\n", __func__);
		r = -EBUSY;
		goto out_put;
	}
	r = pci_request_regions(dev, "kvm_assigned_device");
	if (r) {
		printk(KERN_INFO "%s: Could not get access to device regions\n",
		       __func__);
		goto out_disable;
	}

	pci_reset_function(dev);
	pci_save_state(dev);
	match->pci_saved_state = pci_store_saved_state(dev);
	if (!match->pci_saved_state)
		printk(KERN_DEBUG "%s: Couldn't store %s saved state\n",
		       __func__, dev_name(&dev->dev));

	if (!pci_intx_mask_supported(dev))
		assigned_dev->flags &= ~KVM_DEV_ASSIGN_PCI_2_3;

	match->assigned_dev_id = assigned_dev->assigned_dev_id;
	match->host_segnr = assigned_dev->segnr;
	match->host_busnr = assigned_dev->busnr;
	match->host_devfn = assigned_dev->devfn;
	match->flags = assigned_dev->flags;
	match->dev = dev;
	spin_lock_init(&match->intx_lock);
	spin_lock_init(&match->intx_mask_lock);
	match->irq_source_id = -1;
	match->kvm = kvm;
	match->ack_notifier.irq_acked = kvm_assigned_dev_ack_irq;

	list_add(&match->list, &kvm->arch.assigned_dev_head);

	if (!kvm->arch.iommu_domain) {
		r = kvm_iommu_map_guest(kvm);
		if (r)
			goto out_list_del;
	}
	r = kvm_assign_device(kvm, match->dev);
	if (r)
		goto out_list_del;

out:
	srcu_read_unlock(&kvm->srcu, idx);
	mutex_unlock(&kvm->lock);
	return r;
out_list_del:
	if (pci_load_and_free_saved_state(dev, &match->pci_saved_state))
		printk(KERN_INFO "%s: Couldn't reload %s saved state\n",
		       __func__, dev_name(&dev->dev));
	list_del(&match->list);
	pci_release_regions(dev);
out_disable:
	pci_disable_device(dev);
out_put:
	pci_dev_put(dev);
out_free:
	kfree(match);
	srcu_read_unlock(&kvm->srcu, idx);
	mutex_unlock(&kvm->lock);
	return r;
}
  • kvm_assign_device 其实只是配置设备的 IOMMU (arch/x86/kvm/iommu.c):
int kvm_assign_device(struct kvm *kvm, struct pci_dev *pdev)
{
	struct iommu_domain *domain = kvm->arch.iommu_domain;
	int r;
	bool noncoherent;

	/* check if iommu exists and in use */
	if (!domain)
		return 0;

	if (pdev == NULL)
		return -ENODEV;

	r = iommu_attach_device(domain, &pdev->dev);
	if (r) {
		dev_err(&pdev->dev, "kvm assign device failed ret %d", r);
		return r;
	}

	noncoherent = !iommu_capable(&pci_bus_type, IOMMU_CAP_CACHE_COHERENCY);

	/* Check if need to update IOMMU page table for guest memory */
	if (noncoherent != kvm->arch.iommu_noncoherent) {
		kvm_iommu_unmap_memslots(kvm);
		kvm->arch.iommu_noncoherent = noncoherent;
		r = kvm_iommu_map_memslots(kvm);
		if (r)
			goto out_unmap;
	}

	kvm_arch_start_assignment(kvm);
	pci_set_dev_assigned(pdev);

	dev_info(&pdev->dev, "kvm assign device\n");

	return 0;
out_unmap:
	kvm_iommu_unmap_memslots(kvm);
	return r;
}
  • kvm_arch_start_assignment 函数只是增加设备的分配计数而已(arch/x86/kvm/x86.c):
void kvm_arch_start_assignment(struct kvm *kvm)
{
	atomic_inc(&kvm->arch.assigned_device_count);
}
  • pci_set_dev_assigned 函数设置设备已被分配的标志(include/linux/pci.h):
/* Helper functions for operation of device flag */
static inline void pci_set_dev_assigned(struct pci_dev *pdev)
{
	pdev->dev_flags |= PCI_DEV_FLAGS_ASSIGNED;
}

分析总结

  1. Linux 内核中的 pci-stub 驱动生成的 Driver 结构只是与 PCI Bus 产生的 Device 结构进行绑定,没有提供任何用户调用接口,主要作用是防止其他驱动占用这个 Device
  2. 用户态的 Qemu 访问这个 PCI 设备主要通过这个 PCI BUS 驱动为这个 PCI Device 提供的 SysFS 属性文件进行,通过 config 文件读取和配置 PCI 配置空间,通过 resource 文件获取 PCI BAR 的数量级地址和长度,通过 resource%d 文件进内存映射和中断读写;
  3. Qemu 调用 KVM ioctl 主要是进行这个设备的 IOMMU 配置和 KVM 数据结构初始化,以及中断的注入。这应该是为了弥补 pci-stub 驱动和 SysFS 的不足,感觉这是一种修修补补的写法,原本的框架不支持相关功能,而 VFIO irqfd 在设计之初就是为了解决这个问题,且更加优雅和强大( VFIO 还可以提供 MDev 功能),这应该是其备受推崇的根本原因;
  4. 无论是 PCI-Assign 还是 VFIO ,或者 MDev ,其实他们真正透传的只有内存区域,对于配置区域和中断的处理,其实都是需要宿主机的软件参与,他们能降低的主要是大量数据的传输,对于中断和配置的性能并没有本质改善。由于设备配置部分原本在设备的访问中占比不高,所以不是性能瓶颈。但对于中断,在高性能场景,应尽可能减少中断的使用,使用查的方式处理状态变化,可以获取更好的性能。

上述分析,虽尽力求证,但由于水平有限,难免出现谬误,欢迎批评指正!!!

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值