vfio进行网卡透传

VFIO is a new method of doing PCI device assignment ("PCI passthrough"
aka "<hostdev>") available in newish kernels (3.6?; it's in Fedora 18 at
any rate) and via the "vfio-pci" device in qemu-1.4+. In contrast to the
traditional KVM PCI device assignment (available via the "pci-assign"
device in qemu), VFIO works properly on systems using UEFI "Secure
Boot"; it also offers other advantages, such as grouping of related
devices that must all be assigned to the same guest (or not at all).
Here's some useful reading on the subject.


  http://lwn.net/Articles/474088/
  http://lwn.net/Articles/509153/


Short description (from Alex Williamson's KVM Forum Presentation)


1) Assume this is the device you want to assign:
01:10.0 Ethernet controller: Intel Corporation 82576
Virtual Function (rev 01)


2) Find the vfio group of this device
:
# readlink /sys/bus/pci/devices/0000:01:10.0/iommu_group
../../../../kernel/iommu_groups/15


==IOMMU Group = 15


3) Check the devices in the group:
# ls /sys/bus/pci/devices/0000:01:10.0/iommu_group/devices/
0000:01:10.0


(so this group has only 1 device)


4) Unbind from device driver
# echo 0000:01:10.0 >/sys/bus/pci/devices/0000:01:10.0/driver/unbind


5) Find vendor & device ID
$ lspci -n -s 01:10.0
01:10.0 0200: 8086:10ca (rev 01)


6) Bind to vfio-pci
$ echo 8086 10ca /sys/bus/pci/drivers/vfio-pci/new_id


(this will result in a new device node "/dev/vfio/15",  which is what qemu will use to setup the device for passthrough)


7) chown the device node so it is accessible by qemu user:
# chown qemu /dev/vfio/15; chgrp qemu /dev/vfio/15


(note that /dev/vfio/vfio, which is installed as 0600 root:root, must also be made mode 0666, still owned by root - this is supposedly not dangerous)


I'll look into this, the intention has always been that /dev/vfio/vfio
is a safe interface that's only empowered when connected to
a /dev/vfio/$GROUP, which implies some privileges.


8) set the limit for locked memory equal to all of guest memory size + [some amount large enough to encompass all of io space]
# ulimit -l 2621440   # ((2048 + 512) * 1024)


9) pass to qemu using -device vfio-pci:


 sudo qemu qemu-system-x86_64 -m 2048 -hda rhel6vm \
              -vga std -vnc :0 -net none \
              -enable-kvm \
              -device vfio-pci,host=01:10.0,id=net0


(qemu will then use something like step (2) to figure out which device node it needs to use)


Why the "ulimit -l"?
--------------------


Any qemu guest that is using the old pci-assign must have *all* guest
memory and IO space locked in memory. Normally the maximum amount of
locked memory allowed for a process is controlled by "ulimit -l", but
in the case of pc-assign, the kvm kernel module has always just
ignored the -l limit and locked it all anyway.


With vfio-pci, all guest memory and IO space must still be locked in
memory, but the vfio module *doesn't* ignore the process limits, so
libvirt will need to set ulimit -l for any guest that wants to do
vfio-based pci passthrough. Since (due to the possibility of hotplug)
we don't know at the time the qemu process is started whether or not
it might need to do a pci passthrough, we will need to use prlimit(2)
to modify the limit of the already-running qemu.




Proposed XML Changes
--------------------


To support vfio pci device assignment in libvirt, I'm thinking something
like this (note that the <driversubelement is already used for
<interfaceand <diskto choose which backend to use for a particular
device):


   <hostdev managed='yes'>
     <driver name='vfio'/>
     ...
   </hostdev>


   <interface type='hostdev' managed='yes'>
     <driver name='vfio'/>


vfio is the overall userspace driver framework while vfio-pci is the
specific qemu driver we're using here.  Does it make more sense to call
this 'vfio-pci'?  It's possible that we could later have a device tree
qemu driver which would need to be involved with -device vfio-dt (or
something) and have different options.


     ...
   </hostdev>


(this new use of <driverinside <interfacewouldn't conflict with
the existing <driver name='qemu|vhost'>, since neither of those could
ever possibly be a valid choice for <interface type='hostdev'>. The
one possible problem would be if someone had an <interface
type='network'which might possibly point to a hostdev or standard
bridged network, and wanted to make sure that in the case of a bridged
network, that <driver name='qemu' was used. I suppose in this case,
the driver name in the network definition would override any driver
name in the interface?)


Sepaking of <network>, here's how vfio would be specified in a hostdev <networkdefinition:


   <network>
     <name>vfio-net</name>
     <forward mode='hostdev' managed='yes'>
       <driver name='vfio'/>
       <pf dev='eth3'/<!-- or a list of VFs -->
     </forward>
     ...
   </network>


Another possibility for the <networkxml would be to add a
"driver='vfio'" to each individual <interfaceline, in case someone
wanted some devices in a pool to be asigned using vfio and some using
the old style, but that seems highly unlikely (and could create
problems in the future if we ever needed to add a 2nd attribute to the
<driverelement).


Actually, at one point I considered that vfio should be turned on
globally in libvirtd.conf (or qemu.conf), but that would make
switchover a tedious process, as all existing guests using PCI
passthrough would need to be shutdown prior to the change. As long as
there are no technical problems with allowing both types on the same
host, it's more flexible to choose on a device-by-device basis.
>
Now some questions:


1) Is there any reason that we shouldn't/can't allow both pci-assign
and vfio-pci at the same time on the same host (and even guest).


vfio-pci and pci-assign can be mixed, but don't intermix devices within
a group.  Sometimes this will work (if the grouping is isolation
reasons), but sometimes it won't (when the grouping is for visibility).
Best to just avoid that scenario.


2) Does it make any sense to support a "managed='no'" mode for vfio,
which skipped steps 2-6 above? (this would be parallel to the existing
pci-assign managed='no'(where no unbinding/binding of the device to
the host's pci-stub driver is done, but the device name is simply
passed to qemu assuming that all that work was already done)) Or
should <driver name='vfio'/automatically mean that all
unbinding/binding be done for each device.


I don't think it hurts to have it, but I can't think of a use case.
Even with pci-assign, I can only think of cases where customers have
used it to try to work around things they shouldn't be doing with it.


3) Is it at all bothersome that qemu must be the one opening the
device node, and that there is apparently no way to have libvirt open
it and send the fd to qemu?


I have the same question.  The architecture of vfio is that the user
will open /dev/vfio/vfio (vfiofd) and add a group to it (groupfd).
Multiple groupfds can be added to a single vfiofd, allowing groups to
share IOMMU domains.  However, it's not guaranteed that the IOMMU driver
will allow this (the domains may be incompatible).  Qemu will therefore
attempt to add any new group to an existing vfiofd before re-opening a
new one.  There's also the problem that a group has multiple devices, so
if device A from group X gets added with vfiofd and groupXfd and libvirt
then passes a new vfiofd' and groupXfd' for attaching device B, also
from group X... what's qemu to do?


So in order to pass file descriptors libvirt has to either know exactly
how things are working or just always pass a vfiofd and groupfd, which
qemu will discard if it doesn't need.  The latter implies that fds could
live on and be required past the point where the device that added them
has been removed (in the example above, add A and qemu uses vfiofd and
groupXfd, hot add B and qemu discards vfiofd' and groupXfd', remove A
and qemu continues to use vfiofd and groupXfd for B). 
 
*********************************************************************
-device pci-assign 已经不使用了,会报错invalid argument
 

最新的内核里,建议废除KVM_ASSIGN机制,只支持vfio,如果还是使用老的 KVM ASSIGN的话,那么需要手动修改.config文件 “KVM_DEVICE_ASSIGNMENT=y”,才能使用kvm assgin。 注意,要vim手动修改,make menuconfig里面已经没有了


看了一下代码,assigned-dev.c 是kvm_assgin的实现,只有选择CONFIG_KVM_DEVICE_ASSIGNMENT才会对其进行编译

arch/x86/kvm/Makefile:

kvm-$(CONFIG_KVM_DEVICE_ASSIGNMENT) += assigned-dev.o iommu.o


这里是一篇关于如何使用kvm-pci-assign机制的文章

http://www.linux-kvm.org/page/How_to_assign_devices_with_VT-d_in_KVM


参考链接

http://www.spinics.net/lists/kvm/msg120779.html

http://nanxiao.me/en/why-does-qemu-complain-no-iommu-found/

同样使用kvm-assgin的话,使用最新的QEMU同样存在问题

“qemu-system-x86_64: pci_get_msi_message: unknown interrupt type”

这同样是VFIO的问题

如果想使用kvm-pci-assgin,那么就使用2.6.0以前的QEMU吧

参考链接

http://qemu.11.n7.nabble.com/PATCH-v9-00-25-IOMMU-Enable-interrupt-remapping-for-Intel-IOMMU-td412217.html



另外有个地方可以下载到kvm很多有用的脚本

https://github.com/smilejay/kvm-book.git


  • 0
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值