Xen project Mailing List

Re: [Xen-devel] [PATCH] libxl: Don't insert PCI device into xenstore for HVM guests

To: Malcolm Crossley <malcolm.crossley@xxxxxxxxxx>

From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

Date: Wed, 10 Jun 2015 16:10:56 -0400

Cc: Wei Liu <wei.liu2@xxxxxxxxxx>, Ian Campbell <ian.campbell@xxxxxxxxxx>, Stefano Stabellini <stefano.stabellini@xxxxxxxxxxxxx>, Ian Jackson <ian.jackson@xxxxxxxxxxxxx>, xen-devel@xxxxxxxxxxxxx, Ross Lagerwall <ross.lagerwall@xxxxxxxxxx>

Delivery-date: Wed, 10 Jun 2015 20:11:08 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Tue, Jun 02, 2015 at 04:48:25PM +0100, Malcolm Crossley wrote: > On 02/06/15 15:34, Konrad Rzeszutek Wilk wrote: > > On Tue, Jun 02, 2015 at 11:06:26AM +0100, Malcolm Crossley wrote: > >> On 01/06/15 18:55, Konrad Rzeszutek Wilk wrote: > >>> On Mon, Jun 01, 2015 at 05:03:14PM +0100, Malcolm Crossley wrote: > >>>> On 01/06/15 16:43, Ross Lagerwall wrote: > >>>>> On 06/01/2015 04:26 PM, Konrad Rzeszutek Wilk wrote: > >>>>>> On Fri, May 29, 2015 at 08:59:45AM +0100, Ross Lagerwall wrote: > >>>>>>> When doing passthrough of a PCI device for an HVM guest, don't insert > >>>>>>> the device into xenstore, otherwise pciback attempts to use it which > >>>>>>> conflicts with QEMU. > >>>>>> > >>>>>> How does it conflict? > >>>>> > >>>>> It doesn't work with repeated use. See below. > >>>>> > >>>>>>> > >>>>>>> This manifests itself such that the first time a device is passed to a > >>>>>>> domain, it succeeds. Subsequent attempts fail unless the device is > >>>>>>> unbound from pciback or the machine rebooted. > >>>>>> > >>>>>> Can you be more specific please? What are the issues? Why does it > >>>>>> fail? > >>>>> > >>>>> Without this patch, if a device (e.g. a GPU) is bound to pciback and > >>>>> then passed through to a guest using xl pci-attach, it appears in the > >>>>> guest and works fine. If the guest is rebooted, and the device is again > >>>>> passed through with xl pci-attach, it appears in the guest as before but > >>>>> does not work. In Windows, it gets something like Error Code 43 and on > >>>>> Linux, the Nouveau driver fails to initialize the device (with error -22 > >>>>> or something). The only way to get the device to work again is to reboot > >>>>> the host or unbind and rebind it to pciback. > >>>>> > >>>>> With this patch, it works as expected. The device is bound to pciback > >>>>> and works after being passed through, even after the VM is rebooted. > >>>>> > >>>>>> > >>>>>> There are certain things that pciback does to "prepare" an PCI device > >>>>>> which QEMU also does. Some of them - such as saving the configuration > >>>>>> registers (And then restoring them after the device has been detached) > >>>>>> - > >>>>>> is something that QEMU does not do. > >>>>>> > >>>>> > >>>>> I really have no idea what the correct thing to do is, but the current > >>>>> code with qemu-trad doesn't seem to work (for me). > > > > I think I know what the problem is. Do you by any chance have the > > XSA133-addenum > > patch in? If not could you apply it and tell me if it works? Ping? It was XSA120-addendum. > > > >>>> > >>>> The pciback pci_stub.c implements the pciback.hide and the device reset > >>>> logic. > >>>> > >>>> The rest of pciback implements the pciback xenbus device which PV guests > >>>> need in order to map/unmap MSI interrupts and access PCI config space. > >>>> > >>>> QEMU emulates and handles the MSI interrupt capabilities and PCI config > >>>> space directly. > >>> > >>> Right.. > >>>> > >>>> This is why a pciback xenbus device should not be created for > >>>> passthrough PCI device being handled by QEMU. > >>> > >>> To me that sounds that we should not have PV drivers because QEMU > >>> emulates IDE or network devices. > >> > >> That is different. We first boot with QEMU handling the devices and then > >> we explictly unplug QEMU's handling of IDE and network devices. > >> > >> That handover protocol does not currently exist for PCI passthrough > >> devices so we have to chose one mechanism or the other to manage the > >> passed through PCI device at boot time. Otherwise a HVM guest could load > >> pcifront and cause's all kinds of chaos with interrupt management or > >> outbound MMIO window management. > > > > Which would be fun! :-) > >> > >>> > >>> The crux here is that none of the operations that pciback performs > >>> should affect QEMU or guests. But it does - so there is a bug. > >> > >> I agree there is a bug but should we try to fix it based upon my > >> comments above? > > > > I am still thinking about it. I do like certain things that pciback > > does as part of it being notified that a device is to be used by > > a guest and performing the configuration save/reset (see > > pcistub_put_pci_dev in the pciback). > > > > If somehow that can still be done by libxl (or QEMU) via SysFS > > that would be good. > > > > Just to clarify: > > - I concur with you that having xen-pcifront loaded in HVM > > guest and doing odd things behind QEMU is not good. > > - I like the fact that xen-pciback does a bunch of safety > > things with the PCI device to prepare it for a guest. > > - Currently these 'safety things' are done when you > > 'unbind' or 'bind' the device to pciback. > > - Or when the guest is shutdown and via XenBus we are told > > and can do the 'safety things'. This is the crux - if there > > is a way to do this via SysFS this would be super. > > > > Or perhaps xenpciback can figure out that the guest is HVM > > and ignore any XenBus actions? > > > > Xenserver toolstack currently bind/unbinds the device to pciback and > manually triggers the reset on the device. It then uses xenstore keys to > communicate with the QEMU hotplug mechanism. > > A complete description is here: > > "Xenopsd performs the following steps for HVM PCI passthrough (for each > PCI device): > > write /local/domain/0/backend/pci/<domid>/0/msitranslate â0â or â1â > write /local/domain/0/backend/pci/<domid>/0/power_mgmt â0â or â1â > bind device to pciback (write device id to the new_slot and bind > nodes in sysfs) > write /local/domain/0/backend/device-model/<domid>/command âpci-insâ > write /local/domain/0/backend/device-model/<domid>/parameter > âxxxx:xx:xx.xâ > wait for âpci-insertedâ to appear in > /local/domain/0/backend/<domid>/device-model/state > write /local/domain/0/backend/pci/<domid>/0/dev-<x> âxxxx:xx:xx.xâ > if /local/domain/<domid>/device/pci/0 does not yet exist, then > create /local/domain/<domid>/device/pci/0, give ownership to the guest, > and write backend=/local/domain/0/backend/pci/<domid>/0, backend-id=0, > state=1 > xc_domain_assign_device > > > To unplug: > > write /local/domain/0/backend/device-model/<domid>/command âpci-remâ > write /local/domain/0/backend/device-model/<domid>/parameter > âxxxx:xx:xx.xâ > wait for âpci-removedâ to appear in > /local/domain/0/backend/<domid>/device-model/state > remove /local/domain/0/backend/pci/<domid>/0/dev-<x> > call /usr/lib/xcp/lib/pci-flr flr-pre xxxx:xx:xx.x 'pci-flr' ? User-space program to poke the configuration registers? > if the file /sys/bus/pci/devices/xxxx:xx:xx.x/reset exists, then > write "1" to it; otherwise write "xxxx:xx:xx.x" to > /sys/bus/pci/drivers/pciback/do_flr Do you have a patch to expose this? This paramter does not exist in the upstream Linux kernel. > call /usr/lib/xcp/lib/pci-flr flr-post xxxx:xx:xx.x > xc_domain_deassign_device" > > > I would recommend that libxl performs similar operations (except for the > QEMU hotplug part and the flr script parts). > > We should probably split pciback into parts: > > 1. A part to capture device at boot and prevent other drivers loading > (bind/unbind) > > 2. A part to manage reset on device's which don't have device specific > reset's. Not following you. I think you mean if the standard PCI 'reset' is not exposed because the device has some quirks or it truly does not expose FLR, D3/D0 switch, or any other ways to reset it? > > 3. A part to manage the pciback PV device for communicating to pcifront. > > > I don't think we can have pciback work out if it's a HVM guest and I > think we should instead use the toolstack to prep the device correctly > before passthrough. The toolstack is donating it's PCI device to the new I do not know what is the right way. The kernel has a lot of knowledge about these devices and can proper locking and deal with quirks. QEMU also has some idea of what to do and how to reset a device. > domain afterall :) > > > >>> > >>> I would like to understand which ones do it so I can fix in > >>> pciback - as it might be also be a problem with PV. > >>> > >>> Unless... are you by any chance using extra patches on top of the > >>> native pciback? > >> > >> We do have extra patches but they only allow us to do a SBR on PCI > >> device's which require it. They failure listed above occurs on devices > >> with device specific resets (e.g. FLR,D3) as well so those extra patches > >> aren't being used. > >> > >>> > >>>> > >>>> Malcolm > >>>> > >>>>> > >>>>> Regards > >>>> > >> > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.