Xen project Mailing List

Re: xen/arm: Virtio-PCI for dom0less on ARM

From: "Edgar E. Iglesias" <edgar.iglesias@xxxxxxx>

Date: Thu, 22 May 2025 15:29:40 +0200

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=xen.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0)

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=Q48z1X16UrwC5dEANiAHs0bWd3cvCyH1Z5pCnzkrAEY=; b=Sx8Jmt2wWUt//Gz2+T/hq7QEMSKUfbRm7mjNHJ1pTgdf9twqo54qJ5C55Q98DtuzIlqxjljNqJ583HkU/dI8H9umf6R2+/3u+P3aOA6eVgiHICOf/3aV1nY13s9tmlrABNvycXY0zIx/5072ka0erf6uFRsliBm3Gw3Po8vyTp+vbywsokU0i2vdGPfkurIC0JYWWlDx7slpfVGwa47VLgmp8tnh3Ypdes+QYQai2C36kH+5Ogpx5BKDGZKd+qTxVeBt7X7X4YRXJCPX37y5AX1URbLqhfwQXoncoo2aXxrMGbRKUBebVwMVETLnZlIqeZR1BkW8SemuIYdBW9lFqg==

Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=cVNsksL4pXgVb/5A6/5z+yIPFLJzG9XKGt4nlznWq5ls5WOBzzCUEmha9pc1Ibd6JzZej8njPrsALl5jFgwJWPdHIoCt3PVdyUola2bXZFWqoK5hDcBWahCHSsf2TGajOoxzLj6hrzg7daM+QcTZgRZbpow2wt6AEKORCMSRspujeCtdKOSklmz2kexYHWAym9zbrB3Q6nEulHET1isCVShNQysIZy0J87So80jGZ8dMxtVDfUQtXux9I0+0KsM5MqkI8OsSw+VjpTAxCdzhEd0hxzeDUKlsE0sRYKRqMqZYjhEHfW0UFJjhcL10iV13MSt4LcxGr8APktsiVzmXyw==

Cc: Stefano Stabellini <sstabellini@xxxxxxxxxx>, Bertrand Marquis <bertrand.marquis@xxxxxxx>, Michal Orzel <michal.orzel@xxxxxxx>, "Volodymyr Babchuk" <Volodymyr_Babchuk@xxxxxxxx>, "Edgar E. Iglesias" <edgar.iglesias@xxxxxxxxx>, <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Thu, 22 May 2025 13:29:58 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Wed, May 21, 2025 at 01:22:11PM +0100, Julien Grall wrote: > Hi Edgar, > > Thanks for the write-up. > Thanks for commenting! > On 20/05/2025 09:04, Edgar E. Iglesias wrote: > > Hi all, > > > > Following up on the ARM virtio-pci series I posted a while back ago. > > > > There have been some concerns around the delayed and silent apperance of > > devices on the ECAM area. The spec is not super clear wether this is OK or > > not but I'm providing some references to the PCI specs and to some real > > cases > > where this is used for FPGAs. > > > > There are two options how to implement virtio-pci that we've discussed: > > 1. VPCI + IOREQ > > 2. IOREQ only > > > > There are pros and cons with both. For example with #1, has the benefit that > > we would only have a single PCIe RC (in Xen) and we could emulate a hotplug > > capable expansion port with a standard way to notify when PCI devices plug > > in. > > Approach #2 has the benefit that there is (almost) no additional complexity > > or code added to Xen, almost everything lives outside. > > IMO, both options have merit and both could co-exist. > > > For dynamic xl flows, options #2 already works without modifications to > Xen. > > Users need to pass the correct command-line options to QEMU and a > > device-tree > > fragment with the pci-generic-ecam-host device. > > IIUC, in approach #2, QEMU will emulate the host controller. In Xen, we also > support multiple IOREQ servers. For instance IOREQ A may emulate a GPU > device, whereas IOREQ B could emulate a disk. This is usuful in case where > one may want a separate domain to handle GPUs. > > With the approach #2, it sounds like you will end up to have one host > controller per IOREQ server. The user will also need to know them in > advance. Is my understanding correct? If so, then it feels like this is > defeating the purpose of IOREQ. I don't think that is necessarily the case. Option #2 would have the controller outside of Xen and an implementation is free to split it up into multiple processes each with an IOREQ server. It is also free to do a monolithic implementation or some kind of mix. Today, QEMU supports 3 ways: 1. A monolithic PCI host bridge + PCI Endpoints all in the same process. 2. A PCI host bridge in one process and distributed PCI Endpoints in separate processes (vfio-user). 3. A PCI host bridge with partial virtio-PCI transports in a single process + distributed virtio backends (over vhost-user) in separate processes. If you refer to guest's knowing about PCIe controllers in advance, yes but I don't see why we couldn't use something like dt-overlays and kernel modules to enable them at runtime. For PCI endpoints behind the controller, an expansion port with hotplug can be emulated allowing the hot plugin of devices at runtime. With regards to isolation, I don't really see a benefit with option #1. > > This is the first reason why I feel approach #1 is more suitable. > > > > > For static dom0less flows, we can do the same as for xl flows but we have > > the > > additional problem of domU's PCI bus enumeration racing with QEMU. > > On x86, when domU's access a memory range that has not yet got IOREQ's > > connected to it, the accesses succeeds with reads returning 0xFFFFFFFF and > > writes ignored. This makes it easy for guests to wait for IOREQ devices to > > pop up since guests will find an empty bus and can initiate a rescan later > > when QEMU attaches. On ARM, we trap on these accesses. > > > If we on ARM add support for MMIO background regions with a default > read value, > > i.e mmio handlers that have lower priority than IOREQs and that are > > read-const + writes-ignored, we could support the same flow on ARM. > > This may be generally useful for other devices as well (e.g virtio-mmio or > > something else). We could also use this to defer PCI enumeration. > > Regardless what I wrote above, if we are going down the route of returning > 0xFFFFFFFF, I would just do it for every IOs rather than the one specify in > the device-tree. This could still behind a per-domain option, but it would > at least make simpler to setup the system (AFAIU, in your current proposal, > we would need to specify the range in multiple places). Yes, the range would go into Xen's static domain config node and into the passthrough fragment describing the PCI host bridge. Typically only one range is needed (ECAM). In my prototype, I've got this for the static domain config: domU1 { compatible = "xen,domain"; ... mmio-background-regions = < 0xf1 0x0 // Base 0x0 0x10000000 // Size 0xffffffff 0xffffffff // Read-value // Additional ranges may follow >; }; > > > > > So the next versions of this series I was thinking to remove the PCI > > specifics > > and instead add FDT bindings to ARM dom0less enabling setup of user > > configurable (by address-range and read-value) background mmio regions. > > Xen would then support option #2 without any PCI specifics added. > > > > Thoughts? > > > > Cheers, > > Edgar > > > > # References to spec > > > > PCI express base specification: > > 7.5.1.1.1 Vendor ID Register (Offset 00h) > > The Vendor ID register is HwInit and the value in this register identifies > > the manufacturer of the Function. In keeping with > > PCI-SIG procedures, valid vendor identifiers must be allocated by the > > PCI-SIG to ensure uniqueness. Each vendor must > > have at least one Vendor ID. It is recommended that software read the > > Vendor ID register to determine if a Function is > > present, where a value of FFFFh indicates that no Function is present. > > > > PCI Firmware Specification: > > 3.5 Device State at Firmware/Operating System Handoff > > Page 34: > > The operating system is required to configure PCI subsystems: > >  During hotplug > >  For devices that take too long to come out of reset > >  PCI-to-PCI bridges that are at levels below what firmware is designed to > > configure > > > > Page 36: > > Note: The operating system does not have to walk all buses during boot. The > > kernel can > > automatically configure devices on request; i.e., an event can cause a scan > > of I/O on demand. > > I am not sure why you quote this. To me it reads like this is up to the OS > to decide when to access the PCI bus. As we don't control the OS, this > doesn't seem a behavior Xen can rely on. > > > > > FPGA's can be programmed at runtime and appear on the ECAM bus silently. > > An PCI rescan needs to be triggered for the OS to discover the device: > > Intel FPGAs: > > https://www.intel.com/content/www/us/en/docs/programmable/683190/1-3-1/how-to-rescan-bus-and-re-enable-aer.html > > To clarify, you are saying the ECAM bus may be completely empty (e.g. > everything is reading as ~0) and some part of the ECAM will return a non ~0 > value when the FPGA run. > > That said, the FPGA behavior is IMHO slightly different. I would expect for > FPGA, one would now when the device is present because they would have > programmed the FPGA. In our case, we are trying to solve a race introduce by > Xen (not the user itself). So it feels wrong to ask the user to "probe in a > loop until it works". > > This is the other reason why the approach #1 looks more appealing to me. FPGA's loading has many scenarios, it can be done by the user that also triggers the PCI rescan or it can by done at boot by another machine or by another VM. There are ways to get the PCI link up quickly at boot but they are not always used, introducing a similar problem as the one with virtio-pci in Xen. Note that approach #1 does not remove boot dependencies. For example if a guest wants to boot from a virtio-blk disk, it will get notified when the disk gets hot-plugged but the kernel may have already failed to mount the disk. We would need a way to configure guests to wait for a specific device to appear. Cheers, Edgar

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.