[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Mapping non-pinned memory from one Xen domain into another


  • To: Teddy Astie <teddy.astie@xxxxxxxxxx>, Xen developer discussion <xen-devel@xxxxxxxxxxxxxxxxxxxx>, dri-devel@xxxxxxxxxxxxxxxxxxxxx, linux-mm@xxxxxxxxx, Jan Beulich <jbeulich@xxxxxxxx>, Val Packett <val@xxxxxxxxxxxxxxxxxxxxxx>, Ariadne Conill <ariadne@ariadne.space>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Juergen Gross <jgross@xxxxxxxx>, Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>
  • From: Demi Marie Obenour <demiobenour@xxxxxxxxx>
  • Date: Fri, 27 Mar 2026 13:18:27 -0400
  • Authentication-results: eu.smtp.expurgate.cloud; dkim=pass header.s=20251104 header.d=gmail.com header.i="@gmail.com" header.h="In-Reply-To:Autocrypt:From:Content-Language:References:To:Subject:User-Agent:MIME-Version:Date:Message-ID"
  • Autocrypt: addr=demiobenour@xxxxxxxxx; keydata= xsFNBFp+A0oBEADffj6anl9/BHhUSxGTICeVl2tob7hPDdhHNgPR4C8xlYt5q49yB+l2nipd aq+4Gk6FZfqC825TKl7eRpUjMriwle4r3R0ydSIGcy4M6eb0IcxmuPYfbWpr/si88QKgyGSV Z7GeNW1UnzTdhYHuFlk8dBSmB1fzhEYEk0RcJqg4AKoq6/3/UorR+FaSuVwT7rqzGrTlscnT DlPWgRzrQ3jssesI7sZLm82E3pJSgaUoCdCOlL7MMPCJwI8JpPlBedRpe9tfVyfu3euTPLPx wcV3L/cfWPGSL4PofBtB8NUU6QwYiQ9Hzx4xOyn67zW73/G0Q2vPPRst8LBDqlxLjbtx/WLR 6h3nBc3eyuZ+q62HS1pJ5EvUT1vjyJ1ySrqtUXWQ4XlZyoEFUfpJxJoN0A9HCxmHGVckzTRl 5FMWo8TCniHynNXsBtDQbabt7aNEOaAJdE7to0AH3T/Bvwzcp0ZJtBk0EM6YeMLtotUut7h2 Bkg1b//r6bTBswMBXVJ5H44Qf0+eKeUg7whSC9qpYOzzrm7+0r9F5u3qF8ZTx55TJc2g656C 9a1P1MYVysLvkLvS4H+crmxA/i08Tc1h+x9RRvqba4lSzZ6/Tmt60DPM5Sc4R0nSm9BBff0N m0bSNRS8InXdO1Aq3362QKX2NOwcL5YaStwODNyZUqF7izjK4QARAQABzTxEZW1pIE1hcmll IE9iZW5vdXIgKGxvdmVyIG9mIGNvZGluZykgPGRlbWlvYmVub3VyQGdtYWlsLmNvbT7CwXgE EwECACIFAlp+A0oCGwMGCwkIBwMCBhUIAgkKCwQWAgMBAh4BAheAAAoJELKItV//nCLBhr8Q AK/xrb4wyi71xII2hkFBpT59ObLN+32FQT7R3lbZRjVFjc6yMUjOb1H/hJVxx+yo5gsSj5LS 9AwggioUSrcUKldfA/PKKai2mzTlUDxTcF3vKx6iMXKA6AqwAw4B57ZEJoMM6egm57TV19kz PMc879NV2nc6+elaKl+/kbVeD3qvBuEwsTe2Do3HAAdrfUG/j9erwIk6gha/Hp9yZlCnPTX+ VK+xifQqt8RtMqS5R/S8z0msJMI/ajNU03kFjOpqrYziv6OZLJ5cuKb3bZU5aoaRQRDzkFIR 6aqtFLTohTo20QywXwRa39uFaOT/0YMpNyel0kdOszFOykTEGI2u+kja35g9TkH90kkBTG+a EWttIht0Hy6YFmwjcAxisSakBuHnHuMSOiyRQLu43ej2+mDWgItLZ48Mu0C3IG1seeQDjEYP tqvyZ6bGkf2Vj+L6wLoLLIhRZxQOedqArIk/Sb2SzQYuxN44IDRt+3ZcDqsPppoKcxSyd1Ny 2tpvjYJXlfKmOYLhTWs8nwlAlSHX/c/jz/ywwf7eSvGknToo1Y0VpRtoxMaKW1nvH0OeCSVJ itfRP7YbiRVc2aNqWPCSgtqHAuVraBRbAFLKh9d2rKFB3BmynTUpc1BQLJP8+D5oNyb8Ts4x Xd3iV/uD8JLGJfYZIR7oGWFLP4uZ3tkneDfYzsFNBFp+A0oBEAC9ynZI9LU+uJkMeEJeJyQ/ 8VFkCJQPQZEsIGzOTlPnwvVna0AS86n2Z+rK7R/usYs5iJCZ55/JISWd8xD57ue0eB47bcJv VqGlObI2DEG8TwaW0O0duRhDgzMEL4t1KdRAepIESBEA/iPpI4gfUbVEIEQuqdqQyO4GAe+M kD0Hy5JH/0qgFmbaSegNTdQg5iqYjRZ3ttiswalql1/iSyv1WYeC1OAs+2BLOAT2NEggSiVO txEfgewsQtCWi8H1SoirakIfo45Hz0tk/Ad9ZWh2PvOGt97Ka85o4TLJxgJJqGEnqcFUZnJJ riwoaRIS8N2C8/nEM53jb1sH0gYddMU3QxY7dYNLIUrRKQeNkF30dK7V6JRH7pleRlf+wQcN fRAIUrNlatj9TxwivQrKnC9aIFFHEy/0mAgtrQShcMRmMgVlRoOA5B8RTulRLCmkafvwuhs6 dCxN0GNAORIVVFxjx9Vn7OqYPgwiofZ6SbEl0hgPyWBQvE85klFLZLoj7p+joDY1XNQztmfA rnJ9x+YV4igjWImINAZSlmEcYtd+xy3Li/8oeYDAqrsnrOjb+WvGhCykJk4urBog2LNtcyCj kTs7F+WeXGUo0NDhbd3Z6AyFfqeF7uJ3D5hlpX2nI9no/ugPrrTVoVZAgrrnNz0iZG2DVx46 x913pVKHl5mlYQARAQABwsFfBBgBAgAJBQJafgNKAhsMAAoJELKItV//nCLBwNIP/AiIHE8b oIqReFQyaMzxq6lE4YZCZNj65B/nkDOvodSiwfwjjVVE2V3iEzxMHbgyTCGA67+Bo/d5aQGj gn0TPtsGzelyQHipaUzEyrsceUGWYoKXYyVWKEfyh0cDfnd9diAm3VeNqchtcMpoehETH8fr RHnJdBcjf112PzQSdKC6kqU0Q196c4Vp5HDOQfNiDnTf7gZSj0BraHOByy9LEDCLhQiCmr+2 E0rW4tBtDAn2HkT9uf32ZGqJCn1O+2uVfFhGu6vPE5qkqrbSE8TG+03H8ecU2q50zgHWPdHM OBvy3EhzfAh2VmOSTcRK+tSUe/u3wdLRDPwv/DTzGI36Kgky9MsDC5gpIwNbOJP2G/q1wT1o Gkw4IXfWv2ufWiXqJ+k7HEi2N1sree7Dy9KBCqb+ca1vFhYPDJfhP75I/VnzHVssZ/rYZ9+5 1yDoUABoNdJNSGUYl+Yh9Pw9pE3Kt4EFzUlFZWbE4xKL/NPno+z4J9aWemLLszcYz/u3XnbO vUSQHSrmfOzX3cV4yfmjM5lewgSstoxGyTx2M8enslgdXhPthZlDnTnOT+C+OTsh8+m5tos8 HQjaPM01MKBiAqdPgksm1wu2DrrwUi6ChRVTUBcj6+/9IJ81H2P2gJk3Ls3AVIxIffLoY34E +MYSfkEjBz0E8CLOcAw7JIwAaeBT
  • Delivery-date: Fri, 27 Mar 2026 17:18:48 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 3/26/26 14:26, Teddy Astie wrote:
> Le 26/03/2026 à 18:18, Demi Marie Obenour a écrit :
>> On 3/24/26 14:00, Teddy Astie wrote:
>>>> ## Restrictions on lent memory>>
>>>> Lent memory is still considered to belong to the lending domain.
>>>> The borrowing domain can only access it via its p2m.  Hypercalls made
>>>> by the borrowing domain act as if the borrowed memory was not present.
>>>> This includes, but is not limited to:
>>>>
>>>> - Using pointers to borrowed memory in hypercall arguments.
>>>> - Granting borrowed memory to other VMs.
>>>> - Any other operation that depends on whether a page is accessible
>>>>     by a domain.
>>>
>>> What about emulated instructions that refers to this memory ?
>>
>> This would be allowed if (and only if) it can trigger paging as you
>> wrote above.
>>
>>>> Furthermore:
>>>>
>>>> - Borrowed memory isn't mapped into the IOMMU of any PCIe devices
>>>>     the guest has attached, because IOTLB faults generally are not
>>>>     replayable.
>>>>
>>>
>>> Given that (as written bellow) Borrowed memory is a part of some form of
>>> emulated BAR or special region, there is no guarantee that DMA will work
>>> properly anyway (unless P2P DMA support is advertised).
>>>
>>> Splitting the IOMMU side from the P2M is not a good idea as it rules out
>>> the "IOMMU HAP PT Share" optimization.
>>
>> If the pages are mapped in the IOMMU, paging them out requires an
>> IOTLB invalidation.  My understanding is that these are far too slow.
>>
> 
> yes (aside specific cases like with paravirtualized IOMMU), but only if 
> you have a device in the guest.
> 
> The problem is that that would force us to modify the ABI to have 
> "non-DMA-able" memory in the guest, which doesn't exist yet aside 
> specific cases like grants in PV.

This would make the mechanism *de facto* incompatible with PCI
passthrough.  That is unfortunate but not a dealbreaker for most
applications.  It's quite annoying, though, because of dual-GPU setups
where one GPU is paravirtualized and the other is passed through.

I don't think it necessarily needs any new guest ABI changes.
As you pointed out, guests are not allowed to assume that P2PDMA
works, so if the guest tries to DMA to these pages it's a guest bug.
This means that whether the pages can be DMA'd to or not is not a
guest-facing ABI.

That said, this should not block getting this feature implemented.

>> How important is sharing the HAP and IOMMU page tables?
>>
>>>> - Foreign mapping hypercalls that reference lent memory will fail.
>>>>     Otherwise, the domain making the foreign mapping hypercall could
>>>>     continue to access the borrowed memory after the lease had been
>>>>     revoked.  This is true even if the domain performing the foreign
>>>>     mapping is an all-powerful dom0.  Otherwise, an emulated device
>>>>     could access memory whose lease had been revoked.
>>>>
>>>> This also means that live migration of a domain that has borrowed
>>>> memory requires cooperation from the lending domain.  For now, it
>>>> will be considered out of scope.  Live migration is typically used
>>>> with server workloads, and accelerators for server hardware often
>>>> support SR-IOV.
>>>>
>>>> ## Where will lent memory appear in a guest's address space?
>>>>
>>>> Typically, lent memory will be an emulated PCI BAR.  It may be emulated
>>>> by dom0 or an alternate ioreq server.  However, it is not *required*
>>>> to be a PCI BAR.
>>>>
>>>
>>> ---
>>>
>>> While the design could work (albeit the implied complexity), I'm not a
>>> big fan of it, or at least, it needs to consider some constraints for
>>> having reasonable performance.
>>> One of the big issue is that a performance-sensitive system (virtualized
>>> GPU) is interlocking with several "hard to optimize" subsystem like P2M
>>> or Dom0 having to process a paging event.
>>>
>>> Modifying the P2M (especially removing entries) is a fairly expensive
>>> operation as it sometimes requires pausing all the vCPUs each time it's
>>> done.
>>
>> Not every GPU supports recoverable page faults.  Even when they
>> are supported, they are extremely expensive.  Each of them involves
>> a round-trip from the GPU to the CPU and back, which means that a
>> potentially very large number of GPU cores are blocked until the
>> CPU can respond.  Therefore, GPU driver developers avoid relying on
>> GPU page faults whenever possible.  Instead, data is moved in large
>> chunks using a dedicated DMA engine in the GPU.
>> As a result, I'm not too concerned with the cost of P2M manipulation.
>> Anything that requires making a GPU buffer temporarily inaccessible
>> is already an expensive process, and driver developers have strong
>> incentives to keep the time the buffer is unmapped as short as
>> possible.
>> If performance turns out to be a problem, something like KVM's
>> asynchronous page faults might be a better solution.
>>
> 
> Asynchronous page fault looks like a interesting and potentially easier 
> to implement.
> 
> IIUC, the idea is to make the pages disappears on the guest behalf, and 
> the guest would have to deal with the eventual page fault. Currently in 
> Xen, a unhandled #NPF is fatal, but that could be tuned down for 
> specific regions and transformed into a #PF or another exception for the 
> guest to handle.

Yup!

> We have actually a similar need for SEV-ES MMIO handling, as we need to 
> distinguish "MMIO-related NPF" (to paravirtualize through GHCB) to the 
> other NPF; which needs to be configured in advance in page-tables (so 
> that the CPU choose between #VC and VMEXIT#NPF).
> 
> It would also need some form of para-virtualization coming from virtio 
> or a new Xen PV driver for the guest to be made aware of this mechanism.
> I also assume that the guest handles properly that kind of event.

On KVM, asynchronous page faults are purely an optimization.  I have
a few concerns with relying entirely on them:

1. Can guest userspace use this to crash the guest kernel?  What
   happens if the guest kernel takes a fault in copy_{to,from}_user()?

2. Can this be made to work with Windows guests?

3. Could this run into a livelock problem?  Xen could tell the guest
   that the page is ready, but by the time the guest gets around to
   scheduling the userspace program, the page has been paged out again.

>>> If it's done at 4k granularity, it would also lack superpage support,
>>> which wouldn't help either. (doing things at the 2M+ scale would help,
>>> but I don't know enough how MMU notifier does things.

As an aside, graphics very much needs huge pages.  On AMD, using 4K
pages means a 30% performance hit.

>>> While I agree that grants is not a adequate mechanism for this (for
>>> multiples reasons), I'm not fully convinced of the proposal.
>>> I would prefer a strategy where we map a fixed amount of RAM+VRAM as a
>>> blob, along with some form of cooperative hotplug mechanism to
>>> dynamically provision the amount.
>>
>> I asked the GPU driver developers about pinning VRAM like this a couple
>> years ago or so.  The response I got was that it isn't supported.
>> I suspect that anyone needing VRAM pinning for graphics workloads is
>> using non-upstreamable hacks, most likely specific to a single driver.
>>
>> More generally, the entire graphics stack receives essentially no
>> testing under Xen.  There have been bugs that have affected Qubes OS
>> users for months or more, and they went unfixed because they couldn't
>> be reproduced outside of Xen.  To the upstream graphics developers,
>> Xen might as well not exist.  This means that any solution that
>> requires changing the graphics stack is not a practical option,
>> and I do not expect this to change in the foreseeable future.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

Attachment: OpenPGP_0xB288B55FFF9C22C1.asc
Description: OpenPGP public key

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.