[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Mapping memory into a domain


  • To: Roger Pau Monné <roger.pau@xxxxxxxxxx>, Xenia Ragiadakou <Xenia.Ragiadakou@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>
  • From: Demi Marie Obenour <demiobenour@xxxxxxxxx>
  • Date: Fri, 9 May 2025 00:52:28 -0400
  • Autocrypt: addr=demiobenour@xxxxxxxxx; keydata= xsFNBFp+A0oBEADffj6anl9/BHhUSxGTICeVl2tob7hPDdhHNgPR4C8xlYt5q49yB+l2nipd aq+4Gk6FZfqC825TKl7eRpUjMriwle4r3R0ydSIGcy4M6eb0IcxmuPYfbWpr/si88QKgyGSV Z7GeNW1UnzTdhYHuFlk8dBSmB1fzhEYEk0RcJqg4AKoq6/3/UorR+FaSuVwT7rqzGrTlscnT DlPWgRzrQ3jssesI7sZLm82E3pJSgaUoCdCOlL7MMPCJwI8JpPlBedRpe9tfVyfu3euTPLPx wcV3L/cfWPGSL4PofBtB8NUU6QwYiQ9Hzx4xOyn67zW73/G0Q2vPPRst8LBDqlxLjbtx/WLR 6h3nBc3eyuZ+q62HS1pJ5EvUT1vjyJ1ySrqtUXWQ4XlZyoEFUfpJxJoN0A9HCxmHGVckzTRl 5FMWo8TCniHynNXsBtDQbabt7aNEOaAJdE7to0AH3T/Bvwzcp0ZJtBk0EM6YeMLtotUut7h2 Bkg1b//r6bTBswMBXVJ5H44Qf0+eKeUg7whSC9qpYOzzrm7+0r9F5u3qF8ZTx55TJc2g656C 9a1P1MYVysLvkLvS4H+crmxA/i08Tc1h+x9RRvqba4lSzZ6/Tmt60DPM5Sc4R0nSm9BBff0N m0bSNRS8InXdO1Aq3362QKX2NOwcL5YaStwODNyZUqF7izjK4QARAQABzTxEZW1pIE1hcmll IE9iZW5vdXIgKGxvdmVyIG9mIGNvZGluZykgPGRlbWlvYmVub3VyQGdtYWlsLmNvbT7CwXgE EwECACIFAlp+A0oCGwMGCwkIBwMCBhUIAgkKCwQWAgMBAh4BAheAAAoJELKItV//nCLBhr8Q AK/xrb4wyi71xII2hkFBpT59ObLN+32FQT7R3lbZRjVFjc6yMUjOb1H/hJVxx+yo5gsSj5LS 9AwggioUSrcUKldfA/PKKai2mzTlUDxTcF3vKx6iMXKA6AqwAw4B57ZEJoMM6egm57TV19kz PMc879NV2nc6+elaKl+/kbVeD3qvBuEwsTe2Do3HAAdrfUG/j9erwIk6gha/Hp9yZlCnPTX+ VK+xifQqt8RtMqS5R/S8z0msJMI/ajNU03kFjOpqrYziv6OZLJ5cuKb3bZU5aoaRQRDzkFIR 6aqtFLTohTo20QywXwRa39uFaOT/0YMpNyel0kdOszFOykTEGI2u+kja35g9TkH90kkBTG+a EWttIht0Hy6YFmwjcAxisSakBuHnHuMSOiyRQLu43ej2+mDWgItLZ48Mu0C3IG1seeQDjEYP tqvyZ6bGkf2Vj+L6wLoLLIhRZxQOedqArIk/Sb2SzQYuxN44IDRt+3ZcDqsPppoKcxSyd1Ny 2tpvjYJXlfKmOYLhTWs8nwlAlSHX/c/jz/ywwf7eSvGknToo1Y0VpRtoxMaKW1nvH0OeCSVJ itfRP7YbiRVc2aNqWPCSgtqHAuVraBRbAFLKh9d2rKFB3BmynTUpc1BQLJP8+D5oNyb8Ts4x Xd3iV/uD8JLGJfYZIR7oGWFLP4uZ3tkneDfYzsFNBFp+A0oBEAC9ynZI9LU+uJkMeEJeJyQ/ 8VFkCJQPQZEsIGzOTlPnwvVna0AS86n2Z+rK7R/usYs5iJCZ55/JISWd8xD57ue0eB47bcJv VqGlObI2DEG8TwaW0O0duRhDgzMEL4t1KdRAepIESBEA/iPpI4gfUbVEIEQuqdqQyO4GAe+M kD0Hy5JH/0qgFmbaSegNTdQg5iqYjRZ3ttiswalql1/iSyv1WYeC1OAs+2BLOAT2NEggSiVO txEfgewsQtCWi8H1SoirakIfo45Hz0tk/Ad9ZWh2PvOGt97Ka85o4TLJxgJJqGEnqcFUZnJJ riwoaRIS8N2C8/nEM53jb1sH0gYddMU3QxY7dYNLIUrRKQeNkF30dK7V6JRH7pleRlf+wQcN fRAIUrNlatj9TxwivQrKnC9aIFFHEy/0mAgtrQShcMRmMgVlRoOA5B8RTulRLCmkafvwuhs6 dCxN0GNAORIVVFxjx9Vn7OqYPgwiofZ6SbEl0hgPyWBQvE85klFLZLoj7p+joDY1XNQztmfA rnJ9x+YV4igjWImINAZSlmEcYtd+xy3Li/8oeYDAqrsnrOjb+WvGhCykJk4urBog2LNtcyCj kTs7F+WeXGUo0NDhbd3Z6AyFfqeF7uJ3D5hlpX2nI9no/ugPrrTVoVZAgrrnNz0iZG2DVx46 x913pVKHl5mlYQARAQABwsFfBBgBAgAJBQJafgNKAhsMAAoJELKItV//nCLBwNIP/AiIHE8b oIqReFQyaMzxq6lE4YZCZNj65B/nkDOvodSiwfwjjVVE2V3iEzxMHbgyTCGA67+Bo/d5aQGj gn0TPtsGzelyQHipaUzEyrsceUGWYoKXYyVWKEfyh0cDfnd9diAm3VeNqchtcMpoehETH8fr RHnJdBcjf112PzQSdKC6kqU0Q196c4Vp5HDOQfNiDnTf7gZSj0BraHOByy9LEDCLhQiCmr+2 E0rW4tBtDAn2HkT9uf32ZGqJCn1O+2uVfFhGu6vPE5qkqrbSE8TG+03H8ecU2q50zgHWPdHM OBvy3EhzfAh2VmOSTcRK+tSUe/u3wdLRDPwv/DTzGI36Kgky9MsDC5gpIwNbOJP2G/q1wT1o Gkw4IXfWv2ufWiXqJ+k7HEi2N1sree7Dy9KBCqb+ca1vFhYPDJfhP75I/VnzHVssZ/rYZ9+5 1yDoUABoNdJNSGUYl+Yh9Pw9pE3Kt4EFzUlFZWbE4xKL/NPno+z4J9aWemLLszcYz/u3XnbO vUSQHSrmfOzX3cV4yfmjM5lewgSstoxGyTx2M8enslgdXhPthZlDnTnOT+C+OTsh8+m5tos8 HQjaPM01MKBiAqdPgksm1wu2DrrwUi6ChRVTUBcj6+/9IJ81H2P2gJk3Ls3AVIxIffLoY34E +MYSfkEjBz0E8CLOcAw7JIwAaeBT
  • Cc: Alejandro Vallejo <agarciav@xxxxxxx>, Xen developer discussion <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Juergen Gross <jgross@xxxxxxxx>, Xen-devel <xen-devel-bounces@xxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Fri, 09 May 2025 04:52:15 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 5/8/25 3:52 AM, Roger Pau Monné wrote:
> On Wed, May 07, 2025 at 08:36:07PM -0400, Demi Marie Obenour wrote:
>> On 5/7/25 1:39 PM, Roger Pau Monné wrote:
>>> On Tue, May 06, 2025 at 04:56:12PM -0400, Demi Marie Obenour wrote:
>>>> On 5/6/25 9:06 AM, Alejandro Vallejo wrote:
>>>>> On Tue May 6, 2025 at 3:02 AM CEST, Demi Marie Obenour wrote:
>>>>>> On 5/5/25 7:32 AM, Alejandro Vallejo wrote:
>>>>>>> I suppose this is still about multiplexing the GPU driver the way we
>>>>>>> last discussed at Xen Summit?
>>>>>>>
>>>>>>> On Mon May 5, 2025 at 12:51 AM CEST, Demi Marie Obenour wrote:
>>>>>>>> What are the appropriate Xen internal functions for:
>>>>>>>>
>>>>>>>> 1. Turning a PFN into an MFN?
>>>>>>>> 2. Mapping an MFN into a guest?
>>>>>>>> 3. Unmapping that MFN from a guest?
>>>>>>>
>>>>>>> The p2m is the single source of truth about such mappings.
>>>>>>>
>>>>>>> This is all racy business. You want to keep the p2m lock for the full
>>>>>>> duration of whatever operation you wish do, or you risk another CPU
>>>>>>> taking it and pulling the rug under your feet at the most inconvenient
>>>>>>> time.
>>>>>>>
>>>>>>> In general all this faff is hidden under way too many layers beneath
>>>>>>> copy_{to,from}_guest(). Other p2m manipulation high-level constructs
>>>>>>> that might do interesting things worth looking at may be 
>>>>>>> {map,unmap}_mmio_region()
>>>>>>>
>>>>>>> Note that not every pfn has an associated mfn. Not even every valid pfn
>>>>>>> has necessarily an associated mfn (there's pod). And all of this is
>>>>>>> volatile business in the presence of a baloon driver or vPCI placing
>>>>>>> mmio windows over guest memory.
>>>>>>
>>>>>> Can I check that POD is not in use?  
>>>>>
>>>>> Maybe, but now you're reaching exponential complexity considering each
>>>>> individual knob of the p2m into account.
>>>>>
>>>>>>
>>>>>>> In general anything up this alley would need a cohesive pair for
>>>>>>> map/unmap and a credible plan for concurrency and how it's all handled
>>>>>>> in conjunction with other bits that touch the p2m.
>>>>>>
>>>>>> Is taking the p2m lock for the entire operation a reasonable approach
>>>>>> for concurrency?  Will this cause too much lock contention?
>>>>>
>>>>> Maybe. It'd be fine for a page. Likely not so for several GiB if they
>>>>> aren't already superpages.
>>>>>
>>>>>>
>>>>>>>> The first patch I am going to send with this information is a 
>>>>>>>> documentation
>>>>>>>> patch so that others do not need to figure this out for themselves.
>>>>>>>> I remember being unsure even after looking through the source code, 
>>>>>>>> which
>>>>>>>> is why I am asking here.
>>>>>>>
>>>>>>> That's not surprising. There's per-arch stuff, per-p2mtype stuff,
>>>>>>> per-guesttype stuff. Plus madness like on-demand memory. It's no wonder
>>>>>>> such helpers don't exist and the general manipulations are hard to
>>>>>>> explain.
>>>>>>
>>>>>> Is this a task that is only suitable for someone who has several years
>>>>>> experience working on Xen, or is it something that would make sense for
>>>>>> someone who is less experienced?
>>>>>
>>>>> The p2m is a very complex beast that integrates more features than I
>>>>> care to count. It requires a lot of prior knowledge. Whoever does it
>>>>> must know Xen fairly well in many configurations.
>>>>>
>>>>> The real problem is finding the right primitives that do what you want
>>>>> without overcomplicating everything else, preserving system security
>>>>> invariants and have benign (and ideally clear) edge cases.
>>>>>
>>>>> This was the last email you sent (I think?). Has any of the requirements
>>>>> changed in any direction?
>>>>>
>>>>>   https://lore.kernel.org/xen-devel/Z5794ysNE4KDkFuT@itl-email/
>>>>
>>>> Map and Revoke are still needed, with the same requirements as described
>>>> in this email.  Steal and Return were needed for GPU shared virtual memory,
>>>> but it has been decided to not support this with virtio-GPU, so these
>>>> primitives are no longer needed.
>>>>
>>>>> Something I'm missing there is how everything works without Xen. That
>>>>> might help (me, at least) guage what could prove enough to support the
>>>>> usecase. Are there sequence diagrams anywhere about how this whole thing
>>>>> works without Xen? I vaguely remember you showing something last year in
>>>>> Xen Summit in the design session, but my memory isn't that good :)
>>>
>>> Hello,
>>>
>>> Sorry, possibly replying a bit out of context here.
>>>
>>> Since I will mention this in several places: p2m is the second stage
>>> page-tables used by Xen for PVH and HVM guests.  A p2m violation is
>>> the equivalent of a page-fault for guest p2m accesses.
>>>
>>>> A Linux driver that needs access to userspace memory
>>>> pages can get it in two different ways:
>>>>
>>>> 1. It can pin the pages using the pin_user_pages family of APIs.
>>>>    If these functions succeed, the driver is guaranteed to be able
>>>>    to access the pages until it unpins them.  However, this also
>>>>    means that the pages cannot be paged out or migrated.  Furthermore,
>>>>    file-backed pages cannot be safely pinned, and pinning GPU memory
>>>>    isn’t supported.  (At a minimum, it would prevent the pages from
>>>>    migrating from system RAM to VRAM, so all access by a dGPU would
>>>>    cross the PCIe bus, which would be very slow.)
>>>
>>> From a Xen p2m this is all fine - Xen will never remove pages from the
>>> p2m unless it's requested to.  So the pining, while needed on the Linux
>>> side, doesn't need to be propagated to Xen I would think.
>>
>> If pinning were enough things would be simple, but sadly it’s not.
>>
>>>> 2. It can grab the *current* location of the pages and register an
>>>>    MMU notifier.  This works for GPU memory and file-backed memory.
>>>>    However, when the invalidate_range function of this callback, the
>>>>    driver *must* stop all further accesses to the pages.
>>>>
>>>>    The invalidate_range callback is not allowed to block for a long
>>>>    period of time.  My understanding is that things like dirty page
>>>>    writeback are blocked while the callback is in progress.  My
>>>>    understanding is also that the callback is not allowed to fail.
>>>>    I believe it can return a retryable error but I don’t think that
>>>>    it is allowed to keep failing forever.
>>>>
>>>>    Linux’s grant table driver actually had a bug in this area, which
>>>>    led to deadlocks.  I fixed that a while back.
>>>>
>>>> KVM implements the second option: it maps pages into the stage-2
>>>> page tables (or shadow page tables, if that is chosen) and unmaps
>>>> them when the invalidate_range callback is called.
>>>
>>> I assume this map and unmap is done by the host as a result of some
>>> guest action?
>>
>> Unmapping can happen at any time for any or no reason.  Semantically,
>> it would be correct to only map the pages in response to a p2m violation,
>> but for performance it might be better to map the pages eagerly instead.
> 
> That's an implementation detail, you can certainly map the pages
> eagerly, or even map multiple contiguous pages as a result of a single
> p2m violation.
> 
> I would focus on making a functioning prototype first, performance
> comes afterwards.

Makes sense.

>>>> Furthermore,
>>>> if a page fault happens while the page is unmapped, KVM will try
>>>> to bring the pages back into memory so the guest can access it.
>>>
>>> You could likely handle this in Xen in the following way:
>>>
>>>  - A device model will get p2m violations forwarded, as it's the same
>>>    model that's used to handle emulation of device MMIO.  You will
>>>    need to register an ioreq server to request those faults to be
>>>    forwarded, I think the hardware domain kernel will handle those?
>>>
>>>  - Allow ioreqs to signal to Xen that a guest operation must be
>>>    retried.  IOW: resume guest execution without advancing the IP.
>>>
>>> I think this last bit is the one that will require changes to Xen, so
>>> that you can add a type of ioreq reply that implies a retry from the
>>> guest context.
>> I’m not actually sure if this is needed, though it would be nice.  It
>> might be possible for Xen to instead emulate the current instruction and
>> continue, with the ioreq server just returning the current value of the
>> pages.
> 
> You can, indeed, but it's cumbersome?  You might have to map the page
> in the context of the entity that implements the ioreq server to
> access the data.  Allowing retries would be more generic, and reduce
> the code in the ioreq server handler, that would only map the page
> to the guest p2m and request a retry.

Yeah, it is cumbersome indeed.

>> What I’m more concerned about is being able to provide a page
>> into the p2m so that the *next* access doesn’t fault, and being able
>> to remove that page from the p2m so that the next access *does* fault.
> 
> Maybe I'm not getting the question right, all Xen modifications to the
> p2m take immediate effect.  By the time a XEN_DOMCTL_memory_mapping
> hypercall returns the operation would have taken effect.

Ah, that makes sense.  When revoking access, can XEN_DOMCTL_iomem_permission
and XEN_DOMCTL_memory_mapping fail even if the parameters are correct and
the caller has enough permissions, or will they always succeed?

>> Are there any hypercalls that can be used for these operations right
>> now?
> 
> With some trickery you could likely use XEN_DOMCTL_memory_mapping to
> add and remove those pages.  You will need calls to
> XEN_DOMCTL_iomem_permission beforehand so that you grant the receiving
> domain permissions to access those (and of course the granting domain
> needs to have full access to them).
> 
> This is no ideal if mapping RAM pages, AFAICT there are no strict
> checks that the added page is not RAM, but still you will need to
> handle RAM pages as IOMEM so and grant them using
> XEN_DOMCTL_iomem_permission which is not great.  Also note that this
> is a domctl, so not stable.  It might however be enough for a
> prototype.

Unfortunately this won’t work if the backend is a PVH domain, as a PVH
domain doesn’t know its own MFNs.  It also won’t work for deprivileged
backends because XEN_DOMCTL_iomem_permission is subject to XSA-77.

> Long term I think we want to expand XENMEM_add_to_physmap{,_batch} to
> handle this use-case.

That would indeed be better.

>> If not, which Xen functions would one use to implement them?
>> Some notes:
>>
>> - The p2m might need to be made to point to a PCI BAR or system RAM.
>>   The guest kernel and host userspace don’t know which, and in any
>>   case don’t need to care.  The host kernel knows, but I don’t know
>>   if the information is exposed to the Xen driver.
> 
> Hm, as said above, while you could possible handle RAM as IOMEM, it
> has the slight inconvenience of having to add such RAM pages to the
> d->iomem_caps rangeset for XEN_DOMCTL_memory_mapping to succeed.
> 
> From a guest PoV, it doesn't matter if the underlying page is RAM or
> MMIO, as long as it's mapped in the p2m.

Understood, thanks!

>> - If the p2m needs to point to system RAM, the RAM will be memory
>>   that belongs to the backend.
>>
>> - If the p2m needs to point to a PCI BAR, it will initially need
>>   to point to a real PCI device that is owned by the backend.
> 
> As long as you give the destination domain access to the page using
> XEN_DOMCTL_iomem_permission prior to the XEN_DOMCTL_memory_mapping
> call it should work.
> 
> How does this work for device DMA accesses?  If the device is assigned
> to the backend domain (and thus using the backend domain IOMMU context
> entry and page-tables) DMA accesses cannot be done against guest
> provided addresses, there needs to be some kind of translation layer
> that filters commands?

Thankfully, this is handled by the backend.

> My initial recommendation would be to look into what you can do with
> the existing XEN_DOMCTL_iomem_permission and XEN_DOMCTL_memory_mapping
> hypercalls.

I think this would be suitable for a prototype but not for production.

>> - The switch from “emulated MMIO” to “MMIO or real RAM” needs to
>>   be atomic from the guest’s perspective.
> 
> Updates of p2m PTEs are always atomic.
That’s good.

Xenia, would it be possible for AMD to post whatever has been
implemented so far?  I think this would help a lot, even if it
is incomplete.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

Attachment: OpenPGP_0xB288B55FFF9C22C1.asc
Description: OpenPGP public key

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.