[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] One question about the hypercall to translate gfn to mfn.



> From: Tim Deegan
> Sent: Friday, December 12, 2014 12:47 AM
> 
> Hi,
> 
> At 01:41 +0000 on 11 Dec (1418258504), Tian, Kevin wrote:
> > > From: Tim Deegan [mailto:tim@xxxxxxx]
> > > It is Xen's job to isolate VMs from each other.  As part of that, Xen
> > > uses the MMU, nested paging, and IOMMUs to control access to RAM.
> Any
> > > software component that can pass a raw MFN to hardware breaks that
> > > isolation, because Xen has no way of controlling what that component
> > > can do (including taking over the hypervisor).  This is why I am
> > > afraid when developers ask for GFN->MFN translation functions.
> >
> > When I agree Xen's job absolutely, the isolation is also required in 
> > different
> > layers, regarding to who controls the resource and where the virtualization
> > happens. For example talking about I/O virtualization, Dom0 or driver
> domain
> > needs to isolate among backend drivers to avoid one backend interfering
> > with another. Xen doesn't know such violation, since it only knows it's Dom0
> > wants to access a VM's page.
> 
> I'm going to write second reply to this mail in a bit, to talk about
> this kind of system-level design.  In this email I'll just talk about
> the practical aspects of interfaces and address spaces and IOMMUs.

sure. I've replied to another design mail before seeing this. my bad outlook 
rule didn't push this mail to my eye, and fortunately I dig it out when 
wondering "Hi, again" in your another mail. :-)


> 
> > btw curious of how worse exposing GFN->MFN translation compared to
> > allowing mapping other VM's GFN? If exposing GFN->MFN is under the
> > same permission control as mapping, would it avoid your worry here?
> 
> I'm afraid not.  There's nothing worrying per se in a backend knowing
> the MFNs of the pages -- the worry is that the backend can pass the
> MFNs to hardware.  If the check happens only at lookup time, then XenGT
> can (either through a bug or a security breach) just pass _any_ MFN to
> the GPU for DMA.
> 
> But even without considering the security aspects, this model has bugs
> that may be impossible for XenGT itself to even detect.  E.g.:
>  1. Guest asks its virtual GPU to DMA to a frame of memory;
>  2. XenGT looks up the GFN->MFN mapping;
>  3. Guest balloons out the page;
>  4. Xen allocates the page to a different guest;
>  5. XenGT passes the MFN to the GPU, which DMAs to it.
> 
> Whereas if stage 2 is a _mapping_ operation, Xen can refcount the
> underlying memory and make sure it doesn't get reallocated until XenGT
> is finished with it.

yes, I see your point. Now we can't support ballooning in VM given above
reason, and refcnt is required to close that gap.

but just to confirm one point. from my understanding whether it's a 
mapping operation doesn't really matter. We can invent an interface
to get p2m mapping and then increase refcnt. the key is refcnt here.
when XenGT constructs a shadow GPU page table, it creates a reference
to guest memory page so the refcnt must be increased. :-)

> 
> > > When the backend component gets a GFN from the guest, it wants an
> > > address that it can give to the GPU for DMA that will map the right
> > > memory.  That address must be mapped in the IOMMU tables that the
> GPU
> > > will be using, which means the IOMMU tables of the backend domain,
> > > IIUC[1].  So the hypercall it needs is not "give me the MFN that matches
> > > this GFN" but "please map this GFN into my IOMMU tables".
> >
> > Here "please map this GFN into my IOMMU tables" actually breaks the
> > IOMMU isolation. IOMMU is designed for serving DMA requests issued
> > by an exclusive VM, so IOMMU page table can restrict that VM's attempts
> > strictly.
> >
> > To map multiple VM's GFNs into one IOMMU table, the 1st thing is to
> > avoid GFN conflictions to make it functional. We thought about this approach
> > previously, e.g. by reserving highest 3 bits of GFN as VMID, so one IOMMU
> > page table can be used to combine multi-VM's page table together. However
> > doing so have two limitations:
> >
> > a) it still requires write-protect guest GPU page table, and maintain a
> shadow
> > GPU page table by translate from real GFN to pseudo GFN (plus VMID),
> which
> > doesn't save any engineering effort in the device model part
> 
> Yes -- since there's only one IOMMU context for the whole GPU, the
> XenGT backend still has to audit all GPU commands to maintain
> isolation between clients.
> 
> > b) it breaks the designed isolation intrinsic of IOMMU. In such case, IOMMU
> > can't isolate multiple VMs by itself, since a DMA request can target any
> > pseudo GFN if valid in the page table. We have to rely on the audit in the
> > backend component in Dom0 to ensure the isolation.
> 
> Yep.
> 
> > c) this introduces tricky logic in IOMMU driver to handle such non-standard
> > multiplexed page table style.
> >
> > w/o a SR-IOV implementation (so each VF has its own IOMMU page table),
> > I don't see using IOMMU can help isolation here.
> 
> If I've understood your argument correctly, it basically comes down
> to "It would be extra work for no benefit, because XenGT still has to
> do all the work of isolating GPU clients from each other".  It's true
> that XenGT still has to isolate its clients, but there are other
> benefits.
> 
> The main one, from my point of view as a Xen maintainer, is that it
> allows Xen to constrain XenGT itself, in the case where bugs or
> security breaches mean that XenGT tries to access memory it shouldn't.
> More about that in my other reply.  I'll talk about the rest below.
> 
> > yes, this is a good feedback we didn't think about before. So far the reason
> > why XenGT can work is because we use default IOMMU setting which set
> > up a 1:1 r/w mapping for all possible RAM, so when GPU hits a MFN thru
> > shadow GPU page table, IOMMU is essentially bypassed. However like
> > you said, if IOMMU page table is restricted to dom0's memory, or is not
> > 1:1 identity mapping, XenGT will be broken.
> >
> > However I don't see a good solution for this, except using multiplexed
> > IOMMU page table aforementioned, which however doesn't look like
> > a sane design to me.
> 
> Right.  AIUI you're talking about having a component, maybe in Xen,
> that automatically makes a merged IOMMU table that contains multiple
> VMs' p2m tables all at once.  I think that we can do something simpler
> than that which will have the same effect and also avoid race
> conditions like the one I mentioned at the top of the email.
> 
> [First some hopefully-helpful diagrams to explain my thinking.  I'll
>  borrow 'BFN' from Malcolm's discussion of IOMMUs to describe the
>  addresses that devices issue their DMAs in:

what's 'BFN' short for? Bus Frame Number?

> 
>  Here's how the translations work for a HVM guest using HAP:
> 
>    CPU    <- Code supplied by the guest
>     |
>   (VA)
>     |
>    MMU    <- Pagetables supplied by the guest
>     |
>   (GFN)
>     |
>    HAP    <- Guest's P2M, supplied by Xen
>     |
>   (MFN)
>     |
>    RAM
> 
>  Here's how it looks for a GPU operation using XenGT:
> 
>    GPU       <- Code supplied by Guest, audited by XenGT
>     |
>   (GPU VA)
>     |
>   GPU-MMU    <- GTTs supplied by XenGT (by shadowing guest ones)
>     |
>   (GPU BFN)
>     |
>   IOMMU      <- XenGT backend dom's P2M (for PVH/HVM) or IOMMU
> tables (for PV)
>     |
>   (MFN)
>     |
>    RAM
> 
>  OK, on we go...]
> 
> Somewhere in the existing XenGT code, XenGT has a guest GFN in its
> hand and makes a lookup hypercall to find the MFN.  It puts that MFN
> into the GTTs that it passes to the GPU.  But an MFN is not actually
> what it needs here -- it needs a GPU BFN, which the IOMMU will then
> turn into an MFN for it.
> 
> If we replace that lookup with a _map_ hypercall, either with Xen
> choosing the BFN (as happens in the PV grant map operation) or with
> the guest choosing an unused address (as happens in the HVM/PVH
> grant map operation), then:
>  - the only extra code in XenGT itself is that you need to unmap
>    when you change the GTT;
>  - Xen can track and control exactly which MFNs XenGT/the GPU can access;
>  - running XenGT in a driver domain or PVH dom0 ought to work; and
>  - we fix the race condition I described above.

ok, I see your point here. It does sound like a better design to meet
Xen hypervisor's security requirement and can also work with PVH
Dom0 or driver domain. Previously even when we said a MFN is
required, it's actually a BFN due to IOMMU existence, and it works
just because we have a 1:1 identity mapping in-place. And by finding
a BFN

some follow-up think here:

- one extra unmap call will have some performance impact, especially
for media processing workloads where GPU page table modifications
are hot. but suppose this can be optimized with batch request

- is there existing _map_ call for this purpose per your knowledge, or
a new one is required? If the latter, what's the additional logic to be
implemented there?

- when you say _map_, do you expect this mapped into dom0's virtual
address space, or just guest physical space?

- how is BFN or unused address (what do you mean by address here?)
allocated? does it need present in guest physical memory at boot time,
or just finding some holes?

- graphics memory size could be large. starting from BDW, there'll
be 64bit page table format. Do you see any limitation here on finding
BFN or address?

> 
> The default policy I'm suggesting is that the XenGT backend domain
> should be marked IS_PRIV_FOR (or similar) over the XenGT client VMs,
> which will need a small extension in Xen since at the moment struct
> domain has only one "target" field.

Is that connection setup by toolstack or by hypervisor today?

> 
> BTW, this is the exact analogue of how all other backend and toolstack
> operations work -- they request access from Xen to specific pages and
> they relinquish it when they are done.  In particular:

agree.

> 
> > for mapping and accessing other guest's memory, I don't think we
> > need any new interface atop existing ones. Just similar to other backend
> > drivers, we can leverage the same permission control.
> 
> I don't think that's right -- other backend drivers use the grant
> table mechanism, wher the guest explicitly grants access to only the
> memory it needs.  AIUI you're not suggesting that you'll use that for
> XenGT! :)

yes, we're running native graphics driver in VM, not PV driver

> 
> Right - I hope that made some sense.  I'll go get another cup of
> coffee and start on that other reply...
> 
> Cheers,
> 

Really appreciate your explanation here. It makes lots of sense to me.

Thanks
Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.