[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen Summit 2019 Design Session - Nested Virtualization



On Thu, Aug 08, 2019 at 08:53:36PM -0400, Rich Persaud wrote:
> Session notes attached in markdown and PDF format, please revise as needed.
> 
> Rich
> 
> 
> 
> # Nested Virtualization Design Session
> Xen Design and Developer Summit, [11 July 
> 2019](https://design-sessions.xenproject.org/uid/discussion/disc_1NVcnOZyDZM1LpQbIsJm/view)
> 
> **Related Presentations**
> 
> - (2019) Jürgen Groß, [Support of PV devices in nested 
> Xen](https://youtube.com/watch?v=HA_teA6hV7c)
> - (2019) Christopher Clark and Kelli Little, [The 
> Xen-Blanket](https://youtube.com/watch?v=i5w9sF9VerE)
> - (2018) Ian Pratt, [Hypervisor Security: Lessons 
> Learned](https://youtube.com/watch?v=bNVe2y34dnM) (uXen)
> - (2018) David Weston, [Windows: Hardening with 
> Hardware](https://youtube.com/watch?v=8V0wcqS22vc) (Credential Guard)
> 
> **Use Cases**
> 
> - Xen on Xen, some work was done for the Shim (Meltdown mitigation).
> - Xen on another hypervisor, involves teaching Xen how to use enlightenments 
> from other hypervisors.  
> - Qubes runs Xen on AWS bare-metal instances that use Nitro+KVM, mostly works.

It isn't AWS, it's standard KVM (with qemu etc). Use case is testing. Mostly 
works.

> - Windows Credential Guard (Hyper-V on Xen)
> - Bromium Type-2 uXen in Windows and Linux guests on Xen
> 
> **Issues**
> 
> 1.  Need to be careful with features, eg. Ballooning down memory.
> 2. Dom0 is exposed to things that it should not see.
> 3. Nested virtualization works when both L0 and L1 agree, e.g Xen on Xen.  
> When replacing Xen with another hypervisor, all falls apart.

In my experience, running Xen on another hypervisor (KVM, VirtualBox)
mostly works. What is broken, is running non-Xen within Xen.

> 4. Need more audit checks for what the VM can read or write, i.e. guest 
> requirements.
> 5. Virtual vmentry and vmexit emulation "leaking", doesn't cope well.
> 6. Context switching bug fixed a while ago: doesn't understand AFR(?) loading 
> or whether it should do it or leave alone.
> 7. Missing instructions to virtualize vmexit.
> 8. Enlightened EPT shootdown is easy on top of the other features working.
> 
> **Dependent on CPUID and MSR work**
> 
> 1. Auditing of changes.  Can then fix virtual vmentry and vmexit, one bit at 
> a time.  Once all features are covered, it should work fine.
> 2. hwcaps: needed to tell the guest about the security state of the hardware.
> 3. Reporting CPU topology representation to guests, which is blocking 
> core-scheduling work (presented by Juergen at Xen Summit)
> 4. Andrew is working on the prerequisites for the policy.
> 
> **Validation of Nested Virtualization**
> 
> 1.  First priority is correctness. 
> 2. Second priority is performance.
> 3. There is a unit testing prototype which exercises vmxon, vmxoff 
> instructions.
> 4. Depends on regression testing, which depends upon (a) formal security 
> support, (b) approval of the Xen security team.
> 5. Other hypervisors must be tested with Xen.
> 
> **Guest State**
> 
> Nesting requires merge of L1 and L0 state.
> 
> 1.  AMD interface is much easier: it has "clean bits": if any bit is clear, 
> must resync.  Guest state is kept separately.
> 2. Intel guest state is kept in an opaque blob in memory, with special 
> instructions to access it.  Memory layout in RAM is unknown, behavior changes 
> with microcode updates and there are 150 pages of relevant Intel manuals.
> 4. Bromium does some fun stuff to track guest state in software, poisoning 
> RAM and then inspecting it, which is still faster than Intel's hardware-based 
> VMCS shadowing. L1 hypervisor (Type-2 uXen): https://github.com/openxt/uxen
> 5. Viridian emulates the AMD way, i.e. Microsoft has Intel bits formatted in 
> an AMD-like structure, then L0 translates the AMD structure into Intel's 
> opaque blob.
> 
> **Secure Variable Storage**
> 
> 1. Need an agreed sane way for multiple hypervisors to handle it, eg. a pair 
> of ioports, translation from VMX, guest handles the interrupts via a standard 
> ioport interception to secondary emulator: tiny.
> 2. Easy case:  ioports + memory page for data.
> 3. Citrix XenServer has a closed-source implementation (varstored?)
> 
> **Interface for nested PV devices**
> 
> PV driver support currently involves grants and interrupts.
> 
> Requirements:
> 
> 1. Should Xen's ABI include hypercall nesting level?
> 2. Each layer of nesting must apply access control decisions to the operation 
> invoked by its guest.
> 3. Brownfield: if Xen and other L1 hypervisors must be compatible with 
> existing Xen bare-metal deployments, the L0 hypervisor must continue to 
> support grants, events and xenstore.
> 4. Greenfield: if the L0 hypervisor can be optimized for nesting, then PV 
> driver mechanisms other than grants, events and xenstore could be considered.
> 
> Live migration with PCI graphics (has been implemented on AWS):
> 
> - need to make it look the same, regardless of nesting level
> - 1 or more interrupts
> - 1 or shared pages of RAM
> - share xenstore
> - virtio guest physical address space DMA: done right
> - _*need to get rid of domid*_ as the endpoint identifier
> 
> Access Control:
> 
> Marek: use virtio?
> David: do whatever you like in L1
> Juergen: new "nested hypercall", to pass downwards an opaque payload
> David: how does access control work with that approach?
> 
> Christopher: xenblanket RFC series implements support for one level of 
> nesting. Its implementation below the hypercall interface demonstrates access 
> control logic that is required at each nesting level.
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

Attachment: signature.asc
Description: PGP signature

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.