Xen project Mailing List

On Wed, Mar 29, 2023 at 10:29 PM Johnson, Ethan <ejohns48@xxxxxxxxxxxxxxxx> wrote:
>
> On 2023-03-16 02:14:18 +0000, Andrew Cooper wrote:
> > Ok, so there is a lot here. Apologies in advance for the overly long
> > answer.
> >
> > First, while altp2m was developed in parallel with EPTP-switching, we
> > took care to split the vendor neutral parts from the vendor specific
> > bits. So while we do have VMFUNC support, that's considered "just" a
> > hardware optimisation to speed up the HVMOP_altp2m_switch_p2m hypercall.
> >
> > But before you start, it is important to understand your security
> > boundaries. You've found external mode, and this is all about
> > controlling which aspects of altp2m the guest can invoke itself, and
> > modes other than external let the guest issue HVMOP_altp2m ops itself.
> >
> > If you permit the guest to change views itself, either with VMFUNC, or
> > HVMOP_altp2m_switch_p2m, you have to realise that these are just
> > "regular" CPL0 actions, and can be invoked by any kernel code, not just
> > your driver. i.e. the union of all primary and alternative views is one
> > single security domain.
> >
> > For some usecases this is fine, but yours doesn't look like it fits in
> > this category. In particular, no amount of protection on the trampoline
> > pages stops someone writing a VMFUNC instruction elsewhere in kernel
> > space and executing it.
> >
> > (I have seen plenty of research papers try to construct a security
> > boundary around VMFUNC. I have yet see one that does so robustly, but I
> > do enjoy being surprised on occasion...)
> >
> > The first production use this technology I'm aware of was Bitdefender's
> > HVMI, where the guest had no control at all, and was subject to the
> > permission restrictions imposed on it by the agent in dom0. The agent
> > trapped everything it considered sensitive, including writes to
> > sensitive areas of memory using reduced EPT permissions, and either
> > permitted execution to continue, or took other preventative action.
> >
> > This highlights another key point. Some entity in the system needs to
> > deal with faults that occur when the guest accidentally (or otherwise)
> > violates the reduced EPT permissions. #VE is, again, an optimisation to
> > let violations be handled in guest context, rather than taking a VMExit,
> > but even with #VE the complicated corner cases are left to the external
> > agent.
> >
> > With HVMI, #VE (but not VMFUNC IIRC) did get used as an optimisation to
> > mitigate the perf hit from Window's Meltdown mitigation electing to use
> > LOCK'd BTS/BTC operations on pagetables (which were write protected
> > behind the scenes), but I'm reliably informed that the hoops required to
> > jump through to make that work, and in particular avoid the notice of
> > PatchGuard, were substantial.
> >
> > Perhaps a more accessible example is
> > https://github.com/intel/kernel-fuzzer-for-xen-project and the
> > underlying libvmi. There is also a very basic example in
> > tools/misc/xen-access.c in the Xen tree.
> >
> > For your question specifically about mapping other frames, we do have
> > hypercalls to map other frames (its necessary for e.g. mapping BARs of
> > passed-through PCI devices), but for obvious reasons, it's restricted to
> > control software (Qemu) in dom0. I suspect we don't actually have a
> > hypercall to map MMIO into an alternative view, but it shouldn't be hard
> > to add (if you still decide you want it by the end of this email).
> >
> >
> > But on to the specifics of mapping the xAPIC page. Sorry, but
> > irrespective of altp2m, that is a non-starter, for reasons that date
> > back to ~1997 or thereabouts.
> >
> > It's worth saying that AMD can fully virtualise IPI delivery from one
> > vCPU to another without either taking a VMExit in the common case, since
> > Zen1 (IIRC). Intel has a similar capability since Sapphire Rapids
> > (IIRC). Xen doesn't support either yet, because there are only so many
> > hours in the day...
> >
> > It is technically possible to map the xAPIC window into a guest, and
> > such a guest could interact the real interrupt controller. But now
> > you've got the problem that two bits of software (Xen, and your magic
> > piece of guest kernel) are trying to driver the same single interrupt
> > controller.
> >
> > Even if you were to say that the guest would only use ICR to send
> > interrupts, that still doesn't work. In xAPIC, ICR is formed of two
> > half registers, as it dates from the days of 32bit processors, with a
> > large stride between the two half registers.
> >
> > Therefore, it is a minimum of two separate instructions (set destination
> > in ICR_HI, set type/mode/etc in ICR_LO) to send an interrupt.
> >
> > A common bug in kernels is to try and send IPIs when interrupts are
> > enabled, or in NMI context, both of which could interrupt an IPI
> > sequence. This results in a sequence of writes (from the LAPIC's point
> > of view) of ICR_HI, ICR_HI, ICR_LO, ICR_LO, which causes the outer IPI
> > to be sent with the wrong destination.
> >
> > Guests always execute with IRQs enabled, but can take a VMExit on any
> > arbitrary instruction boundary for other reasons, so the guest kernel
> > can never be sure that ICR_HI hasn't been modified by Xen in the
> > background, even if it used two adjacent instructions to send the IPI.
> >
> > Now, if you were to swap xAPIC for x2APIC, one of the bigger changes was
> > making ICR a single register, so it could be written atomically. But
> > now you have an MSR based interface, not an MMIO based interface.
> >
> > It's also worth noting that any system with >254 CPUs is necessarily
> > operating in x2APIC mode (so there isn't an xAPIC window to map, even if
> > you wanted to try), and because of the ÆPIC Leak vulnerability, IceLake
> > and later CPUs are locked into x2APIC mode by firmware, with no option
> > to revert back into xAPIC mode even on smaller systems.
> >
> > On top of that, you've still got the problem of determining the
> > destination. Even if the guest could send an IPI, it still has to know
> > the physical APIC ID of the CPU the target vCPU is currently scheduled
> > on. And you'd have to ignore things like the logical mode or
> > destination shorthands, because multi/broadcast IPIs will hit incorrect
> > targets.
> >
> > On top of that, even if you can determine the right destination, how
> > does the target receive the interrupt? There can only be one entity in
> > the system receiving INTR, and that's Xen. So you've got to pick some
> > vector that Xen knows what to do with, but isn't otherwise using.
> >
> > Not to mention there's a(nother) giant security hole... A guest able to
> > issue interrupts could just send INIT-SIPI-SIPI and reset the target CPU
> > back into real mode behind Xen's back. Xen will not take kindly to this.
> >
> >
> > So while I expect there's plenty of room to innovate on the realm switch
> > aspect of EPTP-switching, trying to send IPIs from within guest context
> > is something that I will firmly suggest you avoid. There are good
> > reasons why it is so complicated to get VMExit-less guest IPIs working.
> >
> > ~Andrew
>
> Thank you for the detailed answers and context. I am somewhat encouraged to
> note that most of the roadblocks you mentioned are issues we've specifically
> considered (and think we have solutions for) in our design. :-) We're using
> some rather exotic compiler-based instrumentation on the guest kernel (plus
> some tricks with putting the "secure realm"'s page tables in a nonoverlapping
> guest-physical address range that isn't present in the primary p2m used by
> untrusted code) to prevent the guest from doing things it isn't supposed to
> with VMFUNC and (x2)APIC access, despite running in ring 0 within non-root
> mode.
>
> On a more concrete level, I am looking to do the following from within the
> hypervisor (specifically, from within a new hypercall I've added):
>
> 1) Get some (host-)physical memory frames from the domain heap and "pin" them
> to make sure they won't be swapped out.
>
> 2) Create an altp2m for the calling (current) domain.
>
> 3) Map some of the newly-allocated physical frames into both the domain's
> primary p2m and its altp2m, with R/X permissions.
>
> 4) Map the rest of the physical frames into only the altp2m (as R/W), at a
> guest-physical address higher than the end of the main p2m's mapped range
> (such that when the primary p2m is active, the guest cannot access these
> pages without taking a hard VM-exit fault).
>
> I've been poring through Xen's p2m code (e.g. xen/arch/x86/mm/p2m.c) to try
> to understand how to achieve these goals, but with little success. Comments
> in the p2m code seem to be rather sparse, and mostly unhelpful for
> understanding (without pre-understood context) what many of the functions do
> and what is the intended workflow for using them. For instance,
> similarly-named functions like guest_remove_page() and
> guest_physmap_remove_page() seem to operate at different levels of
> abstraction (in terms of memory management, refcount bookkeeping, etc.) but
> it isn't externally obvious how they're meant to all fit together and be used
> by client code.
>
> Any suggestions on which p2m (or other) APIs I should be focusing on, and how
> they're meant to be used, would be greatly appreciated. I suppose in theory I
> could just bypass p2m entirely, and populate one of the VMCS's EPTP-switching
> array's slots directly with my own manually constructed paging hierarchy
> (since I'm envisioning the memory layout of our "secure realm" as being quite
> simple - it only needs a handful of pages). But I'd rather "color within the
> lines" of the existing APIs if possible, especially since some of the pages
> will need to be mapped into the existing primary p2m (for the "insecure
> realm") as well.

You can find an example work-flow here to create altp2m's and change memory permissions in the different views: https://github.com/xen-project/xen/blob/master/tools/misc/xen-access.c#L517. To add a new page to the VM you can use xc_domain_populate_physmap_exact. If you add the page after the VM has already booted the main kernel is unaware of these extra pages that were added but that doesn't mean it can't try to poke them. Similarly, using any type of memory map to avoid the kernel accessing these pages is just wishful thinking, the memory map is after all just a hint to the OS what to look for, not an access-control mechanism.

Also keep in mind that altp2m's get CoW populated from the hostp2m. You can still get your altp2m to be "only a couple pages" by either 1) ensuring no other pages ever get touched while running the vCPU with the altp2m as to not trigger the CoW mechanism; or 2) manually map change the memaccess permissions to n on every page you want to be in-accessible in the altp2m.

You'll likely want to have pages like where the IDT and GDT is mapped into the altp2m, alongside the pagetable pages. An easy way to check what pages are needed for execution in a given code context is use the VM forking mechanism, create a fork at the point your code is that you want to run in the altp2m, singlestep the fork a single instruction, then examine the fork's EPT using xl debug-keys D. Anything you see that got mapped into the fork's memory would be similarly needed to be accessible in the altp2m.

Cheers,

Tamas

Re: Best way to use altp2m to support VMFUNC EPT-switching?