[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Best way to use altp2m to support VMFUNC EPT-switching?
On 15/03/2023 9:41 pm, Johnson, Ethan wrote: >> On 15/03/2023 2:01 am, Johnson, Ethan wrote: >>> Hi all, >>> >>> I'm looking for some pointers on how Xen's altp2m system works and how it's >>> meant to be used with Intel's VMFUNC EPT-switching for secure isolation >>> within an HVM/PVH guest's kernelspace. >>> >>> Specifically, I am attempting to modify Xen to create (on request by an >>> already-booted, cooperative guest with a duly modified Linux kernel) a >>> second set of extended page tables that have access to additional >>> privileged regions of host-physical memory (specifically, a page or two to >>> store some sensitive data that we don't want the guest kernel to be able to >>> overwrite, plus some host-physical MMIO ranges, specifically the xAPIC >>> region). The idea is that the guest kernel will use VMFUNC to switch to the >>> alternate EPTs and call "secure functions" provided (by the hypervisor) as >>> read-only code to be executed in non-root mode on the alternate EPT, >>> allowing certain VM-exit scenarios (namely, sending an IPI to another vCPU >>> of the same domain) to be handled without exiting non-root mode. Hence, >>> these extra privileged pages should only be visible to the alternative p2m >>> that the "secure realm" functions live in. (Transitions between the secure- >>> and insecure-realm EPTs will be through special read-only "trampoline" code >>> pages that ensure the untrusted guest kernel can only enter the secure >>> realm at designated entry points.) >>> >>> Looking at Xen's existing altp2m code, I get the sense that Xen is already >>> designed to support something at least vaguely like this. I have not, >>> however, been able to find much in the way of documentation on altp2m, so I >>> am reaching out to see if anyone can offer pointers on how to best use it. >>> >>> What is the intended workflow (either in the toolstack or within the >>> hypervisor itself) for creating and configuring an altp2m that should have >>> access to additional host-physical frames that are not present in the >>> guest's main p2m? >>> >>> FWIW, once the altp2m has been set up in this fashion, we don't anticipate >>> needing to fiddle with its mappings any further as long as the guest is >>> running (so I'm thinking *maybe* the "external" altp2m mode will suffice >>> for this). In fact, we may not even need to have any "overlap" between the >>> primary and alternative p2m except the trampoline pages themselves >>> (although this aspect of our design is still somewhat in flux). >>> >>> I've noticed a function, do_altp2m_op(), in the hypervisor >>> (xen/arch/x86/hvm/hvm.c) that seems to implement a number of altp2m-related >>> hypercalls intended to be called from the dom0. Do these hypercalls already >>> provide a straightforward way to achieve my goals described above entirely >>> via (a potentially modified version of) the dom0 toolstack? Or would I be >>> better off creating and configuring the altp2m from within the hypervisor >>> itself, since I want to map low-level stuff like xAPIC MMIO ranges into the >>> altp2m? >>> >>> Thank you in advance for your time and assistance! >> Hello, >> >> There's a lot to unpack here, but before I do so, one question. In your >> usecase, are you wanting to map any frames with reduced permissions >> (i.e. such that you'd get a #VE exception), or are you just looking to >> add new frames with RWX perms into an alternative view? >> >> I suspect the latter, but it's not completely clear, and changes the answer. >> >> ~Andrew > Yes, the latter is correct: I am looking to add new frames with RWX perms > into an alternative view. I don't currently envision needing #VE in any form > for this work. > > (We're using a modified PVH Linux guest for this, so rather than needing to > intercept and react to EPT faults via #VE, we can expect the guest kernel to > explicitly call our secure-realm functions via VMFUNC, replacing what would > otherwise be some hypercalls out to Xen in root mode. I suppose supporting > unmodified kernels by intercepting #VE could be an interesting enhancement > for future work, but for now we're content to work with a cooperative > modified PVH guest as a proof of concept. :-)) > > Basically, the primary p2m will be (largely) as it is normally, and the > untrusted guest kernel and userspace will run on it as an HVM/PVH guest > normally would. The alternate p2m will have some additional private code and > data pages mapped in (RWX in the altp2m, but either read-only or completely > unmapped in the primary p2m), as well as the host's xAPIC MMIO range so it > can send IPIs to other vCPUs without having to VM-exit. To facilitate safe > transitions between these two "realms", we'll be adding a couple of > R/X-permissioned "trampoline pages" that will contain the VMFUNC instructions > themselves and will be present in both p2ms. > > Thanks, Ok, so there is a lot here. Apologies in advance for the overly long answer. First, while altp2m was developed in parallel with EPTP-switching, we took care to split the vendor neutral parts from the vendor specific bits. So while we do have VMFUNC support, that's considered "just" a hardware optimisation to speed up the HVMOP_altp2m_switch_p2m hypercall. But before you start, it is important to understand your security boundaries. You've found external mode, and this is all about controlling which aspects of altp2m the guest can invoke itself, and modes other than external let the guest issue HVMOP_altp2m ops itself. If you permit the guest to change views itself, either with VMFUNC, or HVMOP_altp2m_switch_p2m, you have to realise that these are just "regular" CPL0 actions, and can be invoked by any kernel code, not just your driver. i.e. the union of all primary and alternative views is one single security domain. For some usecases this is fine, but yours doesn't look like it fits in this category. In particular, no amount of protection on the trampoline pages stops someone writing a VMFUNC instruction elsewhere in kernel space and executing it. (I have seen plenty of research papers try to construct a security boundary around VMFUNC. I have yet see one that does so robustly, but I do enjoy being surprised on occasion...) The first production use this technology I'm aware of was Bitdefender's HVMI, where the guest had no control at all, and was subject to the permission restrictions imposed on it by the agent in dom0. The agent trapped everything it considered sensitive, including writes to sensitive areas of memory using reduced EPT permissions, and either permitted execution to continue, or took other preventative action. This highlights another key point. Some entity in the system needs to deal with faults that occur when the guest accidentally (or otherwise) violates the reduced EPT permissions. #VE is, again, an optimisation to let violations be handled in guest context, rather than taking a VMExit, but even with #VE the complicated corner cases are left to the external agent. With HVMI, #VE (but not VMFUNC IIRC) did get used as an optimisation to mitigate the perf hit from Window's Meltdown mitigation electing to use LOCK'd BTS/BTC operations on pagetables (which were write protected behind the scenes), but I'm reliably informed that the hoops required to jump through to make that work, and in particular avoid the notice of PatchGuard, were substantial. Perhaps a more accessible example is https://github.com/intel/kernel-fuzzer-for-xen-project and the underlying libvmi. There is also a very basic example in tools/misc/xen-access.c in the Xen tree. For your question specifically about mapping other frames, we do have hypercalls to map other frames (its necessary for e.g. mapping BARs of passed-through PCI devices), but for obvious reasons, it's restricted to control software (Qemu) in dom0. I suspect we don't actually have a hypercall to map MMIO into an alternative view, but it shouldn't be hard to add (if you still decide you want it by the end of this email). But on to the specifics of mapping the xAPIC page. Sorry, but irrespective of altp2m, that is a non-starter, for reasons that date back to ~1997 or thereabouts. It's worth saying that AMD can fully virtualise IPI delivery from one vCPU to another without either taking a VMExit in the common case, since Zen1 (IIRC). Intel has a similar capability since Sapphire Rapids (IIRC). Xen doesn't support either yet, because there are only so many hours in the day... It is technically possible to map the xAPIC window into a guest, and such a guest could interact the real interrupt controller. But now you've got the problem that two bits of software (Xen, and your magic piece of guest kernel) are trying to driver the same single interrupt controller. Even if you were to say that the guest would only use ICR to send interrupts, that still doesn't work. In xAPIC, ICR is formed of two half registers, as it dates from the days of 32bit processors, with a large stride between the two half registers. Therefore, it is a minimum of two separate instructions (set destination in ICR_HI, set type/mode/etc in ICR_LO) to send an interrupt. A common bug in kernels is to try and send IPIs when interrupts are enabled, or in NMI context, both of which could interrupt an IPI sequence. This results in a sequence of writes (from the LAPIC's point of view) of ICR_HI, ICR_HI, ICR_LO, ICR_LO, which causes the outer IPI to be sent with the wrong destination. Guests always execute with IRQs enabled, but can take a VMExit on any arbitrary instruction boundary for other reasons, so the guest kernel can never be sure that ICR_HI hasn't been modified by Xen in the background, even if it used two adjacent instructions to send the IPI. Now, if you were to swap xAPIC for x2APIC, one of the bigger changes was making ICR a single register, so it could be written atomically. But now you have an MSR based interface, not an MMIO based interface. It's also worth noting that any system with >254 CPUs is necessarily operating in x2APIC mode (so there isn't an xAPIC window to map, even if you wanted to try), and because of the ÆPIC Leak vulnerability, IceLake and later CPUs are locked into x2APIC mode by firmware, with no option to revert back into xAPIC mode even on smaller systems. On top of that, you've still got the problem of determining the destination. Even if the guest could send an IPI, it still has to know the physical APIC ID of the CPU the target vCPU is currently scheduled on. And you'd have to ignore things like the logical mode or destination shorthands, because multi/broadcast IPIs will hit incorrect targets. On top of that, even if you can determine the right destination, how does the target receive the interrupt? There can only be one entity in the system receiving INTR, and that's Xen. So you've got to pick some vector that Xen knows what to do with, but isn't otherwise using. Not to mention there's a(nother) giant security hole... A guest able to issue interrupts could just send INIT-SIPI-SIPI and reset the target CPU back into real mode behind Xen's back. Xen will not take kindly to this. So while I expect there's plenty of room to innovate on the realm switch aspect of EPTP-switching, trying to send IPIs from within guest context is something that I will firmly suggest you avoid. There are good reasons why it is so complicated to get VMExit-less guest IPIs working. ~Andrew
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |