Xen project Mailing List

Re: Best way to use altp2m to support VMFUNC EPT-switching?

To: "Johnson, Ethan" <ejohns48@xxxxxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Fri, 31 Mar 2023 22:06:15 +0100

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=I6gcDtpK/u111jI6Os9InU0zkYCUcV40cq/9kO08AqE=; b=UtS5GiAchVzDHfbYoczs6QfWctH5eX5Y6tIW8HpfW9xxVtGglrqRE7Vqxinv2Ba3thu7kykiupv81cda03BRUdRhhAEjjacuW2RUrJt6XdtWBNDVEeS0ySP9Y8TOkXQDDrt7qIm/5geCOuACv6HT7mfL5+BrIuyY3N7YFB8Y5oBXMa+5ZXSoLutR5f2M/mFZc1NvWZhz6R4cpPYwQhE8dyyLoEQ2Hh1BNTcUFU9m+oLRgrqObptlN9Mee5voJ2xQrpb8PFWRJttI8vfEqut626CqCNOfNrbXa7aJxP+45J0IfOIEI7Mt4I/b3iBK8eQwx0a6Oz549Ylmmqem3eINHQ==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=ntbgZbXihO6rNJm5bR67pQsvUnXCzgn/DvuehkgtlkArx4+UasbUKrqwedzf34eadBMKQ2mALMpUhPEODeZj1X08chlBE5xR8pySJjDSmwfE4yaMGcftTQ3zOhN6HLKzE6281yRkWTU7u8HnAutJhFTlbJWWVnRKeiwdGfUKrq4pTAz0AHV3kI4Z6tzVHWI+F6ytsXEEtDSw5l1izGv5SezqH7nLzuNi/81rc/WtzUABtHy5Sl65c1U5bDypXiIMqXNYlaE7KWK2Pc5CdKTqnF4/C9uHqe8iEFzrBeW4iythgDYVvvnbtHsEFI7zeVSit1g3QmuCcuCrwHU7BmPkSg==

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;

Delivery-date: Fri, 31 Mar 2023 21:06:42 +0000

Ironport-data: A9a23:E3IyLa0m5dhD7dUiRvbD5fJwkn2cJEfYwER7XKvMYLTBsI5bp2RRm msYUG+DOP6LZmKmKYglbo3i8UMHvJ6Hy9BhHlBtpC1hF35El5HIVI+TRqvS04F+DeWYFR46s J9OAjXkBJppJpMJjk71atANlVEliefTAOK6ULWeUsxIbVcMYD87jh5+kPIOjIdtgNyoayuAo tq3qMDEULOf82cc3lk8tuTS+HuDgNyo4GlD5gdlPKgS1LPjvyJ94Kw3dPnZw0TQGuG4LsbiL 87fwbew+H/u/htFIrtJRZ6iLyXm6paLVeS/oiI+t5qK23CulQRrukoPD9IOaF8/ttm8t4sZJ OOhF3CHYVxB0qXkwIzxWvTDes10FfUuFLTveRBTvSEPpqFvnrSFL/hGVSkL0YMkFulfJkVc5 K0Sbyk0LT+Epvy55q3ka+92v5F2RCXrFNt3VnBI6xj8VK5ja7acBqLA6JlfwSs6gd1IEbDGf c0FZDFzbRPGJRpSJlMQD5F4l+Ct7pX9W2QA9BTJ+uxouy6KlFUZPLvFabI5fvSjQ8lPk1nej WXB52njWTkRNcCFyCrD+XWp7gPKtXqjAtxCSubgppaGhnXP+1MeN00fT2ec4qKU0Babfu0Fd kobr39GQa8asRbDosPGdxC6p36CpUJMc9FLVfc94wGA0bbZ+UCUCnVsZi5MbpkqudE7QRQu1 0SVhJX5CDp3qrqXRHmBsLCOoluP1TM9KGYDYWoOS1QD6ty6+IUr1EuXF5BkDbK/icDzFXfo2 TeWoSMihrIVy8kWy6G8+lOBiDWpznTUcjMICszsdjrNxmtEiESNO+RENXCzAS58Ebuk

Ironport-hdrordr: A9a23:9QmQ765WMLp/nDQDFAPXwMzXdLJyesId70hD6qkXc203TiX4ra CTdZEgviMc5wx+ZJhNo7+90cq7LU80l6QFg7X5VI3KNGOKhILCFuBfBOXZslndMhy72ulB1b pxN4hSYeeAamSSVPyKhTWFLw==

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 30/03/2023 3:29 am, Johnson, Ethan wrote: > On 2023-03-16 02:14:18 +0000, Andrew Cooper wrote: >> Ok, so there is a lot here. Apologies in advance for the overly long >> answer. >> >> First, while altp2m was developed in parallel with EPTP-switching, we >> took care to split the vendor neutral parts from the vendor specific >> bits. So while we do have VMFUNC support, that's considered "just" a >> hardware optimisation to speed up the HVMOP_altp2m_switch_p2m hypercall. >> >> But before you start, it is important to understand your security >> boundaries. You've found external mode, and this is all about >> controlling which aspects of altp2m the guest can invoke itself, and >> modes other than external let the guest issue HVMOP_altp2m ops itself. >> >> If you permit the guest to change views itself, either with VMFUNC, or >> HVMOP_altp2m_switch_p2m, you have to realise that these are just >> "regular" CPL0 actions, and can be invoked by any kernel code, not just >> your driver. i.e. the union of all primary and alternative views is one >> single security domain. >> >> For some usecases this is fine, but yours doesn't look like it fits in >> this category. In particular, no amount of protection on the trampoline >> pages stops someone writing a VMFUNC instruction elsewhere in kernel >> space and executing it. >> >> (I have seen plenty of research papers try to construct a security >> boundary around VMFUNC. I have yet see one that does so robustly, but I >> do enjoy being surprised on occasion...) >> >> The first production use this technology I'm aware of was Bitdefender's >> HVMI, where the guest had no control at all, and was subject to the >> permission restrictions imposed on it by the agent in dom0. The agent >> trapped everything it considered sensitive, including writes to >> sensitive areas of memory using reduced EPT permissions, and either >> permitted execution to continue, or took other preventative action. >> >> This highlights another key point. Some entity in the system needs to >> deal with faults that occur when the guest accidentally (or otherwise) >> violates the reduced EPT permissions. #VE is, again, an optimisation to >> let violations be handled in guest context, rather than taking a VMExit, >> but even with #VE the complicated corner cases are left to the external >> agent. >> >> With HVMI, #VE (but not VMFUNC IIRC) did get used as an optimisation to >> mitigate the perf hit from Window's Meltdown mitigation electing to use >> LOCK'd BTS/BTC operations on pagetables (which were write protected >> behind the scenes), but I'm reliably informed that the hoops required to >> jump through to make that work, and in particular avoid the notice of >> PatchGuard, were substantial. >> >> Perhaps a more accessible example is >> https://github.com/intel/kernel-fuzzer-for-xen-project and the >> underlying libvmi. There is also a very basic example in >> tools/misc/xen-access.c in the Xen tree. >> >> For your question specifically about mapping other frames, we do have >> hypercalls to map other frames (its necessary for e.g. mapping BARs of >> passed-through PCI devices), but for obvious reasons, it's restricted to >> control software (Qemu) in dom0. I suspect we don't actually have a >> hypercall to map MMIO into an alternative view, but it shouldn't be hard >> to add (if you still decide you want it by the end of this email). >> >> >> But on to the specifics of mapping the xAPIC page. Sorry, but >> irrespective of altp2m, that is a non-starter, for reasons that date >> back to ~1997 or thereabouts. >> >> It's worth saying that AMD can fully virtualise IPI delivery from one >> vCPU to another without either taking a VMExit in the common case, since >> Zen1 (IIRC). Intel has a similar capability since Sapphire Rapids >> (IIRC). Xen doesn't support either yet, because there are only so many >> hours in the day... >> >> It is technically possible to map the xAPIC window into a guest, and >> such a guest could interact the real interrupt controller. But now >> you've got the problem that two bits of software (Xen, and your magic >> piece of guest kernel) are trying to driver the same single interrupt >> controller. >> >> Even if you were to say that the guest would only use ICR to send >> interrupts, that still doesn't work. In xAPIC, ICR is formed of two >> half registers, as it dates from the days of 32bit processors, with a >> large stride between the two half registers. >> >> Therefore, it is a minimum of two separate instructions (set destination >> in ICR_HI, set type/mode/etc in ICR_LO) to send an interrupt. >> >> A common bug in kernels is to try and send IPIs when interrupts are >> enabled, or in NMI context, both of which could interrupt an IPI >> sequence. This results in a sequence of writes (from the LAPIC's point >> of view) of ICR_HI, ICR_HI, ICR_LO, ICR_LO, which causes the outer IPI >> to be sent with the wrong destination. >> >> Guests always execute with IRQs enabled, but can take a VMExit on any >> arbitrary instruction boundary for other reasons, so the guest kernel >> can never be sure that ICR_HI hasn't been modified by Xen in the >> background, even if it used two adjacent instructions to send the IPI. >> >> Now, if you were to swap xAPIC for x2APIC, one of the bigger changes was >> making ICR a single register, so it could be written atomically. But >> now you have an MSR based interface, not an MMIO based interface. >> >> It's also worth noting that any system with >254 CPUs is necessarily >> operating in x2APIC mode (so there isn't an xAPIC window to map, even if >> you wanted to try), and because of the ÆPIC Leak vulnerability, IceLake >> and later CPUs are locked into x2APIC mode by firmware, with no option >> to revert back into xAPIC mode even on smaller systems. >> >> On top of that, you've still got the problem of determining the >> destination. Even if the guest could send an IPI, it still has to know >> the physical APIC ID of the CPU the target vCPU is currently scheduled >> on. And you'd have to ignore things like the logical mode or >> destination shorthands, because multi/broadcast IPIs will hit incorrect >> targets. >> >> On top of that, even if you can determine the right destination, how >> does the target receive the interrupt? There can only be one entity in >> the system receiving INTR, and that's Xen. So you've got to pick some >> vector that Xen knows what to do with, but isn't otherwise using. >> >> Not to mention there's a(nother) giant security hole... A guest able to >> issue interrupts could just send INIT-SIPI-SIPI and reset the target CPU >> back into real mode behind Xen's back. Xen will not take kindly to this. >> >> >> So while I expect there's plenty of room to innovate on the realm switch >> aspect of EPTP-switching, trying to send IPIs from within guest context >> is something that I will firmly suggest you avoid. There are good >> reasons why it is so complicated to get VMExit-less guest IPIs working. >> >> ~Andrew > Thank you for the detailed answers and context. I am somewhat encouraged to > note that most of the roadblocks you mentioned are issues we've specifically > considered (and think we have solutions for) in our design. :-) We're using > some rather exotic compiler-based instrumentation on the guest kernel (plus > some tricks with putting the "secure realm"'s page tables in a nonoverlapping > guest-physical address range that isn't present in the primary p2m used by > untrusted code) to prevent the guest from doing things it isn't supposed to > with VMFUNC and (x2)APIC access, despite running in ring 0 within non-root > mode. > > On a more concrete level, I am looking to do the following from within the > hypervisor (specifically, from within a new hypercall I've added): > > 1) Get some (host-)physical memory frames from the domain heap and "pin" them > to make sure they won't be swapped out. Xen doesn't have paging, owing to not having a disk driver. There is a paging subsystem which you've probably found already in the code, but it's a decade old and never got beyond experimental status, so for most intents and purposes you can pretend that it doesn't exist. i.e. nothing allocated in Xen moves around unexpectedly behind your back. However, pages that are allocated to a guest (PGT_allocated) are reference counted, and can be freed when the refcount drops to zero. This can include explicit guest actions such as a decrease_reservation() hypercall. You have to be aware of this if you want to point any other non-refcounted thing at the memory, but I suspect it wont matter for your cases here. > 2) Create an altp2m for the calling (current) domain. > > 3) Map some of the newly-allocated physical frames into both the domain's > primary p2m and its altp2m, with R/X permissions. > > 4) Map the rest of the physical frames into only the altp2m (as R/W), at a > guest-physical address higher than the end of the main p2m's mapped range > (such that when the primary p2m is active, the guest cannot access these > pages without taking a hard VM-exit fault). > > I've been poring through Xen's p2m code (e.g. xen/arch/x86/mm/p2m.c) to try > to understand how to achieve these goals, but with little success. Comments > in the p2m code seem to be rather sparse, and mostly unhelpful for > understanding (without pre-understood context) what many of the functions do > and what is the intended workflow for using them. For instance, > similarly-named functions like guest_remove_page() and > guest_physmap_remove_page() seem to operate at different levels of > abstraction (in terms of memory management, refcount bookkeeping, etc.) but > it isn't externally obvious how they're meant to all fit together and be used > by client code. Don't feel too bad... Not even the maintainers can agree on where that split is either. It's mostly an answer of history. Originally Xen had paravirtual guests (dom0 still runs in this mode) which were aware they were running under Xen, and had to manage their own memory, including whatever idea they had about their layout. Then HVM guests came along and Xen had to start managing the guest physical address space on behalf of the guest, and this was (dubiously) called the physical_to_machine or P2M. Notice how the guest_phymap_* functions have paging_mode_translate() checks and do two totally different things. Read paging_mode_translate() as is_hvm_domain() and it might help. The guest_physmap_* functions are for doing logically-the-same operation on PV or HVM guests, where PV is often a no-op, and HVM is quite involved. The p2m functions are all for HVM guests specifically. But yes - the APIs are a mess and you're not the only person to have noticed. > Any suggestions on which p2m (or other) APIs I should be focusing on, and how > they're meant to be used, would be greatly appreciated. I suppose in theory I > could just bypass p2m entirely, and populate one of the VMCS's EPTP-switching > array's slots directly with my own manually constructed paging hierarchy > (since I'm envisioning the memory layout of our "secure realm" as being quite > simple - it only needs a handful of pages). But I'd rather "color within the > lines" of the existing APIs if possible, especially since some of the pages > will need to be mapped into the existing primary p2m (for the "insecure > realm") as well. Taking your analogy, I'm afraid you're probably going to have to start with a pencil and draw some more lines. The altp2m work got as far as minor {i,d}TLB bifurcation (to stealth-breakpoint code under analysis), but didn't ever get to "I'd like something totally different in different views". There has to be an authoritative idea of what the guest physmap (singular) looks like, and that's the host p2m. (Not relevant to your case, but to highlight a point. Consider trying to migrate a VM with a mutli-view setup. The logdirty bitmap is expressed as a bit per gfn, and all those gfn bits had better come from the same view, not the alternate view which happened to be active.) I suspect what you might want to do is create the guest with all memory (but mark the secure realm's memory as either E820_RESERVED, or remove the entry entirely), and create two altp2m's; one for the insecure secure realm and one for the secure realm. IIRC, views are populated copy-on-write style from the hostp2m as the vCPU executes in that view, but you can make modifications using HVMOP_altp2m_set_mem_access{,_multi} to give it specific perms or HVMOP_altp2m_change_gfn to bifurcate. I suspect what you want to do is set the default perm to no access (i.e. disable CoW) and use HVMOP_altp2m_set_mem_access_multi explicitly create the subset of mappings you want in each view. But honestly, you're beyond my experience of using altp2m. Good luck :) ~Andrew

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.