[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Best way to use altp2m to support VMFUNC EPT-switching?


  • To: "Johnson, Ethan" <ejohns48@xxxxxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • Date: Thu, 16 Mar 2023 02:14:18 +0000
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=QP0VdVBVqws/+eJbfob7qVU0nZJWMshbnqGKQ3VBgOg=; b=iMdbROa91kTREG7Oef0QfMDeLckeaA+zpP5PgpiGCo84py9sz8avI2jrBGZGbmD3AV+4iMKA4N8Bb2oqeKliSwApvTPsB+10IardsrrQ4V0iUJswqIy9CkM0pJvr8tG+sMfF3mPUtR30jh2iUBzDWYiLjMN2MwViJcSg7aG8/YyXt24TgYImAmL9OtqxuOUc2JgprpYmQJ+25jT1IW+7+v/5woYvyYciQc4WPzon4wsm5e6ZekqOfUO0dPhm/Kj8TVEMFYESLR4Irx2HwNhLJJNeAcMXm7Oy6VxdPYm3uymnBKoys7lJ+v9TDT0NK7arAcndhYTav7NlrkwWqTCmbg==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=LPDB8wnx5nfoTF6CrRw0if8ag3pWsJDZjurTI2wehUPz+Y+IgcjSQXGNz753L01aSx8HDmbaeZ/XME5SGA6Szyy4I9PgHT2xq+bMUFefppEswGWeeo2Xg0ltQWUeo/jQKWhefaZzwNUyiN80f5KBo9c21ZUZYnmyCDxaYt5D+EZvdWW2ohcnolZq7bWCjgMY8HV2XKG5AKrh2UrPP/yca+98leljHueNcDTHgez6C145ErqAPjntpLdwL8XbU29qL48AHxDi6kJiWKdM/OqSlV6BDlwJ3RYYpMIl41s27u9Wog4FRLmjOHVnO5bTrQcY0/T/gTQzx2oD0H5eNVqogg==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
  • Delivery-date: Thu, 16 Mar 2023 02:15:10 +0000
  • Ironport-data: A9a23:jaH4Oq5JiaSt8UQ77EyoFwxRtBnGchMFZxGqfqrLsTDasY5as4F+v jQfUWyBb//YZTTyfN5xboW09UwEvZLcztBhSwZlr31kHi5G8cbLO4+Ufxz6V8+wwm8vb2o8t plDNYOQRCwQZiWBzvt4GuG59RGQ7YnRGvynTraCYnsrLeNdYH9JoQp5nOIkiZJfj9G8Agec0 fv/uMSaM1K+s9JOGjt8B5mr9VU+7JwehBtC5gZlPasS4geD/5UoJMl3yZ+ZfiOQrrZ8RoZWd 86bpJml82XQ+QsaC9/Nut4XpWVTH9Y+lSDX4pZnc/DKbipq/0Te4Y5iXBYoUm9Fii3hojxE4 I4lWapc6+seFvakdOw1C3G0GszlVEFM0OevzXOX6aR/w6BaGpdFLjoH4EweZOUlFuhL7W5m6 dgbAihVPz+6vdm80bDqa8hmhf8aM5y+VG8fkikIITDxK98DGMqGaYOaoNhS0XE3m9xEGuvYa 4wBcz1zYR/cYhpJfFAKFJY5m+TujX76G9FagAvN+exrvC6OkUooj+eF3Nn9I7RmQe18mEqCq 32A1GP+GhwAb/SUyCaf82LqjejK9c/+cNtKTeXkqa403jV/wEQ2EDAKD2Gru8O2yXKTaYhUc 10t+gwx+P1aGEuDC4OVsweDiHOGuR4aQIYAO+YhrhuKwarZ+BqUHC4JQiMpQMwrsoo6SCIn0 neNnsj1Hnp/vbuNU3Wf+7yI6zSoNkAowXQqYCYFSU4P5YnlqYRq1BbXFI88T+iyk8H/Hiz2z 3aSti8iir4PjMkNkaKm4VTAhDHqrZ/MJuIo2jjqsquexlsRTOaYi0aAszA3Md4owF6lc2S8
  • Ironport-hdrordr: A9a23:JLUlyqO+BnBZfMBcT6H255DYdb4zR+YMi2TDiHoddfUFSKalfp 6V98jztSWatN9jYgBGpTnmAtj7fZq8z+8P3WB1B9uftWbdyQiVxe1ZnO7fKnjbalXDH41mpN hdmspFaOEYZGIS5aia3OD7KadY/DDuytHVuQ609QYIcegFUdAH0+40MHf4LqUgLzM2eKbRWa DskPZvln6FQzA6f867Dn4KU6zqoMDKrovvZVorFgMq8w6HiBKv8frfHwKD1hkTfjtTyfN6mF K12TDR1+GGibWW2xXc32jc49B/n8bg8MJKAIihm9UYMTLljyevfcBEV6eZtD44jemz4BIBkc XKoT0nI8NvgkmhNV2dkF/I4U3NwTwu43jtxRuxhmbim9XwQHYfB9BajYxUXxPF4w541esMmJ 5j7ia8jd56HBnAlCPy65zhUAxrrFO9pT4HnfQIh3JSfIMCYPt6rJAZ/mlSDJAcdRiKobwPIa 1LNoXx9fxWeVSVYzTwuXRu+sWlWjAJEhKPUiE5y7mo+gkTuEo841oTxcQZkHtF3ok6UYN46+ PNNbktvK1ST+cNBJgNStspcI+SMCjgUBjMOGWdLRDMD6ccIU/ArJbx/fEc+PyqQpoV15E/8a 6xH2+wjVRCO34GNPf+n6Giqnv2MSeAtHXWu41jDqFCy/zBrOGBC1zHdLgs+/HQ0cn3TPerH8 pbA6gmc8MLHVGeZ7qh4DeOKqW6CUNuJPH96exLLG6mk4bsFrDAkND9XbL6GIfNeAxUKV8XRE FzEQTOGA==
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 15/03/2023 9:41 pm, Johnson, Ethan wrote:
>> On 15/03/2023 2:01 am, Johnson, Ethan wrote:
>>> Hi all,
>>>
>>> I'm looking for some pointers on how Xen's altp2m system works and how it's
>>> meant to be used with Intel's VMFUNC EPT-switching for secure isolation
>>> within an HVM/PVH guest's kernelspace.
>>>
>>> Specifically, I am attempting to modify Xen to create (on request by an
>>> already-booted, cooperative guest with a duly modified Linux kernel) a
>>> second set of extended page tables that have access to additional
>>> privileged regions of host-physical memory (specifically, a page or two to
>>> store some sensitive data that we don't want the guest kernel to be able to
>>> overwrite, plus some host-physical MMIO ranges, specifically the xAPIC
>>> region). The idea is that the guest kernel will use VMFUNC to switch to the
>>> alternate EPTs and call "secure functions" provided (by the hypervisor) as
>>> read-only code to be executed in non-root mode on the alternate EPT,
>>> allowing certain VM-exit scenarios (namely, sending an IPI to another vCPU
>>> of the same domain) to be handled without exiting non-root mode. Hence,
>>> these extra privileged pages should only be visible to the alternative p2m
>>> that the "secure realm" functions live in. (Transitions between the secure-
>>> and insecure-realm EPTs will be through special read-only "trampoline" code
>>> pages that ensure the untrusted guest kernel can only enter the secure
>>> realm at designated entry points.)
>>>
>>> Looking at Xen's existing altp2m code, I get the sense that Xen is already
>>> designed to support something at least vaguely like this. I have not,
>>> however, been able to find much in the way of documentation on altp2m, so I
>>> am reaching out to see if anyone can offer pointers on how to best use it.
>>>
>>> What is the intended workflow (either in the toolstack or within the
>>> hypervisor itself) for creating and configuring an altp2m that should have
>>> access to additional host-physical frames that are not present in the
>>> guest's main p2m?
>>>
>>> FWIW, once the altp2m has been set up in this fashion, we don't anticipate
>>> needing to fiddle with its mappings any further as long as the guest is
>>> running (so I'm thinking *maybe* the "external" altp2m mode will suffice
>>> for this). In fact, we may not even need to have any "overlap" between the
>>> primary and alternative p2m except the trampoline pages themselves
>>> (although this aspect of our design is still somewhat in flux).
>>>
>>> I've noticed a function, do_altp2m_op(), in the hypervisor
>>> (xen/arch/x86/hvm/hvm.c) that seems to implement a number of altp2m-related
>>> hypercalls intended to be called from the dom0. Do these hypercalls already
>>> provide a straightforward way to achieve my goals described above entirely
>>> via (a potentially modified version of) the dom0 toolstack? Or would I be
>>> better off creating and configuring the altp2m from within the hypervisor
>>> itself, since I want to map low-level stuff like xAPIC MMIO ranges into the
>>> altp2m?
>>>
>>> Thank you in advance for your time and assistance!
>> Hello,
>>
>> There's a lot to unpack here, but before I do so, one question.  In your
>> usecase, are you wanting to map any frames with reduced permissions
>> (i.e. such that you'd get a #VE exception), or are you just looking to
>> add new frames with RWX perms into an alternative view?
>>
>> I suspect the latter, but it's not completely clear, and changes the answer.
>>
>> ~Andrew
> Yes, the latter is correct: I am looking to add new frames with RWX perms
> into an alternative view. I don't currently envision needing #VE in any form
> for this work.
>
> (We're using a modified PVH Linux guest for this, so rather than needing to
> intercept and react to EPT faults via #VE, we can expect the guest kernel to
> explicitly call our secure-realm functions via VMFUNC, replacing what would
> otherwise be some hypercalls out to Xen in root mode. I suppose supporting
> unmodified kernels by intercepting #VE could be an interesting enhancement
> for future work, but for now we're content to work with a cooperative
> modified PVH guest as a proof of concept. :-))
>
> Basically, the primary p2m will be (largely) as it is normally, and the
> untrusted guest kernel and userspace will run on it as an HVM/PVH guest
> normally would. The alternate p2m will have some additional private code and
> data pages mapped in (RWX in the altp2m, but either read-only or completely
> unmapped in the primary p2m), as well as the host's xAPIC MMIO range so it
> can send IPIs to other vCPUs without having to VM-exit. To facilitate safe
> transitions between these two "realms", we'll be adding a couple of
> R/X-permissioned "trampoline pages" that will contain the VMFUNC instructions
> themselves and will be present in both p2ms.
>
> Thanks,

Ok, so there is a lot here.  Apologies in advance for the overly long
answer.

First, while altp2m was developed in parallel with EPTP-switching, we
took care to split the vendor neutral parts from the vendor specific
bits.  So while we do have VMFUNC support, that's considered "just" a
hardware optimisation to speed up the HVMOP_altp2m_switch_p2m hypercall.

But before you start, it is important to understand your security
boundaries.  You've found external mode, and this is all about
controlling which aspects of altp2m the guest can invoke itself, and
modes other than external let the guest issue HVMOP_altp2m ops itself.

If you permit the guest to change views itself, either with VMFUNC, or
HVMOP_altp2m_switch_p2m, you have to realise that these are just
"regular" CPL0 actions, and can be invoked by any kernel code, not just
your driver.  i.e. the union of all primary and alternative views is one
single security domain.

For some usecases this is fine, but yours doesn't look like it fits in
this category.  In particular, no amount of protection on the trampoline
pages stops someone writing a VMFUNC instruction elsewhere in kernel
space and executing it.

(I have seen plenty of research papers try to construct a security
boundary around VMFUNC.  I have yet see one that does so robustly, but I
do enjoy being surprised on occasion...)

The first production use this technology I'm aware of was Bitdefender's
HVMI, where the guest had no control at all, and was subject to the
permission restrictions imposed on it by the agent in dom0.  The agent
trapped everything it considered sensitive, including writes to
sensitive areas of memory using reduced EPT permissions, and either
permitted execution to continue, or took other preventative action.

This highlights another key point.  Some entity in the system needs to
deal with faults that occur when the guest accidentally (or otherwise)
violates the reduced EPT permissions.  #VE is, again, an optimisation to
let violations be handled in guest context, rather than taking a VMExit,
but even with #VE the complicated corner cases are left to the external
agent.

With HVMI, #VE (but not VMFUNC IIRC) did get used as an optimisation to
mitigate the perf hit from Window's Meltdown mitigation electing to use
LOCK'd BTS/BTC operations on pagetables (which were write protected
behind the scenes), but I'm reliably informed that the hoops required to
jump through to make that work, and in particular avoid the notice of
PatchGuard, were substantial.

Perhaps a more accessible example is
https://github.com/intel/kernel-fuzzer-for-xen-project and the
underlying libvmi.  There is also a very basic example in
tools/misc/xen-access.c in the Xen tree.

For your question specifically about mapping other frames, we do have
hypercalls to map other frames (its necessary for e.g. mapping BARs of
passed-through PCI devices), but for obvious reasons, it's restricted to
control software (Qemu) in dom0.  I suspect we don't actually have a
hypercall to map MMIO into an alternative view, but it shouldn't be hard
to add (if you still decide you want it by the end of this email).


But on to the specifics of mapping the xAPIC page.  Sorry, but
irrespective of altp2m, that is a non-starter, for reasons that date
back to ~1997 or thereabouts.

It's worth saying that AMD can fully virtualise IPI delivery from one
vCPU to another without either taking a VMExit in the common case, since
Zen1 (IIRC).  Intel has a similar capability since Sapphire Rapids
(IIRC).  Xen doesn't support either yet, because there are only so many
hours in the day...

It is technically possible to map the xAPIC window into a guest, and
such a guest could interact the real interrupt controller.  But now
you've got the problem that two bits of software (Xen, and your magic
piece of guest kernel) are trying to driver the same single interrupt
controller.

Even if you were to say that the guest would only use ICR to send
interrupts, that still doesn't work.  In xAPIC, ICR is formed of two
half registers, as it dates from the days of 32bit processors, with a
large stride between the two half registers.

Therefore, it is a minimum of two separate instructions (set destination
in ICR_HI, set type/mode/etc in ICR_LO) to send an interrupt.

A common bug in kernels is to try and send IPIs when interrupts are
enabled, or in NMI context, both of which could interrupt an IPI
sequence.  This results in a sequence of writes (from the LAPIC's point
of view) of ICR_HI, ICR_HI, ICR_LO, ICR_LO, which causes the outer IPI
to be sent with the wrong destination.

Guests always execute with IRQs enabled, but can take a VMExit on any
arbitrary instruction boundary for other reasons, so the guest kernel
can never be sure that ICR_HI hasn't been modified by Xen in the
background, even if it used two adjacent instructions to send the IPI.

Now, if you were to swap xAPIC for x2APIC, one of the bigger changes was
making ICR a single register, so it could be written atomically.  But
now you have an MSR based interface, not an MMIO based interface.

It's also worth noting that any system with >254 CPUs is necessarily
operating in x2APIC mode (so there isn't an xAPIC window to map, even if
you wanted to try), and because of the ÆPIC Leak vulnerability, IceLake
and later CPUs are locked into x2APIC mode by firmware, with no option
to revert back into xAPIC mode even on smaller systems.

On top of that, you've still got the problem of determining the
destination.  Even if the guest could send an IPI, it still has to know
the physical APIC ID of the CPU the target vCPU is currently scheduled
on.  And you'd have to ignore things like the logical mode or
destination shorthands, because multi/broadcast IPIs will hit incorrect
targets.

On top of that, even if you can determine the right destination, how
does the target receive the interrupt?  There can only be one entity in
the system receiving INTR, and that's Xen.  So you've got to pick some
vector that Xen knows what to do with, but isn't otherwise using.

Not to mention there's a(nother) giant security hole... A guest able to
issue interrupts could just send INIT-SIPI-SIPI and reset the target CPU
back into real mode behind Xen's back.  Xen will not take kindly to this.


So while I expect there's plenty of room to innovate on the realm switch
aspect of EPTP-switching, trying to send IPIs from within guest context
is something that I will firmly suggest you avoid.  There are good
reasons why it is so complicated to get VMExit-less guest IPIs working.

~Andrew



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.