[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [BUG]: Crashing Xen when nestedhvm is enabled


  • To: Patrick <patrick@xxxxxxxxxxxx>, xen-devel@xxxxxxxxxxxxx
  • From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • Date: Tue, 11 Nov 2025 11:22:22 +0000
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=OD5zv+p2zVuzsQv+J5Cqi1thCYSYg8o63Q7FckABUMA=; b=Bb9G9qz6OKm7zQ7K8b6zZwdUCO2p6uj+BYmC2U9M71C7pvBtLB6pO6BC/OTX6Dbx+b8snryRa7wCLWfDY1jdUAGe4CeDG0/hmZUABaMK7uR3AQsydWDeX5HMrCF/AiZrvWjL/XVJ3cKUtR+HTX4/YxRnGO688kvQF7kKpx3iCfo/xm0mYDkAxHl1GSufcWFHRmqYI6+Vw0ceuKvk0fyem5vpBSuCcC/UBB6JVxfjVOanGkigxNVTrWXZpAqQENmqfJcccSweqDg+biwNb2EBLEcFdl5BZlicm0LQjUdV9dYH3MnpTePNIbQIOXwv6Z98QlFqsCYV9iCC52gsCXa7kA==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=fz663qTnCwIOL9QYHvgseKn02COMtfXICoDxoUY4Gs0nJdKLjjUF4WWI7/+uLwkKRaf0M/iusso3Y3kdB97PD2GMHvOK3sWvRZGDfs+Xs5wU9CYpSFdCsl+HBzXXHgNaDAUGeYlCbVoHoFIB/LDJeRUUXoo7nBCQJxSiKiItif9rh58LwrHKB6Sv1nDfErDu2MWVLqU5P2ldwgr1SBH9KsLTx40h/G6+eynkz1oEbl8X2hmu9R96becFl2x9DJK90LMHu4she9PMWxnNgXSQEA+dEfQbkEAsyIMq6DcFxtyMCorujw+MkkP8uXJaWraik1r9H4q4RbEjUVdRGpupfw==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
  • Cc: Jan Beulich <jbeulich@xxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Delivery-date: Tue, 11 Nov 2025 11:23:06 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 25/10/2025 2:44 pm, Patrick wrote:
> Dear all,
>
> I think I have found a way for a guest to crash the hypervisor,
> when hardware nesting is enabled and we are running on Intel using VMX.
>
> This is done by executing the following steps in the non nested guest:
>
> - Enable VMX and set a vmcs active
> - Override the revision id of the active vmcs using memory access to any 
> invalid id
> - call vmwrite to write the `MSR_BITMAP`
>
> Basically this:
> ```C
> vmxon();
> vmcs = alloc();
> *(uint32_t*)vmcs = correct_vmcs_revision_id;
> vmptrld(vmcs);
> *vmcs = invalide_vmcs_revision_id;
> vmwrite(MSR_BITMAP, NULL);
> ```
>
> The `vmptrld` will set the provided vmcs as the link pointer as seen in
> `xen/arch/x86/hvm/vmx/vvmx.c:1834`
> ```
> if ( cpu_has_vmx_vmcs_shadowing )
>     nvmx_set_vmcs_pointer(v, nvcpu->nv_vvmcx);
> ```
>
> If the guest now calls `VMWRITE` it will access that vmcs directly,
> execpt if writing/reading the `IO_BITMAP` or the `MSR_BITMAP`
>
> `xen/arch/x86/hvm/vmx/vvmx.c:107`
> ```
> /*
> * For the following 6 encodings, we need to handle them in VMM.
> * Let them vmexit as usual.
> */
> set_bit(IO_BITMAP_A, vw);
> set_bit(VMCS_HIGH(IO_BITMAP_A), vw);
> set_bit(IO_BITMAP_B, vw);
> set_bit(VMCS_HIGH(IO_BITMAP_B), vw);
> set_bit(MSR_BITMAP, vw);
> set_bit(VMCS_HIGH(MSR_BITMAP), vw);
> ```
>
> If we now execute `vmwrite(MSR_BITMAP, 0)` in the guest it will execute this 
> stack:
> ```
> nvmx_handle_vmwrite
> set_vvmcs_safe
> set_vvmcs_real_safe
> virtual_vmcs_vmwrite_safe
> virtual_vmcs_enter
> __vmptrld
> ```
> The vmcs pointer being loaded in the last step being the one supplied by the 
> guest
> that has been overwritten.
> Since we have overwritten the `vmcs_revision_id` the hardware will reject the 
> vmptrld, which
> will call `BUG()`.

Hello, thanks for reporting this.

The manual very clearly says "Don't Do This", but Xen should not crash
as a result.

I think the bug is letting the shadow VMCS remain in guest memory.  It's
also ridiculous that we intercept writes into control state then emulate
what hardware would have done anyway.

One interesting thing about VMCSes is that VMXON/VMPTRLD may make them
non-coherent with the rest of memory.  This is implementation dependent,
but works in our favour.

Architecturally, the only time revision_id is sampled is during
VMPTRLD.  There are no equivalents to VMX Error 11 for other
instructions, and no mechanism I can see for reporting this specific
failure during VMEntry or non-root operations.

But, with nested virt in the mix, while a vCPU is is in (v)non-root
mode, we can de-schedule entirely, run another VM, and come back to this
vCPU.  We need to be able to guarantee that such errors can't occur, or
to be able to forward the errors properly into the guest.  There seems
to be no option for the latter.


>
> ---
> A secondary similar bug happens when calling `VMXOFF` while a shadow vmcs is 
> active.
> This will not clear the shadow vmcs, and crash the guest if it ever writes to 
> the vmcs again.
> Effectively locking the page of being modified until a new vmcs is set active.
> This should be fixed by applying this patch:
>
> diff --git a/xen/arch/x86/hvm/vmx/vvmx.c b/xen/arch/x86/hvm/vmx/vvmx.c
> index 2432af58e0..3895dd158a 100644
> --- a/xen/arch/x86/hvm/vmx/vvmx.c
> +++ b/xen/arch/x86/hvm/vmx/vvmx.c
> @@ -1589,6 +1589,8 @@ static int nvmx_handle_vmxoff(struct cpu_user_regs 
> *regs)
>      struct vcpu *v=current;
>      struct nestedvmx *nvmx = &vcpu_2_nvmx(v);
>  
> +     if ( cpu_has_vmx_vmcs_shadowing )
> +             nvmx_clear_vmcs_pointer(v, nvcpu->nv_vvmcx);
>      nvmx_purge_vvmcs(v);
>      nvmx->vmxon_region_pa = INVALID_PADDR;
>
> As far as I have read it is not specifically stated in the Intel SDM that a 
> VMXOFF clears the active vmcs, however it
> does also not state anything otherwise and I thinks it's saner to clear it 
> than to crash the guest because of an
> vmcs error, when it has vmx disabled.

This is different.  The manual very clearly says the VMCSes may become
corrupt if they're not VMCLEAR'd before VMXOFF.

In fact, there's a long standing bug/misfeature in Intel CPUs that upon
VMXOFF, non-coherent VMCSes remain non-coherent until the next VMXON, at
which point new VMPTRLD's can cause older VMCSes to be cleared (the CPU
can only hold a certain number of VMCSes in internal registers) and
written back to main memory.

In the case of e.g. kexec, the new kernel has no ability to figure out
that this is going on.

Xen will need to clear all guest VMCSes for safety, but if we were
feeling helpful, we probably ought to poison such VMCSes with 0xc2 or so.

I've opened https://gitlab.com/xen-project/xen/-/issues/218 but it might
be a little while until we get to this.

~Andrew



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.