Xen project Mailing List

Re: [BUG]: Crashing Xen when nestedhvm is enabled

To: Patrick <patrick@xxxxxxxxxxxx>, xen-devel@xxxxxxxxxxxxx

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Tue, 11 Nov 2025 11:22:22 +0000

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=OD5zv+p2zVuzsQv+J5Cqi1thCYSYg8o63Q7FckABUMA=; b=Bb9G9qz6OKm7zQ7K8b6zZwdUCO2p6uj+BYmC2U9M71C7pvBtLB6pO6BC/OTX6Dbx+b8snryRa7wCLWfDY1jdUAGe4CeDG0/hmZUABaMK7uR3AQsydWDeX5HMrCF/AiZrvWjL/XVJ3cKUtR+HTX4/YxRnGO688kvQF7kKpx3iCfo/xm0mYDkAxHl1GSufcWFHRmqYI6+Vw0ceuKvk0fyem5vpBSuCcC/UBB6JVxfjVOanGkigxNVTrWXZpAqQENmqfJcccSweqDg+biwNb2EBLEcFdl5BZlicm0LQjUdV9dYH3MnpTePNIbQIOXwv6Z98QlFqsCYV9iCC52gsCXa7kA==

Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=fz663qTnCwIOL9QYHvgseKn02COMtfXICoDxoUY4Gs0nJdKLjjUF4WWI7/+uLwkKRaf0M/iusso3Y3kdB97PD2GMHvOK3sWvRZGDfs+Xs5wU9CYpSFdCsl+HBzXXHgNaDAUGeYlCbVoHoFIB/LDJeRUUXoo7nBCQJxSiKiItif9rh58LwrHKB6Sv1nDfErDu2MWVLqU5P2ldwgr1SBH9KsLTx40h/G6+eynkz1oEbl8X2hmu9R96becFl2x9DJK90LMHu4she9PMWxnNgXSQEA+dEfQbkEAsyIMq6DcFxtyMCorujw+MkkP8uXJaWraik1r9H4q4RbEjUVdRGpupfw==

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;

Cc: Jan Beulich <jbeulich@xxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>

Delivery-date: Tue, 11 Nov 2025 11:23:06 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 25/10/2025 2:44 pm, Patrick wrote: > Dear all, > > I think I have found a way for a guest to crash the hypervisor, > when hardware nesting is enabled and we are running on Intel using VMX. > > This is done by executing the following steps in the non nested guest: > > - Enable VMX and set a vmcs active > - Override the revision id of the active vmcs using memory access to any > invalid id > - call vmwrite to write the `MSR_BITMAP` > > Basically this: > ```C > vmxon(); > vmcs = alloc(); > *(uint32_t*)vmcs = correct_vmcs_revision_id; > vmptrld(vmcs); > *vmcs = invalide_vmcs_revision_id; > vmwrite(MSR_BITMAP, NULL); > ``` > > The `vmptrld` will set the provided vmcs as the link pointer as seen in > `xen/arch/x86/hvm/vmx/vvmx.c:1834` > ``` > if ( cpu_has_vmx_vmcs_shadowing ) > nvmx_set_vmcs_pointer(v, nvcpu->nv_vvmcx); > ``` > > If the guest now calls `VMWRITE` it will access that vmcs directly, > execpt if writing/reading the `IO_BITMAP` or the `MSR_BITMAP` > > `xen/arch/x86/hvm/vmx/vvmx.c:107` > ``` > /* > * For the following 6 encodings, we need to handle them in VMM. > * Let them vmexit as usual. > */ > set_bit(IO_BITMAP_A, vw); > set_bit(VMCS_HIGH(IO_BITMAP_A), vw); > set_bit(IO_BITMAP_B, vw); > set_bit(VMCS_HIGH(IO_BITMAP_B), vw); > set_bit(MSR_BITMAP, vw); > set_bit(VMCS_HIGH(MSR_BITMAP), vw); > ``` > > If we now execute `vmwrite(MSR_BITMAP, 0)` in the guest it will execute this > stack: > ``` > nvmx_handle_vmwrite > set_vvmcs_safe > set_vvmcs_real_safe > virtual_vmcs_vmwrite_safe > virtual_vmcs_enter > __vmptrld > ``` > The vmcs pointer being loaded in the last step being the one supplied by the > guest > that has been overwritten. > Since we have overwritten the `vmcs_revision_id` the hardware will reject the > vmptrld, which > will call `BUG()`. Hello, thanks for reporting this. The manual very clearly says "Don't Do This", but Xen should not crash as a result. I think the bug is letting the shadow VMCS remain in guest memory. It's also ridiculous that we intercept writes into control state then emulate what hardware would have done anyway. One interesting thing about VMCSes is that VMXON/VMPTRLD may make them non-coherent with the rest of memory. This is implementation dependent, but works in our favour. Architecturally, the only time revision_id is sampled is during VMPTRLD. There are no equivalents to VMX Error 11 for other instructions, and no mechanism I can see for reporting this specific failure during VMEntry or non-root operations. But, with nested virt in the mix, while a vCPU is is in (v)non-root mode, we can de-schedule entirely, run another VM, and come back to this vCPU. We need to be able to guarantee that such errors can't occur, or to be able to forward the errors properly into the guest. There seems to be no option for the latter. > > --- > A secondary similar bug happens when calling `VMXOFF` while a shadow vmcs is > active. > This will not clear the shadow vmcs, and crash the guest if it ever writes to > the vmcs again. > Effectively locking the page of being modified until a new vmcs is set active. > This should be fixed by applying this patch: > > diff --git a/xen/arch/x86/hvm/vmx/vvmx.c b/xen/arch/x86/hvm/vmx/vvmx.c > index 2432af58e0..3895dd158a 100644 > --- a/xen/arch/x86/hvm/vmx/vvmx.c > +++ b/xen/arch/x86/hvm/vmx/vvmx.c > @@ -1589,6 +1589,8 @@ static int nvmx_handle_vmxoff(struct cpu_user_regs > *regs) > struct vcpu *v=current; > struct nestedvmx *nvmx = &vcpu_2_nvmx(v); > > + if ( cpu_has_vmx_vmcs_shadowing ) > + nvmx_clear_vmcs_pointer(v, nvcpu->nv_vvmcx); > nvmx_purge_vvmcs(v); > nvmx->vmxon_region_pa = INVALID_PADDR; > > As far as I have read it is not specifically stated in the Intel SDM that a > VMXOFF clears the active vmcs, however it > does also not state anything otherwise and I thinks it's saner to clear it > than to crash the guest because of an > vmcs error, when it has vmx disabled. This is different. The manual very clearly says the VMCSes may become corrupt if they're not VMCLEAR'd before VMXOFF. In fact, there's a long standing bug/misfeature in Intel CPUs that upon VMXOFF, non-coherent VMCSes remain non-coherent until the next VMXON, at which point new VMPTRLD's can cause older VMCSes to be cleared (the CPU can only hold a certain number of VMCSes in internal registers) and written back to main memory. In the case of e.g. kexec, the new kernel has no ability to figure out that this is going on. Xen will need to clear all guest VMCSes for safety, but if we were feeling helpful, we probably ought to poison such VMCSes with 0xc2 or so. I've opened https://gitlab.com/xen-project/xen/-/issues/218 but it might be a little while until we get to this. ~Andrew

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.