[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 2/2] x86/vmx: implement Notify VM Exit


  • To: Andrew Cooper <Andrew.Cooper3@xxxxxxxxxx>
  • From: Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Date: Thu, 19 May 2022 16:45:20 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=9ESu5xGfl/aC4U/R27cvs98qrOCW91AnVEnn7djkdbc=; b=jNhKTuUce1EL8BYl2mreooNeu1lMGzScIBacDg5e0rcFp8qwETRQcMSZ7nc8Ly6dKf3Zmo9pyhUNEHWJNeXg+UOIKS9FBavtlXkFc/8tFNGl0GTe3Iwsxp0jNazPPYdYkO5HJuzr0F+ukEY1+0OJOujSKfMBN1keuLKO7qCazjStKBRhk6hVa1w3Fznb9fd5zszWfZJEEfNSqF1Wk3GfuXeAFmyj4+6hvWqJ9PMNqJvNK92aM8aSCGBi3209ZyXrLkL8ceQXq2n/8AcGmpMCsEjMQDp5q0zWTc7Et87sNZ7o+QuCgAsL335RNdWS/Ek4xCHrMVyZ7hbKg4cucwsF9Q==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=QV5dUxXncp2NVJjImYUCgoEy3nnRCPmphzBTq/5ITD/GpAK1tZaNFEGf51azZ77NYdB6lsvgV5XUj1T46IsEMxszoHC76C8fYg39D/U7+FXdLCJ6jBz9Y/oBKseA10t8W8FJAcrsNN/2C23a0fDTVMcl3uBefoZBIIrgWs4EA3mh2jm9DKy9qUu6EYJCgYOQ90Jf8Ii14PjVCw4cJuBAoVbZMmUbHAwHJ/S/G48g19KsyrwgefGjrlpX7r8wZ6yLDTwpPdaV89rwJuDWYEq8EfbNj+Ewj2x89PjvJQMFx1NeMmbkeugPKqtykLmgZyYHIGu1bJtm2MXgW9mClneD+w==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, George Dunlap <George.Dunlap@xxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, Jun Nakajima <jun.nakajima@xxxxxxxxx>, Kevin Tian <kevin.tian@xxxxxxxxx>
  • Delivery-date: Thu, 19 May 2022 14:45:48 +0000
  • Ironport-data: A9a23:UjEmo6vs/KTZIZeGoqKTczXQkefnVJRfMUV32f8akzHdYApBsoF/q tZmKWuPPfrbazT0f9oib4yyp0pUupbUndZmQQM5qyFmFCNG+JbJXdiXEBz9bniYRiHhoOOLz Cm8hv3odp1coqr0/0/1WlTZhSAgk/nOHNIQMcacUsxLbVYMpBwJ1FQywobVvqYy2YLjW17U6 IuryyHiEATNNwBcYzp8B52r8HuDjNyq0N/PlgVjDRzjlAa2e0g9VPrzF4noR5fLatA88tqBb /TC1NmEElbxpH/BPD8HfoHTKSXmSpaKVeSHZ+E/t6KK2nCurQRquko32WZ1he66RFxlkvgoo Oihu6BcRi8kH4TQuNY8FCB1AhlcF6dU2oXqYmeW5Jn7I03uKxMAwt1IJWRvZcg037gyBmtDs /sFNDoKcxaPwfqsx662QfVtgcJlK9T3OIQYuTdryjSx4fQOGMifBfmVo4IImm5o3aiiHt6HD yYdQSBoYxnaJQVGJ38cCY4knffujX76G9FdgA3J//NquTSIpOB3+KPSNcOEe/6befRMoRigl mvbrifBJzhPYbRzzhLAqBpAnNTnjS79HY4fCrC83vprm0GIgHweDgUMUlm2quX/jVSxM/pdI UEJ/islrYAp6VemCNL6WnWQomOAvxMac8pdFas98g7l4rHP/w+TC2wATzhAQN8rrsk7QXotz FDht8ztLSxitvuSU3313rWJq3W0MCscL24HbAcFSxcI55/op4RbphHCUNdlVrK0h9vdGDfsz jTMpy8774j/luYO3qS/uFzC3TSlo8GRShZvv12MGGW48gl+eYipIZSy7kTW5upBK4DfSUSdu H8DmI6V6+Vm4YyxqRFhid4lRNmBj8tp+hWB6bKzN/HNLwiQxkM=
  • Ironport-hdrordr: A9a23:9ePEga3/vvU2orl/IHn6dAqjBTtyeYIsimQD101hICG9Lfb0qy n+pp4mPEHP4wr5OEtOpTlPAtjkfZr5z+8M3WB3B8bYYOCGghrQEGgG1+ffKlLbexEWmtQttp uINpIOcuEYbmIK8voSgjPIdOrIqePvmM7IuQ6d9QYKcegDUdAd0+4TMHf+LqQZfnglOXJvf6 Dsm/av6gDQMUj+Ka+Adwo4dtmGg+eOuIPtYBYACRJiwA6SjQmw4Lq/NxSDxB8RXx5G3L9nqA H+4kbEz5Tml8v+5g7X1mfV4ZgTsNz9yuFbDMjJrsQOMD3jhiuheYwkcbyfuzIepv2p9T8R4Z LxiiZlG/42x2Laf2mzrxeo8w780Aw243un8lOciWuLm72PeBsKT+56wa5JeBrQ7EQt+Ptm1r hQ4m6fv51LSTvdgSXU/bHzJl9Xv3vxhUBnvf8YjnRZX4dbQqRWt5Yj8ERcF4pFND7m6bogDP JlAKjnlblrmGuhHjDkV1RUsZ+RtixZJGbFfqFCgL3Y79FupgE586NCr/Zv20vp9/oGOu15Dq r/Q+BVfYp1P74rhJJGdZk8qPSMexzwqDL3QRSvyAfcZeg600ykke+E3JwFoMeXRbcv8Lwe3L z8bXIwjx9GR6upM7zC4KF2
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Thu, May 19, 2022 at 12:10:24AM +0000, Andrew Cooper wrote:
> On 17/05/2022 14:21, Roger Pau Monne wrote:
> > Under certain conditions guests can get the CPU stuck in an infinite
> > loop without the possibility of an interrupt window to occur.
> 
> instruction boundary.
> 
> It's trivial to create an infinite loop without an interrupt window :)
> 
> Also, I'd probably phrase that as an unbounded loop, because not all
> problem cases are truly infinite.
> 
> >   This
> > was the case with the scenarios described in XSA-156.
> 
> Case in point, both of these can be broken by something else (another
> vCPU, or coherent DMA write) editing the IDT and e.g. making the #AC/#DB
> vectors not present, which will yield #NP instead.
> 
> >
> > Make use of the Notify VM Exit mechanism, that will trigger a VM Exit
> > if no interrupt window occurs for a specified amount of time.  Note
> > that using the Notify VM Exit avoids having to trap #AC and #DB
> > exceptions, as Xen is guaranteed to get a VM Exit even if the guest
> > puts the CPU in a loop without an interrupt window, as such disable
> > the intercepts if the feature is available and enabled.
> >
> > Setting the notify VM exit window to 0 is safe because there's a
> > threshold added by the hardware in order to have a sane window value.
> >
> > Suggested-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
> > Signed-off-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>
> > ---
> > This change enables the notify VM exit by default, KVM however doesn't
> > seem to enable it by default, and there's the following note in the
> > commit message:
> >
> > "- There's a possibility, however small, that a notify VM exit happens
> >    with VM_CONTEXT_INVALID set in exit qualification. In this case, the
> >    vcpu can no longer run. To avoid killing a well-behaved guest, set
> >    notify window as -1 to disable this feature by default."
> >
> > It's not obviously clear to me whether the comment was meant to be:
> > "There's a possibility, however small, that a notify VM exit _wrongly_
> > happens with VM_CONTEXT_INVALID".
> 
> TBH, I read that as a get-out clause for "we have no idea what to set
> for a default window", and it's not a decision reasonable to defer to
> users, because they have even less of an idea than us.
> 
> All CPUs with Notify VM Exit have the TSC crystal information in CPUID,
> so I'd suggest that we trust CPUID to be accurate, and program for maybe
> 10us?  That's 1/3 of a default timeslice.
> 
> 
> 
> > It's also not clear whether such wrong hardware behavior only affects
> > a specific set of hardware, in a way that we could avoid enabling
> > notify VM exit there.
> >
> > There's a discussion in one of the Linux patches that 128K might be
> > the safer value in order to prevent false positives, but I have no
> > formal confirmation about this.  Maybe our Intel maintainers can
> > provide some more feedback on a suitable notify VM exit window
> > value.
> >
> > I've tested with 0 (the proposed default in the patch) and I don't
> > seem to be able to trigger notify VM exits under normal guest
> > operation.  Note that even in that case the guest won't be destroyed
> > unless the context is corrupt.
> 
> Huh... There's nothing in the manual about that, but obviously hardware
> has some minimum safe value if 0 appears to work in practice.

Got the tip from looking at the KVM patch submission for notify VM
exit implementation.

Seeing your suggestion above to use 10us and your reply here, I'm
unsure whether you are fine with using 0.

> > diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
> > index d388e6729c..5685a5523e 100644
> > --- a/xen/arch/x86/hvm/vmx/vmcs.c
> > +++ b/xen/arch/x86/hvm/vmx/vmcs.c
> > @@ -67,6 +67,9 @@ integer_param("ple_gap", ple_gap);
> >  static unsigned int __read_mostly ple_window = 4096;
> >  integer_param("ple_window", ple_window);
> >  
> > +static int __read_mostly vm_notify_window;
> > +integer_param("vm-notify-window", vm_notify_window);
> 
> Part of me is loath to keep on adding new top-level options for this.
> 
> I was about to suggest having a vmx= option, but I've just noticed that
> ple_{window,gap} are wired up to cmdline options on Intel, and fixed
> constants on AMD.

Do we want to make this VMX specific?  The problem affects both Intel
and AMD, so I would think it's possible for AMD to introduce a similar
solution in the future and hence we might want to share
"vm-notify-window".  That however raises questions as to whether AMD
will also allow specifying a window, and whether it will be in crystal
clock units.

> Thoughts on a suitable name?

vmx and svm would seem fine to me.

> > @@ -1333,6 +1338,19 @@ static int construct_vmcs(struct vcpu *v)
> >          rc = vmx_add_msr(v, MSR_FLUSH_CMD, FLUSH_CMD_L1D,
> >                           VMX_MSR_GUEST_LOADONLY);
> >  
> > +    if ( cpu_has_vmx_notify_vm_exiting && vm_notify_window >= 0 )
> > +    {
> > +        __vmwrite(NOTIFY_WINDOW, vm_notify_window);
> > +        /*
> > +         * Disable #AC and #DB interception: by using VM Notify Xen is
> > +         * guaranteed to get a VM exit even if the guest manages to lock 
> > the
> > +         * CPU.
> > +         */
> > +        v->arch.hvm.vmx.exception_bitmap &= ~((1U << TRAP_debug) |
> > +                                              (1U << 
> > TRAP_alignment_check));
> > +        vmx_update_exception_bitmap(v);
> 
> IIRC, it's not quite this easy.  There are conditions, e.g. attaching
> gdbsx, where #DB interception wants turning on/off dynamically, and the
> logic got simplified to nothing following XSA-156, so will need
> reintroducing.
> 
> AMD Milan (Zen3) actually has NoNestedDataBp in CPUID.80000021.eax[0]
> which allows us to not intercept #DB, so perhaps that might offer an
> easier way of adjusting the interception logic.  (Or maybe not.  I can't
> remember).

OK, will look into it.

> > +    }
> > +
> >   out:
> >      vmx_vmcs_exit(v);
> >  
> > diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> > index 02cc7a2023..9c37790c36 100644
> > --- a/xen/arch/x86/hvm/vmx/vmx.c
> > +++ b/xen/arch/x86/hvm/vmx/vmx.c
> > @@ -4567,6 +4567,30 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs)
> >           */
> >          break;
> >  
> > +    case EXIT_REASON_NOTIFY:
> > +        __vmread(EXIT_QUALIFICATION, &exit_qualification);
> > +
> > +        if ( exit_qualification & NOTIFY_VM_CONTEXT_INVALID )
> > +        {
> > +            perfc_incr(vmnotify_crash);
> > +            gprintk(XENLOG_ERR, "invalid VM context after notify 
> > vmexit\n");
> > +            domain_crash(v->domain);
> > +            break;
> > +        }
> > +
> > +        if ( cpu_has_vmx_vnmi &&
> > +             (exit_qualification & INTR_INFO_NMI_UNBLOCKED_BY_IRET) )
> > +        {
> > +            unsigned long guest_info;
> > +
> > +            /* Exit was incident to an execution of IRET that unblocked 
> > NMIs. */
> > +            __vmread(GUEST_INTERRUPTIBILITY_INFO, &guest_info);
> > +            __vmwrite(GUEST_INTERRUPTIBILITY_INFO,
> > +                      guest_info | VMX_INTR_SHADOW_NMI);
> 
> I am saddened by how irritating it is having the UNBLOCKED_BY_IRET (in
> the first place...) but moving between the exit qualification and the
> vmexit intr info fields.  The constant probably ought to be renamed to
> lose the INTR_INFO prefix.
> 
> I'd suggest a prereq patch to also break
> 
> static void undo_nmis_unblocked_by_iret(void)
> {
>     ...
> }
> 
> out to avoid opencoding it in several places.  There's one other
> instance in our code (general fault intercept), but we're buggy on
> PML-full, APIC-access and EPT violation all of which Xen handles.

I don't think we are buggy on APIC-accesses, the SDM says:

"Executions of IRET may also incur VM exits due to APIC accesses and
EPT misconfigurations. These VM exits do not report information about
NMI unblocking due to IRET."

> I don't think you need the vnmi check, because the bit is 0 otherwise.

Right, the bit is only defined in the Xen case if vnmi is enabled,
because Xen requires PIN_BASED_NMI_EXITING to be set (NMI Exiting set
to 0 could also allow the bit to be 1).

Thanks, Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.