[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2] x86/hotplug: Do not put offline vCPUs in mwait idle state



On Fri, Jan 20, 2023, Igor Mammedov wrote:
> On Fri, 20 Jan 2023 05:55:11 -0800
> "Srivatsa S. Bhat" <srivatsa@xxxxxxxxxxxxx> wrote:
> 
> > Hi Igor and Thomas,
> > 
> > Thank you for your review!
> > 
> > On 1/19/23 1:12 PM, Thomas Gleixner wrote:
> > > On Mon, Jan 16 2023 at 15:55, Igor Mammedov wrote:  
> > >> "Srivatsa S. Bhat" <srivatsa@xxxxxxxxxxxxx> wrote:  
> > >>> Fix this by preventing the use of mwait idle state in the vCPU offline
> > >>> play_dead() path for any hypervisor, even if mwait support is
> > >>> available.  
> > >>
> > >> if mwait is enabled, it's very likely guest to have cpuidle
> > >> enabled and using the same mwait as well. So exiting early from
> > >>  mwait_play_dead(), might just punt workflow down:
> > >>   native_play_dead()
> > >>         ...
> > >>         mwait_play_dead();
> > >>         if (cpuidle_play_dead())   <- possible mwait here                
> > >>                               
> > >>                 hlt_play_dead(); 
> > >>
> > >> and it will end up in mwait again and only if that fails
> > >> it will go HLT route and maybe transition to VMM.  
> > > 
> > > Good point.
> > >   
> > >> Instead of workaround on guest side,
> > >> shouldn't hypervisor force VMEXIT on being uplugged vCPU when it's
> > >> actually hot-unplugging vCPU? (ex: QEMU kicks vCPU out from guest
> > >> context when it is removing vCPU, among other things)  
> > > 
> > > For a pure guest side CPU unplug operation:
> > > 
> > >     guest$ echo 0 >/sys/devices/system/cpu/cpu$N/online
> > > 
> > > the hypervisor is not involved at all. The vCPU is not removed in that
> > > case.
> > >   
> > 
> > Agreed, and this is indeed the scenario I was targeting with this patch,
> > as opposed to vCPU removal from the host side. I'll add this clarification
> > to the commit message.

Forcing HLT doesn't solve anything, it's perfectly legal to passthrough HLT.  I
guarantee there are use cases that passthrough HLT but _not_ MONITOR/MWAIT, and
that passthrough all of them.

> commit message explicitly said:
> "which prevents the hypervisor from running other vCPUs or workloads on the
> corresponding pCPU."
> 
> and that implies unplug on hypervisor side as well.
> Why? That's because when hypervisor exposes mwait to guest, it has to 
> reserve/pin
> a pCPU for each of present vCPUs. And you can safely run other VMs/workloads
> on that pCPU only after it's not possible for it to be reused by VM where
> it was used originally.

Pinning isn't strictly required from a safety perspective.  The latency of 
context
switching may suffer due to wake times, but preempting a vCPU that it's C1 (or
deeper) won't cause functional problems.   Passing through an entire socket
(or whatever scope triggers extra fun) might be a different story, but pinning
isn't strictly required.

That said, I 100% agree that this is expected behavior and not a bug.  Letting 
the
guest execute MWAIT or HLT means the host won't have perfect visibility into 
guest
activity state.

Oversubscribing a pCPU and exposing MWAIT and/or HLT to vCPUs is generally not 
done
precisely because the guest will always appear busy without extra effort on the
host.  E.g. KVM requires an explicit opt-in from userspace to expose MWAIT 
and/or
HLT.

If someone really wants to effeciently oversubscribe pCPUs and passthrough 
MWAIT,
then their best option is probably to have a paravirt interface so that the 
guest
can tell the host its offlining a vCPU.  Barring that the host could inspect the
guest when preempting a vCPU to try and guesstimate how much work the vCPU is
actually doing in order to make better scheduling decisions.

> Now consider following worst (and most likely) case without unplug
> on hypervisor side:
> 
>  1. vm1mwait: pin pCPU2 to vCPU2
>  2. vm1mwait: guest$ echo 0 >/sys/devices/system/cpu/cpu2/online
>         -> HLT -> VMEXIT
>  --
>  3. vm2mwait: pin pCPU2 to vCPUx and start VM
>  4. vm2mwait: guest OS onlines Vcpu and starts using it incl.
>        going into idle=>mwait state
>  --
>  5. vm1mwait: it still thinks that vCPU is present it can rightfully do:
>        guest$ echo 1 >/sys/devices/system/cpu/cpu2/online
>  --              
>  6.1 best case vm1mwait online fails after timeout
>  6.2 worse case: vm2mwait does VMEXIT on vCPUx around time-frame when
>      vm1mwait onlines vCPU2, the online may succeed and then vm2mwait's
>      vCPUx will be stuck (possibly indefinitely) until for some reason
>      VMEXIT happens on vm1mwait's vCPU2 _and_ host decides to schedule
>      vCPUx on pCPU2 which would make vm1mwait stuck on vCPU2.
> So either way it's expected behavior.
> 
> And if there is no intention to unplug vCPU on hypervisor side,
> then VMEXIT on play_dead is not really necessary (mwait is better
> then HLT), since hypervisor can't safely reuse pCPU elsewhere and
> VCPU goes into deep sleep within guest context.
> 
> PS:
> The only case where making HLT/VMEXIT on play_dead might work out,
> would be if new workload weren't pinned to the same pCPU nor
> used mwait (i.e. host can migrate it elsewhere and schedule
> vCPU2 back on pCPU2).



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.