Xen project Mailing List

Re: [Xen-devel] [PATCH RFC] x86/amd: Avoid cpu_has_hypervisor evaluating true on native hardware

To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Wed, 29 Jan 2020 14:43:14 +0100

Cc: Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Igor Druzhinin <igor.druzhinin@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>

Delivery-date: Wed, 29 Jan 2020 13:43:25 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 29.01.2020 14:08, Andrew Cooper wrote: > On 29/01/2020 08:17, Jan Beulich wrote: >> On 28.01.2020 18:14, Andrew Cooper wrote: >>> On 28/01/2020 13:59, Jan Beulich wrote: >>>> On 27.01.2020 21:21, Andrew Cooper wrote: >>>>> Without this fix, there is apparently a problem with Roger's "[PATCH v3 >>>>> 7/7] >>>>> x86/tlb: use Xen L0 assisted TLB flush when available" on native AMD >>>>> hardware. >>>>> I haven't investgiated the issue with that patch specifically, because >>>>> cpu_has_hypervisor being wrong is obviously a bug. >>>>> >>>>> This is one of two possible approaches, and both have their downsides. >>>>> This >>>>> one takes an extra hit on context switches between PV vcpus and idle/hvm, >>>>> as >>>>> they will usually differ in HYPERVISOR bit. >>>> Why would they differ in the HYPERVISOR bit? Maybe for idle (albeit >>>> off the top of my head I can't recall us special casing idle wrt >>>> CPUID handling), but why for PV vs HVM? The idle case, if there is >>>> an issue with this, could be taken care of by actually setting the >>>> bit there, as no-one should care about what it's set to? >>> d->arch.pv.cpuidmasks is only allocated for PV domains (and starts by >>> dup()ing the default). >> Ah, that's the piece I was missing. My next question then is: Why >> do we do *_init_levelling() from early_init_*() in the first place? >> It looks conceptually wrong to me to set up leveling before having >> obtained CPUID data. Wouldn't there better be a separate hook in >> struct cpu_dev, to be invoked e.g. from identify_cpu() after >> generic_identify()? > > cpuid_mask_cpu= literally means "pretend you're this older CPU instead", > and was implemented in a way which affected Xen. This is why levelling > is configured that early. It's indeed written like this in the cmdline doc, but I vaguely recall questioning this behavior at least once before. > Now that this isn't the only way to make heterogeneous migration safe, > perhaps we don't care so much, but it would still be a behavioural > change to the cpuid_mask_* parameters. And indeed we have cpuid= now to control the features Xen is to use. I don't fancy looking at bug reports where I first need to sort out the interaction between these two cmdline options. >>> When context switching levelling MSRs, any non-PV guest uses >>> cpumask_default. This captures idle and HVM vcpus. >>> >>> This is necessary because, at least at the time it was introduced, >>> {pv,hvm}_cpuid() issued native CPUID instructions to then feed data back >>> into guest context. Its probably less relevant now that guest_cpuid() >>> doesn't issue native instructions in the general case. >>> >>> Either way, HVM gained the default like idle, to cause the lazy >>> switching logic to switch less often. >>> >>> The problem we have after this patch is that default differs from PV in >>> the HYPERVISOR bit, which basically guarantees that we rewrite the leaf >>> 1 levelling on each context switch. >>> >>> However, having looked at the other features bits which differ for PV, >>> VME and PSE36 being hidden means we're always switching leaf 1 anyway, >>> so this change for HYPERVISOR doesn't make the situation any worse. >>> >>>>> The other approach is to order things more carefully so levelling is >>>>> configured after scanning for cpuid bits, but that has the downside that >>>>> it is >>>>> very easy to regress. >>>>> >>>>> Thoughts on which is the least-bad approach to take? Having written this >>>>> patch, I'm now erring on the side of doing it the other way. >>>> Besides the need for me to understand the aspect above, I'm afraid >>>> to judge I'd need to have at least a sketch of what the alternative >>>> would look like, in particular to figure how fragile it really is. >>> It would be a small bit of careful reordering in cpu/amd.c >>> >>> The tipping factor is that, even if we arrange for idle context not to >>> have HYPERVISOR set (and therefore cpu_has_hypervisor ending up clear >>> when scanned), a regular CPUID instruction in PV context would see >>> HYPERVISOR as a property of virtualising things sensibly for guests. >>> >>> As we need to cope with HYPERVISOR being visible in some contexts, its >>> better to consider it uniformly visible and break any kind of notional >>> link between cpu_has_hypervisor matching what CPUID would see as the bit. >> After having set up leveling I think we shouldn't use CPUID anymore >> for leaves which may be leveled. As a result cpu_has_* / cpu_has() >> would then be the only information source in this regard. >> >> Such a general re-arrangement would then also appear to address your >> "easy to regress" concern. > > I've just had another thought, and it rules out other approaches. > > We use ctxt_switch_levelling(NULL) on the crash path to reset things for > next kernel, and that needs to hide the HYPERVISOR bit on AMD. > > Therefore, the approach in this patch is the only sensible action (and > I'm now not concerned about the performance hit, as it won't actually > increase the number of MSR writes we make). I'm not fully convinced: Why is setting the MSRs back to cpuidmask_defaults the correct thing to do in the first place? Shouldn't we reset them to what we found on boot? Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.