[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Ping: Re: [PATCH] x86/AMD: correct certain Fam17 checks

>>> On 25.03.19 at 09:41,  wrote:
>>>> On 22.03.19 at 20:56, <andrew.cooper3@xxxxxxxxxx> wrote:
> > On 19/03/2019 09:16, Jan Beulich wrote:
> >> Commit 3157bb4e13 ("Add MSR support for various feature AMD processor
> >> families") converted certain checks for Fam11 to include families all
> >> the way up to Fam17. The commit having no description, it is hard to
> >> tell whether this was a mechanical dec->hex conversion mistake, or
> >> indeed intended. In any event the NB_CFG handling needs to be restricted
> >> to Fam16 and below: Fam17 doesn't have such an MSR anymore.
> >>
> >> A non-MMCFG extended config space access mechanism still appears to
> >> exist, but code to deal with it will need to be written down the road,
> >> when it can actually be tested.
> >>
> >> Reported-by: Pu Wen <puwen@xxxxxxxx>
> >> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
> > 
> > Having looked through various spec documents, this is a chronic mess.
> > 
> > First, to NB_CTL MSR itself.  It certainly is documented in Fam15, and
> > is absence in the documentation of Fam17.
> > 
> > In Fam15, it is explicitly documented as an alias of
> > 00:18.3[NB_CFG_LOW/HIGH] which are registers at offset 0x88 and 0x8c in
> > config space.
> > 
> > Fam17 documents that the extended cfc/cf8 mechanism does still exist,
> > and the new controls for that found in 00:18.4[CoreMasterAccessCtrl]
> > with a different bit layout.
> > 
> > Experimentation on a Rome system indicates that NB_CTL is fully
> > read0/write discard, so this patch is probably an improvement.
> Hmm, if it's read-zero / write-discard, then I guess we would better
> emulate that behavior. I simply didn't expect it to be that way, as
> I would generally assume undocumented _and_ unused MSRs to
> raise #GP. In which case the simplest thing to do would be to drop
> the one respective hunk changing emul-priv-op.c:write_msr().
> > Therefore, in principle, Acked-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
> Well, even without the point above I wouldn't be sure whether
> using an "in principle" ack as justification for committing is
> appropriate. So perhaps (see above and below) we'd better
> settle on something that you could agree to with less
> reservations (by directly adjusting this patch, or by adding a
> 2nd one on top).

I continue to be hesitant to commit with the above kind of
"restricted" ack. Could you clarify if we want to work out a better
solution, or whether I indeed should put in what's there right now?

Thanks, Jan

> > However, the actual code touched is completely insane.
> > 
> > HVM guests have it automatically read0/write discard, even for Intel
> > (citing cross vendor migration).
> > 
> > PV guest handling is complicated.  For CPUs without the MSR, it is read
> > #GP, write discard.  For CPUs which do have the MSR, it is read0/write
> > discard for domU or nonpinned dom0, which is 100% of usecases.  The PV
> > vs HVM differences cause an asymetry for the hardware domain.
> So how about we make it uniformly read-zero / write-discard, as long
> as the emulated CPUID has vendor AMD and family 0x10 and higher?
> Or even independent of vendor and family, to fully match current
> HVM behavior?
> > A pinned PV dom0 is permitted to change just the IO_ECS bit, eliciting a
> > warning but no #GP for modifying other bits.  As NB_CTL is a per-node
> > control (not a per-core control), unless dom0 vcpus == host pcpus, this
> > creates an asymmetry across the system as to whether IO_ECS is enabled
> > or not.
> Correct, but there's no inconsistency from Dom0's POV. Furthermore,
> as you explain further down, the PCI config space method of changing
> the bit was specifically added to Linux to avoid this inconsistency.
> > The HVM IOREQ path, and PV cfg_ok() path, when seeing an IO_ECS access
> > from the guest, reads the MSR every time, which is results in behaviour
> > which doesn't match the settings a guest can see, and comes with a
> > massive perf hit.
> Indeed it should be the guest view of the MSR which ought to be used
> there.
> >  It also means the behaviour of the guest IO depends
> > on which node it is currently scheduled on.
> If Dom0 enabled things in an asymmetric way.
> > The IO_ECS setting should be chosen once at boot, made consistent across
> > the entire system, and never touched at runtime.
> I don't fully agree (but it somewhat depends on what "at boot" is
> supposed to mean in your reply): We should leave the choice to Dom0,
> but it would probably not hurt to enforce a consistent setting. Doing
> this when Dom0 uses the MSR method would be relatively easy, but
> then again Dom0 shouldnt use that approach anyway. Doing this
> when Dom0 uses PCI config space writes would be less straightforward,
> as we'd have to write protect the north bridge devices' config spaces.
> Which wouldn't be very useful imo, as Dom0 - if it already uses the
> PCI config space approach - would update all north bridges anyway.
> > In all cases for guests, we can offer MMCFG even on a system which
> > doesn't have IO_ECS, and they will prefer that.  The behaviour of the
> > extra 4 bits is reserved, so we could have IO_ECS working in practice
> > with no signal.  However, we could equally drop IO_ECS entirely.  Guests
> > can't find its setting to begin with, so can't reliably use it, and also
> > wouldn't in the presence of MMCFG anyway.
> All this assuming that platforms correctly surface MMCFG. This not
> having been the case in at least the Fam10 days was one of the
> reasons why Linux (and the Xen) gained all this code. Furthermore
> to date I didn't think we would surface MMCFG to any of our guests.
> Jan

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.