[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest



On Fri, Jul 24, 2015 at 5:09 PM, Konrad Rzeszutek Wilk
<konrad.wilk@xxxxxxxxxx> wrote:
> On Fri, Jul 24, 2015 at 05:58:29PM +0200, Dario Faggioli wrote:
>> On Fri, 2015-07-24 at 17:24 +0200, Juergen Gross wrote:
>> > On 07/24/2015 05:14 PM, Juergen Gross wrote:
>> > > On 07/24/2015 04:44 PM, Dario Faggioli wrote:
>>
>> > >> In fact, I think that it is the topology, i.e., what comes from MSRs,
>> > >> that needs to adapt, and follow vNUMA, as much as possible. Do we agree
>> > >> on this?
>> > >
>> > > I think we have to be very careful here. I see two possible scenarios:
>> > >
>> > > 1) The vcpus are not pinned 1:1 on physical cpus. The hypervisor will
>> > >     try to schedule the vcpus according to their numa affinity. So they
>> > >     can change pcpus at any time in case of very busy guests. I don't
>> > >     think the linux kernel should treat the cpus differently in this
>> > >     case as it will be in vane regarding the Xen scheduler's activity.
>> > >     So we should use the "null" topology in this case.
>> >
>> > Sorry, the topology should reflect the vcpu<->numa-node relations, of
>> > course, but nothing else (so flat topolgy in each numa node).
>> >
>> Yeah, I was replying to this point saying something like this right
>> now... Luckily, I've seen this email! :-P
>>
>> With this semantic, I fully agree with this.
>>
>> > > 2) The vcpus of the guest are all pinned 1:1 to physical cpus. The Xen
>> > >     scheduler can't move vcpus between pcpus, so the linux kernel should
>> > >     see the real topology of the used pcpus in order to optimize for this
>> > >     picture.
>> > >
>> >
>> Mmm... I did think about this too, but I'm not sure. I see the value of
>> this of course, and the reason why it makes sense. However, pinning can
>> change on-line, via `xl vcpu-pin' and stuff. Also migration could make
>> things less certain, I think. What happens if we build on top of the
>> initial pinning, and then things change?
>>
>> To be fair, there is stuff building on top of the initial pinning
>> already, e.g., from which physical NUMA node we allocate the memory
>> relies depends exactly on that. That being said, I'm not sure I'm
>> comfortable with adding more of this...
>>
>> Perhaps introduce an 'immutable_pinning' flag, which will prevent
>> affinity to be changed, and then bind the topology to pinning only if
>> that one is set?
>>
>> > >> Maybe, there is room for "fixing" this at this level, hooking up inside
>> > >> the scheduler code... but I'm shooting in the dark, without having check
>> > >> whether and how this could be really feasible, should I?
>> > >
>> > > Uuh, I don't think a change of the scheduler on behalf of Xen is really
>> > > appreciated. :-)
>> > >
>> I'm sure it would (have been! :-)) a true and giant nightmare!! :-D
>>
>> > >> One thing I don't like about this approach is that it would potentially
>> > >> solve vNUMA and other scheduling anomalies, but...
>> > >>
>> > >>> cpuid instruction is available for user mode as well.
>> > >>>
>> > >> ...it would not do any good for other subsystems, and user level code
>> > >> and apps.
>> > >
>> > > Indeed. I think the optimal solution would be two-fold: give the
>> > > scheduler the information it is needing to react correctly via a
>> > > kernel patch not relying on cpuid values and fiddle with the cpuid
>> > > values from xen tools according to any needs of other subsystems and/or
>> > > user code (e.g. licensing).
>> >
>> So, just to check if I'm understanding is correct: you'd like to add an
>> abstraction layer, in Linux, like in generic (or, perhaps, scheduling)
>> code, to hide the direct interaction with CPUID.
>> Such layer, on baremetal, would just read CPUID while, on PV-ops, it'd
>> check with Xen/match vNUMA/whatever... Is this that you are saying?
>>
>> If yes, I think I like it...
>
> I don't think this is workable. For example there are applications
> which use 'cpuid' and figure out the core/thread and use it for its own
> scheduling purposes.

Can you expand a little on this?  I'm having trouble figuring out
exactly what user-space applications are reading and how they're using
it -- and, how they work currently in virtual environments, given that
they (typically) will be moved between physical processors even if
they stay on the same virtual processor.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.