[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest

On 07/24/2015 06:29 PM, Konrad Rzeszutek Wilk wrote:
On Fri, Jul 24, 2015 at 06:18:56PM +0200, Juergen Gross wrote:
On 07/24/2015 06:09 PM, Konrad Rzeszutek Wilk wrote:
On Fri, Jul 24, 2015 at 05:58:29PM +0200, Dario Faggioli wrote:
On Fri, 2015-07-24 at 17:24 +0200, Juergen Gross wrote:
On 07/24/2015 05:14 PM, Juergen Gross wrote:
On 07/24/2015 04:44 PM, Dario Faggioli wrote:

In fact, I think that it is the topology, i.e., what comes from MSRs,
that needs to adapt, and follow vNUMA, as much as possible. Do we agree
on this?

I think we have to be very careful here. I see two possible scenarios:

1) The vcpus are not pinned 1:1 on physical cpus. The hypervisor will
     try to schedule the vcpus according to their numa affinity. So they
     can change pcpus at any time in case of very busy guests. I don't
     think the linux kernel should treat the cpus differently in this
     case as it will be in vane regarding the Xen scheduler's activity.
     So we should use the "null" topology in this case.

Sorry, the topology should reflect the vcpu<->numa-node relations, of
course, but nothing else (so flat topolgy in each numa node).

Yeah, I was replying to this point saying something like this right
now... Luckily, I've seen this email! :-P

With this semantic, I fully agree with this.

2) The vcpus of the guest are all pinned 1:1 to physical cpus. The Xen
     scheduler can't move vcpus between pcpus, so the linux kernel should
     see the real topology of the used pcpus in order to optimize for this

Mmm... I did think about this too, but I'm not sure. I see the value of
this of course, and the reason why it makes sense. However, pinning can
change on-line, via `xl vcpu-pin' and stuff. Also migration could make
things less certain, I think. What happens if we build on top of the
initial pinning, and then things change?

To be fair, there is stuff building on top of the initial pinning
already, e.g., from which physical NUMA node we allocate the memory
relies depends exactly on that. That being said, I'm not sure I'm
comfortable with adding more of this...

Perhaps introduce an 'immutable_pinning' flag, which will prevent
affinity to be changed, and then bind the topology to pinning only if
that one is set?

Maybe, there is room for "fixing" this at this level, hooking up inside
the scheduler code... but I'm shooting in the dark, without having check
whether and how this could be really feasible, should I?

Uuh, I don't think a change of the scheduler on behalf of Xen is really
appreciated. :-)

I'm sure it would (have been! :-)) a true and giant nightmare!! :-D

One thing I don't like about this approach is that it would potentially
solve vNUMA and other scheduling anomalies, but...

cpuid instruction is available for user mode as well.

...it would not do any good for other subsystems, and user level code
and apps.

Indeed. I think the optimal solution would be two-fold: give the
scheduler the information it is needing to react correctly via a
kernel patch not relying on cpuid values and fiddle with the cpuid
values from xen tools according to any needs of other subsystems and/or
user code (e.g. licensing).

So, just to check if I'm understanding is correct: you'd like to add an
abstraction layer, in Linux, like in generic (or, perhaps, scheduling)
code, to hide the direct interaction with CPUID.
Such layer, on baremetal, would just read CPUID while, on PV-ops, it'd
check with Xen/match vNUMA/whatever... Is this that you are saying?

If yes, I think I like it...

I don't think this is workable. For example there are applications
which use 'cpuid' and figure out the core/thread and use it for its own
scheduling purposes.

Might be, yes.

There are <cough>databases</cough> that do this.

Really? How do you know? ;-)

The pure cpuid solution won't work for all license related issues.

Doing it via an abstraction layer in the kernel would work in more than
90% of all cases AND would still enable a user to fiddle cpuids
according to his needs (either topology or license).

I'd rather have an out-of-the-box kernel solution with special user
requirements handling than a complex solution making some user
requirements impossible to meet.

I think there are two issues here - the solution you are trying to come
up with is for PV scenarios.

I'd say my solution is a paravirtualized one.

But the issue I described is for PVH and HVM - where the cpuid is intercepted
by the hypervisor and we can mangle it as we see fit.

I think even PVH and HVM are able to use paravirtualized interfaces.

I don't say mangling cpuids can't solve the scheduling problem. It
surely can. But it can't solve the scheduling problem without hiding
information like number of sockets or cores which might be required
for license purposes. If we don't care, fine.


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.