[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy

To: Juergen Gross <jgross@xxxxxxxx>, Dario Faggioli <dario.faggioli@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>
From: Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>
Date: Wed, 02 Sep 2015 10:08:39 -0400
Cc: Andrew Cooper <Andrew.Cooper3@xxxxxxxxxx>, Stefano Stabellini <stefano.stabellini@xxxxxxxxxxxxx>, linux-kernel <linux-kernel@xxxxxxxxxxxxxxx>, George Dunlap <George.Dunlap@xxxxxxxxxx>, David Vrabel <david.vrabel@xxxxxxxxxx>, "Luis R. Rodriguez" <mcgrof@xxxxxxxxxxxxxxxx>
Delivery-date: Wed, 02 Sep 2015 14:10:02 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 09/02/2015 07:58 AM, Juergen Gross wrote:

On 08/31/2015 06:12 PM, Boris Ostrovsky wrote:



On 08/20/2015 02:16 PM, Juergen GroÃ wrote:

On 08/18/2015 05:55 PM, Dario Faggioli wrote:

Hey everyone,

So, as a followup of what we were discussing in this thread:

  [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest

http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html



I started looking in more details at scheduling domains in the Linux

kernel. Now, that thread was about CPUID and vNUMA, and their weirdway

of interacting, while this thing I'm proposing here is completely
independent from them both.

In fact, no matter whether vNUMA is supported and enabled, and nomatter

whether CPUID is reporting accurate, random, meaningful or completely
misleading information, I think that we should do something about how
scheduling domains are build.

Fact is, unless we use 1:1, and immutable (across all the guest
lifetime) pinning, scheduling domains should not be constructed, in

Linux, by looking at *any* topology information, because that justdoes

not make any sense, when vcpus move around.

Let me state this again (hoping to make myself as clear aspossible): no

matter in  how much good shape we put CPUID support, no matter how
beautifully and consistently that will interact with both vNUMA,

licensing requirements and whatever else. It will be alwayspossible for

vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
on two different NUMA nodes at time t2. Hence, the Linux scheduler

should really not skew his load balancing logic toward any of thosetwo

situations, as neither of them could be considered correct (since
nothing is!).

For now, this only covers the PV case. HVM case shouldn't be any

different, but I haven't looked at how to make the same thinghappen in

there as well.

OVERALL DESCRIPTION
===================
What this RFC patch does is, in the Xen PV case, configure scheduling
domains in such a way that there is only one of them, spanning all the
pCPUs of the guest.

Note that the patch deals directly with scheduling domains, andthere is

no need to alter the masks that will then be used for building and

reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.).That is

the main difference between it and the patch proposed by Juergen here:

http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html

This means that when, in future, we will fix CPUID handling andmake itcomply with whatever logic or requirements we want, that won'thave any

unexpected side effects on scheduling domains.

Information about how the scheduling domains are being constructed
during boot are available in `dmesg', if the kernel is booted with the
'sched_debug' parameter. It is also possible to look
at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.

With the patch applied, only one scheduling domain is created, called
the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
tell that from the fact that every cpu* folder
in /proc/sys/kernel/sched_domain/ only have one subdirectory
('domain0'), with all the tweaks and the tunables for our scheduling
domain.

EVALUATION
==========

I've tested this with UnixBench, and by looking at Xen build time,on a

16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0 only, for
now, but I plan to re-run them in DomUs soon (Juergen may be doing
something similar to this in DomU already, AFAUI).

I've run the benchmarks with and without the patch applied ('patched'
and 'vanilla', respectively, in the tables below), and with different
number of build jobs (in case of the Xen build) or of parallel copy of
the benchmarks (in the case of UnixBench).

What I get from the numbers is that the patch almost always brings
benefits, in some cases even huge ones. There are a couple of cases
where we regress, but always only slightly so, especially if comparing
that to the magnitude of some of the improvement that we get.

Bear also in mind that these results are gathered from Dom0, andwithout

any overcommitment at the vCPU level (i.e., nr. vCPUs == nr pCPUs). If
we move things in DomU and do overcommit at the Xen scheduler level, I
am expecting even better results.

...

REQUEST FOR COMMENTS
====================
Basically, the kind of feedback I'd be really glad to hear is:
  - what you guys thing of the approach,


Yesterday at the end of the developer meeting we (Andrew, Elena and
myself) discussed this topic again.

Regarding a possible future scenario with credit2 eventually supporting
gang scheduling on hyperthreads (which is desirable due to security
reasons [side channel attack] and fairness) my patch seems to be more
suited for that direction than yours. Correct me if I'm wrong, but I
think scheduling domains won't enable the guest kernel's scheduler to

migrate threads more easily between hyperthreads opposed to othervcpus,

while my approach can easily be extended to do so.

- whether you think, looking at this preliminary set of numbers,that
    this is something worth continuing investigating,


I believe as both approaches lead to the same topology information used
by the scheduler (all vcpus are regarded as being equal) your numbers
should apply to my patch as well. Would you mind verifying this?


If set_cpu_sibling_map()'s has_mp is false, wouldn't we effectively have
both of your patches?


Hmm, sort of.

OTOH this would it make hard to make use of some of the topology
information in case of e.g. pinned vcpus (as George pointed out).

I didn't mean to just set has_mp to zero unconditionally (for Xen, orany other, guest). We'd need to have some logic as to when to set it tofalse.


-boris

Also, it seems to me that Xen guests would not be the only ones having
to deal with topology inconsistencies due to migrating VCPUs. Don't KVM
guests, for example, have the same problem? And if yes, perhaps we
should try solving it in non-Xen-specific way (especially given that
both of those patches look pretty simple and thus are presumably easy to
integrate into common code).


Indeed. I'll have a try.

And, as George already pointed out, this should be an optional feature
--- if a guest spans physical nodes and VCPUs are pinned then we don't
always want flat topology/domains.


Yes, it might be a good idea to be able to keep some of the topology
levels. I'll modify my patch to make this command line selectable.


Juergen



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

Follow-Ups:
- Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
  - From: Juergen Gross

References:
- Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
  - From: Juergen Gross

Prev by Date: Re: [Xen-devel] [PATCH 2/5] x86: Support enable/disable CDP dynamically and get CDP status
Next by Date: Re: [Xen-devel] [PATCH] build: update top-level make help
Previous by thread: Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
Next by thread: Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.