Xen project Mailing List

Re: [Xen-devel] [PATCH] xen: make sure the node-affinity is always updated

To: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>

From: Dario Faggioli <dario.faggioli@xxxxxxxxxx>

Date: Fri, 13 Sep 2013 14:43:56 +0200

Cc: Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>, Keir Fraser <keir@xxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Fri, 13 Sep 2013 14:50:29 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On ven, 2013-09-13 at 11:45 +0100, George Dunlap wrote: > On Thu, Sep 12, 2013 at 4:57 PM, Dario Faggioli > > > diff --git a/xen/common/domain.c b/xen/common/domain.c > > index 5999779..f708b3c 100644 > > --- a/xen/common/domain.c > > +++ b/xen/common/domain.c > > @@ -374,28 +373,35 @@ void domain_update_node_affinity(struct domain *d) > > cpumask_or(cpumask, cpumask, online_affinity); > > > + if ( !d->auto_node_affinity ) > > { > > /* Node-affinity is provided by someone else, just filter out cpus > > * that are either offline or not in the affinity of any vcpus. */ > > - nodemask = d->node_affinity; > > for_each_node_mask ( node, d->node_affinity ) > > if ( !cpumask_intersects(&node_to_cpumask(node), cpumask) ) > > - node_clear(node, nodemask);//d->node_affinity); > > + node_clear(node, d->node_affinity); > > + > > + /* > > + * If we end up with an empty node-affinity, e.g., because the user > > + * specified an incompatible vcpu-affinity, or moved the domain to > > + * a different cpupool, reset the node-affinity and re-calculate it > > + * (in the body of the if() below). > > + * > > + * This is necessary to prevent the schedulers from attempting > > + * node-affinity load balancing on empty cpumasks, with > > (potentially) > > + * nasty effects (increased overhead or even crash!). > > + */ > > + if ( nodes_empty(d->node_affinity) ) > > + d->auto_node_affinity = 1; > > Hmm -- silently erasing one set of parameters and changing it to > "auto" doesn't seem that great from a UI perspective -- "principle of > least surprise", and all that. > I know, and I agree. That is actually why I had quite an hard time trying to figure out what it would be best done to fix the issue. :-( > Also, consider someone who has already set both a node and a vcpu > affinity, and decides they now want to change it such that the new > vcpu affinity doesn't intersect with the old node affinity, and does > the following: > > 1. Set new node affinity. The update notices that the node & vcpu > affinity clash, and effectively removes the node affinity > 2. Set new vcpu affinity. > Well, that can't happen in that exact way, since it is _NOT_ possible to change node-affinity on line. It is decided at domain creation time and remains unchanged for the entire life of the domain. I did it like this because, since node-affinity controls where the memory is allocated, I think it would be rather confusing to allow for it to change, and then leave the memory where it is. I've got patches to enable that already, and I'll push them out together with the memory migration work. However, it is true that, before this patch, if someone sets a vcpu-affinity which conflicts with the node-affinity nothing happens to the node-affinity. After this patch that change and the node-affinity will be reset to auto, and as I said, I don't like that either. > Since node affinity is just advisory, wouldn't it make more sense to > make the scheduling system tolerate a disjoint set of affinities? If > we could make cpu_cycle indicate somehow that the mask is empty, > instead of crashing, then we can detect that and skip the "look for an > idle processor within the NUMA affinity" step. > That was the route I took for the first approach. It is indeed possible to make the scheduler deal with that, especially since that is what it happens already (and that was by design!) in all other places: if node-affinity and vcpu-affinity mismatch, what you say is exactly what happens. However, when cpupools are involved (or, in general, where there is the need to filter things against the online cpus, as in _cshced_cpu_pick() ), we have a third mask coming into play, which I thought I was handling correctly, but was not. The outcome of that being that taking care of that properly in the scheduler will add some overhead (a bit more of cpumask_* manipulation) in a quite hot path, and that is going to be payed by everyone, just for the sake of this quite uncommon situation. Anyway, let me try to look harder into that direction, perhaps I could rearrange the stuff in such a way that the cost is not that higher. > For cpupool_numa split, it's probably a good idea to do one of the following: > 1. Detect the numa affinity and place it in the appropriate pool > 2. Detect the conflict and break the affinity from xl (giving a > warning that you're doing so) > 3. Detect the conflict and just give a warning. > cpupools already (potentially) breaks vcpu-affinity, and ISTR, the warning is given in the form of a DEBUG_PRINTK() in Xen (or, anyway, I've seen it in the logs, not on xl's stdout or stderr). The reason why we can't do something too fancy for cpupool-numa-split only is that it is entirely implemented in xl, by using the basic building blocks of the cpupool interface (add-a-cpu-to-pool, remove-a-cpu-to-pool, etc.). The tricky part is _when_ to move the domain in the new pool, and whether that will work or not really depends on how we deal with the issue above (i.e., what do we do when the masks ends up in an empty intersection state). Let's see... > But that's a UI polish thing, rather than a bug fix, so is slightly > lower priority. > Sure, but I was really concerned about making the interface a mess, even if that happens by fixing the bug. I'll think about it a bit more, and see if I can come up with a different, less "surprising" solution from the UI standpoint in the next hours, ok? Thanks for your comments, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.