Xen project Mailing List

Re: [Xen-devel] [PATCH 1 of 4 v6/leftover] libxl: enable automatic placement of guests on NUMA nodes

From: Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>

Date: Mon, 23 Jul 2012 12:38:13 +0100

Cc: Andre Przywara <andre.przywara@xxxxxxx>, Ian Campbell <Ian.Campbell@xxxxxxxxxx>, Stefano Stabellini <Stefano.Stabellini@xxxxxxxxxxxxx>, George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Mon, 23 Jul 2012 11:38:34 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Dario Faggioli writes ("[PATCH 1 of 4 v6/leftover] libxl: enable automatic placement of guests on NUMA nodes"): > If a domain does not have a VCPU affinity, try to pin it automatically > to some PCPUs. This is done taking into account the NUMA characteristics > of the host. In fact, we look for a combination of host's NUMA nodes > with enough free memory and number of PCPUs for the new domain, and pin > it to the VCPUs of those nodes. I'm afraid your new algorithm appears to rely on an inconsistent comparison function: > +static int numa_cmpf(const libxl__numa_candidate *c1, > + const libxl__numa_candidate *c2) > +{ > + double freememkb_diff = normalized_diff(c2->free_memkb, c1->free_memkb); > + double nrvcpus_diff = normalized_diff(c1->nr_vcpus, c2->nr_vcpus); > + > + return SIGN(freememkb_diff + 3*nrvcpus_diff); > +} I don't think this comparison function necessarily defines a partial order. Can you provide a proof that it does ? I think you probably can't. If you think it _is_ a partial order please let me know and I will try to construct a counterexample. I think your comparison function should be of the form: static int numa_cmpf(const libxl__numa_candidate *a, const libxl__numa_candidate *b) { scorea = score(a); scoreb = score(b); return SIGN(scorea - scoreb); /* if scores are smalllish ints, can just return scorea-scoreb */ } or more generally: static int numa_cmpf(const libxl__numa_candidate *a, const libxl__numa_candidate *b) { score1a = score1(a); score1b = score1(b); if (score1a != score1b) return SIGN(score1a - score1b); /* if scores are smalllish ints, can just return score1a-score1b */ score2a = score2(a); score2b = score2(b); if (score2a != score2b) return SIGN(score2a - score2b); /* etc. */ return 0; } That clearly defines a partial order. The important point is that the functions score() must each be calculated only from the node in question. score() must not take the other candidate as an argument. Your normalized_diff function violates this. If you like you could calculate the scores once for each node and store them in libxl__numa_candidate rather than the ingredients for the score (max_memkb) etc., in which case the comparison function is a simple lexical comparison of the scores. > +/* The actual automatic NUMA placement routine */ > +static int numa_place_domain(libxl__gc *gc, libxl_domain_build_info *info) > +{ ... > + /* Not even a suitable placement candidate! Let's just don't touch the > + * domain's info->cpumap. It will have affinity with all nodes/cpus. */ > + if (found == 0) > + goto out; This probably deserves a log message. > @@ -107,7 +203,22 @@ int libxl__build_pre(libxl__gc *gc, uint > uint32_t rtc_timeoffset; > > xc_domain_max_vcpus(ctx->xch, domid, info->max_vcpus); > + > + /* > + * Check if the domain has any CPU affinity. If not, try to build up one. > + * In case numa_place_domain() find at least a suitable candidate, it > will This line is 78 characters, leading to wrap damage in my email reply to you. The official coding style says you may use up to 80 which is tolerable in the odd line of code but it would be nice if wordwrapped comments were wrapped to a shorter distance. > + * affect info->cpumap accordingly; if it does not, it just leaves it > + * as it is. This means (unless some weird error manifests) the > subsequent > + * call to libxl_set_vcpuaffinity_all() will do the actual placement, > + * whatever that turns out to be. > + */ > + if (libxl_bitmap_is_full(&info->cpumap)) { > + int rc = numa_place_domain(gc, info); > + if (rc) > + return rc; > + } > libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus, &info->cpumap); > + As discussed, this is wrong; the fix is in your 2/4 but needs to be folded into this patch I think. Patches should not introduce new bugs even if they're about to be fixed in the next commit. > diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h > --- a/tools/libxl/libxl_internal.h > +++ b/tools/libxl/libxl_internal.h > @@ -2216,6 +2216,125 @@ static inline void libxl__ctx_unlock(lib ... > + * The intended usage is as follows: > + * 1. fist of all, call libxl__get_numa_candidates(), and specify the proper first > +/* > + * Looks for the placement candidates that satisfyies some specific > + * conditions and return the best one according to the provided > + * comparison function. > + */ > +int libxl__get_numa_candidate(libxl__gc *gc, > + uint32_t min_free_memkb, int min_cpus, > + int min_nodes, int max_nodes, > + libxl__numa_candidate_cmpf numa_cmpf, > + libxl__numa_candidate *cndt_out, > + int *cndt_found) > +{ ... > + /* > + * The good thing about this solution is that it is based on heuristics > + * (implemented in numa_cmpf() ), but we at least can evaluate it on > + * all the possible placement candidates. That can happen because the > + * number of nodes present in current NUMA systems is quite small. > + * In fact, even if a sum of binomials is involved, if the system has > + * up to 8 nodes it "only" takes 256 steps. At the same time, 8 is a > + * sensible value, as it is exactly the number of nodes the biggest > + * NUMA systems provide at the time of this writing (and will probably > + * continue to do so for a while). However, computanional complexity > + * would explode on systems much bigger than that. 16 nodes system would > + * still be fine (it will be 65536 steps), but it's really importante we important Isn't this an argument that the limit should be 16 ? (Also 65535 steps since placement on 0 nodes is typically not an option.) > + * avoid trying to run this on monsters with 32, 64 or more nodes (if > + * they ever pop into being). Therefore, here it comes a safety catch > + * that disables the algorithm for the cases when it wouldn't work well. > + */ > + if (nr_nodes > 8) { > + /* Log we did nothing and return 0, as no real error occurred */ > + LOG(WARN, "System has %d NUMA nodes, which is too big for the " > + "placement algorithm to work effectively. Skipping it", > + nr_nodes); It would be nice if this message suggested pinning the vcpus. > + if (min_nodes > nr_nodes) > + min_nodes = nr_nodes; > + if (!max_nodes || max_nodes > nr_nodes) > + max_nodes = nr_nodes; > + if (min_nodes > max_nodes) { > + rc = ERROR_INVAL; > + goto out; This should be accompanied by a log message. > + while (min_nodes <= max_nodes && *cndt_found == 0) { > + comb_iter_t comb_iter; > + int comb_ok; > + > + /* > + * And here it is. Each step of this cycle generates a combination of > + * nodes as big as min_nodes mandates. Each of these combinations is > + * checked against the constraints provided by the caller (namely, > + * amount of free memory and number of cpus) and it can concur to > + * become our best placement iff it passes the check. > + */ Indentation error. Thanks, Ian. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.