Xen project Mailing List

Re: [Xen-devel] [PATCH 09 of 11] libxl, xl: enable automatic placement of guests on NUMA nodes

Dario Faggioli writes ("[PATCH 09 of 11] libxl, xl: enable automatic placement of guests on NUMA nodes"): > If a domain does not have a VCPU affinity, try to pin it automatically > to some PCPUs. This is done taking into account the NUMA characteristics > of the host: we look for a combination of host's NUMA nodes that has enough > free memoy for the new domain, and pin it to the VCPUs of those nodes. > Smaller combinations are considered first, to avoid spreading the > domain's memory among too many nodes. Thanks for this. Here are my comments: > +static void libxl_nodemap_rand_init(libxl_nodemap *nodemap) > +{ > + int i; > + nodemap->size = rand() % 16; > + nodemap->map = calloc(nodemap->size, sizeof(*nodemap->map)); > + libxl_for_each_node(i, *nodemap) { > + if (rand() % 2) > + libxl_nodemap_set(nodemap, i); > + else > + libxl_nodemap_reset(nodemap, i); > + } > +} For your random number generation, please use nrand48, with a seed in the libxl ctx. (This means you'll need to take out the ctx lock.) And provide a way to set the seed. > +/* Automatic NUMA placement */ > +libxl_numa_candidate *libxl_domain_numa_candidates(libxl_ctx *ctx, > + libxl_domain_build_info *b_info, > + int min_nodes, int *nr_cndts); > +int libxl_numa_candidate_add_cpus(libxl_ctx *ctx, > + int min_cpus, int max_nodes, > + libxl_numa_candidate *candidate); > +void libxl_numa_candidates_list_free(libxl_numa_candidate *list, int nr); This interface documentation is deficient, I'm afraid. Please explain how these functions are supposed to be used. > diff --git a/tools/libxl/libxl_json.c b/tools/libxl/libxl_json.c > --- a/tools/libxl/libxl_json.c > +++ b/tools/libxl/libxl_json.c > @@ -119,6 +119,26 @@ out: > return s; > } > > +yajl_gen_status libxl_nodemap_gen_json(yajl_gen hand, > + libxl_nodemap *nodemap) > +{ > + yajl_gen_status s; > + int i; > + > + s = yajl_gen_array_open(hand); > + if (s != yajl_gen_status_ok) goto out; > + > + libxl_for_each_node(i, *nodemap) { > + if (libxl_nodemap_test(nodemap, i)) { > + s = yajl_gen_integer(hand, i); > + if (s != yajl_gen_status_ok) goto out; > + } > + } > + s = yajl_gen_array_close(hand); > +out: > + return s; > +} This is a copy of _cpumap_gen_json. Bad enough to have one of these, let along two. > +/* > + * The following two uility functions compute the node maps > + * coresponding to the [ n! / (k! * (n - k)!) ] combinations > + * of @size nodes within a set that is @nr_nodes big, without > + * repetition. The algorithm used for generating them uses > + * a vector containing the indexes in the set of the elements > + * of the i-eth combination to generate the (i+1)-eth, and > + * ensures they come in lexicographic order. Can you please avoid this `@' ? We aren't using doxygen (and never will I hope) and it's just noise. Also I think the "..." here (and later) are wrong since these aren't strings. If you want to talk about them as arrays and want something to indicate the grouping you could use { }. > +static int numa_node_set_next(int nr_nodes, int size, int *idxs, > + libxl_nodemap *nodemap) You should at least write int idxs[] and explain that the caller is expected to provide an array of size `size' which is private to the implementation of numa_node_set_... > +{ > + int i; > + > + /* Check whether we just need to increase the rightmost index */ > + if (idxs[size - 1] < nr_nodes - 1) { > + idxs[size - 1]++; > + goto build_nodemap; Is there something wrong with `else' ? Or maybe you mean `goto out' ? But, I think in fact this special case is unnecessary, isn't it ? > + /* Find the rightmost element from where to start the increasing */ > + for (i = size - 1; idxs[i] == nr_nodes - size + i; i--) { Since if idxs[size-1] == nr_nodes-1 this loop's first test of the condition with i == size-1 becomes idxs[size-1]==nr_nodes-1 and we therefore execute this > + for (idxs[i]++, i += 1; i < size; i++) > + idxs[i] = idxs[i - 1] + 1; with i==size-1 and thus we increment idxs[size-1] and quit the loop right away. Furthermore, that last loop is really rather confusing. The function would benefit from a comment describing the algorith. `i += 1' is un-idiomatic, too. > +libxl_numa_candidate *libxl_domain_numa_candidates(libxl_ctx *ctx, > + libxl_domain_build_info *b_info, > + int min_nodes, int *nr_cndts) I think it would be better if this function returned a libxl error code. That way in would be possible to distinguish the various error cases with different error codes. You should define new error codes if you need them. For the libxl internal subcalls, do this: rc = libxl_get_some_information(CTX, ...); if (rc) goto out; > + suitable_nodes_idx = malloc(sizeof(int) * nr_nodes); Use GCNEW_ARRAY and ditch the error handling. You will need to GC_INIT and GC_FREE. This will give you a gc and then you should write `CTX' rather than `ctx' so that your code can easily be moved into an inner function. > + /* Generate the combinations and check their cumulative free memory > */ > + numa_node_set_init(nr_nodes, min_nodes, suitable_nodes_idx, > + &suitable_nodes); > + do { > + nodes_memkb = 0; > + libxl_for_each_set_node(i, suitable_nodes) > + nodes_memkb += ninfo[i].free / 1024; This should be a for loop. If necessary, numa_node_set_* should be changed so that they can easily be used in a for loop. This would be acceptable, for example: for (rc = numa_node_set_init(nr_nodes, min_nodes, suitable_nodes_idx, &suitable_nodes); !rc; rc = numa_node_set_next(nr_nodes, min_nodes, suitable_nodes_idx, &suitable_nodes)) { Perhaps numa_node_set_ should take a struct for all these arguments. Then you could write: for (rc = numa_node_set_init(&numa_node_set_iter, nr_nodes, min_nodes); !rc; rc = numa_node_set_next(&numa_node_set_iter) { Or even better if you changed the interface a bit: for (numa_node_set_init(gc, &numa_node_set_iter, nr_nodes, min_nodes); numa_node_set_get(&numa_node_set_iter, &suitable_nodes); numa_node_set_next(&numa_node_set_iter) { which would allow you to make numa_node_set_iter entirely opaque and move the allocation of the idx array into the numa_node_set_* functions. > + cndts = realloc(cndts, sizeof(cndts[0]) * (*nr_cndts+1)); > + if (cndts == NULL) { Firstly, use libxl__zrealloc(0, ...) and ditch the error handling. Secondly, this reallocs the array every time we add a node. It would be better to keep a separate note of the array size and reallocate in bigger chunks. > + out_nodemap_idx: > + free(suitable_nodes_idx); > + out_nodemap: > + libxl_nodemap_dispose(&suitable_nodes); > + out_numainfo: > + libxl_numainfo_list_free(ninfo, nr_nodes); > + out: > + return cndts; > +} Please don't use this error handling style. Instead, initialise all the resource variables to 0 right at the top of the function, and unconditionally free them. For the output variable cndts, you can unconditionally free it at the end if you do this on the success exit path: > +int libxl_numa_candidate_add_cpus(libxl_ctx *ctx, > + int min_cpus, int max_nodes, > + libxl_numa_candidate *candidate) > +{ > + int dom_nodes, nodes_cpus; > + libxl_cputopology *tinfo; > + libxl_numainfo *ninfo; > + int nr_nodes, nr_cpus; > + int i, rc; > + > + tinfo = libxl_get_cpu_topology(ctx, &nr_cpus); > + if (tinfo == NULL) { > + LIBXL__LOG(ctx, LIBXL__LOG_ERROR, "libxl_get_topologyinfo failed\n"); > + rc = ERROR_NOMEM; We aren't supposed to be returning ERROR_NOMEM any more because all our memory allocation is now assumed to be infallible. Shouldn't this be ERROR_FAIL ? (And in the next stanza.) > + if (max_nodes == -1 || max_nodes > nr_nodes) > + max_nodes = nr_nodes; I'm afraid I can't really make much sense of this algorithm without thinking very hard. Maybe if the interface doc comment had explained what it was supposed to do it would be obvious... > diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c > --- a/tools/libxl/xl_cmdimpl.c > +++ b/tools/libxl/xl_cmdimpl.c > @@ -490,6 +490,122 @@ static void split_string_into_string_lis ... > +/* How many PCPUs are there on each node? */ > +static int cpus_per_node(libxl_cputopology *tinfo, int nr_cpus) > +{ > + libxl_numainfo *ninfo; > + int nr_nodes, j, i; > + int cpus_nodes = 0; > + > + ninfo = libxl_get_numainfo(ctx, &nr_nodes); > + if (ninfo == NULL) > + return 0; > + > + /* This makes sense iff # of PCPUs is the same for all nodes */ And what if it isn't ? Later I see: > + if (cpus_nodes != 0) > + dom_min_nodes = (dom_needs_cpus + cpus_nodes - 1) / cpus_nodes; > + else > + dom_min_nodes = 1; which seems to suggest we'll pile the whole thing onto one pcpu. > +/* Try to achieve "optimal" NUMA placement */ > +static int place_domain(libxl_domain_build_info *b_info) > +{ ... + if (nr_pools > 1) { > + fprintf(stderr, "skip numa placement as cpupools are being used\n"); > + err = 0; > + goto out_poolinfo; > + } Perhaps in the case of cpupools the right answer is to consider only those pcpus in the intended pool ? Or is that likely to do more harm than good, with people who are currently doing their numa placement with cpupools ? > + tinfo = libxl_get_cpu_topology(ctx, &nr_cpus); > + if (tinfo == NULL) { > + fprintf(stderr, "libxl_get_topologyinfo failed.\n"); > + err = ENOMEM; libxl functions should return libxl error codes, not errno values. You will probably need to introduce some. > + if (err == ERROR_FAIL) { > + /* Report back we didn't find a candidate with enough cpus */ > + err = ESRCH; I don't think we should ever turn ERROR_FAIL into something else. ERROR_FAIL means `something went wrong in a way that shouldn't happen'. > +out_topologyinfo: > + libxl_cputopology_list_free(tinfo, nr_cpus); > +out_poolinfo: > + for (i = 0; i < nr_pools; i++) > + libxl_cpupoolinfo_dispose(&pinfo[i]); > +out: > + return err; > +} Again, please use an idempotent error handling style. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.