[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 09 of 11] libxl, xl: enable automatic placement of guests on NUMA nodes



On Thu, 2012-05-31 at 16:02 +0100, Ian Jackson wrote:
> Dario Faggioli writes ("[PATCH 09 of 11] libxl, xl: enable automatic 
> placement of guests on NUMA nodes"):
> > If a domain does not have a VCPU affinity, try to pin it automatically
> > to some PCPUs. This is done taking into account the NUMA characteristics
> > of the host: we look for a combination of host's NUMA nodes that has enough
> > free memoy for the new domain, and pin it to the VCPUs of those nodes.
> > Smaller combinations are considered first, to avoid spreading the
> > domain's memory among too many nodes.
> 
> Thanks for this.  Here are my comments:
> 
Thanks to you. :-)

> > +static void libxl_nodemap_rand_init(libxl_nodemap *nodemap)
> > +{
> > +    int i;
> > +    nodemap->size = rand() % 16;
> > +    nodemap->map = calloc(nodemap->size, sizeof(*nodemap->map));
> > +    libxl_for_each_node(i, *nodemap) {
> > +        if (rand() % 2)
> > +            libxl_nodemap_set(nodemap, i);
> > +        else
> > +            libxl_nodemap_reset(nodemap, i);
> > +    }
> > +}
> 
> For your random number generation, please use nrand48, with a seed in
> the libxl ctx.  (This means you'll need to take out the ctx lock.)
> And provide a way to set the seed.
> 
Sounds reasonable, I'll look into this. As a side note, as I
deliberately took inspiration from libxl_cpumap_rand_init() for this,
should I apply what you just said to that function too?

> > +/* Automatic NUMA placement */
> > +libxl_numa_candidate *libxl_domain_numa_candidates(libxl_ctx *ctx,
> > +                                        libxl_domain_build_info *b_info,
> > +                                        int min_nodes, int *nr_cndts);
> > +int libxl_numa_candidate_add_cpus(libxl_ctx *ctx,
> > +                                  int min_cpus, int max_nodes,
> > +                                  libxl_numa_candidate *candidate);
> > +void libxl_numa_candidates_list_free(libxl_numa_candidate *list, int nr);
> 
> This interface documentation is deficient, I'm afraid.  Please explain
> how these functions are supposed to be used.
> 
Sure, as per the other mail, I'll put the API documentation here.

> > +yajl_gen_status libxl_nodemap_gen_json(yajl_gen hand,
> > +                                       libxl_nodemap *nodemap)
> > +{
> > +    yajl_gen_status s;
> > +    int i;
> > +
> > +    s = yajl_gen_array_open(hand);
> > +    if (s != yajl_gen_status_ok) goto out;
> > +
> > +    libxl_for_each_node(i, *nodemap) {
> > +        if (libxl_nodemap_test(nodemap, i)) {
> > +            s = yajl_gen_integer(hand, i);
> > +            if (s != yajl_gen_status_ok) goto out;
> > +        }
> > +    }
> > +    s = yajl_gen_array_close(hand);
> > +out:
> > +    return s;
> > +}
> 
> This is a copy of _cpumap_gen_json.  Bad enough to have one of these,
> let along two.
> 
So, perhaps you're suggesting merging the two of them?

> > +static int numa_node_set_next(int nr_nodes, int size, int *idxs,
> > +                              libxl_nodemap *nodemap)
> 
> You should at least write     int idxs[]    and explain that the
> caller is expected to provide an array of size `size' which is private
> to the implementation of numa_node_set_...
> 
> <snip>
>
I like this and the other suggestions about restructuring the
combination generation and iteration, and I'll also try to improve the
comments and the documentation.

> > +libxl_numa_candidate *libxl_domain_numa_candidates(libxl_ctx *ctx,
> > +                                        libxl_domain_build_info *b_info,
> > +                                        int min_nodes, int *nr_cndts)
> 
> I think it would be better if this function returned a libxl error
> code.  That way in would be possible to distinguish the various error
> cases with different error codes.  You should define new error codes
> if you need them.
> 
> For the libxl internal subcalls, do this:
>     rc = libxl_get_some_information(CTX, ...);
>     if (rc) goto out;
> 
Ok, I will move the array of candidates among the parameters...

> > +                cndts = realloc(cndts, sizeof(cndts[0]) * (*nr_cndts+1));
> > +                if (cndts == NULL) {
> 
> Firstly, use libxl__zrealloc(0, ...) and ditch the error handling.
> Secondly, this reallocs the array every time we add a node.  It would
> be better to keep a separate note of the array size and reallocate in
> bigger chunks.
> 
Yes, that's very bad, I agree.

> > + out_nodemap_idx:
> > +    free(suitable_nodes_idx);
> > + out_nodemap:
> > +    libxl_nodemap_dispose(&suitable_nodes);
> > + out_numainfo:
> > +    libxl_numainfo_list_free(ninfo, nr_nodes);
> > + out:
> > +    return cndts;
> > +}
> 
> Please don't use this error handling style.  Instead, initialise all
> the resource variables to 0 right at the top of the function, and
> unconditionally free them.
> 
Ok.

> For the output variable cndts, you can unconditionally free it at the
> end if you do this on the success exit path:
> 
No, that is up to the caller to free...

> > +int libxl_numa_candidate_add_cpus(libxl_ctx *ctx,
> > +                                  int min_cpus, int max_nodes,
> > +                                  libxl_numa_candidate *candidate)
> > +{
> > +    int dom_nodes, nodes_cpus;
> > +    libxl_cputopology *tinfo;
> > +    libxl_numainfo *ninfo;
> > +    int nr_nodes, nr_cpus;
> > +    int i, rc;
> > +
> > +    tinfo = libxl_get_cpu_topology(ctx, &nr_cpus);
> > +    if (tinfo == NULL) {
> > +        LIBXL__LOG(ctx, LIBXL__LOG_ERROR, "libxl_get_topologyinfo 
> > failed\n");
> > +        rc = ERROR_NOMEM;
> 
> We aren't supposed to be returning ERROR_NOMEM any more because all
> our memory allocation is now assumed to be infallible.  Shouldn't this
> be ERROR_FAIL ?  (And in the next stanza.)
> 
Right. What I need is to distinguish, in the caller, whether this failed
because it did not find and couldn't add enough CPUs --which would mean
"bad, but we just reset the affinity and avoid placing the domain"-- or
for some other reason --which would mean "bad, we must bail out". But I
guess I can manage doing this with the ERROR_s (or adding my own ones,
if that is the case).

> > +/* How many PCPUs are there on each node? */
> > +static int cpus_per_node(libxl_cputopology *tinfo, int nr_cpus)
> > +{
> > +    libxl_numainfo *ninfo;
> > +    int nr_nodes, j, i;
> > +    int cpus_nodes = 0;
> > +
> > +    ninfo = libxl_get_numainfo(ctx, &nr_nodes);
> > +    if (ninfo == NULL)
> > +        return 0;
> > +
> > +    /* This makes sense iff # of PCPUs is the same for all nodes */
> 
> And what if it isn't ?  Later I see:
> 
> > +    if (cpus_nodes != 0)
> > +        dom_min_nodes = (dom_needs_cpus + cpus_nodes - 1) / cpus_nodes;
> > +    else
> > +        dom_min_nodes = 1;
> 
> which seems to suggest we'll pile the whole thing onto one pcpu.
> 
You're again right I might have spent a few more words on this... It's
just I fear bearing to verbose (I really tend to talk and write too
much!).

This whole comes from an hint from George during his review of my RFC,
where he said we should check how many CPUs we have on each node so
that, if we know we have 6 CPUs on each node and the domain wants 8
CPUs, we can avoid looking for solutions comprising less than 2 nodes.
This is all true, but it depends on the fact each node has the same
amount of CPUs in it, otherwise, math becomes way more complex than just
that division.

Therefore, what this is trying to do is check whether or not we are in
that situation and, if not, just turn this kind of optimization off and
tell the rest of the algorithm to give use _all_ the candidates,
starting from a minimum number of nodes equal to one (which is the
meaning of dom_min_nodes = 1 at this stage).

> > +/* Try to achieve "optimal" NUMA placement */
> > +static int place_domain(libxl_domain_build_info *b_info)
> > +{
> ...
>  +    if (nr_pools > 1) {
> > +        fprintf(stderr, "skip numa placement as cpupools are being 
> > used\n");
> > +        err = 0;
> > +        goto out_poolinfo;
> > +    }
> 
> Perhaps in the case of cpupools the right answer is to consider only
> those pcpus in the intended pool ?  Or is that likely to do more harm
> than good, with people who are currently doing their numa placement
> with cpupools ?
> 
I'm not sure. What you are saying is what I wanted to have, but then I
thought, as it is for VCPU affinity, if the user is asking something
specific about CPU usage, it may be better to leave the whole thing with
him and turning all the automation down.

However, I think I can make the algorithm work while taking cpupools
into account, it's just a matter of deciding. Maybe hearing something
from George and Juergen could be useful...

Thanks a lot for the quick and detailed review. :-)

Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.