[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 1 of 4 v6/leftover] libxl: enable automatic placement of guests on NUMA nodes



Dario Faggioli writes ("[PATCH 1 of 4 v6/leftover] libxl: enable automatic 
placement of guests on NUMA nodes"):
> If a domain does not have a VCPU affinity, try to pin it automatically
> to some PCPUs. This is done taking into account the NUMA characteristics
> of the host. In fact, we look for a combination of host's NUMA nodes
> with enough free memory and number of PCPUs for the new domain, and pin
> it to the VCPUs of those nodes.

I'm afraid your new algorithm appears to rely on an inconsistent
comparison function:

> +static int numa_cmpf(const libxl__numa_candidate *c1,
> +                     const libxl__numa_candidate *c2)
> +{
> +    double freememkb_diff = normalized_diff(c2->free_memkb, c1->free_memkb);
> +    double nrvcpus_diff = normalized_diff(c1->nr_vcpus, c2->nr_vcpus);
> +
> +    return SIGN(freememkb_diff + 3*nrvcpus_diff);
> +}

I don't think this comparison function necessarily defines a partial
order.  Can you provide a proof that it does ?  I think you probably
can't.  If you think it _is_ a partial order please let me know and I
will try to construct a counterexample.

I think your comparison function should be of the form:

   static int numa_cmpf(const libxl__numa_candidate *a,
                        const libxl__numa_candidate *b)
   {
       scorea = score(a);  scoreb = score(b);
       return SIGN(scorea - scoreb);
       /* if scores are smalllish ints, can just return scorea-scoreb */
   }

or more generally:

   static int numa_cmpf(const libxl__numa_candidate *a,
                        const libxl__numa_candidate *b)
   {
       score1a = score1(a);  score1b = score1(b);
       if (score1a != score1b) return SIGN(score1a - score1b);
          /* if scores are smalllish ints, can just return score1a-score1b */

       score2a = score2(a);  score2b = score2(b);
       if (score2a != score2b) return SIGN(score2a - score2b);
       /* etc. */

       return 0;
   }

That clearly defines a partial order.  The important point is that the
functions score() must each be calculated only from the node in
question.  score() must not take the other candidate as an argument.
Your normalized_diff function violates this.

If you like you could calculate the scores once for each node and
store them in libxl__numa_candidate rather than the ingredients for
the score (max_memkb) etc., in which case the comparison function is a
simple lexical comparison of the scores.

> +/* The actual automatic NUMA placement routine */
> +static int numa_place_domain(libxl__gc *gc, libxl_domain_build_info *info)
> +{
...
> +    /* Not even a suitable placement candidate! Let's just don't touch the
> +     * domain's info->cpumap. It will have affinity with all nodes/cpus. */
> +    if (found == 0)
> +        goto out;

This probably deserves a log message.

> @@ -107,7 +203,22 @@ int libxl__build_pre(libxl__gc *gc, uint
>      uint32_t rtc_timeoffset;
> 
>      xc_domain_max_vcpus(ctx->xch, domid, info->max_vcpus);
> +
> +    /*
> +     * Check if the domain has any CPU affinity. If not, try to build up one.
> +     * In case numa_place_domain() find at least a suitable candidate, it 
> will

This line is 78 characters, leading to wrap damage in my email reply
to you.  The official coding style says you may use up to 80 which is
tolerable in the odd line of code but it would be nice if wordwrapped
comments were wrapped to a shorter distance.

> +     * affect info->cpumap accordingly; if it does not, it just leaves it
> +     * as it is. This means (unless some weird error manifests) the 
> subsequent
> +     * call to libxl_set_vcpuaffinity_all() will do the actual placement,
> +     * whatever that turns out to be.
> +     */
> +    if (libxl_bitmap_is_full(&info->cpumap)) {
> +        int rc = numa_place_domain(gc, info);
> +        if (rc)
> +            return rc;
> +    }
>      libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus, &info->cpumap);
> +

As discussed, this is wrong; the fix is in your 2/4 but needs to be
folded into this patch I think.  Patches should not introduce new
bugs even if they're about to be fixed in the next commit.

> diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
> --- a/tools/libxl/libxl_internal.h
> +++ b/tools/libxl/libxl_internal.h
> @@ -2216,6 +2216,125 @@ static inline void libxl__ctx_unlock(lib
...
> + * The intended usage is as follows:
> + *  1. fist of all, call libxl__get_numa_candidates(), and specify the proper
          first


> +/*
> + * Looks for the placement candidates that satisfyies some specific
> + * conditions and return the best one according to the provided
> + * comparison function.
> + */
> +int libxl__get_numa_candidate(libxl__gc *gc,
> +                              uint32_t min_free_memkb, int min_cpus,
> +                              int min_nodes, int max_nodes,
> +                              libxl__numa_candidate_cmpf numa_cmpf,
> +                              libxl__numa_candidate *cndt_out,
> +                              int *cndt_found)
> +{
...
> +    /*
> +     * The good thing about this solution is that it is based on heuristics
> +     * (implemented in numa_cmpf() ), but we at least can evaluate it on
> +     * all the possible placement candidates. That can happen because the
> +     * number of nodes present in current NUMA systems is quite small.
> +     * In fact, even if a sum of binomials is involved, if the system has
> +     * up to 8 nodes it "only" takes 256 steps. At the same time, 8 is a
> +     * sensible value, as it is exactly the number of nodes the biggest
> +     * NUMA systems provide at the time of this writing (and will probably
> +     * continue to do so for a while). However, computanional complexity
> +     * would explode on systems much bigger than that. 16 nodes system would
> +     * still be fine (it will be 65536 steps), but it's really importante we
                                                                  important

Isn't this an argument that the limit should be 16 ?  (Also 65535
steps since placement on 0 nodes is typically not an option.)

> +     * avoid trying to run this on monsters with 32, 64 or more nodes (if
> +     * they ever pop into being). Therefore, here it comes a safety catch
> +     * that disables the algorithm for the cases when it wouldn't work well.
> +     */
> +    if (nr_nodes > 8) {
> +        /* Log we did nothing and return 0, as no real error occurred */
> +        LOG(WARN, "System has %d NUMA nodes, which is too big for the "
> +                  "placement algorithm to work effectively. Skipping it",
> +                  nr_nodes);

It would be nice if this message suggested pinning the vcpus.


> +    if (min_nodes > nr_nodes)
> +        min_nodes = nr_nodes;
> +    if (!max_nodes || max_nodes > nr_nodes)
> +        max_nodes = nr_nodes;
> +    if (min_nodes > max_nodes) {
> +        rc = ERROR_INVAL;
> +        goto out;

This should be accompanied by a log message.

> +    while (min_nodes <= max_nodes && *cndt_found == 0) {
> +        comb_iter_t comb_iter;
> +        int comb_ok;
> +
> +        /*
> +        * And here it is. Each step of this cycle generates a combination of
> +        * nodes as big as min_nodes mandates.  Each of these combinations is
> +        * checked against the constraints provided by the caller (namely,
> +        * amount of free memory and number of cpus) and it can concur to
> +         * become our best placement iff it passes the check.
> +         */

Indentation error.


Thanks,
Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.