[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v3 4/7] libxl: vNUMA supporting interface



On Tue, Nov 19, 2013 at 1:37 PM, Dario Faggioli
<dario.faggioli@xxxxxxxxxx> wrote:
> On lun, 2013-11-18 at 15:24 -0500, Elena Ufimtseva wrote:
>> Signed-off-by: Elena Ufimtseva <ufimtseva@xxxxxxxxx>
>>
>> Changes since v2:
>> *   Added vnuma memory pfn alignment which takes into
>> account e820_host option and non-contiguous e820 memory map
>> in that case;
>>
> Wow, cool that you got this one working too! ;-P
>
>> diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
>> index 0de1112..2bd3653 100644
>
>> +int libxl_domain_setvnuma(libxl_ctx *ctx,
>> +                            uint32_t domid,
>> +                            uint16_t nr_vnodes,
>> +                            uint16_t nr_vcpus,
>> +                            vmemrange_t *vmemrange,
>> +                            unsigned int *vdistance,
>> +                            unsigned int *vcpu_to_vnode,
>> +                            unsigned int *vnode_to_pnode)
>> +{
>> +    GC_INIT(ctx);
>>
> Do you need this? I don't think so: you don't need a gc, and ctx appears
> to be enough for you.
>
>> +    int ret;
>> +    ret = xc_domain_setvnuma(ctx->xch, domid, nr_vnodes,
>> +                                nr_vcpus, vmemrange,
>> +                                vdistance,
>> +                                vcpu_to_vnode,
>> +                                vnode_to_pnode);
>> +    GC_FREE;
>> +    return ret;
>> +}
>> +
> libxl function should always return libxl error codes, such as
> ERROR_FAIL, ERROR_INVAL, ecc. It's nice to show what actually went wrong
> in the xc_* call, but that's usually done by:
>
>  rc = xc_call();
>  if (rc < 0) {
>   LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "error xxx");
>   return ERROR_FAIL;
>  }
>
> On the grounds that libxc would properly set errno and return -1 on
> failure.
>
>
>> diff --git a/tools/libxl/libxl_arch.h b/tools/libxl/libxl_arch.h
>> index abe6685..b95abab 100644
>
>>  int libxl__arch_domain_create(libxl__gc *gc, libxl_domain_config *d_config,
>>                 uint32_t domid);
>>
>> +int libxl__vnuma_align_mem(libxl__gc *gc,
>> +                            uint32_t domid,
>> +                            struct libxl_domain_build_info *b_info,
>> +                            vmemrange_t *memblks);
>> +
>> +unsigned long e820_memory_hole_size(unsigned long start,
>> +                                    unsigned long end,
>> +                                    struct e820entry e820[],
>> +                                    int nr);
>> +
>> +unsigned int libxl__vnodemap_is_usable(libxl__gc *gc,
>> +                                libxl_domain_build_info *info);
>> +
>>
> I think we'll better have a libxl maintainer comment on this but, FWIW,
> I don't think I'd put these functions in libxl_arch.h.
>
> We do have a libxl_numa.c file, for host their implementation (rather
> than in libxl_dom.c) and, as libxl hidden functions, their prototype is
> probably fine in libxl_internal.h (unless there is some specific reason
> for having them here which I'm overlooking).
>
> As per e820_memory_hole_size(), it looks like it's only called from
> inside e820_memory_hole_size(), which means it can be just a static
> (still inside libxl_numa.c).
>
>> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
>> index a1c16b0..378249e 100644
>
>> +/* prepares vnode to pnode map for domain vNUMA memory allocation */
>> +int libxl__init_vnode_to_pnode(libxl__gc *gc, uint32_t domid,
>> +                        libxl_domain_build_info *info)
>> +{
>> +    int i, n, nr_nodes, rc;
>>
> Please, initialize nr_nodes to 0. Given how libxl_get_numainfo() works,
> that's the best and easiest way for being sure we won't leak any of the
> memory it deals with.
>
>> +    uint64_t *mems;
>> +    unsigned long long *claim = NULL;
>> +    libxl_numainfo *ninfo = NULL;
>> +
>> +    rc = -1;
>>
> As said above, use libxl error codes.
>
>> +    if (info->vnode_to_pnode == NULL) {
>> +        info->vnode_to_pnode = calloc(info->nr_vnodes,
>> +                                      sizeof(*info->vnode_to_pnode));
>> +        if (info->vnode_to_pnode == NULL)
>> +            return rc;
>> +    }
>> +
> Also, in libxl, memory allocation never fails. :-) What this really mean
> is that such kind of failure is handled for you by the library itself,
> if you use the proper wrappers (and doing so is not optional, is the way
> you deal with memory in libxl).
>
> For this specific example, look for libxl__calloc() and some usage
> examples of it (you will find there never is any error handling code).
>
>> +    /* default setting */
>> +    for (i = 0; i < info->nr_vnodes; i++)
>> +        info->vnode_to_pnode[i] = VNUMA_NO_NODE;
>> +
>> +    /* Get NUMA info */
>> +    ninfo = libxl_get_numainfo(CTX, &nr_nodes);
>> +    if (ninfo == NULL || nr_nodes == 0) {
>> +        LOG(DEBUG, "No HW NUMA found.\n");
>> +        rc = 0;
>> +        goto vnmapout;
>> +    }
>> +
> I'm not so sure about this. What do you mean with "No HW NUMA found"?
> IIRC, on a non-NUMA box, libxl_get_numainfo returns something. I just
> checked on my (non-NUMA) laptop, and I do see a node 0 with all the
> memory and 10 as distance to itself.
>
> So, if what you get is ninfo=NULL and nr_nodes=0, it means something
> went wrong, and we shouldn't return success. OTOH, if you're interested
> to detect the case where the call succeeded, and there really is only
> one node, you should probably check for nr_nodes == 1.

If thats the case and there will be at least one node present, then I
dont care about this )

>
>> +    /*
>> +     * check if we have any hardware NUMA nodes selected,
>> +     * otherwise VNUMA_NO_NODE set and used default allocation
>> +     */
>> +    if (libxl_bitmap_is_empty(&info->nodemap))
>> +        return 0;
>>
> Why _is_empty()? I mean, under what circumstances info->nodemap is
> empty?
>
> I'm asking because info->nodemap is initialized as full in
> libxl__domain_build_info_setdefault() and can only change in
> libxl__build_pre() in a way that makes it really unlikely for it to
> become empty... so...
>
>> +    mems = info->vnuma_memszs;
>> +
>> +    /* check if all vnodes will fit in one node */
>> +    libxl_for_each_set_bit(n, info->nodemap) {
>> +        if(ninfo[n].free/1024 >= info->max_memkb  &&
>> +           libxl_bitmap_test(&info->nodemap, n))
>> +           {
>> +               /*
>> +                * all domain v-nodes will fit one p-node n,
>> +                * p-node n is a best candidate selected by automatic
>> +                * NUMA placement.
>> +                */
>> +               for (i = 0; i < info->nr_vnodes; i++)
>> +                    info->vnode_to_pnode[i] = n;
>> +               /* we can exit, as we are happy with placement */
>> +               return 0;
>> +           }
>> +    }
>>
> Wait, why do we want to do that? If info->nodemap is not fully set
> (which BTW is the actual situation used to exemplify "auto", or "do this
> for me", rather than it being empty) it already contains some
> instructions that, I think, everyone in the downstream chain --be it
> other parts of libxl, libxc and Xen-- should follow.
>
> What I mean is, if info->nodemap has two bits sets, that means the
> domain should be "created on" the (two) nodes corresponding to the bits
> themselves. OTOH, if info->nodemap has only one bit set, then the whole
> domain should stay there and in that case there isn't much choice, is
> it? (basically, your "libxl_for_each_set_bit()" will only take one
> step.)
>
> Here, it looks like, no matter what you we have in info->nodemap, you
> look for one of the nodes that can contain the whole domain, which
> doesn't look right.


>
> Below...
>> +    /* Determine the best nodes to fit vNUMA nodes */
>> +    /* TODO: change algorithm. The current just fits the nodes
>> +     * Will be nice to have them also sorted by size
>> +     * If no p-node found, will be set to NUMA_NO_NODE
>> +     */
>> +    claim = calloc(info->nr_vnodes, sizeof(*claim));
>> +    if (claim == NULL)
>> +        return rc;
>> +
>> +    libxl_for_each_set_bit(n, info->nodemap)
>> +    {
>> +        for (i = 0; i < info->nr_vnodes; i++)
>> +        {
>> +            if (((claim[n] + (mems[i] << 20)) <= ninfo[n].free) &&
>> +                 /*vnode was not set yet */
>> +                 (info->vnode_to_pnode[i] == VNUMA_NO_NODE ) )
>> +            {
>> +                info->vnode_to_pnode[i] = n;
>> +                claim[n] += (mems[i] << 20);
>> +            }
>> +        }
>> +    }
>> +
> ...you're trying to do something similar, i.e., pack the domain within a
> subset of what you find in info->cpumap instead of just use it.
>
> Anyway, I know that automatic placement code may look tricky... Let me
> see if I can hack something together quickly, to show what I mean and
> what I'd have put here instead of all this. :-)
>
>>  int libxl__build_pre(libxl__gc *gc, uint32_t domid,
>>                libxl_domain_config *d_config, libxl__domain_build_state 
>> *state)
>>  {
>> @@ -235,6 +318,66 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid,
>>          if (rc)
>>              return rc;
>>      }
>> +
>> +    if (info->nr_vnodes > 0) {
>> +        /* The memory blocks will be formed here from sizes */
>> +        struct vmemrange *memrange = libxl__calloc(gc, info->nr_vnodes,
>> +                                                sizeof(*memrange));
>> +        if (memrange == NULL) {
>> +            LOG(DETAIL, "Failed to allocate memory for memory ranges.\n");
>> +            return ERROR_FAIL;
>> +        }
>>
> AhA, here you're using the correct memory allocation wrapper, so you can
> ditch the error handling.
>
>> +        if (libxl__vnuma_align_mem(gc, domid, info, memrange) < 0) {
>> +            LOG(DETAIL, "Failed to align memory map.\n");
>> +            return ERROR_FAIL;
>> +        }
>> +
>> +        /*
>> +        * If vNUMA vnode_to_pnode map defined, determine if we
>> +        * can disable automatic numa placement and place vnodes
>> +        * on specified pnodes.
>> +        * For now, if vcpu affinity specified, we will use
>> +        * specified vnode to pnode map.
>> +        */
>> +        if (libxl__vnodemap_is_usable(gc, info) == 1) {
>> +
> Extra blank line.
>> +            /* Will use user-defined vnode to pnode mask */
>> +
> Here too, and with spurious spaces.
>
>> +            libxl_defbool_set(&info->numa_placement, false);
>>
> Setting info->numa_placement to false only makes sense if you do it
> _before_ the following block of code:
>
>     if (libxl_defbool_val(info->numa_placement)) {
>
>         if (!libxl_bitmap_is_full(&info->cpumap)) {
>             LOG(ERROR, "Can run NUMA placement only if no vcpu "
>                        "affinity is specified");
>             return ERROR_INVAL;
>         }
>
>         rc = numa_place_domain(gc, domid, info);
>         if (rc)
>             return rc;
>     }
>
> There isn't any other place where that flag is checked and, if you get
> past this, automatic placement either already happened (if
> numa_placement was true and the other condition is met), or it never
> will.
>
> The right thing to do would probably be having
> libxl__vnodemap_is_usable() called *before* (even immediately above is
> fine) we get here, and have _it_ disable info->numa_placement, if that
> reveals to be the case.
>
> At this point, most of the code below (until '***') can just live inside
> the above "if (libxl_defbool_val(info->numa_placement))" above, without
> the need of introducing another identical block.

Yes, I worked on this piece and looks now as it does the right thing.

>
>> +        }
>> +        else {
>> +            LOG(ERROR, "The allocation mask for vnuma nodes cannot be 
>> used.\n");
>> +            if (libxl_defbool_val(info->vnuma_placement)) {
>> +
>> +                LOG(DETAIL, "Switching to automatic vnuma to pnuma 
>> placement\n.");
>> +                /* Construct the vnode to pnode mapping if possible */
>> +                if (libxl__init_vnode_to_pnode(gc, domid, info) < 0) {
>> +                    LOG(DETAIL, "Failed to call init_vnodemap\n");
>> +                    /* vnuma_nodemap will not be used if nr_vnodes == 0 */
>> +                    return ERROR_FAIL;
>> +                }
>> +            }
>> +            else {
>> +                LOG(ERROR, "The vnodes cannot be mapped to pnodes this 
>> way\n.");
>> +                info->nr_vnodes = 0;
>> +                return ERROR_FAIL;
>> +            }
>> +        }
>> +        /* plumb domain with vNUMA topology */
>> +        if (xc_domain_setvnuma(ctx->xch, domid, info->nr_vnodes,
>> +                                info->max_vcpus, memrange,
>> +                                info->vdistance, info->vcpu_to_vnode,
>> +                                info->vnode_to_pnode) < 0) {
>> +
>> +           LOG(DETAIL, "Failed to set vnuma topology for domain from\n.");
>> +           info->nr_vnodes = 0;
>> +           return ERROR_FAIL;
>> +        }
>> +    }
>> +    else
>> +        LOG(DEBUG, "Will not construct vNUMA topology.\n");
>> +
> *** <-- until here
>
>> +/* Checks if vnuma_nodemap defined in info can be used
>> + * for allocation of vnodes on physical NUMA nodes by
>> + * verifying that there is enough memory on corresponding
>> + * NUMA nodes.
>> + */
>> +unsigned int libxl__vnodemap_is_usable(libxl__gc *gc, 
>> libxl_domain_build_info *info)
>> +{
>> +    unsigned int i;
>> +    libxl_numainfo *ninfo = NULL;
>> +    unsigned long long *claim;
>> +    unsigned int node;
>> +    uint64_t *mems;
>> +    int rc, nr_nodes;
>> +
>> +    rc = nr_nodes = 0;
>> +    if (info->vnode_to_pnode == NULL || info->vnuma_memszs == NULL)
>> +        return rc;
>> +    /*
>> +     * Cannot use specified mapping if not NUMA machine
>> +     */
>> +    ninfo = libxl_get_numainfo(CTX, &nr_nodes);
>> +    if (ninfo == NULL) {
>> +        return rc;
>> +    }
>> +    mems = info->vnuma_memszs;
>> +    claim = calloc(info->nr_vnodes, sizeof(*claim));
>> +    if (claim == NULL)
>> +        return rc;
>> +    /* Sum memory request on per pnode basis */
>> +    for (i = 0; i < info->nr_vnodes; i++)
>> +    {
>> +        node = info->vnode_to_pnode[i];
>> +        /* Correct pnode number? */
>> +        if (node < nr_nodes)
>> +            claim[node] += (mems[i] << 20);
>> +        else
>> +            goto vmapu;
>> +   }
>> +   for (i = 0; i < nr_nodes; i++) {
>> +       if (claim[i] > ninfo[i].free)
>> +          /* Cannot complete user request, falling to default */
>> +          goto vmapu;
>> +   }
>> +   rc = 1;
>> +
>> + vmapu:
>> +   if(claim) free(claim);
>> +   return rc;
>> +}
>>
> This function looks conceptually right. Despite that, it has comes with
> coding style, trailing whitespaces, inconsistent (wrt libxl conventions)
> error and memory allocation/failure handling. :-D :-D
>
>> diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
>> index 87a8110..206f5be 100644
>
>> +
>> +/*
>> + * Checks for the beginnig and end of RAM in e820 map for domain
>> + * and aligns start of first and end of last vNUMA memory block to
>> + * that map. vnode memory size are passed here Megabytes.
>> + * For PV guest e820 map has fixed hole sizes.
>> + */
>> +int libxl__vnuma_align_mem(libxl__gc *gc,
>> +                            uint32_t domid,
>> +                            libxl_domain_build_info *b_info, /* IN: mem 
>> sizes */
>> +                            vmemrange_t *memblks)        /* OUT: linux numa 
>> blocks in pfn */
>> +{
>> +    int i, j, rc;
>> +    uint64_t next_start_pfn, end_max = 0, size, mem_hole;
>> +    uint32_t nr;
>> +    struct e820entry map[E820MAX];
>> +
>> +    if (b_info->nr_vnodes == 0)
>> +        return -EINVAL;
>> +    libxl_ctx *ctx = libxl__gc_owner(gc);
>> +
>> +    /* retreive e820 map for this host */
>> +    rc = xc_get_machine_memory_map(ctx->xch, map, E820MAX);
>> +
>> +    if (rc < 0) {
>> +        errno = rc;
>> +        return -EINVAL;
>> +    }
>> +    nr = rc;
>> +    rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb,
>> +                       (b_info->max_memkb - b_info->target_memkb) +
>> +                       b_info->u.pv.slack_memkb);
>> +    if (rc)
>> +    {
>> +        errno = rc;
>> +        return -EINVAL;
>> +    }
>> +
>> +    /* max pfn for this host */
>> +    for (j = nr - 1; j >= 0; j--)
>> +        if (map[j].type == E820_RAM) {
>> +            end_max = map[j].addr + map[j].size;
>> +            break;
>> +        }
>> +
>> +    memset(memblks, 0, sizeof(*memblks) * b_info->nr_vnodes);
>> +    next_start_pfn = 0;
>> +
>> +    memblks[0].start = map[0].addr;
>> +
>> +    for(i = 0; i < b_info->nr_vnodes; i++) {
>> +        /* start can be not zero */
>> +        memblks[i].start += next_start_pfn;
>> +        memblks[i].end = memblks[i].start + (b_info->vnuma_memszs[i] << 20);
>> +
>> +        size = memblks[i].end - memblks[i].start;
>> +        /*
>> +         * For pv host with e820_host option turned on we need
>> +         * to rake into account memory holes. For pv host with
>> +         * e820_host disabled or unset, the map is contiguous
>> +         * RAM region.
>> +         */
>> +        if (libxl_defbool_val(b_info->u.pv.e820_host)) {
>> +            while (mem_hole = e820_memory_hole_size(memblks[i].start,
>> +                                                 memblks[i].end, map, nr),
>> +                    memblks[i].end - memblks[i].start - mem_hole < size)
>> +            {
>> +                memblks[i].end += mem_hole;
>> +
>> +                if (memblks[i].end > end_max) {
>> +                    memblks[i].end = end_max;
>> +                    break;
>> +                }
>> +            }
>> +        }
>> +        next_start_pfn = memblks[i].end;
>> +    }
>> +
>> +    if (memblks[i-1].end > end_max)
>> +        memblks[i-1].end = end_max;
>> +
>> +    return 0;
>> +}
>> +
> All this code here would require for someone who knows more about e820
> to chime in (or for me to learn something more about it, which I
> definitely want to, but will take a bit).
>
> Perhaps you can help me at leas a bit. Can you comment on/describe what
> is it that you do when you find an hole?
>
> It looks from the above that you just need to move the end of the
> node(s) memory range(s)... Is that the case? For some reason, I was
> assuming that dealing with holes would have required to add support for
> more than one memory block per node, but it's apparently not the case,
> is it?


The code looks right now a bit different as I found some problems there.

But here is the general idea.

When e820_host is not set, then pv guest has fixed e820 map and I dont
care about the holes much since its
always looks like this:

[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] Xen: [mem 0x0000000000000000-0x000000000009ffff] usable
[    0.000000] Xen: [mem 0x00000000000a0000-0x00000000000fffff] reserved
[    0.000000] Xen: [mem 0x0000000000100000-0x00000000f9ffffff] usable
[    0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
[    0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable

Means, first 4KB (0x0000 - 0x0fff) and 384K gap between 0xa0000 and
0xfffff will be reserved.
Since these pfns will never appear in the pages allocations and the
the beginning and end of memory blocks
I construct are exactly as in guest, I form memory blocks in this case
bases in sizes of vnodes.

In case e820_host is set to 1, I have to account for memory holes and
build correct memory ranges.

As for memory blocks.
There will be need in second memory block of certain type (memory, in
this case) if the node spans
non-contiguous memory ranges. For example, if node 0 has 0x000000 -
0x100000 and 0x400000 - 0x800000, then
it will look like this:

hypothethical addresses here :)

node0, memblock 0: 0x000000 - 0x0fffff
node1, memblock 1: 0x100000 - 0x3ffffff
node 0, memblock 1: 0x400000 - 0x7fffff

If the memory blocks can be merged and form contiguous block, that
will be done and will be one block per node.
So I dont see at this point as having two memory blocks.
gaps in e820 are being accounted in linux in memory ranges list in
certain memory block.

Elena


>
> Regards,
> Dario
>
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>



-- 
Elena

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.