[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v5 2/6] xen/x86: move generically usable NUMA code from x86 to common


  • To: Wei Chen <wei.chen@xxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Tue, 27 Sep 2022 10:19:33 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=yAzkf1MsGwcWu9aN2ZBcnYe91RZZv8KzWyU2D+Ok9aU=; b=ZOprF2vt9J67FFI4iRXPaOghBKmebM20zttb+nQl4ySm3ABQtYBvtz5ZZIR6CUQP0OYPEUZlgRvHOxK9bAEqQljzM3TI9kC9WJ6HizndmLXS8oNrzndMkleNXW3EbAwLRklRcnFXMy0odkWqix3twtSvlXajh1SWUc+tDIdf9BzIA9OOnYSZGsqEJYPDoSD8pZoUJSWEspziFc9gnr3JuTzUU1D0l9Z0GhPcj/vHEW09oTnFsrK4mk3AmjeLBYadttAuWiysddU7Psar0z3qlxz3cvygfOf3Wej32sYsdxmjS0dsbbulz44QpdBkXZhc6lTJZ1UJ8QH3RZzyxg8mPw==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=PeoXYerb3GnX++VCOmuPLmbwjOxBZLy/prI0844fc034yGTC+zeEE9iFwY8h6TuJrwXe5wUxpgEsM8eUB+n5s/dQdKl+pYBk2MTbys4PIoF0Za+8M6uSWU2IkyVAaNLFNHP6RrKLR334P/JOYmEEA86rOxAdtTogOGfOjkarV0e0KxusJ6oG8+EFyLCgabdFQ/9hOLp9pUstgDnhqJLCaEqOmf8liWv7cxF4ttr70wanpByBnOpqHCpbrO8/DMhM+gfCTaJvNtMwWk6M4c58qGpGvBchGLKuJyGolY0MziEPtGowg8GIbLdX8Fy0b/u6fHDIWvR+pyyUogcvYtlgUg==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;
  • Cc: nd@xxxxxxx, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx
  • Delivery-date: Tue, 27 Sep 2022 08:19:45 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 20.09.2022 11:12, Wei Chen wrote:
> There are some codes in x86/numa.c can be shared by common
> architectures to implememnt NUMA support. Just like some
> variables and functions to check and store NUMA memory map.
> And some variables and functions to do NUMA initialization.
> 
> In this patch, we move them to common/numa.c and xen/numa.h
> and use the CONFIG_NUMA to gate them for non-NUMA supported
> architectures. As the target header file is Xen-style, so
> we trim some spaces and replace tabs for the codes that has
> been moved to xen/numa.h at the same time.
> 
> As acpi_scan_nodes has been used in a common function, it
> doesn't make sense to use acpi_xxx in common code, so we
> rename it to numa_scan_nodes in this patch too. After that

numa_process_nodes() now?

> if we still use CONFIG_ACPI_NUMA in to gate numa_scan_nodes
> in numa_initmem_init, that doesn't make sense. As CONFIG_NUMA
> will be selected by CONFIG_ACPI_NUMA for x86. So, we replace
> CONFIG_ACPI_NUMA by CONFIG_NUMA to gate numa_scan_nodes.
> 
> As arch_numa_disabled has been implememnted for ACPI NUMA,
> we can rename srat_disabled to numa_disabled and move it
> to common code as well.

Please update the description: arch_numa_disabled() appears in patch 5
only. Of course if you follow the comments to patch 2, the wording here
would be correct again.

> +static unsigned int __init extract_lsb_from_nodes(const struct node *nodes,
> +                                                  nodeid_t numnodes)
> +{
> +    unsigned int i, nodes_used = 0;
> +    unsigned long spdx, epdx;
> +    unsigned long bitfield = 0, memtop = 0;
> +
> +    for ( i = 0; i < numnodes; i++ )
> +    {
> +        spdx = paddr_to_pdx(nodes[i].start);
> +        epdx = paddr_to_pdx(nodes[i].end - 1) + 1;
> +
> +        if ( spdx >= epdx )
> +            continue;
> +
> +        bitfield |= spdx;

Perhaps to be taken care of in a separate patch: We accumulate only
the bits from the start addresses here, contrary to what the comment
ahead of the function says (and I think it is the comment which is
putting things correctly).

> +        nodes_used++;
> +        if ( epdx > memtop )
> +            memtop = epdx;
> +    }
> +
> +    if ( nodes_used <= 1 )
> +        i = BITS_PER_LONG - 1;

Is this actually going to be correct for all architectures? Aiui
Arm64 has only up to 48 physical address bits, but what about an
architecture allowing the use of all 64 bits? I think at the very
least we want BUILD_BUG_ON(PADDR_BITS >= BITS_PER_LONG) here.

> +    else
> +        i = find_first_bit(&bitfield, sizeof(unsigned long) * 8);
> +
> +    memnodemapsize = (memtop >> i) + 1;

Again perhaps the subject of a separate patch: Isn't there an off-by-1
mistake here? memtop is the maximum of all epdx-es, which are
calculated to be the first PDX following the region. Hence I'd expect

    memnodemapsize = ((memtop - 1) >> i) + 1;

here. I guess I'll make patches for both issues, which you may then
need to re-base over.

> +static int __init cf_check numa_setup(const char *opt)
> +{
> +    if ( !strncmp(opt, "off", 3) )
> +        numa_off = true;
> +    else if ( !strncmp(opt, "on", 2) )
> +        numa_off = false;
> +#ifdef CONFIG_NUMA_EMU
> +    else if ( !strncmp(opt, "fake=", 5) )
> +    {
> +        numa_off = false;
> +        numa_fake = simple_strtoul(opt + 5, NULL, 0);
> +        if ( numa_fake >= MAX_NUMNODES )
> +            numa_fake = MAX_NUMNODES;
> +    }
> +#endif
> +    else
> +        return arch_numa_setup(opt);
> +
> +    return 0;
> +}
> +custom_param("numa", numa_setup);

Note that with this moved here at some point during your work (when
allowing NUMA=y for Arm) you'll need to update the command line doc.

> +static void cf_check dump_numa(unsigned char key)
> +{
> +    s_time_t now = NOW();
> +    unsigned int i, j, n;
> +    struct domain *d;
> +    const struct page_info *page;
> +    unsigned int page_num_node[MAX_NUMNODES];
> +    const struct vnuma_info *vnuma;
> +
> +    printk("'%c' pressed -> dumping numa info (now = %"PRI_stime")\n", key,
> +           now);
> +
> +    for_each_online_node ( i )
> +    {
> +        paddr_t pa = pfn_to_paddr(node_start_pfn(i) + 1);
> +
> +        printk("NODE%u start->%lu size->%lu free->%lu\n",
> +               i, node_start_pfn(i), node_spanned_pages(i),
> +               avail_node_heap_pages(i));
> +        /* Sanity check phys_to_nid() */
> +        if ( phys_to_nid(pa) != i )
> +            printk("phys_to_nid(%"PRIpaddr") -> %d should be %u\n",
> +                   pa, phys_to_nid(pa), i);
> +    }
> +
> +    j = cpumask_first(&cpu_online_map);
> +    n = 0;
> +    for_each_online_cpu ( i )
> +    {
> +        if ( i != j + n || cpu_to_node[j] != cpu_to_node[i] )
> +        {
> +            if ( n > 1 )
> +                printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, 
> cpu_to_node[j]);
> +            else
> +                printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]);
> +            j = i;
> +            n = 1;
> +        }
> +        else
> +            ++n;
> +    }
> +    if ( n > 1 )
> +        printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]);
> +    else
> +        printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]);
> +
> +    rcu_read_lock(&domlist_read_lock);
> +
> +    printk("Memory location of each domain:\n");
> +    for_each_domain ( d )
> +    {
> +        process_pending_softirqs();
> +
> +        printk("Domain %u (total: %u):\n", d->domain_id, 
> domain_tot_pages(d));
> +
> +        for_each_online_node ( i )
> +            page_num_node[i] = 0;

I'd be inclined to suggest to use memset() here, but I won't insist
on you doing this "on the fly". Along with this would likely go the
request to limit the scope of page_num_node[] (and then perhaps also
vnuma and page).

> +        spin_lock(&d->page_alloc_lock);
> +        page_list_for_each ( page, &d->page_list )
> +        {
> +            i = phys_to_nid(page_to_maddr(page));
> +            page_num_node[i]++;
> +        }
> +        spin_unlock(&d->page_alloc_lock);
> +
> +        for_each_online_node ( i )
> +            printk("    Node %u: %u\n", i, page_num_node[i]);
> +
> +        if ( !read_trylock(&d->vnuma_rwlock) )
> +            continue;
> +
> +        if ( !d->vnuma )
> +        {
> +            read_unlock(&d->vnuma_rwlock);
> +            continue;
> +        }
> +
> +        vnuma = d->vnuma;
> +        printk("     %u vnodes, %u vcpus, guest physical layout:\n",
> +               vnuma->nr_vnodes, d->max_vcpus);
> +        for ( i = 0; i < vnuma->nr_vnodes; i++ )
> +        {
> +            unsigned int start_cpu = ~0U;
> +
> +            if ( vnuma->vnode_to_pnode[i] == NUMA_NO_NODE )
> +                printk("       %3u: pnode ???,", i);
> +            else
> +                printk("       %3u: pnode %3u,", i, 
> vnuma->vnode_to_pnode[i]);
> +
> +            printk(" vcpus ");
> +
> +            for ( j = 0; j < d->max_vcpus; j++ )
> +            {
> +                if ( !(j & 0x3f) )
> +                    process_pending_softirqs();
> +
> +                if ( vnuma->vcpu_to_vnode[j] == i )
> +                {
> +                    if ( start_cpu == ~0U )
> +                    {
> +                        printk("%d", j);

j being "unsigned int", would you mind switching to %u here and below?

> --- a/xen/include/xen/numa.h
> +++ b/xen/include/xen/numa.h
> @@ -18,4 +18,71 @@
>    (((d)->vcpu != NULL && (d)->vcpu[0] != NULL) \
>     ? vcpu_to_node((d)->vcpu[0]) : NUMA_NO_NODE)
>  
> +/* The following content can be used when NUMA feature is enabled */
> +#ifdef CONFIG_NUMA
> +
> +extern nodeid_t      cpu_to_node[NR_CPUS];
> +extern cpumask_t     node_to_cpumask[];
> +
> +#define cpu_to_node(cpu)        cpu_to_node[cpu]
> +#define parent_node(node)       (node)
> +#define node_to_first_cpu(node) __ffs(node_to_cpumask[node])

I can't spot a use of this - perhaps better drop than generalize (if
done right here then along with mentioning this in the description)?

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.