Xen project Mailing List

Re: [PATCH v5 2/6] xen/x86: move generically usable NUMA code from x86 to common

Date: Tue, 27 Sep 2022 10:19:33 +0200

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=yAzkf1MsGwcWu9aN2ZBcnYe91RZZv8KzWyU2D+Ok9aU=; b=ZOprF2vt9J67FFI4iRXPaOghBKmebM20zttb+nQl4ySm3ABQtYBvtz5ZZIR6CUQP0OYPEUZlgRvHOxK9bAEqQljzM3TI9kC9WJ6HizndmLXS8oNrzndMkleNXW3EbAwLRklRcnFXMy0odkWqix3twtSvlXajh1SWUc+tDIdf9BzIA9OOnYSZGsqEJYPDoSD8pZoUJSWEspziFc9gnr3JuTzUU1D0l9Z0GhPcj/vHEW09oTnFsrK4mk3AmjeLBYadttAuWiysddU7Psar0z3qlxz3cvygfOf3Wej32sYsdxmjS0dsbbulz44QpdBkXZhc6lTJZ1UJ8QH3RZzyxg8mPw==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=PeoXYerb3GnX++VCOmuPLmbwjOxBZLy/prI0844fc034yGTC+zeEE9iFwY8h6TuJrwXe5wUxpgEsM8eUB+n5s/dQdKl+pYBk2MTbys4PIoF0Za+8M6uSWU2IkyVAaNLFNHP6RrKLR334P/JOYmEEA86rOxAdtTogOGfOjkarV0e0KxusJ6oG8+EFyLCgabdFQ/9hOLp9pUstgDnhqJLCaEqOmf8liWv7cxF4ttr70wanpByBnOpqHCpbrO8/DMhM+gfCTaJvNtMwWk6M4c58qGpGvBchGLKuJyGolY0MziEPtGowg8GIbLdX8Fy0b/u6fHDIWvR+pyyUogcvYtlgUg==

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;

Cc: nd@xxxxxxx, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx

Delivery-date: Tue, 27 Sep 2022 08:19:45 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 20.09.2022 11:12, Wei Chen wrote: > There are some codes in x86/numa.c can be shared by common > architectures to implememnt NUMA support. Just like some > variables and functions to check and store NUMA memory map. > And some variables and functions to do NUMA initialization. > > In this patch, we move them to common/numa.c and xen/numa.h > and use the CONFIG_NUMA to gate them for non-NUMA supported > architectures. As the target header file is Xen-style, so > we trim some spaces and replace tabs for the codes that has > been moved to xen/numa.h at the same time. > > As acpi_scan_nodes has been used in a common function, it > doesn't make sense to use acpi_xxx in common code, so we > rename it to numa_scan_nodes in this patch too. After that numa_process_nodes() now? > if we still use CONFIG_ACPI_NUMA in to gate numa_scan_nodes > in numa_initmem_init, that doesn't make sense. As CONFIG_NUMA > will be selected by CONFIG_ACPI_NUMA for x86. So, we replace > CONFIG_ACPI_NUMA by CONFIG_NUMA to gate numa_scan_nodes. > > As arch_numa_disabled has been implememnted for ACPI NUMA, > we can rename srat_disabled to numa_disabled and move it > to common code as well. Please update the description: arch_numa_disabled() appears in patch 5 only. Of course if you follow the comments to patch 2, the wording here would be correct again. > +static unsigned int __init extract_lsb_from_nodes(const struct node *nodes, > + nodeid_t numnodes) > +{ > + unsigned int i, nodes_used = 0; > + unsigned long spdx, epdx; > + unsigned long bitfield = 0, memtop = 0; > + > + for ( i = 0; i < numnodes; i++ ) > + { > + spdx = paddr_to_pdx(nodes[i].start); > + epdx = paddr_to_pdx(nodes[i].end - 1) + 1; > + > + if ( spdx >= epdx ) > + continue; > + > + bitfield |= spdx; Perhaps to be taken care of in a separate patch: We accumulate only the bits from the start addresses here, contrary to what the comment ahead of the function says (and I think it is the comment which is putting things correctly). > + nodes_used++; > + if ( epdx > memtop ) > + memtop = epdx; > + } > + > + if ( nodes_used <= 1 ) > + i = BITS_PER_LONG - 1; Is this actually going to be correct for all architectures? Aiui Arm64 has only up to 48 physical address bits, but what about an architecture allowing the use of all 64 bits? I think at the very least we want BUILD_BUG_ON(PADDR_BITS >= BITS_PER_LONG) here. > + else > + i = find_first_bit(&bitfield, sizeof(unsigned long) * 8); > + > + memnodemapsize = (memtop >> i) + 1; Again perhaps the subject of a separate patch: Isn't there an off-by-1 mistake here? memtop is the maximum of all epdx-es, which are calculated to be the first PDX following the region. Hence I'd expect memnodemapsize = ((memtop - 1) >> i) + 1; here. I guess I'll make patches for both issues, which you may then need to re-base over. > +static int __init cf_check numa_setup(const char *opt) > +{ > + if ( !strncmp(opt, "off", 3) ) > + numa_off = true; > + else if ( !strncmp(opt, "on", 2) ) > + numa_off = false; > +#ifdef CONFIG_NUMA_EMU > + else if ( !strncmp(opt, "fake=", 5) ) > + { > + numa_off = false; > + numa_fake = simple_strtoul(opt + 5, NULL, 0); > + if ( numa_fake >= MAX_NUMNODES ) > + numa_fake = MAX_NUMNODES; > + } > +#endif > + else > + return arch_numa_setup(opt); > + > + return 0; > +} > +custom_param("numa", numa_setup); Note that with this moved here at some point during your work (when allowing NUMA=y for Arm) you'll need to update the command line doc. > +static void cf_check dump_numa(unsigned char key) > +{ > + s_time_t now = NOW(); > + unsigned int i, j, n; > + struct domain *d; > + const struct page_info *page; > + unsigned int page_num_node[MAX_NUMNODES]; > + const struct vnuma_info *vnuma; > + > + printk("'%c' pressed -> dumping numa info (now = %"PRI_stime")\n", key, > + now); > + > + for_each_online_node ( i ) > + { > + paddr_t pa = pfn_to_paddr(node_start_pfn(i) + 1); > + > + printk("NODE%u start->%lu size->%lu free->%lu\n", > + i, node_start_pfn(i), node_spanned_pages(i), > + avail_node_heap_pages(i)); > + /* Sanity check phys_to_nid() */ > + if ( phys_to_nid(pa) != i ) > + printk("phys_to_nid(%"PRIpaddr") -> %d should be %u\n", > + pa, phys_to_nid(pa), i); > + } > + > + j = cpumask_first(&cpu_online_map); > + n = 0; > + for_each_online_cpu ( i ) > + { > + if ( i != j + n || cpu_to_node[j] != cpu_to_node[i] ) > + { > + if ( n > 1 ) > + printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, > cpu_to_node[j]); > + else > + printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]); > + j = i; > + n = 1; > + } > + else > + ++n; > + } > + if ( n > 1 ) > + printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]); > + else > + printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]); > + > + rcu_read_lock(&domlist_read_lock); > + > + printk("Memory location of each domain:\n"); > + for_each_domain ( d ) > + { > + process_pending_softirqs(); > + > + printk("Domain %u (total: %u):\n", d->domain_id, > domain_tot_pages(d)); > + > + for_each_online_node ( i ) > + page_num_node[i] = 0; I'd be inclined to suggest to use memset() here, but I won't insist on you doing this "on the fly". Along with this would likely go the request to limit the scope of page_num_node[] (and then perhaps also vnuma and page). > + spin_lock(&d->page_alloc_lock); > + page_list_for_each ( page, &d->page_list ) > + { > + i = phys_to_nid(page_to_maddr(page)); > + page_num_node[i]++; > + } > + spin_unlock(&d->page_alloc_lock); > + > + for_each_online_node ( i ) > + printk(" Node %u: %u\n", i, page_num_node[i]); > + > + if ( !read_trylock(&d->vnuma_rwlock) ) > + continue; > + > + if ( !d->vnuma ) > + { > + read_unlock(&d->vnuma_rwlock); > + continue; > + } > + > + vnuma = d->vnuma; > + printk(" %u vnodes, %u vcpus, guest physical layout:\n", > + vnuma->nr_vnodes, d->max_vcpus); > + for ( i = 0; i < vnuma->nr_vnodes; i++ ) > + { > + unsigned int start_cpu = ~0U; > + > + if ( vnuma->vnode_to_pnode[i] == NUMA_NO_NODE ) > + printk(" %3u: pnode ???,", i); > + else > + printk(" %3u: pnode %3u,", i, > vnuma->vnode_to_pnode[i]); > + > + printk(" vcpus "); > + > + for ( j = 0; j < d->max_vcpus; j++ ) > + { > + if ( !(j & 0x3f) ) > + process_pending_softirqs(); > + > + if ( vnuma->vcpu_to_vnode[j] == i ) > + { > + if ( start_cpu == ~0U ) > + { > + printk("%d", j); j being "unsigned int", would you mind switching to %u here and below? > --- a/xen/include/xen/numa.h > +++ b/xen/include/xen/numa.h > @@ -18,4 +18,71 @@ > (((d)->vcpu != NULL && (d)->vcpu[0] != NULL) \ > ? vcpu_to_node((d)->vcpu[0]) : NUMA_NO_NODE) > > +/* The following content can be used when NUMA feature is enabled */ > +#ifdef CONFIG_NUMA > + > +extern nodeid_t cpu_to_node[NR_CPUS]; > +extern cpumask_t node_to_cpumask[]; > + > +#define cpu_to_node(cpu) cpu_to_node[cpu] > +#define parent_node(node) (node) > +#define node_to_first_cpu(node) __ffs(node_to_cpumask[node]) I can't spot a use of this - perhaps better drop than generalize (if done right here then along with mentioning this in the description)? Jan

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.