Xen project Mailing List

RE: [XEN RFC PATCH 23/40] xen/arm: introduce a helper to parse device tree memory node

From: Stefano Stabellini <sstabellini@xxxxxxxxxx>

Date: Wed, 8 Sep 2021 15:31:53 -0700 (PDT)

Cc: Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Bertrand Marquis <Bertrand.Marquis@xxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>

Delivery-date: Wed, 08 Sep 2021 22:32:18 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Wed, 8 Sep 2021, Wei Chen wrote: > > > >>> @@ -55,6 +57,79 @@ static int __init > > > >> dtb_numa_processor_affinity_init(nodeid_t node) > > > >>> return 0; > > > >>> } > > > >>> > > > >>> +/* Callback for parsing of the memory regions affinity */ > > > >>> +static int __init dtb_numa_memory_affinity_init(nodeid_t node, > > > >>> + paddr_t start, paddr_t size) > > > >>> +{ > > > >>> + struct node *nd; > > > >>> + paddr_t end; > > > >>> + int i; > > > >>> + > > > >>> + if ( srat_disabled() ) > > > >>> + return -EINVAL; > > > >>> + > > > >>> + end = start + size; > > > >>> + if ( num_node_memblks >= NR_NODE_MEMBLKS ) > > > >>> + { > > > >>> + dprintk(XENLOG_WARNING, > > > >>> + "Too many numa entry, try bigger NR_NODE_MEMBLKS > > \n"); > > > >>> + bad_srat(); > > > >>> + return -EINVAL; > > > >>> + } > > > >>> + > > > >>> + /* It is fine to add this area to the nodes data it will be > > used > > > >> later */ > > > >>> + i = conflicting_memblks(start, end); > > > >>> + /* No conflicting memory block, we can save it for later usage > > */; > > > >>> + if ( i < 0 ) > > > >>> + goto save_memblk; > > > >>> + > > > >>> + if ( memblk_nodeid[i] == node ) { > > > >>> + /* > > > >>> + * Overlaps with other memblk in the same node, warning > > here. > > > >>> + * This memblk will be merged with conflicted memblk later. > > > >>> + */ > > > >>> + printk(XENLOG_WARNING > > > >>> + "DT: NUMA NODE %u (%"PRIx64 > > > >>> + "-%"PRIx64") overlaps with itself (%"PRIx64"- > > > >> %"PRIx64")\n", > > > >>> + node, start, end, > > > >>> + node_memblk_range[i].start, > > node_memblk_range[i].end); > > > >>> + } else { > > > >>> + /* > > > >>> + * Conflict with memblk in other node, this is an error. > > > >>> + * The NUMA information is invalid, NUMA will be turn off. > > > >>> + */ > > > >>> + printk(XENLOG_ERR > > > >>> + "DT: NUMA NODE %u (%"PRIx64"-%" > > > >>> + PRIx64") overlaps with NODE %u (%"PRIx64"- > > > %"PRIx64")\n", > > > >>> + node, start, end, memblk_nodeid[i], > > > >>> + node_memblk_range[i].start, > > node_memblk_range[i].end); > > > >>> + bad_srat(); > > > >>> + return -EINVAL; > > > >>> + } > > > >>> + > > > >>> +save_memblk: > > > >>> + nd = &nodes[node]; > > > >>> + if ( !node_test_and_set(node, memory_nodes_parsed) ) { > > > >>> + nd->start = start; > > > >>> + nd->end = end; > > > >>> + } else { > > > >>> + if ( start < nd->start ) > > > >>> + nd->start = start; > > > >>> + if ( nd->end < end ) > > > >>> + nd->end = end; > > > >>> + } > > > >>> + > > > >>> + printk(XENLOG_INFO "DT: NUMA node %u %"PRIx64"-%"PRIx64"\n", > > > >>> + node, start, end); > > > >>> + > > > >>> + node_memblk_range[num_node_memblks].start = start; > > > >>> + node_memblk_range[num_node_memblks].end = end; > > > >>> + memblk_nodeid[num_node_memblks] = node; > > > >>> + num_node_memblks++; > > > >> > > > >> > > > >> Is it possible to have non-contigous ranges of memory for a single > > NUMA > > > >> node? > > > >> > > > >> Looking at the DT bindings and Linux implementation, it seems > > possible. > > > >> Here, it seems that node_memblk_range/memblk_nodeid could handle it, > > > >> but nodes couldn't. > > > > > > > > Yes, you're right. I copied this code for x86 ACPI NUMA. Does ACPI > > allow > > > > non-contiguous ranges of memory for a single NUMA node too? > > > > > > I couldn't find any restriction for ACPI. Although, I only briefly > > > looked at the spec. > > > > > > > If yes, I think > > > > this will affect x86 ACPI NUMA too. In next version, we plan to merge > > > > dtb_numa_memory_affinity_init and acpi_numa_memory_affinity_init into > > a > > > > neutral function. So we can fix them at the same time. > > > > > > > > If not, maybe we have to keep the diversity for dtb and ACPI here. > > > > > > I am not entirely sure what you mean. Are you saying if ACPI doesn't > > > allow non-contiguous ranges of memory, then we should keep the > > > implementation separated? > > > > > > If so, then I disagree with that. It is fine to have code that supports > > > more than what a firmware table supports. The main benefit is less code > > > and therefore less long term maintenance (with the current solution we > > > would need to check both the ACPI and DT implementation if there is a > > > bug in one). > > > > > > > Yes, I agree. > > > > I am looking for some methods to address this comment. Current "nodes" > has not considered the situation of memory addresses of different NUMA > nodes can be interleaved. > > This code exists in x86 NUMA implementation. I think it may be based on > one early version of Linux x86 NUMA implementation. In recent Linux > code, both ACPI/numa/srat.c[1] and x86 NUMA code[2] are not using > "nodes" to record NUMA memory address boundary. They don't depend > on "nodes" to do sanity check. > > To fix it, we'd better to upgrade the x86 NUMA driver. It will make > a great affect for Xen-x86. And I think it might out of this series > scope. Can we create another thread to discuss about it? > > Or could you give me suggestions that we can use some simple ways > to fix it? It looks like that we would have to replace all the node->start / node->end checks with node_memblk_range checks. There are a few of them in valid_numa_range, conflicting_memblks, cutoff_node, nodes_cover_memory. It wouldn't be trivial. Although I do think that non-contiguous memory for NUMA nodes is important to support, the patch series is already 40 patches. I don't think it is a good idea to add other significant changes to it. I wouldn't upgrade the x86 NUMA driver now. If we can't find a better way, we can proceed as you are doing in this version, with the known gap that we can't deal with non-contigious memory for NUMA nodes, and fix it with a follow-up series later. In that case we would want to have an explicit check for non-contiguous memory for NUMA nodes in dtb_numa_memory_affinity_init and error out if found. > Also, on Linux, NUMA implementations for x86 are different from Arm64 > and RISC-V implementations.[3] > > [1] https://github.com/torvalds/linux/blob/master/drivers/acpi/numa/srat.c > [2] https://github.com/torvalds/linux/blob/master/arch/x86/mm/numa.c > [3] https://github.com/torvalds/linux/blob/master/drivers/base/arch_numa.c In general, I like the idea of sharing code as much as possible between architectures (x86, ARM, etc.) and between DT/ACPI because it makes the code maintainance easier and one might even gain certain features for free. However, the excercise of sharing code shouldn't take significant additional efforts. In fact, it should decrease the overall effort: instead of writing new code one just take existing code and move it to common. In this instance, I think it would be good to be able to share the NUMA initialization code between x86/ARM and ACPI/DT if it doesn't cause extra efforts. Here the extra effort that my previous comment might cause doesn't derive from x86/ARM or DT/ACPI code sharing. It derives from the fact that our existing code doesn't deal with non-contigous memory for NUMA nodes unfortunately. That is something we need to find a way to cope with anyway.

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.