[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] arm64: Approach for DT based NUMA and issues



Hi Vijay,

On 26/11/16 06:59, Vijay Kilari wrote:
> Hi,
> 
>    Below basic write up on DT based NUMA feature support for arm64 platform.
> I have attempted to get NUMA support, However I face below issues. I would 
> like
> to discuss these issues. Please let me know your comments on this. Yet to look
> at ACPI support.
> 
> DT based NUMA support for arm64 platform
> ========================================
> For Xen boot on NUMA arm64 platform, Xen needs to parse
> CPU and Memory nodes for DT based booting mechanism. Here I would
> like to discuss about DT based booting mechanism and the issues
> related to it.
> 
> 1) Parsing CPU and Memory nodes:
> ---------------------------------------------------
> 
> The numa information associated for CPU and Memory are passed in DT
> using numa-node-id u32-interger value. More information about NUMA binding
> is available in linux kernel @ Documentation/devicetree/bindings/numa.txt
> 
> Similar to Linux kernel, cpu and memory nodes of DT are parsed
> and numa-node-id information is populated in cpu_parsed and memory_parsed
> node_t mask.
> 
> When booting in UEFI mode, UEFI passes memory information to Dom0
> using EFI memory descriptor table and deletes the memory nodes
> from the host DT. However to fetch the memory numa node id, memory DT
> node should not be deleted by EFI stub.

So is this what the Cavium UEFI firmware actually does today?
I have been told that removing the DT memory nodes was the original idea
when UEFI was architected for ARM, but it's not clear whether this is
actually implemented. Also this may differ from platform to platform, I
guess.
I don't have easy access to a box, so can't check atm.

> ISSUE: When memory node is _NOT_ deleted by EFI stub from host DT,
> Xen identifies the memory node [xen/arch/arm/bootfdt.c, early_scan_node() ]
> which adds memory ranges to bootinfo.mem structure there by adding duplicate
> entry and eventually initialization fails.
> 
> Possible Solution: While adding new memory region to bootinfo.mem, check for
> duplicate entries and back off if entry is already available from UEFI mem 
> info
> table.

So why do we iterate over DT nodes if we have populated via the UEFI
memmap already? Can't we just have an order:
1) if UEFI memmap available: parse that, populate bootinfo.mem, ignore DT
2) if UEFI not available, parse DT memory nodes, populate bootinfo.mem

So to make this work with NUMA, we would add another chain for NUMA parsing:
1) if ACPI is available, use the SRAT table
2) if ACPI is not available, check the DT memory nodes

This should work with all cases: pure DT, UEFI with DT, UEFI with ACPI

> 
> 2) Parsing CPU nodes:
> ---------------------------------
> The CPU nodes are parsed to extract numa-node-id info for each cpu and
> cpu_nodemask is populated.
> 
> The MPIDR register value is read for each CPU and cpu_to_node[] is populated.

So there is no issue here and that works as expected?

> 3) Parsing Memory nodes:
> --------------------------------------
> For all the DT memory nodes in the flattend DT, start address, size
> and numa-node-id value is extracted and stored in "node_memblk_range[]"
> which is of type struct node.
> 
> Each bootinfo.mem entry from UEFI is verified against node_memblk_range[] and
> NODE_DATA is populated with start PFN, end PFN and nodeid.
> 
> Populating memnodemap:
> 
> The memnodemap[] is allocated from heap and using the NODE_DATA structure,
> the memnodemap[] is populated with nodeid for each page index.
> 
> This memnodemap info is used to fetch memory node id for a given page
> by calling phys_to_nid() by memory allocator.
> 
> ISSUE: phys_to_nid() is called by memory allocator before memnodemap[]
> is initialized.
> 
> Since memnodemap[] is allocated from heap, and hence boot allocator should
> be initialized. The boot_allocator() needs phys_to_nid() which is not
> available untill memnodemap[] is initialized. So there is deadlock situation
> during initialization. To overcome this phsy_to_nid() should rely on
> node_memblk_range[] to get nodeid untill memnodemap[] is initialized.

What about having an early boot fallback: like:

nodeid_t phys_to_nid(paddr_t addr)
{
        if (!memnodemap)
                return 0;
        ....
}

Cheers,
Andre.

> 4) Generating memory nodes for DOM0
> ---------------------------------------------------------
> Linux kernel device drivers that uses devm_zalloc(), tries to allocate memory
> from local memory node. So Dom0 needs to have memory allocated on all the
> available nodes of the system.
> 
> Ex: SMMU driver of device on node 1 tries to allocate memory
> on node 1.
> 
> ISSUE:
>  - Dom0's memory should be split across all the available memory nodes
>    of the system and memory nodes should be generated accordingly.
>  - Memory DT node generated by Xen for Dom0 should populate numa-node-id
>    information.
> 
> Regards
> Vijay
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.