[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] arm64: Approach for DT based NUMA and issues



On Mon, Nov 28, 2016 at 7:20 PM, Andre Przywara <andre.przywara@xxxxxxx> wrote:
> Hi Vijay,
>
> On 26/11/16 06:59, Vijay Kilari wrote:
>> Hi,
>>
>>    Below basic write up on DT based NUMA feature support for arm64 platform.
>> I have attempted to get NUMA support, However I face below issues. I would 
>> like
>> to discuss these issues. Please let me know your comments on this. Yet to 
>> look
>> at ACPI support.
>>
>> DT based NUMA support for arm64 platform
>> ========================================
>> For Xen boot on NUMA arm64 platform, Xen needs to parse
>> CPU and Memory nodes for DT based booting mechanism. Here I would
>> like to discuss about DT based booting mechanism and the issues
>> related to it.
>>
>> 1) Parsing CPU and Memory nodes:
>> ---------------------------------------------------
>>
>> The numa information associated for CPU and Memory are passed in DT
>> using numa-node-id u32-interger value. More information about NUMA binding
>> is available in linux kernel @ Documentation/devicetree/bindings/numa.txt
>>
>> Similar to Linux kernel, cpu and memory nodes of DT are parsed
>> and numa-node-id information is populated in cpu_parsed and memory_parsed
>> node_t mask.
>>
>> When booting in UEFI mode, UEFI passes memory information to Dom0
>> using EFI memory descriptor table and deletes the memory nodes
>> from the host DT. However to fetch the memory numa node id, memory DT
>> node should not be deleted by EFI stub.
>
> So is this what the Cavium UEFI firmware actually does today?
> I have been told that removing the DT memory nodes was the original idea
> when UEFI was architected for ARM, but it's not clear whether this is
> actually implemented. Also this may differ from platform to platform, I
> guess.
> I don't have easy access to a box, so can't check atm.

Please see the patch from Ard in kernel. This change is required in
Xen EFI as well.

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/drivers/firmware/efi/arm-init.c?id=500899c2cc3e3f06140373b587a69d30650f2d9d

>
>> ISSUE: When memory node is _NOT_ deleted by EFI stub from host DT,
>> Xen identifies the memory node [xen/arch/arm/bootfdt.c, early_scan_node() ]
>> which adds memory ranges to bootinfo.mem structure there by adding duplicate
>> entry and eventually initialization fails.
>>
>> Possible Solution: While adding new memory region to bootinfo.mem, check for
>> duplicate entries and back off if entry is already available from UEFI mem 
>> info
>> table.
>
> So why do we iterate over DT nodes if we have populated via the UEFI
> memmap already? Can't we just have an order:
> 1) if UEFI memmap available: parse that, populate bootinfo.mem, ignore DT
> 2) if UEFI not available, parse DT memory nodes, populate bootinfo.mem

Yes, could be done. will have a look
>
> So to make this work with NUMA, we would add another chain for NUMA parsing:
> 1) if ACPI is available, use the SRAT table
> 2) if ACPI is not available, check the DT memory nodes
>
> This should work with all cases: pure DT, UEFI with DT, UEFI with ACPI
>
>>
>> 2) Parsing CPU nodes:
>> ---------------------------------
>> The CPU nodes are parsed to extract numa-node-id info for each cpu and
>> cpu_nodemask is populated.
>>
>> The MPIDR register value is read for each CPU and cpu_to_node[] is populated.
>
> So there is no issue here and that works as expected?

No issue. Already MPIDR is read on secondary cpu boot from which cpu_to_node[]
data is updated

>
>> 3) Parsing Memory nodes:
>> --------------------------------------
>> For all the DT memory nodes in the flattend DT, start address, size
>> and numa-node-id value is extracted and stored in "node_memblk_range[]"
>> which is of type struct node.
>>
>> Each bootinfo.mem entry from UEFI is verified against node_memblk_range[] and
>> NODE_DATA is populated with start PFN, end PFN and nodeid.
>>
>> Populating memnodemap:
>>
>> The memnodemap[] is allocated from heap and using the NODE_DATA structure,
>> the memnodemap[] is populated with nodeid for each page index.
>>
>> This memnodemap info is used to fetch memory node id for a given page
>> by calling phys_to_nid() by memory allocator.
>>
>> ISSUE: phys_to_nid() is called by memory allocator before memnodemap[]
>> is initialized.
>>
>> Since memnodemap[] is allocated from heap, and hence boot allocator should
>> be initialized. The boot_allocator() needs phys_to_nid() which is not
>> available untill memnodemap[] is initialized. So there is deadlock situation
>> during initialization. To overcome this phsy_to_nid() should rely on
>> node_memblk_range[] to get nodeid untill memnodemap[] is initialized.
>
> What about having an early boot fallback: like:
>
> nodeid_t phys_to_nid(paddr_t addr)
> {
>         if (!memnodemap)
>                 return 0;
>         ....
> }

The memory allocator has all the nodes memory from bootinfo.mem
So, memory allocator fails when phys_to_nid() returns 0 for node 1 memory.

>
> Cheers,
> Andre.
>
>> 4) Generating memory nodes for DOM0
>> ---------------------------------------------------------
>> Linux kernel device drivers that uses devm_zalloc(), tries to allocate memory
>> from local memory node. So Dom0 needs to have memory allocated on all the
>> available nodes of the system.
>>
>> Ex: SMMU driver of device on node 1 tries to allocate memory
>> on node 1.
>>
>> ISSUE:
>>  - Dom0's memory should be split across all the available memory nodes
>>    of the system and memory nodes should be generated accordingly.
>>  - Memory DT node generated by Xen for Dom0 should populate numa-node-id
>>    information.
>>
>> Regards
>> Vijay
>>

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.