[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] x86/NUMA: improve memnode_shift calculation for multi node system


  • To: Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Fri, 30 Sep 2022 10:36:20 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=qvE2Hj0t5QUvQsd6fn3XoiQAlMmNPdrwVTOWgCLm0ew=; b=NDPoDbiuIqsUX2PNI0njmS3XEVK3b5ZmhnoxDPBiDLaA50FT2qNazFPHspyQ70z5sav/NNhl1MfHgY3f5j1R7TFLTEhtgcWIPEeODcKr/NnRRDYJefqwS9qmQKzq5UE0RsTuC8/5L/kdkE61ppQe90D+MoySoHKdFuvYtGdJgc1V0q8UmkNydt6KvTM8Iyz1jtNjYxFrOD0UmP3COiBSoM6uvk98+DFuCrxgIdl/MupQU59Fl3xS6YY6a15poz4t5w7brIf9x300BN0egW/FgxIJygnt3MfzLOjCPpfK7ldrEUhBhUBjBpOQ4Tg5670nB8M56+22d2iaUTai/jWPrw==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=gt0QJJNnmV9AlTrf6BZnVs/voyMgwzzAWtQSzKPCiFbyZlJO7BQ8Wwq4okSbjW1tMoUZr7fvAg+BPBhxb8kbTNZGWPAaYWVHWro8DrhyEb9GjZfZXO+1Y3gYNBE7naWMykkIXZx9cgTsHxv+3AhJoLnA2BsfXPHHHSEA0b8B8EerR4nP04X0FaYydYRJKTTOZVh88Lz5AYm18l3KPafebJtu7B+hTPLh+PhAAwP2h93YR0YeIAL85D0rWqtsX2uiYOdGDyZ3upfkGIXr/lpdiTXdwp06mIUu/WmmeqSefsohO/xMJh/rUrSMPa4qrbN3RJE7O5WsJES0l4vXiKplsg==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>
  • Delivery-date: Fri, 30 Sep 2022 08:36:37 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 30.09.2022 10:25, Roger Pau Monné wrote:
> On Tue, Sep 27, 2022 at 06:20:35PM +0200, Jan Beulich wrote:
>> SRAT may describe individual nodes using multiple ranges. When they're
>> adjacent (with or without a gap in between), only the start of the first
>> such range actually needs accounting for. Furthermore the very first
>> range doesn't need considering of its start address at all, as it's fine
>> to associate all lower addresses (with no memory) with that same node.
>> For this to work, the array of ranges needs to be sorted by address -
>> adjust logic accordingly in acpi_numa_memory_affinity_init().
> 
> Speaking for myself (due to the lack of knowledge of the NUMA stuff) I
> would benefit from a bit of context on why and how memnode_shift is
> used.

It's used solely to shift PDXes right before indexing memnodemap[].
Hence a larger shift allows for a smaller array (less memory and,
perhaps more importantly, less cache footprint). Personally I don't
think such needs mentioning in a patch, but I know others think
differently (George for example looks to believe that the present
situation should always be described). In the case here a simple
grep for memnode_shift would tell you, and even if I described this
here it wouldn't really help review I think: You'd still want to
verify then that what I claim actually matches reality.

>> @@ -413,14 +414,37 @@ acpi_numa_memory_affinity_init(const str
>>             node, pxm, start, end - 1,
>>             ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE ? " (hotplug)" : "");
>>  
>> -    node_memblk_range[num_node_memblks].start = start;
>> -    node_memblk_range[num_node_memblks].end = end;
>> -    memblk_nodeid[num_node_memblks] = node;
>> +    /* Keep node_memblk_range[] sorted by address. */
>> +    for (i = 0; i < num_node_memblks; ++i)
>> +            if (node_memblk_range[i].start > start ||
>> +                (node_memblk_range[i].start == start &&
> 
> Maybe I'm confused, but won't .start == start means we have
> overlapping ranges?

Yes, except when a range is empty.

>> +                 node_memblk_range[i].end > end))
>> +                    break;
>> +
>> +    memmove(&node_memblk_range[i + 1], &node_memblk_range[i],
>> +            (num_node_memblks - i) * sizeof(*node_memblk_range));
>> +    node_memblk_range[i].start = start;
>> +    node_memblk_range[i].end = end;
>> +
>> +    memmove(&memblk_nodeid[i + 1], &memblk_nodeid[i],
>> +            (num_node_memblks - i) * sizeof(*memblk_nodeid));
>> +    memblk_nodeid[i] = node;
>> +
>>      if (ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) {
>> -            __set_bit(num_node_memblks, memblk_hotplug);
>> +            next = true;
>>              if (end > mem_hotplug)
>>                      mem_hotplug = end;
>>      }
>> +    for (; i <= num_node_memblks; ++i) {
>> +            bool prev = next;
>> +
>> +            next = test_bit(i, memblk_hotplug);
>> +            if (prev)
>> +                    __set_bit(i, memblk_hotplug);
>> +            else
>> +                    __clear_bit(i, memblk_hotplug);
> 
> Nit: I think you could avoid doing the clear for the last bit, ie:
> else if (i != num_node_memblks) __clear_bit(...);
> 
> But I'm not sure it's worth adding the logic, just makes it more
> complicated to follow.

Indeed. I did consider both this and using a single __change_bit()
wrapped in a suitable if(). Both looked to me like they would hamper
proving the code is doing what it ought to do.

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.