[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] x86/NUMA: improve memnode_shift calculation for multi node system


  • To: Jan Beulich <jbeulich@xxxxxxxx>
  • From: Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Date: Fri, 30 Sep 2022 10:25:24 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=Uh2r5wrtrLXWOlC57QQzNkBMW+t6CAM4JjuYa45kB4s=; b=PxCM6fIjGrpU2rnsSYGCWL4XRuvQP1POZpCusEyWBs9weUiLYGkQgO3yKeWBt+zlPg1OxUCWH2wlrvOUB1ao7JE4iMgvK7tGrBeacouHmR4sc7jsKTzGBWEw1iDkEBBKeGWHfjQC3lzhPMnac4DHDrDPBJK4uKAXJfuJ1yF3VFoqywTsfyl9yhKL15m/FPd/W4IjxO1AanowuYg5ySF6HAobX/m2DYWYVTn0SkMTKHKBXHkoBks0FCH9j0Uct+7A2eJAnxSSOgtsArKjMoDsrAvBqSZpqO1iA14b58cbJKJ5C536SabfYieMfmX2yZIavnE4w4NeeQUAY9R2Xi73Xw==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=IYLAVYIUWR98079RZFIHkGyGWzrbH3NTkHJ1EFWe4pd4J3xwXW/px96LiU83KnAwjYBLjDZmaxJ+WRU0oXovBD4471ZfPxuF4MzS238FtiQHCydHDjN+Ha7D35Bt/9ZTUBNW6QNg5tByTU1bIuYJpGRjwKrrGOd+WKb+ov1jlewUNOjxe+RuFRvnB6xbJX/NzjpGJgE64lI4Q4vnsImwd4++00dO7xtqdhM4mL2WLblBcSZ+Mxcog4XWJBWg0fVWDNLx8987Zo4qWqE+3k/r4fulfYBYdqaR7JvpspcDnHqsbFW4MImxdjLkknP4SVIPzdvNymGgmakQHqo71nShaw==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>
  • Delivery-date: Fri, 30 Sep 2022 08:26:03 +0000
  • Ironport-data: A9a23:WFLNCq8LbNO4SzqO9izoDrUDm3+TJUtcMsCJ2f8bNWPcYEJGY0x3m mYcXzrQOfiDMTbxet5xPYzn9EJQusXczdVgSwptrSA8E34SpcT7XtnIdU2Y0wF+jCHgZBk+s 5hBMImowOQcFCK0SsKFa+C5xZVE/fjUAOC6UIYoAwgpLSd8UiAtlBl/rOAwh49skLCRDhiE/ Nj/uKUzAnf8s9JPGj9SuvnrRC9H5qyo4mtJ5QRmPJingXeF/5UrJMNHTU2OByOQrrl8RoaSW +vFxbelyWLVlz9F5gSNy+uTnuUiG9Y+DCDW4pZkc/HKbitq/0Te5p0TJvsEAXq7vh3S9zxHJ HehgrTrIeshFvWkdO3wyHC0GQkmVUFN0OevzXRSLaV/ZqAJGpfh66wGMa04AWEX0sNrK3B2/ P8yFB8Uakmvpv+u37CEa/Y506zPLOGzVG8ekldJ6GiDSNoDH9XESaiM4sJE1jAtgMwIBezZe 8cSdTtoalLHfgFLPVAUTpk5mY9EhFGmK2Ee9A3T+PtxujeIpOBy+OGF3N79YNuFSN8Thk+Fj mnH4374ElcRM9n3JT+toiv32raUwXiTtIQ6EL+9xOFXqn+p/TYONyw3cUPnmvyXoxvrMz5YA wlOksY0loAw/kG2Stj2XzWjvWWJ+BUbXrJ4A+A8rQ2A1KfQywKYHXQfCC5MbsQ8s807TiBs0 UWG9/v2ARR/vbvTTmiSnop4thu3MCkRaGMHOykNSFJf58G5+d5oyBXSUtxkDai5yMXvHi39y CyLqy54gKgPickM1OOw+lWvby+Qm6UlhzUdvm3/Nl9JJCsjDGJ5T+REMWTm0Ms=
  • Ironport-hdrordr: A9a23:9bdxi6/YBmwzn5d1Gbpuk+E9db1zdoMgy1knxilNoENuH/Bwxv rFoB1E73TJYVYqN03IV+rwXZVoZUmsjaKdgLNhRItKOTOLhILGFuFfBOfZsl7d8mjFh5VgPM RbAtRD4b/LfD9HZK/BiWHXcurIguP3lpxA7d2uskuFJjsaD52IgT0JaDpyRSZNNXN77NcCZe 2hz/sCgwDlVWUcb8y9CHVAd+/fp+fTnJajRRIdHRYo5CSHkDvtsdfBYlGl9yZbdwkK7aYp8G DDnQC8zqK/s8ujwhuZ82PI9ZxZlPbo19MGLs2Rjco+LCnql2+TFfJccozHmApwjPCk6V4snt WJixA8P/5r43eURW2xqQuF4XiT7B8er1vZjXOIi3rqpsL0ABggDdBauI5fehzFr2I9odBVys twri+knqsSKSmFsDX25tDOWR0vvFGzu2AenekaiGEaeZcCaYVWsZcU8CpuYd099RrBmc8a+d RVfY/hDK48SyLaU5mZhBgl/DWUZAV+Iv/cKXJy+vB80FBt7QNEJgUjtY8id0w7heMAoql/lp v525tT5c9zp+8tHNdA7bQ6ML+KI12IZy7wG0SvBnmiPJ07Ghv22u7KCfMOlamXRKA=
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Tue, Sep 27, 2022 at 06:20:35PM +0200, Jan Beulich wrote:
> SRAT may describe individual nodes using multiple ranges. When they're
> adjacent (with or without a gap in between), only the start of the first
> such range actually needs accounting for. Furthermore the very first
> range doesn't need considering of its start address at all, as it's fine
> to associate all lower addresses (with no memory) with that same node.
> For this to work, the array of ranges needs to be sorted by address -
> adjust logic accordingly in acpi_numa_memory_affinity_init().

Speaking for myself (due to the lack of knowledge of the NUMA stuff) I
would benefit from a bit of context on why and how memnode_shift is
used.

> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
> ---
> On my Dinar and Rome systems this changes memnodemapsize to a single
> page. Originally they used 9 / 130 pages (with shifts going from 8 / 6
> to 15 / 16) respectively, resulting from lowmem gaps [A0,FF] / [A0,BF].
> 
> This goes on top of "x86/NUMA: correct memnode_shift calculation for
> single node system".
> 
> --- a/xen/arch/x86/numa.c
> +++ b/xen/arch/x86/numa.c
> @@ -127,7 +127,8 @@ static int __init extract_lsb_from_nodes
>          epdx = paddr_to_pdx(nodes[i].end - 1) + 1;
>          if ( spdx >= epdx )
>              continue;
> -        bitfield |= spdx;
> +        if ( i && (!nodeids || nodeids[i - 1] != nodeids[i]) )
> +            bitfield |= spdx;
>          if ( !i || !nodeids || nodeids[i - 1] != nodeids[i] )
>              nodes_used++;
>          if ( epdx > memtop )
> --- a/xen/arch/x86/srat.c
> +++ b/xen/arch/x86/srat.c
> @@ -312,6 +312,7 @@ acpi_numa_memory_affinity_init(const str
>       unsigned pxm;
>       nodeid_t node;
>       unsigned int i;
> +     bool next = false;
>  
>       if (srat_disabled())
>               return;
> @@ -413,14 +414,37 @@ acpi_numa_memory_affinity_init(const str
>              node, pxm, start, end - 1,
>              ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE ? " (hotplug)" : "");
>  
> -     node_memblk_range[num_node_memblks].start = start;
> -     node_memblk_range[num_node_memblks].end = end;
> -     memblk_nodeid[num_node_memblks] = node;
> +     /* Keep node_memblk_range[] sorted by address. */
> +     for (i = 0; i < num_node_memblks; ++i)
> +             if (node_memblk_range[i].start > start ||
> +                 (node_memblk_range[i].start == start &&

Maybe I'm confused, but won't .start == start means we have
overlapping ranges?

> +                  node_memblk_range[i].end > end))
> +                     break;
> +
> +     memmove(&node_memblk_range[i + 1], &node_memblk_range[i],
> +             (num_node_memblks - i) * sizeof(*node_memblk_range));
> +     node_memblk_range[i].start = start;
> +     node_memblk_range[i].end = end;
> +
> +     memmove(&memblk_nodeid[i + 1], &memblk_nodeid[i],
> +             (num_node_memblks - i) * sizeof(*memblk_nodeid));
> +     memblk_nodeid[i] = node;
> +
>       if (ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) {
> -             __set_bit(num_node_memblks, memblk_hotplug);
> +             next = true;
>               if (end > mem_hotplug)
>                       mem_hotplug = end;
>       }
> +     for (; i <= num_node_memblks; ++i) {
> +             bool prev = next;
> +
> +             next = test_bit(i, memblk_hotplug);
> +             if (prev)
> +                     __set_bit(i, memblk_hotplug);
> +             else
> +                     __clear_bit(i, memblk_hotplug);

Nit: I think you could avoid doing the clear for the last bit, ie:
else if (i != num_node_memblks) __clear_bit(...);

But I'm not sure it's worth adding the logic, just makes it more
complicated to follow.

Thanks, Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.