[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [PATCH 08/37] xen/x86: add detection of discontinous node memory range


  • To: Stefano Stabellini <sstabellini@xxxxxxxxxx>
  • From: Wei Chen <Wei.Chen@xxxxxxx>
  • Date: Tue, 28 Sep 2021 04:41:21 +0000
  • Accept-language: en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=zRhvPQCszJ4shTAAKhrpW7TQioWNu6P6FIit3FiR7r4=; b=NLdBnQjdmInxUi6xAgPJcKV20B2ojvBvIa0HnXQBDSzhzjOFaMbWGyTDG3T67CGOtUOaLDgCcvqskp3NIxirtu8ZD5/t2AjTjzNiQjpF8/htgUkVSFK0sWMKUjhzt/ft3i9jmIE7c1f5VnpGKGFp0C5dMDoH60RjYuehbYp30/l4ub4DUpLXJaobO42hVcDGCi3/P0obI9EqBk0EU6jkGpEoqA69rfg4dYEH1xk+PDWRaODrfTwPGgDWYJlxq1nA4id06Xub8AcpsK9j9qiAXsFAh6mHD/7/aOe7WC+hHi0CSkH/z4kK6r+B2+hBjWUzemFXqt45eYiEi03NSJJ+YA==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=cVcshjvslf1sbKxitwFbQGe7BpmMrICOcIGLDYacOfnF1rx7+kBE6Bf7LJU50N8FZF1qTn2XqnA0xRfRovc2bTXIsZ+LZxhJqUjsiby9vkvNrmYPSe/ZaDwd1SgCGH9cJoy6lNGxQ1dMi6T/susP1wWN1LW0rCVnGfTy50VPql1hY8f/8o/gJQSl6WYT6wGSvO84zl++/O+qXRjlSL5ZkVU1dhsKRosIPMi9zR5ELUN/H4c9clwCaG/uiQOB4mnuP0L+UEsl/iMBPPy1X2wjVIWO/ehMQ0g5i98e3w6xGU4ebFX0OY+eHhtMNZhVwtg+4PZwkZh4ZJ8OrnQy36zPaA==
  • Authentication-results-original: kernel.org; dkim=none (message not signed) header.d=none;kernel.org; dmarc=none action=none header.from=arm.com;
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, "julien@xxxxxxx" <julien@xxxxxxx>, Bertrand Marquis <Bertrand.Marquis@xxxxxxx>, "jbeulich@xxxxxxxx" <jbeulich@xxxxxxxx>, "andrew.cooper3@xxxxxxxxxx" <andrew.cooper3@xxxxxxxxxx>, "roger.pau@xxxxxxxxxx" <roger.pau@xxxxxxxxxx>, "wl@xxxxxxx" <wl@xxxxxxx>
  • Delivery-date: Tue, 28 Sep 2021 04:41:47 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Nodisclaimer: true
  • Original-authentication-results: kernel.org; dkim=none (message not signed) header.d=none;kernel.org; dmarc=none action=none header.from=arm.com;
  • Thread-index: AQHXsHMRZjZ8eZQrbkW8SEq2x0f00KuyVH0AgABDnECAAQJ+AIACfOAggAEi1ICAAB9HAIAAIXKwgACruYCAAL1BwA==
  • Thread-topic: [PATCH 08/37] xen/x86: add detection of discontinous node memory range

Hi Stefano,

> -----Original Message-----
> From: Stefano Stabellini <sstabellini@xxxxxxxxxx>
> Sent: 2021年9月28日 1:19
> To: Wei Chen <Wei.Chen@xxxxxxx>
> Cc: Stefano Stabellini <sstabellini@xxxxxxxxxx>; xen-
> devel@xxxxxxxxxxxxxxxxxxxx; julien@xxxxxxx; Bertrand Marquis
> <Bertrand.Marquis@xxxxxxx>; jbeulich@xxxxxxxx; andrew.cooper3@xxxxxxxxxx;
> roger.pau@xxxxxxxxxx; wl@xxxxxxx
> Subject: RE: [PATCH 08/37] xen/x86: add detection of discontinous node
> memory range
> 
> On Mon, 27 Sep 2021, Wei Chen wrote:
> > > -----Original Message-----
> > > From: Stefano Stabellini <sstabellini@xxxxxxxxxx>
> > > Sent: 2021年9月27日 13:05
> > > To: Stefano Stabellini <sstabellini@xxxxxxxxxx>
> > > Cc: Wei Chen <Wei.Chen@xxxxxxx>; xen-devel@xxxxxxxxxxxxxxxxxxxx;
> > > julien@xxxxxxx; Bertrand Marquis <Bertrand.Marquis@xxxxxxx>;
> > > jbeulich@xxxxxxxx; andrew.cooper3@xxxxxxxxxx; roger.pau@xxxxxxxxxx;
> > > wl@xxxxxxx
> > > Subject: RE: [PATCH 08/37] xen/x86: add detection of discontinous node
> > > memory range
> > >
> > > On Sun, 26 Sep 2021, Stefano Stabellini wrote:
> > > > On Sun, 26 Sep 2021, Wei Chen wrote:
> > > > > > -----Original Message-----
> > > > > > From: Stefano Stabellini <sstabellini@xxxxxxxxxx>
> > > > > > Sent: 2021年9月25日 3:53
> > > > > > To: Wei Chen <Wei.Chen@xxxxxxx>
> > > > > > Cc: Stefano Stabellini <sstabellini@xxxxxxxxxx>; xen-
> > > > > > devel@xxxxxxxxxxxxxxxxxxxx; julien@xxxxxxx; Bertrand Marquis
> > > > > > <Bertrand.Marquis@xxxxxxx>; jbeulich@xxxxxxxx;
> > > andrew.cooper3@xxxxxxxxxx;
> > > > > > roger.pau@xxxxxxxxxx; wl@xxxxxxx
> > > > > > Subject: RE: [PATCH 08/37] xen/x86: add detection of
> discontinous
> > > node
> > > > > > memory range
> > > > > >
> > > > > > On Fri, 24 Sep 2021, Wei Chen wrote:
> > > > > > > > -----Original Message-----
> > > > > > > > From: Stefano Stabellini <sstabellini@xxxxxxxxxx>
> > > > > > > > Sent: 2021年9月24日 8:26
> > > > > > > > To: Wei Chen <Wei.Chen@xxxxxxx>
> > > > > > > > Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx; sstabellini@xxxxxxxxxx;
> > > > > > julien@xxxxxxx;
> > > > > > > > Bertrand Marquis <Bertrand.Marquis@xxxxxxx>;
> jbeulich@xxxxxxxx;
> > > > > > > > andrew.cooper3@xxxxxxxxxx; roger.pau@xxxxxxxxxx; wl@xxxxxxx
> > > > > > > > Subject: Re: [PATCH 08/37] xen/x86: add detection of
> > > discontinous node
> > > > > > > > memory range
> > > > > > > >
> > > > > > > > CC'ing x86 maintainers
> > > > > > > >
> > > > > > > > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > > > > > > > One NUMA node may contain several memory blocks. In
> current
> > > Xen
> > > > > > > > > code, Xen will maintain a node memory range for each node
> to
> > > cover
> > > > > > > > > all its memory blocks. But here comes the problem, in the
> gap
> > > of
> > > > > > > > > one node's two memory blocks, if there are some memory
> blocks
> > > don't
> > > > > > > > > belong to this node (remote memory blocks). This node's
> memory
> > > range
> > > > > > > > > will be expanded to cover these remote memory blocks.
> > > > > > > > >
> > > > > > > > > One node's memory range contains othe nodes' memory, this
> is
> > > > > > obviously
> > > > > > > > > not very reasonable. This means current NUMA code only can
> > > support
> > > > > > > > > node has continous memory blocks. However, on a physical
> > > machine,
> > > > > > the
> > > > > > > > > addresses of multiple nodes can be interleaved.
> > > > > > > > >
> > > > > > > > > So in this patch, we add code to detect discontinous
> memory
> > > blocks
> > > > > > > > > for one node. NUMA initializtion will be failed and error
> > > messages
> > > > > > > > > will be printed when Xen detect such hardware
> configuration.
> > > > > > > >
> > > > > > > > At least on ARM, it is not just memory that can be
> interleaved,
> > > but
> > > > > > also
> > > > > > > > MMIO regions. For instance:
> > > > > > > >
> > > > > > > > node0 bank0 0-0x1000000
> > > > > > > > MMIO 0x1000000-0x1002000
> > > > > > > > Hole 0x1002000-0x2000000
> > > > > > > > node0 bank1 0x2000000-0x3000000
> > > > > > > >
> > > > > > > > So I am not familiar with the SRAT format, but I think on
> ARM
> > > the
> > > > > > check
> > > > > > > > would look different: we would just look for multiple memory
> > > ranges
> > > > > > > > under a device_type = "memory" node of a NUMA node in device
> > > tree.
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > Should I need to include/refine above message to commit log?
> > > > > >
> > > > > > Let me ask you a question first.
> > > > > >
> > > > > > With the NUMA implementation of this patch series, can we deal
> with
> > > > > > cases where each node has multiple memory banks, not interleaved?
> > > > >
> > > > > Yes.
> > > > >
> > > > > > An an example:
> > > > > >
> > > > > > node0: 0x0        - 0x10000000
> > > > > > MMIO : 0x10000000 - 0x20000000
> > > > > > node0: 0x20000000 - 0x30000000
> > > > > > MMIO : 0x30000000 - 0x50000000
> > > > > > node1: 0x50000000 - 0x60000000
> > > > > > MMIO : 0x60000000 - 0x80000000
> > > > > > node2: 0x80000000 - 0x90000000
> > > > > >
> > > > > >
> > > > > > I assume we can deal with this case simply by setting node0
> memory
> > > to
> > > > > > 0x0-0x30000000 even if there is actually something else, a
> device,
> > > that
> > > > > > doesn't belong to node0 in between the two node0 banks?
> > > > >
> > > > > While this configuration is rare in SoC design, but it is not
> > > impossible.
> > > >
> > > > Definitely, I have seen it before.
> > > >
> > > >
> > > > > > Is it only other nodes' memory interleaved that cause issues? In
> > > other
> > > > > > words, only the following is a problematic scenario?
> > > > > >
> > > > > > node0: 0x0        - 0x10000000
> > > > > > MMIO : 0x10000000 - 0x20000000
> > > > > > node1: 0x20000000 - 0x30000000
> > > > > > MMIO : 0x30000000 - 0x50000000
> > > > > > node0: 0x50000000 - 0x60000000
> > > > > >
> > > > > > Because node1 is in between the two ranges of node0?
> > > > > >
> > > > >
> > > > > But only device_type="memory" can be added to allocation.
> > > > > For mmio there are two cases:
> > > > > 1. mmio doesn't have NUMA id property.
> > > > > 2. mmio has NUMA id property, just like some PCIe controllers.
> > > > >    But we don’t need to handle these kinds of MMIO devices
> > > > >    in memory block parsing. Because we don't need to allocate
> > > > >    memory from these mmio ranges. And for accessing, we need
> > > > >    a NUMA-aware PCIe controller driver or a generic NUMA-aware
> > > > >    MMIO accessing APIs.
> > > >
> > > > Yes, I am not too worried about devices with a NUMA id property
> because
> > > > they are less common and this series doesn't handle them at all,
> right?
> > > > I imagine they would be treated like any other device without NUMA
> > > > awareness.
> > > >
> > > > I am thinking about the case where the memory of each NUMA node is
> made
> > > > of multiple banks. I understand that this patch adds an explicit
> check
> > > > for cases where these banks are interleaving, however there are many
> > > > other cases where NUMA memory nodes are *not* interleaving but they
> are
> > > > still made of multiple discontinuous banks, like in the two example
> > > > above.
> > > >
> > > > My question is whether this patch series in its current form can
> handle
> > > > the two cases above correctly. If so, I am wondering how it works
> given
> > > > that we only have a single "start" and "size" parameter per node.
> > > >
> > > > On the other hand if this series cannot handle the two cases above,
> my
> > > > question is whether it would fail explicitly or not. The new
> > > > check is_node_memory_continuous doesn't seem to be able to catch
> them.
> > >
> > >
> > > Looking at numa_update_node_memblks, it is clear that the code is
> meant
> > > to increase the range of each numa node to cover even MMIO regions in
> > > between memory banks. Also see the comment at the top of the file:
> > >
> > >  * Assumes all memory regions belonging to a single proximity domain
> > >  * are in one chunk. Holes between them will be included in the node.
> > >
> > > So if there are multiple banks for each node, start and end are
> > > stretched to cover the holes between them, and it works as long as
> > > memory banks of different NUMA nodes don't interleave.
> > >
> > > I would appreciate if you could add an in-code comment to explain this
> > > on top of numa_update_node_memblk.
> >
> > Yes, I will do it.
> 
> Thank you
> 
> 
> > > Have you had a chance to test this? If not it would be fantastic if
> you
> > > could give it a quick test to make sure it works as intended: for
> > > instance by creating multiple memory banks for each NUMA node by
> > > splitting an real bank into two smaller banks with a hole in between
> in
> > > device tree, just for the sake of testing.
> >
> > Yes, I have created some fake NUMA nodes in FVP device tree to test it.
> > The intertwine of nodes' address can be detected.
> >
> > (XEN) SRAT: Node 0 0000000080000000-00000000ff000000
> > (XEN) SRAT: Node 1 0000000880000000-00000008c0000000
> > (XEN) NODE 0: (0000000080000000-00000008d0000000) intertwine with NODE 1
> (0000000880000000-00000008c0000000)
> 
> Great thanks! And what if there are multiple non-contiguous memory banks
> per node, but *not* intertwined. Does that all work correctly as
> expected?

Yes, I am using a device tree setting like this:

    memory@80000000 {
        device_type = "memory";
        reg = <0x0 0x80000000 0x0 0x80000000>;
        numa-node-id = <0>;
    };

    memory@880000000 {
        device_type = "memory";
        reg = <0x8 0x80000000 0x0 0x8000000>;
        numa-node-id = <0>;
    };

    memory@890000000 {
        device_type = "memory";
        reg = <0x8 0x90000000 0x0 0x8000000>;
        numa-node-id = <0>;
    };

    memory@8A0000000 {
        device_type = "memory";
        reg = <0x8 0xA0000000 0x0 0x8000000>;
        numa-node-id = <0>;
    };

    memory@8B0000000 {
        device_type = "memory";
        reg = <0x8 0xB0000000 0x0 0x8000000>;
        numa-node-id = <0>;
    };

    memory@8C0000000 {
        device_type = "memory";
        reg = <0x8 0xC0000000 0x0 0x8000000>;
        numa-node-id = <1>;
    };

    memory@8D0000000 {
        device_type = "memory";
        reg = <0x8 0xD0000000 0x0 0x8000000>;
        numa-node-id = <1>;
    };

    memory@8E0000000 {
        device_type = "memory";
        reg = <0x8 0xE0000000 0x0 0x8000000>;
        numa-node-id = <1>;
    };

    memory@8F0000000 {
        device_type = "memory";
        reg = <0x8 0xF0000000 0x0 0x8000000>;
        numa-node-id = <1>;
    };

And in Xen we got the output:

(XEN) DT: NUMA node 0 processor parsed
(XEN) DT: NUMA node 0 processor parsed
(XEN) DT: NUMA node 1 processor parsed
(XEN) DT: NUMA node 1 processor parsed
(XEN) SRAT: Node 0 0000000080000000-00000000ff000000
(XEN) SRAT: Node 0 0000000880000000-0000000888000000
(XEN) SRAT: Node 0 0000000890000000-0000000898000000
(XEN) SRAT: Node 0 00000008a0000000-00000008a8000000
(XEN) SRAT: Node 0 00000008b0000000-00000008b8000000
(XEN) SRAT: Node 1 00000008c0000000-00000008c8000000
(XEN) SRAT: Node 1 00000008d0000000-00000008d8000000
(XEN) SRAT: Node 1 00000008e0000000-00000008e8000000
(XEN) SRAT: Node 1 00000008f0000000-00000008f8000000
(XEN) NUMA: parsing numa-distance-map
(XEN) NUMA: distance: NODE#0->NODE#0:10
(XEN) NUMA: distance: NODE#0->NODE#1:20
(XEN) NUMA: distance: NODE#1->NODE#1:10
(XEN) NUMA: Using 16 for the hash shift.
(XEN) Domain heap initialised
(XEN) Booting using Device Tree

Dom0 can be boot successfully, xl info got:
xl info
host                   : X-Dom0
release                : 5.12.0
version                : #20 SMP PREEMPT Wed Jul 28 13:41:28 CST 2021
machine                : aarch64
nr_cpus                : 4
max_cpu_id             : 3
nr_nodes               : 2
cores_per_socket       : 1
threads_per_core       : 1

Xen debug console to dump numa info, we got:

(XEN) 'u' pressed -> dumping numa info (now = 13229372281010)
(XEN) NODE0 start->524288 size->8617984 free->388741
(XEN) NODE1 start->9175040 size->229376 free->106460
(XEN) CPU0...1 -> NODE0
(XEN) CPU2...3 -> NODE1
(XEN) Memory location of each domain:
(XEN) Domain 0 (total: 262144):
(XEN)     Node 0: 262144
(XEN)     Node 1: 0


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.