[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [PATCH 08/37] xen/x86: add detection of discontinous node memory range



On Sun, 26 Sep 2021, Stefano Stabellini wrote:
> On Sun, 26 Sep 2021, Wei Chen wrote:
> > > -----Original Message-----
> > > From: Stefano Stabellini <sstabellini@xxxxxxxxxx>
> > > Sent: 2021年9月25日 3:53
> > > To: Wei Chen <Wei.Chen@xxxxxxx>
> > > Cc: Stefano Stabellini <sstabellini@xxxxxxxxxx>; xen-
> > > devel@xxxxxxxxxxxxxxxxxxxx; julien@xxxxxxx; Bertrand Marquis
> > > <Bertrand.Marquis@xxxxxxx>; jbeulich@xxxxxxxx; andrew.cooper3@xxxxxxxxxx;
> > > roger.pau@xxxxxxxxxx; wl@xxxxxxx
> > > Subject: RE: [PATCH 08/37] xen/x86: add detection of discontinous node
> > > memory range
> > > 
> > > On Fri, 24 Sep 2021, Wei Chen wrote:
> > > > > -----Original Message-----
> > > > > From: Stefano Stabellini <sstabellini@xxxxxxxxxx>
> > > > > Sent: 2021年9月24日 8:26
> > > > > To: Wei Chen <Wei.Chen@xxxxxxx>
> > > > > Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx; sstabellini@xxxxxxxxxx;
> > > julien@xxxxxxx;
> > > > > Bertrand Marquis <Bertrand.Marquis@xxxxxxx>; jbeulich@xxxxxxxx;
> > > > > andrew.cooper3@xxxxxxxxxx; roger.pau@xxxxxxxxxx; wl@xxxxxxx
> > > > > Subject: Re: [PATCH 08/37] xen/x86: add detection of discontinous node
> > > > > memory range
> > > > >
> > > > > CC'ing x86 maintainers
> > > > >
> > > > > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > > > > One NUMA node may contain several memory blocks. In current Xen
> > > > > > code, Xen will maintain a node memory range for each node to cover
> > > > > > all its memory blocks. But here comes the problem, in the gap of
> > > > > > one node's two memory blocks, if there are some memory blocks don't
> > > > > > belong to this node (remote memory blocks). This node's memory range
> > > > > > will be expanded to cover these remote memory blocks.
> > > > > >
> > > > > > One node's memory range contains othe nodes' memory, this is
> > > obviously
> > > > > > not very reasonable. This means current NUMA code only can support
> > > > > > node has continous memory blocks. However, on a physical machine,
> > > the
> > > > > > addresses of multiple nodes can be interleaved.
> > > > > >
> > > > > > So in this patch, we add code to detect discontinous memory blocks
> > > > > > for one node. NUMA initializtion will be failed and error messages
> > > > > > will be printed when Xen detect such hardware configuration.
> > > > >
> > > > > At least on ARM, it is not just memory that can be interleaved, but
> > > also
> > > > > MMIO regions. For instance:
> > > > >
> > > > > node0 bank0 0-0x1000000
> > > > > MMIO 0x1000000-0x1002000
> > > > > Hole 0x1002000-0x2000000
> > > > > node0 bank1 0x2000000-0x3000000
> > > > >
> > > > > So I am not familiar with the SRAT format, but I think on ARM the
> > > check
> > > > > would look different: we would just look for multiple memory ranges
> > > > > under a device_type = "memory" node of a NUMA node in device tree.
> > > > >
> > > > >
> > > >
> > > > Should I need to include/refine above message to commit log?
> > > 
> > > Let me ask you a question first.
> > > 
> > > With the NUMA implementation of this patch series, can we deal with
> > > cases where each node has multiple memory banks, not interleaved?
> > 
> > Yes.
> > 
> > > An an example:
> > > 
> > > node0: 0x0        - 0x10000000
> > > MMIO : 0x10000000 - 0x20000000
> > > node0: 0x20000000 - 0x30000000
> > > MMIO : 0x30000000 - 0x50000000
> > > node1: 0x50000000 - 0x60000000
> > > MMIO : 0x60000000 - 0x80000000
> > > node2: 0x80000000 - 0x90000000
> > > 
> > > 
> > > I assume we can deal with this case simply by setting node0 memory to
> > > 0x0-0x30000000 even if there is actually something else, a device, that
> > > doesn't belong to node0 in between the two node0 banks?
> > 
> > While this configuration is rare in SoC design, but it is not impossible. 
> 
> Definitely, I have seen it before.
> 
> 
> > > Is it only other nodes' memory interleaved that cause issues? In other
> > > words, only the following is a problematic scenario?
> > > 
> > > node0: 0x0        - 0x10000000
> > > MMIO : 0x10000000 - 0x20000000
> > > node1: 0x20000000 - 0x30000000
> > > MMIO : 0x30000000 - 0x50000000
> > > node0: 0x50000000 - 0x60000000
> > > 
> > > Because node1 is in between the two ranges of node0?
> > > 
> > 
> > But only device_type="memory" can be added to allocation.
> > For mmio there are two cases:
> > 1. mmio doesn't have NUMA id property.
> > 2. mmio has NUMA id property, just like some PCIe controllers.
> >    But we don’t need to handle these kinds of MMIO devices
> >    in memory block parsing. Because we don't need to allocate
> >    memory from these mmio ranges. And for accessing, we need
> >    a NUMA-aware PCIe controller driver or a generic NUMA-aware
> >    MMIO accessing APIs.
> 
> Yes, I am not too worried about devices with a NUMA id property because
> they are less common and this series doesn't handle them at all, right?
> I imagine they would be treated like any other device without NUMA
> awareness.
> 
> I am thinking about the case where the memory of each NUMA node is made
> of multiple banks. I understand that this patch adds an explicit check
> for cases where these banks are interleaving, however there are many
> other cases where NUMA memory nodes are *not* interleaving but they are
> still made of multiple discontinuous banks, like in the two example
> above.
> 
> My question is whether this patch series in its current form can handle
> the two cases above correctly. If so, I am wondering how it works given
> that we only have a single "start" and "size" parameter per node.
> 
> On the other hand if this series cannot handle the two cases above, my
> question is whether it would fail explicitly or not. The new
> check is_node_memory_continuous doesn't seem to be able to catch them.


Looking at numa_update_node_memblks, it is clear that the code is meant
to increase the range of each numa node to cover even MMIO regions in
between memory banks. Also see the comment at the top of the file:

 * Assumes all memory regions belonging to a single proximity domain
 * are in one chunk. Holes between them will be included in the node.

So if there are multiple banks for each node, start and end are
stretched to cover the holes between them, and it works as long as
memory banks of different NUMA nodes don't interleave.

I would appreciate if you could add an in-code comment to explain this
on top of numa_update_node_memblk.

Have you had a chance to test this? If not it would be fantastic if you
could give it a quick test to make sure it works as intended: for
instance by creating multiple memory banks for each NUMA node by
splitting an real bank into two smaller banks with a hole in between in
device tree, just for the sake of testing.

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.