[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] _PXM, NUMA, and all that goodnesss

To: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>, <xen-devel@xxxxxxxxxxxxxxxxxxx>, <jun.nakajima@xxxxxxxxx>, <boris.ostrovsky@xxxxxxxxxx>, <jbeulich@xxxxxxxx>, <andrew.cooper3@xxxxxxxxxx>, <andrew.thomas@xxxxxxxxxx>, <ufimtseva@xxxxxxxxx>
From: George Dunlap <george.dunlap@xxxxxxxxxxxxx>
Date: Thu, 13 Feb 2014 11:22:00 +0000
Cc: kurt.hackel@xxxxxxxxxx
Delivery-date: Thu, 13 Feb 2014 11:22:17 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 02/12/2014 07:50 PM, Konrad Rzeszutek Wilk wrote:

Hey,

I have been looking at figuring out how we can "easily" do PCIe assignment
of devices that are on different sockets. The problem is that
on machines with many sockets (four or more) we might inadvertently assign
the PCIe from a different socket to a guest bound to a different NUMA
node. That means more KPI traffic, higher latency, etc.

 From a Linux kernel perspective we do seem to 'pipe' said information
from the ACPI DSDT (drivers/xen/pci.c):

  75                 unsigned long long pxm;
  76
  77                 status = acpi_evaluate_integer(handle, "_PXM",
  78                                    NULL, &pxm);
  79                 if (ACPI_SUCCESS(status)) {
  80                     add.optarr[0] = pxm;
  81                     add.flags |= XEN_PCI_DEV_PXM;

Which is neat except that Xen ignores that flag altogether. I Googled
a bit but still did not find anything relevant - thought there were
some presentations from past Xen Summits referring to it
(I can't find it now :-()

Anyhow,  what I am wondering if there are some prototypes out the
in the past that utilize this. And if we were to use this how
can we expose this to 'libxl' or any other tools to say:

"Hey! You might want to use this other PCI device assigned
to pciback which is on the same node". Some of form of
'numa-pci' affinity.

A warning that the PCI device is not in the numa affinity of the guestmight be nice.

Interestingly enough one can also read this from SysFS:
/sys/bus/pci/devices/<BDF>/numa_node,local_cpu,local_cpulist.

Except that we don't expose the NUMA topology to the initial
domain so the 'numa_node' is all -1. And the local_cpu depends
on seeing _all_ of the CPUs - and of course it assumes that
vCPU == pCPU.

Anyhow, if this was "tweaked" such that the initial domain
was seeing the hardware NUMA topology and parsing it (via
Elena's patches) we could potentially have at least the
'numa_node' information present and figure out if a guest
is using a PCIe device from the right socket.

I don't think we want to go down the path of pretending that dom0 is thehypervisor. This is the same reason I objected to Boris' approach toperf integration last year. I can understand the idea of wanting to usethe same tools in the same way; but the fact is dom0 is a guest, and itsvirtual hardware (including #cpus, topology, &c) isn't (and shouldn't berequired to) be in any way related to the host.

On the other hand... just tossing this out there, but how hard would itbe for dom0 to report information about the *physical* topology oncertain things in sysfs, rather than *virtual* topology? I.e., nomatter what dom0's virtual topology was, to report the physicalnuma_node, local_cpu, &c in sysfs?

I suppose this might cause problems if the scheduler then tried to run aprocess / tasklet on the node to which the device was attached, only tofind out that no such (virtual) node existed.

If that would be a no-go, then I think we need to expose thatinformation via libxl somehow so the toolstack can make reasonabledecisions.


So what I am wondering is:
  1) Were there any plans for the XEN_PCI_DEV_PXM in the
     hypervisor? Were there some prototypes for exporting the
     PCI device BDF and NUMA information out.

  2) Would it be better to just look at making the initial domain
    be able to figure out the NUMA topology and assign the
    correct 'numa_node' in the PCI fields?

  3). If either option is used, would taking that information in-to
    advisement when launching a guest with either 'cpus' or 'numa-affinity'
    or 'pci' and informing the user of a better choice be good?
    Or would it be better if there was some diagnostic tool to at
    least tell the user whether their PCI device assignment made
    sense or not? Or perhaps program the 'numa-affinity' based on
    the PCIe socket location?


I think in general, we should:
* Do something reasonable when no NUMA topology has been specified

* Do what the user asks (but help them make good decisions) when they dospecify topology.


A couple of things that might mean:

* Having the NUMA placement algorithm take into account the location ofassigned PCI devices is probably a good idea.* Having a warning when a device is outside of a VM's soft cpu affinityor NUMA affinity. (I think we do something similar when the soft cpuaffinity doesn't intersect the NUMA affinity.)* Exposing the NUMA affinity of a device when doing xlpci-assignable-list might be a good idea as well, just to give people ahint that they should be maybe thinking about this. Maybe have xlpci-assignable-add print what node a device is on as well? (Maybe onlyon NUMA boxes?)

Just as an aside, can I take it that a lot of your customers have / areexpected to have such NUMA boxes? The accepted wisdom (at least in somecircles) seems to be that NUMA isn't particularly important for cloud,because cloud providers will generally use a larger number of smallerboxes and use a cloud orchestration layer to tie them all together.


 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

References:
- [Xen-devel] _PXM, NUMA, and all that goodnesss
  - From: Konrad Rzeszutek Wilk

Prev by Date: Re: [Xen-devel] _PXM, NUMA, and all that goodnesss
Next by Date: Re: [Xen-devel] [PATCH] xen: Don't use -nostdinc flags with CLANG
Previous by thread: Re: [Xen-devel] _PXM, NUMA, and all that goodnesss
Next by thread: [Xen-devel] [PATCH 0/2] A couple of SR-IOV-related patches
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.