[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] [Xen Hackathon] IO NUMA minutes

Minutes from IO NUMA session, written by Konrad (with some cleanup by me)

Problem - PCI passthrough.

Can't see the sysfs values of the node for devices and determine
the best PCI passthrough option.

We do export that information to the hypervisor. The dom0 parses the
DSDT and _PXM object. Then via PHYSDEVOP_pci_device_add using
the XEN_PCI_DEV_PXM. But both dom0 and Xen don't use it.

That information needs to be exported to the admin.

Andrew's 'hwloc' v2 patch has CPU and memory reported for NUMA. It has
extensions for devices that dom0 does not know about - it does it via
the hypercalls to get the CPUs that are outside of dom0 view. It could
be logically extended for the device location. (Right now PCI devices
hang due to unknown NUMA node.)

'hwloc' provides a library which a toolstack can use. We could also
extend libxl's algorithm. And create domains based on PCI locallity
first, and then use that to figure which NUMA node to use.

hwloc algorithm might not have the right analysis for Xen.

Constraint satisfaction is what hwloc can do (L3 sharing and
such). This placement algorithm could be used in 'hwloc' as opposed to
libxl. libxl deals with fixed configurations - the guest config
specifies _which_ PCI device already.

Association with NUMA and PCI devices configured - would not work.
If the guest config has a specific PCI device, and we could place the
guest on a node that is the same NUMA or close to it.

Outside libxl - for example OpenStack - picks which host and where
within a host to place. You have to expose it to OpenStack. Write
something new in the OpenStack scheduler. What information is missing
to expose to libvirt to make this work?

This information can not be exposed to dom0 via sysfs unless it is a
virtual information - pvNUMA, pvCPU, pvPCI, etc. May need a hypercall
to obtain it. But sysfs would be easiest for the tools - just change
the path (e.g. add 'xen' to it).

There is danger that Linux could change SysFs layout.

If we have xl sysctl and sysfs then they will need to provide the same
information, which is error-prone.

There has to be a compelling reason to do sysfs - why not do it in
user-space if it can be done.

Potential users of the virtual topology are powertop and hwloc. In
order to have physical MSRs (for powertop) we can solve this
differently - pin the vCPUs to pCPUs to get the MSRs values. Or the
tasklets to do it. The problem is how it would signal the domain.

Need to understand amount of work needed to implement hypercall and
modify tools vs.the sysfs implementation.

Work items:
1). Fix hypervisor to store the _PXM and associate it with PCI devices.
2). Expose this PCI information to the userspace (hypercall).
3). xl --pci-list-assignable have the NUMA node information. The
    affinity information needs to be exposed.
4). xl could also have a 'pci' and if there are no cpu or NUMA affinity
    it would take that in account.
5). hwloc should not compile against libxl, but you could against libxc.
    However it loads a lot of them. Figure what the maintainer wants -
    a stub, or OK with dlopen libxl.

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.