[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] [Xen Hackathon] IO NUMA minutes
Minutes from IO NUMA session, written by Konrad (with some cleanup by me) Problem - PCI passthrough. Can't see the sysfs values of the node for devices and determine the best PCI passthrough option. We do export that information to the hypervisor. The dom0 parses the DSDT and _PXM object. Then via PHYSDEVOP_pci_device_add using the XEN_PCI_DEV_PXM. But both dom0 and Xen don't use it. That information needs to be exported to the admin. Andrew's 'hwloc' v2 patch has CPU and memory reported for NUMA. It has extensions for devices that dom0 does not know about - it does it via the hypercalls to get the CPUs that are outside of dom0 view. It could be logically extended for the device location. (Right now PCI devices hang due to unknown NUMA node.) 'hwloc' provides a library which a toolstack can use. We could also extend libxl's algorithm. And create domains based on PCI locallity first, and then use that to figure which NUMA node to use. hwloc algorithm might not have the right analysis for Xen. Constraint satisfaction is what hwloc can do (L3 sharing and such). This placement algorithm could be used in 'hwloc' as opposed to libxl. libxl deals with fixed configurations - the guest config specifies _which_ PCI device already. Association with NUMA and PCI devices configured - would not work. If the guest config has a specific PCI device, and we could place the guest on a node that is the same NUMA or close to it. Outside libxl - for example OpenStack - picks which host and where within a host to place. You have to expose it to OpenStack. Write something new in the OpenStack scheduler. What information is missing to expose to libvirt to make this work? This information can not be exposed to dom0 via sysfs unless it is a virtual information - pvNUMA, pvCPU, pvPCI, etc. May need a hypercall to obtain it. But sysfs would be easiest for the tools - just change the path (e.g. add 'xen' to it). There is danger that Linux could change SysFs layout. If we have xl sysctl and sysfs then they will need to provide the same information, which is error-prone. There has to be a compelling reason to do sysfs - why not do it in user-space if it can be done. Potential users of the virtual topology are powertop and hwloc. In order to have physical MSRs (for powertop) we can solve this differently - pin the vCPUs to pCPUs to get the MSRs values. Or the tasklets to do it. The problem is how it would signal the domain. Need to understand amount of work needed to implement hypercall and modify tools vs.the sysfs implementation. Work items: 1). Fix hypervisor to store the _PXM and associate it with PCI devices. 2). Expose this PCI information to the userspace (hypercall). 3). xl --pci-list-assignable have the NUMA node information. The affinity information needs to be exposed. 4). xl could also have a 'pci' and if there are no cpu or NUMA affinity it would take that in account. 5). hwloc should not compile against libxl, but you could against libxc. However it loads a lot of them. Figure what the maintainer wants - a stub, or OK with dlopen libxl. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |