[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] [PATCH v2 0/7] vNUMA introduction.



vNUMA introduction

This series of patches introduces vNUMA topology awareness and
provides interfaces and data structures to enable vNUMA for
PV guests. There is a plan to extend this support for dom0 and
HVM domains.

vNUMA topology support should be supported by PV guest kernel.
Corresponging patches should be applied.

Introduction
-------------

vNUMA topology is exposed to the PV guest to improve performance when running
workloads on NUMA machines.
XEN vNUMA implementation provides a way to create vNUMA-enabled guests on 
NUMA/UMA
and map vNUMA topology to physical NUMA in a optimal way.

XEN vNUMA support

Current set of patches introduces subop hypercall that is available for 
enlightened
PV guests with vNUMA patches applied.

Domain structure was modified to reflect per-domain vNUMA topology for use in 
other
vNUMA-aware subsystems (e.g. ballooning).

libxc

libxc provides interfaces to build PV guests with vNUMA support and in case of 
NUMA
machines provides initial memory allocation on physical NUMA nodes. This 
implemented by
utilizing nodemap formed by automatic NUMA placement. Details are in patch #3.

libxl

libxl provides a way to predefine in VM config vNUMA topology - number of 
vnodes,
memory arrangement, vcpus to vnodes assignment, distance map.

PV guest

As of now, only PV guest can take advantage of vNUMA functionality. vNUMA Linux 
patches
should be applied and NUMA support should be compiled in kernel.

This patchset can be pulled from https://git.gitorious.org/xenvnuma/xenvnuma.git
Linux patchset https://git.gitorious.org/xenvnuma/linuxvnuma.git

Examples of booting vNUMA enabled PV Linux guest on real NUMA machine:

1. Automatic vNUMA placement on real NUMA machine:

VM config:

memory = 16384
vcpus = 4
name = "rcbig"
vnodes = 4
vnumamem = [10,10]
vnuma_distance = [10, 30, 10, 30]
vcpu_to_vnode = [0, 0, 1, 1]

Xen:

(XEN) Memory location of each domain:
(XEN) Domain 0 (total: 2569511):
(XEN)     Node 0: 1416166
(XEN)     Node 1: 1153345
(XEN) Domain 5 (total: 4194304):
(XEN)     Node 0: 2097152
(XEN)     Node 1: 2097152
(XEN)     Domain has 4 vnodes
(XEN)         vnode 0 - pnode 0  (4096) MB
(XEN)         vnode 1 - pnode 0  (4096) MB
(XEN)         vnode 2 - pnode 1  (4096) MB
(XEN)         vnode 3 - pnode 1  (4096) MB
(XEN)     Domain vcpu to vnode:
(XEN)     0 1 2 3

dmesg on pv guest:

[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x00001000-0x0009ffff]
[    0.000000]   node   0: [mem 0x00100000-0xffffffff]
[    0.000000]   node   1: [mem 0x100000000-0x1ffffffff]
[    0.000000]   node   2: [mem 0x200000000-0x2ffffffff]
[    0.000000]   node   3: [mem 0x300000000-0x3ffffffff]
[    0.000000] On node 0 totalpages: 1048479
[    0.000000]   DMA zone: 56 pages used for memmap
[    0.000000]   DMA zone: 21 pages reserved
[    0.000000]   DMA zone: 3999 pages, LIFO batch:0
[    0.000000]   DMA32 zone: 14280 pages used for memmap
[    0.000000]   DMA32 zone: 1044480 pages, LIFO batch:31
[    0.000000] On node 1 totalpages: 1048576
[    0.000000]   Normal zone: 14336 pages used for memmap
[    0.000000]   Normal zone: 1048576 pages, LIFO batch:31
[    0.000000] On node 2 totalpages: 1048576
[    0.000000]   Normal zone: 14336 pages used for memmap
[    0.000000]   Normal zone: 1048576 pages, LIFO batch:31
[    0.000000] On node 3 totalpages: 1048576
[    0.000000]   Normal zone: 14336 pages used for memmap
[    0.000000]   Normal zone: 1048576 pages, LIFO batch:31
[    0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org
[    0.000000] smpboot: Allowing 4 CPUs, 0 hotplug CPUs
[    0.000000] No local APIC present
[    0.000000] APIC: disable apic facility
[    0.000000] APIC: switched to apic NOOP
[    0.000000] nr_irqs_gsi: 16
[    0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff]
[    0.000000] e820: cannot find a gap in the 32bit address range
[    0.000000] e820: PCI devices with unassigned 32bit BARs may break!
[    0.000000] e820: [mem 0x400100000-0x4004fffff] available for PCI devices
[    0.000000] Booting paravirtualized kernel on Xen
[    0.000000] Xen version: 4.4-unstable (preserve-AD)
[    0.000000] setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:4 
nr_node_ids:4
[    0.000000] PERCPU: Embedded 28 pages/cpu @ffff8800ffc00000 s85376 r8192 
d21120 u2097152
[    0.000000] pcpu-alloc: s85376 r8192 d21120 u2097152 alloc=1*2097152
[    0.000000] pcpu-alloc: [0] 0 [1] 1 [2] 2 [3] 3


pv guest: numactl --hardware:

root@heatpipe:~# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0
node 0 size: 4031 MB
node 0 free: 3997 MB
node 1 cpus: 1
node 1 size: 4039 MB
node 1 free: 4022 MB
node 2 cpus: 2
node 2 size: 4039 MB
node 2 free: 4023 MB
node 3 cpus: 3
node 3 size: 3975 MB
node 3 free: 3963 MB
node distances:
node   0   1   2   3
  0:  10  20  20  20
  1:  20  10  20  20
  2:  20  20  10  20
  3:  20  20  20  10

Comments:
None of the configuration options are correct so default values were used.
Since machine is NUMA machine and there is no vcpu pinning defines, NUMA
automatic node selection mechanism is used and you can see how vnodes
were split across physical nodes.

2. vNUMA enabled guest, no default values, real NUMA machine

Config:

memory = 4096
vcpus = 4
name = "rc9"
vnodes = 2
vnumamem = [2048, 2048]
vdistance = [10, 40, 40, 10]
vnuma_vcpumap = [1, 0, 1, 0]
vnuma_vnodemap = [1, 0]


Xen:

(XEN) 'u' pressed -> dumping numa info (now-0xA86:BD6C8829)
(XEN) idx0 -> NODE0 start->0 size->4521984 free->131471
(XEN) phys_to_nid(0000000000001000) -> 0 should be 0
(XEN) idx1 -> NODE1 start->4521984 size->4194304 free->341610
(XEN) phys_to_nid(0000000450001000) -> 1 should be 1
(XEN) CPU0 -> NODE0
(XEN) CPU1 -> NODE0
(XEN) CPU2 -> NODE0
(XEN) CPU3 -> NODE0
(XEN) CPU4 -> NODE1
(XEN) CPU5 -> NODE1
(XEN) CPU6 -> NODE1
(XEN) CPU7 -> NODE1
(XEN) Memory location of each domain:
(XEN) Domain 0 (total: 2569511):
(XEN)     Node 0: 1416166
(XEN)     Node 1: 1153345
(XEN) Domain 6 (total: 1048576):
(XEN)     Node 0: 524288
(XEN)     Node 1: 524288
(XEN)     Domain has 2 vnodes
(XEN)         vnode 0 - pnode 1  (2048) MB
(XEN)         vnode 1 - pnode 0  (2048) MB
(XEN)     Domain vcpu to vnode:
(XEN)     1 0 1 0

pv guest dmesg:

[    0.000000] NUMA: Initialized distance table, cnt=2
[    0.000000] Initmem setup node 0 [mem 0x00000000-0x7fffffff]
[    0.000000]   NODE_DATA [mem 0x7ffd9000-0x7fffffff]
[    0.000000] Initmem setup node 1 [mem 0x80000000-0xffffffff]
[    0.000000]   NODE_DATA [mem 0xff7f8000-0xff81efff]
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
[    0.000000]   DMA32    [mem 0x01000000-0xffffffff]
[    0.000000]   Normal   empty
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x00001000-0x0009ffff]
[    0.000000]   node   0: [mem 0x00100000-0x7fffffff]
[    0.000000]   node   1: [mem 0x80000000-0xffffffff]
[    0.000000] On node 0 totalpages: 524191
[    0.000000]   DMA zone: 56 pages used for memmap
[    0.000000]   DMA zone: 21 pages reserved
[    0.000000]   DMA zone: 3999 pages, LIFO batch:0
[    0.000000]   DMA32 zone: 7112 pages used for memmap
[    0.000000]   DMA32 zone: 520192 pages, LIFO batch:31
[    0.000000] On node 1 totalpages: 524288
[    0.000000]   DMA32 zone: 7168 pages used for memmap
[    0.000000]   DMA32 zone: 524288 pages, LIFO batch:31
[    0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org
[    0.000000] smpboot: Allowing 4 CPUs, 0 hotplug CPUs
[    0.000000] No local APIC present
[    0.000000] APIC: disable apic facility
[    0.000000] APIC: switched to apic NOOP
[    0.000000] nr_irqs_gsi: 16
[    0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff]
[    0.000000] e820: cannot find a gap in the 32bit address range
[    0.000000] e820: PCI devices with unassigned 32bit BARs may break!
[    0.000000] e820: [mem 0x100100000-0x1004fffff] available for PCI devices
[    0.000000] Booting paravirtualized kernel on Xen
[    0.000000] Xen version: 4.4-unstable (preserve-AD)
[    0.000000] setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:4 
nr_node_ids:2
[    0.000000] PERCPU: Embedded 28 pages/cpu @ffff88007fc00000 s85376 r8192 
d21120 u1048576
[    0.000000] pcpu-alloc: s85376 r8192 d21120 u1048576 alloc=1*2097152
[    0.000000] pcpu-alloc: [0] 0 2 [1] 1 3


pv guest:

root@heatpipe:~# numactl --ha
available: 2 nodes (0-1)
node 0 cpus: 1 3
node 0 size: 2011 MB
node 0 free: 1975 MB
node 1 cpus: 0 2
node 1 size: 2003 MB
node 1 free: 1983 MB
node distances:
node   0   1
  0:  10  40
  1:  40  10
root@heatpipe:~# numactl --ha
available: 2 nodes (0-1)
node 0 cpus: 1 3
node 0 size: 2011 MB
node 0 free: 1975 MB
node 1 cpus: 0 2
node 1 size: 2003 MB
node 1 free: 1983 MB
node distances:
node   0   1
  0:  10  40
  1:  40  10

In this case every config option is correct and we have exact vNUMA topology
as it in VN config file.

Notes:
*   to enable vNUMA in linux kernel the corresponding patch set should be
    applied;
*   automatic numa balancing featurue seem to be fixed in linux kernel:
    https://lkml.org/lkml/2013/7/31/647


TODO:
*   This version limits vdistance config option to only two values - same node 
    distance and other node distance; This prevents oopses on latest (3.13-rc1)
    linux kernel with non-symmetric distance;
*   cpu siblings for Linux machine and xen cpu trap should be detected and 
    warning should be given; Add cpuid check if set in VM config;
*   benchmarking;

Elena Ufimtseva (7):
  xen: vNUMA support for guests.
  libxc: Plumb Xen with vNUMA topology for domain.
  libxc: vnodes allocation on NUMA nodes.
  libxl: vNUMA supporting interface.
  libxl: vNUMA configuration parser
  xen: adds vNUMA info debug-key u
  xl: docs for xl config vnuma options

 docs/man/xl.cfg.pod.5        |   55 +++++++++
 tools/libxc/xc_dom.h         |   10 ++
 tools/libxc/xc_dom_x86.c     |   85 ++++++++++++--
 tools/libxc/xc_domain.c      |   61 ++++++++++
 tools/libxc/xenctrl.h        |    9 ++
 tools/libxc/xg_private.h     |    1 +
 tools/libxl/libxl.c          |   20 ++++
 tools/libxl/libxl.h          |   20 ++++
 tools/libxl/libxl_arch.h     |    8 ++
 tools/libxl/libxl_dom.c      |  189 ++++++++++++++++++++++++++++-
 tools/libxl/libxl_internal.h |    3 +
 tools/libxl/libxl_types.idl  |    5 +-
 tools/libxl/libxl_vnuma.h    |    7 ++
 tools/libxl/libxl_x86.c      |   58 +++++++++
 tools/libxl/xl_cmdimpl.c     |  268 +++++++++++++++++++++++++++++++++++++++++-
 xen/arch/x86/numa.c          |   19 +++
 xen/common/domain.c          |   10 ++
 xen/common/domctl.c          |   82 +++++++++++++
 xen/common/memory.c          |   36 ++++++
 xen/include/public/domctl.h  |   24 ++++
 xen/include/public/memory.h  |    8 ++
 xen/include/public/vnuma.h   |   44 +++++++
 xen/include/xen/domain.h     |   10 ++
 xen/include/xen/sched.h      |    1 +
 24 files changed, 1020 insertions(+), 13 deletions(-)
 create mode 100644 tools/libxl/libxl_vnuma.h
 create mode 100644 xen/include/public/vnuma.h

-- 
1.7.10.4


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.