[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] NUMA guest config options (was: Re: [PATCH 00/11] PV NUMA Guests)



While I like the direction this is going, please try to extend
your model to cover the cases of ballooning and live-migration.
For example, for "CONFINE", ballooning should probably be
disallowed as pages surrendered on "this node" via ballooning
may be only recoverable later on a different node.  Similarly,
creating a CONFINE guest is defined to fail if there is
insufficient memory on any node... will live migration to a
different physical machine similarly fail, even if an administrator
explicitly requests it?

In general, communicating NUMA topology to a guest is a "performance
thing" and ballooning and live-migration are "flexibility things";
and performance and flexibility mix like oil and water.

> -----Original Message-----
> From: Andre Przywara [mailto:andre.przywara@xxxxxxx]
> Sent: Friday, April 23, 2010 6:46 AM
> To: Cui, Dexuan; Dulloor; xen-devel
> Cc: Nakajima, Jun
> Subject: [Xen-devel] NUMA guest config options (was: Re: [PATCH 00/11]
> PV NUMA Guests)
> 
> Hi,
> 
> yesterday Dulloor, Jun and I had a discussion about the NUMA guest
> configuration scheme, we came to the following conclusions:
> 1. The configuration would be the same for HVM and PV guests, only the
> internal method of propagation would differ.
> 2. We want to make it as easy as possible, with best performance out of
> the box as the design goal. Another goal is predictable performance.
> 3. We (at least for now) omit more sophisticated tuning options (exact
> user-driven description of the guest's topology), so the guest's
> resources are split equally across the guest nodes.
> 4. We have three basic strategies:
>   - CONFINE: let the guest use only one node. If that does not work,
> fail.
>   - SPLIT: allocate resources from multiple nodes, inject a NUMA
> topology into the guest (includes PV querying via hypercall). If the
> guest is paravirtualized and does not know about NUMA (missing ELF
> hint): fail.
>   - STRIPE: allocate the memory in an interleaved way from multiple
> nodes, don't tell the guest about NUMA at all.
> 
> If any one the above strategies is explicitly specified in the config
> file and it cannot be met, then the guest creation will fail.
> A fourth option would be the default: AUTOMATIC. This will try the
> three
> strategies after each other (order: CONFINE, SPLIT, STRIP). If one
> fails, the next will be tried (this will never use striping for HVM
> guests).
> 
> 5. The number of guest nodes is internally specified via a min/max
> pair.
> By default min is 1, max is the number of system nodes. The algorithm
> will try to use the smallest possible number of nodes.
> 
> The question remaining is whether we want to expose this pair to the
> user:
>   - For predictable performance we want to specify an exact number of
> guest nodes, so set min=max=<number of nodes>
>   - For best performance, the number of nodes should be at small as
> possible, so min is always 1. For the explicit CONFINE strategy, max
> would also be one, for AUTOMATIC it should be as few as possible, which
> is already built in the algorithm.
> So it is not clear if "max nodes" is a useful option. If it would serve
> as an upper boundary, then it is questionable whether
> "failing-if-not-possible" is a useful result.
> 
> So maybe we get along with just one (optional) value: guestnodes.
> This will be useful in the SPLIT case, where it specifies the number of
> nodes the guest sees (for predictable performance). CONFINE internally
> overrides this value with "1". If one would impose a limit on the
> number
> of nodes, one would choose "AUTOMATIC" and set guestnodes to this
> number. If single-node allocations fail, it will use as few nodes as
> possible, not exceeding the specified number.
> 
> Please comment on this.
> 
> Thanks and regards,
> Andre.
> 
> --
> Andre Przywara
> AMD-Operating System Research Center (OSRC), Dresden, Germany
> Tel: +49 351 448-3567-12
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.