[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 00/11] PV NUMA Guests



On Wed, Apr 14, 2010 at 1:18 AM, Cui, Dexuan <dexuan.cui@xxxxxxxxx> wrote:
> Dulloor wrote:
>> On Wed, Apr 7, 2010 at 3:57 AM, Cui, Dexuan <dexuan.cui@xxxxxxxxx>
>> wrote:
>>> Keir Fraser wrote:
>>>> I would like Acks from the people working on HVM NUMA for this patch
>>>> series. At the very least it would be nice to have a single user
>>>> interface for setting this up, regardless of whether for a PV or HVM
>>>> guest. Hopefully code in the toolstack also can be shared. So I'm
>>> Yes, I strongly agree we should share one interterface, e.g., The
>>> XENMEM_numa_op hypercalls implemented by Dulloor could be >re-used
>>> in the hvm numa case and some parts of the toolstack could be
>>> shared, I think. I also replied in another thead and >supplied some
>>> similarity I found in Andre/Dulloor's patches.
>>>
>> IMO PV NUMA guests and HVM NUMA guests could share most of the code
>> from toolstack - for instance, getting the current state of machine,
>> deciding on a strategy for domain memory allocation, selection of
>> nodes, etc. They diverge only at the actual point of domain
>> construction. PV NUMA uses enlightenments, whereas HVM would need
>> working with hvmloader to export SLIT/SRAT ACPI tables. So, I agree
>> that we need to converge.
> Hi Dulloor,
> In your patches, the toolstack tries to figure out the "best fit nodes" for a 
> PV guest and
>invokes a hypercall set_domain_numa_layout to tell the hypervisor to remember 
>the
>info, and later the PV guest invokes a hypercall get_domain_numa_layout to 
>retrieve the
>info from the hypervisor.
> Can this be changed to: the toolstack writes the guest numa info directly 
> into a new
>field in the start_info(or the share_info) (maybe in the starndard format of 
>the SRAT/SLIT)
>and later PV guest reads the info and uses acpi_numa_init() to parse the info? 
> I think in
>this way the new hypercalls can be avoided and the pv numa enlightenment code 
>in
>guest kernel can be minimized.
> I'm asking  this because this is the way how HVM numa patches of Andure do(the
>toolstack passes the info to hvmloader and the latter builds SRAT/SLIT for 
>guest)
Hi Cui,

In my first version of patches (for making dom0 a numa guest), I had
put this information into start_info
(http://lists.xensource.com/archives/html/xen-devel/2010-02/msg00630.html).
But, after that I thought this new approach is better (for pv numa and
maybe even hvm numa) for following reasons :

- For PV NUMA guests, there are more places where the enlightenment
might be useful. For instance, in the attached (refreshed)patch, I
have used the enlightenment to support ballooning (without changing
node mappings) for PV NUMA guests. Similarly, there are
other places within the hypervisor as well as in the VM where I plan
to use the domain_numa_layout. That's the main reason for choosing
this approach. Although I am not sure, I think this could be useful
for HVM too (maybe with PV on HVM).

- Using the hypercall interface is equally simple. And, also with
start-info, I wasn't sure if it looks clean to add feature-specific
variables (useful only with PV NUMA guests) to start-info (or even
shared info), changing the xen-vm interface, adding (unnecessary)
changes for compat, etc.

Please let me know your thoughts.


>
> xc_select_best_fit_nodes() decides the "min-set" of host nodes that will be 
> used for the
>guest. It only considers the current memory usage of the system. Maybe we 
>should also
>condider the cpu load? And the number of the nodes must be 2^^n? And how to 
>handle >the case #vcpu is < #vnode?
> And looks your patches only consider the guest's memory requirement -- 
> guest's vcpu
>requirement is neglected? e.g., a guest may not need a very large amount of 
>memory
>while it needs many vcpus. xc_select_best_fit_nodes() should consider this when
>determining the number of vnode.

I agree with you. I was planning to consider vcpu load as the next
step. Also, I am looking
for a good heuristic. I looked at the nodeload heuristic (currently in
xen), but found it too naive. But, if you/Andre think it is a good
heuristic, I will add the support. Actually, I think
in future we should do away with strict vcpu-affinities and rely more
on a scheduler with
necessary NUMA support to complement our placement strategies.

As of now, we don't SPLIT, if #vcpu < #vnode. We use STRIPING in that case.

>
>>>> On 04/04/2010 20:30, "Dulloor" <dulloor@xxxxxxxxx> wrote:
>>>>
>>>>> The set of patches implements virtual NUMA-enlightenment to support
>>>>> NUMA-aware PV guests. In more detail, the patch implements the
>>>>> following :
>>>>>
>>>>> * For the NUMA systems, the following memory allocation strategies
>>>>> are implemented : - CONFINE : Confine the VM memory allocation to a
>>>>> single node. As opposed to the current method of doing this in
>>>>> python, the patch implements this in libxc(along with other
>>>>> strategies) and with assurance that the memory actually comes from
>>>>> the selected node. - STRIPE : If the VM memory doesn't fit in a
>>>>> single node and if the VM is not compiled with guest-numa-support,
>>>>> the memory is allocated striped across a selected max-set of nodes.
>>>>> - SPLIT : If the VM memory doesn't fit in a single node and if the
>>>>> VM is compiled with guest-numa-support, the memory is allocated
>>>>> split (equally for now) from the min-set of nodes. The  VM is then
>>>>> made aware of this NUMA allocation (virtual NUMA enlightenment).
>>>>> -DEFAULT : This is the existing allocation scheme.
>>>>>
>>>>> * If the numa-guest support is compiled into the PV guest, we add
>>>>> numa-guest-support to xen features elfnote. The xen tools use this
>>>>> to determine if SPLIT strategy can be applied.
>>>>>
>>> I think this looks too complex to allow a real user to easily
>>> determine which one to use...
>> I think you misunderstood this. For the first version, I have
>> implemented an automatic global domain memory allocation scheme, which
>> (when enabled) applies to all domains on a NUMA machine. I am of
>> opinion that users are seldom in a state to determine which strategy
>> to use. They would want the best possible performance for their VM at
>> any point of time, and we can only guarantee the best possible
>> performance, given the current state of the system (how the free
>> memory is scattered across nodes, distance between those nodes, etc).
>> In that regard, this solution is the simplest.
> Ok, I see.
> BTW: I think actually currently Xen can handle the case CONFINE pretty well, 
> e.g, when
> no vcpu affinity is explicitly specified, the toolstack tries to choose a 
> "best" host node
> for the guest and pins all vcpus of the guest to the host node.
But, currently it is done in python code and also it doesn't use
exact_node interface.
I added this to the libxc toolstack for the sake of completeness
(CONFINE is just a
special case of SPLIT). Also, with libxl catching up, we might anyway
want to do these
things in libxc, where it is accessible to both xm and xl.

>
>>> About the CONFINE stragegy -- looks this is not a useful usage model
>>> to me -- do we really think it's a typical usage model to
>>> ensure a VM's memory can only be allocated on a specified node?
>> Not all VMs are large enough not to fit into a single node (note that
>> user doesn't specify a node). And, if a VM can be fit into a single
>> node, that is obviously the best possible option for a VM.
>>
>>> The definitions of STRIPE and SPLIT also doesn't sound like typical
>>> usage models to me.
>> There are only two possibilities. Either the VM fits in a single node
>> or it doesn't. The mentioned strategies (SPLIT, STRIPE) try to
>> optimize the solution when the VM doesn't fit in a single node. The
>> aim is to reduce the number of inter-node accesses(SPLIT) and/or
>> provide a more predictable performance(STRIPE).
>>
>>> Why must tools know if the PV kernel is built with guest numa
>>> support or not?
>> What is the point of arranging the memory amenable for construction of
>> nodes in guest if the guest itself is not compiled to do so.
> I meant: to simplify the implementation, the toolstack can always supply the 
> numa
> config info to the guest *if necessary*, no matter if the guest kernel is 
> numa-enabled or
> not (even if the guest kernel isn't numa-enabled, the guest performance may 
> be better
> if the toolstack decides to supply a numa config to the guest)
> About the "*if necessary*": Andure and I think the user should supply an 
>  option
> "guestnode" in the guest config file, and you think the toolstack should be 
> able to
> automatically determine a "best" value. I raised some questions about
> xc_select_best_fit_nodes() in the above paragraph.
> Hi Andre, would you like to comment on this?
How about an "automatic"  global option along with a VM-level
"guestnode" option. These options could be work independently or with
each other ("guestnode" would take
preference over global "automatic" option). We can work out finer details.

>
>>
>>> If a user configures guest numa to "on" for a pv guest, the tools
>>> can supply the numa info to PV kernel even if the pv kernel is not >
>>> built with guest numa support -- the pv kernel will neglect the info
>>> safely;
>>> If a user configures guest numa to "off" for a pv guest and the
>>> tools don't supply the numa info to PV kernel, and if the pv kernel
>>> > is built with guest numa support, the pv kernel can easily detect
>>> this by your new hypercall and will not enable numa.
>> These error checks are done even now. But, by checking if the PV
>> kernel is built with guest numa support, we don't require the user to
>> configure yet another parameter. Wasn't that your concern too in the
>> very first point ?
>>
>>>
>>> When a user finds the computing capability of a single node can't
>>> satisfy the actual need and hence wants to use guest numa,
>>> since the user has specified the amount of guest memory and the
>>> number of vcpus in guest config file, I think the user only needs
>>> to specify how many guest nodes (the "guestnodes" option in Andre's
>>> patch) the guest will see, and the tools and the hypervisor
>>> should co-work to distribute guest memory and vcpus uniformly among
>>> the guest nodes(I think we may not want to support non-
>>> uniform nodes as that doesn't look like a typical usage model) -- of
>>> course, maybe a specified node doesn't have the expected
>>> amount of memory -- in this case, the guest can continue to run with
>>> a slower speed (we can print a warning message to the
>>> user); or, if the user does care about predictable guest
>>> performance, the guest creation should fail.
>>
>> Please observe that the patch does all these things plus some more.
>> For one, "guestnodes" option doesn't make sense, since as you observe,
>> it needs the user to carefully read the state of the system when
>> starting the domain and also the user needs to make sure that the
>> guest itself is compiled with numa support. The aim should be to
> I think it's not difficult for a user to specify "guestnodes" and to check if 
> a PV/HVM guest
> kernel is numa-enabled or not(anyway, a user needs to ensure that to achieve 
> the
> optimal peformance). "xm info/list/vcpu-list" should already supply enough 
> info. I think
> it's reasonable to assume a numa user has more knowledge than a preliminary 
> user. :-)
>
> I suppose Andure would argue more for the  "guestnodes" option.
>
> PV guest can use the ELFnote as a hit to the toolstack. This may be used as a 
> kind of optimization.
> HVM guest can't use this.
As mentioned above, I think we have a good case for both global and
VM-level options. What do you think ?

>
>> automate this part and provide the best performance, given the current
>> state. The patch attempts to do that. Secondly, when the guests are
>> not compiled with numa support, they would still want a more
>> predictable (albeit average) performance. And, by striping the memory
>> across the nodes and by pinning the domain vcpus to the union of those
>> nodes' processors, applications (of substantial sizes) could be
>> expected to see more predictable performance.
>>>
>>> How do you like this? My thought is we can make things simple in the
>>> first step. :-)
>> Please let me know if my comments are not clear. I agree that we
>> should shoot for simplicity and also for a common interface. Hope we
>> will get there :)
> Thanks a lot for all the explanation and discussion.
> Yes, we need to agree on a common interface to avoid confusion.
> And I still think the "guestnodes/uniform_nodes" idea is more straightforward 
> and the
> implementatin is simpler. :-)
>
> Thanks,
>  -- Dexuan

thanks
dulloor

Attachment: numa-ballooning.patch
Description: Text Data

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.