[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 00/11] PV NUMA Guests


  • To: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>
  • From: Dulloor <dulloor@xxxxxxxxx>
  • Date: Fri, 9 Apr 2010 00:16:51 -0400
  • Cc: xen-devel@xxxxxxxxxxxxxxxxxxx, Keir Fraser <keir.fraser@xxxxxxxxxxxxx>
  • Delivery-date: Thu, 08 Apr 2010 21:17:26 -0700
  • Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=HVS8Kb0mNDYNnjsos4TfLltd9AimNvP4vd5pwFALjBGdavhtxDEO4YnBm46YTlOjdH keApdbCqftwwWmKy2U6lThzeheQJWfLJ0a2Rhjq7qB86eCoI1HnrZ3rQ42gWaKztX+ap XWan7IaPJpRrzaisk//wZr2KEBrEncJOJYOuI=
  • List-id: Xen developer discussion <xen-devel.lists.xensource.com>

On Tue, Apr 6, 2010 at 1:18 PM, Dan Magenheimer
<dan.magenheimer@xxxxxxxxxx> wrote:
> In general, I am of the opinion that in a virtualized world,
> one gets best flexibility or best performance, but not both.
> There may be a couple of reasonable points on this "slider
> selector", but I'm not sure in general if it will be worth
> a huge time investment as real users will not understand the
> subtleties of their workloads well enough to choose from
> a large number of (perhaps more than two) points on the
> performance/flexibility spectrum.
>
> So customers that want highest performance should be prepared
> to pin their guests and not use ballooning.  And those that
> want the flexibility of migration and ballooning etc should
> expect to see a performance hit (including NUMA consequences).
In principle, I agree with you. For the same reason, in this first
version, I have tried to keep the configurable parameters to minimum.
Wrt ballooning, we could work out simple solutions that work.
Migration would be problematic though.
>
> But since I don't get to make that decision, let's look
> at the combination of NUMA + dynamic memory utilization...
>
>> Please refer to my previously submitted patch for this
>> (http://old.nabble.com/Xen-devel--XEN-PATCH---Linux-PVOPS--ballooning-
>> on-numa-domains-td26262334.html).
>> I intend to send out a refreshed patch once the basic guest numa is
>> checked in.
>
> OK, will wait and take a look at that later.
>
>> We first try to CONFINE a domain and only then proceed to STRIPE or
>> SPLIT(if capable) the domain. So, in this (automatic) global domain
>> memory allocation scheme, there is no possibility of starvation from
>> memory pov. Hope I got your question right.
>
> The example I'm concerned with is:
> 1) Domain A is CONFINE'd to node A and domain B/C/D/etc are not
>   CONFINE'd
> 2) Domain A uses less than the total memory on node A and/or
>   balloons down so it uses even less than when launched.
> 3) Domains B/C/D have an increasing memory need, and semi-randomly
>   absorb memory from all nodes, including node A.
>
> After (3), free memory is somewhat randomly distributed across
> all nodes.  Then:
>
> 4) Domain A suddenly has an increasing memory need... but there's
>   not enough free memory remaining on node A (in fact possibly
>   there is none at all) to serve its need.   But by definition
>   of CONFINE, domain A is not allowed to use memory other than
>   on node A.
>
> What happens now?  It appears to me that other domains have
> (perhaps even maliciously) starved domain A.
>
> I think this is a dynamic bin-packing problem which is unsolvable
> in general form.  So the choice of heuristics is going to be
> important.
>
In the proposed solution, the domain could be either CONFINED, SPLIT
(NUMA-aware), or STRIPED.
But, in each case, the domain is aware of how much memory is allocated
from each of the nodes
at the time of start-up. The enlightened ballooning attempts to keep
the state similar to that during startup.
But, we might have to allocate from any node under extreme memory
pressure. For that (hopefully less likely) case,
we can implement dynamic mechanisms to converge to the original state,
by sweeping through the memory and
exchanging memory reservations whenever possible. I already have the
means of doing this, as part of the ballooning changes.

For CONFINED/SPLIT domains, I am using the Best-Fit-Decreasing
heuristic, whereas for STRIPED, I am using First-Fit-Increasing
strategy (as a means to reduce the fragmentation of free node memory).

>> For the tmem, I was thinking of the ability to specify a set of nodes
>> from which the tmem-space memory is preferred which could be derived
>> from the domain's numa enlightenment, but as you mentioned the
>> full-page copy overhead is less noticeable (at least on my smaller
>> NUMA machine). But, the rate would determine if we should do this to
>> reduce inter-node traffic. What do you suggest ?  I was looking at the
>> data structures too.
>
> Since tmem allocates individual xmalloc-tlsf memory pools per domain,
> it should be possible to inform tmem of node preferences, but I don't
> know that it will be feasible to truly CONFINE a domain's tmem.
> On the other hand, because of the page copying, affinity by itself
> may be sufficient.
>
Yeah, I guess affinity should suffice for the CONFINED domains, but
I was thinking of node-preferences for the NUMA (SPLIT)Guests.

>> > Also, I will be looking into adding some page-sharing
>> > techniques into tmem in the near future.  This (and the
>> > existing page sharing feature just added to 4.0) may
>> > create some other interesting challenges for NUMA-awareness.
>> I have just started reading up on the memsharing feature of Xen. I
>> would be glad to get your input on NUMA challenges over there.
>
> Note that the tmem patch that does sharing (tmem calls it "page
> deduplication") was just accepted into xen-unstable.  Basically
> some memory may belong to more than one domain, so NUMA affects
> and performance/memory tradeoffs may get very complicated.
>
Thanks for sharing. I will read this very soon.
> Dan
>

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.