[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] blkback global resources



On Tue, 2012-03-27 at 08:27 +0100, Jan Beulich wrote:
> >>> On 26.03.12 at 18:53, Daniel Stodden <daniel.stodden@xxxxxxxxxxxxxx> 
> >>> wrote:
> > On Mon, 2012-03-26 at 17:06 +0100, Keir Fraser wrote:
> >> Cc'ing Daniel for you on this one, Jan.
> >> 
> >>  K.
> >> 
> >> On 26/03/2012 16:56, "Jan Beulich" <JBeulich@xxxxxxxx> wrote:
> >> 
> >> > All the resources allocated based on xen_blkif_reqs are global in
> >> > blkback. While (without having measured anything) I think that this
> >> > is bad from a QoS perspective (not the least implied from a warning
> >> > issued by Citrix'es multi-page-ring patches:
> >> > 
> >> > if (blkif_reqs < BLK_RING_SIZE(order))
> >> > printk(KERN_WARNING "WARNING: "
> >> >       "I/O request space (%d reqs) < ring order %ld, "
> >> >       "consider increasing %s.reqs to >= %ld.",
> >> >       blkif_reqs, order, KBUILD_MODNAME,
> >> >       roundup_pow_of_two(BLK_RING_SIZE(order)));
> >> > 
> >> > indicating that this _is_ a bottleneck), I'm otoh hesitant to convert
> >> > this to per-instance allocations, as the amount of memory taken
> >> > away from Dom0 for this may be not insignificant when there are
> >> > many devices.
> >> > 
> >> > Does anyone have an opinion here, in particular regarding the
> >> > original authors' decision to make this global vs. the apparently
> >> > made observation (by Daniel Stodden, the author of said patch,
> >> > who I don't have any current email of to ask directly), but also
> >> > in the context of multi-page rings, the purpose of which is to
> >> > allow for larger amounts of in-flight I/O?
> >> > 
> >> > Thanks, Jan
> > 
> > Re-CC'ing Andrei Lifchits, I think there's been some work going on at
> > Citrix regarding that matter.
> > 
> > Yes, just allocating a pfn pool per backend instance is way too much
> > memory balooned out. Otherwise this stuff would have never looked the
> > way it does now.
> 
> This of course could be accounted for by having an initially non-empty
> (large enough) balloon (not sure how easy it is these days to do this
> for pv-ops, but it has always been trivial with the legacy code). That
> wouldn't help a 32-bit kernel much (where generally the initial balloon
> is all in highmem, yet the vacated pages need to be in lowmem), but
> for 64-bit kernels it should be fine.
> 
> > Regarding the right balance, note that on the other extreme end, if PFN
> > space were infinite, there's not much expected performance gain from
> > rendering virtual backends fully independent. Beyond controller queue
> > depth, these requests are all just going to pile up, waiting.
> 
> Is there a way to look through the queue stack to find out how many
> distinct ones there are that the backend is running on top of as well
> as - for a particular I/O path - the one with the smallest depth? Or can
> one assume that the top most one (generally loop's or blktap2's) won't
> advertise a queue deeper than what is going to be accepted
> downstream (probably not, I'd guess)?
> 
> And - what you say would similarly apply to the usefulness of multi-page
> rings afaict.
> 

The balance is tricky. What I observe so far is that having multi-page
rings doesn't necessarily help improve performance (but it is still nice
to have it in case of future usage). There are other contentions, which
limit throughput of a single VIF.

> > XenServer has some support for decoupling in blktap.ko [1] which worked
> > relatively well: Use frame 'pool' kobjects. A bunch of pages, mapped to
> > sysfs object. Name was arbitrary. Size configurable, even at runtime. 
> > 
> > Sysfs meant stuff was easily set up by shell or python code, or
> > manually. To become operational, every backend must be bound to a pool
> > (initially, the global 'default' one, for tool compat). Backends can be
> > relinked arbitrarily before entering Connected state.
> > 
> > Then let the userland toolstack set things up according to physical I/O
> > topology and properties probed. Basically every physical backend (say, a
> > volume group, or a HBA) would start out by allocating and dimensioning a
> > dedicated pool (named after the backend), and every backend instance
> > fired up gets bound to the pool it belongs to.
> 
> Having userland do all that seems like a fallback solution only to me - I
> would hope that sufficient information is available directly to the drivers.
> 

I tempt to make all information available to drivers, but haven't
reached a conclusion yet. Maybe we should also allow user to experiment
various configuration for their specific needs?


Wei.

> Thanks in any case for responding so quickly,
> Jan
> 
> > There's a lot of additional optimizations one could consider, e.g.
> > autogrowing the pool (log(nbackends) or so?) and the like. To improve
> > locality, having backends which look ahead in their request queue and
> > allocate whole batches is probably a good idea too, etc, etc.
> > 
> > HTH,
> > Daniel
> > 
> > [1]
> > http://xenbits.xen.org/gitweb/?p=people/dstodden/linux.git 
> >  mostly in drivers/block/blktap/sysfs.c (show/store_pool) and request.c.
> >  Note that these are based on mempools, not the frame pools blkback
> >  would take.
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.