Xen project Mailing List

Re: [Xen-devel] [Hackathon minutes] PV block improvements

To: Ian Campbell <Ian.Campbell@xxxxxxxxxx>

From: Roger Pau MonnÃ <roger.pau@xxxxxxxxxx>

Date: Thu, 27 Jun 2013 17:20:19 +0200

Cc: George Dunlap <george.dunlap@xxxxxxxxxxxxx>, Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxx>, Matt Wilson <msw@xxxxxxxxxx>, Stefano Stabellini <stefano.stabellini@xxxxxxxxxxxxx>

Delivery-date: Thu, 27 Jun 2013 15:20:34 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 27/06/13 16:21, Ian Campbell wrote: > On Thu, 2013-06-27 at 14:58 +0100, George Dunlap wrote: >> On 26/06/13 12:37, Ian Campbell wrote: >>> On Wed, 2013-06-26 at 10:37 +0100, George Dunlap wrote: >>>> On Tue, Jun 25, 2013 at 7:04 PM, Stefano Stabellini >>>> <stefano.stabellini@xxxxxxxxxxxxx> wrote: >>>>> On Tue, 25 Jun 2013, Ian Campbell wrote: >>>>>> On Sat, 2013-06-22 at 09:11 +0200, Roger Pau MonnÃ wrote: >>>>>>> On 21/06/13 20:07, Matt Wilson wrote: >>>>>>>> On Fri, Jun 21, 2013 at 07:10:59PM +0200, Roger Pau MonnÃ wrote: >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> While working on further block improvements I've found an issue with >>>>>>>>> persistent grants in blkfront. >>>>>>>>> >>>>>>>>> Persistent grants basically allocate grants and then they are never >>>>>>>>> released, so both blkfront and blkback keep using the same memory >>>>>>>>> pages >>>>>>>>> for all the transactions. >>>>>>>>> >>>>>>>>> This is not a problem in blkback, because we can dynamically choose >>>>>>>>> how >>>>>>>>> many grants we want to map. On the other hand, blkfront cannot remove >>>>>>>>> the access to those grants at any point, because blkfront doesn't know >>>>>>>>> if blkback has this grants mapped persistently or not. >>>>>>>>> >>>>>>>>> So if for example we start expanding the number of segments in >>>>>>>>> indirect >>>>>>>>> requests, to a value like 512 segments per requests, blkfront will >>>>>>>>> probably try to persistently map 512*32+512 = 16896 grants per device, >>>>>>>>> that's much more grants that the current default, which is 32*256 = >>>>>>>>> 8192 >>>>>>>>> (if using grant tables v2). This can cause serious problems to other >>>>>>>>> interfaces inside the DomU, since blkfront basically starts hoarding >>>>>>>>> all >>>>>>>>> possible grants, leaving other interfaces completely locked. >>>>>>>> Yikes. >>>>>>>> >>>>>>>>> I've been thinking about different ways to solve this, but so far I >>>>>>>>> haven't been able to found a nice solution: >>>>>>>>> >>>>>>>>> 1. Limit the number of persistent grants a blkfront instance can use, >>>>>>>>> let's say that only the first X used grants will be persistently >>>>>>>>> mapped >>>>>>>>> by both blkfront and blkback, and if more grants are needed the >>>>>>>>> previous >>>>>>>>> map/unmap will be used. >>>>>>>> I'm not thrilled with this option. It would likely introduce some >>>>>>>> significant performance variability, wouldn't it? >>>>>>> Probably, and also it will be hard to distribute the number of available >>>>>>> grant across the different interfaces in a performance sensible way, >>>>>>> specially given the fact that once a grant is assigned to a interface it >>>>>>> cannot be returned back to the pool of grants. >>>>>>> >>>>>>> So if we had two interfaces with very different usage (one very busy and >>>>>>> another one almost idle), and equally distribute the grants amongst >>>>>>> them, one will have a lot of unused grants while the other will suffer >>>>>>> from starvation. >>>>>> I do think we need to implement some sort of reclaim scheme, which >>>>>> probably does mean a specific request (per your #4). We simply can't >>>>>> have a device which once upon a time had high throughput but is no >>>>>> mostly ideal continue to tie up all those grants. >>>>>> >>>>>> If you make the reuse of grants use an MRU scheme and reclaim the >>>>>> currently unused tail fairly infrequently and in large batches then the >>>>>> perf overhead should be minimal, I think. >>>>>> >>>>>> I also don't think I would discount the idea of using ephemeral grants >>>>>> to cover bursts so easily either, in fact it might fall out quite >>>>>> naturally from an MRU scheme? In that scheme bursting up is pretty cheap >>>>>> since grant map is relative inexpensive, and recovering from the burst >>>>>> shouldn't be too expensive if you batch it. If it turns out to be not a >>>>>> burst but a sustained level of I/O then the MRU scheme would mean you >>>>>> wouldn't be recovering them. >>>>>> >>>>>> I also think there probably needs to be some tunable per device limit on >>>>>> the maximum persistent grants, perhaps minimum and maximum pool sizes >>>>>> ties in with an MRU scheme? If nothing else it gives the admin the >>>>>> ability to prioritise devices. >>>>> If we introduce a reclaim call we have to be careful not to fall back >>>>> to a map/unmap scheme like we had before. >>>>> >>>>> The way I see it either these additional grants are useful or not. >>>>> In the first case we could just limit the maximum amount of persistent >>>>> grants and be done with it. >>>>> If they are not useful (they have been allocated for one very large >>>>> request and not used much after that), could we find a way to identify >>>>> unusually large requests and avoid using persistent grants for those? >>>> Isn't it possible that these grants are useful for some periods of >>>> time, but not for others? You wouldn't say, "Caching the disk data in >>>> main memory is either useful or not; if it is not useful (if it was >>>> allocated for one very large request and not used much after that), we >>>> should find a way to identify unusually large requests and avoid >>>> caching it." If you're playing a movie, sure; but in most cases, the >>>> cache was useful for a time, then stopped being useful. Treating the >>>> persistent grants the same way makes sense to me. >>> Right, this is what I was trying to suggest with the MRU scheme. If you >>> are using lots of grants and you keep on reusing them then they remain >>> persistent and don't get reclaimed. If you are not reusing them for a >>> while then they get reclaimed. If you make "for a while" big enough then >>> you should find you aren't unintentionally falling back to a map/unmap >>> scheme. >> >> And I was trying to say that I agreed with you. :-) > > Excellent ;-) I also agree that this is the best solution, I will start looking at implementing it. >> BTW, I presume "MRU" stands for "Most Recently Used", and means "Keep >> the most recently used"; is there a practical difference between that >> and "LRU" ("Discard the Least Recently Used")? > > I started off with LRU and then got my self confused and changed it > everywhere. Yes I mean keep Most Recently Used == discard Least Recently > Used. This will help if the disk is only doing intermittent bursts of data, but if the disk is under high I/O during a long time we might end up under the same situation (all grants hoarded by a single disk). We should make sure that there's always a buffer of unused grants so other disks or nic interfaces can continue to work as expected. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.