Xen project Mailing List

[Xen-devel] Re: [PATCH RFC] xen/blkfront: use tagged queuing for barriers

To: Jeremy Fitzhardinge <jeremy@xxxxxxxx>

From: Daniel Stodden <daniel.stodden@xxxxxxxxxx>

Date: Wed, 28 Jul 2010 04:06:05 -0700

Cc: "Xen-devel@xxxxxxxxxxxxxxxxxxx" <Xen-devel@xxxxxxxxxxxxxxxxxxx>, Jake Wires <Jake.Wires@xxxxxxxxxx>

Delivery-date: Wed, 28 Jul 2010 04:08:13 -0700

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Hey. I tried to figure out some of the barrier stuff. I think it's becoming more clear to me now. But take everything below with a grain of salt. Sorry for having written a book about it. It's just because I'm not sure if I'm always reasoning correctly. There's summary at the bottom. Maybe skip over and just apply your patch, possibly with a more optimistic update to the comment, and with a third option in support of QUEUE_ORDERED_DRAIN. In between, I seems to make sense to differentiate between blkback and blktap1/2 to understand what our disk model traditionally was/is, because until now frontends don't differentiate much. Except for maybe that feature-barrier flag. And below I gather why I think we might want to do something about it. Blktap1 -------------------------------------------------------------------------- A blktap1 backend ring is quite simple to explain. It thinks it's a cacheless disk with no ordering guarantees. In kernel queueing terms that's a QUEUE_ORDERED_DRAIN. - We pass requests directly on to tapdisk. - Tapdisk drivers will scatter and reorder at will, especially VHD. - The RSP message is driven by iocb completion(s). - It fully relies on AIO/O_DIRECT to make sure the data made it actually down to the platter by that time. I'm not aware of anything else left to a userspace app. Blktap2 -------------------------------------------------------------------------- The blktap2 datapath was completely derived from blktap1. The difference that it's now hooked into a bdev. Which is quite important because image consistency is now in the hands of the block layer. It presently doesn't declare a barrier mode, which in my understanding evaluates to QUEUE_ORDERED_NONE. My understanding of QUEUE_ORDERED_NONE semantics is that the blk code and filesystems will happily fill the queue and assmume in-order completion. This is the mode of choice with either a queue depth of 1 or no reordering. Or a non-flushable cache, at which point you know you're already screwed, so you don't need to worry. Blktap2 passes a frontend ring worth of requests to userland, with the same semantics as blktap1: queue depth of 32, no serialization control. Which should actually be a QUEUE_ORDERED_DRAIN passed on to the block layer. So I guess we better fix that. Blkback -------------------------------------------------------------------------- I had some trouble with this one. Blkback and -tap1 being somewhat the same kinda guy from the frontend perspective, I've always assumed the semantics should be (are) the same. Now I think that's quite wrong. Blkback is sitting on a kernel device queue, and that's a slightly different beast, because the blkdev interface works SCSI-style, which means it's completely revolving around SCSI barriers. On SCSI you issue a barrier into a request stream and ideally that directly maps to controller hardware, you're done. For ATA with NCQ that apparently doesn't apply, but you still issue a barrier op. It will cause a cache flush (if given) to complete foregoing requests, then a forced unit access (if given) to complete the barrier'd write itself. Or post-flush, again. The point is you don't do that on every request separately, or performance somewhat suffers. :) Let's assume blkback on a bdev of queue depth >1 with barriers enabled. And that we wanted to guarantee responses implying physical completion. We'd have to do at least something like the following: queue any batch we can get from the frontend, set a barrier bit on the last one, then ack every req only after the last one completed. Because in between, bio completion doesn't mean much. That might even be an option, I haven't tried. One could probably also play tricks based on the _FLUSH bit. Anyway, I don't see the blkback code presently doing that. >From the blkfront perspective, that would still be a weird-looking QUEUE_ORDERED_DRAIN. Because everything queued appears to complete at once. It didn't, of course, so you still have to drain where you want to be safe. If we don't do that, then the frontend has to submit the barrier position. I find this has some interesting implications. One is that the guest will effectively be utilizing a writeback cache, safely. The reason why I find this so exciting is because that cache can be potentially much larger than a ring, and it'd still be safely used by the guest. The other reason is I'm often made to wonder if blktap should get one. To a Linux/blkfront queue, this is a QUEUE_ORDERED_TAG. Blkif -------------------------------------------------------------------------- I'm not entirely clear about the reason for BLKIF_OP_FLUSH_DISKCACHE because of the existence of BLKIF_OP_WRITE_BARRIER. Maybe someone can enlighten me. The latter implies a flush, if one is necessary (otherwise it wouldn't be a barrier). I could imagine drain/flush combinations executed to achieve barriers, or a flush as a slightly relaxed alternative, e.g. for queues which preserve order. Otoh, Xen normally tries to be about interface idealization where appropriate. There is no point in feature-barrier==0 && feature-cache-flush==1 or vice versa, because a barrier would still have been realizable per drain/flush, and at least in the blkback case there'd be no additional cost in the backend, if it just does that on behalf of the frontend. And it saves precious ring space. Last I'm not clear about how the meaning of feature-barrier got formulated in the header. To the frontend, is not a "feature" which "may succeed". Rather, if set, as a strong recommendation for guests who aim to eventually remount their filesystems r/w after a reboot. Summary -------------------------------------------------------------------------- I'm assuming most guest OS filesystem/block layer glue follows an ordering interface based on SCSI? Does anyone know if there are exceptions to that? I'd be really curious. [*] Assuming the former is correct, I think we absolutely want interface idealization. This leaves exactly 2.5 reasonable modes which frontends should prepare for: - feature-barrier == 0 -> QUEUE_ORDERED_NONE Disk is either a) non-caching/non-reordering or b) roken. - feature-barrier == 1 -> QUEUE_ORDERED_TAG Disk is reordering, quite possibly caching, but you won't need to know. You seriously want to issue ordered writes with BLKIF_OP_WRITE_BARRIER. [ - !xenbus_exists(feature-barrier) -> QUEUE_ORDERED_DRAIN (?) This is either blktap1, or an outdated blkback (?) I think one would want to assume blktap1 (or parse otherend-id). With Blkback there's probably not much you can do anyway. With blktap1 there's a chance you're doing the right thing. ] I'd suggest to ignore/phase-out the caching bit, because it's no use wrt QUEUE_ORDERED_TAG and complementary to QUEUE_ORDERED_NONE. I'd suggest to fix the backends where people see fit. In blktap2, which appears urgent to me, this is a one liner for now, setting QUEUE_ORDERED_DRAIN on the bdev. Unless we discover some day we want to implement barriers in tapdisk. Which is not going to happen in July. In good old blktap1, it would take a queue drain. I guess that would best be be done in userspace, by forwarding BLKIF_OP_BARRIER. At which point we might just shift the blktap2 tapdev mode to QUEUE_ORDERED_TAG as well. [*] One could imagine different hardware models for a ring-based Xen backend. As mentioned above, NCQ could be reasonable too. Essentially, one might just go and map queue ordering bits into a xenstore key and let the guest I/O stack figure it out. That doesn't map to bio.h, but it would map to blkdev.h. I'm also the wrong guy to ask if there'd be a benefit :). Definitely not for Linux guests, where it's only overhead in order to get the same full barrier op the ring. But maybe other OSes? On Thu, 2010-07-22 at 17:20 -0400, Jeremy Fitzhardinge wrote: > When barriers are supported, then use QUEUE_ORDERED_TAG to tell the block > subsystem that it doesn't need to do anything else with the barriers. > Previously we used ORDERED_DRAIN which caused the block subsystem to > drain all pending IO before submitting the barrier, which would be > very expensive. > > Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx> > > diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c > index 92091a7..b61a021 100644 > --- a/drivers/block/xen-blkfront.c > +++ b/drivers/block/xen-blkfront.c > @@ -418,10 +418,20 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 > sector_size) > static int xlvbd_barrier(struct blkfront_info *info) > { > int err; > + unsigned ordered = QUEUE_ORDERED_NONE; > > - err = blk_queue_ordered(info->rq, > - info->feature_barrier ? QUEUE_ORDERED_DRAIN : > QUEUE_ORDERED_NONE, > - NULL); > + /* > + * If we don't have barrier support, then there's really no > + * way to guarantee write ordering, so we really just have to > + * send writes to the backend and hope for the best. If > + * barriers are supported, we don't really know what the > + * flushing semantics of the barrier are, so again, tag the > + * requests in order and hope for the best. > + */ > + if (info->feature_barrier) > + ordered = QUEUE_ORDERED_TAG; > + > + err = blk_queue_ordered(info->rq, ordered, NULL); > > if (err) > return err; > @@ -508,8 +518,7 @@ static int xlvbd_alloc_gendisk(blkif_sector_t capacity, > info->rq = gd->queue; > info->gd = gd; > > - if (info->feature_barrier) > - xlvbd_barrier(info); > + xlvbd_barrier(info); > > if (vdisk_info & VDISK_READONLY) > set_disk_ro(gd, 1); > > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.