[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] Re: [PATCH RFC] xen/blkfront: use tagged queuing for barriers
On 07/28/2010 04:06 AM, Daniel Stodden wrote: Hey. I tried to figure out some of the barrier stuff. I think it's becoming more clear to me now. But take everything below with a grain of salt. Sorry for having written a book about it. It's just because I'm not sure if I'm always reasoning correctly. Thanks for the detailed response! There's summary at the bottom. Maybe skip over and just apply your patch, possibly with a more optimistic update to the comment, and with a third option in support of QUEUE_ORDERED_DRAIN. In between, I seems to make sense to differentiate between blkback and blktap1/2 to understand what our disk model traditionally was/is, because until now frontends don't differentiate much. Except for maybe that feature-barrier flag. And below I gather why I think we might want to do something about it. Blktap1 -------------------------------------------------------------------------- A blktap1 backend ring is quite simple to explain. It thinks it's a cacheless disk with no ordering guarantees. In kernel queueing terms that's a QUEUE_ORDERED_DRAIN. Yes, if it has no cache then we don't need to worry about delayed writes, and draining should be enough. - We pass requests directly on to tapdisk. - Tapdisk drivers will scatter and reorder at will, especially VHD. - The RSP message is driven by iocb completion(s). - It fully relies on AIO/O_DIRECT to make sure the data made it actually down to the platter by that time. I'm not aware of anything else left to a userspace app. Blktap2 -------------------------------------------------------------------------- The blktap2 datapath was completely derived from blktap1. The difference that it's now hooked into a bdev. Which is quite important because image consistency is now in the hands of the block layer. It presently doesn't declare a barrier mode, which in my understanding evaluates to QUEUE_ORDERED_NONE. My understanding of QUEUE_ORDERED_NONE semantics is that the blk code and filesystems will happily fill the queue and assmume in-order completion. This is the mode of choice with either a queue depth of 1 or no reordering. Or a non-flushable cache, at which point you know you're already screwed, so you don't need to worry. Blktap2 passes a frontend ring worth of requests to userland, with the same semantics as blktap1: queue depth of 32, no serialization control. Which should actually be a QUEUE_ORDERED_DRAIN passed on to the block layer. So blktap2 doesn't pass barriers into userland? I guess userland doesn't really have a way to send barriers into the kernel except with f(data)sync or direct io. So I guess we better fix that. Blkback -------------------------------------------------------------------------- I had some trouble with this one. Blkback and -tap1 being somewhat the same kinda guy from the frontend perspective, I've always assumed the semantics should be (are) the same. Now I think that's quite wrong. Blkback is sitting on a kernel device queue, and that's a slightly different beast, because the blkdev interface works SCSI-style, which means it's completely revolving around SCSI barriers. On SCSI you issue a barrier into a request stream and ideally that directly maps to controller hardware, you're done. For ATA with NCQ that apparently doesn't apply, but you still issue a barrier op. It will cause a cache flush (if given) to complete foregoing requests, then a forced unit access (if given) to complete the barrier'd write itself. Or post-flush, again. The point is you don't do that on every request separately, or performance somewhat suffers. :) Let's assume blkback on a bdev of queue depth>1 with barriers enabled. And that we wanted to guarantee responses implying physical completion. We'd have to do at least something like the following: queue any batch we can get from the frontend, set a barrier bit on the last one, then ack every req only after the last one completed. Because in between, bio completion doesn't mean much. That might even be an option, I haven't tried. One could probably also play tricks based on the _FLUSH bit. Anyway, I don't see the blkback code presently doing that. Couldn't blkback just send a barrier write down into the host's storage subsystem, and rely on it completing the bios in the right order? > From the blkfront perspective, that would still be a weird-looking QUEUE_ORDERED_DRAIN. Because everything queued appears to complete at once. It didn't, of course, so you still have to drain where you want to be safe. If we don't do that, then the frontend has to submit the barrier position. Not sure I follow this. I find this has some interesting implications. One is that the guest will effectively be utilizing a writeback cache, safely. The reason why I find this so exciting is because that cache can be potentially much larger than a ring, and it'd still be safely used by the guest. The other reason is I'm often made to wonder if blktap should get one. To a Linux/blkfront queue, this is a QUEUE_ORDERED_TAG. Blkif -------------------------------------------------------------------------- I'm not entirely clear about the reason for BLKIF_OP_FLUSH_DISKCACHE because of the existence of BLKIF_OP_WRITE_BARRIER. Maybe someone can enlighten me. The latter implies a flush, if one is necessary (otherwise it wouldn't be a barrier). Really? It's perfectly reasonable to define a barrier which prevents reordering but doesn't imply any synchronous write. If you have a filesystem which just wants to make sure the superblock is written after metadata updates, but doesn't really mean to push things out *now*, a non-flushing barrier makes complete sense. I found it odd that all the Linux documentation about barriers seem to imply a flush. Or is that just the conventional meaning of "barrier" in storage-world? I could imagine drain/flush combinations executed to achieve barriers, or a flush as a slightly relaxed alternative, e.g. for queues which preserve order. Right. Otoh, Xen normally tries to be about interface idealization where appropriate. There is no point in feature-barrier==0&& feature-cache-flush==1 or vice versa, because a barrier would still have been realizable per drain/flush, and at least in the blkback case there'd be no additional cost in the backend, if it just does that on behalf of the frontend. And it saves precious ring space. Last I'm not clear about how the meaning of feature-barrier got formulated in the header. To the frontend, is not a "feature" which "may succeed". Rather, if set, as a strong recommendation for guests who aim to eventually remount their filesystems r/w after a reboot. Summary -------------------------------------------------------------------------- I'm assuming most guest OS filesystem/block layer glue follows an ordering interface based on SCSI? Does anyone know if there are exceptions to that? I'd be really curious. [*] Assuming the former is correct, I think we absolutely want interface idealization. This leaves exactly 2.5 reasonable modes which frontends should prepare for: - feature-barrier == 0 -> QUEUE_ORDERED_NONE Disk is either a) non-caching/non-reordering or b) roken. - feature-barrier == 1 -> QUEUE_ORDERED_TAG Disk is reordering, quite possibly caching, but you won't need to know. You seriously want to issue ordered writes with BLKIF_OP_WRITE_BARRIER. So does this mean that feature-barrier is actually not backwards compatible? If you use an old blkfront which doesn't know what to do with it, you end up with less reliable storage than before feature-barrier? Perhaps the backend should keep writes synchronous until it sees a barrier coming from the guest, then it switches to caching/reordering/etc (and hope the guest sends a barrier quickish). [ - !xenbus_exists(feature-barrier) -> QUEUE_ORDERED_DRAIN (?) This is either blktap1, or an outdated blkback (?) I think one would want to assume blktap1 (or parse otherend-id). With Blkback there's probably not much you can do anyway. With blktap1 there's a chance you're doing the right thing. ] See patch below (delta against previous one). Does that look OK? I'd suggest to ignore/phase-out the caching bit, because it's no use wrt QUEUE_ORDERED_TAG and complementary to QUEUE_ORDERED_NONE. I'd suggest to fix the backends where people see fit. In blktap2, which appears urgent to me, this is a one liner for now, setting QUEUE_ORDERED_DRAIN on the bdev. Unless we discover some day we want to implement barriers in tapdisk. Which is not going to happen in July. OK. Is blkback OK as-is? And I don't care about blktap1, but I guess its still the current product storage backend... Thanks, J From: Jeremy Fitzhardinge<jeremy.fitzhardinge@xxxxxxxxxx> Date: Wed, 28 Jul 2010 10:49:29 -0700 Subject: [PATCH] xen/blkfront: Use QUEUE_ORDERED_DRAIN for old backends If there's no feature-barrier key in xenstore, then it means its a fairly old backend which does uncached in-order writes, which means ORDERED_DRAIN is appropriate. Signed-off-by: Jeremy Fitzhardinge<jeremy.fitzhardinge@xxxxxxxxxx> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c index eb28e1f..8cd5418 100644 --- a/drivers/block/xen-blkfront.c +++ b/drivers/block/xen-blkfront.c @@ -423,27 +423,22 @@ static int xlvbd_init_blk_queue(struct blkfront_info *info, static int xlvbd_barrier(struct blkfront_info *info) { int err; - unsigned ordered = QUEUE_ORDERED_NONE; + const char *barrier; - /* - * If we don't have barrier support, then there's really no - * way to guarantee write ordering, so we really just have to - * send writes to the backend and hope for the best. If - * barriers are supported, we don't really know what the - * flushing semantics of the barrier are, so again, tag the - * requests in order and hope for the best. - */ - if (info->feature_barrier) - ordered = QUEUE_ORDERED_TAG; + switch (info->feature_barrier) { + case QUEUE_ORDERED_DRAIN: barrier = "enabled (drain)"; break; + case QUEUE_ORDERED_TAG: barrier = "enabled (tag)"; break; + case QUEUE_ORDERED_NONE: barrier = "disabled"; break; + default: return -EINVAL; + } - err = blk_queue_ordered(info->rq, ordered, NULL); + err = blk_queue_ordered(info->rq, info->feature_barrier, NULL); if (err) return err; printk(KERN_INFO "blkfront: %s: barriers %s\n", - info->gd->disk_name, - info->feature_barrier ? "enabled" : "disabled"); + info->gd->disk_name, barrier); return 0; } @@ -717,7 +712,7 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id) printk(KERN_WARNING "blkfront: %s: write barrier op failed\n", info->gd->disk_name); error = -EOPNOTSUPP; - info->feature_barrier = 0; + info->feature_barrier = QUEUE_ORDERED_NONE; xlvbd_barrier(info); } /* fall through */ @@ -1057,6 +1052,7 @@ static void blkfront_connect(struct blkfront_info *info) unsigned long sector_size; unsigned int binfo; int err; + int barrier; switch (info->connected) { case BLKIF_STATE_CONNECTED: @@ -1097,10 +1093,27 @@ static void blkfront_connect(struct blkfront_info *info) } err = xenbus_gather(XBT_NIL, info->xbdev->otherend, - "feature-barrier", "%lu",&info->feature_barrier, + "feature-barrier", "%lu",&barrier, NULL); + + /* + * If there's no "feature-barrier" defined, then it means + * we're dealing with a very old backend which probably writes + * in order with no cache; draining will do what needs to get + * done. + * + * If there are barriers, then we can do full queued writes + * with tagged barriers. + * + * If barriers are not supported, then there's no much we can + * do, so just set ordering to NONE. + */ if (err) - info->feature_barrier = 0; + info->feature_barrier = QUEUE_ORDERED_DRAIN; + else if (barrier) + info->feature_barrier = QUEUE_ORDERED_TAG; + else + info->feature_barrier = QUEUE_ORDERED_NONE; err = xlvbd_alloc_gendisk(sectors, info, binfo, sector_size); if (err) { _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |