Xen project Mailing List

Re: [Xen-devel] [PATCH 2/2] block/xen-blkfront: Handle non-indirect grant with 64KB pages

El 05/10/15 a les 19.05, Julien Grall ha escrit: > On 05/10/15 17:01, Roger Pau MonnÃ wrote: >> El 11/09/15 a les 21.32, Julien Grall ha escrit: >>> The minimal size of request in the block framework is always PAGE_SIZE. >>> It means that when 64KB guest is support, the request will at least be >>> 64KB. >>> >>> Although, if the backend doesn't support indirect grant (such as QDISK >>> in QEMU), a ring request is only able to accomodate 11 segments of 4KB >>> (i.e 44KB). >>> >>> The current frontend is assuming that an I/O request will always fit in >>> a ring request. This is not true any more when using 64KB page >>> granularity and will therefore crash during the boot. >> ^ during boot. >>> >>> On ARM64, the ABI is completely neutral to the page granularity used by >>> the domU. The guest has the choice between different page granularity >>> supported by the processors (for instance on ARM64: 4KB, 16KB, 64KB). >>> This can't be enforced by the hypervisor and therefore it's possible to >>> run guests using different page granularity. >>> >>> So we can't mandate the block backend to support non-indirect grant >>> when the frontend is using 64KB page granularity and have to fix it >>> properly in the frontend. >>> >>> The solution exposed below is based on modifying directly the frontend >>> guest rather than asking the block framework to support smaller size >>> (i.e < PAGE_SIZE). This is because the change is the block framework are >>> not trivial as everything seems to relying on a struct *page (see [1]). >>> Although, it may be possible that someone succeed to do it in the future >>> and we would therefore be able to use advantage. >> ^ it. (no advantage IMHO) >>> >>> Given that a block request may not fit in a single ring request, a >>> second request is introduced for the data that cannot fit in the first >>> one. This means that the second request should never be used on Linux >>> configuration using a page granularity < 44KB. >> ^ if the page size is smaller than 44KB. >>> >>> Note that the parameters blk_queue_max_* helpers haven't been updated. >>> The block code will set mimimum size supported and we may be able to >> ^ the minimum extra space ^ >>> support directly any change in the block framework that lower down the >>> mimimal size of a request. >> ^ minimal >> >> I have a concern regarding the splitting done in this patch. >> >> What happens with FUA requests when split? For example the frontend >> receives a FUA requests with 64KB of data, and this is split into two >> different requests on the ring, is this going to cause trouble in the >> higher layers if for example the first request is completed but the >> second is not? Could we leave the disk in a bad state as a consequence >> of this? > > If a block request is split into two ring requests, we will wait the two > responses before reporting the completion to the higher layers (see > blkif_interrupt and blkif_completion). > > Furthermore, the second ring request will always use the same operation > as the first one. Note that you will flush twice which is not nice but > could be improved. > >> >>> [1] http://lists.xen.org/archives/html/xen-devel/2015-08/msg02200.html >>> >>> Signed-off-by: Julien Grall <julien.grall@xxxxxxxxxx> >>> >>> --- >>> Cc: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> >>> Cc: "Roger Pau MonnÃ" <roger.pau@xxxxxxxxxx> >>> Cc: Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx> >>> Cc: David Vrabel <david.vrabel@xxxxxxxxxx> >>> --- >>> drivers/block/xen-blkfront.c | 199 >>> +++++++++++++++++++++++++++++++++++++++---- >>> 1 file changed, 183 insertions(+), 16 deletions(-) >>> >>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c >>> index f9d55c3..03772c9 100644 >>> --- a/drivers/block/xen-blkfront.c >>> +++ b/drivers/block/xen-blkfront.c >>> @@ -60,6 +60,20 @@ >>> >>> #include <asm/xen/hypervisor.h> >>> >>> +/* >>> + * The block framework is always working on segment of PAGE_SIZE minimum. >> >> The above sentence needs to be reworded. > > What about: > > "The mininal size of the segment supported by the block framework is > PAGE_SIZE." That sounds fine. > >> >>> + * When Linux is using a different page size than xen, it may not be >>> possible >>> + * to put all the data in a single segment. >>> + * This can happen when the backend doesn't support indirect grant and >> indirect requests ^ >>> + * therefore the maximum amount of data that a request can carry is >>> + * BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE = 44KB >>> + * >>> + * Note that we only support one extra request. So the Linux page size >>> + * should be <= ( 2 * BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE) = >>> + * 88KB. >>> + */ >>> +#define HAS_EXTRA_REQ (BLKIF_MAX_SEGMENTS_PER_REQUEST < XEN_PFN_PER_PAGE) >> >> Since you are already introducing a preprocessor define, I would like >> the code added to be protected by it. > > Most of the code is protected by an if (HAS_REQ_EXTRA && ...) or > variable setting based on HAS_REQ_EXTRA. Furthermore there is extra > protection with some BUG_ON. > > I would rather avoid to use the preprocessor to avoid ending up with: > > #ifdef HAS_EXTRA_REQ > /* Code */ > #else > /* Code */ > #endif > > and potentially miss update on the ARM64 with 64KB pages because > currently most of people are developing PV drivers on x86. Ok, I was also thinking whether a compile time assert would be fine: BUILD_BUG_ON(PAGE_SIZE > 2 * BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE); But then if indirect descriptors are available the assert is no longer true. >>> enum blkif_state { >>> BLKIF_STATE_DISCONNECTED, >>> BLKIF_STATE_CONNECTED, >>> @@ -79,6 +93,19 @@ struct blk_shadow { >>> struct grant **indirect_grants; >>> struct scatterlist *sg; >>> unsigned int num_sg; >> >> #if HAS_EXTRA_REQ >> >>> + enum >>> + { >>> + REQ_WAITING, >>> + REQ_DONE, >>> + REQ_FAIL >>> + } status; >>> + >>> + #define NO_ASSOCIATED_ID ~0UL >>> + /* >>> + * Id of the sibling if we ever need 2 requests when handling a >>> + * block I/O request >>> + */ >>> + unsigned long associated_id; >> >> #endif > > See my remark above. > >> >>> }; >>> >>> struct split_bio { >>> @@ -467,6 +494,8 @@ static unsigned long blkif_ring_get_request(struct >>> blkfront_info *info, >>> >>> id = get_id_from_freelist(info); >>> info->shadow[id].request = req; >>> + info->shadow[id].status = REQ_WAITING; >>> + info->shadow[id].associated_id = NO_ASSOCIATED_ID; >>> >>> (*ring_req)->u.rw.id = id; >>> >>> @@ -508,6 +537,9 @@ struct setup_rw_req { >>> bool need_copy; >>> unsigned int bvec_off; >>> char *bvec_data; >>> + >>> + bool require_extra_req; >>> + struct blkif_request *ring_req2; >> >> extra_ring_req? > > Will do. > > >>> }; >>> >>> static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int >>> offset, >>> @@ -521,8 +553,24 @@ static void blkif_setup_rw_req_grant(unsigned long >>> gfn, unsigned int offset, >>> unsigned int grant_idx = setup->grant_idx; >>> struct blkif_request *ring_req = setup->ring_req; >>> struct blkfront_info *info = setup->info; >>> + /* >>> + * We always use the shadow of the first request to store the list >>> + * of grant associated to the block I/O request. This made the >> ^ grants ^ makes >>> + * completion more easy to handle even if the block I/O request is >>> + * split. >>> + */ >>> struct blk_shadow *shadow = &info->shadow[setup->id]; >>> >>> + if (unlikely(setup->require_extra_req && >>> + grant_idx >= BLKIF_MAX_SEGMENTS_PER_REQUEST)) { >>> + /* >>> + * We are using the second request, setup grant_idx >>> + * to be the index of the segment array >>> + */ >>> + grant_idx -= BLKIF_MAX_SEGMENTS_PER_REQUEST; >>> + ring_req = setup->ring_req2; >>> + } >>> + >>> if ((ring_req->operation == BLKIF_OP_INDIRECT) && >>> (grant_idx % GRANTS_PER_INDIRECT_FRAME == 0)) { >>> if (setup->segments) >>> @@ -537,7 +585,11 @@ static void blkif_setup_rw_req_grant(unsigned long >>> gfn, unsigned int offset, >>> >>> gnt_list_entry = get_grant(&setup->gref_head, gfn, info); >>> ref = gnt_list_entry->gref; >>> - shadow->grants_used[grant_idx] = gnt_list_entry; >>> + /* >>> + * All the grants are stored in the shadow of the first >>> + * request. Therefore we have to use the global index >>> + */ >>> + shadow->grants_used[setup->grant_idx] = gnt_list_entry; >>> >>> if (setup->need_copy) { >>> void *shared_data; >>> @@ -579,11 +631,31 @@ static void blkif_setup_rw_req_grant(unsigned long >>> gfn, unsigned int offset, >>> (setup->grant_idx)++; >>> } >>> >>> +static void blkif_setup_extra_req(struct blkif_request *first, >>> + struct blkif_request *second) >>> +{ >>> + uint16_t nr_segments = first->u.rw.nr_segments; >>> + >>> + /* >>> + * The second request is only present when the first request uses >>> + * all its segments. It's always the continuity of the first one >>> + */ >>> + first->u.rw.nr_segments = BLKIF_MAX_SEGMENTS_PER_REQUEST; >>> + >>> + second->u.rw.nr_segments = nr_segments - BLKIF_MAX_SEGMENTS_PER_REQUEST; >>> + second->u.rw.sector_number = first->u.rw.sector_number + >>> + (BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE) / 512; >> >> Does this need to take into account if sectors have been skipped in the >> previous requests due to empty data? > > I'm not sure to understand this question. > > AFAIU, the data is always contiguous in a block segment even though it > can be split accross multiple Linux page. The number of segments is > calculated from the amount of data see "Calculate the number of grant > used" the code. The second request will only be created if this number > is greater than BLKIF_MAX_SEGMENTS_PER_REQUEST. > > So the second request will always be the continuity of the first one. Ack. >> Also I'm not sure how correct it is to hardcode 512 here. > > Well, we are hardcoding it in blkfront and the block framework does the > same (see blk_limits_max_hw_sectors). Right, seems like the whole block system is always working with 512 as the basic unit. >> >>> + >>> + second->u.rw.handle = first->u.rw.handle; >>> + second->operation = first->operation; >>> +} >>> + >>> static int blkif_queue_rw_req(struct request *req) >>> { >>> struct blkfront_info *info = req->rq_disk->private_data; >>> - struct blkif_request *ring_req; >>> - unsigned long id; >>> + struct blkif_request *ring_req, *ring_req2 = NULL; >>> + unsigned long id, id2 = NO_ASSOCIATED_ID; >>> + bool require_extra_req = false; >>> int i; >>> struct setup_rw_req setup = { >>> .grant_idx = 0, >>> @@ -628,19 +700,19 @@ static int blkif_queue_rw_req(struct request *req) >>> /* Fill out a communications ring structure. */ >>> id = blkif_ring_get_request(info, req, &ring_req); >>> >>> - BUG_ON(info->max_indirect_segments == 0 && >>> - GREFS(req->nr_phys_segments) > BLKIF_MAX_SEGMENTS_PER_REQUEST); >>> - BUG_ON(info->max_indirect_segments && >>> - GREFS(req->nr_phys_segments) > info->max_indirect_segments); >>> - >>> num_sg = blk_rq_map_sg(req->q, req, info->shadow[id].sg); >>> num_grant = 0; >>> /* Calculate the number of grant used */ >>> for_each_sg(info->shadow[id].sg, sg, num_sg, i) >>> num_grant += gnttab_count_grant(sg->offset, sg->length); >>> >>> + require_extra_req = info->max_indirect_segments == 0 && >>> + num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST; >>> + BUG_ON(!HAS_EXTRA_REQ && require_extra_req); >>> + >>> info->shadow[id].num_sg = num_sg; >>> - if (num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST) { >>> + if (num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST && >>> + likely(!require_extra_req)) { >>> /* >>> * The indirect operation can only be a BLKIF_OP_READ or >>> * BLKIF_OP_WRITE >>> @@ -680,10 +752,29 @@ static int blkif_queue_rw_req(struct request *req) >>> } >>> } >>> ring_req->u.rw.nr_segments = num_grant; >>> + if (unlikely(require_extra_req)) { >>> + id2 = blkif_ring_get_request(info, req, &ring_req2); >> >> How can you guarantee that there's always going to be another free >> request? AFAICT blkif_queue_rq checks for RING_FULL, but you don't >> actually know if there's only one slot or more than one available. > > Because the depth of the queue is divided by 2 when the extra request is > used (see xlvbd_init_blk_queue). I see, that's quite restrictive but I guess it's better than introducing a new ring macro in order to figure out if there are at least two free slots. Roger. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.