[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 2/2] block/xen-blkfront: Handle non-indirect grant with 64KB pages



El 05/10/15 a les 19.05, Julien Grall ha escrit:
> On 05/10/15 17:01, Roger Pau Monnà wrote:
>> El 11/09/15 a les 21.32, Julien Grall ha escrit:
>>> The minimal size of request in the block framework is always PAGE_SIZE.
>>> It means that when 64KB guest is support, the request will at least be
>>> 64KB.
>>>
>>> Although, if the backend doesn't support indirect grant (such as QDISK
>>> in QEMU), a ring request is only able to accomodate 11 segments of 4KB
>>> (i.e 44KB).
>>>
>>> The current frontend is assuming that an I/O request will always fit in
>>> a ring request. This is not true any more when using 64KB page
>>> granularity and will therefore crash during the boot.
>>                                        ^ during boot.
>>>
>>> On ARM64, the ABI is completely neutral to the page granularity used by
>>> the domU. The guest has the choice between different page granularity
>>> supported by the processors (for instance on ARM64: 4KB, 16KB, 64KB).
>>> This can't be enforced by the hypervisor and therefore it's possible to
>>> run guests using different page granularity.
>>>
>>> So we can't mandate the block backend to support non-indirect grant
>>> when the frontend is using 64KB page granularity and have to fix it
>>> properly in the frontend.
>>>
>>> The solution exposed below is based on modifying directly the frontend
>>> guest rather than asking the block framework to support smaller size
>>> (i.e < PAGE_SIZE). This is because the change is the block framework are
>>> not trivial as everything seems to relying on a struct *page (see [1]).
>>> Although, it may be possible that someone succeed to do it in the future
>>> and we would therefore be able to use advantage.
>>                                        ^ it. (no advantage IMHO)
>>>
>>> Given that a block request may not fit in a single ring request, a
>>> second request is introduced for the data that cannot fit in the first
>>> one. This means that the second request should never be used on Linux
>>> configuration using a page granularity < 44KB.
>>   ^ if the page size is smaller than 44KB.
>>>
>>> Note that the parameters blk_queue_max_* helpers haven't been updated.
>>> The block code will set mimimum size supported and we may be able  to
>>                           ^ the minimum                extra space ^
>>> support directly any change in the block framework that lower down the
>>> mimimal size of a request.
>>   ^ minimal
>>
>> I have a concern regarding the splitting done in this patch.
>>
>> What happens with FUA requests when split? For example the frontend
>> receives a FUA requests with 64KB of data, and this is split into two
>> different requests on the ring, is this going to cause trouble in the
>> higher layers if for example the first request is completed but the
>> second is not? Could we leave the disk in a bad state as a consequence
>> of this?
> 
> If a block request is split into two ring requests, we will wait the two
> responses before reporting the completion to the higher layers (see
> blkif_interrupt and blkif_completion).
> 
> Furthermore, the second ring request will always use the same operation
> as the first one. Note that you will flush twice which is not nice but
> could be improved.
> 
>>
>>> [1] http://lists.xen.org/archives/html/xen-devel/2015-08/msg02200.html
>>>
>>> Signed-off-by: Julien Grall <julien.grall@xxxxxxxxxx>
>>>
>>> ---
>>> Cc: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
>>> Cc: "Roger Pau MonnÃ" <roger.pau@xxxxxxxxxx>
>>> Cc: Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>
>>> Cc: David Vrabel <david.vrabel@xxxxxxxxxx>
>>> ---
>>>  drivers/block/xen-blkfront.c | 199 
>>> +++++++++++++++++++++++++++++++++++++++----
>>>  1 file changed, 183 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>>> index f9d55c3..03772c9 100644
>>> --- a/drivers/block/xen-blkfront.c
>>> +++ b/drivers/block/xen-blkfront.c
>>> @@ -60,6 +60,20 @@
>>>  
>>>  #include <asm/xen/hypervisor.h>
>>>  
>>> +/*
>>> + * The block framework is always working on segment of PAGE_SIZE minimum.
>>
>> The above sentence needs to be reworded.
> 
> What about:
> 
> "The mininal size of the segment supported by the block framework is
> PAGE_SIZE."

That sounds fine.

> 
>>
>>> + * When Linux is using a different page size than xen, it may not be 
>>> possible
>>> + * to put all the data in a single segment.
>>> + * This can happen when the backend doesn't support indirect grant and
>>                                      indirect requests ^
>>> + * therefore the maximum amount of data that a request can carry is
>>> + * BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE = 44KB
>>> + *
>>> + * Note that we only support one extra request. So the Linux page size
>>> + * should be <= ( 2 * BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE) =
>>> + * 88KB.
>>> + */
>>> +#define HAS_EXTRA_REQ (BLKIF_MAX_SEGMENTS_PER_REQUEST < XEN_PFN_PER_PAGE)
>>
>> Since you are already introducing a preprocessor define, I would like
>> the code added to be protected by it.
> 
> Most of the code is protected by an if (HAS_REQ_EXTRA && ...) or
> variable setting based on HAS_REQ_EXTRA. Furthermore there is extra
> protection with some BUG_ON.
> 
> I would rather avoid to use the preprocessor to avoid ending up with:
> 
> #ifdef HAS_EXTRA_REQ
> /* Code */
> #else
> /* Code */
> #endif
> 
> and potentially miss update on the ARM64 with 64KB pages because
> currently most of people are developing PV drivers on x86.

Ok, I was also thinking whether a compile time assert would be fine:

BUILD_BUG_ON(PAGE_SIZE > 2 * BLKIF_MAX_SEGMENTS_PER_REQUEST *
XEN_PAGE_SIZE);

But then if indirect descriptors are available the assert is no longer true.

>>>  enum blkif_state {
>>>     BLKIF_STATE_DISCONNECTED,
>>>     BLKIF_STATE_CONNECTED,
>>> @@ -79,6 +93,19 @@ struct blk_shadow {
>>>     struct grant **indirect_grants;
>>>     struct scatterlist *sg;
>>>     unsigned int num_sg;
>>
>> #if HAS_EXTRA_REQ
>>
>>> +   enum
>>> +   {
>>> +           REQ_WAITING,
>>> +           REQ_DONE,
>>> +           REQ_FAIL
>>> +   } status;
>>> +
>>> +   #define NO_ASSOCIATED_ID ~0UL
>>> +   /*
>>> +    * Id of the sibling if we ever need 2 requests when handling a
>>> +    * block I/O request
>>> +    */
>>> +   unsigned long associated_id;
>>
>> #endif
> 
> See my remark above.
> 
>>
>>>  };
>>>  
>>>  struct split_bio {
>>> @@ -467,6 +494,8 @@ static unsigned long blkif_ring_get_request(struct 
>>> blkfront_info *info,
>>>  
>>>     id = get_id_from_freelist(info);
>>>     info->shadow[id].request = req;
>>> +   info->shadow[id].status = REQ_WAITING;
>>> +   info->shadow[id].associated_id = NO_ASSOCIATED_ID;
>>>  
>>>     (*ring_req)->u.rw.id = id;
>>>  
>>> @@ -508,6 +537,9 @@ struct setup_rw_req {
>>>     bool need_copy;
>>>     unsigned int bvec_off;
>>>     char *bvec_data;
>>> +
>>> +   bool require_extra_req;
>>> +   struct blkif_request *ring_req2;
>>
>> extra_ring_req?
> 
> Will do.
> 
> 
>>>  };
>>>  
>>>  static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int 
>>> offset,
>>> @@ -521,8 +553,24 @@ static void blkif_setup_rw_req_grant(unsigned long 
>>> gfn, unsigned int offset,
>>>     unsigned int grant_idx = setup->grant_idx;
>>>     struct blkif_request *ring_req = setup->ring_req;
>>>     struct blkfront_info *info = setup->info;
>>> +   /*
>>> +    * We always use the shadow of the first request to store the list
>>> +    * of grant associated to the block I/O request. This made the
>>               ^ grants                                        ^ makes
>>> +    * completion more easy to handle even if the block I/O request is
>>> +    * split.
>>> +    */
>>>     struct blk_shadow *shadow = &info->shadow[setup->id];
>>>  
>>> +   if (unlikely(setup->require_extra_req &&
>>> +                grant_idx >= BLKIF_MAX_SEGMENTS_PER_REQUEST)) {
>>> +           /*
>>> +            * We are using the second request, setup grant_idx
>>> +            * to be the index of the segment array
>>> +            */
>>> +           grant_idx -= BLKIF_MAX_SEGMENTS_PER_REQUEST;
>>> +           ring_req = setup->ring_req2;
>>> +   }
>>> +
>>>     if ((ring_req->operation == BLKIF_OP_INDIRECT) &&
>>>         (grant_idx % GRANTS_PER_INDIRECT_FRAME == 0)) {
>>>             if (setup->segments)
>>> @@ -537,7 +585,11 @@ static void blkif_setup_rw_req_grant(unsigned long 
>>> gfn, unsigned int offset,
>>>  
>>>     gnt_list_entry = get_grant(&setup->gref_head, gfn, info);
>>>     ref = gnt_list_entry->gref;
>>> -   shadow->grants_used[grant_idx] = gnt_list_entry;
>>> +   /*
>>> +    * All the grants are stored in the shadow of the first
>>> +    * request. Therefore we have to use the global index
>>> +    */
>>> +   shadow->grants_used[setup->grant_idx] = gnt_list_entry;
>>>  
>>>     if (setup->need_copy) {
>>>             void *shared_data;
>>> @@ -579,11 +631,31 @@ static void blkif_setup_rw_req_grant(unsigned long 
>>> gfn, unsigned int offset,
>>>     (setup->grant_idx)++;
>>>  }
>>>  
>>> +static void blkif_setup_extra_req(struct blkif_request *first,
>>> +                             struct blkif_request *second)
>>> +{
>>> +   uint16_t nr_segments = first->u.rw.nr_segments;
>>> +
>>> +   /*
>>> +    * The second request is only present when the first request uses
>>> +    * all its segments. It's always the continuity of the first one
>>> +    */
>>> +   first->u.rw.nr_segments = BLKIF_MAX_SEGMENTS_PER_REQUEST;
>>> +
>>> +   second->u.rw.nr_segments = nr_segments - BLKIF_MAX_SEGMENTS_PER_REQUEST;
>>> +   second->u.rw.sector_number = first->u.rw.sector_number +
>>> +           (BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE) / 512;
>>
>> Does this need to take into account if sectors have been skipped in the
>> previous requests due to empty data?
> 
> I'm not sure to understand this question.
> 
> AFAIU, the data is always contiguous in a block segment even though it
> can be split accross multiple Linux page. The number of segments is
> calculated from the amount of data see "Calculate the number of grant
> used" the code. The second request will only be created if this number
> is greater than BLKIF_MAX_SEGMENTS_PER_REQUEST.
> 
> So the second request will always be the continuity of the first one.

Ack.

>> Also I'm not sure how correct it is to hardcode 512 here.
> 
> Well, we are hardcoding it in blkfront and the block framework does the
> same (see blk_limits_max_hw_sectors).

Right, seems like the whole block system is always working with 512 as
the basic unit.

>>
>>> +
>>> +   second->u.rw.handle = first->u.rw.handle;
>>> +   second->operation = first->operation;
>>> +}
>>> +
>>>  static int blkif_queue_rw_req(struct request *req)
>>>  {
>>>     struct blkfront_info *info = req->rq_disk->private_data;
>>> -   struct blkif_request *ring_req;
>>> -   unsigned long id;
>>> +   struct blkif_request *ring_req, *ring_req2 = NULL;
>>> +   unsigned long id, id2 = NO_ASSOCIATED_ID;
>>> +   bool require_extra_req = false;
>>>     int i;
>>>     struct setup_rw_req setup = {
>>>             .grant_idx = 0,
>>> @@ -628,19 +700,19 @@ static int blkif_queue_rw_req(struct request *req)
>>>     /* Fill out a communications ring structure. */
>>>     id = blkif_ring_get_request(info, req, &ring_req);
>>>  
>>> -   BUG_ON(info->max_indirect_segments == 0 &&
>>> -          GREFS(req->nr_phys_segments) > BLKIF_MAX_SEGMENTS_PER_REQUEST);
>>> -   BUG_ON(info->max_indirect_segments &&
>>> -          GREFS(req->nr_phys_segments) > info->max_indirect_segments);
>>> -
>>>     num_sg = blk_rq_map_sg(req->q, req, info->shadow[id].sg);
>>>     num_grant = 0;
>>>     /* Calculate the number of grant used */
>>>     for_each_sg(info->shadow[id].sg, sg, num_sg, i)
>>>            num_grant += gnttab_count_grant(sg->offset, sg->length);
>>>  
>>> +   require_extra_req = info->max_indirect_segments == 0 &&
>>> +           num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST;
>>> +   BUG_ON(!HAS_EXTRA_REQ && require_extra_req);
>>> +
>>>     info->shadow[id].num_sg = num_sg;
>>> -   if (num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST) {
>>> +   if (num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST &&
>>> +       likely(!require_extra_req)) {
>>>             /*
>>>              * The indirect operation can only be a BLKIF_OP_READ or
>>>              * BLKIF_OP_WRITE
>>> @@ -680,10 +752,29 @@ static int blkif_queue_rw_req(struct request *req)
>>>                     }
>>>             }
>>>             ring_req->u.rw.nr_segments = num_grant;
>>> +           if (unlikely(require_extra_req)) {
>>> +                   id2 = blkif_ring_get_request(info, req, &ring_req2);
>>
>> How can you guarantee that there's always going to be another free
>> request? AFAICT blkif_queue_rq checks for RING_FULL, but you don't
>> actually know if there's only one slot or more than one available.
> 
> Because the depth of the queue is divided by 2 when the extra request is
> used (see xlvbd_init_blk_queue).

I see, that's quite restrictive but I guess it's better than introducing
a new ring macro in order to figure out if there are at least two free
slots.

Roger.


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.