[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCHv7 3/3] gnttab: use per-VCPU maptrack free lists



On 30/04/15 16:12, Tim Deegan wrote:
> At 14:28 +0100 on 30 Apr (1430404125), David Vrabel wrote:
>> From: Malcolm Crossley <malcolm.crossley@xxxxxxxxxx>
>>
>> Performance analysis of aggregate network throughput with many VMs
>> shows that performance is signficantly limited by contention on the
>> maptrack lock when obtaining/releasing maptrack handles from the free
>> list.
>>
>> Instead of a single free list use a per-VCPU list. This avoids any
>> contention when obtaining a handle.  Handles must be released back to
>> their original list and since this may occur on a different VCPU there
>> is some contention on the destination VCPU's free list tail pointer
>> (but this is much better than a per-domain lock).
>>
>> A domain's maptrack limit is multiplied by the number of VCPUs.  This
>> ensures that a worst case domain that only performs grant table
>> operations via one VCPU will not see a lower map track limit.
>>
>> Signed-off-by: Malcolm Crossley <malcolm.crossley@xxxxxxxxxx>
>> Signed-off-by: David Vrabel <david.vrabel@xxxxxxxxxx>
>> ---
>>  xen/common/grant_table.c      |  131 
>> ++++++++++++++++++++++++-----------------
>>  xen/include/xen/grant_table.h |    8 ++-
>>  xen/include/xen/sched.h       |    5 ++
>>  3 files changed, 87 insertions(+), 57 deletions(-)
>>
>> diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
>> index 569b3d3..d8e7373 100644
>> --- a/xen/common/grant_table.c
>> +++ b/xen/common/grant_table.c
>> @@ -118,9 +118,9 @@ struct gnttab_unmap_common {
>>      ((t)->maptrack[(e)/MAPTRACK_PER_PAGE][(e)%MAPTRACK_PER_PAGE])
>>  
>>  static inline unsigned int
>> -nr_maptrack_frames(struct grant_table *t)
>> +nr_vcpu_maptrack_frames(struct vcpu *v)
>>  {
>> -    return t->maptrack_limit / MAPTRACK_PER_PAGE;
>> +    return v->maptrack_limit / MAPTRACK_PER_PAGE;
>>  }
>>  
>>  #define MAPTRACK_TAIL (~0u)
>> @@ -255,66 +255,91 @@ static int __get_paged_frame(unsigned long gfn, 
>> unsigned long *frame, struct pag
>>  
>>  static inline int
>>  __get_maptrack_handle(
>> -    struct grant_table *t)
>> +    struct grant_table *t,
>> +    struct vcpu *v)
>>  {
>> -    unsigned int h;
>> -    if ( unlikely((h = t->maptrack_head) == MAPTRACK_TAIL) )
>> +    unsigned int head, next;
>> +
>> +    head = v->maptrack_head;
>> +    if ( unlikely(head == MAPTRACK_TAIL) )
>> +        return -1;
>> +
>> +    next = maptrack_entry(t, head).ref;
>> +    if ( unlikely(next == MAPTRACK_TAIL) )
>>          return -1;
> 
> So this is an extra restriction over the old code: the list shall
> never be empty.  Is that to make freeing onto the tail of a remote
> freelist easier?  If so can you please add a comment so that nobody
> drops this later?

Previously, on maptrack table init we allocated a page, with per vCPU
maptrack we would need to allocate per vcpu. As domains could have large
numbers of vCPU's, this patch change's the logic so that each vcpu
maptrack entries are only allocated when they are needed (or are full).
This behavioural change hopefully keeps the Xen memory overhead for
large vCPU domains down.

We detect a vCPU without any allocated maptrack pages by setting it's
head pointer to the MAPTRACK_TAIL define.

> 
>> -    t->maptrack_head = maptrack_entry(t, h).ref;
>> -    return h;
>> +
>> +    v->maptrack_head = next;
>> +
>> +    return head;
>>  }
>>  
>>  static inline void
>>  put_maptrack_handle(
>>      struct grant_table *t, int handle)
>>  {
>> -    spin_lock(&t->maptrack_lock);
>> -    maptrack_entry(t, handle).ref = t->maptrack_head;
>> -    t->maptrack_head = handle;
>> -    spin_unlock(&t->maptrack_lock);
>> +    struct domain *d = current->domain;
>> +    struct vcpu *v;
>> +    u32 prev_tail, cur_tail;
>> +
>> +    /* Set current hande to have TAIL reference */
>> +    maptrack_entry(t, handle).ref = MAPTRACK_TAIL;
>> +    v = d->vcpu[maptrack_entry(t,handle).vcpu];
>> +    cur_tail = v->maptrack_tail;
>> +    do {
>> +        prev_tail = cur_tail;
>> +        cur_tail = cmpxchg((u32 *)&v->maptrack_tail, prev_tail, handle);
>> +    } while ( cur_tail != prev_tail );
> 
> Thinking out loud here: this is the matching code that needs a tail
> entry to append to.  Because we advance the tail pointer befiore
> threading the linst in the .ref field, there's a potentially large set
> of half-freed entries here that are/were the tail pointer but haven't
> had their .ref fields updated yet.  Eventually all freeing threads
> will execute the line below and stitch things together.  And cmpxchg()
> is a full barrier so we're safe against reordering here: any fragment
> of list will end in MAPTRACK_TAIL.  Good.

That's the theory of how it works but you are correct that all accesses
to ref must be atomic, otherwise we won't splice the linked lists back
together again.
> 
> But this:
> 
>> +    maptrack_entry(t, prev_tail).ref = handle;
> 
> ought to be a write_atomic(), and the reads in
> __get_maptrack_handle() should be atomic too.  

You are right, we probably get away with it on x86 as it's a 32bit write
but we should use the atomic helpers.
> 
>>  }
>>  
>>  static inline int
>>  get_maptrack_handle(
>>      struct grant_table *lgt)
>>  {
>> +    struct vcpu          *v = current;
>>      int                   i;
>>      grant_handle_t        handle;
>>      struct grant_mapping *new_mt;
>> -    unsigned int          new_mt_limit, nr_frames;
>> -
>> -    spin_lock(&lgt->maptrack_lock);
>> +    unsigned int          nr_frames;
>>  
>> -    while ( unlikely((handle = __get_maptrack_handle(lgt)) == -1) )
>> -    {
>> -        nr_frames = nr_maptrack_frames(lgt);
>> -        if ( nr_frames >= max_maptrack_frames )
>> -            break;
>> -
>> -        new_mt = alloc_xenheap_page();
>> -        if ( !new_mt )
>> -            break;
>> +    handle = __get_maptrack_handle(lgt, v);
>> +    if ( likely(handle != -1) )
>> +        return handle;
>>  
>> -        clear_page(new_mt);
>> +    nr_frames = nr_vcpu_maptrack_frames(v);
>> +    if ( nr_frames >= max_maptrack_frames )
>> +        return -1;
>>  
>> -        new_mt_limit = lgt->maptrack_limit + MAPTRACK_PER_PAGE;
>> +    new_mt = alloc_xenheap_page();
>> +    if ( !new_mt )
>> +        return -1;
>> +    clear_page(new_mt);
>>  
>> -        for ( i = 1; i < MAPTRACK_PER_PAGE; i++ )
>> -            new_mt[i - 1].ref = lgt->maptrack_limit + i;
>> -        new_mt[i - 1].ref = lgt->maptrack_head;
>> -        lgt->maptrack_head = lgt->maptrack_limit;
>> +    spin_lock(&lgt->maptrack_lock);
>>  
>> -        lgt->maptrack[nr_frames] = new_mt;
>> -        smp_wmb();
>> -        lgt->maptrack_limit      = new_mt_limit;
>> +    for ( i = 1; i < MAPTRACK_PER_PAGE; i++ ){
> 
> Style nit: brace on its own line, please. 
> 
>> +        new_mt[i - 1].ref = (lgt->maptrack_pages * MAPTRACK_PER_PAGE) + i;
>> +        new_mt[i - 1].vcpu = v->vcpu_id;
>> +    }
>> +    /* Set last entry vcpu and ref */
>> +    new_mt[i - 1].ref = v->maptrack_head;
>> +    new_mt[i - 1].vcpu = v->vcpu_id;
>>  
>> -        gdprintk(XENLOG_INFO, "Increased maptrack size to %u frames\n",
>> -                 nr_frames + 1);
>> +    v->maptrack_head = lgt->maptrack_pages * MAPTRACK_PER_PAGE;
>> +    if (v->maptrack_tail == MAPTRACK_TAIL)
>> +    {
>> +        v->maptrack_tail = (lgt->maptrack_pages * MAPTRACK_PER_PAGE)
>> +            + MAPTRACK_PER_PAGE - 1;
>> +        new_mt[i - 1].ref = MAPTRACK_TAIL;
> 
> This clause was slightly confusing to me until I realised that it's
> only for the case where no maptrack entries have been allocated to
> this vcpu at all before (because you keep at least one on the
> freelist).  And therefore, there's no need to be careful about this
> update to maptrack_tail.  I think that deserves a comment.

As I detailed above, we have changed behaviour slightly so there is a
case where there is nothing allocated on the freelist. We'll add a
comment explaining this.

> 
> So I think my only concerns about this whole series are some comments
> and using atomic read & writes in the linked-list code.  I'm happy to
> believe that you'll DTRT with all that, so for the next spin,
> 
> Acked-by: Tim Deegan <tim@xxxxxxx> 
> 

Thanks for your review.

Malcolm

> Cheers,
> 
> Tim.
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.