[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 05/10] net: move destructor_arg to the front of sk_buff.



On 04/11/2012 01:20 AM, Eric Dumazet wrote:
> On Tue, 2012-04-10 at 12:15 -0700, Alexander Duyck wrote:
>
>> Actually now that I think about it my concerns go much further than the
>> memset.  I'm convinced that this is going to cause a pretty significant
>> performance regression on multiple drivers, especially on non x86_64
>> architecture.  What we have right now on most platforms is a
>> skb_shared_info structure in which everything up to and including frag 0
>> is all in one cache line.  This gives us pretty good performance for igb
>> and ixgbe since that is our common case when jumbo frames are not
>> enabled is to split the head and place the data in a page.
> I dont understand this split thing for MTU=1500 frames.
>
> Even using half a page per fragment, each skb :
>
> needs 2 allocations for sk_buff and skb->head, plus one page alloc /
> reference.
>
> skb->truesize = ksize(skb->head) + sizeof(*skb) + PAGE_SIZE/2 = 512 +
> 256 + 2048 = 2816 bytes
The number you provide for head is currently only available for 128 byte
skb allocations.  Anything larger than that will generate a 1K
allocation.  Also after all of these patches the smallest size you can
allocate will be 1K for anything under 504 bytes.

The size advantage is actually more for smaller frames.  In the case of
igb the behaviour is to place anything less than 512 bytes into just the
header and to skip using the page.  As such we get a much more ideal
allocation for small packets. since the truesize is only 1280 in that case.

In the case of ixgbe the advantage is more of a cache miss advantage. 
Ixgbe only receives the data into pages now.  I can prefetch the first
two cache lines of the page into memory while allocating the skb to
receive it.  As such we essentially cut the number of cache misses in
half versus the old approach which had us generating cache misses on the
sk_buff during allocation, and then generating more cache misses again
once we received the buffer and can fill out the sk_buff fields.  A
similar size advantage exists as well, but only for frames 256 bytes or
smaller.

> With non split you have :
>
> 2 allocations for sk_buff and skb->head.
>
> skb->truesize = ksize(skb->head) + sizeof(*skb) = 2048 + 256 = 2304
> bytes
>
> less overhead and less calls to page allocator...
>
> This only can benefit if GRO is on, since aggregation can use fragments
> and a single sk_buff, instead of a frag_list
There is much more than true size involved here.  My main argument is
that if we are going to align this modified skb_shared_info so that it
is aligned on nr_frags we should do it on all architectures, not just
x86_64.

Thanks,

Alex

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.