[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Interesting observation with network event notification and batching



On Mon, Jul 01, 2013 at 03:48:38PM +0800, annie li wrote:
> 
> On 2013-6-29 0:15, Wei Liu wrote:
> >Hi all,
> >
> >After collecting more stats and comparing copying / mapping cases, I now
> >have some more interesting finds, which might contradict what I said
> >before.
> >
> >I tuned the runes I used for benchmark to make sure iperf and netperf
> >generate large packets (~64K). Here are the runes I use:
> >
> >   iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note)
> >   netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072
> >
> >                           COPY                    MAP
> >iperf    Tput:             6.5Gb/s             14Gb/s (was 2.5Gb/s)
> 
> So with default iperf setting, copy is about 7.9G, and map is about
> 2.5G? How about the result of netperf without large packets?
> 

First question, yes.

Second question, 5.8Gb/s. And I believe for the copying scheme without
large packet the throuput is more or less the same.

> >          PPI               2.90                  1.07
> >          SPI               37.75                 13.69
> >          PPN               2.90                  1.07
> >          SPN               37.75                 13.69
> >          tx_count           31808                174769
> 
> Seems interrupt count does not affect the performance at all with -l
> 131072 -w 128k.
> 

Right.

> >          nr_napi_schedule   31805                174697
> >          total_packets      92354                187408
> >          total_reqs         1200793              2392614
> >
> >netperf  Tput:            5.8Gb/s             10.5Gb/s
> >          PPI               2.13                   1.00
> >          SPI               36.70                  16.73
> >          PPN               2.13                   1.31
> >          SPN               36.70                  16.75
> >          tx_count           57635                205599
> >          nr_napi_schedule   57633                205311
> >          total_packets      122800               270254
> >          total_reqs         2115068              3439751
> >
> >   PPI: packets processed per interrupt
> >   SPI: slots processed per interrupt
> >   PPN: packets processed per napi schedule
> >   SPN: slots processed per napi schedule
> >   tx_count: interrupt count
> >   total_reqs: total slots used during test
> >
> >* Notification and batching
> >
> >Is notification and batching really a problem? I'm not so sure now. My
> >first thought when I didn't measure PPI / PPN / SPI / SPN in copying
> >case was that "in that case netback *must* have better batching" which
> >turned out not very true -- copying mode makes netback slower, however
> >the batching gained is not hugh.
> >
> >Ideally we still want to batch as much as possible. Possible way
> >includes playing with the 'weight' parameter in NAPI. But as the figures
> >show batching seems not to be very important for throughput, at least
> >for now. If the NAPI framework and netfront / netback are doing their
> >jobs as designed we might not need to worry about this now.
> >
> >Andrew, do you have any thought on this? You found out that NAPI didn't 
> >scale well with multi-threaded iperf in DomU, do you have any handle how
> >that can happen?
> >
> >* Thoughts on zero-copy TX
> >
> >With this hack we are able to achieve 10Gb/s single stream, which is
> >good. But, with classic XenoLinux kernel which has zero copy TX we
> >didn't able to achieve this.  I also developed another zero copy netback
> >prototype one year ago with Ian's out-of-tree skb frag destructor patch
> >series. That prototype couldn't achieve 10Gb/s either (IIRC the
> >performance was more or less the same as copying mode, about 6~7Gb/s).
> >
> >My hack maps all necessary pages permantently, there is no unmap, we
> >skip lots of page table manipulation and TLB flushes. So my basic
> >conclusion is that page table manipulation and TLB flushes do incur
> >heavy performance penalty.
> >
> >This hack can be upstreamed in no way. If we're to re-introduce
> >zero-copy TX, we would need to implement some sort of lazy flushing
> >mechanism. I haven't thought this through. Presumably this mechanism
> >would also benefit blk somehow? I'm not sure yet.
> >
> >Could persistent mapping (with the to-be-developed reclaim / MRU list
> >mechanism) be useful here? So that we can unify blk and net drivers?
> >
> >* Changes required to introduce zero-copy TX
> >
> >1. SKB frag destructor series: to track life cycle of SKB frags. This is
> >not yet upstreamed.
> 
> Are you mentioning this one 
> http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?
> 
> <http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html>
> 

Yes. But I believe there's been several versions posted. The link you
have is not the latest version.

> >
> >2. Mechanism to negotiate max slots frontend can use: mapping requires
> >backend's MAX_SKB_FRAGS >= frontend's MAX_SKB_FRAGS.
> >
> >3. Lazy flushing mechanism or persistent grants: ???
> 
> I did some test with persistent grants before, it did not show
> better performance than grant copy. But I was using the default
> params of netperf, and not tried large packet size. Your results
> reminds me that maybe persistent grants would get similar results
> with larger packet size too.
> 

"No better performance" -- that's because both mechanisms are copying?
However I presume persistent grant can scale better? From an earlier
email last week, I read that copying is done by the guest so that this
mechanism scales much better than hypervisor copying in blk's case.


Wei.

> Thanks
> Annie
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.