[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Interesting observation with network event notification and batching



On Mon, 1 Jul 2013, Wei Liu wrote:
> On Mon, Jul 01, 2013 at 03:48:38PM +0800, annie li wrote:
> > 
> > On 2013-6-29 0:15, Wei Liu wrote:
> > >Hi all,
> > >
> > >After collecting more stats and comparing copying / mapping cases, I now
> > >have some more interesting finds, which might contradict what I said
> > >before.
> > >
> > >I tuned the runes I used for benchmark to make sure iperf and netperf
> > >generate large packets (~64K). Here are the runes I use:
> > >
> > >   iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note)
> > >   netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072
> > >
> > >                           COPY                    MAP
> > >iperf    Tput:             6.5Gb/s             14Gb/s (was 2.5Gb/s)
> > 
> > So with default iperf setting, copy is about 7.9G, and map is about
> > 2.5G? How about the result of netperf without large packets?
> > 
> 
> First question, yes.
> 
> Second question, 5.8Gb/s. And I believe for the copying scheme without
> large packet the throuput is more or less the same.
> 
> > >          PPI               2.90                  1.07
> > >          SPI               37.75                 13.69
> > >          PPN               2.90                  1.07
> > >          SPN               37.75                 13.69
> > >          tx_count           31808                174769
> > 
> > Seems interrupt count does not affect the performance at all with -l
> > 131072 -w 128k.
> > 
> 
> Right.
> 
> > >          nr_napi_schedule   31805                174697
> > >          total_packets      92354                187408
> > >          total_reqs         1200793              2392614
> > >
> > >netperf  Tput:            5.8Gb/s             10.5Gb/s
> > >          PPI               2.13                   1.00
> > >          SPI               36.70                  16.73
> > >          PPN               2.13                   1.31
> > >          SPN               36.70                  16.75
> > >          tx_count           57635                205599
> > >          nr_napi_schedule   57633                205311
> > >          total_packets      122800               270254
> > >          total_reqs         2115068              3439751
> > >
> > >   PPI: packets processed per interrupt
> > >   SPI: slots processed per interrupt
> > >   PPN: packets processed per napi schedule
> > >   SPN: slots processed per napi schedule
> > >   tx_count: interrupt count
> > >   total_reqs: total slots used during test
> > >
> > >* Notification and batching
> > >
> > >Is notification and batching really a problem? I'm not so sure now. My
> > >first thought when I didn't measure PPI / PPN / SPI / SPN in copying
> > >case was that "in that case netback *must* have better batching" which
> > >turned out not very true -- copying mode makes netback slower, however
> > >the batching gained is not hugh.
> > >
> > >Ideally we still want to batch as much as possible. Possible way
> > >includes playing with the 'weight' parameter in NAPI. But as the figures
> > >show batching seems not to be very important for throughput, at least
> > >for now. If the NAPI framework and netfront / netback are doing their
> > >jobs as designed we might not need to worry about this now.
> > >
> > >Andrew, do you have any thought on this? You found out that NAPI didn't 
> > >scale well with multi-threaded iperf in DomU, do you have any handle how
> > >that can happen?
> > >
> > >* Thoughts on zero-copy TX
> > >
> > >With this hack we are able to achieve 10Gb/s single stream, which is
> > >good. But, with classic XenoLinux kernel which has zero copy TX we
> > >didn't able to achieve this.  I also developed another zero copy netback
> > >prototype one year ago with Ian's out-of-tree skb frag destructor patch
> > >series. That prototype couldn't achieve 10Gb/s either (IIRC the
> > >performance was more or less the same as copying mode, about 6~7Gb/s).
> > >
> > >My hack maps all necessary pages permantently, there is no unmap, we
> > >skip lots of page table manipulation and TLB flushes. So my basic
> > >conclusion is that page table manipulation and TLB flushes do incur
> > >heavy performance penalty.
> > >
> > >This hack can be upstreamed in no way. If we're to re-introduce
> > >zero-copy TX, we would need to implement some sort of lazy flushing
> > >mechanism. I haven't thought this through. Presumably this mechanism
> > >would also benefit blk somehow? I'm not sure yet.
> > >
> > >Could persistent mapping (with the to-be-developed reclaim / MRU list
> > >mechanism) be useful here? So that we can unify blk and net drivers?
> > >
> > >* Changes required to introduce zero-copy TX
> > >
> > >1. SKB frag destructor series: to track life cycle of SKB frags. This is
> > >not yet upstreamed.
> > 
> > Are you mentioning this one 
> > http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?
> > 
> > <http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html>
> > 
> 
> Yes. But I believe there's been several versions posted. The link you
> have is not the latest version.
> 
> > >
> > >2. Mechanism to negotiate max slots frontend can use: mapping requires
> > >backend's MAX_SKB_FRAGS >= frontend's MAX_SKB_FRAGS.
> > >
> > >3. Lazy flushing mechanism or persistent grants: ???
> > 
> > I did some test with persistent grants before, it did not show
> > better performance than grant copy. But I was using the default
> > params of netperf, and not tried large packet size. Your results
> > reminds me that maybe persistent grants would get similar results
> > with larger packet size too.
> > 
> 
> "No better performance" -- that's because both mechanisms are copying?
> However I presume persistent grant can scale better? From an earlier
> email last week, I read that copying is done by the guest so that this
> mechanism scales much better than hypervisor copying in blk's case.

Yes, I always expected persistent grants to be faster then
gnttab_copy but I was very surprised by the difference in performances:

http://marc.info/?l=xen-devel&m=137234605929944

I think it's worth trying persistent grants on PV network, although it's
very unlikely that they are going to improve the throughput by 5 Gb/s.

Also once we have both PV block and network using persistent grants,
we might incur the grant table limit, see this email:

http://marc.info/?l=xen-devel&m=137183474618974

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.