[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [RFC] netif: staging grants for requests
Hey! Thanks for writing this detailed document! On Wed, Dec 14, 2016 at 06:11:12PM +0000, Joao Martins wrote: > Hey, > > Back in the Xen hackaton '16 networking session there were a couple of ideas > brought up. One of them was about exploring permanently mapped grants between > xen-netback/xen-netfront. > > I started experimenting and came up with sort of a design document (in pandoc) > on what it would like to be proposed. This is meant as a seed for discussion > and also requesting input to know if this is a good direction. Of course, I > am willing to try alternatives that we come up beyond the contents of the > spec, or any other suggested changes ;) > > Any comments or feedback is welcome! > > Cheers, > Joao > > --- > % Staging grants for network I/O requests > % Joao Martins <<joao.m.martins@xxxxxxxxxx>> > % Revision 1 > > \clearpage > > -------------------------------------------------------------------- > Status: **Experimental** > > Architecture(s): x86 and ARM > Any. > Component(s): Guest > > Hardware: Intel and AMD No need to specify this. > -------------------------------------------------------------------- > > # Background and Motivation > I skimmed through the middle -- I think you description of transmissions in both directions is accurate. The proposal to replace some steps with explicit memcpy is also sensible. > \clearpage > > ## Performance > > Numbers that give a rough idea on the performance benefits of this extension. > These are Guest <-> Dom0 which test the communication between backend and > frontend, excluding other bottlenecks in the datapath (the software switch). > > ``` > # grant copy > Guest TX (1vcpu, 64b, UDP in pps): 1 506 170 pps > Guest TX (4vcpu, 64b, UDP in pps): 4 988 563 pps > Guest TX (1vcpu, 256b, UDP in pps): 1 295 001 pps > Guest TX (4vcpu, 256b, UDP in pps): 4 249 211 pps > > # grant copy + grant map (see next subsection) > Guest TX (1vcpu, 260b, UDP in pps): 577 782 pps > Guest TX (4vcpu, 260b, UDP in pps): 1 218 273 pps > > # drop at the guest network stack > Guest RX (1vcpu, 64b, UDP in pps): 1 549 630 pps > Guest RX (4vcpu, 64b, UDP in pps): 2 870 947 pps > ``` > > With this extension: > ``` > # memcpy > data-len=256 TX (1vcpu, 64b, UDP in pps): 3 759 012 pps > data-len=256 TX (4vcpu, 64b, UDP in pps): 12 416 436 pps This basically means we can almost get line rate for 10Gb link. It is already a good result. I'm interested in knowing if there is possibility to approach 40 or 100 Gb/s? It would be good if we design this extension with higher goals in mind. > data-len=256 TX (1vcpu, 256b, UDP in pps): 3 248 392 pps > data-len=256 TX (4vcpu, 256b, UDP in pps): 11 165 355 pps > > # memcpy + grant map (see next subsection) > data-len=256 TX (1vcpu, 260b, UDP in pps): 588 428 pps > data-len=256 TX (4vcpu, 260b, UDP in pps): 1 668 044 pps > > # (drop at the guest network stack) > data-len=256 RX (1vcpu, 64b, UDP in pps): 3 285 362 pps > data-len=256 RX (4vcpu, 64b, UDP in pps): 11 761 847 pps > > # (drop with guest XDP_DROP prog) > data-len=256 RX (1vcpu, 64b, UDP in pps): 9 466 591 pps > data-len=256 RX (4vcpu, 64b, UDP in pps): 33 006 157 pps > ``` > > Latency measurements (netperf TCP_RR request size 1 and response size 1): > ``` > 24 KTps vs 28 KTps > 39 KTps vs 50 KTps (with kernel busy poll) > ``` > > TCP Bulk transfer measurements aren't showing a representative increase on > maximum throughput (sometimes ~10%), but rather less retransmissions and > more stable. This is probably because of being having a slight decrease in rtt > time (i.e. receiver acknowledging data quicker). Currently trying exploring > other data list sizes and probably will have a better idea on the effects of > this. > > ## Linux grant copy vs map remark > > Based on numbers above there's a sudden 2x performance drop when we switch > from > grant copy to also grant map the ` gref`: 1 295 001 vs 577 782 for 256 and > 260 > packets bytes respectivally. Which is all the more visible when removing the > grant > copy with memcpy in this extension (3 248 392 vs 588 428). While there's been > discussions of avoid the TLB unflush on unmap, one could wonder what the > threshold of that improvement would be. Chances are that this is the least of > our concerns in a fully poppulated host (or with an oversubscribed one). Would > it be worth experimenting increasing the threshold of the copy beyond the > header? > Yes, it would be interesting to see more data points and provide sensible default. But I think this is secondary goal because "sensible default" can change overtime and on different environments. > \clearpage > > # References > > [0] http://lists.xenproject.org/archives/html/xen-devel/2015-05/msg01504.html > > [1] > https://github.com/freebsd/freebsd/blob/master/sys/dev/netmap/netmap_mem2.c#L362 > > [2] https://www.freebsd.org/cgi/man.cgi?query=vale&sektion=4&n=1 > > [3] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf > > [4] > http://prototype-kernel.readthedocs.io/en/latest/networking/XDP/design/requirements.html#write-access-to-packet-data > > [5] > http://lxr.free-electrons.com/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L2073 > > [6] > http://lxr.free-electrons.com/source/drivers/net/ethernet/mellanox/mlx4/en_rx.c#L52 > > # History > > A table of changes to the document, in chronological order. > > ------------------------------------------------------------------------ > Date Revision Version Notes > ---------- -------- -------- ------------------------------------------- > 2016-12-14 1 Xen 4.9 Initial version. > ---------- -------- -------- ------------------------------------------- _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |