[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC] netif: staging grants for requests



On Thu, 5 Jan 2017, Joao Martins wrote:
> Hey Stefano,
> 
> Thanks a lot for the review and the comments!!
> 
> On 01/04/2017 07:40 PM, Stefano Stabellini wrote:
> > On Wed, 14 Dec 2016, Joao Martins wrote:
> >> # Proposed Extension
> >>
> >> ## Terminology
> >>
> >> `data gref` refers to the reusable/staging grants, whereas `gref` are the
> >> ones that are granted/revoked for each packet being sent. `command ring`
> >> refers to the current ring, whereas `data list` refers to the one proposed
> >> below.
> >>
> >> ## Xenstore
> >>
> >> "feature-staging-grants" is a capability to allow the use of a set of
> >> recycled buffers permanently mapped at the device setup. If
> >> advertised, the frontend will grant a region equivalent to ```maximum 
> >> number
> >> of slots per ring * size of buffer``` where:
> >>
> >>  * `maximum number of slots per ring` is the number of request structure
> >>    that can be fit within the command ring. By default a request
> >>    structure consumes 16 bytes. The first 64 bytes of a ring are used by
> >>    producer and consumer indicies.
> > 
> > This means that by default the number of slots is (4096-64)/16 = 252,
> > correct?
> That would be correct but I made a few mistakes in the writing, despite a
> following paragraph mentioning the correct ring size for both TX and RX.
> The ring slot size is determined by the size of the request or response 
> (being a
> union of request and response structures). This means the ring slot would 
> either
> 8 octets for RX ring or 12 octets for TX ring which in effect translates to:
> 
> RX := (4096 - 64) / 8 = 504
> TX := (4096 - 64) / 12 = 336
> 
> And as Wei correctly mentions both values are rounded down to a power of 2.
> Which results in 256 slots for each ring. I will fix this in the spec.
> 
> >>  * `size of buffer` is the maximum portion of the packet the backend (and
> >>    frontend) have negotiated which will be put for each slot of the
> >>    `data ring`.
> >>
> >> Which means that for 256 possible packets in ring with 256 bytes of
> >> chosen size the amount of memory to be permanently granted is a region of
> >> 64KB i.e. 16 grefs.
> >>
> >> The lack of "feature-staging-grants" or a value of 0 means that it's not
> >> supported. If feature is present then a second entry "staging-grants-sizes"
> >> must exist and it contains the sizes that can be used per slot. To avoid
> >> frontend clobbering with various different values, we limit to a set of 
> >> fixed
> >> ones semi-colon delimited.
> >>
> >> The allowed values are implementation specific. Examples of good values
> >> include: 256 (protocol/header region), 2048 (fits 1 MSS which is common to 
> >> be
> >> in the linear part of linux packets), 4096 (grant per packet if 
> >> feature-sg=0, for
> >> DPDK/XDP/netmap buffers) and 65535 (maximum packet size i.e.
> >> ```NETIF_NR_SLOTS_MIN * XEN_PAGE_SIZE``` for feature-sg=1).
> > 
> > I am not following. So far we have been talking about two values:
> > - maximum number of slots per ring
> > - size of buffer
> > We have also discussed the number of bytes used for indexes at the
> > beginning of the command ring, which is 64 bytes.
> Right, but these 64 bytes were only mentioned to describe the calculation of
> number of slots.
> 
> > 
> > But now you are introducing a few other values:
> > - protocol/header region
> > - fits 1 MSS which is common to be in the linear part of linux packets
> > - grant per packet
> > - maximum packet size (I take this is the same as size of buffer)
> > 
> > Could you please described what these values represent in more details
> > and how they relate to the two previous values?
> These values I give above are examples/recomendations of valid "size of the
> buffer" values. That multiplied for the number of slots gives how much it is
> granted by the frontend.
> 
> Though I want to be able to say that the backend will copy up to N much bytes 
> to
> these staging grants (or data grefs as terminology used in the doc), before it
> resorts to grant ops.

I think I understand.


> Hence the value of 65536 is the same as 4096 in terms of
> number of data grefs provided, but the backend will just copy less from/to the
> staging area leaving the rest to be used with the command ring grefs (grant
> map/unmap/etc). Does this answer your question?

Which value of 65536 is the same as 4096?  Packet size? Because up to
4096 bytes are copied to staging grants, the rest is dealt with as
usual?


> I went with the simple approach to start with, but I also thought of having a
> small descriptor to describe how much of this staging area is used for packet,
> instead of a fixed negotiated value. But it would leave me with only a length
> field. Unless we could have a hader with the most commonly used extra info 
> slot
> (GSO) as part of this descriptor too, and save 1 slot per packet for GSO/GRO
> packets.

I think that a fixed size is OK. I'll call "N" the max number of bytes
copied to the staging grants. Are we going to avoid granting the
original packet data if the packet size <= N?

In fact, wouldn't it be better to only do memcpy if packet size <= N,
and only do mapping or grant copies if packet size > N? That way we
could skip all grant operations when packet size <= N and behave
normally in other cases.


> >> Individual size covered per entry by frontend is through ```tx-data-len``` 
> >> or
> >> ```rx-data-len``` which contains maximum amount of data on a single packet.
> >> Chosen values of "tx-data-len" and "rx-data-len" are asymmetrical (hence 
> >> can
> >> be different between TX and RX) are validated against this list of valid 
> >> sizes.
> >> For it's use see [datapath](#datapath-changes) section further below.
> >>
> >>    /local/domain/1/device/vif/0/feature-staging-grants = "1"
> >>    /local/domain/1/device/vif/0/staging-grants-sizes = "128;256;2048;4096"
> > 
> > I don't get what the 4 numbers mean. Are they possible supported options
> > for "tx-data-len" and "rx-data-len", which in turn are the "size of
> > buffer" as described above?
> Exactly!
> 
> > If so, shouldn't they be written by the backend to backend paths? Why
> > only 4 options? Could there be more? What is the max number of allowed
> > option? And what is the minimum?
> The path is wrong, it should be the backend path. The backend advertises this
> set of values that can be used, and it can advertise as many values as it
> supports (separated by semicolon). The values I selected were simply picked
> based on what I explain in the previous paragraph. frontend will then pick one
> of those.

All right, all this needs to be clearly written in the doc.


> /local/domain/0/backend/vif/1/0/feature-staging-grants = "1"
> /local/domain/0/backend/vif/1/0/staging-grants-sizes = "128;256;2048;4096"
> 
> I could be less restrictive and instead turn these sizes into a tuple with
> "min;max" or two separate xenstore entries if folks prefer.
> 
> And the frontend just advertises a feature entry like this:
> 
> /local/domain/1/device/vif/0/feature-staging-grants = "1"
> 
> Other alternatives on the table was to specify this region as part of the
> request/response, and turn the TX/RX rings into multi-page ones. But I found
> that a bit limitative, for example if we wanted to allow a guest to use a fix
> set of pages (DPDK, netmap, XDP) which would result in around >256
> "{tx,rx}-ring-ref" entries.
> 
> >>    /local/domain/1/device/vif/0/queue-0/tx-data-ref0 = "<data-ref-tx0>"
> >>    /local/domain/1/device/vif/0/queue-0/tx-data-len = "128"
> >>    /local/domain/1/device/vif/0/queue-0/rx-data-ref0 = "<data-ref-rx0>"
> >>    /local/domain/1/device/vif/0/queue-0/rx-data-len = "256"
> >>
> >> The ```tx-data-ref%u``` and ```rx-data-ref%u``` describe the list of 
> >> grants to be
> >> used for each ring. The example above exemplifies a list composed of a 
> >> single
> > 
> > For what ring, the data ring?
> The command ring. There is no data ring, but a list of grefs given to the
> backend, hence here I mention this as "data list". And the "data list" 
> entries I
> also refer to them as "staging/data grefs". Plus there's one list for TX and
> another RX.
> 
> > 
> >> page whereas multiple pages would be as:
> >>
> >>    tx-data-ref0 = <data-ref-tx0>
> >>    tx-data-ref1 = <data-ref-tx1>
> >>    rx-data-ref0 = <data-ref-rx0>
> >>    rx-data-ref1 = <data-ref-rx1>
> >>
> >> Each slot in the `data-ref` list is formatted as following:
> >>
> >>             0     1     2     3     4     5     6     7  octet
> >>          +-----+-----+-----+-----+-----+-----+-----+-----+
> >>          | id        | gref                  | offset    |
> >>          +-----+-----+-----+-----+-----+-----+-----+-----+
> >>
> >>    id: the frame identifier.
> > 
> > What is a "frame identifier"?
> It's the buffer we are sharing, the one we use to memcpy for backend to use.
> Probably should use "buffer" instead. (See further below about its use)
> 
> >>    gref: grant reference to the frame.
> >>    offset: offset within the frame.
> > 
> > I assumed that each grant used for the data ring would be used in its
> > entirety. Does it really make sense to create a ring using portions of
> > pages? Given that we can only share memory at a page granularity,
> My main argument with having portions of pages is for guests that only want to
> use this region to copy the linear region and hence have a much smaller
> footprint per ring (with just 16 grefs). Otherwise we would be sharing 256
> grants while only using 1/16 of it. In the rest of the spec I take this is the
> general case, where we memcpy only a small portion of the packet.

Even if 16 gref is your main use case, 16*256 = 4096: you can achieve
that by sharing one single page for all data slots.


> > I don't think it makes sense: the rest of the page would still be
> > accessible from the backend, which could compromise its content. I would
> > share a list of pages that can be mapped contiguously to create a data
> > ring.
> Note that the permissions of the TX grefs are readonly, whereas the rx ring is
> read/write, so I inherit this same property here for the data grefs too. I
> wouldn't mind having a grant per packet;

That's not what I am suggesting.

Let's suppose that the frontend chooses a slot size of 128 and that we
have 256 slots in total. The frontend shares 8 pages to the backend:

 128*256 = 32768 = 8*4096

these 8 pages are going to be used for all data slots. So slot number 34
(if we count from 1) is the second slot of the second page (32 data
slots per page).


> is your concern coming from the fact that these RX data grefs are
> constantly mapped hence backend can mangle anytime as opposed to RX
> grants are still being revoked when given back to the frontend?

Irrespective from what netfront/netback does today with the grants, it
is certainly better to avoid sharing unnecessary memory to the other
end.


> But then nothing currently stops a backend to write to these pages
> irrespective of the approach (with or without staging grants) until
> they are copied/revoked; the frontend would always copy from the
> staging region into a guest-only visible region.
>
> >> This list has twice as max slots as would have `tx-ring-ref` or 
> >> `rx-ring-ref`
> >> respectively, and it is set up at device creation and removed at device
> >> teardown, same as the command ring entries. This way we keep up with ring 
> >> size
> >> changes as it it expected to be in the command ring. A hypothetical 
> >> multi-page
> >                 ^ repetition
> OK, will remove it.
> 
> >> command ring would increase number of slots and thus this data list would 
> >> grow
> >> accordingly. List is terminated by an entry which ```gref``` field is 0, 
> >> having
> >> ignored the other fields of this specific entry.
> > 
> > Could you please explain a bit more how the command ring increases the
> > number of slot and why we need to allocate twice the number of required
> > slots at the beginning?
> "This list has twice as max slots as" is meant to say that the list has twice 
> as
> capacity as the command ring. The list entry is 8 octets, so one page fits 512
> entries. The data ref is only filled with up to the command ring size which
> means only half of it would be filled currently. I will adjust this sentence
> here to make it more clear.
> 
> multi-page rings aren't supported on netback/netfront (yet) but as an example 
> we
> would have one `tx-data-ref0` to represent the command ring slots of
> `tx-ring-ref0` and tx-ring-ref1`. I noticed that to not waste grants, I need 
> to
> assume the ```data-ref``` is full with the lack of a termanting entry; 
> otherwise
> I would endup needing `tx-data-ref1` to have a terminating entry. I will add a
> sentence mentioning this.
> 
> >> ## Datapath Changes
> >>
> >> The packet is copied to/from the mapped data list grefs of up to 
> >> `tx-data-len`
> >> or `rx-data-len`. This means that the buffer (referenced by `gref` from 
> >> within
> >> the `struct netif_[tx|rx]_request`) has the same data up to `size`.  In 
> >> other
> >> words, *gref[0->size] contents is replicated on the `data-ring` at `idx`. 
> >> Hence
> >> netback should ignore up to `size` of the `gref` when processing as the
> >> `data-ring` has the contents of it.
> > 
> > My lack of netfront/netback knowledge might show, but I am not
> > following. What do gref, size and idx represent in this context? Please
> > explain the new workflow in more details.
> [ Please ignore the part saying "replicated on the `data-ring` at `idx`", 
> since
> there is no ring. It should instead be `data list` buffer identified by 
> `struct
> netif_[tx|rx]_request` `id` field. ]
> 
> The new workflow is not too different from the old one: (on TX) we *first*
> memcpy the packet to staging grants region (or `data gref` like how I mention 
> in
> this doc) of up to the negotiated `{tx,rx}-data-len` (or `size` specified in 
> the
> command ring if smaller). The `data gref` to be used is identified by the `id`
> field in netif_[tx|rx]_request struct. The rest will proceed as before, that 
> is
> granting the packet (within XEN_PAGE_SIZE boundaries) and setting `gref` and
> `offset` accordingly in the rest of the command ring slots. Let me improve 
> this
> paragraph to make this more clear.
> 
> The other difference (see Zerocopy section) is that if frontend sets the flag
> NETTXF_staging_buffer, then the `gref` field in netif_[tx|rx]_request struct
> will have a value of the `data gref` id (the id field saying "frame 
> identifier"
> that you asked earlier in the doc). This is to allow a frontend to specify an 
> RX
> `data gref` to be used in the TX ring without involving any additional copy. I
> haven't PoC-ed this zerocopy part yet, as it only covers two specific 
> scenarios
> (that is guest XDP_TX or on a DPDK frontend).
> 
> >> Values bigger then the 4096 page/grant boundary only have special meaning 
> >> for
> >> backend being how much it is required to be copied/pulled across the whole
> >> packet (which can be composed of multiple slots). Hence (e.g.) a value of 
> >> 65536
> >> vs 4096 will have the same data list size and the latter value would lead 
> >> to
> >> only copy/pull one gref in the whole packet, whereas the former will be a
> >> copy-only interface for all slots.
> >>
> >> ## Buffer Identification and Flags
> >>
> >> The data list ids must start from 0 and are global and continguous (across 
> >> both
> >> lists).  Data slot is identified by ring slot ```id``` field. Resultant 
> >> data
> >> gref id to be used in RX data list is computed by subtract of `struct
> >> netif_[tx|rx]_request` ```id``` from total amount of tx data grefs. 
> >> Example of
> >> the lists layout  below:
> > 
> > What is the "resultant data gref id"? What is the "RX data list"?
> > Please explain in more details.
> "Resultant data gref id" corresponds to the case where we set
> NETTXF_staging_buffer flag, in which case we set the command ring `gref` with
> the `id` in the data list entry (see below diagram). "RX data list" is the 
> list
> described with `rx-data-ref`. I should have put that as separate paragrah as
> this is making things more confusing.
> 
> >> ```
> >>  [tx-data-ref-0, tx-data-len=256]
> >>  { .id = 0, gref = 512, .offset = 0x0   }
> >>  { .id = 1, gref = 512, .offset = 0x100 }
> >>  { .id = 2, gref = 512, .offset = 0x200 }
> >>  ...
> >>  { .id = 256, gref = 0, .offset = 0x0   }
> >>
> >>  [rx-data-ref-0, rx-data-len=4096]
> >>  { .id = 256, gref = 529, .offset = 0x0 }
> >>  { .id = 257, gref = 530, .offset = 0x0 }
> >>  { .id = 258, gref = 531, .offset = 0x0 }
> >>  ...
> >>  { .id = 512, gref = 0,   .offset = 0x0 }
> >> ```
> >>
> >> Permissions of RX data grefs are read-write whereas TX data grefs is 
> >> read-only.
> >>
> >> ## Zerocopy
> >>
> >> Frontend may wish to provide a bigger RX list than TX, and use RX buffers 
> >> for
> >> transmission in a zerocopy fashion for guests mainly doing forwarding. In 
> >> such
> >> cases backend set NETTXF_staging_buffer flag in ```netif_tx_request``` 
> >> flags
> >> field such that `gref` field instead designates the `id` of a data grefs.
> >>
> >> This is only valid when packets are solely described by the staging grants 
> >> for
> >> the slot packet size being written. Or when [tx|rx]-data-len is 4096 (for
> >> feature-sg 0) or 65535 (for feature-sg 1) and thus no new `gref` is needed 
> >> for
> >> describing the packet payload.
> >>
> >> \clearpage
> >>
> >> ## Performance
> >>
> >> Numbers that give a rough idea on the performance benefits of this 
> >> extension.
> >> These are Guest <-> Dom0 which test the communication between backend and
> >> frontend, excluding other bottlenecks in the datapath (the software 
> >> switch).
> >>
> >> ```
> >> # grant copy
> >> Guest TX (1vcpu,  64b, UDP in pps):  1 506 170 pps
> >> Guest TX (4vcpu,  64b, UDP in pps):  4 988 563 pps
> >> Guest TX (1vcpu, 256b, UDP in pps):  1 295 001 pps
> >> Guest TX (4vcpu, 256b, UDP in pps):  4 249 211 pps
> >>
> >> # grant copy + grant map (see next subsection)
> >> Guest TX (1vcpu, 260b, UDP in pps):    577 782 pps
> >> Guest TX (4vcpu, 260b, UDP in pps):  1 218 273 pps
> >>
> >> # drop at the guest network stack
> >> Guest RX (1vcpu,  64b, UDP in pps):  1 549 630 pps
> >> Guest RX (4vcpu,  64b, UDP in pps):  2 870 947 pps
> >> ```
> >>
> >> With this extension:
> >> ```
> >> # memcpy
> >> data-len=256 TX (1vcpu,  64b, UDP in pps):  3 759 012 pps
> >> data-len=256 TX (4vcpu,  64b, UDP in pps): 12 416 436 pps
> >> data-len=256 TX (1vcpu, 256b, UDP in pps):  3 248 392 pps
> >> data-len=256 TX (4vcpu, 256b, UDP in pps): 11 165 355 pps
> >>
> >> # memcpy + grant map (see next subsection)
> >> data-len=256 TX (1vcpu, 260b, UDP in pps):    588 428 pps
> >> data-len=256 TX (4vcpu, 260b, UDP in pps):  1 668 044 pps
> >>
> >> # (drop at the guest network stack)
> >> data-len=256 RX (1vcpu,  64b, UDP in pps):  3 285 362 pps
> >> data-len=256 RX (4vcpu,  64b, UDP in pps): 11 761 847 pps
> >>
> >> # (drop with guest XDP_DROP prog)
> >> data-len=256 RX (1vcpu,  64b, UDP in pps):  9 466 591 pps
> >> data-len=256 RX (4vcpu,  64b, UDP in pps): 33 006 157 pps
> >> ```
> > 
> > Very nice!
> :D
> 
> >> Latency measurements (netperf TCP_RR request size 1 and response size 1):
> >> ```
> >> 24 KTps vs 28 KTps
> >> 39 KTps vs 50 KTps (with kernel busy poll)
> >> ```
> >>
> >> TCP Bulk transfer measurements aren't showing a representative increase on
> >> maximum throughput (sometimes ~10%), but rather less retransmissions and
> >> more stable. This is probably because of being having a slight decrease in 
> >> rtt
> >> time (i.e. receiver acknowledging data quicker). Currently trying exploring
> >> other data list sizes and probably will have a better idea on the effects 
> >> of
> >> this.
> > 
> > This is strange. By TCP Bulk transfers, do you mean iperf?
> Yeap.
> 
> > From my pvcalls experience, I would expect a great improvement there too.
> Notice that here we are only memcpying a small portion of the packet (256 
> bytes
> max, not all of it). 

Are we? Unless I am mistaken, the protocol doesn't have any limitations
on the amount of bytes we are memcpying. It is up to the backend, which
could theoretically support large amounts such as 64k and larger.


> Past experience (in a old proposal I made) it also showed
> me an improvement like yours. I am a bit hesitant with memcpy all of way when
> other things are involved in the workload; but then I currently don't see many
> other alternatives to lessen the grants overhead. In the meantime I'll have 
> more
> data points and have a clearer idea on the ratio of the improvement vs the
> compromise.

I don't want to confuse you, but another alternative to consider would
be to use a curcular data ring like in PVCalls instead of slots. But I
don't know how difficult it would be to implement something like that in
netfront/netback. The command ring could still be used to send packets,
but the data could be transferred over the data ring instead. In this
model we would have to handle the case where a slot is available on the
command ring, but we don't have enough room on the data ring to copy the
data. For example, we could fall back to grant ops.


> >> ## Linux grant copy vs map remark
> >>
> >> Based on numbers above there's a sudden 2x performance drop when we switch 
> >> from
> >> grant copy to also grant map the ` gref`: 1 295 001 vs  577 782 for 256 
> >> and 260
> >> packets bytes respectivally. Which is all the more visible when removing 
> >> the grant
> >> copy with memcpy in this extension (3 248 392 vs 588 428). While there's 
> >> been
> >> discussions of avoid the TLB unflush on unmap, one could wonder what the
> >> threshold of that improvement would be. Chances are that this is the least 
> >> of
> >> our concerns in a fully poppulated host (or with an oversubscribed one). 
> >> Would
> >> it be worth experimenting increasing the threshold of the copy beyond the
> >> header?
> > 
> > +1
> Cool!
> 
> >> \clearpage
> >>
> >> # References
> >>
> >> [0] 
> >> http://lists.xenproject.org/archives/html/xen-devel/2015-05/msg01504.html
> >>
> >> [1] 
> >> https://github.com/freebsd/freebsd/blob/master/sys/dev/netmap/netmap_mem2.c#L362
> >>
> >> [2] https://www.freebsd.org/cgi/man.cgi?query=vale&sektion=4&n=1
> >>
> >> [3] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf
> >>
> >> [4]
> >> http://prototype-kernel.readthedocs.io/en/latest/networking/XDP/design/requirements.html#write-access-to-packet-data
> >>
> >> [5] 
> >> http://lxr.free-electrons.com/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L2073
> >>
> >> [6] 
> >> http://lxr.free-electrons.com/source/drivers/net/ethernet/mellanox/mlx4/en_rx.c#L52
> >>
> >> # History
> >>
> >> A table of changes to the document, in chronological order.
> >>
> >> ------------------------------------------------------------------------
> >> Date       Revision Version  Notes
> >> ---------- -------- -------- -------------------------------------------
> >> 2016-12-14 1        Xen 4.9  Initial version.
> >> ---------- -------- -------- -------------------------------------------
> >>
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.