[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC] netif: staging grants for requests



Hey Stefano,

Thanks a lot for the review and the comments!!

On 01/04/2017 07:40 PM, Stefano Stabellini wrote:
> On Wed, 14 Dec 2016, Joao Martins wrote:
>> # Proposed Extension
>>
>> ## Terminology
>>
>> `data gref` refers to the reusable/staging grants, whereas `gref` are the
>> ones that are granted/revoked for each packet being sent. `command ring`
>> refers to the current ring, whereas `data list` refers to the one proposed
>> below.
>>
>> ## Xenstore
>>
>> "feature-staging-grants" is a capability to allow the use of a set of
>> recycled buffers permanently mapped at the device setup. If
>> advertised, the frontend will grant a region equivalent to ```maximum number
>> of slots per ring * size of buffer``` where:
>>
>>  * `maximum number of slots per ring` is the number of request structure
>>    that can be fit within the command ring. By default a request
>>    structure consumes 16 bytes. The first 64 bytes of a ring are used by
>>    producer and consumer indicies.
> 
> This means that by default the number of slots is (4096-64)/16 = 252,
> correct?
That would be correct but I made a few mistakes in the writing, despite a
following paragraph mentioning the correct ring size for both TX and RX.
The ring slot size is determined by the size of the request or response (being a
union of request and response structures). This means the ring slot would either
8 octets for RX ring or 12 octets for TX ring which in effect translates to:

RX := (4096 - 64) / 8 = 504
TX := (4096 - 64) / 12 = 336

And as Wei correctly mentions both values are rounded down to a power of 2.
Which results in 256 slots for each ring. I will fix this in the spec.

>>  * `size of buffer` is the maximum portion of the packet the backend (and
>>    frontend) have negotiated which will be put for each slot of the
>>    `data ring`.
>>
>> Which means that for 256 possible packets in ring with 256 bytes of
>> chosen size the amount of memory to be permanently granted is a region of
>> 64KB i.e. 16 grefs.
>>
>> The lack of "feature-staging-grants" or a value of 0 means that it's not
>> supported. If feature is present then a second entry "staging-grants-sizes"
>> must exist and it contains the sizes that can be used per slot. To avoid
>> frontend clobbering with various different values, we limit to a set of fixed
>> ones semi-colon delimited.
>>
>> The allowed values are implementation specific. Examples of good values
>> include: 256 (protocol/header region), 2048 (fits 1 MSS which is common to be
>> in the linear part of linux packets), 4096 (grant per packet if 
>> feature-sg=0, for
>> DPDK/XDP/netmap buffers) and 65535 (maximum packet size i.e.
>> ```NETIF_NR_SLOTS_MIN * XEN_PAGE_SIZE``` for feature-sg=1).
> 
> I am not following. So far we have been talking about two values:
> - maximum number of slots per ring
> - size of buffer
> We have also discussed the number of bytes used for indexes at the
> beginning of the command ring, which is 64 bytes.
Right, but these 64 bytes were only mentioned to describe the calculation of
number of slots.

> 
> But now you are introducing a few other values:
> - protocol/header region
> - fits 1 MSS which is common to be in the linear part of linux packets
> - grant per packet
> - maximum packet size (I take this is the same as size of buffer)
> 
> Could you please described what these values represent in more details
> and how they relate to the two previous values?
These values I give above are examples/recomendations of valid "size of the
buffer" values. That multiplied for the number of slots gives how much it is
granted by the frontend.

Though I want to be able to say that the backend will copy up to N much bytes to
these staging grants (or data grefs as terminology used in the doc), before it
resorts to grant ops. Hence the value of 65536 is the same as 4096 in terms of
number of data grefs provided, but the backend will just copy less from/to the
staging area leaving the rest to be used with the command ring grefs (grant
map/unmap/etc). Does this answer your question?

I went with the simple approach to start with, but I also thought of having a
small descriptor to describe how much of this staging area is used for packet,
instead of a fixed negotiated value. But it would leave me with only a length
field. Unless we could have a hader with the most commonly used extra info slot
(GSO) as part of this descriptor too, and save 1 slot per packet for GSO/GRO
packets.

>> Individual size covered per entry by frontend is through ```tx-data-len``` or
>> ```rx-data-len``` which contains maximum amount of data on a single packet.
>> Chosen values of "tx-data-len" and "rx-data-len" are asymmetrical (hence can
>> be different between TX and RX) are validated against this list of valid 
>> sizes.
>> For it's use see [datapath](#datapath-changes) section further below.
>>
>>      /local/domain/1/device/vif/0/feature-staging-grants = "1"
>>      /local/domain/1/device/vif/0/staging-grants-sizes = "128;256;2048;4096"
> 
> I don't get what the 4 numbers mean. Are they possible supported options
> for "tx-data-len" and "rx-data-len", which in turn are the "size of
> buffer" as described above?
Exactly!

> If so, shouldn't they be written by the backend to backend paths? Why
> only 4 options? Could there be more? What is the max number of allowed
> option? And what is the minimum?
The path is wrong, it should be the backend path. The backend advertises this
set of values that can be used, and it can advertise as many values as it
supports (separated by semicolon). The values I selected were simply picked
based on what I explain in the previous paragraph. frontend will then pick one
of those.

/local/domain/0/backend/vif/1/0/feature-staging-grants = "1"
/local/domain/0/backend/vif/1/0/staging-grants-sizes = "128;256;2048;4096"

I could be less restrictive and instead turn these sizes into a tuple with
"min;max" or two separate xenstore entries if folks prefer.

And the frontend just advertises a feature entry like this:

/local/domain/1/device/vif/0/feature-staging-grants = "1"

Other alternatives on the table was to specify this region as part of the
request/response, and turn the TX/RX rings into multi-page ones. But I found
that a bit limitative, for example if we wanted to allow a guest to use a fix
set of pages (DPDK, netmap, XDP) which would result in around >256
"{tx,rx}-ring-ref" entries.

>>      /local/domain/1/device/vif/0/queue-0/tx-data-ref0 = "<data-ref-tx0>"
>>      /local/domain/1/device/vif/0/queue-0/tx-data-len = "128"
>>      /local/domain/1/device/vif/0/queue-0/rx-data-ref0 = "<data-ref-rx0>"
>>      /local/domain/1/device/vif/0/queue-0/rx-data-len = "256"
>>
>> The ```tx-data-ref%u``` and ```rx-data-ref%u``` describe the list of grants 
>> to be
>> used for each ring. The example above exemplifies a list composed of a single
> 
> For what ring, the data ring?
The command ring. There is no data ring, but a list of grefs given to the
backend, hence here I mention this as "data list". And the "data list" entries I
also refer to them as "staging/data grefs". Plus there's one list for TX and
another RX.

> 
>> page whereas multiple pages would be as:
>>
>>      tx-data-ref0 = <data-ref-tx0>
>>      tx-data-ref1 = <data-ref-tx1>
>>      rx-data-ref0 = <data-ref-rx0>
>>      rx-data-ref1 = <data-ref-rx1>
>>
>> Each slot in the `data-ref` list is formatted as following:
>>
>>             0     1     2     3     4     5     6     7  octet
>>          +-----+-----+-----+-----+-----+-----+-----+-----+
>>          | id        | gref                  | offset    |
>>          +-----+-----+-----+-----+-----+-----+-----+-----+
>>
>>      id: the frame identifier.
> 
> What is a "frame identifier"?
It's the buffer we are sharing, the one we use to memcpy for backend to use.
Probably should use "buffer" instead. (See further below about its use)

>>      gref: grant reference to the frame.
>>      offset: offset within the frame.
> 
> I assumed that each grant used for the data ring would be used in its
> entirety. Does it really make sense to create a ring using portions of
> pages? Given that we can only share memory at a page granularity,
My main argument with having portions of pages is for guests that only want to
use this region to copy the linear region and hence have a much smaller
footprint per ring (with just 16 grefs). Otherwise we would be sharing 256
grants while only using 1/16 of it. In the rest of the spec I take this is the
general case, where we memcpy only a small portion of the packet.

> I don't think it makes sense: the rest of the page would still be
> accessible from the backend, which could compromise its content. I would
> share a list of pages that can be mapped contiguously to create a data
> ring.
Note that the permissions of the TX grefs are readonly, whereas the rx ring is
read/write, so I inherit this same property here for the data grefs too. I
wouldn't mind having a grant per packet; is your concern coming from the fact
that these RX data grefs are constantly mapped hence backend can mangle anytime
as opposed to RX grants are still being revoked when given back to the frontend?
But then nothing currently stops a backend to write to these pages irrespective
of the approach (with or without staging grants) until they are copied/revoked;
the frontend would always copy from the staging region into a guest-only visible
region.

>> This list has twice as max slots as would have `tx-ring-ref` or `rx-ring-ref`
>> respectively, and it is set up at device creation and removed at device
>> teardown, same as the command ring entries. This way we keep up with ring 
>> size
>> changes as it it expected to be in the command ring. A hypothetical 
>> multi-page
>                 ^ repetition
OK, will remove it.

>> command ring would increase number of slots and thus this data list would 
>> grow
>> accordingly. List is terminated by an entry which ```gref``` field is 0, 
>> having
>> ignored the other fields of this specific entry.
> 
> Could you please explain a bit more how the command ring increases the
> number of slot and why we need to allocate twice the number of required
> slots at the beginning?
"This list has twice as max slots as" is meant to say that the list has twice as
capacity as the command ring. The list entry is 8 octets, so one page fits 512
entries. The data ref is only filled with up to the command ring size which
means only half of it would be filled currently. I will adjust this sentence
here to make it more clear.

multi-page rings aren't supported on netback/netfront (yet) but as an example we
would have one `tx-data-ref0` to represent the command ring slots of
`tx-ring-ref0` and tx-ring-ref1`. I noticed that to not waste grants, I need to
assume the ```data-ref``` is full with the lack of a termanting entry; otherwise
I would endup needing `tx-data-ref1` to have a terminating entry. I will add a
sentence mentioning this.

>> ## Datapath Changes
>>
>> The packet is copied to/from the mapped data list grefs of up to 
>> `tx-data-len`
>> or `rx-data-len`. This means that the buffer (referenced by `gref` from 
>> within
>> the `struct netif_[tx|rx]_request`) has the same data up to `size`.  In other
>> words, *gref[0->size] contents is replicated on the `data-ring` at `idx`. 
>> Hence
>> netback should ignore up to `size` of the `gref` when processing as the
>> `data-ring` has the contents of it.
> 
> My lack of netfront/netback knowledge might show, but I am not
> following. What do gref, size and idx represent in this context? Please
> explain the new workflow in more details.
[ Please ignore the part saying "replicated on the `data-ring` at `idx`", since
there is no ring. It should instead be `data list` buffer identified by `struct
netif_[tx|rx]_request` `id` field. ]

The new workflow is not too different from the old one: (on TX) we *first*
memcpy the packet to staging grants region (or `data gref` like how I mention in
this doc) of up to the negotiated `{tx,rx}-data-len` (or `size` specified in the
command ring if smaller). The `data gref` to be used is identified by the `id`
field in netif_[tx|rx]_request struct. The rest will proceed as before, that is
granting the packet (within XEN_PAGE_SIZE boundaries) and setting `gref` and
`offset` accordingly in the rest of the command ring slots. Let me improve this
paragraph to make this more clear.

The other difference (see Zerocopy section) is that if frontend sets the flag
NETTXF_staging_buffer, then the `gref` field in netif_[tx|rx]_request struct
will have a value of the `data gref` id (the id field saying "frame identifier"
that you asked earlier in the doc). This is to allow a frontend to specify an RX
`data gref` to be used in the TX ring without involving any additional copy. I
haven't PoC-ed this zerocopy part yet, as it only covers two specific scenarios
(that is guest XDP_TX or on a DPDK frontend).

>> Values bigger then the 4096 page/grant boundary only have special meaning for
>> backend being how much it is required to be copied/pulled across the whole
>> packet (which can be composed of multiple slots). Hence (e.g.) a value of 
>> 65536
>> vs 4096 will have the same data list size and the latter value would lead to
>> only copy/pull one gref in the whole packet, whereas the former will be a
>> copy-only interface for all slots.
>>
>> ## Buffer Identification and Flags
>>
>> The data list ids must start from 0 and are global and continguous (across 
>> both
>> lists).  Data slot is identified by ring slot ```id``` field. Resultant data
>> gref id to be used in RX data list is computed by subtract of `struct
>> netif_[tx|rx]_request` ```id``` from total amount of tx data grefs. Example 
>> of
>> the lists layout  below:
> 
> What is the "resultant data gref id"? What is the "RX data list"?
> Please explain in more details.
"Resultant data gref id" corresponds to the case where we set
NETTXF_staging_buffer flag, in which case we set the command ring `gref` with
the `id` in the data list entry (see below diagram). "RX data list" is the list
described with `rx-data-ref`. I should have put that as separate paragrah as
this is making things more confusing.

>> ```
>>  [tx-data-ref-0, tx-data-len=256]
>>  { .id = 0, gref = 512, .offset = 0x0   }
>>  { .id = 1, gref = 512, .offset = 0x100 }
>>  { .id = 2, gref = 512, .offset = 0x200 }
>>  ...
>>  { .id = 256, gref = 0, .offset = 0x0   }
>>
>>  [rx-data-ref-0, rx-data-len=4096]
>>  { .id = 256, gref = 529, .offset = 0x0 }
>>  { .id = 257, gref = 530, .offset = 0x0 }
>>  { .id = 258, gref = 531, .offset = 0x0 }
>>  ...
>>  { .id = 512, gref = 0,   .offset = 0x0 }
>> ```
>>
>> Permissions of RX data grefs are read-write whereas TX data grefs is 
>> read-only.
>>
>> ## Zerocopy
>>
>> Frontend may wish to provide a bigger RX list than TX, and use RX buffers for
>> transmission in a zerocopy fashion for guests mainly doing forwarding. In 
>> such
>> cases backend set NETTXF_staging_buffer flag in ```netif_tx_request``` flags
>> field such that `gref` field instead designates the `id` of a data grefs.
>>
>> This is only valid when packets are solely described by the staging grants 
>> for
>> the slot packet size being written. Or when [tx|rx]-data-len is 4096 (for
>> feature-sg 0) or 65535 (for feature-sg 1) and thus no new `gref` is needed 
>> for
>> describing the packet payload.
>>
>> \clearpage
>>
>> ## Performance
>>
>> Numbers that give a rough idea on the performance benefits of this extension.
>> These are Guest <-> Dom0 which test the communication between backend and
>> frontend, excluding other bottlenecks in the datapath (the software switch).
>>
>> ```
>> # grant copy
>> Guest TX (1vcpu,  64b, UDP in pps):  1 506 170 pps
>> Guest TX (4vcpu,  64b, UDP in pps):  4 988 563 pps
>> Guest TX (1vcpu, 256b, UDP in pps):  1 295 001 pps
>> Guest TX (4vcpu, 256b, UDP in pps):  4 249 211 pps
>>
>> # grant copy + grant map (see next subsection)
>> Guest TX (1vcpu, 260b, UDP in pps):    577 782 pps
>> Guest TX (4vcpu, 260b, UDP in pps):  1 218 273 pps
>>
>> # drop at the guest network stack
>> Guest RX (1vcpu,  64b, UDP in pps):  1 549 630 pps
>> Guest RX (4vcpu,  64b, UDP in pps):  2 870 947 pps
>> ```
>>
>> With this extension:
>> ```
>> # memcpy
>> data-len=256 TX (1vcpu,  64b, UDP in pps):  3 759 012 pps
>> data-len=256 TX (4vcpu,  64b, UDP in pps): 12 416 436 pps
>> data-len=256 TX (1vcpu, 256b, UDP in pps):  3 248 392 pps
>> data-len=256 TX (4vcpu, 256b, UDP in pps): 11 165 355 pps
>>
>> # memcpy + grant map (see next subsection)
>> data-len=256 TX (1vcpu, 260b, UDP in pps):    588 428 pps
>> data-len=256 TX (4vcpu, 260b, UDP in pps):  1 668 044 pps
>>
>> # (drop at the guest network stack)
>> data-len=256 RX (1vcpu,  64b, UDP in pps):  3 285 362 pps
>> data-len=256 RX (4vcpu,  64b, UDP in pps): 11 761 847 pps
>>
>> # (drop with guest XDP_DROP prog)
>> data-len=256 RX (1vcpu,  64b, UDP in pps):  9 466 591 pps
>> data-len=256 RX (4vcpu,  64b, UDP in pps): 33 006 157 pps
>> ```
> 
> Very nice!
:D

>> Latency measurements (netperf TCP_RR request size 1 and response size 1):
>> ```
>> 24 KTps vs 28 KTps
>> 39 KTps vs 50 KTps (with kernel busy poll)
>> ```
>>
>> TCP Bulk transfer measurements aren't showing a representative increase on
>> maximum throughput (sometimes ~10%), but rather less retransmissions and
>> more stable. This is probably because of being having a slight decrease in 
>> rtt
>> time (i.e. receiver acknowledging data quicker). Currently trying exploring
>> other data list sizes and probably will have a better idea on the effects of
>> this.
> 
> This is strange. By TCP Bulk transfers, do you mean iperf?
Yeap.

> From my pvcalls experience, I would expect a great improvement there too.
Notice that here we are only memcpying a small portion of the packet (256 bytes
max, not all of it). Past experience (in a old proposal I made) it also showed
me an improvement like yours. I am a bit hesitant with memcpy all of way when
other things are involved in the workload; but then I currently don't see many
other alternatives to lessen the grants overhead. In the meantime I'll have more
data points and have a clearer idea on the ratio of the improvement vs the
compromise.

>> ## Linux grant copy vs map remark
>>
>> Based on numbers above there's a sudden 2x performance drop when we switch 
>> from
>> grant copy to also grant map the ` gref`: 1 295 001 vs  577 782 for 256 and 
>> 260
>> packets bytes respectivally. Which is all the more visible when removing the 
>> grant
>> copy with memcpy in this extension (3 248 392 vs 588 428). While there's been
>> discussions of avoid the TLB unflush on unmap, one could wonder what the
>> threshold of that improvement would be. Chances are that this is the least of
>> our concerns in a fully poppulated host (or with an oversubscribed one). 
>> Would
>> it be worth experimenting increasing the threshold of the copy beyond the
>> header?
> 
> +1
Cool!

>> \clearpage
>>
>> # References
>>
>> [0] http://lists.xenproject.org/archives/html/xen-devel/2015-05/msg01504.html
>>
>> [1] 
>> https://github.com/freebsd/freebsd/blob/master/sys/dev/netmap/netmap_mem2.c#L362
>>
>> [2] https://www.freebsd.org/cgi/man.cgi?query=vale&sektion=4&n=1
>>
>> [3] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf
>>
>> [4]
>> http://prototype-kernel.readthedocs.io/en/latest/networking/XDP/design/requirements.html#write-access-to-packet-data
>>
>> [5] 
>> http://lxr.free-electrons.com/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L2073
>>
>> [6] 
>> http://lxr.free-electrons.com/source/drivers/net/ethernet/mellanox/mlx4/en_rx.c#L52
>>
>> # History
>>
>> A table of changes to the document, in chronological order.
>>
>> ------------------------------------------------------------------------
>> Date       Revision Version  Notes
>> ---------- -------- -------- -------------------------------------------
>> 2016-12-14 1        Xen 4.9  Initial version.
>> ---------- -------- -------- -------------------------------------------
>>

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.