[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [UNIKRAFT PATCH v4 10/12] plat/xen/drivers/net: Add transmit operation

To: Costin Lupu <costin.lupu@xxxxxxxxx>, Sharan Santhanam <sharan.santhanam@xxxxxxxxx>, <minios-devel@xxxxxxxxxxxxx>
From: Simon Kuenzer <simon.kuenzer@xxxxxxxxx>
Date: Fri, 30 Oct 2020 20:11:16 +0100
Delivery-date: Fri, 30 Oct 2020 19:11:27 +0000
List-id: Mini-os development list <minios-devel.lists.xenproject.org>

Hi Costin,

since I worked on the lwip adoption, I found something in your txfunction. Please see it inline.


Thanks,

Simon

On 23.10.20 17:54, Costin Lupu wrote:

Hi Sharan,

Please see inline.

On 10/22/20 4:38 PM, Sharan Santhanam wrote:

Hello Costin,

Please find the review comments inline:

Thanks & Regards

Sharan

On 8/21/20 3:04 PM, Simon Kuenzer wrote:

On 21.08.20 13:24, Costin Lupu wrote:

On 8/21/20 1:49 PM, Simon Kuenzer wrote:

On 21.08.20 11:32, Costin Lupu wrote:

On 8/20/20 6:49 PM, Simon Kuenzer wrote:

On 13.08.20 10:53, Costin Lupu wrote:

Whenever a packet is transmitted, the request describing it is
written
in the
shared ring. The packet itself is written in a page referenced by
the
request
and shared, as well, with the backend. At the end of each transmit
operation, a
cleanup operation is performed free'ing the resources allocated
for the
previously transmitted packets.

Signed-off-by: Costin Lupu <costin.lupu@xxxxxxxxx>
Signed-off-by: Razvan Cojocaru <razvan.cojocaru93@xxxxxxxxx>
---
     plat/xen/drivers/net/netfront.c | 102
++++++++++++++++++++++++++++++++
     1 file changed, 102 insertions(+)

diff --git a/plat/xen/drivers/net/netfront.c
b/plat/xen/drivers/net/netfront.c
index 858d124a..f2b81329 100644
--- a/plat/xen/drivers/net/netfront.c
+++ b/plat/xen/drivers/net/netfront.c
@@ -39,6 +39,7 @@
     #include <uk/alloc.h>
     #include <uk/netdev_driver.h>
     #include <xen-x86/mm.h>
+#include <xen-x86/irq.h>
     #include <xenbus/xenbus.h>
     #include "netfront.h"
     #include "netfront_xb.h"
@@ -78,6 +79,106 @@ static uint16_t get_id_from_freelist(uint16_t
*freelist)
         return id;
     }
     +static int network_tx_buf_gc(struct uk_netdev_tx_queue *txq)
+{
+    RING_IDX prod, cons;
+    netif_tx_response_t *tx_rsp;
+    uint16_t id;
+    bool more_to_do;
+    int count = 0;
+
+    do {
+        prod = txq->ring.sring->rsp_prod;
+        rmb(); /* Ensure we see responses up to 'rp'. */
+
+        for (cons = txq->ring.rsp_cons; cons != prod; cons++) {
+            tx_rsp = RING_GET_RESPONSE(&txq->ring, cons);
+
+            if (tx_rsp->status == NETIF_RSP_NULL)
+                continue;
+
+            if (tx_rsp->status == NETIF_RSP_ERROR)
+                uk_pr_err("packet error\n");


"netdev%u: Transmission error on txq %u\n" ?


Ack.

+
+            id  = tx_rsp->id;
+            UK_ASSERT(id < NET_TX_RING_SIZE);
+
+            gnttab_end_access(txq->gref[id]);
+            txq->gref[id] = GRANT_INVALID_REF;
+
+            add_id_to_freelist(id, txq->freelist);
+            uk_semaphore_up(&txq->sem);

Shouldn't we free up the netbuf we receive from the ring?


Yeah, I wasn't aware of that either. In minios the tx buffers are reused
and therefore never freed.

For the next version I follow the approach in virtio net driver and keep
a reference to the netbuf in order to free it here.

+
+            count++;
+        }
+
+        txq->ring.rsp_cons = prod;
+
+ RING_FINAL_CHECK_FOR_RESPONSES(&txq->ring, more_to_do);
+    } while (more_to_do);
+
+    return count;
+}
+
+static int netfront_xmit(struct uk_netdev *n,
+        struct uk_netdev_tx_queue *txq,
+        struct uk_netbuf *pkt)
+{
+    struct netfront_dev *nfdev;
+    unsigned long flags;
+    uint16_t id;
+    RING_IDX req_prod;
+    netif_tx_request_t *tx_req;
+    int notify;
+    int status = 0, count;
+
+    UK_ASSERT(n != NULL);
+    UK_ASSERT(txq != NULL);
+    UK_ASSERT(pkt != NULL);
+    UK_ASSERT(pkt->len < PAGE_SIZE);
+
+    nfdev = to_netfront_dev(n);
+
+    /* get request id */
+    uk_semaphore_down(&txq->sem);
+    local_irq_save(flags);
+    id = get_id_from_freelist(txq->freelist);


What happens when we are out of space and the ring is full?
The xmit function should be non-blocking so that we can use it in
polling mode. We should return in such a case that the transmission
queue is currently full...


Ack.

+    local_irq_restore(flags);
+
+    /* get request */
+    req_prod = txq->ring.req_prod_pvt;
+    tx_req = RING_GET_REQUEST(&txq->ring, req_prod); > +
+    /* setup grant for buffer data */
+    txq->gref[id] = tx_req->gref =
+ gnttab_grant_access(nfdev->xendev->otherend_id,
+            virt_to_mfn(pkt->data), 1);
+    UK_ASSERT(tx_req->gref != GRANT_INVALID_REF);
+
+    tx_req->offset = 0;

I found that if you set the offset to the following, the driver doessupport sending of packets that are having a headroom.


tx_req->offset = (uint16_t) uk_netbuf_headroom(pkt);

At the function entrance I would also check that this is the onlysegment and that the pkt buffer is page aligned:


UK_ASSERT(!pkt->next);
UK_ASSERT(((unsigned long) pkt->buf & ~PAGE_MASK) == 0);

Instead of building the grant with the data pointer, you could use thebuf address instead. This one should be page aligned and the offset withheadroom to it would be correctly point to data.

Please also, double-check with series 1560:
https://patchwork.unikraft.org/project/unikraft/list/?series=1560

With this you have your assumptions for a zero-copy tx confirmed.

+    tx_req->size = (uint16_t) pkt->len;
+    tx_req->flags = 0;
+    tx_req->id = id;
+
+    txq->ring.req_prod_pvt = req_prod + 1;
+    wmb(); /* Ensure backend sees requests */
+
+ RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(&txq->ring, notify);
+    if (notify)
+        notify_remote_via_evtchn(txq->evtchn); > +
+    status |= UK_NETDEV_STATUS_SUCCESS;
+
+    /* some cleanup */
+    local_irq_save(flags);
+    count = network_tx_buf_gc(txq);


The clean-up should happen before enqueuing the given packet to the
transmit queue, it is the first thing of xmit(). Since we anyway
need to
do it and we do not have an advantage by delaying this operation.


Of course there is an advantage. You push the packets in order for
them
to be processed ASAP by the backend and deal with cleaning up
afterwards. It's a kind of parallelization. Otherwise you would delay
the transmissions.

We can also deal much better with full rings. Otherwise we
unnecessarily drop a packets for transmission although there would
have been space when clean-up finished. After calling cleaning you
can check if there is space on the ring and return if we run out of
it (use an `unlikely` to the condition so that we speed up the
successful sending case).


The only scenario when this would solve anything would be when you
have
the ring full and just before you decide to send a new packet the
backend starts processing your previous transmitted packets,
leaving you
at least one free slot. There are 2 unlikely events here: first, the
ring is full, and second, the backend starts to process again when the
ring gets full (for the latter event, it's unlikely that the backend
will wake up given that it couldn't process 256 transmitted packets,
where 256 is the current number of slots). And for this very unlikely
context you would introduce a delay for all the transmissions if we
followed your suggestion.

Anyway, both Linux and FreeBSD do the cleanup after pushing the
requests, so I don't think we have much more to say about this.


Hum... I double-checked with Intel DPDK's ixgbe (10 Gig) driver:

http://git.dpdk.org/dpdk-stable/tree/drivers/net/ixgbe/ixgbe_rxtx.c?h=20.02#n230



They clean the slots before sending. I derived my conclusion from
there.
However I noticed now that there is one minor difference: They do
both -
cleanup and enqueuing - before they notify the card about changes on
the
ring. In that case it does not add any delay if you swap the order of
cleanup and enqueuing. In your case this is probably different because
you can cleanup without notifying. Here, I agree with you that we add
unnecessary xmit delays by swapping the order.


ixgbe is a native driver, so there is no backend driver in that case.
It's a different scenario.


You could consider the card itself as a backend driver, but fine.
There is a ABI difference between ixgbe and netback.

However and in general, having a full ring is not an uncommon and
unlikely case. To my experience, this can already happen with a single
TCP connection if your network stack is feeding next packets fast
enough
and if the current TCP window allows to send them. It also depends on
when the driver domain is scheduled; on which CPU core your netfront
domain and the driver domain sit; and even on the underlying physical
device that might be busy with other traffic.


What I tried to say with the events probabilities explanation is that
you add delay for the common case in order to remove delay for the much
less probable corner case.


I got that and agreed.

Another thing we are having in mind for uknetdev - and this is probably
different to Linux/BSD's implementation - we plan to change the xmit
function to do batching (as DPDK does). Sharan could push virtio-net
performance further with this. You are able to reduce the number of
notifies to the backend because you enqueue multiple packets at once
and
due to locality of batching you can utilize CPU caches ways better.

What about this idea for netfront: We could do first a check for
available space and - if needed - clean-up just as many extra slots
that
you need to complete the transmission (for now this is anyways just one
slot). Then we proceed with enqueuing and do the cleanup of the rest of
the slots after we notified the backend. What do you think?


This will stay just as it is for the current patches. Feel free to
change anything after upstreaming these patches.


No, it won't stay like this. At least, you have to call clean-up when
the ring is full before you leave the non-blocking function. Otherwise
you end up in never getting space back on a full ring, but I think
this is clear.


Btw, I'm waiting for reviews on the other patches as well before I
send
the v5.


Yes, sure.

+    local_irq_restore(flags);
+
+    status |= likely(count > 0) ? UK_NETDEV_STATUS_MORE : 0x0;
+
+    return status;
+}
+
     static int netfront_rxq_enqueue(struct uk_netdev_rx_queue *rxq,
             struct uk_netbuf *netbuf)
     {
@@ -598,6 +699,7 @@ static int netfront_add_dev(struct xenbus_device
*xendev)
         }
           /* register netdev */
+    nfdev->netdev.tx_one = netfront_xmit;
         nfdev->netdev.ops = &netfront_ops;
         rc = uk_netdev_drv_register(&nfdev->netdev, drv_allocator,
DRIVER_NAME);
         if (rc < 0) {


Thanks,

Simon

References:
- Re: [UNIKRAFT PATCH v4 10/12] plat/xen/drivers/net: Add transmit operation
  - From: Sharan Santhanam
- Re: [UNIKRAFT PATCH v4 10/12] plat/xen/drivers/net: Add transmit operation
  - From: Costin Lupu

Prev by Date: [UNIKRAFT/LWIP PATCH v4 5/5] uknetdev: Retrieve `ioalign` from device
Next by Date: Re: [UNIKRAFT PATCH v4 06/12] plat/xen/drivers/net: Configure netfront tx queue
Previous by thread: Re: [UNIKRAFT PATCH v4 10/12] plat/xen/drivers/net: Add transmit operation
Next by thread: Re: [UNIKRAFT PATCH v4 11/12] plat/xen/drivers/net: Add receive operation
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.