Xen project Mailing List

Re: Improving the network-stack performance over Xen

To: Anil Madhavapeddy <anil@xxxxxxxxxx>

From: Dimosthenis Pediaditakis <dimosthenis.pediaditakis@xxxxxxxxxxxx>

Date: Fri, 04 Oct 2013 12:07:57 +0100

Cc: Mirage List <cl-mirage@xxxxxxxxxxxxxxx>, David Scott <scott.dj@xxxxxxxxx>

List-id: MirageOS development <cl-mirage.lists.cam.ac.uk>

As far as Network stack performance, domain scheduling and console

operation goes, I haven't observed any issues so far.

D. On 04/10/13 10:57, Anil Madhavapeddy wrote:

Closing the loop on this; is Dave's Activations patch considered
safe-to-merge now?

-anil

On Tue, Sep 24, 2013 at 04:29:54PM +0100, Dimosthenis Pediaditakis wrote:

Hi again,
as promised, here are some data from a DomU-2-DomU iperf flow,
pushing 0.8Gbits in 0.38sec (~2.1Gbps):
http://www.cl.cam.ac.uk/~dp463/files/mirageNetData.tar.gz

Note that I've used a txqueuelen of 500 for each VIF in dom0.
Also, with TCP debug disabled, speeds reach >= 2.5Gbps.

Everythin seems smoother. Didn't observe any odd TX delay spikes
caused by netif blocking.

Regards,
D.



On 24/09/13 02:36, Dimosthenis Pediaditakis wrote:

Hi David,
sorry for the absence. I attended a Workshop on SDN last week, and
today was quite a busy day.
I had a look in the interrupt branch of your mirage-platform repo,
cloned it and ran a few experiments.
The speeds I got were consistently between 2.45-2.6Gbps on my
machine (i7-3770, dual ch. DDR3 1600), which is a very good
number.
Unfortunatelly, I didn't have the time to further stress test it
and generate any TCP stack plots. This is the top item in my TODO
list.

I also went briefly through the additions/modifications in
hypervisor.c, main.c, activations.ml, netif.ml, main.ml
It seems that SCHEDOP_block + no-op handling + evtchn_poll  +
VIRQ_TIMER-bind  along with the rewriting of Activations (fix for
the silently dropped events) have done the trick.

A couple of questions:
- In file hypervisor.c, what is the purpose of "
force_evtchn_callback" ? I've seen it only being invoked via
"__sti()" and "__restore_flags()"

- In the updated design, when an event is received, then:
    SCHEDOP_block returns,
    event_poll is then invoked, and
    finally Main.aux() is called.
  Activations.run() is invoked in Main.aux() if no threads are
scheduled, and subsequently the domain is blocked again (until a
timer interrupt, or reception of another event).  My question is
why don't we re-check right after "Activations.run()" the state of
t ?  For example if packets are waiting to be sent, and netif gets
unblocked, why we block the domain directly again?

Also, thanks thanks for the credits in you updates :-)

D.



On 19/09/13 22:41, David Scott wrote:

Hi Dimos,

I've created a new patch set based on a mix of your ideas and mine:

https://github.com/mirage/mirage-platform/pull/58

I've proposed switching to SCHEDOP_block with interrupts
enabled. Unlike in regular Mini-OS I don't think we need to do
anything in the hypervisor_callback, because we already have
code to poll the evtchn pending bits in evtchn_poll-- so we're a
bit of a hybrid: interrupts on for wakeups but the whole OS is
still based around a select/poll-style loop. I've left all event
channels unmasked and used the event_upcall_mask to turn on/off
event delivery globally. I've revamped the OCaml Activations
interface to remove one source of missing events.

So far the code is working ok in my testing. I ran mirage-iperf
and am getting 1642002 KBit/sec on my test hardware -- I don't
know if this is considered good or bad! I ran your instrumented
version (thanks for the exhaustive instructions btw) and it drew
some pretty graphs, but I'm not enough of a TCP expert to
interpret them properly.

Could you give this a go in your test environment and let me
know what you think?

I'm extremely suspicious of the console code -- it shouldn't be
necessary to include a delay in the print loop; that's
definitely worth investigating.

Cheers,
Dave


On Thu, Sep 19, 2013 at 11:02 AM, David Scott
<scott.dj@xxxxxxxxx <mailto:scott.dj@xxxxxxxxx>> wrote:

    Hi Dimos,

    Thanks for looking into this! Thinking about it, I think we have
    several problems.

    1. I think the Activations.wait API is difficult to use / unsafe:

    (* Block waiting for an event to occur on a particular port *)
    let wait evtchn =
      if Eventchn.is_valid evtchn then begin
              let port = Eventchn.to_int evtchn in
              let th, u = Lwt.task () in
              let node = Lwt_sequence.add_l u event_cb.(port) in
              Lwt.on_cancel th (fun _ -> Lwt_sequence.remove node);
              th
      end else Lwt.fail Generation.Invalid

    When you call Activations.wait you are added to a 'sequence'
    (like a list) of people to wake up when the next event occurs. A
    typical driver would call Activations.wait in a loop, block for
    an event, wake up, signal some other thread to do work and then
    block again. However if the thread running the loop blocks
    anywhere else, then the thread will not be added to the sequence
    straight away and any notifications that arrive during the gap
    will be dropped. I noticed this when debugging my block backend
    implementation. I think netif has this problem:

    let listen nf fn =
      (* Listen for the activation to poll the interface *)
      let rec poll_t t =
        lwt () = refill_requests t in
        ^^^ blocks here, can miss events

        rx_poll t fn;
        tx_poll t;
        (* Evtchn.notify nf.t.evtchn; *)
        lwt new_t =
          try_lwt
            Activations.wait t.evtchn >> return t
          with
          | Generation.Invalid ->
            Console.log_s "Waiting for plug in listen" >>
            wait_for_plug nf >>
            Console.log_s "Done..." >>
            return nf.t
        in poll_t new_t
      in
      poll_t nf.t


    I think we should change the semantics of Activations.wait to be
    more level-triggered rather than edge-triggered (i.e. more like
    the underlying behaviour of xen) like this:

     type event
     (** a particular event *)

     val wait: Evtchn.t -> event option -> event Lwt.t
     (** [wait evtchn None] returns [Some e] where [e] is the latest
    event.
         [wait evtchn (Some e)] returns [Some e'] where [e'] is a
    later event than [e] *)

    In the implementation we could have "type event = int" and
    maintain a counter of "number of times this event has been
    signalled". When you call Activations.wait, you would pass in the
    number of the last event you saw, and the thread would block
    until a new event is available. This way you wouldn't have to be
    registered in the table when the event arrives.

    2. SCHEDOP_poll has a low (arbitrary) nr_ports limit

    
http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/common/schedule.c;h=a8398bd9ed4827564bed4346e1fdfbb98ec5907e;hb=c5e9596cd095e3b96a090002d9e6629a980904eb#l712

     704 static long do_poll(struct sched_poll *sched_poll)
     705 {
     706     struct vcpu   *v = current;
     707     struct domain *d = v->domain;
     708     evtchn_port_t  port;
     709     long           rc;
     710     unsigned int   i;
     711
     712     /* Fairly arbitrary limit. */
     713     if ( sched_poll->nr_ports > 128 )
     714         return -EINVAL;

    The total number of available event channels for a 64-bit guest
    is 4096 using the current ABI (a new interface is under
    development which allows even more). The limit of 128 is probably
    imposed to limit the amount of time the hypercall takes, to avoid
    hitting scalability limits like you do in userspace with select().

    One of the use-cases I'd like to use Mirage for is to run backend
    services (like xenstore or blkback) for all the domains on a
    host. This requires at least one event channel per client domain.
    We routinely run ~300 VMs/host, so the 128 limit is too small.
    Plus a quick grep around Linux shows that it doesn't use
    SCHEDOP_poll very much-- I think we should focus on using the
    hypercalls that other OSes are using, for maximum chance of success.

    So I think we should switch from select()-like behaviour using
    SCHEDOP_poll to interrupt-based delivery using SCHEDOP_block. I
    note that upstream mini-os does this by default too. I'll take a
    look at this.

    Cheers,
    Dave



    On Fri, Sep 13, 2013 at 11:50 PM, Dimosthenis Pediaditakis
    <dimosthenis.pediaditakis@xxxxxxxxxxxx
    <mailto:dimosthenis.pediaditakis@xxxxxxxxxxxx>> wrote:

        Hi all,
        The last few days I've been trying to pin-down the
        performance issues of the Mirage network stack, when running
        over Xen.
        When trying to push net-direct to its limits, random
        transmissions stall for anywhere between 0.1sec-4sec
        (especially at the sender).

        After some experimentation, I believe that those time-outs
        occur because netif is not (always) notified (via
        Activations) about freed TX-ring slots.
        It seems that these events (intermittently) don't reach the
        guest domain's front-end driver.

        AFAIK Activations.wait() currently blocks waiting for an
        event on the port belonging to the event channel for the netif.
        This event is delivered to Activations.run via Main.run.aux
        which is invoked via the callback in app_main() of
        runtime/kernel/main.c
        The problem I observed was that using "SCHEDOP_poll" without
        masking the intended events, the hypervisor didn't "wake-up"
        the blocked domain upon new event availability.
        The requirement for event-masking when using "SCHEDOP_poll"
        is also mentioned in the Xen documentation.

        I've produced a patch that seems to fix the above erratic
        behavior.
        Now I am able to consistently achieve higher speeds (up to
        2.75Gbps DomU2Domu). Please, have a look at my repo:
        https://github.com/dimosped/mirage-platform
        It will be helpful to use big-enough txqueuelen values for
        your VIFs, as the current TCP implementation doesn't like
        much losses at high datarates. The default size in my system
        was only 32.

        I have also modified the mirage-net-direct by adding per-flow
        TCP debug logging. This has helped me to better understand
        and pin-down the problem.
        You can grab the modified sources here:
        https://github.com/dimosped/mirage-net
        Be aware that logging big volumes of data for a TCP flow will
        require big enough memory. Nevertheless, it only barely
        affects performance.

        The iperf benchmark sources can be found here:
        https://github.com/dimosped/iperf-mirage
        I've included as much info as possible in the README file.
        This should be sufficient to get you started and replicate my
        experiments.

        In the iperf-mirage repo there is also a Python tool, which
        you can use to automatically generate plots based on the
        collected TCP debug info (I include also a sample dataset in
        data/ ):
        https://github.com/dimosped/iperf-mirage/tree/master/tools/MirageTcpVis
        For really large datasets, the script might be slow. I need
        to switch into using NumPy arrays at some point...

        Please keep in mind that I am a newbie in Xen/Mirage so your
        comments/input are more than welcome.

        Regards,
        Dimos




        ------------------------------------------------
           MORE TECHNICAL DETAILS
        ------------------------------------------------


        -----------------------------------------------------------------

        === How (I think) Mirage and XEN scheduling works ===
        -----------------------------------------------------------------

         - When Netif receives a writev request, it checks if the TX
        ring has enough empty space (for the producer) for the data
            - If there is not enough space, it block-waits (via
        Activations.wait) for an event on the port mapped to the
        netif (and bound to the backend driver)
            - Otherwise it pushes the request.
        -  Activations are notified (via run) from "aux ()" in
        Main.run. Once notified, it means that the waiting netif can
        proceed, check again the ring for free space, write a new
        request, and send an event to the backend.
        - Main.run.aux is registered as a callback (under name
        "OS.Main.run") and is invoked in xen/runtime/kernel/main.c
        (in app_main() loop). As long as the Mirage guest domain is
        scheduled, this loop keeps running.
        - However, in Main.run.aux, the Mirage guest domain is
        blocked via "block_domain timeout" if the main thread has no
        task to perform.
        - In turn, "block_domain" invokes caml_block_domain()  found
        in xen/runtime/kernel/main.c, which issues a
        "HYPERVISOR_sched_op(SCHEDOP_poll, &sched_poll);" hypercall

        -------------------------------------
        === Polling mode issue ===
        -------------------------------------
        In my opinion, and based on debug information, it seems that
        the problem is that Mirage uses "SCHEDOP_poll" without
        masking the event channels.
        The XEN documentation clearly states that with "SCHEDOP_poll"
        the domain would yield until either
          a) an event is pending on the polled channels and
          b) the timeout time (given in nanoseconds, is not duration
        but absolute system time) is reached
        It also states that this SCHEDOP_poll can only be be executed
        when the guest has delivery of events disabled.

        In Mirage, netif events are not masked and therefore they
        never "wakeup" the guest domain.
        The guest only wakes-up whenever a thread is scheduled
to         wakeup in Time.SleepQueue (e.g. a TCP timer).
        Once the guest is scheduled again, it completes any
        outstanding tasks, sends any packets pending, and whenever a)
        the TX ring gets full, or  b)the hypervisor  it, c)  it will
        sleep again.
        To further support the above, whenever I press buttons via
        XEN-console while the mirage-sender is running, the execution
        completes faster.

        ----------------
        === Fix ===
        ----------------
        There are multiple ways to mask events (e.g. at VCPU level,
        event level etc).
        As a quick hack I replaced "Eventchn.unmask h evtchn;" in
        Netif.plug_inner with Eventchn.mask h evtchn (which I had to
        create, both in Eventchn and  as a stub in
        xen/runtime/kernel/eventchn_stubs.c).
        See:
        
https://github.com/dimosped/mirage-platform/commit/6d4d3f0403497f07fde4db6f4cb63665a8bf8e26








    --     Dave Scott




--
Dave Scott

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.