[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Improving the network-stack performance over Xen



Closing the loop on this; is Dave's Activations patch considered
safe-to-merge now?

-anil

On Tue, Sep 24, 2013 at 04:29:54PM +0100, Dimosthenis Pediaditakis wrote:
> Hi again,
> as promised, here are some data from a DomU-2-DomU iperf flow,
> pushing 0.8Gbits in 0.38sec (~2.1Gbps):
> http://www.cl.cam.ac.uk/~dp463/files/mirageNetData.tar.gz
> 
> Note that I've used a txqueuelen of 500 for each VIF in dom0.
> Also, with TCP debug disabled, speeds reach >= 2.5Gbps.
> 
> Everythin seems smoother. Didn't observe any odd TX delay spikes
> caused by netif blocking.
> 
> Regards,
> D.
> 
> 
> 
> On 24/09/13 02:36, Dimosthenis Pediaditakis wrote:
> >Hi David,
> >sorry for the absence. I attended a Workshop on SDN last week, and
> >today was quite a busy day.
> >I had a look in the interrupt branch of your mirage-platform repo,
> >cloned it and ran a few experiments.
> >The speeds I got were consistently between 2.45-2.6Gbps on my
> >machine (i7-3770, dual ch. DDR3 1600), which is a very good
> >number.
> >Unfortunatelly, I didn't have the time to further stress test it
> >and generate any TCP stack plots. This is the top item in my TODO
> >list.
> >
> >I also went briefly through the additions/modifications in
> >hypervisor.c, main.c, activations.ml, netif.ml, main.ml
> >It seems that SCHEDOP_block + no-op handling + evtchn_poll  +
> >VIRQ_TIMER-bind  along with the rewriting of Activations (fix for
> >the silently dropped events) have done the trick.
> >
> >A couple of questions:
> > - In file hypervisor.c, what is the purpose of "
> >force_evtchn_callback" ? I've seen it only being invoked via
> >"__sti()" and "__restore_flags()"
> >
> > - In the updated design, when an event is received, then:
> >    SCHEDOP_block returns,
> >    event_poll is then invoked, and
> >    finally Main.aux() is called.
> >  Activations.run() is invoked in Main.aux() if no threads are
> >scheduled, and subsequently the domain is blocked again (until a
> >timer interrupt, or reception of another event).  My question is
> >why don't we re-check right after "Activations.run()" the state of
> >t ?  For example if packets are waiting to be sent, and netif gets
> >unblocked, why we block the domain directly again?
> >
> >Also, thanks thanks for the credits in you updates :-)
> >
> >D.
> >
> >
> >
> >On 19/09/13 22:41, David Scott wrote:
> >>Hi Dimos,
> >>
> >>I've created a new patch set based on a mix of your ideas and mine:
> >>
> >>https://github.com/mirage/mirage-platform/pull/58
> >>
> >>I've proposed switching to SCHEDOP_block with interrupts
> >>enabled. Unlike in regular Mini-OS I don't think we need to do
> >>anything in the hypervisor_callback, because we already have
> >>code to poll the evtchn pending bits in evtchn_poll-- so we're a
> >>bit of a hybrid: interrupts on for wakeups but the whole OS is
> >>still based around a select/poll-style loop. I've left all event
> >>channels unmasked and used the event_upcall_mask to turn on/off
> >>event delivery globally. I've revamped the OCaml Activations
> >>interface to remove one source of missing events.
> >>
> >>So far the code is working ok in my testing. I ran mirage-iperf
> >>and am getting 1642002 KBit/sec on my test hardware -- I don't
> >>know if this is considered good or bad! I ran your instrumented
> >>version (thanks for the exhaustive instructions btw) and it drew
> >>some pretty graphs, but I'm not enough of a TCP expert to
> >>interpret them properly.
> >>
> >>Could you give this a go in your test environment and let me
> >>know what you think?
> >>
> >>I'm extremely suspicious of the console code -- it shouldn't be
> >>necessary to include a delay in the print loop; that's
> >>definitely worth investigating.
> >>
> >>Cheers,
> >>Dave
> >>
> >>
> >>On Thu, Sep 19, 2013 at 11:02 AM, David Scott
> >><scott.dj@xxxxxxxxx <mailto:scott.dj@xxxxxxxxx>> wrote:
> >>
> >>    Hi Dimos,
> >>
> >>    Thanks for looking into this! Thinking about it, I think we have
> >>    several problems.
> >>
> >>    1. I think the Activations.wait API is difficult to use / unsafe:
> >>
> >>    (* Block waiting for an event to occur on a particular port *)
> >>    let wait evtchn =
> >>      if Eventchn.is_valid evtchn then begin
> >>              let port = Eventchn.to_int evtchn in
> >>              let th, u = Lwt.task () in
> >>              let node = Lwt_sequence.add_l u event_cb.(port) in
> >>              Lwt.on_cancel th (fun _ -> Lwt_sequence.remove node);
> >>              th
> >>      end else Lwt.fail Generation.Invalid
> >>
> >>    When you call Activations.wait you are added to a 'sequence'
> >>    (like a list) of people to wake up when the next event occurs. A
> >>    typical driver would call Activations.wait in a loop, block for
> >>    an event, wake up, signal some other thread to do work and then
> >>    block again. However if the thread running the loop blocks
> >>    anywhere else, then the thread will not be added to the sequence
> >>    straight away and any notifications that arrive during the gap
> >>    will be dropped. I noticed this when debugging my block backend
> >>    implementation. I think netif has this problem:
> >>
> >>    let listen nf fn =
> >>      (* Listen for the activation to poll the interface *)
> >>      let rec poll_t t =
> >>        lwt () = refill_requests t in
> >>        ^^^ blocks here, can miss events
> >>
> >>        rx_poll t fn;
> >>        tx_poll t;
> >>        (* Evtchn.notify nf.t.evtchn; *)
> >>        lwt new_t =
> >>          try_lwt
> >>            Activations.wait t.evtchn >> return t
> >>          with
> >>          | Generation.Invalid ->
> >>            Console.log_s "Waiting for plug in listen" >>
> >>            wait_for_plug nf >>
> >>            Console.log_s "Done..." >>
> >>            return nf.t
> >>        in poll_t new_t
> >>      in
> >>      poll_t nf.t
> >>
> >>
> >>    I think we should change the semantics of Activations.wait to be
> >>    more level-triggered rather than edge-triggered (i.e. more like
> >>    the underlying behaviour of xen) like this:
> >>
> >>     type event
> >>     (** a particular event *)
> >>
> >>     val wait: Evtchn.t -> event option -> event Lwt.t
> >>     (** [wait evtchn None] returns [Some e] where [e] is the latest
> >>    event.
> >>         [wait evtchn (Some e)] returns [Some e'] where [e'] is a
> >>    later event than [e] *)
> >>
> >>    In the implementation we could have "type event = int" and
> >>    maintain a counter of "number of times this event has been
> >>    signalled". When you call Activations.wait, you would pass in the
> >>    number of the last event you saw, and the thread would block
> >>    until a new event is available. This way you wouldn't have to be
> >>    registered in the table when the event arrives.
> >>
> >>    2. SCHEDOP_poll has a low (arbitrary) nr_ports limit
> >>
> >>    
> >> http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/common/schedule.c;h=a8398bd9ed4827564bed4346e1fdfbb98ec5907e;hb=c5e9596cd095e3b96a090002d9e6629a980904eb#l712
> >>
> >>     704 static long do_poll(struct sched_poll *sched_poll)
> >>     705 {
> >>     706     struct vcpu   *v = current;
> >>     707     struct domain *d = v->domain;
> >>     708     evtchn_port_t  port;
> >>     709     long           rc;
> >>     710     unsigned int   i;
> >>     711
> >>     712     /* Fairly arbitrary limit. */
> >>     713     if ( sched_poll->nr_ports > 128 )
> >>     714         return -EINVAL;
> >>
> >>    The total number of available event channels for a 64-bit guest
> >>    is 4096 using the current ABI (a new interface is under
> >>    development which allows even more). The limit of 128 is probably
> >>    imposed to limit the amount of time the hypercall takes, to avoid
> >>    hitting scalability limits like you do in userspace with select().
> >>
> >>    One of the use-cases I'd like to use Mirage for is to run backend
> >>    services (like xenstore or blkback) for all the domains on a
> >>    host. This requires at least one event channel per client domain.
> >>    We routinely run ~300 VMs/host, so the 128 limit is too small.
> >>    Plus a quick grep around Linux shows that it doesn't use
> >>    SCHEDOP_poll very much-- I think we should focus on using the
> >>    hypercalls that other OSes are using, for maximum chance of success.
> >>
> >>    So I think we should switch from select()-like behaviour using
> >>    SCHEDOP_poll to interrupt-based delivery using SCHEDOP_block. I
> >>    note that upstream mini-os does this by default too. I'll take a
> >>    look at this.
> >>
> >>    Cheers,
> >>    Dave
> >>
> >>
> >>
> >>    On Fri, Sep 13, 2013 at 11:50 PM, Dimosthenis Pediaditakis
> >>    <dimosthenis.pediaditakis@xxxxxxxxxxxx
> >>    <mailto:dimosthenis.pediaditakis@xxxxxxxxxxxx>> wrote:
> >>
> >>        Hi all,
> >>        The last few days I've been trying to pin-down the
> >>        performance issues of the Mirage network stack, when running
> >>        over Xen.
> >>        When trying to push net-direct to its limits, random
> >>        transmissions stall for anywhere between 0.1sec-4sec
> >>        (especially at the sender).
> >>
> >>        After some experimentation, I believe that those time-outs
> >>        occur because netif is not (always) notified (via
> >>        Activations) about freed TX-ring slots.
> >>        It seems that these events (intermittently) don't reach the
> >>        guest domain's front-end driver.
> >>
> >>        AFAIK Activations.wait() currently blocks waiting for an
> >>        event on the port belonging to the event channel for the netif.
> >>        This event is delivered to Activations.run via Main.run.aux
> >>        which is invoked via the callback in app_main() of
> >>        runtime/kernel/main.c
> >>        The problem I observed was that using "SCHEDOP_poll" without
> >>        masking the intended events, the hypervisor didn't "wake-up"
> >>        the blocked domain upon new event availability.
> >>        The requirement for event-masking when using "SCHEDOP_poll"
> >>        is also mentioned in the Xen documentation.
> >>
> >>        I've produced a patch that seems to fix the above erratic
> >>        behavior.
> >>        Now I am able to consistently achieve higher speeds (up to
> >>        2.75Gbps DomU2Domu). Please, have a look at my repo:
> >>        https://github.com/dimosped/mirage-platform
> >>        It will be helpful to use big-enough txqueuelen values for
> >>        your VIFs, as the current TCP implementation doesn't like
> >>        much losses at high datarates. The default size in my system
> >>        was only 32.
> >>
> >>        I have also modified the mirage-net-direct by adding per-flow
> >>        TCP debug logging. This has helped me to better understand
> >>        and pin-down the problem.
> >>        You can grab the modified sources here:
> >>        https://github.com/dimosped/mirage-net
> >>        Be aware that logging big volumes of data for a TCP flow will
> >>        require big enough memory. Nevertheless, it only barely
> >>        affects performance.
> >>
> >>        The iperf benchmark sources can be found here:
> >>        https://github.com/dimosped/iperf-mirage
> >>        I've included as much info as possible in the README file.
> >>        This should be sufficient to get you started and replicate my
> >>        experiments.
> >>
> >>        In the iperf-mirage repo there is also a Python tool, which
> >>        you can use to automatically generate plots based on the
> >>        collected TCP debug info (I include also a sample dataset in
> >>        data/ ):
> >>        
> >> https://github.com/dimosped/iperf-mirage/tree/master/tools/MirageTcpVis
> >>        For really large datasets, the script might be slow. I need
> >>        to switch into using NumPy arrays at some point...
> >>
> >>        Please keep in mind that I am a newbie in Xen/Mirage so your
> >>        comments/input are more than welcome.
> >>
> >>        Regards,
> >>        Dimos
> >>
> >>
> >>
> >>
> >>        ------------------------------------------------
> >>           MORE TECHNICAL DETAILS
> >>        ------------------------------------------------
> >>
> >>
> >>        -----------------------------------------------------------------
> >>
> >>        === How (I think) Mirage and XEN scheduling works ===
> >>        -----------------------------------------------------------------
> >>
> >>         - When Netif receives a writev request, it checks if the TX
> >>        ring has enough empty space (for the producer) for the data
> >>            - If there is not enough space, it block-waits (via
> >>        Activations.wait) for an event on the port mapped to the
> >>        netif (and bound to the backend driver)
> >>            - Otherwise it pushes the request.
> >>        -  Activations are notified (via run) from "aux ()" in
> >>        Main.run. Once notified, it means that the waiting netif can
> >>        proceed, check again the ring for free space, write a new
> >>        request, and send an event to the backend.
> >>        - Main.run.aux is registered as a callback (under name
> >>        "OS.Main.run") and is invoked in xen/runtime/kernel/main.c
> >>        (in app_main() loop). As long as the Mirage guest domain is
> >>        scheduled, this loop keeps running.
> >>        - However, in Main.run.aux, the Mirage guest domain is
> >>        blocked via "block_domain timeout" if the main thread has no
> >>        task to perform.
> >>        - In turn, "block_domain" invokes caml_block_domain()  found
> >>        in xen/runtime/kernel/main.c, which issues a
> >>        "HYPERVISOR_sched_op(SCHEDOP_poll, &sched_poll);" hypercall
> >>
> >>        -------------------------------------
> >>        === Polling mode issue ===
> >>        -------------------------------------
> >>        In my opinion, and based on debug information, it seems that
> >>        the problem is that Mirage uses "SCHEDOP_poll" without
> >>        masking the event channels.
> >>        The XEN documentation clearly states that with "SCHEDOP_poll"
> >>        the domain would yield until either
> >>          a) an event is pending on the polled channels and
> >>          b) the timeout time (given in nanoseconds, is not duration
> >>        but absolute system time) is reached
> >>        It also states that this SCHEDOP_poll can only be be executed
> >>        when the guest has delivery of events disabled.
> >>
> >>        In Mirage, netif events are not masked and therefore they
> >>        never "wakeup" the guest domain.
> >>        The guest only wakes-up whenever a thread is scheduled
> >>to         wakeup in Time.SleepQueue (e.g. a TCP timer).
> >>        Once the guest is scheduled again, it completes any
> >>        outstanding tasks, sends any packets pending, and whenever a)
> >>        the TX ring gets full, or  b)the hypervisor  it, c)  it will
> >>        sleep again.
> >>        To further support the above, whenever I press buttons via
> >>        XEN-console while the mirage-sender is running, the execution
> >>        completes faster.
> >>
> >>        ----------------
> >>        === Fix ===
> >>        ----------------
> >>        There are multiple ways to mask events (e.g. at VCPU level,
> >>        event level etc).
> >>        As a quick hack I replaced "Eventchn.unmask h evtchn;" in
> >>        Netif.plug_inner with Eventchn.mask h evtchn (which I had to
> >>        create, both in Eventchn and  as a stub in
> >>        xen/runtime/kernel/eventchn_stubs.c).
> >>        See:
> >>        
> >> https://github.com/dimosped/mirage-platform/commit/6d4d3f0403497f07fde4db6f4cb63665a8bf8e26
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>    --     Dave Scott
> >>
> >>
> >>
> >>
> >>-- 
> >>Dave Scott
> >
> 

-- 
Anil Madhavapeddy                                 http://anil.recoil.org



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.