Xen project Mailing List

Re: Improving the network-stack performance over Xen

To: Dimosthenis Pediaditakis <dimosthenis.pediaditakis@xxxxxxxxxxxx>

From: Anil Madhavapeddy <anil@xxxxxxxxxx>

Date: Fri, 4 Oct 2013 10:57:11 +0100

Cc: Mirage List <cl-mirage@xxxxxxxxxxxxxxx>, David Scott <scott.dj@xxxxxxxxx>

List-id: MirageOS development <cl-mirage.lists.cam.ac.uk>

Closing the loop on this; is Dave's Activations patch considered safe-to-merge now? -anil On Tue, Sep 24, 2013 at 04:29:54PM +0100, Dimosthenis Pediaditakis wrote: > Hi again, > as promised, here are some data from a DomU-2-DomU iperf flow, > pushing 0.8Gbits in 0.38sec (~2.1Gbps): > http://www.cl.cam.ac.uk/~dp463/files/mirageNetData.tar.gz > > Note that I've used a txqueuelen of 500 for each VIF in dom0. > Also, with TCP debug disabled, speeds reach >= 2.5Gbps. > > Everythin seems smoother. Didn't observe any odd TX delay spikes > caused by netif blocking. > > Regards, > D. > > > > On 24/09/13 02:36, Dimosthenis Pediaditakis wrote: > >Hi David, > >sorry for the absence. I attended a Workshop on SDN last week, and > >today was quite a busy day. > >I had a look in the interrupt branch of your mirage-platform repo, > >cloned it and ran a few experiments. > >The speeds I got were consistently between 2.45-2.6Gbps on my > >machine (i7-3770, dual ch. DDR3 1600), which is a very good > >number. > >Unfortunatelly, I didn't have the time to further stress test it > >and generate any TCP stack plots. This is the top item in my TODO > >list. > > > >I also went briefly through the additions/modifications in > >hypervisor.c, main.c, activations.ml, netif.ml, main.ml > >It seems that SCHEDOP_block + no-op handling + evtchn_poll + > >VIRQ_TIMER-bind along with the rewriting of Activations (fix for > >the silently dropped events) have done the trick. > > > >A couple of questions: > > - In file hypervisor.c, what is the purpose of " > >force_evtchn_callback" ? I've seen it only being invoked via > >"__sti()" and "__restore_flags()" > > > > - In the updated design, when an event is received, then: > > SCHEDOP_block returns, > > event_poll is then invoked, and > > finally Main.aux() is called. > > Activations.run() is invoked in Main.aux() if no threads are > >scheduled, and subsequently the domain is blocked again (until a > >timer interrupt, or reception of another event). My question is > >why don't we re-check right after "Activations.run()" the state of > >t ? For example if packets are waiting to be sent, and netif gets > >unblocked, why we block the domain directly again? > > > >Also, thanks thanks for the credits in you updates :-) > > > >D. > > > > > > > >On 19/09/13 22:41, David Scott wrote: > >>Hi Dimos, > >> > >>I've created a new patch set based on a mix of your ideas and mine: > >> > >>https://github.com/mirage/mirage-platform/pull/58 > >> > >>I've proposed switching to SCHEDOP_block with interrupts > >>enabled. Unlike in regular Mini-OS I don't think we need to do > >>anything in the hypervisor_callback, because we already have > >>code to poll the evtchn pending bits in evtchn_poll-- so we're a > >>bit of a hybrid: interrupts on for wakeups but the whole OS is > >>still based around a select/poll-style loop. I've left all event > >>channels unmasked and used the event_upcall_mask to turn on/off > >>event delivery globally. I've revamped the OCaml Activations > >>interface to remove one source of missing events. > >> > >>So far the code is working ok in my testing. I ran mirage-iperf > >>and am getting 1642002 KBit/sec on my test hardware -- I don't > >>know if this is considered good or bad! I ran your instrumented > >>version (thanks for the exhaustive instructions btw) and it drew > >>some pretty graphs, but I'm not enough of a TCP expert to > >>interpret them properly. > >> > >>Could you give this a go in your test environment and let me > >>know what you think? > >> > >>I'm extremely suspicious of the console code -- it shouldn't be > >>necessary to include a delay in the print loop; that's > >>definitely worth investigating. > >> > >>Cheers, > >>Dave > >> > >> > >>On Thu, Sep 19, 2013 at 11:02 AM, David Scott > >><scott.dj@xxxxxxxxx <mailto:scott.dj@xxxxxxxxx>> wrote: > >> > >> Hi Dimos, > >> > >> Thanks for looking into this! Thinking about it, I think we have > >> several problems. > >> > >> 1. I think the Activations.wait API is difficult to use / unsafe: > >> > >> (* Block waiting for an event to occur on a particular port *) > >> let wait evtchn = > >> if Eventchn.is_valid evtchn then begin > >> let port = Eventchn.to_int evtchn in > >> let th, u = Lwt.task () in > >> let node = Lwt_sequence.add_l u event_cb.(port) in > >> Lwt.on_cancel th (fun _ -> Lwt_sequence.remove node); > >> th > >> end else Lwt.fail Generation.Invalid > >> > >> When you call Activations.wait you are added to a 'sequence' > >> (like a list) of people to wake up when the next event occurs. A > >> typical driver would call Activations.wait in a loop, block for > >> an event, wake up, signal some other thread to do work and then > >> block again. However if the thread running the loop blocks > >> anywhere else, then the thread will not be added to the sequence > >> straight away and any notifications that arrive during the gap > >> will be dropped. I noticed this when debugging my block backend > >> implementation. I think netif has this problem: > >> > >> let listen nf fn = > >> (* Listen for the activation to poll the interface *) > >> let rec poll_t t = > >> lwt () = refill_requests t in > >> ^^^ blocks here, can miss events > >> > >> rx_poll t fn; > >> tx_poll t; > >> (* Evtchn.notify nf.t.evtchn; *) > >> lwt new_t = > >> try_lwt > >> Activations.wait t.evtchn >> return t > >> with > >> | Generation.Invalid -> > >> Console.log_s "Waiting for plug in listen" >> > >> wait_for_plug nf >> > >> Console.log_s "Done..." >> > >> return nf.t > >> in poll_t new_t > >> in > >> poll_t nf.t > >> > >> > >> I think we should change the semantics of Activations.wait to be > >> more level-triggered rather than edge-triggered (i.e. more like > >> the underlying behaviour of xen) like this: > >> > >> type event > >> (** a particular event *) > >> > >> val wait: Evtchn.t -> event option -> event Lwt.t > >> (** [wait evtchn None] returns [Some e] where [e] is the latest > >> event. > >> [wait evtchn (Some e)] returns [Some e'] where [e'] is a > >> later event than [e] *) > >> > >> In the implementation we could have "type event = int" and > >> maintain a counter of "number of times this event has been > >> signalled". When you call Activations.wait, you would pass in the > >> number of the last event you saw, and the thread would block > >> until a new event is available. This way you wouldn't have to be > >> registered in the table when the event arrives. > >> > >> 2. SCHEDOP_poll has a low (arbitrary) nr_ports limit > >> > >> > >> http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/common/schedule.c;h=a8398bd9ed4827564bed4346e1fdfbb98ec5907e;hb=c5e9596cd095e3b96a090002d9e6629a980904eb#l712 > >> > >> 704 static long do_poll(struct sched_poll *sched_poll) > >> 705 { > >> 706 struct vcpu *v = current; > >> 707 struct domain *d = v->domain; > >> 708 evtchn_port_t port; > >> 709 long rc; > >> 710 unsigned int i; > >> 711 > >> 712 /* Fairly arbitrary limit. */ > >> 713 if ( sched_poll->nr_ports > 128 ) > >> 714 return -EINVAL; > >> > >> The total number of available event channels for a 64-bit guest > >> is 4096 using the current ABI (a new interface is under > >> development which allows even more). The limit of 128 is probably > >> imposed to limit the amount of time the hypercall takes, to avoid > >> hitting scalability limits like you do in userspace with select(). > >> > >> One of the use-cases I'd like to use Mirage for is to run backend > >> services (like xenstore or blkback) for all the domains on a > >> host. This requires at least one event channel per client domain. > >> We routinely run ~300 VMs/host, so the 128 limit is too small. > >> Plus a quick grep around Linux shows that it doesn't use > >> SCHEDOP_poll very much-- I think we should focus on using the > >> hypercalls that other OSes are using, for maximum chance of success. > >> > >> So I think we should switch from select()-like behaviour using > >> SCHEDOP_poll to interrupt-based delivery using SCHEDOP_block. I > >> note that upstream mini-os does this by default too. I'll take a > >> look at this. > >> > >> Cheers, > >> Dave > >> > >> > >> > >> On Fri, Sep 13, 2013 at 11:50 PM, Dimosthenis Pediaditakis > >> <dimosthenis.pediaditakis@xxxxxxxxxxxx > >> <mailto:dimosthenis.pediaditakis@xxxxxxxxxxxx>> wrote: > >> > >> Hi all, > >> The last few days I've been trying to pin-down the > >> performance issues of the Mirage network stack, when running > >> over Xen. > >> When trying to push net-direct to its limits, random > >> transmissions stall for anywhere between 0.1sec-4sec > >> (especially at the sender). > >> > >> After some experimentation, I believe that those time-outs > >> occur because netif is not (always) notified (via > >> Activations) about freed TX-ring slots. > >> It seems that these events (intermittently) don't reach the > >> guest domain's front-end driver. > >> > >> AFAIK Activations.wait() currently blocks waiting for an > >> event on the port belonging to the event channel for the netif. > >> This event is delivered to Activations.run via Main.run.aux > >> which is invoked via the callback in app_main() of > >> runtime/kernel/main.c > >> The problem I observed was that using "SCHEDOP_poll" without > >> masking the intended events, the hypervisor didn't "wake-up" > >> the blocked domain upon new event availability. > >> The requirement for event-masking when using "SCHEDOP_poll" > >> is also mentioned in the Xen documentation. > >> > >> I've produced a patch that seems to fix the above erratic > >> behavior. > >> Now I am able to consistently achieve higher speeds (up to > >> 2.75Gbps DomU2Domu). Please, have a look at my repo: > >> https://github.com/dimosped/mirage-platform > >> It will be helpful to use big-enough txqueuelen values for > >> your VIFs, as the current TCP implementation doesn't like > >> much losses at high datarates. The default size in my system > >> was only 32. > >> > >> I have also modified the mirage-net-direct by adding per-flow > >> TCP debug logging. This has helped me to better understand > >> and pin-down the problem. > >> You can grab the modified sources here: > >> https://github.com/dimosped/mirage-net > >> Be aware that logging big volumes of data for a TCP flow will > >> require big enough memory. Nevertheless, it only barely > >> affects performance. > >> > >> The iperf benchmark sources can be found here: > >> https://github.com/dimosped/iperf-mirage > >> I've included as much info as possible in the README file. > >> This should be sufficient to get you started and replicate my > >> experiments. > >> > >> In the iperf-mirage repo there is also a Python tool, which > >> you can use to automatically generate plots based on the > >> collected TCP debug info (I include also a sample dataset in > >> data/ ): > >> > >> https://github.com/dimosped/iperf-mirage/tree/master/tools/MirageTcpVis > >> For really large datasets, the script might be slow. I need > >> to switch into using NumPy arrays at some point... > >> > >> Please keep in mind that I am a newbie in Xen/Mirage so your > >> comments/input are more than welcome. > >> > >> Regards, > >> Dimos > >> > >> > >> > >> > >> ------------------------------------------------ > >> MORE TECHNICAL DETAILS > >> ------------------------------------------------ > >> > >> > >> ----------------------------------------------------------------- > >> > >> === How (I think) Mirage and XEN scheduling works === > >> ----------------------------------------------------------------- > >> > >> - When Netif receives a writev request, it checks if the TX > >> ring has enough empty space (for the producer) for the data > >> - If there is not enough space, it block-waits (via > >> Activations.wait) for an event on the port mapped to the > >> netif (and bound to the backend driver) > >> - Otherwise it pushes the request. > >> - Activations are notified (via run) from "aux ()" in > >> Main.run. Once notified, it means that the waiting netif can > >> proceed, check again the ring for free space, write a new > >> request, and send an event to the backend. > >> - Main.run.aux is registered as a callback (under name > >> "OS.Main.run") and is invoked in xen/runtime/kernel/main.c > >> (in app_main() loop). As long as the Mirage guest domain is > >> scheduled, this loop keeps running. > >> - However, in Main.run.aux, the Mirage guest domain is > >> blocked via "block_domain timeout" if the main thread has no > >> task to perform. > >> - In turn, "block_domain" invokes caml_block_domain() found > >> in xen/runtime/kernel/main.c, which issues a > >> "HYPERVISOR_sched_op(SCHEDOP_poll, &sched_poll);" hypercall > >> > >> ------------------------------------- > >> === Polling mode issue === > >> ------------------------------------- > >> In my opinion, and based on debug information, it seems that > >> the problem is that Mirage uses "SCHEDOP_poll" without > >> masking the event channels. > >> The XEN documentation clearly states that with "SCHEDOP_poll" > >> the domain would yield until either > >> a) an event is pending on the polled channels and > >> b) the timeout time (given in nanoseconds, is not duration > >> but absolute system time) is reached > >> It also states that this SCHEDOP_poll can only be be executed > >> when the guest has delivery of events disabled. > >> > >> In Mirage, netif events are not masked and therefore they > >> never "wakeup" the guest domain. > >> The guest only wakes-up whenever a thread is scheduled > >>to wakeup in Time.SleepQueue (e.g. a TCP timer). > >> Once the guest is scheduled again, it completes any > >> outstanding tasks, sends any packets pending, and whenever a) > >> the TX ring gets full, or b)the hypervisor it, c) it will > >> sleep again. > >> To further support the above, whenever I press buttons via > >> XEN-console while the mirage-sender is running, the execution > >> completes faster. > >> > >> ---------------- > >> === Fix === > >> ---------------- > >> There are multiple ways to mask events (e.g. at VCPU level, > >> event level etc). > >> As a quick hack I replaced "Eventchn.unmask h evtchn;" in > >> Netif.plug_inner with Eventchn.mask h evtchn (which I had to > >> create, both in Eventchn and as a stub in > >> xen/runtime/kernel/eventchn_stubs.c). > >> See: > >> > >> https://github.com/dimosped/mirage-platform/commit/6d4d3f0403497f07fde4db6f4cb63665a8bf8e26 > >> > >> > >> > >> > >> > >> > >> > >> > >> -- Dave Scott > >> > >> > >> > >> > >>-- > >>Dave Scott > > > -- Anil Madhavapeddy http://anil.recoil.org

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.