Xen project Mailing List

Hi Dimos,

Thanks for looking into this! Thinking about it, I think we have several problems.

1. I think the Activations.wait API is difficult to use / unsafe:

(* Block waiting for an event to occur on a particular port *)

let wait evtchn =

if Eventchn.is_valid evtchn then begin

let port = Eventchn.to_int evtchn in

let th, u = Lwt.task () in

let node = Lwt_sequence.add_l u event_cb.(port) in

Lwt.on_cancel th (fun _ -> Lwt_sequence.remove node);

end else Lwt.fail Generation.Invalid

When you call Activations.wait you are added to a 'sequence' (like a list) of people to wake up when the next event occurs. A typical driver would call Activations.wait in a loop, block for an event, wake up, signal some other thread to do work and then block again. However if the thread running the loop blocks anywhere else, then the thread will not be added to the sequence straight away and any notifications that arrive during the gap will be dropped. I noticed this when debugging my block backend implementation. I think netif has this problem:

let listen nf fn =

(* Listen for the activation to poll the interface *)

let rec poll_t t =

lwt () = refill_requests t in

^^^ blocks here, can miss events

rx_poll t fn;

tx_poll t;

(* Evtchn.notify nf.t.evtchn; *)

lwt new_t =

try_lwt

Activations.wait t.evtchn >> return t

with

| Generation.Invalid ->

Console.log_s "Waiting for plug in listen" >>

wait_for_plug nf >>

Console.log_s "Done..." >>

return nf.t

in poll_t new_t

poll_t nf.t

I think we should change the semantics of Activations.wait to be more level-triggered rather than edge-triggered (i.e. more like the underlying behaviour of xen) like this:

type event

(** a particular event *)

val wait: Evtchn.t -> event option -> event Lwt.t

(** [wait evtchn None] returns [Some e] where [e] is the latest event.

[wait evtchn (Some e)] returns [Some e'] where [e'] is a later event than [e] *)

In the implementation we could have "type event = int" and maintain a counter of "number of times this event has been signalled". When you call Activations.wait, you would pass in the number of the last event you saw, and the thread would block until a new event is available. This way you wouldn't have to be registered in the table when the event arrives.

2. SCHEDOP_poll has a low (arbitrary) nr_ports limit

http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/common/schedule.c;h=a8398bd9ed4827564bed4346e1fdfbb98ec5907e;hb=c5e9596cd095e3b96a090002d9e6629a980904eb#l712

704 static long do_poll(struct sched_poll *sched_poll)

705 {

706 struct vcpu *v = current;

707 struct domain *d = v->domain;

708 evtchn_port_t port;

709 long rc;

710 unsigned int i;

711

712 /* Fairly arbitrary limit. */

713 if ( sched_poll->nr_ports > 128 )

714 return -EINVAL;

The total number of available event channels for a 64-bit guest is 4096 using the current ABI (a new interface is under development which allows even more). The limit of 128 is probably imposed to limit the amount of time the hypercall takes, to avoid hitting scalability limits like you do in userspace with select().

One of the use-cases I'd like to use Mirage for is to run backend services (like xenstore or blkback) for all the domains on a host. This requires at least one event channel per client domain. We routinely run ~300 VMs/host, so the 128 limit is too small. Plus a quick grep around Linux shows that it doesn't use SCHEDOP_poll very much-- I think we should focus on using the hypercalls that other OSes are using, for maximum chance of success.

So I think we should switch from select()-like behaviour using SCHEDOP_poll to interrupt-based delivery using SCHEDOP_block. I note that upstream mini-os does this by default too. I'll take a look at this.

Cheers,

Dave

On Fri, Sep 13, 2013 at 11:50 PM, Dimosthenis Pediaditakis <dimosthenis.pediaditakis@xxxxxxxxxxxx> wrote:

Hi all,
The last few days I've been trying to pin-down the performance issues of the Mirage network stack, when running over Xen.
When trying to push net-direct to its limits, random transmissions stall for anywhere between 0.1sec-4sec (especially at the sender).

After some experimentation, I believe that those time-outs occur because netif is not (always) notified (via Activations) about freed TX-ring slots.
It seems that these events (intermittently) don't reach the guest domain's front-end driver.

AFAIK Activations.wait() currently blocks waiting for an event on the port belonging to the event channel for the netif.
This event is delivered to Activations.run via Main.run.aux which is invoked via the callback in app_main() of runtime/kernel/main.c
The problem I observed was that using "SCHEDOP_poll" without masking the intended events, the hypervisor didn't "wake-up" the blocked domain upon new event availability.
The requirement for event-masking when using "SCHEDOP_poll" is also mentioned in the Xen documentation.

I've produced a patch that seems to fix the above erratic behavior.
Now I am able to consistently achieve higher speeds (up to 2.75Gbps DomU2Domu). Please, have a look at my repo:
https://github.com/dimosped/mirage-platform
It will be helpful to use big-enough txqueuelen values for your VIFs, as the current TCP implementation doesn't like much losses at high datarates. The default size in my system was only 32.

I have also modified the mirage-net-direct by adding per-flow TCP debug logging. This has helped me to better understand and pin-down the problem.
You can grab the modified sources here:
https://github.com/dimosped/mirage-net
Be aware that logging big volumes of data for a TCP flow will require big enough memory. Nevertheless, it only barely affects performance.

The iperf benchmark sources can be found here:
https://github.com/dimosped/iperf-mirage
I've included as much info as possible in the README file. This should be sufficient to get you started and replicate my experiments.

In the iperf-mirage repo there is also a Python tool, which you can use to automatically generate plots based on the collected TCP debug info (I include also a sample dataset in data/ ):
https://github.com/dimosped/iperf-mirage/tree/master/tools/MirageTcpVis
For really large datasets, the script might be slow. I need to switch into using NumPy arrays at some point...

Please keep in mind that I am a newbie in Xen/Mirage so your comments/input are more than welcome.

Regards,
Dimos

------------------------------------------------
   MORE TECHNICAL DETAILS
------------------------------------------------

-----------------------------------------------------------------
=== How (I think) Mirage and XEN scheduling works ===
-----------------------------------------------------------------
- When Netif receives a writev request, it checks if the TX ring has enough empty space (for the producer) for the data
    - If there is not enough space, it block-waits (via Activations.wait) for an event on the port mapped to the netif (and bound to the backend driver)
    - Otherwise it pushes the request.
- Activations are notified (via run) from "aux ()" in Main.run. Once notified, it means that the waiting netif can proceed, check again the ring for free space, write a new request, and send an event to the backend.
- Main.run.aux is registered as a callback (under name "OS.Main.run") and is invoked in xen/runtime/kernel/main.c (in app_main() loop). As long as the Mirage guest domain is scheduled, this loop keeps running.
- However, in Main.run.aux, the Mirage guest domain is blocked via "block_domain timeout" if the main thread has no task to perform.
- In turn, "block_domain" invokes caml_block_domain() found in xen/runtime/kernel/main.c, which issues a "HYPERVISOR_sched_op(SCHEDOP_poll, &sched_poll);" hypercall

-------------------------------------
=== Polling mode issue ===
-------------------------------------
In my opinion, and based on debug information, it seems that the problem is that Mirage uses "SCHEDOP_poll" without masking the event channels.
The XEN documentation clearly states that with "SCHEDOP_poll" the domain would yield until either
a) an event is pending on the polled channels and
b) the timeout time (given in nanoseconds, is not duration but absolute system time) is reached
It also states that this SCHEDOP_poll can only be be executed when the guest has delivery of events disabled.

In Mirage, netif events are not masked and therefore they never "wakeup" the guest domain.
The guest only wakes-up whenever a thread is scheduled to wakeup in Time.SleepQueue (e.g. a TCP timer).
Once the guest is scheduled again, it completes any outstanding tasks, sends any packets pending, and whenever a) the TX ring gets full, or b)the hypervisor it, c) it will sleep again.
To further support the above, whenever I press buttons via XEN-console while the mirage-sender is running, the execution completes faster.

----------------
=== Fix ===
----------------
There are multiple ways to mask events (e.g. at VCPU level, event level etc).
As a quick hack I replaced "Eventchn.unmask h evtchn;" in Netif.plug_inner with Eventchn.mask h evtchn (which I had to create, both in Eventchn and as a stub in xen/runtime/kernel/eventchn_stubs.c).
See:
https://github.com/dimosped/mirage-platform/commit/6d4d3f0403497f07fde4db6f4cb63665a8bf8e26

--
Dave Scott

Re: Improving the network-stack performance over Xen