[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: IRQ latency measurements in hypervisor


  • To: Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • Date: Sat, 16 Jan 2021 14:40:13 +0000
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=YLRRy/cLgLOVa3liVzc21e7BIeX+y983Tyd0LIZx+Xc=; b=LrTNtdTC+jokCqfIGh7ZRDmvW5JjnydBN7Jd1a4eHMXOMq+Ogf3LvIVVmiziNhN7pN7L6WOv98qCazipd2HwO8ITAvGCouuNQ18rsFLigobfVZmHySIO6AM9nukO0JZVie6njWLIq5wjpQnXTh96VPrS3++OcFYP0stXxrscSeFzj/TZ8myDNl4Ff49o26vXRpEiKdTgR4O2p70/WwTiNrQyHEgXwKK6NB/GRcyG+hvX8031SFAMuVkvVCttocibgzAiC+2XGpXtopuJrjeERgMjvNSJcYl7DBC6KVkXaCpI8qFjJKDUbi/gEe1M74c0znZc1ILtkblBQeXm5Fr3oQ==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=TPmD5De/C4zXh6NDOeMVRDQR9u5k2quBD60hzNK7vwmTbrvqq7BSwNYZEnt5XMehCV/qn+F7qyIlor3jeem7tra/grjj58nrZ8ELxLXxbk5w4y6eRI/lXTCi68YAWJNqDJbroMN4SvN42ja9P1GsDRz11QQCAV0IfUapa0uZmBhgH4X7RWe3B8m+ExvgAg7Wrl7ewZfOJlID7PSNCSAmfV7YOVujtWaVUTp5l208fjPKUfczrjUah5JYhoZS93UXHNKibgyfa+335Sg6BMIVhAkyCf4azcKso+1XokkFdxUTJEkM/UZDaDpueXElgffgLmCf/XHc2EaOUD6PBfqeYw==
  • Authentication-results: esa5.hc3370-68.iphmx.com; dkim=pass (signature verified) header.i=@citrix.onmicrosoft.com
  • Cc: Julien Grall <jgrall@xxxxxxxxxx>, Stefano Stabellini <stefano.stabellini@xxxxxxxxxx>, Dario Faggioli <dario.faggioli@xxxxxxxx>
  • Delivery-date: Sat, 16 Jan 2021 14:40:33 +0000
  • Ironport-sdr: XxuVbzyplCYu8H1Oep1DADzLGTmheL1priIsyzZ/z4hCJbxCLUcwjk2luzm9h12+10wiYQ8J+U 5dmDD19oP6GVTE/N+jTJX/6WAbwDu3LtJHmlcgTLxw4Q63+RpcEZh1D0g5arb6CkKvl8fWeegj zOka+gqZO1/jHgvgMr1lZCdf6BH6+nY/GRGats+dlmRxxnG9ig/WM5DnKm5Wmt41+kuXREy138 C+i2gGEi7YdtfK3UimKhFmYDSkEzoS6idfhic4lIShH/wQQKcxBVOCddqPaAjDWKcc0FFzEggt LkM=
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 12/01/2021 23:48, Volodymyr Babchuk wrote:
> Hello community,
>
> Recently I was experimenting with context switching latencies in
> Xen. I want to share my findings and discuss some issues I have found.
>
> My target was to ensure that no more 30us passes between interrupt
> generation and handling in a guest VM. Main goal was to determine if
> Xen is ready for soft RT and to identify main issues that prevent
> this.
>
> TLDR: I was able to achieve mean 15us latency with tuned RTDS
> scheduler, disabled serial console and some hacks to the Xen
> toolstack. This was done on a setup with 1 PCPU.
>
> Test setup.
>
> ARMv8 platform. Namely Renesas Rcar H3 SoC on Salvator board.
>
> To accurately determine latency, I employed one of timer counter units
> (TMUs) available on the SoC. This is 32-bit timer with auto-reload,
> that can generate interrupt on underflow. I fed it with 33.275MHz
> clock, which gave me resolution of about 30ns. I programmed the timer
> to generate interrupt every 10ms. My ISR then read the current timer
> value and determined how much time passed since last underrun. This
> gave me time interval between IRQ generation and ISR invocation.
> Those values were collected and every 10 seconds statistics was
> calculated. There is an example of output from my Linux driver:
>
> [   83.873162] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns)
> [   94.136632] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns)
> [  104.400098] rt_eval_tmu e6fc0000.tmu: Mean: 50 (1500 ns) stddev: 129 (3870 
> ns)
> [  114.663828] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns)
> [  124.927296] rt_eval_tmu e6fc0000.tmu: Mean: 56 (1680 ns) stddev: 183 (5490 
> ns)
>
> This is the baremetal Linux. And there is Dom0:
>
> [  237.431003] rt_eval_tmu e6fc0000.tmu: Mean: 306 (9180 ns) stddev: 25 (750 
> ns)
> [  247.694506] rt_eval_tmu e6fc0000.tmu: Mean: 302 (9060 ns) stddev: 17 (510 
> ns)
>
> Driver outputs both the raw timer value (eg. 4) and the same value
> scaled to nanoseconds (eg. 1320 ns). As you can see baremetal setup is
> much faster. But experiments showed that Linux does not provide
> consistent values, even when running in baremetal mode. You can see
> sporadic spikes in "std dev" values.
>
> So my next step was to use proper RT OS to do the measurements. I
> chose Zephyr. My PR that adds Xen support to Zephyr can be found at
> [1]. Support for RCAR Gen3 is not upstreamed, but is present on my
> GitHub([2]). At [3] you can find the source code for application that
> does the latency measurements. It behaves exactly as my linux driver,
> but provides a bit more information:
>
>  *** Booting Zephyr OS build zephyr-v2.4.0-2750-g0f2c858a39fc  ***
> RT Eval app
>
> Counter freq is 33280000 Hz. Period is 30 ns
> Set alarm in 0 sec (332800 ticks)
> Mean: 600 (18000 ns) stddev: 3737 (112110 ns) above thr: 0% [265 (7950 ns) - 
> 66955 (2008650 ns)] global [265 (7950 ns) 66955 (2008650 ns)]
> Mean: 388 (11640 ns) stddev: 2059 (61770 ns) above thr: 0% [266 (7980 ns) - 
> 58830 (1764900 ns)] global [265 (7950 ns) 66955 (2008650 ns)]
> Mean: 358 (10740 ns) stddev: 1796 (53880 ns) above thr: 0% [265 (7950 ns) - 
> 57780 (1733400 ns)] global [265 (7950 ns) 66955 (2008650 ns)]
> ...
>
> So there you can see: mean time, standard deviation, % of interrupts
> that was processed above 30us threshold, minimum and maximum latency
> values for the current 10s run, global minimum and maximum.
>
> Zephyr running as baremetal showed very stable results (this is an
> older build, so no extended statistics there):
>
> ## Starting application at 0x480803C0 ...
>  *** Booting Zephyr OS build zephyr-v2.4.0-1137-g5803ee1e8183  ***
> RT Eval app
>
> Counter freq is 33280000 Hz. Period is 30 ns
> Mean: 31 (930 ns) stddev: 0 (0 ns)
> Mean: 31 (930 ns) stddev: 0 (0 ns)
> Mean: 31 (930 ns) stddev: 0 (0 ns)
> Mean: 31 (930 ns) stddev: 0 (0 ns)
> Mean: 31 (930 ns) stddev: 0 (0 ns)
> Mean: 31 (930 ns) stddev: 0 (0 ns)
> ...
>
> As Zephyr provided stable readouts with no jitter, I used it to do all
> subsequent measurements.
>
>
> IMPORTANT! All subsequent tests was conducted with only 1 CPU core
> enabled. My goal was to ensure that system can timely react to an
> external interrupt even under load.
>
>
> Test results and latency sources
>
> As you can see, baremetal OS provides latency of ~0.9us without any
> deviations. The same code running as DomU on idle system shows mean
> latency of 12us and deviation of about 10us. Range of latencies in a
> 10s batch can vary from 8us to 25us. This fits into required 30us
> threshold, so no big issue there.

This is all very interesting work - thankyou for sharing.

However, there is one set of measurements which I think are missing, and
are needed to describe the full picture.

You've got measurements for a native kernel taking interrupts, with
excellent results for Zephyr, and good results for Linux.

You've also got measurements for putting Xen into the mix, and the
results could do with some work.

Adding Xen totally changes the structure of the system.  Conceptually
(if the Xen model were expressed as native), what you've got now is a
single piece of userspace (pid1, no fork()ing), which is single threaded
(but doing its own private task switching between "thread"), where some
threads are making blocking system calls (all Xen hypercalls are fully
synchronous) while other threads are measuring the latency for a signal
to hit.

I think it would be incredibly valuable to mock something like that up
as a 3rd set of measurements, because it allows for a fairer comparison
of the rough structure of a virtualised system, with and without Xen
being involved.


It would also be very valuable to see where the latency is in the
system.  Would it be possible to set up a shared page, and have the
various moving parts write the counter value they see into it, so the
final stats can do a breakdown?

You probably want counter samples from:
1) When Xen has positively identified this single interrupt
2) When the interrupt has been marked pending for the guest
3) When the guest kernel has positively identified this single interrupt
4) When the interrupt handler has run (this is the datapoint you've got,
AFAICT).

Maybe for point 2, you also want some metadata as to what context was
interrupted (EL0..2).

You ought to see reasonably consistent results when interrupting EL0,
because that should transition immediately to EL1.  For interrupting
EL1, you'll perhaps see variation depending on whether other higher
priority are being serviced.

For interrupting EL2, this is where I'd expect you to see all the
problems with hypercalls, or potentially other work Xen is doing. 
Fundamentally, Xen can't just drop what it's doing and resume guest
context for the interrupt to be delivered.

We get around this with the hypercall continuation mechanism, which is
conceptually similar to voluntary preemption (we periodically check to
see if we should stop doing what we're doing, unwind state safely, and
get back out into guest context).

However, even using hypercall continuations, there will be a minimum
unit of work, necessary to guarantee forward progress, which cannot be
subdivided further.  There is also a cost to having finer grain
continuation checks, but hopefully that is confined to the error margins.


Alternatively, if we still have problems, something to consider might be
the task domain concept which has been discussed before.  This was an
idea to try and bring something more like normal tasks to Xen, by having
a hypervisor context that the control domain could queue work with,
who's running wouldn't be a synchronous hypercall from from the guest's
perspective.

Such a task domain would require a mechanism to ensure it got scheduled
with a high enough priority not to get starved, and would probably
benefit if we restructured Xen to have fewer irq-off areas (which would
be a great thing irrespective - the scheduler lock in particular needs
to become irqs-off).  In other words, the task domain concept is a lot
of work to do, with only a general hope of it resulting in an
improvement - it shouldn't be jumped upon until lightly to solve perf
problems.

~Andrew



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.