|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: IRQ latency measurements in hypervisor
Hi Volodymyr, Stefano, On 14/01/2021 23:33, Stefano Stabellini wrote: + Bertrand, Andrew (see comment on alloc_heap_pages()) Long running hypercalls are usually considered security issues.In this case, only the control domain can issue large memory allocation (2GB at a time). Guest, would only be able to allocate 2MB at the time, so from the numbers below, it would only take 1ms max. So I think we are fine here. Next time, you find a large loop, please provide an explanation why they are not security issues (e.g. cannot be used by guests) or send an email to the Security Team in doubt. ARMv8 platform. Namely Renesas Rcar H3 SoC on Salvator board. Which core is it? In a related topic, I am not entirely sure that all the hypercalls would be able to fit in the 100us slice. In particular, the one which are touching the P2M and do memory allocation.To accurately determine latency, I employed one of timer counter units (TMUs) available on the SoC. This is 32-bit timer with auto-reload, that can generate interrupt on underflow. I fed it with 33.275MHz clock, which gave me resolution of about 30ns. I programmed the timer to generate interrupt every 10ms. My ISR then read the current timer value and determined how much time passed since last underrun. This gave me time interval between IRQ generation and ISR invocation. Those values were collected and every 10 seconds statistics was calculated. There is an example of output from my Linux driver:It looks like a solid approach to collect results, similar to the one we used for the cache coloring work. Just make sure to collect very many results. A few of questions: did you use a single physical CPU? Are you using RTDS and schedule 2 vCPU on 1 pCPU? Is dom0 idle or busy? I take the results were measured in domU?[ 83.873162] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns) [ 94.136632] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns) [ 104.400098] rt_eval_tmu e6fc0000.tmu: Mean: 50 (1500 ns) stddev: 129 (3870 ns) [ 114.663828] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns) [ 124.927296] rt_eval_tmu e6fc0000.tmu: Mean: 56 (1680 ns) stddev: 183 (5490 ns) This is the baremetal Linux. And there is Dom0: [ 237.431003] rt_eval_tmu e6fc0000.tmu: Mean: 306 (9180 ns) stddev: 25 (750 ns) [ 247.694506] rt_eval_tmu e6fc0000.tmu: Mean: 302 (9060 ns) stddev: 17 (510 ns) Driver outputs both the raw timer value (eg. 4) and the same value scaled to nanoseconds (eg. 1320 ns). As you can see baremetal setup is much faster. But experiments showed that Linux does not provide consistent values, even when running in baremetal mode. You can see sporadic spikes in "std dev" values.So baremetal IRQ latency is 1320-1680ns and Linux IRQ latency is 9060-9180ns. I am not surprised that Linux results are inconsistent but I have a couple of observations: - 9us is high for Linux If the system is idle, the latency should be lower, around 2-3us. I imagine you are actually running some sort of interference from dom0? Or using RTDS and descheduling vCPUs? - the stddev of 3870ns is high for baremetal In the baremetal case the stddev should be minimal if the system is idle.So my next step was to use proper RT OS to do the measurements. I chose Zephyr. My PR that adds Xen support to Zephyr can be found at [1]. Support for RCAR Gen3 is not upstreamed, but is present on my GitHub([2]). At [3] you can find the source code for application that does the latency measurements. It behaves exactly as my linux driver, but provides a bit more information: *** Booting Zephyr OS build zephyr-v2.4.0-2750-g0f2c858a39fc *** RT Eval app Counter freq is 33280000 Hz. Period is 30 ns Set alarm in 0 sec (332800 ticks) Mean: 600 (18000 ns) stddev: 3737 (112110 ns) above thr: 0% [265 (7950 ns) - 66955 (2008650 ns)] global [265 (7950 ns) 66955 (2008650 ns)] Mean: 388 (11640 ns) stddev: 2059 (61770 ns) above thr: 0% [266 (7980 ns) - 58830 (1764900 ns)] global [265 (7950 ns) 66955 (2008650 ns)] Mean: 358 (10740 ns) stddev: 1796 (53880 ns) above thr: 0% [265 (7950 ns) - 57780 (1733400 ns)] global [265 (7950 ns) 66955 (2008650 ns)] ... So there you can see: mean time, standard deviation, % of interrupts that was processed above 30us threshold, minimum and maximum latency values for the current 10s run, global minimum and maximum. Zephyr running as baremetal showed very stable results (this is an older build, so no extended statistics there): ## Starting application at 0x480803C0 ... *** Booting Zephyr OS build zephyr-v2.4.0-1137-g5803ee1e8183 *** RT Eval app Counter freq is 33280000 Hz. Period is 30 ns Mean: 31 (930 ns) stddev: 0 (0 ns) Mean: 31 (930 ns) stddev: 0 (0 ns) Mean: 31 (930 ns) stddev: 0 (0 ns) Mean: 31 (930 ns) stddev: 0 (0 ns) Mean: 31 (930 ns) stddev: 0 (0 ns) Mean: 31 (930 ns) stddev: 0 (0 ns) ... As Zephyr provided stable readouts with no jitter, I used it to do all subsequent measurements.I am a bit confused here. Looking at the numbers above the stddev is 112110 ns in the first instance. That is pretty high. Am I looking at the wrong numbers?IMPORTANT! All subsequent tests was conducted with only 1 CPU core enabled. My goal was to ensure that system can timely react to an external interrupt even under load.All right. FYI I have no frame of reference for 2 vCPUs on 1 pCPUs, all my tests were done with 1vCPU <-> 1pCPU and the null scheduler. This is very interestingi too. Did you get any spikes with the period set to 100us? It would be fantastic if there were none. There are two for loops in alloc_heap_pages() using this syntax. Which one are your referring to?
Looking at the domain creation code, 2GB will be split in two extents of 1GB. This means, there will be at least a preemption point between the allocation of the two extents. That said, this would only half of the time. So there might be more optimization to do... This function is not When I read "hypercall continuation", I read we will return to the guest context so it can process interrupts and potentially switch to another task. This means that the guest could issue a second populate_physmap() from the vCPU. Therefore any restart information should be part of the hypercall parameters. So far, I don't see how this would be possible. Even if we overcome that part, this can be easily abuse by a guest as the memory is not yet accounted to the domain. Imagine a guest that never request the continuation of the populate_physmap(). So we would need to block the vCPU until the allocation is finished. I think the first step is we need to figure out which part of the allocation is slow (see my question above). From there, we can figure out if there is a way to reduce the impact. Cheers, -- Julien Grall
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |