[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[RFC PATCH 00/10] Preemption in hypervisor (ARM only)


  • To: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • From: Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>
  • Date: Tue, 23 Feb 2021 02:34:55 +0000
  • Accept-language: en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=epam.com; dmarc=pass action=none header.from=epam.com; dkim=pass header.d=epam.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=aMv9AIA+SlsYD7jStw5GaXVCfP6GBuKwDkIf84cklY4=; b=ZXyKfVT9e/p/CqjatqPuxvaP1OiMR8HV6aesT6VgG/AEperbizj9Cqt7GVcOSapYNdPTZG12sCBrV4Gb4yAIivkxlWpJUIR5aMfT08+bFpJYT2G8ehO7LaF2jZHjJ5b6SuORs9y6erSn35dfjBVRlRlp2CK9tl+BetfRQ92iWsOHSKiK1z+GRvQfL8v9uYkokygDZpYNif6MisM0CtW6+2lsaEFVMYQMGNRf9Y/81iCdDgf3gYkhH92HPUBPjn0vsn0UOYugseQRtGstwFRo5+IxKtXhdhr2WOHtEbhoUR4LBmXXzzxaS53wAPuok4r6kTUHYaw67ZwLyTSKNPiCNA==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=Z6/2exD15NZLAtaQWJ1UtLZVGYajg/Kl6edODKKwKdFrp/DAtuB6vHSObpH6ebXoeCVIJAfGT6Yc6QVCx6q1F/39CbABUrp52yFQlOFxVreHZeKyGDLC6NWPXTbrgHUrqLpDhsWg/ipk71KNNGkAviJdTR4cRn6auDcscOUZYGIcqMrEZU871LNfeh1iXZWp90QZ3MZabhy5eeb9JTlBMDLiiLm6uWURtvXjS8qzh1Vzo0LlyXi7L3mBnQPC+jETuIet0o6kLsAFmASNn3XDeKHgjpg/dC8KUxjuxmJ5Bu8atcaBaV3O0zdKtZk9qOwr+KD8UWWfFJlkNuMJ8EfS3Q==
  • Authentication-results: lists.xenproject.org; dkim=none (message not signed) header.d=none;lists.xenproject.org; dmarc=none action=none header.from=epam.com;
  • Cc: Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Dario Faggioli <dfaggioli@xxxxxxxx>, Meng Xu <mengxu@xxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Ian Jackson <iwj@xxxxxxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>
  • Delivery-date: Tue, 23 Feb 2021 02:35:17 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Thread-index: AQHXCYx4A6OUUHr1gkqxWv1TEOLkug==
  • Thread-topic: [RFC PATCH 00/10] Preemption in hypervisor (ARM only)

Hello community,

Subject of this cover letter is quite self-explanatory. This patch
series implements PoC for preemption in hypervisor mode.

This is the sort of follow-up to recent discussion about latency
([1]).

Motivation
==========

It is well known that Xen is not preemptable. On other words, it is
impossible to switch vCPU contexts while running in hypervisor
mode. Only one place where scheduling decision can be made and one
vCPU can be replaced with another is the exit path from the hypervisor
mode. The one exception are Idle vCPUs, which never leaves the
hypervisor mode for obvious reasons.

This leads to a number of problems. This list is not comprehensive. It
lists only things that I or my colleagues encountered personally.

Long-running hypercalls. Due to nature of some hypercalls they can
execute for arbitrary long time. Mostly those are calls that deal with
long list of similar actions, like memory pages processing. To deal
with this issue Xen employs most horrific technique called "hypercall
continuation". When code that handles hypercall decides that it should
be preempted, it basically updates the hypercall parameters, and moves
guest PC one instruction back. This causes guest to re-execute the
hypercall with altered parameters, which will allow hypervisor to
continue hypercall execution later. This approach itself have obvious
problems: code that executes hypercall is responsible for preemption,
preemption checks are infrequent (because they are costly by
themselves), hypercall execution state is stored in guest-controlled
area, we rely on guest's good will to continue the hypercall. All this
imposes restrictions on which hypercalls can be preempted, when they
can be preempted and how to write hypercall handlers. Also, it
requires very accurate coding and already led to at least one
vulnerability - XSA-318. Some hypercalls can not be preempted at all,
like the one mentioned in [1].

Absence of hypervisor threads/vCPUs. Hypervisor owns only idle vCPUs,
which are supposed to run when the system is idle. If hypervisor needs
to execute own tasks that are required to run right now, it have no
other way than to execute them on current vCPU. But scheduler does not
know that hypervisor executes hypervisor task and accounts spent time
to a domain. This can lead to domain starvation.

Also, absence of hypervisor threads leads to absence of high-level
synchronization primitives like mutexes, conditional variables,
completions, etc. This leads to two problems: we need to use spinlocks
everywhere and we have problems when porting device drivers from linux
kernel.

Proposed solution
=================

It is quite obvious that to fix problems above we need to allow
preemption in hypervisor mode. I am not familiar with x86 side, but
for the ARM it was surprisingly easy to implement. Basically, vCPU
context in hypervisor mode is determined by its stack at general
purpose registers. And __context_switch() function perfectly switches
them when running in hypervisor mode. So there are no hard
restrictions, why it should be called only in leave_hypervisor() path.

The obvious question is: when we should to try to preempt running
vCPU?  And answer is: when there was an external event. This means
that we should try to preempt only when there was an interrupt request
where we are running in hypervisor mode. On ARM, in this case function
do_trap_irq() is called. Problem is that IRQ handler can be called
when vCPU is already in atomic state (holding spinlock, for
example). In this case we should try to preempt right after leaving
atomic state. This is basically all the idea behind this PoC.

Now, about the series composition.
Patches

  sched: core: save IRQ state during locking
  sched: rt: save IRQ state during locking
  sched: credit2: save IRQ state during locking
  preempt: use atomic_t to for preempt_count
  arm: setup: disable preemption during startup
  arm: context_switch: allow to run with IRQs already disabled

prepare the groundwork for the rest of PoC. It appears that not all
code is ready to be executed in IRQ state, and schedule() now can be
called at end of do_trap_irq(), which technically is considered IRQ
handler state. Also, it is unwise to try preempt things when we are
still booting, so ween to enable atomic context during the boot
process.

Patches
  preempt: add try_preempt() function
  sched: core: remove ASSERT_NOT_IN_ATOMIC and disable preemption[!]
  arm: traps: try to preempt before leaving IRQ handler

are basically the core of this PoC. try_preempt() function tries to
preempt vCPU when either called by IRQ handler and when leaving atomic
state. Scheduler now enters atomic state to ensure that it will not
preempt self. do_trap_irq() calls try_preempt() to initiate preemption.

Patch
  [HACK] alloc pages: enable preemption early

is exactly what it says. I wanted to see if this PoC is capable of
fixing that mentioned issue with long-running alloc_heap_pages(). So
this is just a hack that disables atomic context early. As mentioned
in the patch description, right solution would be to use mutexes.

Results
=======

I used the same testing setup that I described in [1]. The results are
quite promising:

1. Stefano noted that very first batch of measurements resulted in
higher than usual latency:

 *** Booting Zephyr OS build zephyr-v2.4.0-2750-g0f2c858a39fc  ***
RT Eval app

Counter freq is 33280000 Hz. Period is 30 ns
Set alarm in 0 sec (332800 ticks)
Mean: 600 (18000 ns) stddev: 3737 (112110 ns) above thr: 0% [265 (7950 ns) - 
66955 (2008650 ns)]
Mean: 388 (11640 ns) stddev: 2059 (61770 ns) above thr: 0% [266 (7980 ns) - 
58830 (1764900 ns)]

Note that maximum latency is about 2ms.

With this patches applied, things are much better:

 *** Booting Zephyr OS build zephyr-v2.4.0-3614-g0e2689f8edc3  ***
RT Eval app

Counter freq is 33280000 Hz. Period is 30 ns
Set alarm in 0 sec (332800 ticks)
Mean: 335 (10050 ns) stddev: 52 (1560 ns) above thr: 0% [296 (8880 ns) - 1256 
(37680 ns)]
Mean: 332 (9960 ns) stddev: 11 (330 ns) above thr: 0% [293 (8790 ns) - 501 
(15030 ns)]

As you can see, maximum latency is ~38us, which is way lower than 2ms.

Second test is to observe influence of call to alloc_heap_pages() with
order 18. Without the last patch:

Mean: 756 (22680 ns) stddev: 7328 (219840 ns) above thr: 4% [326 (9780 ns) - 
234405 (7032150 ns)]

Huge spike of 7ms can be observed.

Now, with the HACK patch:

Mean: 488 (14640 ns) stddev: 1656 (49680 ns) above thr: 6% [324 (9720 ns) - 
52756 (1582680 ns)]
Mean: 458 (13740 ns) stddev: 227 (6810 ns) above thr: 3% [324 (9720 ns) - 3936 
(118080 ns)]
Mean: 333 (9990 ns) stddev: 12 (360 ns) above thr: 0% [320 (9600 ns) - 512 
(15360 ns)]

Two things can be observed: mean latency time is lower, maximum
latencies are lower too, but overall runtime is higher.

Downside of this patches is that mean latency time is a bit
higher. There are the results for current xen master branch:

Mean: 288 (8640 ns) stddev: 20 (600 ns) above thr: 0% [269 (8070 ns) - 766 
(22980 ns)]
Mean: 287 (8610 ns) stddev: 20 (600 ns) above thr: 0% [266 (7980 ns) - 793 
(23790 ns)]

8.6us versus ~10us with the patches.

Of course, this is the crude approach and certain things can be made
more optimally.

Know issues
===========

0. Right now it is ARM only. x86 changes vCPU contexts in a different
way, and I don't know what amount of changes needed to make this work on x86

1. RTDS scheduler goes crasy when running on SMP system (e.g. with
more than 1 pCPU) and tries to schedule already running vCPU on
multiple pCPU at a time. This leads to some hard-to-debug crashes

2. As I mentioned, mean latency become a bit higher

Conclusion
==========

My main intention is to begin discussion of hypervisor preemption. As
I showed, it is doable right away and provides some immediate
benefits. I do understand that proper implementation requires much
more efforts. But we are ready to do this work if community is
interested in it.

Just to reiterate main benefits:

1. More controllable latency. On embedded systems customers care about
such things.

2. We can get rid of hypercall continuations, which will results in
simpler and more secure code.

3. We can implement proper hypervisor threads, mutexes, completions
and so on. This will make scheduling more accurate, ease up linux
drivers porting and implementation of more complex features in the
hypervisor.



[1] https://marc.info/?l=xen-devel&m=161049529916656&w=2

Volodymyr Babchuk (10):
  sched: core: save IRQ state during locking
  sched: rt: save IRQ state during locking
  sched: credit2: save IRQ state during locking
  preempt: use atomic_t to for preempt_count
  preempt: add try_preempt() function
  arm: setup: disable preemption during startup
  sched: core: remove ASSERT_NOT_IN_ATOMIC and disable preemption[!]
  arm: context_switch: allow to run with IRQs already disabled
  arm: traps: try to preempt before leaving IRQ handler
  [HACK] alloc pages: enable preemption early

 xen/arch/arm/domain.c      | 18 ++++++++++-----
 xen/arch/arm/setup.c       |  4 ++++
 xen/arch/arm/traps.c       |  7 ++++++
 xen/common/memory.c        |  4 ++--
 xen/common/page_alloc.c    | 21 ++---------------
 xen/common/preempt.c       | 36 ++++++++++++++++++++++++++---
 xen/common/sched/core.c    | 46 +++++++++++++++++++++++---------------
 xen/common/sched/credit2.c |  5 +++--
 xen/common/sched/rt.c      | 10 +++++----
 xen/include/xen/preempt.h  | 17 +++++++++-----
 10 files changed, 109 insertions(+), 59 deletions(-)

-- 
2.29.2



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.