Xen project Mailing List

Re: [Xen-devel] RFC on deprivileged x86 hypervisor device models

To: "Ben Catterall (Intern)" <ben.catterall@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>, Andrew Cooper <Andrew.Cooper3@xxxxxxxxxx>, "JBeulich@xxxxxxxx" <JBeulich@xxxxxxxx>

From: Paul Durrant <Paul.Durrant@xxxxxxxxxx>

Date: Fri, 17 Jul 2015 12:32:32 +0000

Accept-language: en-GB, en-US

Delivery-date: Fri, 17 Jul 2015 12:32:55 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Thread-index: AQHQwHkj5CUxJTCbEEKFLBkgwRRsN53fl+0A

Thread-topic: [Xen-devel] RFC on deprivileged x86 hypervisor device models

> -----Original Message----- > From: xen-devel-bounces@xxxxxxxxxxxxx [mailto:xen-devel- > bounces@xxxxxxxxxxxxx] On Behalf Of Ben Catterall > Sent: 17 July 2015 11:10 > To: xen-devel@xxxxxxxxxxxxx; Andrew Cooper; JBeulich@xxxxxxxx > Subject: [Xen-devel] RFC on deprivileged x86 hypervisor device models > > Hi all, > > I'm working on an x86 proof-of-concept series to evaluate if it is > feasible to move device models currently running in the hypervisor and > x86 emulation code for HVM guests into a deprivileged context. > Why is that better than, say, moving the device models into a dedicated monolithic VM (like a global stub domain) and running them there? It gives you the depriv aspect and there's prior art. Paul > I've put together the following document as I have been considering > several different ways this could be achieved and was hoping to get > feedback from maintainers before I go ahead. > > Many thanks in advance, > Ben > > Context > ------- > The aim is to run device models, which are already running inside the > hypervisor (e.g. x86 emulate), in deprivileged user mode for HVM guests, > using suitably mapped page tables. A simple hypercall convention is > needed to pass data between these two modes of operation and a > mechanism > to move between them. > > This is intended as a proof-of-concept, with the aim of determining if > this idea is feasible within performance constraints. > > Motivation > ---------- > The motivation for moving the device models and x86 emulation code into > ring 3 is to mitigate a system compromise due a bug in any of these > systems. These systems are currently part of the hypervisor and, > consequently, a bug in any of these could allow an attacker to gain > control (or perform a DOS) of Xen and/or guests. > > > Moving between privilege levels > -------------------------------- > The general process is to determine if we need to run a device model (or > similar) and then, if so, switch into deprivileged mode. The operation > is performed by deprivileged code which calls into the hypervisor as and > when needed. After the operation completes, we return to the hypervisor. > > If deprivileged mode needs to make any hypervisor requests, it can do > these using a syscall interface, possibly placing an operation code into > a register to indicate the operation. This would allow it to get data > to/from the hypervisor. > > I am currently considering three different methods regarding the context > switch and would be grateful of any feedback. > > Method One > ---------- > This method works by building on top of the QEMU emulation path code. > This currently operates as a state machine, using flags to determine the > current emulation state. These states are used to determine which code > paths to take when calling out of and into the hypervisor before and > after emulation. > > The intention would be to add new states and then follow the same > process as existing code does except, rather than blocking the vcpu, we > switch into deprivileged mode and process the request on the current > vcpu. This is different to QEMU which blocks the current vcpu so > additional code is needed to support this context switch. There may be > other code paths which have not been written in this way which would > require rewriting. > > When moving into deprivileged mode, we need to be careful to ensure that > when we leave, we can redo the call into the hypervisor after the device > model completes without causing problems. Thus, we need to be _certain_ > that the same call path is followed on the re-entry and that the > system's state can handle this. This may mean undoing operations such as > memory allocations. > > > Method Two > ---------- > At the point of detecting the need to perform a deprivileged operation, > we take a copy of the current stack from the current stack position up > to the point where the guest entered Xen and save it. Subsequently, we > move the stack pointer back. > > This effectively gives us a clean stack as though we had just entered > Xen. We then put the deprivileged context onto this new stack and enter > deprivileged mode. > > Upon returning, we restore the previous stack with the guest's and Xen's > context then jump to the saved rip and continue execution. Xen will then > perform the necessary processing, determining if the operation was > successful or not. > > We are effectively context switching out Xen for deprivileged code and > then bringing Xen back in once we're done. > > As Xen is non-preemptive, the Xen stack won't be updated whilst we're in > deprivileged mode. If it may be updated (I'm speculating here), e.g. an > interrupt, then we can pause deprivileged mode by hooking the interrupt > and restoring the Xen stack, then handle the interrupt and finally go > back to deprivileged mode. > > Problem: If the device model or emulator edit the saved guest registers > and these are touched by Xen on the return path after finishing > servicing the deprivileged operation, then the guest will use these > values not those the deprivileged mode provided. > > This is not a problem if the code doesn't do this. If it does, we could > give higher precedence to deprivileged changes. So, deprivileged mode > pushes the changes into the hypervisor which caches them and then, just > before guest context is restored, makes those changes, thus discarding > any Xen made. > > > > Method Three > ------------ > A per vcpu stack is maintained for user mode and supervisor mode. We > then don't need to do any copying, just switch to user mode at the point > when deprivileged code needs to run. > > When deprivileged mode is done, we move back to supervisor mode, > restore > the previous context and continue execution of the code path that > followed the call to move into deprivileged mode. > > > > Method Evaluation > ----------------- > In method one, similarly to the QEMU path, we need to move up and down > the call stack twice. We pay the cost of running the entry and exit > code, which all methods will. Then we pay the cost of the code paths for > moving into deprivileged mode from the call site and for moving from > deprivileged mode back to the call site to handle the result. This means > that we also destroy and then rebuild the stack. We also pay any > allocation and deallocation costs twice, unless we can re-write the code > paths so that these can be avoided. A potential issue would be if any > changes are made to Xen's state on the first entry which mean that on > the second entry (returning from deprivileged mode), we take a different > call path. > > As mentioned, QEMU appears to do something similar so we can reuse much > of this. The call tree is quite deep and broad so great care will need > to be taken when making these changes to examine state-changing calls. > Furthermore, such a change will be needed for each device, although this > will be simpler after the first device is added. > > The second method requires copying the stack and then restoring it. It > doesn't pay the costs of following a return path into deprivileged mode > or moving back to the call site as it, effectively, skips all of this. > Memory accesses on the stack are roughly the same as the first method > but, we do need enough storage to hold a copy of the stack for each > vcpu. The edits to intermediate callers are likely to be simpler than > method one, as we don't need to worry about there being two different > return paths. Adding a new device model would most likely be easier than > method one. > > Method two appears to require fewer edits to the original source code > and I suspect would be more efficient computationally than moving up and > down the stack twice with multiple flag tests breaking code up. However, > this has already been done for QEMU call paths so this may prove less > troublesome/ expensive than expected. > > The third method _may_ require significant code refactoring as > currently, there is only one stack per pcpu so this may be a large change. > > > Summary > ------- > Just to reiterate, this is intended as a proof-of-concept to measure how > feasible such a feature is. > > I'm currently on the fence between method one and method two. > > Method one will require more attention to existing code paths and is > less like a context-switch approach. > > Method two will require less attention to existing code paths and is > more like a context-switching approach. > > I am unsure of method three as I suspect it would be a significant change. > > Are there any potential issues or things which I have overlooked? > Additionally, which (if any) of the above would you recommend pursuing > or do you have any ideas regarding alternatives? > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxx > http://lists.xen.org/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.