[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] RFC on deprivileged x86 hypervisor device models



> -----Original Message-----
> From: xen-devel-bounces@xxxxxxxxxxxxx [mailto:xen-devel-
> bounces@xxxxxxxxxxxxx] On Behalf Of Ben Catterall
> Sent: 17 July 2015 11:10
> To: xen-devel@xxxxxxxxxxxxx; Andrew Cooper; JBeulich@xxxxxxxx
> Subject: [Xen-devel] RFC on deprivileged x86 hypervisor device models
> 
> Hi all,
> 
> I'm working on an x86 proof-of-concept series to evaluate if it is
> feasible to move device models currently running in the hypervisor and
> x86 emulation code for HVM guests into a deprivileged context.
> 

Why is that better than, say, moving the device models into a dedicated 
monolithic VM (like a global stub domain) and running them there? It gives you 
the depriv aspect and there's prior art.

  Paul

> I've put together the following document as I have been considering
> several different ways this could be achieved and was hoping to get
> feedback from maintainers before I go ahead.
> 
> Many thanks in advance,
> Ben
> 
> Context
> -------
> The aim is to run device models, which are already running inside the
> hypervisor (e.g. x86 emulate), in deprivileged user mode for HVM guests,
> using suitably mapped page tables. A simple hypercall convention is
> needed to pass data between these two modes of operation and a
> mechanism
> to move between them.
> 
> This is intended as a proof-of-concept, with the aim of determining if
> this idea is feasible within performance constraints.
> 
> Motivation
> ----------
> The motivation for moving the device models and x86 emulation code into
> ring 3 is to mitigate a system compromise due a bug in any of these
> systems. These systems are currently part of the hypervisor and,
> consequently, a bug in any of these could allow an attacker to gain
> control (or perform a DOS) of Xen and/or guests.
> 
> 
> Moving between privilege levels
> --------------------------------
> The general process is to determine if we need to run a device model (or
> similar) and then, if so, switch into deprivileged mode. The operation
> is performed by deprivileged code which calls into the hypervisor as and
> when needed. After the operation completes, we return to the hypervisor.
> 
> If deprivileged mode needs to make any hypervisor requests, it can do
> these using a syscall interface, possibly placing an operation code into
> a register to indicate the operation. This would allow it to get data
> to/from the hypervisor.
> 
> I am currently considering three different methods regarding the context
> switch and would be grateful of any feedback.
> 
> Method One
> ----------
> This method works by building on top of the QEMU emulation path code.
> This currently operates as a state machine, using flags to determine the
> current emulation state. These states are used to determine which code
> paths to take when calling out of and into the hypervisor before and
> after emulation.
> 
> The intention would be to add new states and then follow the same
> process as existing code does except, rather than blocking the vcpu, we
> switch into deprivileged mode and process the request on the current
> vcpu. This is different to QEMU which blocks the current vcpu so
> additional code is needed to support this context switch. There may be
> other code paths which have not been written in this way which would
> require rewriting.
> 
> When moving into deprivileged mode, we need to be careful to ensure that
> when we leave, we can redo the call into the hypervisor after the device
> model completes without causing problems. Thus, we need to be _certain_
> that the same call path is followed on the re-entry and that the
> system's state can handle this. This may mean undoing operations such as
> memory allocations.
> 
> 
> Method Two
> ----------
> At the point of detecting the need to perform a deprivileged operation,
> we take a copy of the current stack from the current stack position up
> to the point where the guest entered Xen and save it. Subsequently, we
> move the stack pointer back.
> 
> This effectively gives us a clean stack as though we had just entered
> Xen. We then put the deprivileged context onto this new stack and enter
> deprivileged mode.
> 
> Upon returning, we restore the previous stack with the guest's and Xen's
> context then jump to the saved rip and continue execution. Xen will then
> perform the necessary processing, determining if the operation was
> successful or not.
> 
> We are effectively context switching out Xen for deprivileged code and
> then bringing Xen back in once we're done.
> 
> As Xen is non-preemptive, the Xen stack won't be updated whilst we're in
> deprivileged mode. If it may be updated (I'm speculating here), e.g. an
> interrupt, then we can pause deprivileged mode by hooking the interrupt
> and restoring the Xen stack, then handle the interrupt and finally go
> back to deprivileged mode.
> 
> Problem: If the device model or emulator edit the saved guest registers
> and these are touched by Xen on the return path after finishing
> servicing the deprivileged operation, then the guest will use these
> values not those the deprivileged mode provided.
> 
> This is not a problem if the code doesn't do this. If it does, we could
> give higher precedence to deprivileged changes. So, deprivileged mode
> pushes the changes into the hypervisor which caches them and then, just
> before guest context is restored, makes those changes, thus discarding
> any Xen made.
> 
> 
> 
> Method Three
> ------------
> A per vcpu stack is maintained for user mode and supervisor mode. We
> then don't need to do any copying, just switch to user mode at the point
> when deprivileged code needs to run.
> 
> When deprivileged mode is done, we move back to supervisor mode,
> restore
> the previous context and continue execution of the code path that
> followed the call to move into deprivileged mode.
> 
> 
> 
> Method Evaluation
> -----------------
> In method one, similarly to the QEMU path, we need to move up and down
> the call stack twice. We pay the cost of running the entry and exit
> code, which all methods will. Then we pay the cost of the code paths for
> moving into deprivileged mode from the call site and for moving from
> deprivileged mode back to the call site to handle the result. This means
> that we also destroy and then rebuild the stack. We also pay any
> allocation and deallocation costs twice, unless we can re-write the code
> paths so that these can be avoided. A potential issue would be if any
> changes are made to Xen's state on the first entry which mean that on
> the second entry (returning from deprivileged mode), we take a different
> call path.
> 
> As mentioned, QEMU appears to do something similar so we can reuse much
> of this. The call tree is quite deep and broad so great care will need
> to be taken when making these changes to examine state-changing calls.
> Furthermore, such a change will be needed for each device, although this
> will be simpler after the first device is added.
> 
> The second method requires copying the stack and then restoring it. It
> doesn't pay the costs of following a return path into deprivileged mode
> or moving back to the call site as it, effectively, skips all of this.
> Memory accesses on the stack are roughly the same as the first method
> but, we do need enough storage to hold a copy of the stack for each
> vcpu. The edits to intermediate callers are likely to be simpler than
> method one, as we don't need to worry about there being two different
> return paths. Adding a new device model would most likely be easier than
> method one.
> 
> Method two appears to require fewer edits to the original source code
> and I suspect would be more efficient computationally than moving up and
> down the stack twice with multiple flag tests breaking code up. However,
> this has already been done for QEMU call paths so this may prove less
> troublesome/ expensive than expected.
> 
> The third method _may_ require significant code refactoring as
> currently, there is only one stack per pcpu so this may be a large change.
> 
> 
> Summary
> -------
> Just to reiterate, this is intended as a proof-of-concept to measure how
> feasible such a feature is.
> 
> I'm currently on the fence between method one and method two.
> 
> Method one will require more attention to existing code paths and is
> less like a context-switch approach.
> 
> Method two will require less attention to existing code paths and is
> more like a context-switching approach.
> 
> I am unsure of method three as I suspect it would be a significant change.
> 
> Are there any potential issues or things which I have overlooked?
> Additionally, which (if any) of the above would you recommend pursuing
> or do you have any ideas regarding alternatives?
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.