[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] RFC on deprivileged x86 hypervisor device models

Hi all,

I'm working on an x86 proof-of-concept series to evaluate if it is feasible to move device models currently running in the hypervisor and x86 emulation code for HVM guests into a deprivileged context.

I've put together the following document as I have been considering several different ways this could be achieved and was hoping to get feedback from maintainers before I go ahead.

Many thanks in advance,

The aim is to run device models, which are already running inside the hypervisor (e.g. x86 emulate), in deprivileged user mode for HVM guests, using suitably mapped page tables. A simple hypercall convention is needed to pass data between these two modes of operation and a mechanism to move between them.

This is intended as a proof-of-concept, with the aim of determining if this idea is feasible within performance constraints.

The motivation for moving the device models and x86 emulation code into ring 3 is to mitigate a system compromise due a bug in any of these systems. These systems are currently part of the hypervisor and, consequently, a bug in any of these could allow an attacker to gain control (or perform a DOS) of Xen and/or guests.

Moving between privilege levels
The general process is to determine if we need to run a device model (or similar) and then, if so, switch into deprivileged mode. The operation is performed by deprivileged code which calls into the hypervisor as and when needed. After the operation completes, we return to the hypervisor.

If deprivileged mode needs to make any hypervisor requests, it can do these using a syscall interface, possibly placing an operation code into a register to indicate the operation. This would allow it to get data to/from the hypervisor.

I am currently considering three different methods regarding the context switch and would be grateful of any feedback.

Method One
This method works by building on top of the QEMU emulation path code. This currently operates as a state machine, using flags to determine the current emulation state. These states are used to determine which code paths to take when calling out of and into the hypervisor before and after emulation.

The intention would be to add new states and then follow the same process as existing code does except, rather than blocking the vcpu, we switch into deprivileged mode and process the request on the current vcpu. This is different to QEMU which blocks the current vcpu so additional code is needed to support this context switch. There may be other code paths which have not been written in this way which would require rewriting.

When moving into deprivileged mode, we need to be careful to ensure that when we leave, we can redo the call into the hypervisor after the device model completes without causing problems. Thus, we need to be _certain_ that the same call path is followed on the re-entry and that the system's state can handle this. This may mean undoing operations such as memory allocations.

Method Two
At the point of detecting the need to perform a deprivileged operation, we take a copy of the current stack from the current stack position up to the point where the guest entered Xen and save it. Subsequently, we move the stack pointer back.

This effectively gives us a clean stack as though we had just entered Xen. We then put the deprivileged context onto this new stack and enter deprivileged mode.

Upon returning, we restore the previous stack with the guest's and Xen's context then jump to the saved rip and continue execution. Xen will then perform the necessary processing, determining if the operation was successful or not.

We are effectively context switching out Xen for deprivileged code and then bringing Xen back in once we're done.

As Xen is non-preemptive, the Xen stack won't be updated whilst we're in deprivileged mode. If it may be updated (I'm speculating here), e.g. an interrupt, then we can pause deprivileged mode by hooking the interrupt and restoring the Xen stack, then handle the interrupt and finally go back to deprivileged mode.

Problem: If the device model or emulator edit the saved guest registers and these are touched by Xen on the return path after finishing servicing the deprivileged operation, then the guest will use these values not those the deprivileged mode provided.

This is not a problem if the code doesn't do this. If it does, we could give higher precedence to deprivileged changes. So, deprivileged mode pushes the changes into the hypervisor which caches them and then, just before guest context is restored, makes those changes, thus discarding any Xen made.

Method Three
A per vcpu stack is maintained for user mode and supervisor mode. We then don't need to do any copying, just switch to user mode at the point when deprivileged code needs to run.

When deprivileged mode is done, we move back to supervisor mode, restore the previous context and continue execution of the code path that followed the call to move into deprivileged mode.

Method Evaluation
In method one, similarly to the QEMU path, we need to move up and down the call stack twice. We pay the cost of running the entry and exit code, which all methods will. Then we pay the cost of the code paths for moving into deprivileged mode from the call site and for moving from deprivileged mode back to the call site to handle the result. This means that we also destroy and then rebuild the stack. We also pay any allocation and deallocation costs twice, unless we can re-write the code paths so that these can be avoided. A potential issue would be if any changes are made to Xen's state on the first entry which mean that on the second entry (returning from deprivileged mode), we take a different call path.

As mentioned, QEMU appears to do something similar so we can reuse much of this. The call tree is quite deep and broad so great care will need to be taken when making these changes to examine state-changing calls. Furthermore, such a change will be needed for each device, although this will be simpler after the first device is added.

The second method requires copying the stack and then restoring it. It doesn't pay the costs of following a return path into deprivileged mode or moving back to the call site as it, effectively, skips all of this. Memory accesses on the stack are roughly the same as the first method but, we do need enough storage to hold a copy of the stack for each vcpu. The edits to intermediate callers are likely to be simpler than method one, as we don't need to worry about there being two different return paths. Adding a new device model would most likely be easier than method one.

Method two appears to require fewer edits to the original source code and I suspect would be more efficient computationally than moving up and down the stack twice with multiple flag tests breaking code up. However, this has already been done for QEMU call paths so this may prove less troublesome/ expensive than expected.

The third method _may_ require significant code refactoring as currently, there is only one stack per pcpu so this may be a large change.

Just to reiterate, this is intended as a proof-of-concept to measure how feasible such a feature is.

I'm currently on the fence between method one and method two.

Method one will require more attention to existing code paths and is less like a context-switch approach.

Method two will require less attention to existing code paths and is more like a context-switching approach.

I am unsure of method three as I suspect it would be a significant change.

Are there any potential issues or things which I have overlooked? Additionally, which (if any) of the above would you recommend pursuing or do you have any ideas regarding alternatives?

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.