Xen project Mailing List

Re: [Xen-devel] RFC: HVM de-privileged mode scheduling considerations

To: Ian Campbell <ian.campbell@xxxxxxxxxx>

From: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>

Date: Tue, 4 Aug 2015 14:46:57 +0100

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Dario Faggioli <dario.faggioli@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>, Ben Catterall <Ben.Catterall@xxxxxxxxxx>

Delivery-date: Tue, 04 Aug 2015 13:47:07 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Mon, Aug 3, 2015 at 3:34 PM, Ian Campbell <ian.campbell@xxxxxxxxxx> wrote: > On Mon, 2015-08-03 at 14:54 +0100, Andrew Cooper wrote: >> On 03/08/15 14:35, Ben Catterall wrote: >> > Hi all, >> > >> > I am working on an x86 proof-of-concept to evaluate if it is feasible >> > to move device models and x86 emulation code for HVM guests into a >> > de-privileged context. >> > >> > I was hoping to get feedback from relevant maintainers on scheduling >> > considerations for this system to mitigate potential DoS attacks. >> > >> > Many thanks in advance, >> > Ben >> > >> > This is intended as a proof-of-concept, with the aim of determining if >> > this idea is feasible within performance constraints. >> > >> > Motivation >> > ---------- >> > The motivation for moving the device models and x86 emulation code >> > into ring 3 is to mitigate a system compromise due a bug in any of >> > these systems. These systems are currently part of the hypervisor and, >> > consequently, a bug in any of these could allow an attacker to gain >> > control (or perform a DOS) of >> > Xen and/or guests. >> > >> > Migrating between PCPUs >> > ----------------------- >> > There is a need to support migration between pcpus so that the >> > scheduler can still perform this operation. However, there is an issue >> > to resolve. Currently, I have a per-vcpu copy of the Xen ring 0 stack >> > up to the point of entering the de-privileged mode. This allows us to >> > restore this stack and then continue from the entry point when we have >> > finished in de-privileged mode. There will be per-pcpu data on these >> > per-vcpu stacks such as saved stack frame pointers for the per-pcpu >> > stack, smp_processor_id() responses etc. >> > >> > Therefore, it will be necessary to lock the vcpu to the current pcpu >> > when it enters this user mode so that it does not wake up on a >> > different pcpu where such pointers and other data are invalid. We can >> > do this by setting a hard affinity to the pcpu that the vcpu is >> > executing on. See common/wait.c which does something similar to what I >> > am doing. >> > >> > However, needing to have hard affinity to a pcpu leads to the >> > following problem: >> > - An attacker could lock multiple vcpus to a single pcpu, leading to a >> > DoS. This could be achieved by spinning in a loop in Xen >> > de-privileged mode (assuming a bug in this mode) and performing this >> > operation on multiple vcpus at once. The attacker could wait until all >> > of their vcpus were on the same pcpu and then execute this attack. >> > This could cause the pcpu to, effectively, lock up, as it will be >> > under heavy load, and we would be unable to move work elsewhere. >> > >> > A solution to the DoS would be to force migration to another pcpu, if >> > after, say, 100 quanta have passed where the vcpu has remained in >> > de-privileged mode. This forcing of migration would require us to >> > forcibly complete the de-privileged operation, and then, just before >> > returning into the guest, force a cpu change. We could not just force >> > a migration at the schedule call point as the Xen stack needs to >> > unwind to free up resources. We would reset this count each time we >> > completed a de-privileged mode operation. >> > >> > A legitimate long-running de-privileged operation would trigger this >> > forced migration mechanism. However, it is unlikely that such >> > operations will be needed and the count can be adjusted appropriately >> > to mitigate this. >> > >> > Any suggestions or feedback would be appreciated! >> >> I don't see why any scheduling support is needed. >> >> Currently all operations like this are run synchronously in the vmexit >> context of the vcpu. Any current DoS is already a real issue. > > The point is that this work is supposed to mitigate (or eliminate) such > issues, so we would like to remove this existing real issue. > > IOW while it might be expected that an in-Xen DM can DoS the system, an in > -Xen-ring3 DM should not be able to do so. > >> In any reasonable situation, emulation of a device is a small state >> mutation and occasionally kicking off a further action to perform. (The >> far bigger risk from this kind of emulation is following bad >> pointers/etc, rather than long loops.) >> >> I think it would be entirely reasonable to have a deadline for a single >> execution of depriv mode, after which the domain is declared malicious >> and killed. > > I think this could make sense, it's essentially a harsher variant of Ben's > suggestion to abort an attempt to process the MMIO in order to migrate to > another pcpu, but it has the benefit of being easier to implement and > easier to reason about in terms of interactions with other aspects of the > system (i.e. it seems to remove the need to think of ways an attacker might > game that other system). > >> We already have this for host pcpus - the watchdog defaults to 5 >> seconds. Having a similar cutoff for depriv mode should be fine. > > That's a reasonable analogy. > > Perhaps we would want the depriv-watchdog to be some 1/N fraction of the > pcpu -watchdog, for a smallish N, to avoid the risk of any slop in the > timing allowing the pcpu watchdog to fire. N=3 for example (on the grounds > that N=2 is probably sufficient, so N=3 must be awesome). +1 -George _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.