[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] RFC: HVM de-privileged mode scheduling considerations



On Mon, Aug 3, 2015 at 3:34 PM, Ian Campbell <ian.campbell@xxxxxxxxxx> wrote:
> On Mon, 2015-08-03 at 14:54 +0100, Andrew Cooper wrote:
>> On 03/08/15 14:35, Ben Catterall wrote:
>> > Hi all,
>> >
>> > I am working on an x86 proof-of-concept to evaluate if it is feasible
>> > to move device models and x86 emulation code for HVM guests into a
>> > de-privileged context.
>> >
>> > I was hoping to get feedback from relevant maintainers on scheduling
>> > considerations for this system to mitigate potential DoS attacks.
>> >
>> > Many thanks in advance,
>> > Ben
>> >
>> > This is intended as a proof-of-concept, with the aim of determining if
>> > this idea is feasible within performance constraints.
>> >
>> > Motivation
>> > ----------
>> > The motivation for moving the device models and x86 emulation code
>> > into ring 3 is to mitigate a system  compromise due a bug in any of
>> > these systems. These systems are currently part of the hypervisor and,
>> > consequently, a bug in any of these could allow an attacker to gain
>> > control (or perform a DOS) of
>> > Xen and/or guests.
>> >
>> > Migrating between PCPUs
>> > -----------------------
>> > There is a need to support migration between pcpus so that the
>> > scheduler can still perform this operation. However, there is an issue
>> > to resolve. Currently, I have a per-vcpu copy of the Xen ring 0 stack
>> > up to the point of entering the de-privileged mode. This allows us to
>> > restore this stack and then continue from the entry point when we have
>> > finished in de-privileged mode. There will be per-pcpu data on these
>> > per-vcpu stacks such as saved stack frame pointers for the per-pcpu
>> > stack, smp_processor_id() responses etc.
>> >
>> > Therefore, it will be necessary to lock the vcpu to the current pcpu
>> > when it enters this user mode so that it does not wake up on a
>> > different pcpu where such pointers and other data are invalid. We can
>> > do this by setting a hard affinity to the pcpu that the vcpu is
>> > executing on. See common/wait.c which does something similar to what I
>> > am doing.
>> >
>> > However, needing to have hard affinity to a pcpu leads to the
>> > following problem:
>> > - An attacker could lock multiple vcpus to a single pcpu, leading to a
>> > DoS. This could be achieved by  spinning in a loop in Xen
>> > de-privileged mode (assuming a bug in this mode) and performing this
>> > operation on multiple vcpus at once. The attacker could wait until all
>> > of their vcpus were on the same pcpu and then execute this attack.
>> > This could cause the pcpu to, effectively, lock up, as it will be
>> > under heavy load, and we would be unable to move work elsewhere.
>> >
>> > A solution to the DoS would be to force migration to another pcpu, if
>> > after, say, 100 quanta have passed where the vcpu has remained in
>> > de-privileged mode. This forcing of migration would require us to
>> > forcibly complete the de-privileged operation, and then, just before
>> > returning into the guest, force a cpu change. We could not just force
>> > a migration at the schedule call point as the Xen stack needs to
>> > unwind to free up resources. We would reset this count each time we
>> > completed a de-privileged mode operation.
>> >
>> > A legitimate long-running de-privileged operation would trigger this
>> > forced migration mechanism. However, it is unlikely that such
>> > operations will be needed and the count can be adjusted appropriately
>> > to mitigate this.
>> >
>> > Any suggestions or feedback would be appreciated!
>>
>> I don't see why any scheduling support is needed.
>>
>> Currently all operations like this are run synchronously in the vmexit
>> context of the vcpu.  Any current DoS is already a real issue.
>
> The point is that this work is supposed to mitigate (or eliminate) such
> issues, so we would like to remove this existing real issue.
>
> IOW while it might be expected that an in-Xen DM can DoS the system, an in
> -Xen-ring3 DM should not be able to do so.
>
>> In any reasonable situation, emulation of a device is a small state
>> mutation and occasionally kicking off a further action to perform.  (The
>> far bigger risk from this kind of emulation is following bad
>> pointers/etc, rather than long loops.)
>>
>> I think it would be entirely reasonable to have a deadline for a single
>> execution of depriv mode, after which the domain is declared malicious
>> and killed.
>
> I think this could make sense, it's essentially a harsher variant of Ben's
> suggestion to abort an attempt to process the MMIO in order to migrate to
> another pcpu, but it has the benefit of being easier to implement and
> easier to reason about in terms of interactions with other aspects of the
> system (i.e. it seems to remove the need to think of ways an attacker might
> game that other system).
>
>> We already have this for host pcpus - the watchdog defaults to 5
>> seconds.  Having a similar cutoff for depriv mode should be fine.
>
> That's a reasonable analogy.
>
> Perhaps we would want the depriv-watchdog to be some 1/N fraction of the
> pcpu -watchdog, for a smallish N, to avoid the risk of any slop in the
> timing allowing the pcpu watchdog to fire. N=3 for example (on the grounds
> that N=2 is probably sufficient, so N=3 must be awesome).

+1

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.