[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Ongoing/future speculative mitigation work
On 19/10/18 09:09, Dario Faggioli wrote: > On Thu, 2018-10-18 at 18:46 +0100, Andrew Cooper wrote: >> Hello, >> > Hey, > > This is very accurate and useful... thanks for it. :-) > >> 1) A secrets-free hypervisor. >> >> Basically every hypercall can be (ab)used by a guest, and used as an >> arbitrary cache-load gadget. Logically, this is the first half of a >> Spectre SP1 gadget, and is usually the first stepping stone to >> exploiting one of the speculative sidechannels. >> >> Short of compiling Xen with LLVM's Speculative Load Hardening (which >> is >> still experimental, and comes with a ~30% perf hit in the common >> case), >> this is unavoidable. Furthermore, throwing a few >> array_index_nospec() >> into the code isn't a viable solution to the problem. >> >> An alternative option is to have less data mapped into Xen's virtual >> address space - if a piece of memory isn't mapped, it can't be loaded >> into the cache. >> >> [...] >> >> 2) Scheduler improvements. >> >> (I'm afraid this is rather more sparse because I'm less familiar with >> the scheduler details.) >> >> At the moment, all of Xen's schedulers will happily put two vcpus >> from >> different domains on sibling hyperthreads. There has been a lot of >> sidechannel research over the past decade demonstrating ways for one >> thread to infer what is going on the other, but L1TF is the first >> vulnerability I'm aware of which allows one thread to directly read >> data >> out of the other. >> >> Either way, it is now definitely a bad thing to run different guests >> concurrently on siblings. >> > Well, yes. But, as you say, L1TF, and I'd say TLBLeed as well, are the > first serious issues discovered so far and, for instance, even on x86, > not all Intel CPUs and none of the AMD ones, AFAIK, are affected. TLBleed is an excellent paper and associated research, but is still just inference - a vast quantity of post-processing is required to extract the key. There are plenty of other sidechannels which affect all SMT implementations, such as the effects of executing an mfence instruction, execution unit > Therefore, although I certainly think we _must_ have the proper > scheduler enhancements in place (and in fact I'm working on that :-D) > it should IMO still be possible for the user to decide whether or not > to use them (either by opting-in or opting-out, I don't care much at > this stage). I'm not suggesting that we leave people without a choice, but given an option which doesn't share siblings between different guests, it should be the default. > >> Fixing this by simply not scheduling vcpus >> from a different guest on siblings does result in a lower resource >> utilisation, most notably when there are an odd number runable vcpus >> in >> a domain, as the other thread is forced to idle. >> > Right. > >> A step beyond this is core-aware scheduling, where we schedule in >> units >> of a virtual core rather than a virtual thread. This has much better >> behaviour from the guests point of view, as the actually-scheduled >> topology remains consistent, but does potentially come with even >> lower >> utilisation if every other thread in the guest is idle. >> > Yes, basically, what you describe as 'core-aware scheduling' here can > be build on top of what you had described above as 'not scheduling > vcpus from different guests'. > > I mean, we can/should put ourselves in a position where the user can > choose if he/she wants: > - just 'plain scheduling', as we have now, > - "just" that only vcpus of the same domains are scheduled on siblings > hyperthread, > - full 'core-aware scheduling', i.e., only vcpus that the guest > actually sees as virtual hyperthread siblings, are scheduled on > hardware hyperthread siblings. > > About the performance impact, indeed it's even higher with core-aware > scheduling. Something that we can see about doing, is acting on the > guest scheduler, e.g., telling it to try to "pack the load", and keep > siblings busy, instead of trying to avoid doing that (which is what > happens by default in most cases). > > In Linux, this can be done by playing with the sched-flags (see, e.g., > https://elixir.bootlin.com/linux/v4.18/source/include/linux/sched/topology.h#L20 > , > and /proc/sys/kernel/sched_domain/cpu*/domain*/flags ). > > The idea would be to avoid, as much as possible, the case when "every > other thread is idle in the guest". I'm not sure about being able to do > something by default, but we can certainly document things (like "if > you enable core-scheduling, also do `echo 1234 > /proc/sys/.../flags' > in your Linux guests"). > > I haven't checked whether other OSs' schedulers have something similar. > >> A side requirement for core-aware scheduling is for Xen to have an >> accurate idea of the topology presented to the guest. I need to dust >> off my Toolstack CPUID/MSR improvement series and get that upstream. >> > Indeed. Without knowing which one of the guest's vcpus are to be > considered virtual hyperthread siblings, I can only get you as far as > "only scheduling vcpus of the same domain on siblings hyperthread". :-) > >> One of the most insidious problems with L1TF is that, with >> hyperthreading enabled, a malicious guest kernel can engineer >> arbitrary >> data leakage by having one thread scanning the expected physical >> address, and the other thread using an arbitrary cache-load gadget in >> hypervisor context. This occurs because the L1 data cache is shared >> by >> threads. >> > Right. So, sorry if this is a stupid question, but how does this relate > to the "secret-free hypervisor", and with the "if a piece of memory > isn't mapped, it can't be loaded into the cache". > > So, basically, I'm asking whether I am understanding it correctly that > secret-free Xen + core-aware scheduling would *not* be enough for > mitigating L1TF properly (and if the answer is no, why... but only if > you have 5 mins to explain it to me :-P). > > In fact, ISTR that core-scheduling plus something that looked to me > similar enough to "secret-free Xen", is how Microsoft claims to be > mitigating L1TF on hyper-v... Correct - that is what HyperV appears to be doing. Its best to consider the secret-free Xen and scheduler improvements as orthogonal. In particular, the secret-free Xen is defence in depth against SP1, and the risk of future issues, but does have non-speculative benefits as well. That said, the only way to use HT and definitely be safe to L1TF without a secret-free Xen is to have the synchronised entry/exit logic working. >> A solution to this issue was proposed, whereby Xen synchronises >> siblings >> on vmexit/entry, so we are never executing code in two different >> privilege levels. Getting this working would make it safe to >> continue >> using hyperthreading even in the presence of L1TF. >> > Err... ok, but we still want core-aware scheduling, or at least we want > to avoid having vcpus from different domains on siblings, don't we? In > order to avoid leaks between guests, I mean. Ideally, we'd want all of these. I expect the only reasonable way to develop them is one on top of another. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |