[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Issue with waitqueues and Intel CET-SS
On 12/08/2021 19:20, Jason Andryuk wrote: > Hi, > > I was reviewing xen/common/wait.c:__prepare_to_wait() and I think I've > identified an incompatibility with shadow stacks like Intel CET-SS. > > The inline asm does: > > "call 1f;" > "1: addq $2f-1b,(%%rsp);" > "sub %%esp,%%ecx;" > "cmp %3,%%ecx;" > "ja 3f;" > "mov %%rsp,%%rsi;" > > /* check_wakeup_from_wait() longjmp()'s to this point. */ > "2: rep movsb;" > "mov %%rsp,%%rsi;" > "3: pop %%rax;" > > `call 1f` gets the address of the code, but the address is popped off > without ret. This will leave the shadow stack out-of-sync which will > trigger the protection. Is my analysis correct? There is a shadow stack bug here, but it isn't this. A CALL with 0 displacement explicitly doesn't push a return address onto the shadow stack, because CALL; POP is a common technique used in 32bit PIC logic. Similarly, 0-displacement calls don't enter the RSB/RAS branch predictor (and why the safety of retpoline depends on using non-zero displacements). However, the fact we copy the regular stack sideways and totally skip the shadow stack is a problem. We'll survive the setjmp(), but the longjmp() will explode because of trying to use the old context's shstk with the new context's regular stack. There is no credible way to make the waitqueue infrastructure compatible with shadow stacks. Furthermore, the infrastructure is an eyesore and we've discussed how to go about deleting it several times in the past - it is only needed because of an shortcut taken in the monitor/sharing/paging ring handling with dom0. In the short term, we should have logic to perform a boot-time disable of monitor/sharing/paging if CET is wanted, which works in exactly the same way as opt_pv32. This will prevent Xen from exploding on a CET system when you first try to introspect a VM with more than 7(?) vcpus, and will let the user choose between "working introspection" and "working CET". As a tangent, if we know that waitqueues aren't in use, then it is safe to turn off HVM RSB stuffing, which is enough of a perf win to actively chase. I already have some work in progress for disabling PV RSB stuffing, and the raw perf numbers are https://paste.debian.net/hidden/0d59b024/ for anyone interested. As for monitor/sharing/paging, their dependence on the waitqueue infrastructure is easy to remove, now that we've got the acquire_resource infrastructure. The problem is that, given a 4k ring shared between all vcpus, Xen may need to wait for space on the ring in the middle of a logical action, necessitating scheduling the vcpu out while retaining the interim state which isn't at a clean return-to-guest boundary. The problem is exasperated by the fact that monitor in particular allows for sync or async notifications, meaning that one single vcpu could consume all room in the ring in very short order. The fix is also easy. For the sync interface, switch to a slot-per-vcpu model exactly like the IOREQ interface. This simplifies Xen and the userspace agent, and improves performance because you don't need to marshal data in and out of the ring. Also offers an opportunity to switch to evtchn-per-vcpu rather than having to scan all vcpus on every interrupt. For the async ring, we can either delete that functionality (it is unclear whether it is actually used at all) or we can implement a multi-page ring similar to the vmtrace buffer (actually easier, because there is no requirement to be physically contiguous). As long as the ring is at least big enough for one entry per vcpu, we can rearrange the logic to never need to sleep before putting an entry into the ring. Specifically, pause the current vCPU after writing the entry if the remaining space is smaller than $N * d->max_vcpus, after which we can return to the scheduler like all other operations which pause a vCPU, and the waitqueue logic loses its only caller. If anyone has time to address any parts of this, I'll happily provide as much guidance as I can. ~Andrew
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |