Xen project Mailing List

Re: Issue with waitqueues and Intel CET-SS

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Thu, 12 Aug 2021 20:18:17 +0100

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=YkmZLF6b+epKsmv6mrPF6MxaDlvZ75ZCJBjXBq06Rew=; b=bA0qpLfoJG33j74MhJBaAOC8PUyUGxfirtseE33w9hh3uhHULh45xz/fxwndcMc3Ugat26U5XwkD7aLEQMA/trtTruOrOb17dguHhpV/cL0NEDJmZtFWbvMSSzC5gvAIwF6Dn5kqAqDXoLBn09VfmTJtiSSY21NmTGhYT0NjVwS4ZoIqKc8Obaej5QO41A4+7ircyt1i2ZrzcFR5NYdvl9A3my4ztECAdE9b00nqjxN4l4yUpFee0dXhQlo/QD+m+JPbVlGugeZnIZu+HGAJo1Lijo05M1a8gIHgdkk52WqZhfpuMznKaCX/VmapstsZEJe2r4oHkZEBjYRTVYBW7Q==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=BZ0HCDpBsiAj9WQ4axrAXaaq2HJd8TQEt1kTNktqpyMjHzKJecs0iNzWB4Z8HntxXMJY/eR01l8UnfGThWCsOjnAlLOjiRPyRZd1B2U8iigJwP/eSSjky+Wos5h3pCv7/CQGtb7FUZULx4J6MPlkHvCBJ+r9IYXoTG43/hhVT7GFY9CWzSHnAtvcou2gu4RX6ujcr02bDLIZ+9i/aLhxCZFt9mX5KFTIwXoeNE6Bz85L5tf8TAHZBYnNeFQ5GKcjtCXncT/24VJi50sI/u6qHQ07GS4aL8y55Z0T+9zHedNRS4+ZkPXQo84j9EudUNJSH00RTOr+oZcs7BT9DtF0hg==

Authentication-results: esa3.hc3370-68.iphmx.com; dkim=pass (signature verified) header.i=@citrix.onmicrosoft.com

Cc: Jan Beulich <jbeulich@xxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, Tamas K Lengyel <tamas@xxxxxxxxxxxxx>, Alexandru Isaila <aisaila@xxxxxxxxxxxxxxx>, Petre Pircalabu <ppircalabu@xxxxxxxxxxxxxxx>, Andrei LUTAS <vlutas@xxxxxxxxxxxxxxx>, Mihai Donțu <mdontu@xxxxxxxxxxxxxxx>, Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>, Jason Andryuk <jandryuk@xxxxxxxxx>

Delivery-date: Thu, 12 Aug 2021 19:18:32 +0000

Ironport-hdrordr: A9a23:JDSUVq+Y6QKHE7jbQdRuk+FWdb1zdoMgy1knxilNoENuGPBwxv rEoB1E73fJYW4qKQkdcdDpAsm9qADnhOVICOgqTP2ftWzd1VdAQ7sSibcKrwePJ8S6zJ8l6U 4CSdkyNDSTNykcsS+S2mDVfOrIguP3lpxA7t2urEuFODsaDp2ImD0JaDpzfHcWeCB2Qb4CUL aM7MtOoDStPV4NaN6gO3UDV+/f4/XWiZPPe3c9dlEawTjLqQntxK/xEhCe0BtbeShI260e/W /MlBG8zrm/ssu81gTX2wbontprcZrau5p+7f63+4sowwbX+0SVjbFaKv2/VX4O0aSSAR0R4a PxSl8bTrlOAjXqDy2ISFLWqnXd+Sdr5Hn4xVCCh3z/5cT/WTIhEsJEwZlUax3D9iMbzZhBOY 9wrhWkXqBsfGX9deXGlqv1fgAvklDxrWspkOYVgXAaWYwCaKVJpYha+E9OCp8PEC/z9YhiSY BVfYrhzecTdUnfY2HSv2FpztDpVnMvHg2eSkxHvsCOyTBZkH1w0kNdzs0CmXUL8o47VvB/lq z5G7UtkKsLQt4dbKp7CutEScyrCnbVSRaJK26WKUSPLtByB5sMke+D3FwR3pDbRHUl9upNpH 3xaiIriYdpQTOQNSSn5uw7zizw

Ironport-sdr: ox13Jfo4lao8veH/ZLISjnxABgbNZ584dH/gSnwYjmNc7Z/U6P1bn/DSOgxmDdpTsbhCA6CUWq +Y9Sp0ZSw02UXiX4jKt+9MzADEC/ChF74Y6PDFuGi8MqqYltf/9yKog8GdO25JuFUS+uwwBxPl fjCV7zIcWWYf3WcpBjexI1DELkdTaji7yyfBgE0QsmdJ2tVm44nYLvV3mF8uYuU84f4ZaCLyTg x0tROonXjVcUhrKAvYqRyz5kihnyA1bK4v0DhcWY616I2GcBnRNwVTHrpmuQy7pBMKeO99XGkQ tm3a2U1RFMCDAh3zQm3H/Dvy

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 12/08/2021 19:20, Jason Andryuk wrote: > Hi, > > I was reviewing xen/common/wait.c:__prepare_to_wait() and I think I've > identified an incompatibility with shadow stacks like Intel CET-SS. > > The inline asm does: > > "call 1f;" > "1: addq $2f-1b,(%%rsp);" > "sub %%esp,%%ecx;" > "cmp %3,%%ecx;" > "ja 3f;" > "mov %%rsp,%%rsi;" > > /* check_wakeup_from_wait() longjmp()'s to this point. */ > "2: rep movsb;" > "mov %%rsp,%%rsi;" > "3: pop %%rax;" > > `call 1f` gets the address of the code, but the address is popped off > without ret. This will leave the shadow stack out-of-sync which will > trigger the protection. Is my analysis correct? There is a shadow stack bug here, but it isn't this. A CALL with 0 displacement explicitly doesn't push a return address onto the shadow stack, because CALL; POP is a common technique used in 32bit PIC logic. Similarly, 0-displacement calls don't enter the RSB/RAS branch predictor (and why the safety of retpoline depends on using non-zero displacements). However, the fact we copy the regular stack sideways and totally skip the shadow stack is a problem. We'll survive the setjmp(), but the longjmp() will explode because of trying to use the old context's shstk with the new context's regular stack. There is no credible way to make the waitqueue infrastructure compatible with shadow stacks. Furthermore, the infrastructure is an eyesore and we've discussed how to go about deleting it several times in the past - it is only needed because of an shortcut taken in the monitor/sharing/paging ring handling with dom0. In the short term, we should have logic to perform a boot-time disable of monitor/sharing/paging if CET is wanted, which works in exactly the same way as opt_pv32. This will prevent Xen from exploding on a CET system when you first try to introspect a VM with more than 7(?) vcpus, and will let the user choose between "working introspection" and "working CET". As a tangent, if we know that waitqueues aren't in use, then it is safe to turn off HVM RSB stuffing, which is enough of a perf win to actively chase. I already have some work in progress for disabling PV RSB stuffing, and the raw perf numbers are https://paste.debian.net/hidden/0d59b024/ for anyone interested. As for monitor/sharing/paging, their dependence on the waitqueue infrastructure is easy to remove, now that we've got the acquire_resource infrastructure. The problem is that, given a 4k ring shared between all vcpus, Xen may need to wait for space on the ring in the middle of a logical action, necessitating scheduling the vcpu out while retaining the interim state which isn't at a clean return-to-guest boundary. The problem is exasperated by the fact that monitor in particular allows for sync or async notifications, meaning that one single vcpu could consume all room in the ring in very short order. The fix is also easy. For the sync interface, switch to a slot-per-vcpu model exactly like the IOREQ interface. This simplifies Xen and the userspace agent, and improves performance because you don't need to marshal data in and out of the ring. Also offers an opportunity to switch to evtchn-per-vcpu rather than having to scan all vcpus on every interrupt. For the async ring, we can either delete that functionality (it is unclear whether it is actually used at all) or we can implement a multi-page ring similar to the vmtrace buffer (actually easier, because there is no requirement to be physically contiguous). As long as the ring is at least big enough for one entry per vcpu, we can rearrange the logic to never need to sleep before putting an entry into the ring. Specifically, pause the current vCPU after writing the entry if the remaining space is smaller than $N * d->max_vcpus, after which we can return to the scheduler like all other operations which pause a vCPU, and the waitqueue logic loses its only caller. If anyone has time to address any parts of this, I'll happily provide as much guidance as I can. ~Andrew

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.