[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Issue with waitqueues and Intel CET-SS

  • To: <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • Date: Thu, 12 Aug 2021 20:18:17 +0100
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=YkmZLF6b+epKsmv6mrPF6MxaDlvZ75ZCJBjXBq06Rew=; b=bA0qpLfoJG33j74MhJBaAOC8PUyUGxfirtseE33w9hh3uhHULh45xz/fxwndcMc3Ugat26U5XwkD7aLEQMA/trtTruOrOb17dguHhpV/cL0NEDJmZtFWbvMSSzC5gvAIwF6Dn5kqAqDXoLBn09VfmTJtiSSY21NmTGhYT0NjVwS4ZoIqKc8Obaej5QO41A4+7ircyt1i2ZrzcFR5NYdvl9A3my4ztECAdE9b00nqjxN4l4yUpFee0dXhQlo/QD+m+JPbVlGugeZnIZu+HGAJo1Lijo05M1a8gIHgdkk52WqZhfpuMznKaCX/VmapstsZEJe2r4oHkZEBjYRTVYBW7Q==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=BZ0HCDpBsiAj9WQ4axrAXaaq2HJd8TQEt1kTNktqpyMjHzKJecs0iNzWB4Z8HntxXMJY/eR01l8UnfGThWCsOjnAlLOjiRPyRZd1B2U8iigJwP/eSSjky+Wos5h3pCv7/CQGtb7FUZULx4J6MPlkHvCBJ+r9IYXoTG43/hhVT7GFY9CWzSHnAtvcou2gu4RX6ujcr02bDLIZ+9i/aLhxCZFt9mX5KFTIwXoeNE6Bz85L5tf8TAHZBYnNeFQ5GKcjtCXncT/24VJi50sI/u6qHQ07GS4aL8y55Z0T+9zHedNRS4+ZkPXQo84j9EudUNJSH00RTOr+oZcs7BT9DtF0hg==
  • Authentication-results: esa3.hc3370-68.iphmx.com; dkim=pass (signature verified) header.i=@citrix.onmicrosoft.com
  • Cc: Jan Beulich <jbeulich@xxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, Tamas K Lengyel <tamas@xxxxxxxxxxxxx>, Alexandru Isaila <aisaila@xxxxxxxxxxxxxxx>, Petre Pircalabu <ppircalabu@xxxxxxxxxxxxxxx>, Andrei LUTAS <vlutas@xxxxxxxxxxxxxxx>, Mihai Donțu <mdontu@xxxxxxxxxxxxxxx>, Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>, Jason Andryuk <jandryuk@xxxxxxxxx>
  • Delivery-date: Thu, 12 Aug 2021 19:18:32 +0000
  • Ironport-hdrordr: A9a23:JDSUVq+Y6QKHE7jbQdRuk+FWdb1zdoMgy1knxilNoENuGPBwxv rEoB1E73fJYW4qKQkdcdDpAsm9qADnhOVICOgqTP2ftWzd1VdAQ7sSibcKrwePJ8S6zJ8l6U 4CSdkyNDSTNykcsS+S2mDVfOrIguP3lpxA7t2urEuFODsaDp2ImD0JaDpzfHcWeCB2Qb4CUL aM7MtOoDStPV4NaN6gO3UDV+/f4/XWiZPPe3c9dlEawTjLqQntxK/xEhCe0BtbeShI260e/W /MlBG8zrm/ssu81gTX2wbontprcZrau5p+7f63+4sowwbX+0SVjbFaKv2/VX4O0aSSAR0R4a PxSl8bTrlOAjXqDy2ISFLWqnXd+Sdr5Hn4xVCCh3z/5cT/WTIhEsJEwZlUax3D9iMbzZhBOY 9wrhWkXqBsfGX9deXGlqv1fgAvklDxrWspkOYVgXAaWYwCaKVJpYha+E9OCp8PEC/z9YhiSY BVfYrhzecTdUnfY2HSv2FpztDpVnMvHg2eSkxHvsCOyTBZkH1w0kNdzs0CmXUL8o47VvB/lq z5G7UtkKsLQt4dbKp7CutEScyrCnbVSRaJK26WKUSPLtByB5sMke+D3FwR3pDbRHUl9upNpH 3xaiIriYdpQTOQNSSn5uw7zizw
  • Ironport-sdr: ox13Jfo4lao8veH/ZLISjnxABgbNZ584dH/gSnwYjmNc7Z/U6P1bn/DSOgxmDdpTsbhCA6CUWq +Y9Sp0ZSw02UXiX4jKt+9MzADEC/ChF74Y6PDFuGi8MqqYltf/9yKog8GdO25JuFUS+uwwBxPl fjCV7zIcWWYf3WcpBjexI1DELkdTaji7yyfBgE0QsmdJ2tVm44nYLvV3mF8uYuU84f4ZaCLyTg x0tROonXjVcUhrKAvYqRyz5kihnyA1bK4v0DhcWY616I2GcBnRNwVTHrpmuQy7pBMKeO99XGkQ tm3a2U1RFMCDAh3zQm3H/Dvy
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 12/08/2021 19:20, Jason Andryuk wrote:
> Hi,
> I was reviewing xen/common/wait.c:__prepare_to_wait() and I think I've
> identified an incompatibility with shadow stacks like Intel CET-SS.
> The inline asm does:
>         "call 1f;"
>         "1: addq $2f-1b,(%%rsp);"
>         "sub %%esp,%%ecx;"
>         "cmp %3,%%ecx;"
>         "ja 3f;"
>         "mov %%rsp,%%rsi;"
>         /* check_wakeup_from_wait() longjmp()'s to this point. */
>         "2: rep movsb;"
>         "mov %%rsp,%%rsi;"
>         "3: pop %%rax;"
> `call 1f` gets the address of the code, but the address is popped off
> without ret.  This will leave the shadow stack out-of-sync which will
> trigger the protection.  Is my analysis correct?

There is a shadow stack bug here, but it isn't this.

A CALL with 0 displacement explicitly doesn't push a return address onto
the shadow stack, because CALL; POP is a common technique used in 32bit
PIC logic.  Similarly, 0-displacement calls don't enter the RSB/RAS
branch predictor (and why the safety of retpoline depends on using
non-zero displacements).

However, the fact we copy the regular stack sideways and totally skip
the shadow stack is a problem.  We'll survive the setjmp(), but the
longjmp() will explode because of trying to use the old context's shstk
with the new context's regular stack.

There is no credible way to make the waitqueue infrastructure compatible
with shadow stacks.

Furthermore, the infrastructure is an eyesore and we've discussed how to
go about deleting it several times in the past - it is only needed
because of an shortcut taken in the monitor/sharing/paging ring handling
with dom0.

In the short term, we should have logic to perform a boot-time disable
of monitor/sharing/paging if CET is wanted, which works in exactly the
same way as opt_pv32.  This will prevent Xen from exploding on a CET
system when you first try to introspect a VM with more than 7(?) vcpus,
and will let the user choose between "working introspection" and
"working CET".

As a tangent, if we know that waitqueues aren't in use, then it is safe
to turn off HVM RSB stuffing, which is enough of a perf win to actively
chase.  I already have some work in progress for disabling PV RSB
stuffing, and the raw perf numbers are
https://paste.debian.net/hidden/0d59b024/ for anyone interested.

As for monitor/sharing/paging, their dependence on the waitqueue
infrastructure is easy to remove, now that we've got the
acquire_resource infrastructure.

The problem is that, given a 4k ring shared between all vcpus, Xen may
need to wait for space on the ring in the middle of a logical action,
necessitating scheduling the vcpu out while retaining the interim state
which isn't at a clean return-to-guest boundary.  The problem is
exasperated by the fact that monitor in particular allows for sync or
async notifications, meaning that one single vcpu could consume all room
in the ring in very short order.

The fix is also easy.  For the sync interface, switch to a slot-per-vcpu
model exactly like the IOREQ interface.  This simplifies Xen and the
userspace agent, and improves performance because you don't need to
marshal data in and out of the ring.  Also offers an opportunity to
switch to evtchn-per-vcpu rather than having to scan all vcpus on every

For the async ring, we can either delete that functionality (it is
unclear whether it is actually used at all) or we can implement a
multi-page ring similar to the vmtrace buffer (actually easier, because
there is no requirement to be physically contiguous).

As long as the ring is at least big enough for one entry per vcpu, we
can rearrange the logic to never need to sleep before putting an entry
into the ring.  Specifically, pause the current vCPU after writing the
entry if the remaining space is smaller than $N * d->max_vcpus, after
which we can return to the scheduler like all other operations which
pause a vCPU, and the waitqueue logic loses its only caller.

If anyone has time to address any parts of this, I'll happily provide as
much guidance as I can.




Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.