Xen project Mailing List

[Xen-devel] Sporadic PV guest malloc.c assertion failures and segfaults unless pv-l1tf=false is set

From: Andy Smith <andy@xxxxxxxxxxxxxx>

Date: Sun, 25 Nov 2018 06:18:49 +0000

Delivery-date: Sun, 25 Nov 2018 06:18:59 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Openpgp: id=BF15490B; url=http://strugglers.net/~andy/pubkey.asc

Hi, Last weekend I deployed a hypervisor built from 4.10.1 release plus the most recent XSAs (which were under embargo at that time). Previously to this I had only gone as far as XSA-267, having taken a decision to wait before applying later XSAs. So, this most recent deployment included the fixes for XSA-273 for the first time. Over the course of this past week, some guests started to experience sporadic assertion failures in libc/malloc.c or strange segmentation violations. In most cases it is not easily reproducible, but I got a report from one guest administrator that their php-fpm process is reliably segfaulting immediately. For example: [19-Nov-2018 06:39:56] WARNING: [pool www] child 3682 exited on signal 11 (SIGSEGV) after 18.601413 seconds from start [19-Nov-2018 06:39:56] NOTICE: [pool www] child 3683 started [19-Nov-2018 06:40:16] WARNING: [pool www] child 3683 exited on signal 11 (SIGSEGV) after 20.364357 seconds from start [19-Nov-2018 06:40:16] NOTICE: [pool www] child 3684 started [19-Nov-2018 06:43:43] WARNING: [pool www] child 3426 exited on signal 11 (SIGSEGV) after 1327.885798 seconds from start [19-Nov-2018 06:43:43] NOTICE: [pool www] child 3739 started [19-Nov-2018 06:43:59] WARNING: [pool www] child 3739 exited on signal 11 (SIGSEGV) after 15.922980 seconds from start The failures that mention malloc.c are happening in multiple different binaries, including grep, perl and shells. They look like this: grep: malloc.c:2372: sysmalloc: Assertion `(old_top == (((mbinptr) (((char *) &((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct malloc_chunk, fd)))) && old_size == 0) || ((unsigned long) (old_size) >= (unsigned long)((((__builtin_offsetof (struct malloc_chunk, fd_nextsize))+((2 *(sizeof(size_t))) - 1)) & ~((2 *(sizeof(size_t))) - 1))) && ((old_top)->size & 0x1) && ((unsigned long) old_end & pagemask) == 0)' failed. I have not been able to reproduce these problems when I boot the hypervisor with pv-l1tf=false. The php-fpm one was previously reproducible 100% of the time. The other cases are very hard to trigger but with pv-l1tf=false I am not able to at all. I have since checked out staging-4.10 and am experiencing the same thing, so I'm fairly confident it is not something I've introduced when applying XSA patches. My workload is several hundred PV guests across 9 servers with two different types of Intel CPU. The guests are of many different Linux distributions, probably a 70/30 split between 32- and 64-bit. I have so far only encountered this with 64-bit guests running Debian jessie and stretch, less than 10 guests are affected (so far reported), and all of them trigger the "d1 L1TF-vulnerable L1e 000000006a6ff960 - Shadowing" warning in dmesg (though there are hundreds of others which trigger it yet seem unaffected). There is an unconfirmed report from 64-bit Gentoo. In the text for XSA-273 it says: "Shadowing comes with a workload-dependent performance hit to the guest. Once the guest kernel software updates have been applied, a well behaved guest will not write vulnerable PTEs, and will therefore avoid the performance penalty (or crash) entirely." Does anyone have a reference to what is needed in the Linux kernel for that? Perhaps I can see what the status of that is within kernel upstream / Debian and then get past the problem by getting an updated guest kernel onto affected guests. Also: "This behaviour is active by default for guests on affected hardware (controlled by `pv-l1tf=`), but is disabled by default for dom0. Dom0's exemption is because of instabilities when being shadowed, which are under investigation" I have not had these issues in any of my 9 dom0s which are all 64-bit Debian jessie. Since these L1TF fixes are not active for dom0, that makes sense. Are the observed dom0 instabilities similar to what I am seeing in some guests? Any suggestions for further debugging? I have attached an "xl dmesg". Cheers, Andy

Attachment: dmesg.txt
Description: Text document

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.