Xen project Mailing List

Re: [Xen-devel] [PATCH] xen: only clobber multicall elements without error

From: "Jan Beulich" <JBeulich@xxxxxxxx>

Date: Mon, 26 Nov 2018 07:58:27 -0700

Cc: Stefano Stabellini <sstabellini@xxxxxxxxxx>, Wei Liu <wei.liu2@xxxxxxxxxx>, Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>, George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>, Tim Deegan <tim@xxxxxxx>, Julien Grall <julien.grall@xxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Mon, 26 Nov 2018 14:58:45 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

>>> On 26.11.18 at 15:23, <jgross@xxxxxxxx> wrote: > On 26/11/2018 15:01, Jan Beulich wrote: >>>>> On 26.11.18 at 14:52, <jgross@xxxxxxxx> wrote: >>> I don't think the hypervisor should explicitly try to make it as hard as >>> possible for the guest to find problems in the code. >> >> That's indeed not the hypervisor's goal. Instead it tries to make >> it as hard as possible for the guest (developer) to make wrong >> assumptions. > > Let's look at the current example why I wrote this patch: > > The Linux kernel's use of multicalls should never trigger any single > call to return an error (return value < 0). A kernel compiled for > productive use will catch such errors, but has no knowledge which > single call has failed, as it doesn't keep track of the single entries > (non-productive kernels have an option available in the respective > source to copy the entries before doing the multicall in order to have > some diagnostic data available in case of such an error). Catching an > error from a multicall right now means a WARN() with a stack backtrace > (for the multicall itself, not for the entry causing the error). > > I have a customer report for a case where such a backtrace was produced > and a kernel crash some seconds later, obviously due to illegally > unmapped memory pages resulting from the failed multicall. Unfortunately > there are multiple possibilities what might have gone wrong and I don't > know which one was the culprit. The problem can't be a very common one, > because there is only one such report right now, which might depend on > a special driver. > > Finding this bug without a known reproducer and the current amount of > diagnostic data is next to impossible. So I'd like to have more data > available without having to hurt performance for the 99.999999% of the > cases where nothing bad happens. > > In case you have an idea how to solve this problem in another way I'd be > happy to follow that route. I'd really like to be able to have a better > clue in case such an error occurs in future. Since you have a production kernel, I assume you also have a production hypervisor. This hypervisor doesn't clobber the arguments if I'm not mistaken. Therefore - in the debugging scenario you (can) have all data available by virtue of the information getting copied in the kernel, - in the release scenario you have all data available since it's left un-clobbered. Am I missing anything (I don't view mixed debug/release setups of kernel and hypervisor as overly important here)? Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.