Xen project Mailing List

Re: [Xen-devel] [PATCH] xen: only clobber multicall elements without error

To: Juergen Gross <jgross@xxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>

From: Julien Grall <julien.grall@xxxxxxx>

Date: Mon, 26 Nov 2018 16:01:41 +0000

Cc: Stefano Stabellini <sstabellini@xxxxxxxxxx>, Wei Liu <wei.liu2@xxxxxxxxxx>, Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>, George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>, Tim Deegan <tim@xxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Mon, 26 Nov 2018 16:01:49 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 26/11/2018 15:29, Juergen Gross wrote:

On 26/11/2018 15:58, Jan Beulich wrote:

On 26.11.18 at 15:23, <jgross@xxxxxxxx> wrote:

On 26/11/2018 15:01, Jan Beulich wrote:

On 26.11.18 at 14:52, <jgross@xxxxxxxx> wrote:

I don't think the hypervisor should explicitly try to make it as hard as
possible for the guest to find problems in the code.


That's indeed not the hypervisor's goal. Instead it tries to make
it as hard as possible for the guest (developer) to make wrong
assumptions.


Let's look at the current example why I wrote this patch:

The Linux kernel's use of multicalls should never trigger any single
call to return an error (return value < 0). A kernel compiled for
productive use will catch such errors, but has no knowledge which
single call has failed, as it doesn't keep track of the single entries
(non-productive kernels have an option available in the respective
source to copy the entries before doing the multicall in order to have
some diagnostic data available in case of such an error). Catching an
error from a multicall right now means a WARN() with a stack backtrace
(for the multicall itself, not for the entry causing the error).

I have a customer report for a case where such a backtrace was produced
and a kernel crash some seconds later, obviously due to illegally
unmapped memory pages resulting from the failed multicall. Unfortunately
there are multiple possibilities what might have gone wrong and I don't
know which one was the culprit. The problem can't be a very common one,
because there is only one such report right now, which might depend on
a special driver.

Finding this bug without a known reproducer and the current amount of
diagnostic data is next to impossible. So I'd like to have more data
available without having to hurt performance for the 99.999999% of the
cases where nothing bad happens.

In case you have an idea how to solve this problem in another way I'd be
happy to follow that route. I'd really like to be able to have a better
clue in case such an error occurs in future.


Since you have a production kernel, I assume you also have a
production hypervisor. This hypervisor doesn't clobber the
arguments if I'm not mistaken. Therefore
- in the debugging scenario you (can) have all data available by
   virtue of the information getting copied in the kernel,
- in the release scenario you have all data available since it's
   left un-clobbered.
Am I missing anything (I don't view mixed debug/release setups
of kernel and hypervisor as overly important here)?


No, you are missing nothing here. OTOH a debug hypervisor destroying
debug data is kind of weird, so I posted this patch.

This is a quite common approach if you want to enforce the other entity to not

rely on some fields. This also follows what we do for hypercalls (at least on Arm).

Cheers, -- Julien Grall _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.