  Date: Mon, 26 Nov 2018 16:29:02 +0100
On 26/11/2018 15:58, Jan Beulich wrote:
>>>> On 26.11.18 at 15:23, <jgross@xxxxxxxx> wrote:
>> On 26/11/2018 15:01, Jan Beulich wrote:
>>>>>> On 26.11.18 at 14:52, <jgross@xxxxxxxx> wrote:
>>>> I don't think the hypervisor should explicitly try to make it as hard as
>>>> possible for the guest to find problems in the code.
>>> That's indeed not the hypervisor's goal. Instead it tries to make
>>> it as hard as possible for the guest (developer) to make wrong
>>> assumptions.
>> Let's look at the current example why I wrote this patch:
>> The Linux kernel's use of multicalls should never trigger any single
>> call to return an error (return value < 0). A kernel compiled for
>> productive use will catch such errors, but has no knowledge which
>> single call has failed, as it doesn't keep track of the single entries
>> (non-productive kernels have an option available in the respective
>> source to copy the entries before doing the multicall in order to have
>> some diagnostic data available in case of such an error). Catching an
>> error from a multicall right now means a WARN() with a stack backtrace
>> (for the multicall itself, not for the entry causing the error).
>> I have a customer report for a case where such a backtrace was produced
>> and a kernel crash some seconds later, obviously due to illegally
>> unmapped memory pages resulting from the failed multicall. Unfortunately
>> there are multiple possibilities what might have gone wrong and I don't
>> know which one was the culprit. The problem can't be a very common one,
>> because there is only one such report right now, which might depend on
>> a special driver.
>> Finding this bug without a known reproducer and the current amount of
>> diagnostic data is next to impossible. So I'd like to have more data
>> available without having to hurt performance for the 99.999999% of the
>> cases where nothing bad happens.
>> In case you have an idea how to solve this problem in another way I'd be
>> happy to follow that route. I'd really like to be able to have a better
>> clue in case such an error occurs in future.
> Since you have a production kernel, I assume you also have a
> production hypervisor. This hypervisor doesn't clobber the
> arguments if I'm not mistaken. Therefore
> - in the debugging scenario you (can) have all data available by
>   virtue of the information getting copied in the kernel,
> - in the release scenario you have all data available since it's
>   left un-clobbered.
> Am I missing anything (I don't view mixed debug/release setups
> of kernel and hypervisor as overly important here)?

No, you are missing nothing here. OTOH a debug hypervisor destroying
debug data is kind of weird, so I posted this patch.

I'll add the related Linux kernel patch (in case it is acked by Boris)
with or without this hypervisor patch, but I thought it would be better
to have the hypervisor patch in place, especially as e.g. a hypervisor
from xen-unstable might have a bug which could be easier to diagnose
with this patch in place.


