[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] xen: only clobber multicall elements without error



>>> On 26.11.18 at 15:23, <jgross@xxxxxxxx> wrote:
> On 26/11/2018 15:01, Jan Beulich wrote:
>>>>> On 26.11.18 at 14:52, <jgross@xxxxxxxx> wrote:
>>> I don't think the hypervisor should explicitly try to make it as hard as
>>> possible for the guest to find problems in the code.
>> 
>> That's indeed not the hypervisor's goal. Instead it tries to make
>> it as hard as possible for the guest (developer) to make wrong
>> assumptions.
> 
> Let's look at the current example why I wrote this patch:
> 
> The Linux kernel's use of multicalls should never trigger any single
> call to return an error (return value < 0). A kernel compiled for
> productive use will catch such errors, but has no knowledge which
> single call has failed, as it doesn't keep track of the single entries
> (non-productive kernels have an option available in the respective
> source to copy the entries before doing the multicall in order to have
> some diagnostic data available in case of such an error). Catching an
> error from a multicall right now means a WARN() with a stack backtrace
> (for the multicall itself, not for the entry causing the error).
> 
> I have a customer report for a case where such a backtrace was produced
> and a kernel crash some seconds later, obviously due to illegally
> unmapped memory pages resulting from the failed multicall. Unfortunately
> there are multiple possibilities what might have gone wrong and I don't
> know which one was the culprit. The problem can't be a very common one,
> because there is only one such report right now, which might depend on
> a special driver.
> 
> Finding this bug without a known reproducer and the current amount of
> diagnostic data is next to impossible. So I'd like to have more data
> available without having to hurt performance for the 99.999999% of the
> cases where nothing bad happens.
> 
> In case you have an idea how to solve this problem in another way I'd be
> happy to follow that route. I'd really like to be able to have a better
> clue in case such an error occurs in future.

Since you have a production kernel, I assume you also have a
production hypervisor. This hypervisor doesn't clobber the
arguments if I'm not mistaken. Therefore
- in the debugging scenario you (can) have all data available by
  virtue of the information getting copied in the kernel,
- in the release scenario you have all data available since it's
  left un-clobbered.
Am I missing anything (I don't view mixed debug/release setups
of kernel and hypervisor as overly important here)?

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.