[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH] xen: only clobber multicall elements without error
On 26/11/2018 16:17, Juergen Gross wrote: On 26/11/2018 17:01, Julien Grall wrote:On 26/11/2018 15:29, Juergen Gross wrote:On 26/11/2018 15:58, Jan Beulich wrote:On 26.11.18 at 15:23, <jgross@xxxxxxxx> wrote:On 26/11/2018 15:01, Jan Beulich wrote:On 26.11.18 at 14:52, <jgross@xxxxxxxx> wrote:I don't think the hypervisor should explicitly try to make it as hard as possible for the guest to find problems in the code.That's indeed not the hypervisor's goal. Instead it tries to make it as hard as possible for the guest (developer) to make wrong assumptions.Let's look at the current example why I wrote this patch: The Linux kernel's use of multicalls should never trigger any single call to return an error (return value < 0). A kernel compiled for productive use will catch such errors, but has no knowledge which single call has failed, as it doesn't keep track of the single entries (non-productive kernels have an option available in the respective source to copy the entries before doing the multicall in order to have some diagnostic data available in case of such an error). Catching an error from a multicall right now means a WARN() with a stack backtrace (for the multicall itself, not for the entry causing the error). I have a customer report for a case where such a backtrace was produced and a kernel crash some seconds later, obviously due to illegally unmapped memory pages resulting from the failed multicall. Unfortunately there are multiple possibilities what might have gone wrong and I don't know which one was the culprit. The problem can't be a very common one, because there is only one such report right now, which might depend on a special driver. Finding this bug without a known reproducer and the current amount of diagnostic data is next to impossible. So I'd like to have more data available without having to hurt performance for the 99.999999% of the cases where nothing bad happens. In case you have an idea how to solve this problem in another way I'd be happy to follow that route. I'd really like to be able to have a better clue in case such an error occurs in future.Since you have a production kernel, I assume you also have a production hypervisor. This hypervisor doesn't clobber the arguments if I'm not mistaken. Therefore - in the debugging scenario you (can) have all data available by virtue of the information getting copied in the kernel, - in the release scenario you have all data available since it's left un-clobbered. Am I missing anything (I don't view mixed debug/release setups of kernel and hypervisor as overly important here)?No, you are missing nothing here. OTOH a debug hypervisor destroying debug data is kind of weird, so I posted this patch.This is a quite common approach if you want to enforce the other entity to not rely on some fields. This also follows what we do for hypercalls (at least on Arm).I don't question that general mechanism. What I question is doing the clobbering in this case where the caller would need to take special measures for copying the needed debug data which will hit performance. So not doing the clobbering would only be in the very rare error case, not in the common case where the guest should really have no need to see the preserved data. I really fail to see why it is so bad to not clobber data in a case which normally should never occur. The only outcome of clobbering the data in the error case is making diagnosis of that error much harder. Its not as if there would be secret data suddenly made available to the guest. Its just avoiding the need for the guest to copy the data for each multicall for the very unlikely chance an error might occur. And we are not speaking of a hypercall issued then and now, but of the path hit for nearly every memory-management action and context switch of PV guests. So doing always the copy would really be visible. For a first, now the behavior will not be the same as when handling a single hypercall. I would much prefer if the behavior is kept the same everywhere. Secondly, you are introducing an ABI change without explicitly telling it. This should be clarified in the public interface and probably the multicall code to avoid re-introducing the clobbering in the future. But I am concerned that the conditional clobbering (i.e depending on the return) will be overlooked by the guest and will probably introduce some more interesting^weird bug. Cheers, -- Julien Grall _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |