Xen project Mailing List

Re: [Xen-devel] XenGT is still regressed on master

To: "Igor Druzhinin" <igor.druzhinin@xxxxxxxxxx>

From: "Jan Beulich" <JBeulich@xxxxxxxx>

Date: Fri, 08 Mar 2019 04:55:29 -0700

Cc: Juergen Gross <jgross@xxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Paul Durrant <paul.durrant@xxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Fri, 08 Mar 2019 11:55:48 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

>>> On 07.03.19 at 13:46, <igor.druzhinin@xxxxxxxxxx> wrote: > We've noticed that there is still a regression with p2m_ioreq_server P2M > type on master. Since the commit 940faf0279 (x86/HVM: split page Afaics it's 3bdec530a5. I'm also slightly confused by the use of "still": Iirc so far I've had only an informal indication of the problem, without any details, from Andrew. Also Cc-ing Jürgen to be explicitly aware of the regression, albeit I assume he has noticed the report already anyway. > straddling emulated accesses in more cases) the behavior of write and > rmw instruction emulation changed (possibly unintentionally) so that it > might not re-enter hvmemul_do_io() on IOREQ completion which should be > done in order to avoid breaking IOREQ state machine. What we're seeing > instead is a domain crash here: > > static int hvmemul_do_io( > bool_t is_mmio, paddr_t addr, unsigned long *reps, unsigned int > ... > case STATE_IORESP_READY: > vio->io_req.state = STATE_IOREQ_NONE; > p = vio->io_req; > > /* Verify the emulation request has been correctly re-issued */ > if ( (p.type != (is_mmio ? IOREQ_TYPE_COPY : IOREQ_TYPE_PIO)) || > (p.addr != addr) || > (p.size != size) || > (p.count > *reps) || > (p.dir != dir) || > (p.df != df) || > (p.data_is_ptr != data_is_addr) || > (data_is_addr && (p.data != data)) ) > domain_crash(currd); > > This is happening on processing of the next IOREQ after the one that > hasn't been completed properly due to p2mt being changed in IOREQ > handler by XenGT kernel module. So it hit HVMTRANS_okay case in > linear_write() helper on the way back and didn't re-enter hvmemul_do_io(). Am I to take this to mean that the first time round we take the HVMTRANS_bad_gfn_to_mfn exit from __hvm_copy() due to finding p2m_ioreq_server, but in the course of processing the request the page's type gets changed and hence we don't take that same path the second time? If so, my first reaction is to blame the kernel module: Machine state (of the VM) may not change while processing a write, other than to carry out the _direct_ effects of the write. I don't think a p2m type change is supposed to be occurring as a side effect. > The bug could be mitigated by the following patch but since it's you who > introduced this helper you might have better ideas how to avoid the > problem in a clean way here. > > --- a/xen/arch/x86/hvm/emulate.c > +++ b/xen/arch/x86/hvm/emulate.c > @@ -1139,13 +1139,11 @@ static int linear_write(unsigned long addr, > unsigned int bytes, void *p_data, > { > unsigned int offset, part1; > > - case HVMTRANS_okay: > - return X86EMUL_OKAY; > - > case HVMTRANS_bad_linear_to_gfn: > x86_emul_pagefault(pfinfo.ec, pfinfo.linear, &hvmemul_ctxt->ctxt); > return X86EMUL_EXCEPTION; > > + case HVMTRANS_okay: > case HVMTRANS_bad_gfn_to_mfn: > offset = addr & ~PAGE_MASK; > if ( offset + bytes <= PAGE_SIZE ) This is (I'm inclined to say "of course") not an appropriate change in the general case: Getting back HVMTRANS_okay means the write was carried out, and hence it shouldn't be tried to be carried out a 2nd time. I take it that changing the kernel driver would at best be sub-optimal though, so a hypervisor-only fix would be better. Assuming the p2m type change arrives via XEN_DMOP_set_mem_type, I think what we need to do is delay the actual change until no ioreq is pending anymore, kind of like the VM event subsystem delays certain CR and MSR writes until VM entry time. In this situation we'd then further have to restrict the number of such pending changes, because we need to record the request(s) somewhere. Question is how many of such bufferable requests would be enough: Unless we wanted to enforce that only ordinary aligned writes (and rmw-s) can get us into this state, apart from writes crossing page boundaries we may need to be able to cope with AVX512's scatter insns. The other alternative I can think of would be to record the specific failure case (finding p2m_ioreq_server) in the first pass, and act upon it (instead of your mitigation above) during the retry. At the first glance it would seem more cumbersome to do the recording at that layer though. Let me know what you think, and also whether you want me to try to prototype something, or whether you want to give it a shot. Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.