[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] XenGT is still regressed on master



>>> On 07.03.19 at 13:46, <igor.druzhinin@xxxxxxxxxx> wrote:
> We've noticed that there is still a regression with p2m_ioreq_server P2M
> type on master. Since the commit 940faf0279 (x86/HVM: split page

Afaics it's 3bdec530a5. I'm also slightly confused by the use of "still":
Iirc so far I've had only an informal indication of the problem, without
any details, from Andrew.

Also Cc-ing Jürgen to be explicitly aware of the regression, albeit I
assume he has noticed the report already anyway.

> straddling emulated accesses in more cases) the behavior of write and
> rmw instruction emulation changed (possibly unintentionally) so that it
> might not re-enter hvmemul_do_io() on IOREQ completion which should be
> done in order to avoid breaking IOREQ state machine. What we're seeing
> instead is a domain crash here:
> 
> static int hvmemul_do_io(
>     bool_t is_mmio, paddr_t addr, unsigned long *reps, unsigned int
> ...
>     case STATE_IORESP_READY:
>         vio->io_req.state = STATE_IOREQ_NONE;
>         p = vio->io_req;
> 
>         /* Verify the emulation request has been correctly re-issued */
>         if ( (p.type != (is_mmio ? IOREQ_TYPE_COPY : IOREQ_TYPE_PIO)) ||
>              (p.addr != addr) ||
>              (p.size != size) ||
>              (p.count > *reps) ||
>              (p.dir != dir) ||
>              (p.df != df) ||
>              (p.data_is_ptr != data_is_addr) ||
>              (data_is_addr && (p.data != data)) )
>             domain_crash(currd);
> 
> This is happening on processing of the next IOREQ after the one that
> hasn't been completed properly due to p2mt being changed in IOREQ
> handler by XenGT kernel module. So it hit HVMTRANS_okay case in
> linear_write() helper on the way back and didn't re-enter hvmemul_do_io().

Am I to take this to mean that the first time round we take the
HVMTRANS_bad_gfn_to_mfn exit from __hvm_copy() due to finding
p2m_ioreq_server, but in the course of processing the request the
page's type gets changed and hence we don't take that same path
the second time? If so, my first reaction is to blame the kernel
module: Machine state (of the VM) may not change while processing
a write, other than to carry out the _direct_ effects of the write. I
don't think a p2m type change is supposed to be occurring as a side
effect.

> The bug could be mitigated by the following patch but since it's you who
> introduced this helper you might have better ideas how to avoid the
> problem in a clean way here.
> 
> --- a/xen/arch/x86/hvm/emulate.c
> +++ b/xen/arch/x86/hvm/emulate.c
> @@ -1139,13 +1139,11 @@ static int linear_write(unsigned long addr,
> unsigned int bytes, void *p_data,
>      {
>          unsigned int offset, part1;
> 
> -    case HVMTRANS_okay:
> -        return X86EMUL_OKAY;
> -
>      case HVMTRANS_bad_linear_to_gfn:
>          x86_emul_pagefault(pfinfo.ec, pfinfo.linear, &hvmemul_ctxt->ctxt);
>          return X86EMUL_EXCEPTION;
> 
> +    case HVMTRANS_okay:
>      case HVMTRANS_bad_gfn_to_mfn:
>          offset = addr & ~PAGE_MASK;
>          if ( offset + bytes <= PAGE_SIZE )

This is (I'm inclined to say "of course") not an appropriate change in
the general case: Getting back HVMTRANS_okay means the write
was carried out, and hence it shouldn't be tried to be carried out a
2nd time.

I take it that changing the kernel driver would at best be sub-optimal
though, so a hypervisor-only fix would be better.

Assuming the p2m type change arrives via XEN_DMOP_set_mem_type,
I think what we need to do is delay the actual change until no ioreq
is pending anymore, kind of like the VM event subsystem delays
certain CR and MSR writes until VM entry time. In this situation we'd
then further have to restrict the number of such pending changes,
because we need to record the request(s) somewhere. Question is
how many of such bufferable requests would be enough: Unless we
wanted to enforce that only ordinary aligned writes (and rmw-s) can
get us into this state, apart from writes crossing page boundaries we
may need to be able to cope with AVX512's scatter insns.

The other alternative I can think of would be to record the specific
failure case (finding p2m_ioreq_server) in the first pass, and act upon
it (instead of your mitigation above) during the retry. At the first
glance it would seem more cumbersome to do the recording at that
layer though.

Let me know what you think, and also whether you want me to try to
prototype something, or whether you want to give it a shot.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.