[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen 4.6.1 crash with altp2menabledbydefault

>>> On 22.09.16 at 17:11, <Kevin.Mayer@xxxxxxxx> wrote:
> Here is a call stack from dmesg.
> Keep in mind that the compiler omits some function names (most importantly
> the vmx_fpu_leave) and also that vmx_vmenter_helper is not actually called.
> The backtrace just thinks it is called because the ud2 which panics the
> hypervisor lies somewhere behind its epilogue.
> (XEN) Xen call trace:
> (XEN)    [<ffff82d0801fd8ec>] vmx_vmenter_helper+0x280/0x30a
> (XEN)    [<ffff82d080174f91>] __context_switch+0xdb/0x3b5
> (XEN)    [<ffff82d080178c19>] __sync_local_execstate+0x5e/0x7a
> (XEN)    [<ffff82d080178c3e>] sync_local_execstate+0x9/0xb
> (XEN)    [<ffff82d080179740>] map_domain_page+0xa0/0x5d4
> (XEN)    [<ffff82d080196152>] destroy_perdomain_mapping+0x8f/0x1e8
> (XEN)    [<ffff82d080244a62>] free_compat_arg_xlat+0x26/0x28
> (XEN)    [<ffff82d0801d4081>] hvm_vcpu_destroy+0x112/0x176
> (XEN)    [<ffff82d080175c2c>] vcpu_destroy+0x5d/0x72
> (XEN)    [<ffff82d080105dd4>] complete_domain_destroy+0x49/0x192
> (XEN)    [<ffff82d0801215fd>] rcu_process_callbacks+0x19a/0x1fb
> (XEN)    [<ffff82d08012caf8>] __do_softirq+0x82/0x8d
> (XEN)    [<ffff82d08012cb3b>] process_pending_softirqs+0x38/0x3a
> (XEN)    [<ffff82d0801c23a8>] mwait_idle+0x10c/0x315
> (XEN)    [<ffff82d080174825>] idle_loop+0x51/0x6b

So one possible solution would be to simply avoid calling
altp2m_vcpu_update_p2m() and altp2m_vcpu_update_vmfunc_ve()
from altp2m_vcpu_destroy() for dying domains. However, it looks
as if this would still only paper over the underlying problem.

Yet I continue to have difficulty seeing how we can end up with the
call stack above, without some other earlier bug: I don't think
un-paused vCPU-s are supposed to make it into vcpu_destroy().
Yet at the moment a vCPU gets paused, sync_vcpu_execstate()
would have got called for it already. And while both
vcpu_check_shutdown() and domain_shutdown() call
vcpu_pause_nosync() (which hence wouldn't result in the needed
call to sync_vcpu_execstate()), domain_kill() calls domain_pause()
first thing, while it drops the domain reference almost last thing.
And only the dropping of the last domain reference can cause
execution to reach complete_domain_destroy().

Could you verify this is what is actually happening, i.e. you're not
suffering from a stray put_domain() somewhere? And just to
double check - you're not having any other code changes in your
tree beyond the default enabling of altp2m?


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.