[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH v1] x86/hvm: Generic instruction re-execution mechanism for execute faults
On Thu, Nov 22, 2018 at 12:14:59PM +0200, Razvan Cojocaru wrote: > On 11/22/18 12:05 PM, Roger Pau Monné wrote: > > On Wed, Nov 21, 2018 at 08:55:48PM +0200, Razvan Cojocaru wrote: > >> On 11/16/18 7:04 PM, Roger Pau Monné wrote: > >>>> + if ( a == v ) > >>>> + continue; > >>>> + > >>>> + /* Pause, synced. */ > >>>> + while ( !a->arch.in_host ) > >>> Why not use a->is_running as a way to know whether the vCPU is > >>> running? > >>> > >>> I think the logic of using vcpu_pause and expecting the running vcpu > >>> to take a vmexit and thus set in_host is wrong because a vcpu that > >>> wasn't running when vcpu_pause_nosync is called won't get scheduled > >>> anymore, thus not taking a vmexit and this function will lockup. > >>> > >>> I don't think you need the in_host boolean at all. > >>> > >>>> + cpu_relax(); > >>> Is this really better than using vcpu_pause? > >>> > >>> I assume this is done to avoid waiting on each vcpu, and instead doing > >>> it here likely means less wait time? > >> > >> The problem with plain vcpu_pause() is that we weren't able to use it, > >> for the same reason (which remains unclear as of yet) that we couldn't > >> use a->is_running: we get CPU stuck hypervisor crashes that way. Here's > >> one that uses the same logic, but loops on a->is_running instead of > >> !a->arch.in_host: [...] > >> Some scheduler magic appears to happen here where it is unclear why > >> is_running doesn't seem to end up being 0 as expected in our case. We'll > >> keep digging. > > > > There seems to be some kind of deadlock between > > vmx_start_reexecute_instruction and hap_track_dirty_vram/handle_mmio. > > Are you holding a lock while trying to put the other vcpus to sleep? > > d->arch.rexec_lock, but I don't see how that would matter in this case. The trace from pCPU#0: (XEN) [ 3668.016989] RFLAGS: 0000000000000202 CONTEXT: hypervisor (d0v0) [...] (XEN) [ 3668.275417] Xen call trace: (XEN) [ 3668.278714] [<ffff82d0801327d2>] vcpu_sleep_sync+0x40/0x71 (XEN) [ 3668.284952] [<ffff82d08010735b>] domain.c#do_domain_pause+0x33/0x4f (XEN) [ 3668.291973] [<ffff82d08010879a>] domain_pause+0x25/0x27 (XEN) [ 3668.297952] [<ffff82d080245e69>] hap_track_dirty_vram+0x2c1/0x4a7 (XEN) [ 3668.304797] [<ffff82d0801dd8f5>] do_hvm_op+0x18be/0x2b58 (XEN) [ 3668.310864] [<ffff82d080172aca>] pv_hypercall+0x1e5/0x402 (XEN) [ 3668.317017] [<ffff82d080250899>] entry.o#test_all_events+0/0x3d Shows there's an hypercall executed from Dom0 that's trying to pause the domain, thus pausing all the vCPUs. Then pCPU#3: (XEN) [ 3669.062841] RFLAGS: 0000000000000202 CONTEXT: hypervisor (d1v0) [...] (XEN) [ 3669.322832] Xen call trace: (XEN) [ 3669.326128] [<ffff82d08021006a>] vmx_start_reexecute_instruction+0x107/0x68a (XEN) [ 3669.333925] [<ffff82d080210b3e>] p2m_mem_access_check+0x551/0x64d (XEN) [ 3669.340774] [<ffff82d0801dee9e>] hvm_hap_nested_page_fault+0x2f2/0x631 (XEN) [ 3669.348051] [<ffff82d080202c00>] vmx_vmexit_handler+0x156c/0x1e45 (XEN) [ 3669.354899] [<ffff82d08020820c>] vmx_asm_vmexit_handler+0xec/0x250 Seems to be blocked in vmx_start_reexecute_instruction, and thus not getting paused and triggering the watchdog on pCPU#0? You should check on which vCPU is the trace from pCPU#0 waiting, if that's the vCPU running on pCPU#3 (d1v0) you will have to check what's taking such a long time in vmx_start_reexecute_instruction. Roger. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |