[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] PML causing race condition during guest bootstorm and host crash on Broadwell cpu.



Hi,

Code is trying to destroy multiple vcpus held by the domain: complete_domain_destroy->hvm_vcpu_destroy() for each vcpu.

In vmx_vcpu_destroy, we have a call for vmx_vcpu_disable_pml which leads to a race while destroying foreign vcpu's with other domains rebooting on the same vcpus .

With a single domain reboot, no race is observed.

Commit e18d4274772e52ac81fda1acb246d11ef666e5fe causes this race condition.

Anshul


On 07/02/17 17:58, anshul makkar wrote:
Hi, Sorry, forgot to include you in cc list.

Anshul


On 07/02/17 17:26, anshul makkar wrote:
Hi,

Facing a issue where bootstorm of guests leads to host crash. I debugged and found that that enabling PML introduces a race condition during guest teardown stage while disabling PML on a vcpu and context switch happening for another vcpu.

Crash happens only on Broadwell processors as PML got introduced in this generation.

Here is my analysis:

Race condition:

vmcs.c vmx_vcpu_disable_pml (vcpu){ vmx_vmcs_enter() ; vm_write( disable_PML); vmx_vmcx_exit();)

In between I have a callpath from another pcpu executing context switch-> vmx_fpu_leave() and crash on vmwrite..

  if ( !(v->arch.hvm_vmx.host_cr0 & X86_CR0_TS) )
{
         v->arch.hvm_vmx.host_cr0 |= X86_CR0_TS;
         __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0); //crash
     }

Debug logs
XEN) [221256.749928] VMWRITE VMCS Invalid !!!!!
(XEN) [221256.754870] **[00] { now 0000c93b4341df1d, hw 00000035fffea000, op 00000035fffea000 } vmclear (XEN) [221256.765052] ** frames [ ffff82d080134652 smp_call_function_interrupt+0x92/0xa0 ]

(XEN) [221256.773969] **[01] { now 0000c93b4341e099, hw ffffffffffffffff, op 00000035fffea000 } vmptrld (XEN) [221256.784150] ** frames [ ffff82d0801f0765 vmx_vmcs_try_enter+0x95/0xb0 ]

(XEN) [221256.792197] **[02] { now 0000c93b4341e1f1, hw 00000035fffea000, op 00000035fffea000 } vmclear (XEN) [221256.802378] ** frames [ ffff82d080134652 smp_call_function_interrupt+0x92/0xa0 ]

(XEN) [221256.811298] **[03] { now 0000c93b5784dd0a, hw ffffffffffffffff, op 00000039d7074000 } vmptrld (XEN) [221256.821478] ** frames [ ffff82d0801f4c31 vmx_do_resume+0x51/0x150 ]

(XEN) [221256.829139] **[04] { now 0000c93b59d67b5b, hw 00000039d7074000, op 0000002b9a575000 } vmptrld (XEN) [221256.839320] ** frames [ ffff82d0801f4c31 vmx_do_resume+0x51/0x150 ]

(XEN) [221256.882850] **[07] { now 0000c93b59e71e48, hw 0000002b9a575000, op 00000039d7074000 } vmptrld (XEN) [221256.893034] ** frames [ ffff82d0801f4d13 vmx_do_resume+0x133/0x150 ]

(XEN) [221256.900790] **[08] { now 0000c93b59e78675, hw 00000039d7074000, op 00000040077ae000 } vmptrld (XEN) [221256.910968] ** frames [ ffff82d0801f0765 vmx_vmcs_try_enter+0x95/0xb0 ]

(XEN) [221256.919015] **[09] { now 0000c93b59e78ac8, hw 00000040077ae000, op 00000040077ae000 } vmclear (XEN) [221256.929196] ** frames [ ffff82d080134652 smp_call_function_interrupt+0x92/0xa0 ]

(XEN) [221256.938117] **[10] { now 0000c93b59e78d72, hw ffffffffffffffff, op 00000040077ae000 } vmptrld (XEN) [221256.948297] ** frames [ ffff82d0801f0765 vmx_vmcs_try_enter+0x95/0xb0 ]

(XEN) [221256.956345] **[11] { now 0000c93b59e78ff0, hw 00000040077ae000, op 00000040077ae000 } vmclear (XEN) [221256.966525] ** frames [ ffff82d080134652 smp_call_function_interrupt+0x92/0xa0 ]

(XEN) [221256.975445] **[12] { now 0000c93b59e7deda, hw ffffffffffffffff, op 00000040077b3000 } vmptrld (XEN) [221256.985626] ** frames [ ffff82d0801f0765 vmx_vmcs_try_enter+0x95/0xb0 ]

(XEN) [221256.993672] **[13] { now 0000c93b59e9fe00, hw 00000040077b3000, op 00000040077b3000 } vmclear (XEN) [221257.003852] ** frames [ ffff82d080134652 smp_call_function_interrupt+0x92/0xa0 ]

(XEN) [221257.012772] **[14] { now 0000c93b59ea007e, hw ffffffffffffffff, op 00000040077b3000 } vmptrld (XEN) [221257.022952] ** frames [ ffff82d0801f0765 vmx_vmcs_try_enter+0x95/0xb0 ]

(XEN) [221257.031000] **[15] { now 0000c93b59ea02ba, hw 00000040077b3000, op 00000040077b3000 } vmclear (XEN) [221257.041180] ** frames [ ffff82d080134652 smp_call_function_interrupt+0x92/0xa0 ]

(XEN) [221257.050101]  ....
(XEN) [221257.053008] vmcs_ptr:0xffffffffffffffff, vcpu->vmcs:0x2b9a575000


vmcs is loaded and between the next call to vm_write, there is a clear of vmcs caused by vmx_vcpu_disable_pml.

Above log highlights that IPI is clearing the vmcs in between vmptrld and vmwrite but I also verified that interrupts are disabled during context switch and execution of vm_write in vmx_fpu_leave.. This has got me confused.

Also, I am not sure if I understand the handling of foreign_vmcs correctly, which can also be the cause of the race.

Please if you can share some suggestions here.


Thanks

Anshul Makkar







_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.