[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] vpmu=1 and running 'perf top' within a PVHVM guest eventually hangs dom0 and hypervisor has stuck vCPUS. Romley-EP (model=45, stepping=2)



On Wed, Mar 13, 2013 at 09:02:24AM +0100, Dietmar Hahn wrote:
> Am Mittwoch 13 März 2013, 08:51:30 schrieb Dietmar Hahn:
> > Am Dienstag 12 März 2013, 16:54:11 schrieb Boris Ostrovsky:
> > > On 03/12/2013 04:31 PM, Konrad Rzeszutek Wilk wrote:
> > > > On Tue, Mar 12, 2013 at 02:50:59PM -0400, Boris Ostrovsky wrote:
> > > >> On 03/12/2013 01:30 PM, Konrad Rzeszutek Wilk wrote:
> > > >>> This issue I am encountering seems to only happen on multi-socket
> > > >>> machines.
> > > >> I believe I was able to reproduce this (once) on my laptop.
> > > >>
> > > >>> It also does not help that the only multi-socket box I have is
> > > >>> an Romley-EP (so two socket SandyBridge CPUs). The other
> > > >>> SandyBridge boxes I've (one socket) are not showing this. Granted
> > > >>> they are also a different model (42).
> > > >>>
> > > >>> The problem is that when I run 'perf top' within an SMP PVHVM
> > > >>> guest, after a couple of seconds or minutes the guest hangs.
> > > >>> Hypervisor ends up stuck too looping, and then the dom0 ends
> > > >>> up hanging as well.
> > > >>>
> > > >>> Dumping the cpu registers (Ctrl-A x3, then 'd'
> > > >>> shows that the guest is pretty firmly stuck in vmx_vmexit_handler:
> > > >>>
> > > >>> (XEN)    [<ffff82c4c01d386f>] vmx_vmexit_handler+0x22f/0x174
> > > >> And in my case this address is the second instruction after STI, i.e. 
> > > >> we
> > > >> are right at the point where interrupts got enabled.
> > > >>
> > > >> So I am wondering whether this has something to do with the counter
> > > >> overflow interrupt (which I believe is an NMI).
> > > > Interestingly enough, if I run the PVHVM guest with 'nowatchdog'
> > > > it runs fine!
> > > 
> > > I think by default perf top runs off timer interrupt so it does not use 
> > > HW counters. But watchdog
> > > is implemented on top of the counters so perhaps it fires the interrupt 
> > > at a bad time, messing
> > > something up.
> > 
> > This looks like a strange behavior we had on nehalem cpus see
> > http://lists.xen.org/archives/html/xen-devel/2010-11/msg01157.html
> > For this I added a quirk, see check_pmc_quirk() in vpmu_core2.c
> > The model 42 is in the quirk list and it seems to work but Romley-EP is 
> > model
> > 43 I think which is not in the list.
> 
> Sorry It should be 45?
> But this isn't on the list too, currently only 47, 46, 42 and 26 - the
> processors we were able to test.

So with this tiny patch:

>From ca17d322447e6253a13e896cd828a4d507fedce1 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
Date: Wed, 13 Mar 2013 10:22:57 -0400
Subject: [PATCH] add quirk for 45 and 43

---
 xen/arch/x86/hvm/vmx/vpmu_core2.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/hvm/vmx/vpmu_core2.c 
b/xen/arch/x86/hvm/vmx/vpmu_core2.c
index 2af8966..658f533 100644
--- a/xen/arch/x86/hvm/vmx/vpmu_core2.c
+++ b/xen/arch/x86/hvm/vmx/vpmu_core2.c
@@ -59,8 +59,10 @@ static void check_pmc_quirk(void)
     if ( family == 6 )
     {
         if ( cpu_model == 47 || cpu_model == 46 || cpu_model == 42 ||
-             cpu_model == 26 )
+             cpu_model == 26 || cpu_model == 45 || cpu_model == 43 ) {
             is_pmc_quirk = 1;
+            printk("%s enabled (model: %x)\n", __func__, cpu_model);
+        }
     }
 }
 
-- 
1.8.0.2

it blows up:

(XEN) check_pmc_quirk enabled (model: 2d)
(XEN) ----[ Xen-4.3-unstable  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    13
(82c4c01cee72>] vmx_load_vmcs+0x140/0x16e
(XEN) RFLAGS: 0000000000010003   CONTEXT: hypervisor
(XEN) rax: ffff83043ff4fc78   rbx: 00000000bd87c380   rcx: 0000000000000000
(XEN) rdx: 0000000000000282   rsi: ffff8300bb6ca500   rdi: ffff8300bb6ca000
(XEN) rbp: ffff83043ff4fc88   rsp: ffff83043ff4fc70   r8:  ffff83043ff53cf0
(XEN) r9:  0000000000000002   r10: 00000000fffffffc   r11: ffff82c4c0225fc0
(XEN) r12: ffff83043ff53d10   r13: 000000000042fe41   r14: 000000082d33c000
(XEN) r15: 000000000000001f   cr0: 0000000080050033   cr4: 00000000000426f0
(XEN) cr3: 00000002170e9000   cr2: 00007f16872a2169
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff83043ff4fc70:
(XEN)    ffff83043ff4fc88 00000000bd87c380 ffff8300bb6ca000 ffff83043ff4fca8
(XEN)    ffff82c4c01cf18b ffff8300bb6ca000 0000000000000000 ffff83043ff4fcd8
(XEN)    ffff82c4c01d30c8 ffff8300bb6ca000 0000000000000000 ffff83042fff8000
(XEN)    0000000000000000 ffff83043ff4fd08 ffff82c4c01b3bc7 ffff8300bb6ca000
(XEN)    0000000000000000 ffff83042fff8000 0000000000000000 ffff83043ff4fd38
(XEN)    ffff82c4c015cf5b ffff8300bb6ca000 ffff83042fff8000 0000000000000000
(XEN)    ffff83042fff8000 ffff83043ff4fd78 ffff82c4c0105e90 ffff83042fff8000
(XEN)    0000000000000080 ffff83042ffc3ee0 000000000000001f ffff83042fff8000
(XEN)    ffff82c4c0303c00 ffff83043ff4fef8 ffff82c4c0103a6e 0000000400000000
(XEN)    00007f1687b35004 0000001f00000001 ffff83043ff48000 ffff82c4c0319820
(XEN)    ffff82c4c031a5b8 ffff82c4c031a5b8 ffff82c4c031a5b8 ffff82c4c031a5b8
(XEN)    ffff83043ff4fdf8 000000803ff4fef8 0000000000000000 ffff83043c993f90
(XEN)    0000000000000000 00000000ffffffff 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000300000000 00000000002171e8 000000021562c025
(XEN)    ffff8300bb73e000 000000090000000f 00007f1687920001 00007f1600000004
(XEN)    0000000000000000 00007fff33256950 00007f1687297e68 00007f1687b376e8
(XEN)    0000000001428050 00007fff33256910 0000000001428c80 00007fff33256df0
(XEN)    00007f16879367c5 0000000001428b40 0000000001428c80 0000000000000004
(XEN)    0000000000000001 0000000001428250 0000000001428090 ffff83043ff4fed8
(XEN)    ffff8300bb73e000 0000000000000005 ffff880069c9db58 0000000000000005
(XEN) Xen call trace:
(XEN)    [<ffff82c4c01cee72>] vmx_load_vmcs+0x140/0x16e
(XEN)    [<ffff82c4c01cf18b>] vmx_vmcs_enter+0xa5/0xba
(XEN)    [<ffff82c4c01d30c8>] vmx_vcpu_initialise+0x107/0x156
(XEN)    [<ffff82c4c01b3bc7>] hvm_vcpu_initialise+0x4e/0x20f
(XEN)    [<ffff82c4c015cf5b>] vcpu_initialise+0x72/0x2a5
(XEN)    [<ffff82c4c0105e90>] alloc_vcpu+0x191/0x271
(XEN)    [<ffff82c4c0103a6e>] do_domctl+0xa52/0x11d4
(XEN)    [<ffff82c4c02223db>] syscall_enter+0xeb/0x145
(XEN)    
(XEN) 
(XEN) ****************************************
(XEN) Panic on CPU 13:
(XEN) FATAL TRAP: vector = 6 (invalid opcode)
(XEN) ****************************************
(XEN) 
(XEN) Manual reset required ('noreboot' specified)

This is based on 65c9792df60051b5f5eaadbc47a118cfba7edd49.

FYI, I also tried on a non-debug hypervisor and got this:

(XEN) CPU:    11
(XEN) RIP:    e008:[<ffff82c4c01cee72>] vmx_load_vmcs+0x140/0x16e
(XEN) RFLAGS: 0000000000010003   CONTEXT: hypervisor
(XEN) rax: ffff83043ff67c78   rbx: 00000000bd87c380   rcx: 0000000000000000
(XEN) rdx: 0000000000000282   rsi: ffff8300bb6ca500   rdi: ffff8300bb6ca000
(XEN) rbp: ffff83043ff67c88   rsp: ffff83043ff67c70   r8:  ffff83043ff6fcf0
(XEN) r9:  0000000000000002   r10: 00000000fffffffc   r11: ffff82c4c0225fc0
(XEN) r12: ffff83043ff6fd10   r13: 000000000042fdba   r14: 000000083f2c5000
(XEN) r15: 000000000000001f   cr0: 0000000080050033   cr4: 00000000000426f0
(XEN) cr3: 00000002190a9000   cr2: 00007f7db970c169
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff83043ff67c70:
(XEN)    ffff83043ff67c88 00000000bd87c380 ffff8300bb6ca000 ffff83043ff67ca8
(XEN)    ffff82c4c01cf18b ffff8300bb6ca000 0000000000000000 ffff83043ff67cd8
(XEN)    ffff82c4c01d30c8 ffff8300bb6ca000 0000000000000000 ffff83042ff21000
(XEN)    0000000000000000 ffff83043ff67d08 ffff82c4c01b3bc7 ffff8300bb6ca000
(XEN)    0000000000000000 ffff83042ff21000 0000000000000000 ffff83043ff67d38
(XEN)    ffff82c4c015cf5b ffff8300bb6ca000 ffff83042ff21000 0000000000000000
(XEN)    ffff83042ff21000 ffff83043ff67d78 ffff82c4c0105e90 ffff83042ff21000
(XEN)    0000000000000080 ffff83042fefcee0 000000000000001f ffff83042ff21000
(XEN)    ffff82c4c0303c00 ffff83043ff67ef8 ffff82c4c0103a6e 0000000400000000
(XEN)    00007f7db9f9f004 0000001f00000001 ffff83043ff60000 ffff82c4c0319820
(XEN)    ffff82c4c031a5b8 ffff82c4c031a5b8 ffff82c4c031a5b8 ffff82c4c031a5b8
(XEN)    ffff83043ff67df8 000000803ff67ef8 0000000000000000 ffff83043c993f90
(XEN)    0000000000000000 00000000ffffffff 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000300000000 000000000021956d 0000000215636025
(XEN)    ffff8300a97c4000 000000090000000f 00007f7db9d90001 00007f7d00000004
(XEN)    0000000000000000 00007ffffe3b98a0 00007f7db9701e68 00007f7db9fa16e8
(XEN)    0000000000db5050 00007ffffe3b9860 0000000000db5c80 00007ffffe3b9d40
(XEN)    00007f7db9da07c5 0000000000db5b40 0000000000db5c80 0000000000000004
(XEN)    0000000000000001 0000000000db5250 0000000000db5090 ffff83043ff67ed8
(XEN)    ffff8300a97c4000 0000000000000005 ffff880069c9ab58 0000000000000005
(XEN) Xen call trace:
(XEN)    [<ffff82c4c01cee72>] vmx_load_vmcs+0x140/0x16e
(XEN)    [<ffff82c4c01cf18b>] vmx_vmcs_enter+0xa5/0xba
(XEN)    [<ffff82c4c01d30c8>] vmx_vcpu_initialise+0x107/0x156
(XEN)    [<ffff82c4c01b3bc7>] hvm_vcpu_initialise+0x4e/0x20f
(XEN)    [<ffff82c4c015cf5b>] vcpu_initialise+0x72/0x2a5
(XEN)    [<ffff82c4c0105e90>] alloc_vcpu+0x191/0x271
(XEN)    [<ffff82c4c0103a6e>] do_domctl+0xa52/0x11d4
(XEN)    [<ffff82c4c02223db>] syscall_enter+0xeb/0x145
(XEN)    
(XEN) 
(XEN) ****************************************
(XEN) Panic on CPU 11:
(XEN) FATAL TRAP: vector = 6 (invalid opcode)
(XEN) ****************************************
(XEN) 
(XEN) Manual reset required ('noreboot' specified)


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.