Hi guys
We are using Xen 4.6.1 to manage our virtual machines on x86-64-servers.
We start dozens of VMs and destroy them again after 60 seconds, which works fine as it is, but the next step in our approach requires the use of the altp2m functionality.
Since libvirt does not pass the altp2m-enable flag to the hypervisor we enabled altp2m unconditionally by patching the hvm.c . Since all of our machines support the altp2m this seemed to be ok.
d->arch.hvm_domain.params[HVM_PARAM_HPET_ENABLED] = 1;
d->arch.hvm_domain.params[HVM_PARAM_TRIPLE_FAULT_REASON] = SHUTDOWN_reboot;
+ d->arch.hvm_domain.params[HVM_PARAM_ALTP2M] = 1;
+
vpic_init(d);
rc = vioapic_init(d);
Since applying this patch the hypervisor crashes after several hundred restarted VMs (without any altp2m-functionality used by us) with the following dmesg:
(XEN) ----[ Xen-4.6.1 x86_64 debug=n Not tainted ]----
(XEN) CPU: 7
(XEN) RIP: e008:[<ffff82d0801f5a55>] vmx_vmenter_helper+0x2b5/0x340
(XEN) RFLAGS: 0000000000010003 CONTEXT: hypervisor (d0v3)
(XEN) rax: 000000008005003b rbx: ffff8300e7038000 rcx: 0000000000000008
(XEN) rdx: 0000000000006c00 rsi: ffff83062eb5e000 rdi: ffff8300e7038000
(XEN) rbp: ffff830c17e3f000 rsp: ffff830617fc7d70 r8: 0000000000000000
(XEN) r9: ffff83014f8d7028 r10: 000002700f858000 r11: 00002201be6861f0
(XEN) r12: ffff83062eb5e000 r13: ffff8300e752f000 r14: ffff82d08030ea40
(XEN) r15: 0000000000000007 cr0: 000000008005003b cr4: 00000000000026e0
(XEN) cr3: 00000001bf4da000 cr2: 00000000dd840c00
(XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008
(XEN) Xen stack trace from rsp=ffff830617fc7d70:
(XEN) ffff8300e7038000 ffff82d080170c04 0000000000000000 0000000780109f6a
(XEN) ffff830617fc7f18 ffff83000000001e 0000000000000000 ffff8300e752f19c
(XEN) 0000000000000286 ffff8300e752f000 ffff8300e72fc000 0000000000000007
(XEN) ffff830c17e3f000 ffff830c14ee1000 ffff82d08030ea40 ffff82d080173d6a
(XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) ffff82d08030ea40 ffff8300e72fc000 000002700f481091 0000000000000001
(XEN) ffff82d080324560 ffff82d08030ea40 ffff8300e752f000 ffff82d080128004
(XEN) 0000000000000001 0000000001c9c380 ffff830c14ef60e8 0000000017fce600
(XEN) 0000000000000001 ffff82d0801bd18b ffff82d0801d9e88 ffff8300e752f000
(XEN) 0000000001c9c380 ffff82d08012e700 0000006e00000171 ffffffffffffffff
(XEN) ffff830617fc0000 ffff82d0802f8f80 00000000ffffffff ffff83062eb5e000
(XEN) ffff82d08030ea40 ffff82d08012b040 ffff8300e7038000 ffff830617fc0000
(XEN) ffff8300e7038000 00000000ffffffff ffff830c14ee1000 ffff82d080170970
(XEN) ffff8300e72fc000 0000000000000000 0000000000000000 0000000000000000
(XEN) 0000000000000000 0000000080550f50 00000000ffdffc70 0000000000000000
(XEN) 0000000000000000 0000000000000000 0000000000000000 000000002fcffe19
(XEN) 00000000ffdffc70 0000000000000000 00000000ffdffc50 00000000853b0918
(XEN) 000000fa00000000 00000000f0e48162 0000000000000000 0000000000000246
(XEN) 0000000080550f34 0000000000000000 0000000000000000 0000000000000000
(XEN) 0000000000000000 0000000000000000 0000000000000007 ffff8300e752f000
(XEN) Xen call trace:
(XEN) [<ffff82d0801f5a55>] vmx_vmenter_helper+0x2b5/0x340
(XEN) [<ffff82d080170c04>] __context_switch+0xb4/0x350
(XEN) [<ffff82d080173d6a>] context_switch+0xca/0xef0
(XEN) [<ffff82d080128004>] schedule+0x264/0x5f0
(XEN) [<ffff82d0801bd18b>] mwait_idle+0x25b/0x3a0
(XEN) [<ffff82d0801d9e88>] hvm_vcpu_has_pending_irq+0x58/0xc0
(XEN) [<ffff82d08012e700>] timer_softirq_action+0x80/0x250
(XEN) [<ffff82d08012b040>] __do_softirq+0x60/0x90
(XEN) [<ffff82d080170970>] idle_loop+0x20/0x50
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 7:
(XEN) FATAL TRAP: vector = 6 (invalid opcode)
(XEN) ****************************************
(XEN)
(XEN) Reboot in five seconds...
(XEN) Executing kexec image on cpu7
(XEN) Shot down all CPUs
The RIP points to ud2
0xffff82d0801f5a55: ud2
From the RFLAGS we concluded that the vmwrite failed due to an invalid vmcs-pointer (CF = 1), but this is where we are stuck since we have no idea how the pointer could have gotten corrupted.
crash> vcpu
gives vmcs = 0xffffffff817cbc20 for vcpu_id = 7,
and vcpus gives
VCID PCID VCPU ST T DOMID DOMAIN
0 0 ffff8300e75f2000 RU I 32767 ffff830c14ee1000
1 1 ffff8300e72fe000 RU I 32767 ffff830c14ee1000
2 2 ffff8300e7527000 RU I 32767 ffff830c14ee1000
> 3 3 ffff8300e7526000 RU I 32767 ffff830c14ee1000
4 4 ffff8300e75f1000 RU I 32767 ffff830c14ee1000
> 5 5 ffff8300e75f0000 RU I 32767 ffff830c14ee1000
> 6 6 ffff8300e72fd000 RU I 32767 ffff830c14ee1000
7 7 ffff8300e72fc000 RU I 32767 ffff830c14ee1000
0 0 ffff8300e72fa000 BL 0 0 ffff830c17e3f000
1 6 ffff8300e72f9000 BL 0 0 ffff830c17e3f000
2 3 ffff8300e72f8000 BL 0 0 ffff830c17e3f000
> 3 7 ffff8300e752f000 RU 0 0 ffff830c17e3f000
4 5 ffff8300e752e000 RU 0 0 ffff830c17e3f000
> 5 2 ffff8300e752d000 RU 0 0 ffff830c17e3f000
> 6 1 ffff8300e752c000 BL 0 0 ffff830c17e3f000
>* 7 0 ffff8300e752b000 RU 0 0 ffff830c17e3f000
0 4 ffff8300e7042000 OF U 127 ffff830475bbe000
> 0 4 ffff8300e7040000 RU U 128 ffff83062a7bc000
0 1 ffff8300e7038000 RU U 129 ffff83062eb5e000
0 5 ffff8300e703e000 BL U 130 ffff830475bd1000
Do you have any ideas what could cause this crash or how to proceed?
Cheers
Kevin