Xen project Mailing List

Re: [Xen-devel] HVM domains crash after upgrade from XEN 4.5.1 to 4.5.2

To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

From: Atom2 <ariel.atom2@xxxxxxxxxx>

Date: Sun, 15 Nov 2015 01:14:15 +0100

Cc: Jan Beulich <JBeulich@xxxxxxxx>, xen-devel@xxxxxxxxxxxxx

Delivery-date: Sun, 15 Nov 2015 00:14:50 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Am 14.11.15 um 21:32 schrieb Andrew Cooper:

On 14/11/2015 00:16, Atom2 wrote:

Am 13.11.15 um 11:09 schrieb Andrew Cooper:

On 13/11/15 07:25, Jan Beulich wrote:

On 13.11.15 at 00:00, <ariel.atom2@xxxxxxxxxx> wrote:

Am 12.11.15 um 17:43 schrieb Andrew Cooper:

On 12/11/15 14:29, Atom2 wrote:

Hi Andrew,
thanks for your reply. Answers are inline further down.

Am 12.11.15 um 14:01 schrieb Andrew Cooper:

On 12/11/15 12:52, Jan Beulich wrote:

On 12.11.15 at 02:08, <ariel.atom2@xxxxxxxxxx> wrote:

After the upgrade HVM domUs appear to no longer work - regardless
of the
dom0 kernel (tested with both 3.18.9 and 4.1.7 as the dom0 kernel); PV
domUs, however, work just fine as before on both dom0 kernels.

xl dmesg shows the following information after the first crashed HVM
domU which is started as part of the machine booting up:
[...]
(XEN) Failed vm entry (exit reason 0x80000021) caused by invalid guest
state (0).
(XEN) ************* VMCS Area **************
(XEN) *** Guest State ***
(XEN) CR0: actual=0x0000000000000039, shadow=0x0000000000000011,
gh_mask=ffffffffffffffff
(XEN) CR4: actual=0x0000000000002050, shadow=0x0000000000000000,
gh_mask=ffffffffffffffff
(XEN) CR3: actual=0x0000000000800000, target_count=0
(XEN)      target0=0000000000000000, target1=0000000000000000
(XEN)      target2=0000000000000000, target3=0000000000000000
(XEN) RSP = 0x0000000000006fdc (0x0000000000006fdc)  RIP =
0x0000000100000000 (0x0000000100000000)

Other than RIP looking odd for a guest still in non-paged protected
mode I can't seem to spot anything wrong with guest state.

odd? That will be the source of the failure.

Out of long mode, the upper 32bit of %rip should all be zero, and it
should not be possible to set any of them.

I suspect that the guest has exited for emulation, and there has been a
bad update to %rip.  The alternative (which I hope is not the case) is
that there is a hardware errata which allows the guest to accidentally
get it self into this condition.

Are you able to rerun with a debug build of the hypervisor?

[big snip]

Now _without_ the debug USE flag, but with debug information in
        the binary (I used splitdebug), all is back to where the problem
        started off (i.e. the system boots without issues until such
        time it starts a HVM domU which then crashes; PV domUs are
        working). I have attached the latest "xl dmesg" output with the
        timing information included.
 
I hope any of this makes sense to you.

Again many thanks and best regards
Right - it would appear that the USE flag is definitely not what you wanted, and causes bad compilation for Xen.Â The do_IRQ disassembly you sent is a the result of disassembling a whole block of zeroes.Â Sorry for leading you on a goose chase - the double faults will be the product of bad compilation, rather than anything to do with your specific problem.

Hi Andrew,
there's absolutely no need to appologize as it is me who asked for help and you who generously stepped in and provided it. I really do appreciate your help and it is for me, as the one seeking help, to provide all the information you deem necessary and you ask for.

However, the final log you sent (dmesg) is using a debug Xen, which is what I was attempting to get you to do originally.

Next time I know better how to arrive at a debug XEN. It's all about learning.

We still observe that the VM ends up in 32bit non-paged mode but with an RIP with bit 32 set, which is an invalid state to be in.Â However, there was nothing particularly interesting in the extra log information.

Please can you rerun with "hvm_debug=0xc3f", which will cause far more logging to occur to the console while the HVM guest is running.Â That might show some hints.

I haven't done that yet - but please see my next paragraph. If you are still interested in this, for whatever reason, I am clearly more than happy to rerun with your suggested option and provide that information as well.

Also, the fact that this occurs just after starting SeaBIOS is interesting.Â As you have switched versions of Xen, you have also switched hvmloader, which contains the SeaBIOS binary embedded in it.Â Would you be able to compile both 4.5.1 and 4.5.2 and switch the hvmloader binaries in use.Â It would be very interesting to see whether the failure is caused by the hvmloader binary or the hypervisor.Â (With `xl`, you can use firmware_override="/full/path/to/firmware" to override the default hvmloader).

Your analysis was absolutely spot on. After re-thinking this for a moment, I thought going down that route first would make a lot of sense as PV guests still do work and one of the differences to HVM domUs is that the former do _not_ require SeaBIOS. Looking at my log files of installed packages confirmed an upgrade from SeaBIOS 1.7.5 to 1.8.2 in the relevant timeframe which obviously had not made it to the hvmloader of xen-4.5.1 as I did not re-compile xen after the upgrade of SeaBIOS.

So I re-compiled xen-4.5.1 (obviously now using the installed SeaBIOS 1.8.2) and the same error as with xen-4.5.2 popped up - and that seemed to strongly indicate that there indeed might be an issue with SeaBIOS as this probably was the only variable that had changed from the original install of xen-4.5.1.

My next step was to downgrade SeaBIOS to 1.7.5 and to re-compile xen-4.5.1. Voila, the system was again up and running. While still having SeaBIOS 1.7.5 installed, I also re-compiled xen-4.5.2 and ... you probably guessed it ... the problem was gone: The system boots up with no issues and everything is fine again.

So in a nutshell: There seems to be a problem with SeaBIOS 1.8.2 preventing HVM doamins from successfully starting up. I don't know what this is triggered from, if this is specific to my hardware or whether something else in my environment is to blame.

In any case, I am again more than happy to provide data / run a few tests should you wish to get to the grounds of this.

I do owe you a beer (or any other drink) should you ever be at my location (i.e. Vienna, Austria).

Many thanks again for your analysis and your first class support. Xen and their people absolutely rock!

Atom2

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.