Re: [Xen-devel] HVM domains crash after upgrade from XEN 4.5.1 to 4.5.2

>>>>>>>>>>>>> After the upgrade HVM domUs appear to no longer work -
>>>>>>>>>>>>> regardless
>>>>>>>>>>>>> of the
>>>>>>>>>>>>> dom0 kernel (tested with both 3.18.9 and 4.1.7 as the dom0
>>>>>>>>>>>>> kernel); PV
>>>>>>>>>>>>> domUs, however, work just fine as before on both dom0 kernels.
>>>>>>>>>>>>> xl dmesg shows the following information after the first
>>>>>>>>>>>>> crashed HVM
>>>>>>>>>>>>> domU which is started as part of the machine booting up:
>>>>>>>>>>>>> [...]
>>>>>>>>>>>>> (XEN) Failed vm entry (exit reason 0x80000021) caused by
>>>>>>>>>>>>> invalid guest
>>>>>>>>>>>>> state (0).
>>>>>>>>>>>>> (XEN) ************* VMCS Area **************
>>>>>>>>>>>>> (XEN) *** Guest State ***
>>>>>>>>>>>>> (XEN) CR0: actual=0x0000000000000039,
>>>>>>>>>>>>> shadow=0x0000000000000011,
>>>>>>>>>>>>> gh_mask=ffffffffffffffff
>>>>>>>>>>>>> (XEN) CR4: actual=0x0000000000002050,
>>>>>>>>>>>>> shadow=0x0000000000000000,
>>>>>>>>>>>>> gh_mask=ffffffffffffffff
>>>>>>>>>>>>> (XEN) CR3: actual=0x0000000000800000, target_count=0
>>>>>>>>>>>>> (XEN)      target0=0000000000000000, target1=0000000000000000
>>>>>>>>>>>>> (XEN)      target2=0000000000000000, target3=0000000000000000
>>>>>>>>>>>>> (XEN) RSP = 0x0000000000006fdc (0x0000000000006fdc)  RIP =
>>>>>>>>>>>>> 0x0000000100000000 (0x0000000100000000)
>>>>>>>>>>>> Other than RIP looking odd for a guest still in non-paged
>>>>>>>>>>>> protected
>>>>>>>>>>>> mode I can't seem to spot anything wrong with guest state.
>>>>>>>>>>> odd? That will be the source of the failure.
>>>>>>>>>>> Out of long mode, the upper 32bit of %rip should all be zero,
>>>>>>>>>>> and it
>>>>>>>>>>> should not be possible to set any of them.
>>>>>>>>>>> I suspect that the guest has exited for emulation, and there
>>>>>>>>>>> has been a
>>>>>>>>>>> bad update to %rip.  The alternative (which I hope is not the
>>>>>>>>>>> case) is
>>>>>>>>>>> that there is a hardware errata which allows the guest to
>>>>>>>>>>> accidentally
>>>>>>>>>>> get it self into this condition.
>>>>>>>>>>> Are you able to rerun with a debug build of the hypervisor?
>>>>>>>>>> Now _without_ the debug USE flag, but with debug information in
>>>>>>>>>>          the binary (I used splitdebug), all is back to where
>>>>>>>>>> the problem
>>>>>>>>>>          started off (i.e. the system boots without issues
>>>>>>>>>> until such
>>>>>>>>>>          time it starts a HVM domU which then crashes; PV
>>>>>>>>>> domUs are
>>>>>>>>>>          working). I have attached the latest "xl dmesg"
>>>>>>>>>> output with the
>>>>>>>>>>          timing information included.
>>>>> I hope any of this makes sense to you.
>>>>> Again many thanks and best regards
>>>> Right - it would appear that the USE flag is definitely not what you
>>>> wanted, and causes bad compilation for Xen.  The do_IRQ disassembly
>>>> you sent is a the result of disassembling a whole block of zeroes.
>>>> Sorry for leading you on a goose chase - the double faults will be the
>>>> product of bad compilation, rather than anything to do with your
>>>> specific problem.
>>> Hi Andrew,
>>> there's absolutely no need to appologize as it is me who asked for help
>>> and you who generously stepped in and provided it. I really do
>>> appreciate your help and it is for me, as the one seeking help, to
>>> provide all the information you deem necessary and you ask for.
>>>> However, the final log you sent (dmesg) is using a debug Xen, which is
>>>> what I was attempting to get you to do originally.
>>> Next time I know better how to arrive at a debug XEN. It's all about
>>> learning.
>>>> We still observe that the VM ends up in 32bit non-paged mode but with
>>>> an RIP with bit 32 set, which is an invalid state to be in.  However,
>>>> there was nothing particularly interesting in the extra log
>>>> information.
>>>> Please can you rerun with "hvm_debug=0xc3f", which will cause far more
>>>> logging to occur to the console while the HVM guest is running.  That
>>>> might show some hints.
>>> I haven't done that yet - but please see my next paragraph. If you are
>>> still interested in this, for whatever reason, I am clearly more than
>>> happy to rerun with your suggested option and provide that information
>>> as well.
>>>> Also, the fact that this occurs just after starting SeaBIOS is
>>>> interesting.  As you have switched versions of Xen, you have also
>>>> switched hvmloader, which contains the SeaBIOS binary embedded in it.
>>>> Would you be able to compile both 4.5.1 and 4.5.2 and switch the
>>>> hvmloader binaries in use.  It would be very interesting to see
>>>> whether the failure is caused by the hvmloader binary or the
>>>> hypervisor.  (With `xl`, you can use
>>>> firmware_override="/full/path/to/firmware" to override the default
>>>> hvmloader).
>>> Your analysis was absolutely spot on. After re-thinking this for a
>>> moment, I thought going down that route first would make a lot of sense
>>> as PV guests still do work and one of the differences to HVM domUs is
>>> that the former do _not_ require SeaBIOS. Looking at my log files of
>>> installed packages confirmed an upgrade from SeaBIOS 1.7.5 to 1.8.2 in
>>> the relevant timeframe which obviously had not made it to the hvmloader
>>> of xen-4.5.1 as I did not re-compile xen after the upgrade of SeaBIOS.
>>> So I re-compiled xen-4.5.1 (obviously now using the installed SeaBIOS
>>> 1.8.2) and the same error as with xen-4.5.2 popped up - and that seemed
>>> to strongly indicate that there indeed might be an issue with SeaBIOS as
>>> this probably was the only variable that had changed from the original
>>> install of xen-4.5.1.
>>> My next step was to downgrade SeaBIOS to 1.7.5 and to re-compile
>>> xen-4.5.1. Voila, the system was again up and running. While still
>>> having SeaBIOS 1.7.5 installed, I also re-compiled xen-4.5.2 and ... you
>>> probably guessed it ... the problem was gone: The system boots up with
>>> no issues and everything is fine again.
>>> So in a nutshell: There seems to be a problem with SeaBIOS 1.8.2
>>> preventing HVM doamins from successfully starting up. I don't know what
>>> this is triggered from, if this is specific to my hardware or whether
>>> something else in my environment is to blame.
>>> In any case, I am again more than happy to provide data / run a few
>>> tests should you wish to get to the grounds of this.
>>> I do owe you a beer (or any other drink) should you ever be at my
>>> location (i.e. Vienna, Austria).
>>> Many thanks again for your analysis and your first class support. Xen
>>> and their people absolutely rock!
>>> Atom2
>> I'm a little late to the thread but can you send me (you can do it
>> off-list if you'd like) the USE flags you used for xen, xen-tools and
>> seabios? Also emerge --info. You can kill two birds with one stone by
>> using emerge --info xen.
> Hi Doug,
> here you go:

Thanks. I'll use your configuration as a test point to update a few
things with regard to the Gentoo ebuilds. I'm not the maintainer of Xen
and SeaBIOS but I don't think the maintainers will have much issue with
the changes.

> USE flags:
> app-emulation/xen-4.5.2-r1::gentoo  USE="-custom-cflags -debug -efi
> -flask -xsm"
> app-emulation/xen-tools-4.5.2::gentoo  USE="hvm pam pygrub python qemu
> screen system-seabios -api -custom-cflags -debug -doc -flask (-ocaml)
> -ovmf -static-libs -system-qemu" PYTHON_TARGETS="python2_7"
> sys-firmware/seabios-1.7.5::gentoo  USE="binary"

So looking at how SeaBIOS and friends are built I think we have an issue
that needs to be addressed. That being said, you wouldn't have this
issue if you did USE="-system-seabios -system-qemu". I believe you would
also be ok if you had done USE="system-seabios system-qemu". But after a
quick look at everything USE="system-seabios -system-qemu" will
definitely do the wrong thing.

> emerge --info: Please see the attached file
>> I'm not too familiar with the xen ebuilds but I was pretty sure that
>> xen-tools is what builds hvmloader and it downloads a copy of SeaBIOS
>> and builds it so that it remains consistent. But obviously your
>> experience shows otherwise.
> You are right, it's xen-tools that builds hvmloader. If I remember
> correctly, the "system-seabios" USE flag (for xen-tools) specifies
> whether sys-firmware/seabios is used and the latter downloads SeaBIOS in
> it's binary form provided its "binary" USE flag is set. At least that's
> my understanding.
>> I'm looking at some ideas to improve SeaBIOS packaging on Gentoo and
>> your info would be helpful.
> Great. Whatever makes gentoo and xen stronger will be awesome. What
> immediately springs to mind is to create a separate hvmloader package
> and slot that (that's just an idea and probably not fully thought
> through, but ss far as I understood Andrew, it would then be possible to
> specify the specific firmware version [i.e. hvmloader] to use on xl's
> command line by using firmware_override="full/path/to/firmware").
> I also found out that an upgrade to sys-firmware/seabios obviously does
> not trigger an automatic re-emerge of xen-tools and thus hvmloader.
> Shouldn't this also happen automatically as xen-tools depends on seabios?
> Thanks and best regards Atom2
> P.S. If you prefer to take this off-list, just reply to my mail address.

