[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [BUG] XEN domU crash when PV grub chainloads 32-bit domU grub



On 21/09/2015 21:03, Andreas Sundstrom wrote:
> This is using Debian Jessie and grub 2.02~beta2-22 (with Debian patches
> applied) and Xen 4.4.1
>
> I originally posted a bug report with Debian but got the suggestion to
> file bugs with upstream as well.
> Debian bug report:
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=799480
>
> Note that my original thought was that this bug probably is within GRUB.
> But Ian asked me to file a bug with Xen as well, you have to live with the
> fact that it is centered around GRUB though.
>
> Here's the information from my original bug report:
>
> Using 64-bit dom0 and 32-bit domU PV (para-virtualized) grub sometimes
> fail when chainloading the domU's grub. 64-bit domU seem to work 100%
> of the time.

You say sometimes.  Do you mean that repeated attempts to boot a 32bit
domU causes it to ether boot correctly, or die in the below manor?

>
> My understanding of the process:
>
>  * dom0 launches domU with grub that is loaded from dom0's disk.
>  * Grub reads config file from memdisk, and then looks for grub binary in
>     domU filesystem.
>  * If grub is found in domU it then chainloads (multiboot) that grub binary
>     and the domU grub reads grub.cfg and continue booting.
>  * If grub is not found in domU it reads grub.cfg and continues with boot.
>
> It fails at step 3 in my list of the boot process, but sometimes it
> does work so it may be something like a race condition that causes the
> problem?
>
> A workaround is to not install or rename /boot/xen in domU so that the
> first grub that is loaded from dom0's disk will not find the grub
> binary in the domU filesystem and hence continues to read grub.cfg and
> boot. The drawback of this is of course that the two versions can't
> differ too much as there are different setups creating grub.cfg and
> then reading/parsing it at boot time.
>
> I am not sure at this point whether this is a problem in XEN or a
> problem in grub but I compiled the legacy pvgrub that uses some minios
> from XEN (don't really know much more about it) and when that legacy
> pvgrub chainloads the domU grub it seems to work 100% of the time. Now
> the legace pvgrub is not a real alternative as it's not packaged for
> Debian though.
>
> When it fails "xl create vm -c" outputs this:
> Parsing config from /etc/xen/vm
> libxl: error: libxl_dom.c:35:libxl__domain_type: unable to get domain
> type for domid=16
> Unable to attach console
> libxl: error: libxl_exec.c:118:libxl_report_child_exitstatus: console
> child [0] exited with error status 1

These error messages are just because the domain crashes sufficiently
early that libxl can't find the console information.  Running `xl
create` without '-c' would remove the libxl errors.

>
> And "xl dmesg" shows errors like this:
> (XEN) traps.c:2514:d15 Domain attempted WRMSR 00000000c0010201 from
> 0x0000000000000000 to 0x000000000000ffff.
> (XEN) d16:v0: unhandled page fault (ec=0010)
> (XEN) Pagetable walk from 0000000000000000:
> (XEN) L4[0x000] = 0000000200256027 000000000000049c
> (XEN) L3[0x000] = 0000000200255027 000000000000049d
> (XEN) L2[0x000] = 0000000200251023 00000000000004a1
> (XEN) L1[0x000] = 0000000000000000 ffffffffffffffff
> (XEN) domain_crash_sync called from entry.S: fault at ffff82d08021feb0
> compat_create_bounce_frame+0xc6/0xde
> (XEN) Domain 16 (vcpu#0) crashed on cpu#0:
> (XEN) ----[ Xen-4.4.1 x86_64 debug=n Not tainted ]----
> (XEN) CPU: 0
> (XEN) RIP: e019:[<0000000000000000>]
> (XEN) RFLAGS: 0000000000000246 EM: 1 CONTEXT: pv guest
> (XEN) rax: 0000000000000000 rbx: 0000000000000000 rcx: 0000000000000000
> (XEN) rdx: 0000000000000000 rsi: 0000000000499000 rdi: 0000000000800000
> (XEN) rbp: 000000000000000a rsp: 00000000005a5ff0 r8: 0000000000000000
> (XEN) r9: 0000000000000000 r10: ffff83023e9b9000 r11: ffff83023e9b9000
> (XEN) r12: 0000033f3d335bfb r13: ffff82d080300800 r14: ffff82d0802ea940
> (XEN) r15: ffff83005e819000 cr0: 000000008005003b cr4: 00000000000506f0
> (XEN) cr3: 0000000200b7a000 cr2: 0000000000000000
> (XEN) ds: e021 es: e021 fs: e021 gs: e021 ss: e021 cs: e019
> (XEN) Guest stack trace from esp=005a5ff0:
> (XEN) 00000010 00000000 0001e019 00010046 0016b38b 0016b38a 0016b389
> 0016b388
> (XEN) 0016b387 0016b386 0016b385 0016b384 0016b383 0016b382 0016b381
> 0016b380
> (XEN) 0016b37f 0016b37e 0016b37d 0016b37c 0016b37b 0016b37a 0016b379
> 0016b378
> (XEN) 0016b377 0016b376 0016b375 0016b374 0016b373 0016b372 0016b371
> 0016b370
> (XEN) 0016b36f 0016b36e 0016b36d 0016b36c 0016b36b 0016b36a 0016b369
> 0016b368
> (XEN) 0016b367 0016b366 0016b365 0016b364 0016b363 0016b362 0016b361
> 0016b360
> (XEN) 0016b35f 0016b35e 0016b35d 0016b35c 0016b35b 0016b35a 0016b359
> 0016b358
> (XEN) 0016b357 0016b356 0016b355 0016b354 0016b353 0016b352 0016b351
> 0016b350
> (XEN) 0016b34f 0016b34e 0016b34d 0016b34c 0016b34b 0016b34a 0016b349
> 0016b348
> (XEN) 0016b347 0016b346 0016b345 0016b344 0016b343 0016b342 0016b341
> 0016b340
> (XEN) 0016b33f 0016b33e 0016b33d 0016b33c 0016b33b 0016b33a 0016b339
> 0016b338
> (XEN) 0016b337 0016b336 0016b335 0016b334 0016b333 0016b332 0016b331
> 0016b330
> (XEN) 0016b32f 0016b32e 0016b32d 0016b32c 0016b32b 0016b32a 0016b329
> 0016b328
> (XEN) 0016b327 0016b326 0016b325 0016b324 0016b323 0016b322 0016b321
> 0016b320
> (XEN) 0016b31f 0016b31e 0016b31d 0016b31c 0016b31b 0016b31a 0016b319
> 0016b318
> (XEN) 0016b317 0016b316 0016b315 0016b314 0016b313 0016b312 0016b311
> 0016b310
> (XEN) 0016b30f 0016b30e 0016b30d 0016b30c 0016b30b 0016b30a 0016b309
> 0016b308
> (XEN) 0016b307 0016b306 0016b305 0016b304 0016b303 0016b302 0016b301
> 0016b300
> (XEN) 0016b2ff 0016b2fe 0016b2fd 0016b2fc 0016b2fb 0016b2fa 0016b2f9
> 0016b2f8
> (XEN) 0016b2f7 0016b2f6 0016b2f5 0016b2f4 0016b2f3 0016b2f2 0016b2f1
> 0016b2f0

This is a very concerning stack trace.  You appear to have a spliced
32/64bit domain which, irrespective if your other problems, should not
be able to exist.

The segment registers indicate that the domU is executing in ring1 which
makes it a 32bit guest (also why 32bit words are used for the stack
dump), but r10 through r14 have 64bit values in.

>
> An easy way to find out which grub you are in if the machine boots is
> to hit 'c' and type 'ls', only the grub from dom0 will know about
> (memdisk). So when trying to replicate the issue (and the domU
> actually starts) you can hit 'c', type 'ls' (check for memdisk) and
> then type 'halt' and relaunch the domU. Usually I can't launch more
> than 4-5 times in a row before it fails, often it fails on my first
> try.
>
> For information I have reproduced on two different AMD desktop
> processor machines, not sure if Intel would be any different. I'm
> pretty sure I did tests with grub from unstable with same result at
> some point, but can test again if that is likely to work.
>
> The package that is in installed on the domU side is "grub-xen".
>
> I am unable to understand how to debug grub further on my own, I have
> printed out text from grub so that I understood that it is the
> chainload that fails. I see no output from the domU grub (except when
> it works as it should of course). I can help with further testing if
> needed.

It does appear to be an intermittent bug in 32bit grub-xen in the
eventual domU, but I have no help to offer with respect to debugging
grub-xen further.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.