|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [xen-unstable test] 110009: regressions - FAIL
Jan Beulich writes ("Re: [Xen-devel] [xen-unstable test] 110009: regressions -
FAIL"):
> So finally we have some output from the debugging code added by
> 933f966bcd ("x86/mm: add temporary debugging code to
> get_page_from_gfn_p2m()"), i.e. the migration heisenbug we hope
> to hunt down:
>
> (XEN) d0v2: d7 dying (looking up 3e000)
> ...
> (XEN) Xen call trace:
> (XEN) [<ffff82d0803150ef>] get_page_from_gfn_p2m+0x7b/0x416
> (XEN) [<ffff82d080268e88>] arch_do_domctl+0x51a/0x2535
> (XEN) [<ffff82d080206cf9>] do_domctl+0x17e4/0x1baf
> (XEN) [<ffff82d080355896>] pv_hypercall+0x1ef/0x42d
> (XEN) [<ffff82d0803594c6>] entry.o#test_all_events+0/0x30
>
> which points at XEN_DOMCTL_getpageframeinfo3 handling code.
> What business would the tool stack have invoking this domctl for
> a dying domain? I'd expect all of these operations to be done
> while the domain is still alive (perhaps paused), but none of them
> to occur once domain death was initiated.
The toolstack log says:
libxl-save-helper: debug: starting restore: Success
xc: detail: fd 8, dom 8, hvm 0, pae 0, superpages 0, stream_type 0
xc: info: Found x86 HVM domain from Xen 4.10
xc: info: Restoring domain
xc: error: Failed to get types for pfn batch (3 = No such process): Internal
error
xc: error: Save failed (3 = No such process): Internal error
This is a mixture of output from the save, and output from the restore.
Domain 7 is the domain which is migrating out; domain 8 is migrating
in.
The `Failed to get types message' is the first thing that seems to go
wrong. It's from tools/libxc/xc_sr_save.c line 136, which is part of
the machinery for constructing a memory batch.
I tried comparing this test with a successful one. I had to hunt a
bit to find one where the (inherently possibly-out-of-order) toolstack
messages were similar, but found 110010 (a linux-4.9 test) [1].
The first significant difference (excluding some variations of
addresses etc., and some messages about NUMA placement of the new
domain which presumably result from a different host) occur here:
libxl-save-helper: debug: starting restore: Success
xc: detail: fd 8, dom 8, hvm 0, pae 0, superpages 0, stream_type 0
xc: info: Found x86 HVM domain from Xen 4.9
xc: info: Restoring domain
libxl: debug: libxl_dom_suspend.c:179:domain_suspend_callback_common: Domain
7:Calling xc_domain_shutdown on HVM domain
libxl: debug: libxl_dom_suspend.c:294:domain_suspend_common_wait_guest:
Domain 7:wait for the guest to suspend
libxl: debug: libxl_event.c:636:libxl__ev_xswatch_register: watch w=0x2179a40
wpath=@releaseDomain token=3/1: register slotnum=3
libxl: debug: libxl_event.c:573:watchfd_callback: watch w=0x2179a40
wpath=@releaseDomain token=3/1: event epath=@releaseDomain
libxl: debug: libxl_dom_suspend.c:352:suspend_common_wait_guest_check: Domain
7:guest has suspended
Looking at the serial logs for that and comparing them with 10009,
it's not terribly easy to see what's going on because the kernel
versions are different and so produce different messages about xenbr0
(and I think may have a different bridge port management algorithm).
But the messages about promiscuous mode seem the same, and of course
promiscuous mode is controlled by userspace, rather than by the kernel
(so should be the same in both).
However, in the failed test we see extra messages about promis:
Jun 5 13:37:08.353656 [ 2191.652079] device vif7.0-emu left promiscuous mode
...
Jun 5 13:37:08.377571 [ 2191.675298] device vif7.0 left promiscuous mode
Also, the qemu log for the guest in the failure case says this:
Log-dirty command enable
Log-dirty: no command yet.
reset requested in cpu_handle_ioreq.
Issued domain 7 reboot
Whereas in the working tests we see something like this:
Log-dirty command enable
Log-dirty: no command yet.
dm-command: pause and save state
device model saving state
In the xl log in the failure case I see this:
libxl: debug: libxl_domain.c:773:domain_death_xswatch_callback: Domain
7:Exists shutdown_reported=0 dominf.flags=10106
libxl: debug: libxl_domain.c:785:domain_death_xswatch_callback: shutdown
reporting
libxl: debug: libxl_domain.c:740:domain_death_xswatch_callback: [evg=0] all
reported
libxl: debug: libxl_domain.c:802:domain_death_xswatch_callback: domain death
search done
Domain 7 has shut down, reason code 1 0x1
Action for shutdown reason code 1 is restart
xl then tears down the domain's devices and destroys the domain.
All of this seems to suggest that the domain decided to reboot
mid-migration, which is pretty strange.
Ian.
[1]
http://logs.test-lab.xenproject.org/osstest/logs/110010/test-amd64-amd64-xl-qemut-win7-amd64/info.html
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |