Xen project Mailing List

Re: [Xen-devel] [xen-unstable test] 110009: regressions - FAIL

From: Ian Jackson <ian.jackson@xxxxxxxxxxxxx>

Date: Tue, 6 Jun 2017 15:00:57 +0100

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Wei Liu <wei.liu2@xxxxxxxxxx>, osstest-admin@xxxxxxxxxxxxxx, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Tue, 06 Jun 2017 14:01:25 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Jan Beulich writes ("Re: [Xen-devel] [xen-unstable test] 110009: regressions - FAIL"): > So finally we have some output from the debugging code added by > 933f966bcd ("x86/mm: add temporary debugging code to > get_page_from_gfn_p2m()"), i.e. the migration heisenbug we hope > to hunt down: > > (XEN) d0v2: d7 dying (looking up 3e000) > ... > (XEN) Xen call trace: > (XEN) [<ffff82d0803150ef>] get_page_from_gfn_p2m+0x7b/0x416 > (XEN) [<ffff82d080268e88>] arch_do_domctl+0x51a/0x2535 > (XEN) [<ffff82d080206cf9>] do_domctl+0x17e4/0x1baf > (XEN) [<ffff82d080355896>] pv_hypercall+0x1ef/0x42d > (XEN) [<ffff82d0803594c6>] entry.o#test_all_events+0/0x30 > > which points at XEN_DOMCTL_getpageframeinfo3 handling code. > What business would the tool stack have invoking this domctl for > a dying domain? I'd expect all of these operations to be done > while the domain is still alive (perhaps paused), but none of them > to occur once domain death was initiated. The toolstack log says: libxl-save-helper: debug: starting restore: Success xc: detail: fd 8, dom 8, hvm 0, pae 0, superpages 0, stream_type 0 xc: info: Found x86 HVM domain from Xen 4.10 xc: info: Restoring domain xc: error: Failed to get types for pfn batch (3 = No such process): Internal error xc: error: Save failed (3 = No such process): Internal error This is a mixture of output from the save, and output from the restore. Domain 7 is the domain which is migrating out; domain 8 is migrating in. The `Failed to get types message' is the first thing that seems to go wrong. It's from tools/libxc/xc_sr_save.c line 136, which is part of the machinery for constructing a memory batch. I tried comparing this test with a successful one. I had to hunt a bit to find one where the (inherently possibly-out-of-order) toolstack messages were similar, but found 110010 (a linux-4.9 test) [1]. The first significant difference (excluding some variations of addresses etc., and some messages about NUMA placement of the new domain which presumably result from a different host) occur here: libxl-save-helper: debug: starting restore: Success xc: detail: fd 8, dom 8, hvm 0, pae 0, superpages 0, stream_type 0 xc: info: Found x86 HVM domain from Xen 4.9 xc: info: Restoring domain libxl: debug: libxl_dom_suspend.c:179:domain_suspend_callback_common: Domain 7:Calling xc_domain_shutdown on HVM domain libxl: debug: libxl_dom_suspend.c:294:domain_suspend_common_wait_guest: Domain 7:wait for the guest to suspend libxl: debug: libxl_event.c:636:libxl__ev_xswatch_register: watch w=0x2179a40 wpath=@releaseDomain token=3/1: register slotnum=3 libxl: debug: libxl_event.c:573:watchfd_callback: watch w=0x2179a40 wpath=@releaseDomain token=3/1: event epath=@releaseDomain libxl: debug: libxl_dom_suspend.c:352:suspend_common_wait_guest_check: Domain 7:guest has suspended Looking at the serial logs for that and comparing them with 10009, it's not terribly easy to see what's going on because the kernel versions are different and so produce different messages about xenbr0 (and I think may have a different bridge port management algorithm). But the messages about promiscuous mode seem the same, and of course promiscuous mode is controlled by userspace, rather than by the kernel (so should be the same in both). However, in the failed test we see extra messages about promis: Jun 5 13:37:08.353656 [ 2191.652079] device vif7.0-emu left promiscuous mode ... Jun 5 13:37:08.377571 [ 2191.675298] device vif7.0 left promiscuous mode Also, the qemu log for the guest in the failure case says this: Log-dirty command enable Log-dirty: no command yet. reset requested in cpu_handle_ioreq. Issued domain 7 reboot Whereas in the working tests we see something like this: Log-dirty command enable Log-dirty: no command yet. dm-command: pause and save state device model saving state In the xl log in the failure case I see this: libxl: debug: libxl_domain.c:773:domain_death_xswatch_callback: Domain 7:Exists shutdown_reported=0 dominf.flags=10106 libxl: debug: libxl_domain.c:785:domain_death_xswatch_callback: shutdown reporting libxl: debug: libxl_domain.c:740:domain_death_xswatch_callback: [evg=0] all reported libxl: debug: libxl_domain.c:802:domain_death_xswatch_callback: domain death search done Domain 7 has shut down, reason code 1 0x1 Action for shutdown reason code 1 is restart xl then tears down the domain's devices and destroys the domain. All of this seems to suggest that the domain decided to reboot mid-migration, which is pretty strange. Ian. [1] http://logs.test-lab.xenproject.org/osstest/logs/110010/test-amd64-amd64-xl-qemut-win7-amd64/info.html _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.