[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [xen-unstable test] 110009: regressions - FAIL



Jan Beulich writes ("Re: [Xen-devel] [xen-unstable test] 110009: regressions - 
FAIL"):
> So finally we have some output from the debugging code added by
> 933f966bcd ("x86/mm: add temporary debugging code to
> get_page_from_gfn_p2m()"), i.e. the migration heisenbug we hope
> to hunt down:
> 
> (XEN) d0v2: d7 dying (looking up 3e000)
> ...
> (XEN) Xen call trace:
> (XEN)    [<ffff82d0803150ef>] get_page_from_gfn_p2m+0x7b/0x416
> (XEN)    [<ffff82d080268e88>] arch_do_domctl+0x51a/0x2535
> (XEN)    [<ffff82d080206cf9>] do_domctl+0x17e4/0x1baf
> (XEN)    [<ffff82d080355896>] pv_hypercall+0x1ef/0x42d
> (XEN)    [<ffff82d0803594c6>] entry.o#test_all_events+0/0x30
> 
> which points at XEN_DOMCTL_getpageframeinfo3 handling code.
> What business would the tool stack have invoking this domctl for
> a dying domain? I'd expect all of these operations to be done
> while the domain is still alive (perhaps paused), but none of them
> to occur once domain death was initiated.

The toolstack log says:

  libxl-save-helper: debug: starting restore: Success
  xc: detail: fd 8, dom 8, hvm 0, pae 0, superpages 0, stream_type 0
  xc: info: Found x86 HVM domain from Xen 4.10
  xc: info: Restoring domain
  xc: error: Failed to get types for pfn batch (3 = No such process): Internal 
error
  xc: error: Save failed (3 = No such process): Internal error

This is a mixture of output from the save, and output from the restore.
Domain 7 is the domain which is migrating out; domain 8 is migrating
in.

The `Failed to get types message' is the first thing that seems to go
wrong.  It's from tools/libxc/xc_sr_save.c line 136, which is part of
the machinery for constructing a memory batch.


I tried comparing this test with a successful one.  I had to hunt a
bit to find one where the (inherently possibly-out-of-order) toolstack
messages were similar, but found 110010 (a linux-4.9 test) [1].

The first significant difference (excluding some variations of
addresses etc., and some messages about NUMA placement of the new
domain which presumably result from a different host) occur here:

  libxl-save-helper: debug: starting restore: Success
  xc: detail: fd 8, dom 8, hvm 0, pae 0, superpages 0, stream_type 0
  xc: info: Found x86 HVM domain from Xen 4.9
  xc: info: Restoring domain
  libxl: debug: libxl_dom_suspend.c:179:domain_suspend_callback_common: Domain 
7:Calling xc_domain_shutdown on HVM domain
  libxl: debug: libxl_dom_suspend.c:294:domain_suspend_common_wait_guest: 
Domain 7:wait for the guest to suspend
  libxl: debug: libxl_event.c:636:libxl__ev_xswatch_register: watch w=0x2179a40 
wpath=@releaseDomain token=3/1: register slotnum=3
  libxl: debug: libxl_event.c:573:watchfd_callback: watch w=0x2179a40 
wpath=@releaseDomain token=3/1: event epath=@releaseDomain
  libxl: debug: libxl_dom_suspend.c:352:suspend_common_wait_guest_check: Domain 
7:guest has suspended

Looking at the serial logs for that and comparing them with 10009,
it's not terribly easy to see what's going on because the kernel
versions are different and so produce different messages about xenbr0
(and I think may have a different bridge port management algorithm).

But the messages about promiscuous mode seem the same, and of course
promiscuous mode is controlled by userspace, rather than by the kernel
(so should be the same in both).

However, in the failed test we see extra messages about promis:

  Jun  5 13:37:08.353656 [ 2191.652079] device vif7.0-emu left promiscuous mode
  ...
  Jun  5 13:37:08.377571 [ 2191.675298] device vif7.0 left promiscuous mode

Also, the qemu log for the guest in the failure case says this:

  Log-dirty command enable
  Log-dirty: no command yet.
  reset requested in cpu_handle_ioreq.
  Issued domain 7 reboot

Whereas in the working tests we see something like this:

  Log-dirty command enable
  Log-dirty: no command yet.
  dm-command: pause and save state
  device model saving state

In the xl log in the failure case I see this:

  libxl: debug: libxl_domain.c:773:domain_death_xswatch_callback: Domain 
7:Exists shutdown_reported=0 dominf.flags=10106
  libxl: debug: libxl_domain.c:785:domain_death_xswatch_callback:  shutdown 
reporting
  libxl: debug: libxl_domain.c:740:domain_death_xswatch_callback: [evg=0] all 
reported
  libxl: debug: libxl_domain.c:802:domain_death_xswatch_callback: domain death 
search done
  Domain 7 has shut down, reason code 1 0x1
  Action for shutdown reason code 1 is restart

xl then tears down the domain's devices and destroys the domain.

All of this seems to suggest that the domain decided to reboot
mid-migration, which is pretty strange.

Ian.


[1]  
http://logs.test-lab.xenproject.org/osstest/logs/110010/test-amd64-amd64-xl-qemut-win7-amd64/info.html


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.