[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [xen-unstable test] 123379: regressions - FAIL



Juergen Gross writes ("Re: [Xen-devel] [xen-unstable test] 123379: regressions 
- FAIL"):
> Before that there was:
> 
> 2018-05-30 22:12:49.320+0000: xc: Failed to get types for pfn batch (14
> = Bad address): Internal error

This seems to be the only message about the root cause.

> But looking at the messages issued some seconds before that I see some
> xenstore watch related messages in:
> 
> http://logs.test-lab.xenproject.org/osstest/logs/123379/test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm/huxelrebe1---var-log-libvirt-libxl-libxl-driver.log

I think this is all a red herring.

What I see happening is:

2018-05-30 22:12:44.695+0000: libxl: 
libxl_event.c:636:libxl__ev_xswatch_register: watch w=0xb40005e8 
wpath=/local/domain/3/control/shutdown token=2/b: register slotnum=2

libxl starts watching the domain's shutdown control node.  I think
this is done from near libxl_dom_suspend.c:202.

2018-05-30 22:12:44.696+0000: libxl: libxl_event.c:573:watchfd_callback: watch 
w=0xb40005e8 wpath=/local/domain/3/control/shutdown token=2/b: event 
epath=/local/domain/3/control/shutdown

The watch we just set triggers.  This happens with every xenstore
watch, after it is set up - ie, it does not mean that anything had
written the node.

2018-05-30 22:12:44.696+0000: libxl: 
libxl_event.c:673:libxl__ev_xswatch_deregister: watch w=0xb40005e8 
wpath=/local/domain/3/control/shutdown token=2/b: deregister slotnum=2

libxl stops watching the domain's shutdown control node.  This is
done, I think, by domain_suspend_common_pvcontrol_suspending
(libxl_dom_suspend.c:222).

We can conclude that
  if (!rc && !domain_suspend_pvcontrol_acked(state))
was not taken.  It seems unlikely that rc!=0, because the
node is read in xswait_xswatch_callback using libxl__xs_read_checked
which I think would log a message.  So probably
/local/domain/3/control/shutdown was `suspend', meaning the domain had
indeed acked the suspend request.

2018-05-30 22:12:44.696+0000: libxl: 
libxl_event.c:636:libxl__ev_xswatch_register: watch w=0xb40005f8 
wpath=@releaseDomain token=2/c: register slotnum=2

This is the watch registration in domain_suspend_common_wait_guest.

2018-05-30 22:12:44.696+0000: libxl: libxl_event.c:548:watchfd_callback: watch 
w=0xb40005f8 epath=/local/domain/3/control/shutdown token=2/b: counter != c

This is a watch event for the watch we set up at 2018-05-30
22:12:44.696+0000.  You can tell because the token is the same.  But
that watch was cancelled within libxl at 2018-05-30
22:12:44.696+0000.  libxl's watch handling machinery knows this, and
discards the event.  "counter != c", libxl_event.c:547.

It does indeed use the same slot in the libxl xswatch data structure,
but libxl can distinguish it by the counter and the event path.  (In
any case xs watch handlers should tolerate spurious events and be
idempotent, although that does not matter here.)

I think this must be the watch event from the guest actually writing
its acknowledgement to the control node - we would indeed expect two
such events, one generated by the watch setup, and one from the
guest's write.  The timing meant that here we processed the guest's
written value as a result of the first watch event.  This is fine.

2018-05-30 22:12:44.696+0000: libxl: libxl_event.c:573:watchfd_callback: watch 
w=0xb40005f8 wpath=@releaseDomain token=2/c: event epath=@releaseDomain

This is the immediate-auto-firing of the @releaseDomain event set up
at 2018-05-30 22:12:44.696+0000.  libxl's xswatch machinery looks this
up in slot 2 and finds that the counter and paths are right, so it
will dispatch it to suspend_common_wait_guest_watch which is a
frontend for suspend_common_wait_guest_check.

In the absence of log messages from that function we can conclude that
  !(info.flags & XEN_DOMINF_shutdown)
ie the guest has not shut down yet.

2018-05-30 22:12:44.720+0000: libxl: libxl_event.c:573:watchfd_callback: watch 
w=0xb2a26708 wpath=@releaseDomain token=3/0: event epath=@releaseDomain

This is a watch event which was set up much earlier at 2018-05-30
21:58:02.182+0000.  The surrounding context there (references to
domain_death_xswatch_callback) makes it clear that this is pursuant to
libxl_evenable_domain_death - ie, libvirt asked libxl to monitor for
the death of the domain.

2018-05-30 22:12:44.724+0000: libxl: 
libxl_domain.c:816:domain_death_xswatch_callback:  shutdown reporting

The output here is a bit perplexing.  I don't understand how we can
have the message "shutdown reporting" without any previous message
"Exists shutdown_reported=%d" or "[evg=%p] nentries=%d rc=%d %ld..%ld"
both of which seem to precede the "shutdown reporting" message in
domain_death_xswatch_callback.

However, we can conclude that, at this point, libxl finds that
  got->flags & XEN_DOMINF_shutdown
and it decides to inform libvirt that the domain has shut down,
by providing a DOMAIN_SHUTDOWN libxl event.

(This event is not passed to libvirt immediately yet because it lives
on either (a) a queue on this thread's stack, which will be processed
on return to libvirt, or (b) a queue associated with the CTX, whose
lock we hold.  The callback to libvirt will be reported later.)

2018-05-30 22:12:44.724+0000: libxl: 
libxl_domain.c:771:domain_death_xswatch_callback: [evg=0] all reported
2018-05-30 22:12:44.724+0000: libxl: 
libxl_domain.c:833:domain_death_xswatch_callback: domain death search done

This is the end of the search ing domain_death_xswatch_callback for
domains which need to be reported.  libvirt was listening only for one
domain.

2018-05-30 22:12:44.724+0000: libxl: libxl_event.c:573:watchfd_callback: watch 
w=0xb40005f8 wpath=@releaseDomain token=2/c: event epath=@releaseDomain

Another xs watch event for the same domain shutdown, because libxl had
set up two watches for it.

(These will probably have been written very quickly together into the
xs ring and/or socket, and handled within the loop in
libxl_event.c:watchfd_callback.  So we have not yet released the CTX
lock or returned to libvirt: therefore, probably, the domain shutdown
event notification to libvirt is still queued with libxl.  Indeed as
we will see, that occurs a bit later.)

This is another watch event from the registration in
domain_suspend_common_wait_guest and again it will call
suspend_common_wait_guest_check.

2018-05-30 22:12:44.724+0000: libxl: 
libxl_event.c:673:libxl__ev_xswatch_deregister: watch w=0xb40005f8 
wpath=@releaseDomain token=2/c: deregister slotnum=2

This must logically be one of the libxl__ev_xswatch_deregister calls
in domain_suspend_common_guest_suspended / domain_suspend_common_done.

However, looking at suspend_common_wait_guest_check, reaching either
of those should have produced some kind of log message - either "guest
has suspended" (DEBUG) or an ERROR of some sort.

... oh I have just spotted this logfile ...

http://logs.test-lab.xenproject.org/osstest/logs/123379/test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm/huxelrebe1---var-log-libvirt-libxl-debianhvm.guest.osstest.log

... which contains the other half of the messages.

So we see there:

2018-05-30 22:12:44.724+0000: libxl: 
libxl_dom_suspend.c:350:suspend_common_wait_guest_check: Domain 3:guest has 
suspended

OK.

2018-05-30 22:12:44.731+0000: libxl: 
libxl_event.c:686:libxl__ev_xswatch_deregister: watch w=0xb40005f8: deregister 
unregistered

This is the second call to libxl__ev_xswatch_deregister for the same
watch event.  That tells us that the first call must have been in
domain_suspend_common_guest_suspended.  So all is going well, and we
called domain_suspend_common_done, whose idempotent cleanup
deregisters the already-deregisterd watch.

2018-05-30 22:12:44.731+0000: libxl: libxl_event.c:1398:egc_run_callbacks: 
event 0xb40068f0 callback type=domain_shutdown

This is libxl calling libvirt to tell libvirt that the domain has shut
down.  libvirt does not seem to respond.

2018-05-30 22:12:49.320+0000: xc: Failed to get types for pfn batch (14 = Bad 
address): Internal error
2018-05-30 22:12:49.483+0000: xc: Save failed (14 = Bad address): Internal error
2018-05-30 22:12:49.648+0000: libxl-save-helper: complete r=-1: Bad address

And this is the first thing that goes wrong.  You can see similar
messages in the other logfile:

2018-05-30 22:12:49.650+0000: libxl: 
libxl_stream_write.c:350:libxl__xc_domain_save_done: Domain 3:saving domain: 
domain responded to suspend request: Bad address

All of these are reports of the same thing: xc_get_pfn_type_batch at
xc_sr_save.c:133 failed with EFAULT.


2018-05-30 22:12:49.650+0000: libxl: libxl.c:746:libxl__fd_flags_restore: fnctl 
F_SETFL of fd 33 to 0x8001
2018-05-30 22:12:49.650+0000: libxl: libxl_event.c:1869:libxl__ao_complete: ao 
0xb4000478: complete, rc=-3
2018-05-30 22:12:49.650+0000: libxl: libxl_event.c:1838:libxl__ao__destroy: ao 
0xb4000478: destroy

> which make me wonder whether the libxl watch handling is really
> correct: e.g. libxl__ev_xswatch_register() first registers the watch
> with xenstore and only then writes the data needed for processing the
> watch in the related structure. Could it be that the real suspend watch
> event was interpreted as a @releaseDomain event?

No.  The code in libxl__ev_xswatch_register all runs with the libxl
CTX lock held so it cannot be interrupted in this way.  As you see
above I have analysed the log and it is all operating correctly,
albeit rather noisily in the debug log.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.