[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] error handling in libxl_domain_suspend



Ian, Wei,

we got a report about a crash from libxl_domain_suspend like this, from 'virsh 
migrate --live xen+ssh://host':

#1 helper_done (egc=0x7fc0284aa6c0, shs=0x7fc0180256c8) at 
libxl_save_callout.c:371
 helper_failed
 helper_stop
 libxl__save_helper_abort
#2 check_all_finished (egc=0x7fc0284aa6c0, stream=0x7fc018025698, rc=-3) at 
libxl_stream_write.c:671
 stream_done
 stream_complete
 write_done 
 dc->callback == write_done
 efd->func == datacopier_writable
#3 afterpoll_internal (egc=egc@entry=0x7fc0284aa6c0, 
poller=poller@entry=0x7fc018003f20, nfds=4, fds=0x7fc018002d00, now=...) at 
libxl_event.c:1269

I inserted the extra call trace manually for better understanding.
The issue is a failed poll will crash libxl, the actual error was:

libxl_aoutils.c:328:datacopier_writable: unexpected poll event 0x1c on fd 37 
(should be POLLOUT) writing libxc header during copy of save v2 stream

In this case revents in datacopier_writable is POLLHUP|POLLERR|POLLOUT, which 
triggers datacopier_callback.
In helper_done, shs->completion_callback is still zero:

(gdb) p stream.shs
$32 = {ao = 0x7f3fa4002d10, domid = 0, callbacks = {
save = {a = {suspend = 0x7f3f99c8e220 <libxl__domain_suspend_callback>, 
postcopy = 0x0, checkpoint = 0x0, wait_checkpoint = 0x0, switch_qemu_logdirty = 
0x7f3f99c8eca0 <libxl__domain_suspend_common_switch_qemu_logdirty>}},
restore = {a = {suspend = 0x7f3f99c8e220 <libxl__domain_suspend_callback>, 
postcopy = 0x0, checkpoint = 0x0, wait_checkpoint = 0x0, restore_results = 
0x7f3f99c8eca0 <libxl__domain_suspend_common_switch_qemu_logdirty>}}},
recv_callback = 0x0, completion_callback = 0x0,
caller_state = 0x0, need_results = 0, rc = 0, completed = 0, retval = 0, 
errnoval = 0, abrt = {ao = 0x0, callback = 0x0, registered = false,
entry = { le_next = 0x0, le_prev = 0x0}}, pipes = {0x0, 0x0}, readable = {fd = 
-1, events = 0, func = 0x0, entry = {le_next = 0x0, le_prev = 0x0}, nexus = 
0x0},
child = {pid = -1, callback = 0x0, entry = {le_next = 0x0, le_prev = 0x0}}, 
stdin_what = 0x0, stdout_what = 0x0, egc = 0x0}


Even if helper_done would check if shs->completion_callback is valid, 
check_all_finished would apparently cycle forever:

(gdb) p stream.completion_callback 
$35 = (void (*)(libxl__egc *, libxl__stream_write_state *, int)) 0x7f3f99c8e890 
<stream_done>

stream_done would call check_all_finished again.

My understanding of the code is that libxl__xc_domain_save fills dss.sws.shs. 
But that function is only called after stream_header_done. Any error before 
that will leave dss partly uninitialized.

How is this supposed to be fixed?

Olaf

Attachment: pgptnq8Yp_zHA.pgp
Description: Digitale Signatur von OpenPGP

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.