[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [xen-4.9-testing test] 126201: regressions - FAIL



On 9/5/18 3:37 PM, Jim Fehlig wrote:
On 08/24/2018 02:58 AM, Wei Liu wrote:
On Wed, Aug 22, 2018 at 04:52:27PM -0600, Jim Fehlig wrote:
On 08/21/2018 05:14 AM, Jan Beulich wrote:
On 21.08.18 at 03:11, <osstest-admin@xxxxxxxxxxxxxx> wrote:
flight 126201 xen-4.9-testing real [real]
http://logs.test-lab.xenproject.org/osstest/logs/126201/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
   test-amd64-amd64-libvirt-pair 22 guest-migrate/src_host/dst_host fail REGR. vs. 124328

Something needs to be done about this, as this continued failure is
blocking the 4.9.3 release. I did mail about this on Aug 2nd already
for flight 125710, I've got back from Wei:

This is libvirtd's error message.

The remote host can't obtain the state change log due to it is already
held by another task/thread. It could be a libvirt / libxl bug.

2018-08-01 16:12:13.433+0000: 3491: warning : libxlDomainObjBeginJob:151 :
Cannot start job (modify) for domain debian.guest.osstest; current job is (modify) owned by (24975)

I took a closer look at the logs and it appears the finish phase of
migration fails to acquire the domain job lock since it is already held by
the perform phase. In the perform phase, after the vm has been transferred
to the dst, the qemu process associated with the vm is started. For whatever
reason that takes a long time on this host:

2018-08-19 17:05:19.182+0000: libxl: libxl_dm.c:2235:libxl__spawn_local_dm:
Domain 1:Spawning device-model /usr/local/lib/xen/bin/qemu-system-i386 with
arguments: ...
2018-08-19 17:05:19.188+0000: libxl: libxl_exec.c:398:spawn_watch_event:
domain 1 device model: spawn watch p=(null)

This is a spurious event after the watch has been set up.

...
2018-08-19 17:05:51.529+0000: libxl: libxl_event.c:573:watchfd_callback:
watch w=0x7f84a0047ee8 wpath=/local/domain/0/device-model/1/state token=2/1:
event epath=/local/domain/0/device-model/1/state
2018-08-19 17:05:51.529+0000: libxl: libxl_exec.c:398:spawn_watch_event:
domain 1 device model: spawn watch p=running

So it has taken 32s for QEMU to write "running" in xenstore. This,
however, is still within the timeout limit set by libxl (60s).

Right, but it is not within libvirt's job wait timeout, which is 30s.

I've sent a series to fix this and other problems I found while 
testing/debugging

https://www.redhat.com/archives/libvir-list/2018-September/msg00178.html

Assuming those patches are committed to libvirt.git master, it's not clear how they will improve this and other tests that use an older, fixed libvirt commit.

FYI, the patches fixing this problem from the libvirt side have been committed to libvir.git master now. See commits 60b4fd90, e39c66d3, 47da84e0, 0149464a, and 5ea2abb3.

Regards,
Jim

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.