[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [xen-unstable test] 56759: regressions - FAIL
On Tue, 2015-05-26 at 14:29 +0100, Ian Campbell wrote: > On Wed, 2015-05-20 at 10:56 +0100, Ian Campbell wrote: > > On Wed, 2015-05-20 at 09:34 +0000, osstest service user wrote: > > > flight 56759 xen-unstable real [real] > > > http://logs.test-lab.xenproject.org/osstest/logs/56759/ > > > > > > Regressions :-( > > > > > > Tests which did not succeed and are blocking, > > > including tests which could not be run: > > > test-armhf-armhf-xl-multivcpu 17 leak-check/check fail REGR. vs. > > > 56375 > > > > I'm pretty hard pressed to explain this from the set of commits > > currently under test, but it has happened a few times now (e.g. 56700 > > 56576) so it does seem to be real. > > > > http://logs.test-lab.xenproject.org/osstest/results/bisect.xen-unstable.test-armhf-armhf-xl-multivcpu.leak-check--check.html > > is working on it and is currently consider the set of changes from: > > ianc@cosworth:xen.git$ git log --oneline 9ab42~1...45fcc4 > > 45fcc45 use ticket locks for spin locks > > e13013d libxc/restore: add checkpointed flag to the restore context > > ce44b40 libxc/restore: introduce setup() and cleanup() on restore > > c5c5a04 libxc/restore: split read/handle qemu info > > 9ab42c9 libxc/restore: introduce process_record() > > > > where e13013d is current master which was pushed in by flight 56375. > > > > I think it unlikely the libxl stuff is responible, given we don't > > migrate on ARM, which would seem to point to the ticket locks... > > I've now managed to reproduce using the arndale on my desk. ... and now I've confirmed that reverting the spin lock change causes the issue to not happen any more. > I'm just starting to dig in to the issue. > > So far the only thing I've concluded is that the message comes from > netback try to read the script node for inclusion in the hotplug > invocation's environment. > > I wonder if perhaps the spinlock change has just exposed a pre-existing > race? I'm still confirming, but AFAICT libxl does the right thing and writes state=Closing and waits for it to hit state=Closed before tearing down the backend directory. AFAICS it is not timing out while waiting. Looking at the netback side though it seems like netback_remove is switching to state=Closed _before_ it calls kobject_uevent(..., KOBJ_OFFLINE) and it is this which generates the call to netback_uevent which tries and fails to read script and produces the error message. Since switching to state=Closed is what prompts libxl to go and delete the xenstore backend dir it seems like it would be possible that netback_uevent might not happen until the xenstore key was gone, prompting it to write the error nodes. Is there anything else which might prevent against that possibility? Handwaving a bit (ok, a lot) it's possible that the change of spinlocks has caused a commonly won race to become a commonly lost one at least under these circumstances. My theory is that this is exacerbated on arndale because the CPU is relatively slow (even compared to cubietruck which is the same core but faster DRAM etc) and the fact that it is dual core while the test case which is failing involves a 4 vcpu guest (which is a bit dumb but not invalid) is loading things even more. I'm still slightly concerned that perhaps the new spinlock stuff has some sort of bad behaviour either on arndale specifically or more generally for ARM systems which has pushed this particular case over the edge. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |