Xen project Mailing List

Re: [Xen-devel] [XTF PATCH] xtf-runner: fix two synchronisation issues

From: Ian Jackson <ian.jackson@xxxxxxxxxxxxx>

Date: Fri, 29 Jul 2016 15:31:30 +0100

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Fri, 29 Jul 2016 14:31:43 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Wei Liu writes ("Re: [XTF PATCH] xtf-runner: fix two synchronisation issues"): > On Fri, Jul 29, 2016 at 01:43:42PM +0100, Andrew Cooper wrote: > > The runner existing before xl has torn down the guest is very > > deliberate, because some part of hvm guests is terribly slow to tear > > down; waiting synchronously for teardown tripled the wallclock time to > > run a load of tests back-to-back. > > Then you won't know if a guest is leaked or it is being slowly destroyed > when a dead guest shows up in the snapshot of 'xl list'. > > Also consider that would make back-to-back tests that happen to have a > guest that has the same name as the one in previous test fail. > > I don't think getting blocked for a few more seconds is a big issue. > It's is important to eliminate such race conditions so that osstest can > work properly. IMO the biggest reason for waiting for teardown is that that will make it possible to accurately identify the xtf test which was responsible for the failure if a test reveals a bug which causes problems for the whole host. Suppose there is a test T1 which, in buggy hypervisors, creates an anomalous data structure, such that the hypervisor crashes when T1's guest is finally torn down. If we start to run the next test T2 immediately we see success output from T1, we will observe the host crashing "due to T2", and T1 would be regarded as having succeeded. This is why in an in-person conversation with Wei yesterday I recommended that osstest should after each xtf test (i) wait for everything to be torn down and (ii) then check that the dom0 is still up. (And these two activities are regarded as part of the preceding test step.) If this leads to over-consumption of machine resources because this serialisation is too slow then the right approach would be explicit parallelisation in osstest. That would still mean that in the scenario above, T1 would be regarded as having failed, because T1 wouldn't be regarded as having passed until osstest had seen that all of T1's cleanup had been done and the host was still up. (T2 would _also_ be regarded as failed, and that might look like a heisenbug, but that would be tolerable.) Wei: I need to check what happens with multiple failing test steps in the same job. Specifically, I need to check which one the bisector is likely to try to attack. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.