Xen project Mailing List

Re: [Xen-devel] [XTF PATCH] xtf-runner: fix two synchronisation issues

To: Ian Jackson <ian.jackson@xxxxxxxxxxxxx>, Wei Liu <wei.liu2@xxxxxxxxxx>

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Fri, 29 Jul 2016 16:05:15 +0100

Cc: Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Fri, 29 Jul 2016 15:06:14 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 29/07/16 15:31, Ian Jackson wrote: > Wei Liu writes ("Re: [XTF PATCH] xtf-runner: fix two synchronisation issues"): >> On Fri, Jul 29, 2016 at 01:43:42PM +0100, Andrew Cooper wrote: >>> The runner existing before xl has torn down the guest is very >>> deliberate, because some part of hvm guests is terribly slow to tear >>> down; waiting synchronously for teardown tripled the wallclock time to >>> run a load of tests back-to-back. >> Then you won't know if a guest is leaked or it is being slowly destroyed >> when a dead guest shows up in the snapshot of 'xl list'. >> >> Also consider that would make back-to-back tests that happen to have a >> guest that has the same name as the one in previous test fail. >> >> I don't think getting blocked for a few more seconds is a big issue. >> It's is important to eliminate such race conditions so that osstest can >> work properly. > IMO the biggest reason for waiting for teardown is that that will make > it possible to accurately identify the xtf test which was responsible > for the failure if a test reveals a bug which causes problems for the > whole host. That is perfectly reasonable. > > Suppose there is a test T1 which, in buggy hypervisors, creates an > anomalous data structure, such that the hypervisor crashes when T1's > guest is finally torn down. > > If we start to run the next test T2 immediately we see success output > from T1, we will observe the host crashing "due to T2", and T1 would > be regarded as having succeeded. > > This is why in an in-person conversation with Wei yesterday I > recommended that osstest should after each xtf test (i) wait for > everything to be torn down and (ii) then check that the dom0 is still > up. (And these two activities are regarded as part of the preceding > test step.) That is also my understanding of how the intended OSSTest integration is going to work. OSSTest asks `./xtf-runner --list` for all tests, then iterates over all tests, running them one at a time, with suitable liveness checks inbetween. This is not using xtf-runner's ability to run multiple tests back to back. The dev usecase on the other hand is for something like, for checking a test case refactoring or new bit of functionality. $ ./xtf-runner sefltest <snip> Combined test results: test-pv64-selftest SUCCESS test-pv32pae-selftest SUCCESS test-hvm64-selftest SUCCESS test-hvm32pae-selftest SUCCESS test-hvm32pse-selftest SUCCESS test-hvm32-selftest SUCCESS FWIW, I have just put a synchronous wait in to demonstrate. Without wait: $ time ./xtf-runner sefltest <snip> real 0m0.571s user 0m0.060s sys 0m0.228s With wait: $ time ./xtf-runner sefltest <snip> real 0m8.870s user 0m0.048s sys 0m0.280s That is more than 8 wallclock seconds elapsed where nothing useful is happening from the point of view of a human using ./xtf-runner. All of this time is spent between @releaseDomain and `xl create -F` finally exiting. > > If this leads to over-consumption of machine resources because this > serialisation is too slow then the right approach would be explicit > parallelisation in osstest. That would still mean that in the > scenario above, T1 would be regarded as having failed, because T1 > wouldn't be regarded as having passed until osstest had seen that all > of T1's cleanup had been done and the host was still up. (T2 would > _also_ be regarded as failed, and that might look like a heisenbug, > but that would be tolerable.) OSSTest shouldn't run multiple tests at once, and I have taken exactly the same decision for XenRT. Easy identification of what went bang is the most important properly in these cases. We are going to have to get to a vast test library before the wallclock time of XTF tests approaches anything similar to installing a VM from scratch. I am not worried at the moment. > > Wei: I need to check what happens with multiple failing test steps in > the same job. Specifically, I need to check which one the bisector > is likely to try to attack. For individual XTF tests, it is entirely possible that every failure is from a different change, so should be treated individually. Having said that, it is also quite likely that, given a lot of similar microckernels, one hypervisor bug would take a large number out at once, and we really don't want to bisect each individual XTF test. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.