[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] osstest commits and Xen releases
Juergen Gross writes ("OSStest commits and Xen releases"): > I have found an alarming tendency regarding changes in the OSStest > repository: over the last 2 years (or 3 Xen versions) there has been > a pattern of OSStest commits being more frequent during the RC phase > of a Xen release. On average there were about 4 commits to osstest.git > per week. The numbers were significantly higher during RC-phases: > > Version RC-phase OSStest commits per week > 4.12 2019/01/16 - 19 > 4.11 2018/04/17 - 2018/07/09 10 > 4.10 2017/10/16 - 2017/12/13 6 > > I have looked at this as I would have liked to cut 4.12-RC2 this > Monday, but OSStests for xen-unstable failed over the weekend. Ian > suspected a change in OSStest to be blamed (needs to be verified). > > As the release manager I don't like RCs being delayed due to changes > in our infrastructure. For Xen we have code freeze and patches to go in > need the release manager's ack. Shouldn't the same apply to OSStest? > > I like OSStest very much as it helps catching bugs early. But I believe > the main development should not be done in the time when we need it's > results to be most reliable. > > Thoughts? Thanks for raising this. I have three lines of response. Firstly, in the most general case: I think you have a point. (I think this effect is probably due to changes which had been starved of effort due to the impending Xen freeze being unblocked, but I would have to do a full chart to be sure.) I suggest we improve this by adopting a release ack system for pushes to osstest pretest after the Xen codefreeze date. In practice it will sometimes be necessary to make changes quickly (eg debian-installer kernel updates) so I think I (as ossteset maintainer) would need some discretion to waive the need for a release ack or to make one myself, but that would certainly involve informing you, and asking your opinion if you are available. Another possibility would be to arrange for xen-unstable to have its own separate branch of osstest, so that xen-unstable's runs can be detached from the rest. I think while this is technically possible it is not worth the additional complexity (admin hassle, risk of confusion, work to reconcile branches, etc. etc.) Do you think a release ack should be needed for commissioning new hardware ? Secondly, on this specific set of changes, looking at it from the point of view of whether such a release ack ought to have been forthcoming: We have been having hardware failures. Particularly, we have been having PDU port failures which I am fairly sure are due to the high frequency with which we use the PDU relays to hard power cycle the machines. We have also had a higher rate of other hardware problems than I think would be to be expected, which might be related. These PDU relay problems themselves lead to osstest unreliability and of course the longer the situation goes on the more stuff breaks. So I think that for these changes a release ack should probably have been granted although perhaps additional formal testing (or some other assurance) would have or should been done - see below. Thirdly, in this case these recent changes were in fact not anything to do with the fact that we didn't get a push over the weekend. Looking at the recent flights, the first of the changes I made at the end of last week took effect in 132504 (which reported late on Monday). The osstest changes were: * Substantial changes to host (and L1 host/guest) power on/off/reboot machinery. In particular hosts are now normally soft-rebooted via ssh at the start of a test, rather than hard power cycled. * Small changes to reporting functions. * One tiny change to improve some error messages. These changes *did* cause a regression in 132504: test-amd64-amd64-examine 4 memdisk-try-append fail pass in 132478 This was not considered blocking by osstest because from the archaeologist's point of view it is intermittent (the archaeologist is right but for the wrong reason). But, to justify that osstest had to look at 132478 which has other failures, so this osstest regression was part of the reason for not getting a push on Monday night. The bug was effectively introduced by dropping, late in development, the power management changes for the FreeBSD tests. Those changes were dropped late due to me realising as I was writing more comprehensive design comments that my intended scheme was not 100% sound. This problem was not detected by osstest's formal self-test because the formal self-test did not encounter the triggering condition (The bug triggers when the FreeBSD test runs on a box which for some reason was left in a state, by the previous test, where it could not be rebooted with ssh; the latter is quite rare.) This risk would have been obvious to me if I had been asked (or asked myself) how thoroughly the changes ought to have been tested. For example, in the context of deciding whether to grant a release-ack. So I think your implied proposal to apply the freeze to osstest would have avoided this: probably, I would have done additional testing and then a better version would have gone into production. The FreeBSD changes were made in a proper way later. Ie the bug was fixed on Friday and is now in production. The currently-running xen-unstable flight picked up the fixed version. As for the problems which actually stopped us getting a push in 132457 132478, and contributed to failing to get a push in 132504: 132457 test-amd64-amd64-examine memdisk-try-append This is the single test step in that test which uses FreeBSD, which is not UEFI-capable, and it ran on one of our few UEFI hosts. I don't want to only run the test on non-UEFI hosts because part of the point is to check that osstest's host interaction stuff is still working (after changes to osstest, or indeed Xen). This test step ought to be skipped on UEFI hosts. That it is not is a bug. The workaround from a Xen pov is either to get lucky or, if the test is sticky enough, a force push. test-amd64-i386-xl-qemuu-debianhvm-amd64-xsm guest-localmigrate/x10 [ 317.522719] Freezing of tasks failed after 20.005 seconds (1 tasks refusing to freeze, wq_busy=0): [ 317.540911] jbd2/xvda5-8 D ffffffff8109e380 0 112 2 0x00000000 libxl: error: libxl_dom_suspend.c:367:suspend_common_wait_guest_timeout: Domain 21:guest did not suspend, timed out This looks like some kind of Xen-specific bug in the Debian kernel. 132478 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm debian-hvm-install fail I do not understand what goes wrong here. The host and guest are apparently working. The guest is doing a Debian install using debian-installer. The guest installer asks to reboot, as is expected. osstest manages this reboot itself by detecting the guest's state change, because it wants to remove the virtual installation media. So the first thing it does is to destroy the old domain. This fails with some kind of libvirt error. I think this is a bug in libvirt. Presumably a race. test-amd64-amd64-xl-qemut-stubdom-debianhvm-amd64-xsm guest-localmigrate/x10 libxl: error: libxl_dom_suspend.c:367:suspend_common_wait_guest_timeout: Domain 37:guest did not suspend, timed out [ 365.637795] Freezing of tasks failed after 20.002 seconds (1 tasks refusing to freeze, wq_busy=0): [ 365.645857] jbd2/xvda5-8 D ffffffff8109e380 0 115 2 0x00000000 Same as above. 132504 test-amd64-amd64-xl-qemut-stubdom-debianhvm-amd64-xsm 16 guest-localmigrate/x10 fail REGR. vs. 132422 libxl: error: libxl_dom_suspend.c:367:suspend_common_wait_guest_timeout: Domain 41:guest did not suspend, timed out [ 383.837386] Freezing of tasks failed after 20.001 seconds (1 tasks refusing to freeze, wq_busy=0): [ 383.845464] jbd2/xvda5-8 D ffffffff8109e380 0 115 2 0x00000000 Same as above. test-amd64-amd64-examine 4 memdisk-try-append fail pass in 132478 osstest bug, discussed above. HTH. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |