= Attendees =
Lars Kurth (LK)
Anthony Perard (AP)
Jim Fehlig (JF)
Bob Ball (BB)
Konrad Wilk (KW)
Antony Messerli (AM)
= Status Update =
We have had about 50% red (failed) and 50% green (failed). We discovered two NOVA bugs for which patches have been submitted, but these have not yet gone into KILO.
BB: it is unlikely that we get any more NOVA changes â if we discover them say in the next week or so in â as KILO is in release candidate mode (having passed the KILO3 milestone)
and master is increasingly tied down. On the plus side, this should help reduce churn and
AM: offered that he can look into getting the priority of 166184 pushed up and get a Rackspace employee to review the patches such that they can be applied
ACTION: Ant Messerly (AM) to investigate the above ^^
BB: is also on vacation in the next two weeks, so can't do reviews or raise patches, or do any other work
On the plus side, the churn in NOVA should reduce which means we should have a more stable baseline to investigate issues.
== Scripts to investigate issues ==
BB: A simple couple of bash lines to download the full console logs from the jobs and create a histogram of tempest tests caused the majority of the failures:
# grep -h "\.\.\. FAIL" * | sed -e 's/.*\(tempest[^) ]*\).*/\1/' | sort | uniq -c | sort -n
From the logs (attached) the current prime issue is test_resize_server_confirm_from_stopped (excerpt of log below)
# Failures
Test
[snip]
24
tempest.scenario.test_encrypted_cinder_volumes.TestEncryptedCinderVolumes.test_encrypted_cinder_volumes_luks
25
tempest.api.compute.volumes.test_attach_volume.AttachVolumeTestJSON.test_list_get_volume_attachments
25
tempest.scenario.test_encrypted_cinder_volumes.TestEncryptedCinderVolumes.test_encrypted_cinder_volumes_cryptsetup
58
tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_confirm_from_stopped
This should be fixed by 166184. So the priority needs to be to get these changes into NOVA.
= Analysis and finding a way forward =
BB: Need to stepwise reduce the tests that are failing. In theory we could write a quick parsing script (attached above) which looks at all of the tests
and will help us identify hot spots of test failures. This may give us some more clarity and figure out where we have test failure hotspots.
*Unfortunately it turns out that this is not the case (see attached log histogram)*
AP discovered one race condition in NOVA, for which we have a patch now (aka 166184). He is working through issues, but there are a lot of moving pieces that can cause failures.
LK :asks what Xen + Libvirt baseline we are using
BB: at the moment we have a custom based system which uses Xen 4.4.2 (with some of JF's patches) + Ubuntu Dom0 + an older libvirt version with some patches.
BB: we were also queuing up Daniel B's fix (aka 159106). We can easily add openstack changes that are necessary into the custom config.
LK wonders whether the randomness of failures are an indicator that we are suffering from the concurrency issues that JF is investigating
JF: has been looking at running Tempest in parallel. Whether the intermittent failures would be improved is not clear.
BB: Even within serial Tempest, multiple VMs are run and thus it is plausible that the intermittend failures could be caused by this.
JF: if VM's are started/destroyed concurrently we could be hitting the issues that have been fixed.
JF: I've made some progress on the concurrency problems Tempest (and my reproducer) bump into.
JF: sent mail with additional patches for Xen 4.4 & 4.5 and Libvirt (yesterday several patches) were committed.
[Copy of text from JF's e-mail sent prior to the meeting
On the Xen side, 5 patches are needed on top of current 4.4.x and
4.5.x. Commits 93699882d and f1335f0d from master and 4783c99a,
1c91d6fba, and 188e9c54 from staging.
A lot of work has been done on the libvirt side, much of which has been
committed. I do have one last series of three patches fixing issues
related to domain destroy that are not yet posted upstream. With those
patches on top of current libvirt.git master, and a libxl containing
aforementioned Xen patches, my Tempest reproducer passes.
]
LK: asks how easy it is to change the Xen and Libvirt baseline
BB: Not particularly straight forward, unless we can upgrade for all of the tests.
BB: We can't run a single test with a custom version with libvirt or Xen. Would have change for all tests.
LK: given the intermittent failures changing the Xen + Libvirt baseline would actually
AP: Testing a specific version of Xen 4.4.1 + additional patches and Libvirt plus patches
AP: Tried JF's new patches on his local set-up in parallel, and still gets many time-outs.
LK: It appears that moving the baseline for Xen and Libvirt and getting the NOVA changes in are our top next steps
Bob: XenServer had an internal CI loop for quite some time before going live and achieving a good level of pass rates.
Bob: would be happy to move the baseline
ACTION: JF & AP to decide what the best baseline would be
LK: asks what test environment JF uses
JF: doesn't run Tempest, runs his own Tempest reproducer
JF: the Xen version does not appear to matter so much as long as the 5 patches are in place
(4.4.2 + 4.5 + unstable with those 5 patches listed above). All of these work equally well with reproducer scripts.
JF: Will ask Xen maintainers to back port these such that the next Xen releases have the fixes in them
JF: Libvirt needs to use something very current. The latest release is not sufficient. It needs to be git master.
JF: points out that libvirt 1.4.14 (released in 2 weeks) will contain all the recent libvirt work
JF: will be posting the 3 missing pieces in the next few hours. But someone needs to be reviewing these before Friday
JF: Libvirt FREEZE is on Friday.
JF: re-iterates that libvirt 1.4.14 should give us a good baseline for Tempest
LK: asks whether we can use Xen Project members to review the 3 outstanding libvirt patches for libvirt 1.4.14
*AP and KW offer to review the libvirt patches*
ACTION: JF will CC xen-devel, AP & KW on these
ACTION: KW and AP to look out for these patches on xen-devel
ACTION: LK will sync with Ian Jackson (who has reviewed Libvirt patches before) to see whether he can have a look also
BB: is fairly confident that the concurrency issues will improve the situation enough, such that we can turn commenting
on on all changes for the OpenStack Liberty development cyle
AM: noted that he has also been trying Fedora (vs. Ubuntu)
ACTION: BB to sync with AP and JF on baseline when back from holidays
= Estimating of cost of running *.openstack.xenproject.org =
LK: one of the outstanding actions was to estimate the cost of running Tempest, such that the Xen Project Advisory Board
budget can be revised to be more accurate
BB: notes that we should not do this until we can run Tempest in parallel as in sequential mode we use more Vms than
Really needed
AM: notes that we can pull the bill from the Citrix account; it itemises all the VMs which we then should be able to map back
to *.openstack.xenproject.org
ACTION: BB to investigate ^^ after back from vacation â sync with LK before
|