Xen project Mailing List

Re: [Xen-devel] [PATCH] libxl: Increase device model startup timeout to 1min.

To: Stefano Stabellini <stefano.stabellini@xxxxxxxxxxxxx>

From: Anthony PERARD <anthony.perard@xxxxxxxxxx>

Date: Thu, 2 Jul 2015 12:11:48 +0100

Cc: Wei Liu <wei.liu2@xxxxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>, Ian Campbell <ian.campbell@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxx

Delivery-date: Thu, 02 Jul 2015 11:11:56 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Wed, Jul 01, 2015 at 04:03:55PM +0100, Stefano Stabellini wrote: > On Tue, 30 Jun 2015, Ian Jackson wrote: > > > > * The number and nature of parallel operations done in the stress > > > > test is unreasonable for the provided hardware: > > > > => the timeout is fine > > > > > > I don't know if it is our place to make this call. Should we really be > > > deciding what is considered "reasonable"? I think not. Defining what is > > > reasonable and policies that match it is not a route I think we should > > > take in libxl. > > > > Nevertheless if we are defining timeouts we are implicitly setting > > some parameters which imply that certain configurations are > > unreasonable. Hopefully all such configurations are absurd. > > > > If what you mean is that our bounds of `reasonable' should be very > > wide, then I agree. If anyone could reasonably expect it to work, > > then that is fine. Certainly we should refrain fromk subjective > > judgements. > > OK. How do you measure reasonable for this case? > > What I actually mean to ask is how do you suggest we proceed on this > problem? > > Of course it would be nice if we knew exactly why this is happening, but > the issue only happens once every 2-3 tempest runs, each of them takes > about 1 hour. Tempest executes about 1300 tests for each run, some > of them in parallel. We haven't taken the time to read all the tests run > by tempest so we don't know exactly what they do. > > We don't really know the environment that causes the failure. Reading > all the tests is not an option. We could try adding more tracing to the > system, but given the type of error, if we do we are not likely to > reproduce the error at all, or maybe reproduce something different. > > > Given the state of things, I suggest we make sure that increasing the > timeout actually fixes/works-around the problem. I would also like to > see some empirical measurements that tell us by how much we should > increase the timeout. Is 1 minute actually enough? I have tested an increase timeout this night. And here are the result. The machine is a AMD Opteron(tm) Processor 4284, with 8G of RAM and 8 pCPU. It's running Ubuntu 14.04, with Xen 4.4. On top of that, OpenStack have been deployed via devstack on a single. The test is to run Tempest with --concurrency=4. There are 4 tests runned in parallel, but they don't necessarly start a VM. When they do, it's a PV with 64MB and 1 vCPU and sometime with double amount of RAM. The stats: Tempest run: 22 Tempest run time for each run: ~3000s Tempest number of test: 1143 after 22 run of tempest: QEMU start: 3352 number of run that took more than 2s: 20 number of run that took more than 9s: 6 maximum start time: 10.973713s I have gathered the QEMU start time by having strace running for each of them. I have then look at the time it took from the first syscall execve('qemu') until the syscall where QEMU respond on its QMP socket (libxl have acknoledge that QEMU is running at that time). -- Anthony PERARD _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.