[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] libxl: Increase device model startup timeout to 1min.



On Wed, Jul 01, 2015 at 04:03:55PM +0100, Stefano Stabellini wrote:
> On Tue, 30 Jun 2015, Ian Jackson wrote:
> > > >   * The number and nature of parallel operations done in the stress
> > > >     test is unreasonable for the provided hardware:
> > > >       => the timeout is fine
> > > 
> > > I don't know if it is our place to make this call.  Should we really be
> > > deciding what is considered "reasonable"? I think not. Defining what is
> > > reasonable and policies that match it is not a route I think we should
> > > take in libxl.
> > 
> > Nevertheless if we are defining timeouts we are implicitly setting
> > some parameters which imply that certain configurations are
> > unreasonable.  Hopefully all such configurations are absurd.
> > 
> > If what you mean is that our bounds of `reasonable' should be very
> > wide, then I agree.  If anyone could reasonably expect it to work,
> > then that is fine.  Certainly we should refrain fromk subjective
> > judgements.
> 
> OK.  How do you measure reasonable for this case?
> 
> What I actually mean to ask is how do you suggest we proceed on this
> problem?
> 
> Of course it would be nice if we knew exactly why this is happening, but
> the issue only happens once every 2-3 tempest runs, each of them takes
> about 1 hour.  Tempest executes about 1300 tests for each run, some
> of them in parallel. We haven't taken the time to read all the tests run
> by tempest so we don't know exactly what they do.
> 
> We don't really know the environment that causes the failure. Reading
> all the tests is not an option. We could try adding more tracing to the
> system, but given the type of error, if we do we are not likely to
> reproduce the error at all, or maybe reproduce something different.
> 
> 
> Given the state of things, I suggest we make sure that increasing the
> timeout actually fixes/works-around the problem. I would also like to
> see some empirical measurements that tell us by how much we should
> increase the timeout. Is 1 minute actually enough?

I have tested an increase timeout this night. And here are the result.

The machine is a AMD Opteron(tm) Processor 4284, with 8G of RAM and 8 pCPU.
It's running Ubuntu 14.04, with Xen 4.4. On top of that, OpenStack have
been deployed via devstack on a single.

The test is to run Tempest with --concurrency=4. There are 4 tests runned
in parallel, but they don't necessarly start a VM. When they do, it's a PV
with 64MB and 1 vCPU and sometime with double amount of RAM.

The stats:
  Tempest run: 22
  Tempest run time for each run: ~3000s
  Tempest number of test: 1143
after 22 run of tempest:
  QEMU start: 3352
  number of run that took more than 2s: 20
  number of run that took more than 9s: 6
  maximum start time: 10.973713s

I have gathered the QEMU start time by having strace running for each of
them. I have then look at the time it took from the first syscall
execve('qemu') until the syscall where QEMU respond on its QMP socket
(libxl have acknoledge that QEMU is running at that time).

-- 
Anthony PERARD

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.