[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [Notes for xen summit 2018 design session] Process changes: is the 6 monthly release Cadence too short, Security Process, ...

> On Jul 5, 2018, at 12:05 PM, Ian Jackson <ian.jackson@xxxxxxxxxx> wrote:
> Lars Kurth writes ("Re: [Xen-devel] [Notes for xen summit 2018 design 
> session] Process changes: is the 6 monthly release Cadence too short, 
> Security Process, ..."):
>> * The Gitlab machinery will do build tests (and the discussion
>>  showed that we should be able to do this via cross compilation or
>>  compilation on a real system if a service such as infosifter is
>>  used [...]
>> * This could eventually include a basic set of smoke tests that are
>>  system independent and could run under QEMU - Doug already uses a
>>  basic test where a xen host and/or VM is started
> Firstly, I think this is all an excellent idea.  It should be pursued.
> I don't think it interacts directly with osstest except to reduce the
> rate of test failures.
>> [Juergen:]
>>    A major source of the pain seems to be the hardware: About half of all
>>    cases where I looked into the test reports to find the reason for a
>>    failing test flight were related to hardware failures. Not sure how to
>>    solve that.
>> This is worrying and I would like to get Ian Jackson's viewpoint on it. 
> I haven't worked up a formal analysis of the pattern of failures, but
> (discussing hardware trouble only):
> * We are still waiting for new ARM hardware.  When we get it we will
>   hopefully be able to decommission the arndale dev boards, whose
>   network controllers are unreliable.
>   Sadly, the unreliability of the armhf tests has become so
>   normalised that we all just shrug and hope the next one will be
>   better.  Another option would be to decommission the arndales right
>   away and reduce the armhf test coverage.
> * We have had problems with PDU relays, affecting about three
>   machines (depending how you count things).  My experience with the
>   PDUs in Massachusetts has been much poorer than in Cambridge.  I
>   think the underlying cause is probably USAian 110V electricity (!)
>   I have a plan to fix this, involving more use of IPMI in tandem
>   with the PDUs, which I hope will reduce this significantly.
> * As the test lab increases in size, the rate of hardware failure
>   necessarily also rises.  Right now, response to that is manual: a
>   human must notice the problem, inspect test results, decide it is a
>   hardware problem, and take the affected node out of service.  I
>   am working on a plan to do that part automatically.
>   Human intervention will still be required to diagnose and repair
>   the problem of course, but in the meantime, further tests will not
>   be affected.
>>    Another potential problems showed up last week: OSSTEST is using the
>>    Debian servers for doing the basic installation. A change there (e.g.
>>    a new point release) will block tests. I'd prefer to have a local cache
>>    of the last known good set of *.deb files to be used especially for the
>>    branched Xen versions. This would rule out remote problems for releases.
>> This is again something which we should definitely look at.
> This was bad luck.  This kind of update happens about 3-4 times a
> year.  It does break everything, leading to a delay of a day or two,
> but the fix is straightforward.
> Obviously this is not ideal but the solutions are nontrivial.  It is
> not really possible to "have a local cache of the last known good set
> of *.deb files" without knowing what that subset should be; that would
> require an edifice to track what is used, or some manual configuration
> which would probably break.  Alternatively we could run a complete
> mirror but that is a *lot* of space and bandwidth, most of which would
> be unused.
> I think the right approach is probably to switch from using d-i for
> host installs, to something like FAI.  That would be faster as well.
> However that amouns to reengineering the way osstest does host
> installs; it would also leave us maintaining an additional way to do
> host installs, since we would still want to be able to *test* d-i
> operation as a guest.

What I think would be ideal is a way to take ‘snapshots’ of different states of 
setup for various hosts and revert to them.  There’s absolutely no reason to do 
a full install of a host every osstest run, when that install happens 1) before 
we even install Xen, and 2) should be nearly identical each time.  We should be 
able to install a host, take a snapshot of the “clean” install, then do the 
build prep, take a snapshot of that, and then simply revert to one or both of 
those (assuming build requirements haven’t changed in the mean time) whenever 
necessary.  Re-generating these snapshots once per week per host should be 
plenty, and sounds like it would massively improve the current throughput.

I’d like to propose the idea also that we try to find a more efficient way of 
testing guest functionality than doing a guest install.  I understand it’s a 
natural way to test a reasonable range of functionality, but particularly for 
Windows guests, my impression is that it’s very slow; there must be a way to 
make a test that would have similar coverage but be able to be completed with a 
pre-installed snapshot, in only a few minutes.


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.