[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Wg-test-framework] baroque1 hardware problem



Ian Jackson writes ("Minutes All-Net synchronisation call, 20th May"):
> ACTION: Ian to send technical details, so that All-net can raise
>      with supplier (Intel).

So, the problem is as follows:


Summary:
--------

Sometimes, when powered on, baroque1 does not come up.

Symptoms are that the serial control lines do change (visible in
sympathy log), but no text is printed on the serial console.  Waiting
a long time (up to at least ten minutes) has no effect.  Sending
"return" on the serial console elicits no response.

After the problem has occurred, often more than one further attempt to
power cycle the machine is required to get it to work again.  After
that it works normally until the fault recurs.


Repro method:
-------------

 * Write a pxeboot file which refers to a stock Wheey amd64
   debian-installer netboot image, and specifies a preseed file.

 * Power off (via the PDU).

 * Wait 30s.

 * Power on (via the PDU).

 * Monitor the preseed file http server access log waiting for the
   pressed file to be downloaded.

 * When the preseed file has been fetched, declare "success".

   Then run round for the next repetition.  (Generally, this means
   that the server is powered off in the middle of one of
   debian-installer's software-fetching steps.)

   Alternatively, after 350s, declare "failure" and stop.


Statistical information:
------------------------

* My records show failures after the following number of repetitions:
   96 (not quite sure about this - data collection was affected by an
       unrelated network problem on my workstation)
   29, 25, 26.

* My records show the following number of attempts needed to get the
  machine to work at all, again:
   1, 3, 2, 3

* I have run the same test on baroque0.  It has managed (at least) 400
  consecutive power cycle restarts without problem.


Handover:
---------

I hereby hand both baroque0 and baroque1 over to you.  (It seems most
sensible to give you the working machine too, for comparison.)

The current setup in the colo is the PXE configuration as described
above.

So I think it should be possible to reproduce the problem as follows:

  - power baroque1 off
  - wait 30s
  - power baroque1 on

  - wait for it to show life on the serial console

  - wait for it to show entry into debian-installer
     (eg wait for "Setting up the clock" to be printed on the
      serial console), then declare success and go round again

I have disconnected the serial consoles of both machines from
sympathy, so you should be able to connect to them with picocom or
expect on /dev/ttyRP5 and /dev/ttyRP6.


NB that I am now going to be away until next Wednesday morning.

Ian.

_______________________________________________
Wg-test-framework mailing list
Wg-test-framework@xxxxxxxxxxxxxxxxxxxx
http://lists.xenproject.org/cgi-bin/mailman/listinfo/wg-test-framework


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.