[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [xen-unstable test] 13677: regressions - FAIL



>>> On 12.09.12 at 11:48, Ian Campbell <Ian.Campbell@xxxxxxxxxx> wrote:
> On Wed, 2012-09-12 at 09:14 +0100, Jan Beulich wrote:
>> >>> On 11.09.12 at 19:28, xen.org <ian.jackson@xxxxxxxxxxxxx> wrote:
>> > flight 13677 xen-unstable real [real]
>> > http://www.chiark.greenend.org.uk/~xensrcts/logs/13677/ 
>> > 
>> > Regressions :-(
>> > 
>> > Tests which did not succeed and are blocking,
>> > including tests which could not be run:
>> >  test-amd64-i386-rhel6hvm-amd  8 guest-stop                fail REGR. vs. 
> 13668
>> >  test-amd64-i386-qemuu-rhel6hvm-amd  8 guest-stop          fail REGR. vs. 
> 13668
>> >  test-amd64-i386-qemuu-rhel6hvm-intel  8 guest-stop        fail REGR. vs. 
> 13668
>> >  test-amd64-i386-rhel6hvm-intel  8 guest-stop              fail REGR. vs. 
> 13668
>> 
>> While it seems very likely that if any, one of the two changes
>> of mine under test here have caused this, looking through the
>> logs I can't spot anything that would tell me what's wrong.
>> According to var-log-xen-qemu-dm-redhat.guest.osstest.log,
>> the guest went down (but it didn't even enter "dying" mode
>> yet according to the diagnostic output in the serial log). Nor
>> can I see any close relation between the behavior and the
>> changsets under test...
> 
> Looking at test-amd64-i386-rhel6hvm-amd in flight 13675 which succeeded
> the guest log ends:
>         Halting system...
>         type=1128 audit(1347348133.094:15071): user pid=1432 uid=0 
> auid=4294967295 ses=4294967295 msg='init: exe="/sbin/reboot" hostname=? 
> addr=? terminal=console res=success'
>         md: stopping all md devices.
>         xenbus_dev_shutdown: device/vkbd/0: Initialising != Connected, 
> skipping
>         xenbus_dev_shutdown: device/vbd/5632: Closed != Connected, skipping
>         ACPI: Preparing to enter system sleep state S5
>         Disabling non-boot CPUs ...
>         Broke affinity for irq 4
>         Broke affinity for irq 12
>         SMP alternatives: switching to UP code
>         Power down.
>         shutdown requested in cpu_handle_ioreq
>         Issued domain 2 poweroff
>         
> Whereas in the failing case it cuts off after "stopping all md devices".
> 
> 13675 failed another sequence, lets assume for unrelated reasons. The
> delta in the commits is just:
> 
> 25844:0a9a4549e6b9 powernow: Update P-state directly when _PSD's CoordType 
> is DOMAIN_COORD_TYPE_HW_ALL
> 25843:51090fe1ab97 x86/HVM: assorted RTC emulation adjustments
> 
> The first is a host level thing which I doubt would so consistently
> effect HVM guests (and anyway, Intel tests are also failing). Which
> pretty much leaves 25843:51090fe1ab97 or some weird heisenbug.
> 
> Is it outside the realms of possibility that the guest has managed to
> limp along with the RTC being broken in some subtle way and only
> eventually trips up when we come to shut down?

That's certainly not impossible, but afaik Linux doesn't play with
the RTC unless told to by user space (whereas Windows, as we
know from the reporter of the problem that triggered putting
together these changes, does on its own at least under certain
circumstances, yet the Windows tests all go through fine).

Certainly this is the most likely candidate for having broken
something, and hence would be the prime candidate for reverting.
But before doing so, I'd want to see at least another run's results.

> Looking back at 13675, is it possible that:
> 
> 25842:a1f73e989c24 x86/hvm: don't give vector callback higher priority than 
> NMI/MCE
> 
> is exposing a race in the guest or something? I very much doubt any NMI
> or MCE are being injected at all though.

So do I.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.