[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Regression, host crash with 4.5rc1



>>> On 23.11.14 at 02:28, <sflist@xxxxxxxxx> wrote:
> With mwait-idle=0:
> 
> (XEN) 'c' pressed -> printing ACPI Cx structures
> (XEN) ==cpu0==
> (XEN) active state:             C0
> (XEN) max_cstate:               C7
> (XEN) states:
> (XEN)     C1:   type[C1] latency[001] usage[00000000] method[  FFH] 
> duration[0]
> (XEN)     C2:   type[C0] latency[000] usage[00000000] method[ NONE] 
> duration[0]
> (XEN)     C3:   type[C3] latency[064] usage[00000000] method[  FFH] 
> duration[0]
> (XEN)     C4:   type[C3] latency[096] usage[00000000] method[  FFH] 
> duration[0]
> (XEN)    *C0:   usage[00000000] duration[46930624784]
> (XEN) PC2[0] PC3[0] PC6[0] PC7[0]
> (XEN) CC3[0] CC6[0] CC7[0]
>[...]

Very interesting - the hypervisor has C-state information, but never
entered any of them. That certainly explains the difference between
using/not using the ,wait-idle driver, but puts us back to there being
a more general issue with C-state use on this CPU model. Possibly
related to C2 having entry method "NONE", but then again I can't
see how such a state could get entered into the table the first place:
set_cx() bails upon check_cx() returning an error, and hence its
switch()'s default statement should never be reached. Plus even if an
array entry was set to "NONE", it should simply be ignored when
looking for a state to enter. I'll probably need to put together a
debugging patch to figure out what's going on here.

In any event C2 being set to "NONE" and that information presumably
coming from firmware is an indication that there's a problem with C2
(note that the numbering doesn't really match up with what the
document says, this likely really is C1E) on that CPU. Which gets us
back to ...

> CPU information for one of the cores, 2.8 GHz is nominal, stepping is 2. 
> Not sure how to translate that stepping number into Intel's format:
> 
> processor       : 0
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 44
> model name      : Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz
> stepping        : 2
>[...]
>> There are a couple potentially relevant errata (BC36, BC38, BC54,
>> BC77, BC110).
>>
>> To exclude BC36, a boot log with "apic-verbosity=debug" and debug
>> key 'i' output would be necessary.
> 
> Done, see the very end of the email.
> 
>> BC38 should not affect us since we don't enter C states from ISRs.
>>
>> BC54 is probably irrelevant since we meanwhile know that your
>> system doesn't really hang hard.
>>
>> For BC77 it would be worth trying to disable turbo mode instead of
>> disabling the mwait-idle driver ("xenpm disable-turbo-mode" right
>> after boot).
> 
> I looked up BC77 but as a result found this document[1], which seems to 
> relate to the i7.  Would this[2] not be the relevant document?
> 
> [1] 
> http://www.intel.com/content/dam/www/public/us/en/documents/specification-upd
> ates/core-i7-900-ee-and-desktop-processor-series-32nm-spec-update.pdf
> 
> [2] 
> http://www.intel.com/content/dam/www/public/us/en/documents/specification-upd
> ates/xeon-5600-specification-update.pdf

Indeed. I wasn't aware that there are family/model/stepping tuples
that can be both Xeon and desktop CPUs.

> As promised, below is the apic-verbosity=debug log, with 'i'. Thanks!

I'm sorry, I misspelled the option, it's really "apic_verbosity=debug".
The 'i' output at least already confirms that there are no ExtINT
entries among the IO-APIC RTEs.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.