[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x



On 27/03/2013 14:31, Marek Marczykowski wrote:
> On 27.03.2013 09:52, Jan Beulich wrote:
>>>>> On 26.03.13 at 19:50, Andrew Cooper <andrew.cooper3@xxxxxxxxxx> wrote:
>>> So vector e9 doesn't appear to be programmed in anywhere.
>> Quite obviously, as it's the 8259A vector for IRQ 9. The question
>> really is why an IRQ appears on that vector in the first place. The
>> 8259A resume code _should_ leave all IRQs masked on a fully
>> IO-APIC system (see my question raised yesterday).
>>
>> And that's also why I suggested, for an experiment, to fiddle with
>> the loop exit condition to exclude legacy vectors (which wouldn't
>> be a final solution, but would at least tell us whether the direction
>> is the right one). In the end, besides understanding why an
>> interrupt on vector E9 gets raised at all, we may also need to
>> tweak the IRQ migration logic to not do anything on legacy IRQs,
>> but that would need to happen earlier than in
>> smp_irq_move_cleanup_interrupt(). Considering that 4.3
>> apparently doesn't have this problem, we may need to go hunt for
>> a change that isn't directly connected to this, yet deals with the
>> problem as a side effect (at least I don't recall any particular fix
>> since 4.2). One aspect here is the double mapping of legacy IRQs
>> (once to their IO-APIC vector, and once to their legacy vector,
>> i.e. vector_irq[] having two entries pointing to the same IRQ).
> So tried change loop condition to LAST_DYNAMIC_VECTOR and it doesn't hit that
> BUG/ASSERT. But still it doesn't work - only CPU0 used by scheduler, also some
> errors from dom0 kernel, and errors about PCI devices used by domU(1).
>
> Messages from resume (different tries):
> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector.log
> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector2.log
>
> Also one time I've got fatal page fault error, earlier in resume (it isn't
> deterministic):
> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-resume-page-fault.log
>

This pagefault is a Null structure pointer dereference, likely the
scheduling data.  At a first glance, it looks related to the assertion
failures I have been seeing sporadically in testing, but unable to
reproduce reliably.  There seems to be something quite dodgy with
interaction of vcpu_wake and scheduling loops.

The other logs indicate that dom0 appears to have a domain id of 1,
which is sure to cause problems.

As for locating the cause of the legacy vectors, it might be a good idea
to stick a printk at the top of do_IRQ() which indicates an interrupt
with vector between 0xe0 and 0xef.  This might at least indicate whether
legacy vectors are genuinely being delivered, or whether we have some
memory corruption causing these effects.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.