[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: NetBSD dom0 PVH: hardware interrupts stalls


  • To: xen-devel@xxxxxxxxxxxxxxxxxxxx
  • From: Jürgen Groß <jgross@xxxxxxxx>
  • Date: Tue, 24 Nov 2020 16:23:11 +0100
  • Delivery-date: Tue, 24 Nov 2020 15:23:17 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 24.11.20 15:59, Roger Pau Monné wrote:
On Tue, Nov 24, 2020 at 03:42:28PM +0100, Jan Beulich wrote:
On 24.11.2020 11:05, Jan Beulich wrote:
On 23.11.2020 18:39, Manuel Bouyer wrote:
On Mon, Nov 23, 2020 at 06:06:10PM +0100, Roger Pau Monné wrote:
OK, I'm afraid this is likely too verbose and messes with the timings.

I've been looking (again) into the code, and I found something weird
that I think could be related to the issue you are seeing, but haven't
managed to try to boot the NetBSD kernel provided in order to assert
whether it solves the issue or not (or even whether I'm able to
repro it). Would you mind giving the patch below a try?

With this, I get the same hang but XEN outputs don't wake up the interrupt
any more. The NetBSD counter shows only one interrupt for ioapic2 pin 2,
while I would have about 8 at the time of the hang.

So, now it looks like interrupts are blocked forever.

Which may be a good thing for debugging purposes, because now we have
a way to investigate what is actually blocking the interrupt's
delivery without having to worry about more output screwing the
overall picture.

At
http://www-soc.lip6.fr/~bouyer/xen-log5.txt
you'll find the output of the 'i' key.

(XEN)    IRQ:  34 vec:59 IO-APIC-level   status=010 aff:{0}/{0-7} in-flight=1 
d0: 34(-MM)

(XEN)     IRQ 34 Vec 89:
(XEN)       Apic 0x02, Pin  2: vec=59 delivery=LoPri dest=L status=1 polarity=1 
irr=1 trig=L mask=0 dest_id:00000001

Since it repeats in Manuel's latest dump, perhaps the odd combination
of status=1 and irr=1 is to tell us something? It is my understanding
that irr ought to become set only when delivery-status clears. Yet I
don't know what to take from this...

My reading of this is that one interrupt was accepted by the lapic
(irr=1) and that there's a further interrupt pending that hasn't yet
been accepted by the lapic (status=1) because it's still serving the
previous one. But that's all weird because there's no matching
vector in ISR, and hence the IRR bit on the IO-APIC has somehow become
stale or out of sync with the lapic state?

I'm also unsure about how Xen has managed to reach this state, it
shouldn't be possible in the first place.

I don't think I can instrument the paths further with printfs because
it's likely to result in the behavior itself changing and console
spamming. I could however create a static buffer to trace relevant
actions and then dump all them together with the 'i' debug key output.

debugtrace is your friend here. It already has a debug key for printing
the buffer contents to console ('T').

As the buffer is wrap-around you can even add debug prints in the
related interrupt paths for finding out which paths have been called in
which order and on which cpu. Depending on the findings you might want
to use percpu buffers.


Juergen

Attachment: OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys

Attachment: OpenPGP_signature
Description: OpenPGP digital signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.