[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: NetBSD dom0 PVH: hardware interrupts stalls
On 23.11.2020 18:39, Manuel Bouyer wrote: > On Mon, Nov 23, 2020 at 06:06:10PM +0100, Roger Pau Monné wrote: >> OK, I'm afraid this is likely too verbose and messes with the timings. >> >> I've been looking (again) into the code, and I found something weird >> that I think could be related to the issue you are seeing, but haven't >> managed to try to boot the NetBSD kernel provided in order to assert >> whether it solves the issue or not (or even whether I'm able to >> repro it). Would you mind giving the patch below a try? > > With this, I get the same hang but XEN outputs don't wake up the interrupt > any more. The NetBSD counter shows only one interrupt for ioapic2 pin 2, > while I would have about 8 at the time of the hang. > > So, now it looks like interrupts are blocked forever. Which may be a good thing for debugging purposes, because now we have a way to investigate what is actually blocking the interrupt's delivery without having to worry about more output screwing the overall picture. > At > http://www-soc.lip6.fr/~bouyer/xen-log5.txt > you'll find the output of the 'i' key. (XEN) IRQ: 34 vec:59 IO-APIC-level status=010 aff:{0}/{0-7} in-flight=1 d0: 34(-MM) (XEN) IRQ 34 Vec 89: (XEN) Apic 0x02, Pin 2: vec=59 delivery=LoPri dest=L status=1 polarity=1 irr=1 trig=L mask=0 dest_id:00000001 (XEN) ioapic 2 pin 2 gsi 34 vector 0x67 (XEN) delivery mode 0 dest mode 0 delivery status 0 (XEN) polarity 1 IRR 0 trig mode 1 mask 0 dest id 0 IOW from guest pov the interrupt is entirely idle (mask and irr clear), while Xen sees it as both in-flight and irr also already having become set again. I continue to suspect the EOI timer not doing its job. Yet as said before, for it to have to do anything in the first place the "guest" (really Dom0 here) would need to fail to EOI the IRQ within the timeout period. Which in turn, given your description of how you handle interrupts, cannot be excluded (i.e. the handling may simply take "slightly" too long). What we're missing is LAPIC information, since the masked status logged is unclear: (-MM) isn't fully matching up with "mask=0". But of course the former is just a software representation, while the latter is what the RTE holds. IOW for the interrupt to not get delivered, there needs to be this or a higher ISR bit set (considering we don't use the TPR), or (I think we can pretty much exclude this) we'd need to be running with IRQs off for extended periods of time. Jan
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |