[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: NetBSD dom0 PVH: hardware interrupts stalls


  • To: Jan Beulich <jbeulich@xxxxxxxx>, Manuel Bouyer <bouyer@xxxxxxxxxxxxxxx>
  • From: Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Date: Tue, 24 Nov 2020 15:59:27 +0100
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=hVUHjFUsZj5/PaBK5kj5vImLgcnNQcVtKsPK5FrO3QI=; b=AFQZCVP/QfrYu91v1TD0ypl+qLxIhqL5Zwy4n/rvvquiqPEkDigAFAq6uODIHLeNs2EhWd2j4ygvyqg7NAa6gpfNgBZa8zmobJZM7yl6uWeYCQJqH0shj9nmCWXC4Kn10a6gGqtQpdd6txu7hT+sMWMXvCOuz3xCfjzd/efMZOLmn79lU9bYrH/rMg96KcMDFreMFNNzyGdC8k/vtd8OdnscHwbJTwST8zW6Xy67+CMN9bJZ51pebe5QYNPd280ilgfAO2f8mgRkMwHFn6wwIKTkV76bb7tj52NZ9Ru8Je0vNWTXuqwGZtMy1Gdmc0mlzFi4XKp7grOh28m5ntlGUg==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=CU+dhdhCXo7MN8gcLzWUU80tMq48KRvIFjqjlUl1MscdYlxE+0Uq6O84MG5lhJyiCsDpmJG939i6Ok4hM8EkFtqfqbDRQou05G/imZoydtjW+0E7UozX3fwlkFgjVfkYZzyHI/Im/QjwF2EtG2IsPaWHSxhNZViLyu5uvl09HzafJnnxe753Ka9kTY2g4h4H6EuLqFpT1NJlJ0k0DSO+7uPR65GaDJb0cwLo0EcvynaQZyJ2N89w7/OKsbSHTpDJRDHSZ79TuM0IcJIOqi0ppCd874m9NkXfdeZI5gq08qnwsD492UgmzM5cgrT3qrzTLfvqI7fTIg0IhCmGMusFTg==
  • Authentication-results: esa6.hc3370-68.iphmx.com; dkim=pass (signature verified) header.i=@citrix.onmicrosoft.com
  • Cc: <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Tue, 24 Nov 2020 15:00:43 +0000
  • Ironport-sdr: cCSMrwaLdhQoGGZGSCuk5SUIOrHfD5A+WiXsMhIsM1joKOkk9qbptKxWCIAvjLpwaY86zEVbjY zX9IQGi/SXmvoMki8CiZpq34F/meKvohM4EHisRyPmLuaDBSzGmm4oWIlPJucIfk0aGYAUW7TL AkCyktDCu3DoH85w72SZGXx9t2CnPRibrDXaZzo5i3ur7eIFU81/jOJOol1sQix9ZwJAG7nS1N gPkZM5BRjBwFEuEzWG2gBnBouTVKA7KwIO3HIPCTcmbrJVo3kzeXPtfGv/w3ZNtsqmNta4K5h7 8QI=
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Tue, Nov 24, 2020 at 03:42:28PM +0100, Jan Beulich wrote:
> On 24.11.2020 11:05, Jan Beulich wrote:
> > On 23.11.2020 18:39, Manuel Bouyer wrote:
> >> On Mon, Nov 23, 2020 at 06:06:10PM +0100, Roger Pau Monné wrote:
> >>> OK, I'm afraid this is likely too verbose and messes with the timings.
> >>>
> >>> I've been looking (again) into the code, and I found something weird
> >>> that I think could be related to the issue you are seeing, but haven't
> >>> managed to try to boot the NetBSD kernel provided in order to assert
> >>> whether it solves the issue or not (or even whether I'm able to
> >>> repro it). Would you mind giving the patch below a try?
> >>
> >> With this, I get the same hang but XEN outputs don't wake up the interrupt
> >> any more. The NetBSD counter shows only one interrupt for ioapic2 pin 2,
> >> while I would have about 8 at the time of the hang.
> >>
> >> So, now it looks like interrupts are blocked forever.
> > 
> > Which may be a good thing for debugging purposes, because now we have
> > a way to investigate what is actually blocking the interrupt's
> > delivery without having to worry about more output screwing the
> > overall picture.
> > 
> >> At
> >> http://www-soc.lip6.fr/~bouyer/xen-log5.txt
> >> you'll find the output of the 'i' key.
> > 
> > (XEN)    IRQ:  34 vec:59 IO-APIC-level   status=010 aff:{0}/{0-7} 
> > in-flight=1 d0: 34(-MM)
> > 
> > (XEN)     IRQ 34 Vec 89:
> > (XEN)       Apic 0x02, Pin  2: vec=59 delivery=LoPri dest=L status=1 
> > polarity=1 irr=1 trig=L mask=0 dest_id:00000001
> 
> Since it repeats in Manuel's latest dump, perhaps the odd combination
> of status=1 and irr=1 is to tell us something? It is my understanding
> that irr ought to become set only when delivery-status clears. Yet I
> don't know what to take from this...

My reading of this is that one interrupt was accepted by the lapic
(irr=1) and that there's a further interrupt pending that hasn't yet
been accepted by the lapic (status=1) because it's still serving the
previous one. But that's all weird because there's no matching
vector in ISR, and hence the IRR bit on the IO-APIC has somehow become
stale or out of sync with the lapic state?

I'm also unsure about how Xen has managed to reach this state, it
shouldn't be possible in the first place.

I don't think I can instrument the paths further with printfs because
it's likely to result in the behavior itself changing and console
spamming. I could however create a static buffer to trace relevant
actions and then dump all them together with the 'i' debug key output.

Sorry Manuel, you seem to have hit some kind of weird bug regarding
interrupt management. If you want to progress further with NetBSD PVH
dom0 it's likely to work on a different box, but I would ask if you
can keep the current box in order for us to continue debugging.

Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.