[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC] Xen crashes on ASSERT on suspend/resume, suggested fix


  • To: Stefano Stabellini <stefano.stabellini@xxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Mon, 22 May 2023 15:03:55 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=mumcL+mQy8ocaFwSH9jtHJeg5S7BbONqsxIJix1dCOs=; b=hhvPDi+9bkudcVzbVK6CqleIvPzcXPL+p4VOJimLxefUMr2SDQV+wYLS+vUj7ITjXuxOHkElOmZzm9W0OzdBzHGcqbGgSJDV50Q8Km3Rd5Fr6Cr2NttDQxuWM1QEDzqanIC05NTlayQDTzZvK1r9ohTvj8C2Iv5/lpBpniHPtzm0SNte7hKJLcL80ViKZYfBhQklmaDRK2e90oHwubOwAIAz3mm/I0nXwLJx6xw28VnCIxtx74HK2eu7t4za9CS/V52HyVVymFyfvxHJ+lciBzXb7fi9II9Rc2PiNdl/QsX/BuiiPaZxL4pJ7G9OWVb277BHtn9/q1h1ej5vy420Gg==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=aG0QajieduBTCQmI9jjKLdjo/EWK1p7i0dFJCU6quc/6J9iJmfKyxwEo6FdhDYfH9bRFiU6OqagNegWavXLFD+6XCL8VtVS7zMGk8sbjYohClvXGWEeEA35QlakCd4kQTqTLqryZScFRfOc+Ode7aSAkV5paXTixsBKlAQ534B0paub4k5na6Wq2d4TloDFz7e8giSargayP2fcXDUm+4/t2Ro/wP3hxR9kRxY5JKUSBppenz3CfgeYcTDs4FY0fIv954E00yP0o5hEzjhjsRu3F4A8N3B6Dytdqn/NGxu8G9WDtNI/M1fnIe25sLOS8tWsdZPs/8V+07444jH1pTw==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;
  • Cc: andrew.cooper3@xxxxxxxxxx, roger.pau@xxxxxxxxxx, xenia.ragiadakou@xxxxxxx, xen-devel@xxxxxxxxxxxxxxxxxxxx
  • Delivery-date: Mon, 22 May 2023 13:04:11 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 19.05.2023 01:44, Stefano Stabellini wrote:
> Hi all,
> 
> After many PVH Dom0 suspend/resume cycles we are seeing the following
> Xen crash (it is random and doesn't reproduce reliably):
> 
> (XEN) [555.042981][<ffff82d04032a137>] R 
> arch/x86/irq.c#_clear_irq_vector+0x214/0x2bd
> (XEN) [555.042986][<ffff82d04032a74c>] F destroy_irq+0xe2/0x1b8
> (XEN) [555.042991][<ffff82d0403276db>] F msi_free_irq+0x5e/0x1a7
> (XEN) [555.042995][<ffff82d04032da2d>] F unmap_domain_pirq+0x441/0x4b4
> (XEN) [555.043001][<ffff82d0402d29b9>] F 
> arch/x86/hvm/vmsi.c#vpci_msi_disable+0xc0/0x155
> (XEN) [555.043006][<ffff82d0402d39fc>] F vpci_msi_arch_disable+0x1e/0x2b
> (XEN) [555.043013][<ffff82d04026561c>] F 
> drivers/vpci/msi.c#control_write+0x109/0x10e
> (XEN) [555.043018][<ffff82d0402640c3>] F vpci_write+0x11f/0x268
> (XEN) [555.043024][<ffff82d0402c726a>] F 
> arch/x86/hvm/io.c#vpci_portio_write+0xa0/0xa7
> (XEN) [555.043029][<ffff82d0402c6682>] F hvm_process_io_intercept+0x203/0x26f
> (XEN) [555.043034][<ffff82d0402c6718>] F hvm_io_intercept+0x2a/0x4c
> (XEN) [555.043039][<ffff82d0402b6353>] F 
> arch/x86/hvm/emulate.c#hvmemul_do_io+0x29b/0x5f6
> (XEN) [555.043043][<ffff82d0402b66dd>] F 
> arch/x86/hvm/emulate.c#hvmemul_do_io_buffer+0x2f/0x6a
> (XEN) [555.043048][<ffff82d0402b7bde>] F hvmemul_do_pio_buffer+0x33/0x35
> (XEN) [555.043053][<ffff82d0402c7042>] F handle_pio+0x6d/0x1b4
> (XEN) [555.043059][<ffff82d04029ec20>] F svm_vmexit_handler+0x10bf/0x18b0
> (XEN) [555.043064][<ffff82d0402034e5>] F svm_stgi_label+0x8/0x18
> (XEN) [555.043067]
> (XEN) [555.469861]
> (XEN) [555.471855] ****************************************
> (XEN) [555.477315] Panic on CPU 9:
> (XEN) [555.480608] Assertion 'per_cpu(vector_irq, cpu)[old_vector] == irq' 
> failed at arch/x86/irq.c:233
> (XEN) [555.489882] ****************************************
> 
> Looking at the code in question, the ASSERT looks wrong to me.
> 
> Specifically, if you see send_cleanup_vector and
> irq_move_cleanup_interrupt, it is entirely possible to have old_vector
> still valid and also move_in_progress still set, but only some of the
> per_cpu(vector_irq, me)[vector] cleared. It seems to me that this could
> happen especially when an MSI has a large old_cpu_mask.
> 
> While we go and clear per_cpu(vector_irq, me)[vector] in each CPU one by
> one, the state is that not all per_cpu(vector_irq, me)[vector] are
> cleared and old_vector is still set.
> 
> If at this point we enter _clear_irq_vector, we are going to hit the
> ASSERT above.
> 
> My suggestion was to turn the ASSERT into an if. Any better ideas?

While I'm fully willing to believe there is something that needs fixing,
we need to understand what's going wrong before applying any change.
Even more so when the suggestion is the simplest possible, converting an
assertion to an if(). There's no explanation at all what happens in the
"else" case: Are we (silently) leaking vectors then? Any other undue
side effects?

What I'm particularly missing is any connection to suspend/resume. There
my primary suspect would be an issue with/in fixup_irqs().

What might be relevant in the investigation is what the value is that
is found in the array slot. In particular if it was already ~irq, it
may hint at where the issue is coming from and/or that the assertion
may merely need a little bit of relaxing. Allowing for that value in an
array slot would in particular not raise any questions towards "what if
not" (as mentioned above), because that's already the value we store
there anyway.

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.