[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC] Xen crashes on ASSERT on suspend/resume, suggested fix


  • To: Stefano Stabellini <stefano.stabellini@xxxxxxx>
  • From: Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Date: Tue, 23 May 2023 16:50:06 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=iewsP3RDwBj7upNajhGPj9pL1aILmljqTlcpQ2xqdk0=; b=KX9OlMp8khVhj9zo7DfB22N1LdtzOyhx3hI1asm8dkiUXdc4IGmSXUSU1iNcqxPIZm3/BFuwXI6T7iO7BsCx2kbKv/DfHIkBiNbFaVyyY5Knbu2Wcmdxw8iylvAFR9h+uSvN0qEgrxN/XMng3wtZ6Ld1u6jF/Z0oGQOi+FGIK8SwH+P8DsTcQm0BJd+1S9/j8Gd6WAa0C3rBud6Emy/QnViX7fN+sooGWkFWv0yGY46QrrYaJxQ5oQ/6ztOlbSCJIraljFKJmx9Y7VAcQyERsLkOsr4xyJH8WW5Wf3G3enwxdYwWGUt2GIGHjGR8PjXvnNFOE7h+JVYIgwj1VjeKFA==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=hiSGoMs2Kgd0TpOEU9nliTjs+TD5WAPz8tma5wH4q+gjDJYjeJkj/sq4AzpP8zqsDzph4RSWdiu0NSyjdvBhVBSp8rPKU6MU7AjThJZMt4wx+3dlBfUzXrAO9p1QjS7C1shEYzp79J0W6TsBObrKu8iC1edlI8tgyNp3j5qv3SoJSpRbiHpDLPdv1xYZFFq9keVSejoJrE36JFVTSWHMV+rGeMebP79BK4icQ/6ASqTqSi+fznYNDB7sBvaKFLoh1KCHTf1x9FVQd53zZuFxEVTiFBWPktLwF3wA829koJ2e9vFtc+WAcnQqYZD54G6hopI5DN4qjyT3igp/d0nBow==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
  • Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx, jbeulich@xxxxxxxx, andrew.cooper3@xxxxxxxxxx, xenia.ragiadakou@xxxxxxx
  • Delivery-date: Tue, 23 May 2023 14:50:39 +0000
  • Ironport-data: A9a23:PKTpYKxt3EXqTfB8F8h6t+f/xyrEfRIJ4+MujC+fZmUNrF6WrkUBm 2JLWDiEM6rfZ2GnfNx0Ptyx9xhQ7ZfSztAxS1RopCAxQypGp/SeCIXCJC8cHc8wwu7rFxs7s ppEOrEsCOhuExcwcz/0auCJQUFUjP3OHfykTrafYEidfCc8IA85kxVvhuUltYBhhNm9Emult Mj75sbSIzdJ4RYtWo4vw/zF8EsHUMja4mtC5QRjP6sT5jcyqlFOZH4hDfDpR5fHatE88t6SH 47r0Ly/92XFyBYhYvvNfmHTKxBirhb6ZGBiu1IOM0SQqkEqSh8ai87XAME0e0ZP4whlqvgqo Dl7WT5cfi9yVkHEsLx1vxC1iEiSN4UekFPMCSDXXcB+UyQq2pYjqhljJBheAGEWxgp4KWJx8 NgXOhE2Ui+83+Op8KiVWvk13Mt2eaEHPKtH0p1h5RfwKK9/BLrlE+DN79Ie2yosjMdTG/qYf 9AedTdkcBXHZVtIJ0sTD5U92uyvgxETcRUB8A7T+fVxvjiVlVIhuFTuGIO9ltiiX8Jak1zev mvb12/4HgsbJJqUzj/tHneE37eezH2qB9hPfFG+3uBmvk2c6l0qMQExVWSE/dDhs0WeSc0Kf iT4/QJr98De7neDVcXwURS+pzifohcWVt5UEus7wAiIxuzf5APxLngJSHtNZcIrsOcyRCc2z RmZktXxHzttvbaJD3WH+d+8rzm/JCwUJm8qfjIfQE0O5NyLiJE+iBPGCMxqH6+8gtT2HizYy jWG6iM5gt0uYdUj0qy6+RXNhWKqr52QFwotvFyJDiSi8x9zY5Oja8qw81/H4P1cLYGfCF6co HwDnMvY5+cLZX2QqBGwrCw2NOnBz5643Pf02zaDw7FJG+yRxkOe
  • Ironport-hdrordr: A9a23:gqxvpKp54EXVJOt2oHPgSqQaV5r5eYIsimQD101hICG9E/bo7f xG+c5x6faaslgssR0b9Oxoe5PhfZqkz+8S3WBJB8baYOCEggqVxeNZnPDfKlTbckWVygc378 hdmsZFZOEYQmIK7vrS0U2UH9Mh39Wd4MmT9ILjJxAEd3ATV0jM1XYcNu/eKDwQeCBWQZ40Do CV6MZkqyrIQwV0UviG
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Tue, May 23, 2023 at 03:54:36PM +0200, Roger Pau Monné wrote:
> On Thu, May 18, 2023 at 04:44:53PM -0700, Stefano Stabellini wrote:
> > Hi all,
> > 
> > After many PVH Dom0 suspend/resume cycles we are seeing the following
> > Xen crash (it is random and doesn't reproduce reliably):
> > 
> > (XEN) [555.042981][<ffff82d04032a137>] R 
> > arch/x86/irq.c#_clear_irq_vector+0x214/0x2bd
> > (XEN) [555.042986][<ffff82d04032a74c>] F destroy_irq+0xe2/0x1b8
> > (XEN) [555.042991][<ffff82d0403276db>] F msi_free_irq+0x5e/0x1a7
> > (XEN) [555.042995][<ffff82d04032da2d>] F unmap_domain_pirq+0x441/0x4b4
> > (XEN) [555.043001][<ffff82d0402d29b9>] F 
> > arch/x86/hvm/vmsi.c#vpci_msi_disable+0xc0/0x155
> > (XEN) [555.043006][<ffff82d0402d39fc>] F vpci_msi_arch_disable+0x1e/0x2b
> > (XEN) [555.043013][<ffff82d04026561c>] F 
> > drivers/vpci/msi.c#control_write+0x109/0x10e
> > (XEN) [555.043018][<ffff82d0402640c3>] F vpci_write+0x11f/0x268
> > (XEN) [555.043024][<ffff82d0402c726a>] F 
> > arch/x86/hvm/io.c#vpci_portio_write+0xa0/0xa7
> > (XEN) [555.043029][<ffff82d0402c6682>] F 
> > hvm_process_io_intercept+0x203/0x26f
> > (XEN) [555.043034][<ffff82d0402c6718>] F hvm_io_intercept+0x2a/0x4c
> > (XEN) [555.043039][<ffff82d0402b6353>] F 
> > arch/x86/hvm/emulate.c#hvmemul_do_io+0x29b/0x5f6
> > (XEN) [555.043043][<ffff82d0402b66dd>] F 
> > arch/x86/hvm/emulate.c#hvmemul_do_io_buffer+0x2f/0x6a
> > (XEN) [555.043048][<ffff82d0402b7bde>] F hvmemul_do_pio_buffer+0x33/0x35
> > (XEN) [555.043053][<ffff82d0402c7042>] F handle_pio+0x6d/0x1b4
> > (XEN) [555.043059][<ffff82d04029ec20>] F svm_vmexit_handler+0x10bf/0x18b0
> > (XEN) [555.043064][<ffff82d0402034e5>] F svm_stgi_label+0x8/0x18
> > (XEN) [555.043067]
> > (XEN) [555.469861]
> > (XEN) [555.471855] ****************************************
> > (XEN) [555.477315] Panic on CPU 9:
> > (XEN) [555.480608] Assertion 'per_cpu(vector_irq, cpu)[old_vector] == irq' 
> > failed at arch/x86/irq.c:233
> > (XEN) [555.489882] ****************************************
> > 
> > Looking at the code in question, the ASSERT looks wrong to me.
> > 
> > Specifically, if you see send_cleanup_vector and
> > irq_move_cleanup_interrupt, it is entirely possible to have old_vector
> > still valid and also move_in_progress still set, but only some of the
> > per_cpu(vector_irq, me)[vector] cleared. It seems to me that this could
> > happen especially when an MSI has a large old_cpu_mask.
> 
> i guess the only way to get into such situation would be if you happen
> to execute _clear_irq_vector() with a cpu_online_map smaller than the
> one in old_cpu_mask, at which point you will leave old_vector fields
> not updated.
> 
> Maybe somehow you get into this situation when doing suspend/resume?
> 
> Could you try to add a:
> 
> ASSERT(cpumask_equal(tmp_mask, desc->arch.old_cpu_mask));
> 
> Before the `for_each_cpu(cpu, tmp_mask)` loop?

I see that the old_cpu_mask is cleared in release_old_vec(), so that
suggestion is not very useful.

Does the crash happen at specific points, for example just after
resume or before suspend?

Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.