Xen project Mailing List

Re: [PATCH for-4.20 1/2] x86/shutdown: quiesce devices ahead of AP shutdown

To: Roger Pau Monne <roger.pau@xxxxxxxxxx>

Date: Wed, 29 Jan 2025 11:13:09 +0100

Autocrypt: addr=jbeulich@xxxxxxxx; keydata= xsDiBFk3nEQRBADAEaSw6zC/EJkiwGPXbWtPxl2xCdSoeepS07jW8UgcHNurfHvUzogEq5xk hu507c3BarVjyWCJOylMNR98Yd8VqD9UfmX0Hb8/BrA+Hl6/DB/eqGptrf4BSRwcZQM32aZK 7Pj2XbGWIUrZrd70x1eAP9QE3P79Y2oLrsCgbZJfEwCgvz9JjGmQqQkRiTVzlZVCJYcyGGsD /0tbFCzD2h20ahe8rC1gbb3K3qk+LpBtvjBu1RY9drYk0NymiGbJWZgab6t1jM7sk2vuf0Py O9Hf9XBmK0uE9IgMaiCpc32XV9oASz6UJebwkX+zF2jG5I1BfnO9g7KlotcA/v5ClMjgo6Gl MDY4HxoSRu3i1cqqSDtVlt+AOVBJBACrZcnHAUSuCXBPy0jOlBhxPqRWv6ND4c9PH1xjQ3NP nxJuMBS8rnNg22uyfAgmBKNLpLgAGVRMZGaGoJObGf72s6TeIqKJo/LtggAS9qAUiuKVnygo 3wjfkS9A3DRO+SpU7JqWdsveeIQyeyEJ/8PTowmSQLakF+3fote9ybzd880fSmFuIEJldWxp Y2ggPGpiZXVsaWNoQHN1c2UuY29tPsJgBBMRAgAgBQJZN5xEAhsDBgsJCAcDAgQVAggDBBYC AwECHgECF4AACgkQoDSui/t3IH4J+wCfQ5jHdEjCRHj23O/5ttg9r9OIruwAn3103WUITZee e7Sbg12UgcQ5lv7SzsFNBFk3nEQQCACCuTjCjFOUdi5Nm244F+78kLghRcin/awv+IrTcIWF hUpSs1Y91iQQ7KItirz5uwCPlwejSJDQJLIS+QtJHaXDXeV6NI0Uef1hP20+y8qydDiVkv6l IreXjTb7DvksRgJNvCkWtYnlS3mYvQ9NzS9PhyALWbXnH6sIJd2O9lKS1Mrfq+y0IXCP10eS FFGg+Av3IQeFatkJAyju0PPthyTqxSI4lZYuJVPknzgaeuJv/2NccrPvmeDg6Coe7ZIeQ8Yj t0ARxu2xytAkkLCel1Lz1WLmwLstV30g80nkgZf/wr+/BXJW/oIvRlonUkxv+IbBM3dX2OV8 AmRv1ySWPTP7AAMFB/9PQK/VtlNUJvg8GXj9ootzrteGfVZVVT4XBJkfwBcpC/XcPzldjv+3 HYudvpdNK3lLujXeA5fLOH+Z/G9WBc5pFVSMocI71I8bT8lIAzreg0WvkWg5V2WZsUMlnDL9 mpwIGFhlbM3gfDMs7MPMu8YQRFVdUvtSpaAs8OFfGQ0ia3LGZcjA6Ik2+xcqscEJzNH+qh8V m5jjp28yZgaqTaRbg3M/+MTbMpicpZuqF4rnB0AQD12/3BNWDR6bmh+EkYSMcEIpQmBM51qM EKYTQGybRCjpnKHGOxG0rfFY1085mBDZCH5Kx0cl0HVJuQKC+dV2ZY5AqjcKwAxpE75MLFkr wkkEGBECAAkFAlk3nEQCGwwACgkQoDSui/t3IH7nnwCfcJWUDUFKdCsBH/E5d+0ZnMQi+G0A nAuWpQkjM1ASeQwSHEeAWPgskBQL

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Anthony PERARD <anthony.perard@xxxxxxxxxx>, Michal Orzel <michal.orzel@xxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx

Delivery-date: Wed, 29 Jan 2025 10:13:27 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 28.01.2025 17:27, Roger Pau Monne wrote: > The current shutdown logic in smp_send_stop() will first disable the APs, > and then attempt to disable (some) of the interrupt sources. > > There are two issues with this approach; the first one being that MSI > interrupt sources are not disabled, the second one is the APs are stopped > before interrupts are disabled. On AMD systems this can lead to the > triggering of local APIC errors: > > APIC error on CPU0: 00(08), Receive accept error > > Such error message can be printed in a loop, thus blocking the system from > rebooting. I assume this loop is created by the error being triggered by > the console interrupt, which is further triggered by the ESR reporting > write to the console. > > Intel SDM states: > > "Receive Accept Error. > > Set when the local APIC detects that the message it received was not > accepted by any APIC on the APIC bus, including itself. Used only on P6 > family and Pentium processors." > > So the error shouldn't trigger on any Intel CPU supported by Xen. > > However AMD doesn't make such claims, and indeed the error is broadcasted > to all local APIC when for example an interrupt targets a CPU that's > offline. > > To prevent the error from triggering, move the masking of IO-APIC pins > ahead of stopping the APs. Also introduce a new function that disables > MSI and MSI-X on all PCI devices. Remove the call to fixup_irqs() since > there's no point in attempting to move interrupts: all sources will be > either masked or disabled. > > For the NMI crash path also call the newly introduced function, with the > hope that disabling MSI and MSI-X will make it easier for the (possible) > crash kernel to boot, as it could otherwise receive the same "Receive > accept error" upon re-enabling interrupts. > > Note that this will have the side-effect of preventing further IOMMU > interrupts from being delivered, that's expected and at that point in the > shutdown process no further interaction with the IOMMU should be relevant. This is at most for AMD only. Shouldn't we similarly disable VT-d's interrupt(s)? (It's only one right now, as we still don't use the QI completion one.) Even for AMD I'm uncertain: It has separate hw_irq_controller instances, and its set_iommu_interrupt_handler() is custom as well. Will pci_disable_msi_all() really affect it? (Hmm, yes, from amd_iommu_msi_enable() it looks like it will.) > --- a/xen/arch/x86/msi.c > +++ b/xen/arch/x86/msi.c > @@ -1248,6 +1248,20 @@ void pci_cleanup_msi(struct pci_dev *pdev) > msi_free_irqs(pdev); > } > > +static int cf_check disable_msi(struct pci_dev *pdev, void *arg) > +{ > + msi_set_enable(pdev, 0); > + msix_set_enable(pdev, 0); > + > + return 0; > +} > + > +void pci_disable_msi_all(void) > +{ > + /* Disable MSI and/or MSI-X on all devices. */ > + pci_iterate_devices(disable_msi, NULL); > +} That's going to be all devices we know of. I.e. best effort only. Maybe the comment should be adjusted to this effect. > --- a/xen/arch/x86/smp.c > +++ b/xen/arch/x86/smp.c > @@ -358,14 +358,15 @@ void smp_send_stop(void) > { > unsigned int cpu = smp_processor_id(); > > + local_irq_disable(); > + disable_IO_APIC(); > + pci_disable_msi_all(); > + local_irq_enable(); > + > if ( num_online_cpus() > 1 ) > { > int timeout = 10; > > - local_irq_disable(); > - fixup_irqs(cpumask_of(cpu), 0); > - local_irq_enable(); > - > smp_call_function(stop_this_cpu, NULL, 0); > > /* Wait 10ms for all other CPUs to go offline. */ > @@ -376,7 +377,6 @@ void smp_send_stop(void) > if ( cpu_online(cpu) ) > { > local_irq_disable(); > - disable_IO_APIC(); > hpet_disable(); Like IOMMUs, HPET also has custom interrupt management. I think this call needs pulling up, too (much like it is also there in nmi_shootdown_cpus()). > --- a/xen/drivers/passthrough/pci.c > +++ b/xen/drivers/passthrough/pci.c > @@ -1803,6 +1803,38 @@ int iommu_do_pci_domctl( > return ret; > } > > +struct segment_iter { > + int (*handler)(struct pci_dev *pdev, void *arg); > + void *arg; > +}; > + > +static int cf_check iterate_all(struct pci_seg *pseg, void *arg) > +{ > + const struct segment_iter *iter = arg; > + struct pci_dev *pdev; > + > + list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list ) > + { > + int rc = iter->handler(pdev, iter->arg); > + > + if ( rc ) > + return rc; > + } > + > + return 0; > +} > + > +int pci_iterate_devices(int (*handler)(struct pci_dev *pdev, void *arg), > + void *arg) > +{ > + struct segment_iter iter = { > + .handler = handler, > + .arg = arg, > + }; > + > + return pci_segments_iterate(iterate_all, &iter); > +} For the specific purpose during shutdown it may be okay to do all of this without locking (but see below) and without preemption checks. Yet then a warning will want putting here to indicate that from other environments this isn't okay to use as-is. This use then also requires that msi{,x}_set_enable() paths never gain lock-related assertions. Talking of the lack of locking: Since you invoke the disabling before bringing down APs, we're ending up in kind of a chicken and egg problem here: Without APs quiesced, there may be operations in progress there which conflict with the disabling done here. Hence why so far we brought down APs first. With this special-purpose use I further wonder whether iterate_all() wouldn't better continue despite an error coming back from a callback (and also arrange for pci_segments_iterate() to continue, by merely recording any possible error in struct segment_iter), and only accumulate the error code to eventually return. The more devices we manage to quiesce, the better our chances of rebooting cleanly. Jan

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.