[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH 2/2] x86/boot: attempt to print trace and panic on AP bring up stall
On Wed, May 21, 2025 at 07:16:18PM +0100, Andrew Cooper wrote: > On 21/05/2025 5:55 pm, Roger Pau Monne wrote: > > With the current AP bring up code Xen can get stuck indefinitely if an AP > > You want a comma between "code, Xen" to make the sentence easier to parse. > > > freezes during boot after the 'callin' step. Introduce a 10s timeout while > > waiting for APs to finish startup. > > 5s is the timeout used in other parts of AP bringup. I'd suggest being > consistent here. > > > > On failure of an AP to complete startup send an NMI to trigger the printing > > Again, a comma between "startup, send" would go a long way. > > > of a stack backtrace on the stuck AP and panic on the BSP. > > > > The sending of the NMI re-uses the code already present in fatal_trap(), by > > moving it to a separate function. > > I'd be tempted to split the patch in two. > > > > > Signed-off-by: Roger Pau Monné <roger.pau@xxxxxxxxxx> > > --- > > It may be worth nothing that this came from the ICX143 investigation, > even if it wasn't relevant in the end? > > > diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c > > index 48ce996ba414..77dce3e3e22b 100644 > > --- a/xen/arch/x86/smpboot.c > > +++ b/xen/arch/x86/smpboot.c > > @@ -1388,10 +1389,17 @@ int __cpu_up(unsigned int cpu) > > time_latch_stamps(); > > > > set_cpu_state(CPU_STATE_ONLINE); > > + start = NOW(); > > while ( !cpu_online(cpu) ) > > { > > cpu_relax(); > > process_pending_softirqs(); > > + if ( NOW() > start + SECONDS(10) ) > > (NOW() - start) > SECONDS(10) > > It has one fewer boundary conditions, even if it is rather unlikely that > start + 10s will overflow. > > > diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c > > index c94779b4ad4f..9b9e3726e2fb 100644 > > --- a/xen/arch/x86/traps.c > > +++ b/xen/arch/x86/traps.c > > @@ -734,6 +736,40 @@ static int cf_check nmi_show_execution_state( > > return 1; > > } > > > > +void show_execution_state_nmi(const cpumask_t *mask, bool show_all) > > +{ > > + unsigned int msecs, pending; > > + > > + force_show_all = show_all; > > + > > + watchdog_disable(); > > + console_start_sync(); > > + > > + cpumask_copy(&show_state_mask, mask); > > + set_nmi_callback(nmi_show_execution_state); > > + /* Ensure new callback is set before sending out the NMI. */ > > + smp_wmb(); > > I know this is only moving code, but this is wrong. So is the smp_mb() > in the x2apic drivers. > > It would only be correct in principle for xAPIC (which is an MMIO > store), except it's UC and is strongly ordered anyway. Furthermore, the > sequence point for the send_IPI_mask() prevents the compiler from > reordering this unsafely. > > The x2APIC drivers need LFENCE;MFENCE on Intel, and just MFENCE on AMD, > and this (critically) is not smp_mb(), which is now just a locked operation. > > I bet these aren't the only examples of incorrect barriers WRT IPIs. I > guess we should fix those separately. Thanks, I will remove the smp_wmb() ahead of moving the code, but other instances in the APIC drivers I will leave for a different series, I don't want to delay the work here on those fixes. Regards, Roger.
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |