[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 2/2] x86/boot: attempt to print trace and panic on AP bring up stall



On Wed, May 21, 2025 at 07:16:18PM +0100, Andrew Cooper wrote:
> On 21/05/2025 5:55 pm, Roger Pau Monne wrote:
> > With the current AP bring up code Xen can get stuck indefinitely if an AP
> 
> You want a comma between "code, Xen" to make the sentence easier to parse.
> 
> > freezes during boot after the 'callin' step.  Introduce a 10s timeout while
> > waiting for APs to finish startup.
> 
> 5s is the timeout used in other parts of AP bringup.  I'd suggest being
> consistent here.
> 
> 
> > On failure of an AP to complete startup send an NMI to trigger the printing
> 
> Again, a comma between "startup, send" would go a long way.
> 
> > of a stack backtrace on the stuck AP and panic on the BSP.
> >
> > The sending of the NMI re-uses the code already present in fatal_trap(), by
> > moving it to a separate function.
> 
> I'd be tempted to split the patch in two.
> 
> >
> > Signed-off-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>
> > ---
> 
> It may be worth nothing that this came from the ICX143 investigation,
> even if it wasn't relevant in the end?
> 
> > diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
> > index 48ce996ba414..77dce3e3e22b 100644
> > --- a/xen/arch/x86/smpboot.c
> > +++ b/xen/arch/x86/smpboot.c
> > @@ -1388,10 +1389,17 @@ int __cpu_up(unsigned int cpu)
> >      time_latch_stamps();
> >  
> >      set_cpu_state(CPU_STATE_ONLINE);
> > +    start = NOW();
> >      while ( !cpu_online(cpu) )
> >      {
> >          cpu_relax();
> >          process_pending_softirqs();
> > +        if ( NOW() > start + SECONDS(10) )
> 
> (NOW() - start) > SECONDS(10)
> 
> It has one fewer boundary conditions, even if it is rather unlikely that
> start + 10s will overflow.
> 
> > diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
> > index c94779b4ad4f..9b9e3726e2fb 100644
> > --- a/xen/arch/x86/traps.c
> > +++ b/xen/arch/x86/traps.c
> > @@ -734,6 +736,40 @@ static int cf_check nmi_show_execution_state(
> >      return 1;
> >  }
> >  
> > +void show_execution_state_nmi(const cpumask_t *mask, bool show_all)
> > +{
> > +    unsigned int msecs, pending;
> > +
> > +    force_show_all = show_all;
> > +
> > +    watchdog_disable();
> > +    console_start_sync();
> > +
> > +    cpumask_copy(&show_state_mask, mask);
> > +    set_nmi_callback(nmi_show_execution_state);
> > +    /* Ensure new callback is set before sending out the NMI. */
> > +    smp_wmb();
> 
> I know this is only moving code, but this is wrong.  So is the smp_mb()
> in the x2apic drivers.
> 
> It would only be correct in principle for xAPIC (which is an MMIO
> store), except it's UC and is strongly ordered anyway.  Furthermore, the
> sequence point for the send_IPI_mask() prevents the compiler from
> reordering this unsafely.
> 
> The x2APIC drivers need LFENCE;MFENCE on Intel, and just MFENCE on AMD,
> and this (critically) is not smp_mb(), which is now just a locked operation.
> 
> I bet these aren't the only examples of incorrect barriers WRT IPIs.  I
> guess we should fix those separately.

Thanks, I will remove the smp_wmb() ahead of moving the code, but
other instances in the APIC drivers I will leave for a different
series, I don't want to delay the work here on those fixes.

Regards, Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.