[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Problems with APIC on versions 4.9 and later (4.8 works)



Em seg., 25 de jan. de 2021 às 06:38, Jan Beulich <jbeulich@xxxxxxxx> escreveu:
>
> On 23.01.2021 00:36, Claudemir Todo Bom wrote:
> > Em qua., 20 de jan. de 2021 às 12:13, Jürgen Groß <jgross@xxxxxxxx> 
> > escreveu:
> >>
> >> On 20.01.21 09:50, Jan Beulich wrote:
> >>> On 19.01.2021 20:36, Claudemir Todo Bom wrote:
> >>>> I do not have serial output on this setup, so I recorded a video with
> >>>> boot_delay=50 in order to be able to get all the kernel messages:
> >>>> https://youtu.be/y95h6vqoF7Y
> >>>
> >>> This doesn't show any badness afaics.
> >>>
> >>>> This is running 4.14 from debian bullseye (testing).
> >>>>
> >>>> I'm also attaching the dmesg output when booting xen 4.8 with  the same
> >>>> kernel version and same parameters.
> >>>>
> >>>> I visually compared all the messages, and the only thing I noticed was 
> >>>> that
> >>>> 4.14 used tsc as clocksource and 4.8 used xen. I tried to boot the kernel
> >>>> with "clocksource=xen" and the problem is happening with that also.
> >>>
> >>> There's some confusion here I suppose: The clock source you talk
> >>> about is the kernel's, not Xen's. I didn't think this would
> >>> change for the same kernel version with different Xen underneath,
> >>> but the Linux maintainers of the Xen code there may know better.
> >>> Cc-ing them.
> >>
> >> This might depend on CPUID bits given to dom0 by Xen, e.g. regarding
> >> TSC stability.
> >>
> >
> > Looks like the CPUID changes I observed and wrote on the other
> > messages are another
> > problem I may end up with. I narrowed down the cause of the problem on
> > booting of dom0 with more than 1 core on the following commit:
> >
> > https://github.com/xen-project/xen/commit/63e1d01b8fd948b3e0fa3beea494e407668aa43b
> >
> > Code built from this commit doesn't boot, built from the parent of it, 
> > boots.
>
> Odd.
>
> > Now, there is something I can do on the command line to make it boots?
> > Or its needed to fix on the code?
>
> That's too early to ask. We first need to understand what's going
> on. There are two things I'd like you to try: One is to use
> "clocksource=tsc" on the Xen (not the kernel) command line, and
> the other (without that option) is to try the debugging patch
> below. Of course that patch is only going to be useful when you
> can somehow record Xen's log messages up to the point where the
> system hangs. (Both ideally on as new a Xen as you can arrange
> for.)
>
> Jan
>
> --- unstable.orig/xen/arch/x86/time.c
> +++ unstable/xen/arch/x86/time.c
> @@ -1799,9 +1799,11 @@ static void time_calibration(void *unuse
>      cpumask_copy(&r.cpu_calibration_map, &cpu_online_map);
>
>      /* @wait=1 because we must wait for all cpus before freeing @r. */
> +printk("TSC: %ps\n", time_calibration_rendezvous_fn);//temp
>      on_selected_cpus(&r.cpu_calibration_map,
>                       time_calibration_rendezvous_fn,
>                       &r, 1);
> +printk("TSC: end rendezvous\n");//temp
>  }
>
>  static struct cpu_time_stamp ap_bringup_ref;
> @@ -2043,6 +2045,7 @@ static int __init verify_tsc_reliability
>       * While with constant-rate TSCs the scale factor can be shared, when 
> TSCs
>       * are not marked as 'reliable', re-sync during rendezvous.
>       */
> +printk("TSC: c=%d r=%d\n", !!boot_cpu_has(X86_FEATURE_CONSTANT_TSC), 
> !!boot_cpu_has(X86_FEATURE_TSC_RELIABLE));//temp
>      if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
>           !boot_cpu_has(X86_FEATURE_TSC_RELIABLE) )
>          time_calibration_rendezvous_fn = time_calibration_tsc_rendezvous;
> @@ -2056,6 +2059,7 @@ int __init init_xen_time(void)
>  {
>      tsc_check_writability();
>
> +printk("TSC: c=%d r=%d\n", !!boot_cpu_has(X86_FEATURE_CONSTANT_TSC), 
> !!boot_cpu_has(X86_FEATURE_TSC_RELIABLE));//temp
>      open_softirq(TIME_CALIBRATE_SOFTIRQ, local_time_calibration);
>
>      /* NB. get_wallclock_time() can take over one second to execute. */
>

I've managed to get the debug messages on the screen using
vga=text-80x50,keep and disabling all messages from the kernel. Two
images are attached with the output running the debug patch.

About the version I've used to test: since the 4.14 shows that other
bug with the detection of cpu features I mentioned on the other
subthread, I chose to work on 4.11 that doesn't shows that behaviour.

Calling with clocksource on the xen command line changed nothing.

I don't know if this part of code is intended to execute a lot of
times, but when starting with dom0_max_vcpus=1, the system boots up
and keeps showing the messages.

I've checked the reversion of the code on the commit #63e1d01 and the
system boots up. I've not checked with any virtual machine yet.

Best regards,
Claudemir

Attachment: IMG_20210125_161303.jpg
Description: JPEG image

Attachment: IMG_20210125_162313.jpg
Description: JPEG image


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.