[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Problems with APIC on versions 4.9 and later (4.8 works)
Em seg., 25 de jan. de 2021 às 06:38, Jan Beulich <jbeulich@xxxxxxxx> escreveu: > > On 23.01.2021 00:36, Claudemir Todo Bom wrote: > > Em qua., 20 de jan. de 2021 às 12:13, Jürgen Groß <jgross@xxxxxxxx> > > escreveu: > >> > >> On 20.01.21 09:50, Jan Beulich wrote: > >>> On 19.01.2021 20:36, Claudemir Todo Bom wrote: > >>>> I do not have serial output on this setup, so I recorded a video with > >>>> boot_delay=50 in order to be able to get all the kernel messages: > >>>> https://youtu.be/y95h6vqoF7Y > >>> > >>> This doesn't show any badness afaics. > >>> > >>>> This is running 4.14 from debian bullseye (testing). > >>>> > >>>> I'm also attaching the dmesg output when booting xen 4.8 with the same > >>>> kernel version and same parameters. > >>>> > >>>> I visually compared all the messages, and the only thing I noticed was > >>>> that > >>>> 4.14 used tsc as clocksource and 4.8 used xen. I tried to boot the kernel > >>>> with "clocksource=xen" and the problem is happening with that also. > >>> > >>> There's some confusion here I suppose: The clock source you talk > >>> about is the kernel's, not Xen's. I didn't think this would > >>> change for the same kernel version with different Xen underneath, > >>> but the Linux maintainers of the Xen code there may know better. > >>> Cc-ing them. > >> > >> This might depend on CPUID bits given to dom0 by Xen, e.g. regarding > >> TSC stability. > >> > > > > Looks like the CPUID changes I observed and wrote on the other > > messages are another > > problem I may end up with. I narrowed down the cause of the problem on > > booting of dom0 with more than 1 core on the following commit: > > > > https://github.com/xen-project/xen/commit/63e1d01b8fd948b3e0fa3beea494e407668aa43b > > > > Code built from this commit doesn't boot, built from the parent of it, > > boots. > > Odd. > > > Now, there is something I can do on the command line to make it boots? > > Or its needed to fix on the code? > > That's too early to ask. We first need to understand what's going > on. There are two things I'd like you to try: One is to use > "clocksource=tsc" on the Xen (not the kernel) command line, and > the other (without that option) is to try the debugging patch > below. Of course that patch is only going to be useful when you > can somehow record Xen's log messages up to the point where the > system hangs. (Both ideally on as new a Xen as you can arrange > for.) > > Jan > > --- unstable.orig/xen/arch/x86/time.c > +++ unstable/xen/arch/x86/time.c > @@ -1799,9 +1799,11 @@ static void time_calibration(void *unuse > cpumask_copy(&r.cpu_calibration_map, &cpu_online_map); > > /* @wait=1 because we must wait for all cpus before freeing @r. */ > +printk("TSC: %ps\n", time_calibration_rendezvous_fn);//temp > on_selected_cpus(&r.cpu_calibration_map, > time_calibration_rendezvous_fn, > &r, 1); > +printk("TSC: end rendezvous\n");//temp > } > > static struct cpu_time_stamp ap_bringup_ref; > @@ -2043,6 +2045,7 @@ static int __init verify_tsc_reliability > * While with constant-rate TSCs the scale factor can be shared, when > TSCs > * are not marked as 'reliable', re-sync during rendezvous. > */ > +printk("TSC: c=%d r=%d\n", !!boot_cpu_has(X86_FEATURE_CONSTANT_TSC), > !!boot_cpu_has(X86_FEATURE_TSC_RELIABLE));//temp > if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) && > !boot_cpu_has(X86_FEATURE_TSC_RELIABLE) ) > time_calibration_rendezvous_fn = time_calibration_tsc_rendezvous; > @@ -2056,6 +2059,7 @@ int __init init_xen_time(void) > { > tsc_check_writability(); > > +printk("TSC: c=%d r=%d\n", !!boot_cpu_has(X86_FEATURE_CONSTANT_TSC), > !!boot_cpu_has(X86_FEATURE_TSC_RELIABLE));//temp > open_softirq(TIME_CALIBRATE_SOFTIRQ, local_time_calibration); > > /* NB. get_wallclock_time() can take over one second to execute. */ > I've managed to get the debug messages on the screen using vga=text-80x50,keep and disabling all messages from the kernel. Two images are attached with the output running the debug patch. About the version I've used to test: since the 4.14 shows that other bug with the detection of cpu features I mentioned on the other subthread, I chose to work on 4.11 that doesn't shows that behaviour. Calling with clocksource on the xen command line changed nothing. I don't know if this part of code is intended to execute a lot of times, but when starting with dom0_max_vcpus=1, the system boots up and keeps showing the messages. I've checked the reversion of the code on the commit #63e1d01 and the system boots up. I've not checked with any virtual machine yet. Best regards, Claudemir Attachment:
IMG_20210125_161303.jpg Attachment:
IMG_20210125_162313.jpg
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |