[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Problems with APIC on versions 4.9 and later (4.8 works)



On 20.01.21 09:50, Jan Beulich wrote:
On 19.01.2021 20:36, Claudemir Todo Bom wrote:
I do not have serial output on this setup, so I recorded a video with
boot_delay=50 in order to be able to get all the kernel messages:
https://youtu.be/y95h6vqoF7Y

This doesn't show any badness afaics.

This is running 4.14 from debian bullseye (testing).

I'm also attaching the dmesg output when booting xen 4.8 with  the same
kernel version and same parameters.

I visually compared all the messages, and the only thing I noticed was that
4.14 used tsc as clocksource and 4.8 used xen. I tried to boot the kernel
with "clocksource=xen" and the problem is happening with that also.

There's some confusion here I suppose: The clock source you talk
about is the kernel's, not Xen's. I didn't think this would
change for the same kernel version with different Xen underneath,
but the Linux maintainers of the Xen code there may know better.
Cc-ing them.

This might depend on CPUID bits given to dom0 by Xen, e.g. regarding
TSC stability.


The "start" of the problem is that when the kernel gets to the "Freeing
unused kernel image (initmem) memory: 2380K" it hangs and stays there for a
while. After a few minutes it shows that a process (swapper) is blocked for
sometime (image attached)

Now that's pretty unusual - the call trace seen in the screen
shot you had attached indicates the kernel didn't even make it
past its own initialization just yet. Just to have explored that
possibility - could you enable Xen's NMI watchdog (simply
"watchdog" on the Xen command line)? Among the boot messages
there ought to be one indicating whether it actually works on
your system. Without a serial console you wouldn't see anything
if it triggers, but the system would then never make it to the
kernel side issue.

As far as making sure we at least see all kernel messages -
are you having "ignore_loglevel" in place? I don't think I've
been able to spot the kernel command line anywhere in the video.

I'm afraid there's no real way around seeing the full Xen
messages, i.e. including possible ones while Dom0 already boots
(and allowing some debug keys to be issued, as the rcu_barrier
on the stack may suggest there's an issue with one of the
secondary CPUs). You could try whether "vag=keep" on the Xen
command line allows you to see more, but this option may have
quite severe an effect on the timing of Dom0's booting, which
may make an already bad situation worse.

Alternatively the kernel may need instrumenting to figure what
exactly it is that prevent forward progress.

There's one other wild guess you may want to try: "cpuidle=no"
on the Xen command line.

Other wild guesses are:

- add "sched=credit" to the Xen command line

or

- add "xen.fifo_events=0" to the dom0 command line


Juergen

Attachment: OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys

Attachment: OpenPGP_signature
Description: OpenPGP digital signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.