[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Problems with APIC on versions 4.9 and later (4.8 works)
On 28.01.2021 14:08, Claudemir Todo Bom wrote: > Em qui., 28 de jan. de 2021 às 06:49, Jan Beulich <jbeulich@xxxxxxxx> > escreveu: >> >> On 28.01.2021 10:47, Jan Beulich wrote: >>> On 26.01.2021 14:03, Claudemir Todo Bom wrote: >>>> If this information is good for more tests, please send the patch and >>>> I will test it! >>> >>> Here you go. For simplifying analysis it may be helpful if you >>> could limit the number of CPUs in use, e.g. by "maxcpus=4" or >>> at least "smt=0". Provided the problem still reproduces with >>> such options, of course. >> >> Speaking of command line options - it doesn't look like you have >> told us what else you have on the Xen command line, and without >> a serial log this isn't visible (e.g. in your video). > > All tests are done with xen command line: > > dom0_mem=1024M,max:2048M dom0_max_vcpus=4 dom0_vcpus_pin=true > smt=false vga=text-80x50,keep > > and kernel command line: > > loglevel=0 earlyprintk=xen nomodeset > > this way I can get all xen messages on console. > > Attached are the frames I captured from a video, I manually selected > them starting from the first readable frame. Okay, so we seem to be hitting two previously noticed issues, neither of which so far was necessary to address directly (because there was always something else to be tweaked such that the problems went away). For one, the boot CPU has a TSC value that lags by more than a second compared to all secondary CPUs. The way time_calibration_tsc_rendezvous() works, together with the way we calculate system time from the TSC (get_s_time_fixed() - this is where the known issue here is: the function breaks when trying to scale a negative delta, hence the absurdly high s= values in the screenshots you've provided), allows for small negative deltas between CPUs, but expects to bring all CPUs' TSCs forward (i.e. over the 1s interval between rendezvous' the lagging CPUs are assumed to have made enough progress to be ahead of the more towards the future timestamps on the previous run). Secondary lagging behind the boot CPU more than this could also be dealt with, but on your system the situation is the other way around. Btw - what kind of BIOS do you have on this system? This way of the TSCs being set is pretty odd, and must be - unless you run other pre-boot software or an unusual boot loader - caused by the BIOS. And then this points out (again, afaic at least) that the way we kickstart the rendezvous handling is likely inappropriate. Especially when TSCs are skewed like they are here, it is unhelpful to launch Dom0 before having brought the TSC in sync. (Related to this, I also don't think we should arm the respective timer before AP bringup was done, or else we risk the first rendezvous instance to not hit all CPUs.) I'll work on addressing both, hoping that in particular for the first one you'll be ready to help with testing (through perhaps multiple iterations). > As a sidenote, I managed to get the system working with the parameter > "tsc=unstable", performance looks satisfactory but I do not know what > problems I may end with this parameter. I _think_ you'd be running into trouble if you removed dom0_vcpus_pin (which imo really no-one should use without reporting a bug, despite all the hits to the contrary that one gets when searching the web), and if you ran any guests (PV at least) without pinning their vCPU-s to pCPU-s. Jan
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |