[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Problems with APIC on versions 4.9 and later (4.8 works)
Em sex., 29 de jan. de 2021 às 11:21, Jan Beulich <jbeulich@xxxxxxxx> escreveu: > > On 28.01.2021 14:08, Claudemir Todo Bom wrote: > > Em qui., 28 de jan. de 2021 às 06:49, Jan Beulich <jbeulich@xxxxxxxx> > > escreveu: > >> > >> On 28.01.2021 10:47, Jan Beulich wrote: > >>> On 26.01.2021 14:03, Claudemir Todo Bom wrote: > >>>> If this information is good for more tests, please send the patch and > >>>> I will test it! > >>> > >>> Here you go. For simplifying analysis it may be helpful if you > >>> could limit the number of CPUs in use, e.g. by "maxcpus=4" or > >>> at least "smt=0". Provided the problem still reproduces with > >>> such options, of course. > >> > >> Speaking of command line options - it doesn't look like you have > >> told us what else you have on the Xen command line, and without > >> a serial log this isn't visible (e.g. in your video). > > > > All tests are done with xen command line: > > > > dom0_mem=1024M,max:2048M dom0_max_vcpus=4 dom0_vcpus_pin=true > > smt=false vga=text-80x50,keep > > > > and kernel command line: > > > > loglevel=0 earlyprintk=xen nomodeset > > > > this way I can get all xen messages on console. > > > > Attached are the frames I captured from a video, I manually selected > > them starting from the first readable frame. > > Okay, so we seem to be hitting two previously noticed issues, neither > of which so far was necessary to address directly (because there was > always something else to be tweaked such that the problems went away). > > For one, the boot CPU has a TSC value that lags by more than a > second compared to all secondary CPUs. The way > time_calibration_tsc_rendezvous() works, together with the way we > calculate system time from the TSC (get_s_time_fixed() - this is > where the known issue here is: the function breaks when trying to > scale a negative delta, hence the absurdly high s= values in the > screenshots you've provided), allows for small negative deltas > between CPUs, but expects to bring all CPUs' TSCs forward (i.e. over > the 1s interval between rendezvous' the lagging CPUs are assumed to > have made enough progress to be ahead of the more towards the future > timestamps on the previous run). Secondary lagging behind the boot > CPU more than this could also be dealt with, but on your system the > situation is the other way around. > > Btw - what kind of BIOS do you have on this system? This way of the > TSCs being set is pretty odd, and must be - unless you run other > pre-boot software or an unusual boot loader - caused by the BIOS. It is a generic mainboard acquired from china... it is very lame! I was already thinking the big issue is the BIOS. Unfortunately I don't know how to upgrade it. > And then this points out (again, afaic at least) that the way we > kickstart the rendezvous handling is likely inappropriate. > Especially when TSCs are skewed like they are here, it is unhelpful > to launch Dom0 before having brought the TSC in sync. (Related to > this, I also don't think we should arm the respective timer before > AP bringup was done, or else we risk the first rendezvous instance > to not hit all CPUs.) > > I'll work on addressing both, hoping that in particular for the > first one you'll be ready to help with testing (through perhaps > multiple iterations). I can help you a little more until end of next week. After that I will move the host to another address and I will not have a quick "hands on" access to it. > > As a sidenote, I managed to get the system working with the parameter > > "tsc=unstable", performance looks satisfactory but I do not know what > > problems I may end with this parameter. > > I _think_ you'd be running into trouble if you removed dom0_vcpus_pin > (which imo really no-one should use without reporting a bug, despite > all the hits to the contrary that one gets when searching the web), > and if you ran any guests (PV at least) without pinning their vCPU-s > to pCPU-s. just tested it without the cpu pin, it worked. I stress-tested both dom0 and a pv guest with the "yes method" described here: https://linuxconfig.org/how-to-stress-test-your-cpu-on-linux. Best regards, Claudemir
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |