[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] Large system boot problems
Here is some debug of the large memory / pmtimer issue. (for background see [Xen-devel] Test results on Unisys ES7000 64x 256gb using unstablec/s 16693 on 3.2.0 Release Candidate from Jan 9, 2008) Symptom: A system with lots of CPUs and memory can fail to boot up properly. Dom0 gets time going backwards errors and effectively hangs during initialization. The cause of dom0's init failure is due to it using bogus values for CPU0's speed, while the other CPUs have proper speed info. Workarounds: Increasing the memory retained by the Hypervisor by either a dom0_mem or a xenheap arg will delay the start of dom0 enough (while that memory is scrubbed) that the HV cpu speed calculation will self-correct. Changing the timer used can also work (pit works for me) but basically it's a race and I expect that with the right hardware situation it could fail too. Details: With either pmtimer or pit the initial calculation done for CPU0 speed is bad (at least on a large system). If the dom0 starts quickly enough that it reads the bad CPU speed data from the Hypervisor shared area before the Hypervisor corrects it, dom0 is in trouble. Debug details: When the Xen boot has sized memory, detected and booted all the CPUs, and gets to the point of (XEN) ENABLING IO-APIC IRQs init_percpu_time gets called for the CPU0 and the cpu_time values recorded are: (XEN) dump_cpu_time cpu0 addr ffff828c801ca520 (XEN) local_tsc_stamp 1691332805 (XEN) stime_master_stamp 0 (XEN) stime_local_stamp 0 (XEN) Platform timer overflows in 234 jiffies. (XEN) Platform timer is 3.579MHz ACPI PM Timer Then domain 0 is loaded, and local_time_calibration for CPU0 gets called and actually does something. The "out count" below indicates that it was called 315 times and due to if ( ((s64)stime_elapsed64 < (EPOCH / 2)) ) effectively did nothing on those calls. The result of the calculations in local_time_calibration with the huge difference in the tsc value screws up pretty badly: (XEN) local_time_calibration error factor cpu0 is 0x80000000. out count 315 (XEN) PRE0: tsc=1691332805 stime=0 master=0 (XEN) CUR0: tsc=33466953185 stime=9345455787 master=4641208868 -> -4704246919 (XEN) calibration_mul_frac 4ac8a18d tsc_shift -2 The bogus values here are then used by dom0 to incorrectly determine the frequency of CPU0, while all other CPUs have correct values. Xen reported: 13692.820 MHz processor. For the HV, this self corrects, as the next time local_time_calibration gets called the data in cpu_time is properly set. But the damage has been done and dom0 struggles to make progress and reports time going backwards, etc. The reason that limiting the memory given to dom0 fixes the problem is that the loop that scrubs the memory that the HV is keeping (scrub_heap_pages) periodically calls process_pending_timers and if there is enough memory there, then the correction happens before dom0 starts. This recalls a comment from a vendor a few months ago where they said you needed to add a xenheap arg to make large memory work. When doing clocksource=pit a similar thing happens where the initial calc is bad, but it gets fixed before dom0 gets going (debug from PIT): (XEN) dump_cpu_time cpu0 addr ffff828c801ca520 (XEN) local_tsc_stamp 226384274 (XEN) stime_master_stamp 0 (XEN) stime_local_stamp 0 (XEN) Platform timer overflows in 2 jiffies. (XEN) Platform timer is 1.193MHz PIT there are no "goto out's" taken, the next call to local_time_calibration does the bad calculation (XEN) Scrubbing Free RAM: .local_time_calibration error factor cpu0 is 0x80000000. out count 0 (XEN) PRE0: tsc=226384274 stime=0 master=0 (XEN) CUR0: tsc=35424564759 stime=10351900878 master=1052517641 -> -9299383237 (XEN) calibration_mul_frac 7a7b2a1a tsc_shift -5 next call to local_time_calibration fixes it.. (XEN) calibration_mul_frac 969714d2 tsc_shift -1 and dom0 get the right stuff Xen reported: 3399.956 MHz processor. Looking for ideas or suggestions on how to solve this issue. Ideally we'd be able to prevent the bogus calculation in the first place. Bill _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |