[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Need help in debugging partially blocked hypervisor
> Very detailed explanation indeed. What you described is the same as I saw > months ago. > But unluckily, I do not know the root cause yet. It seems to me that > unmasking of PMI in local APIC will immediately generate a new NMI in the > system if one of the enabled counter is zero at that time. > That is why I was asking you whether you could try to set that counter to > some value other than zero (for example, 0x1) before unmasking(in your case, > it is Fixed Counter 1 0x30a) PMI in vpmu_do_interrupt and see whether it > helped. OK I will try to set the counter after reading the 0 value to 1. But some things remain fully unclear ... Dietmar. > > When I met this problem, I remember that I tried two approaches: > 1> Setting the counter to non-zero before unmasking PMI in vpmu_do_interrupt; > 2> Remove unmasking PMI from vpmu_do_interrupt and unmask *physical PMI* when > guest vcpu unmasks virtual PMI. > I remember that approach 2 can fix this issue. But I do not remember the > result of approach 1, since I met this about one year ago. > It is my understanding that approach 2 is quite same as approach 1, since > normally guest will set the counter to some negative value (for example, > -100000) before unmasking virtual PMI. > However, approach 2 looks cleaner and more reasonable. > > Can you have a try and let me know the result? If both can not work, there > might be some problems that I have not met before. > > BTW: Sorry, I did not see your patch to enable NHM vpmu before. So, there is > no need for me to work on that now. :) > > Haitao > > > Dietmar Hahn wrote: > > Hi Haitao, > > > >> Can I know how you enabled vPMU on Nehalem? This is not supported in > >> current Xen. > > > > http://lists.xensource.com/archives/html/xen-devel/2009-09/msg00829.html > > > >> > >> Concerning vpmu support, I totally agree that we can disable this > >> feature by default. If anyone really wants to use it, he can use boot > >> options to turn it on. > > > > Yes, that's OK for me. > > > >> I am preparing a patch for that. And I will > >> send a patch to enable NHM vpmu together. > >> > >> For the problem that Dietmar met, I think I once met this before. Can > >> you add some code in vpmu_do_interrupt that sets the counter you are > >> using to a value other than zero? Please let me know if that can > >> help. > > > > I don't set the counter to zero. I use 0-val to set the counter. > > Actually I testet on Nehalem with > > - General Perf-counter #2 (0xc3) with CPU_CLK_UNHALTED and val=1100000 > > - Fixed counter #1 (0x30a) and val=1100000 > > The thing is that in normal case the overflows of both counters appear > > nearly at the same time. > > As described I added some extra tracer for xentrace in > > core2_vpmu_do_interrupt() so the code looks like: > > > > rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, msr_content); -> 1. Step > > { > > uint32_t HAHN_l, HAHN_h; > > HAHN_l = (uint32_t) msr_content; > > HAHN_h = (uint32_t) (msr_content >> 32); > > HVMTRACE_3D(HAHN_TR2, v, 1, HAHN_h, HAHN_l); -> 2. Step > > } > > if ( !msr_content ) > > return 0; > > core2_vpmu_cxt->global_ovf_status |= msr_content; > > msr_content = 0xC000000700000000 | ((1 << core2_get_pmc_count()) > > - 1); wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, msr_content); -> 3. > > Step > > > > rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, msr_content); -> 4. Step > > { > > uint32_t HAHN_l, HAHN_h; > > HAHN_l = (uint32_t) msr_content; > > HAHN_h = (uint32_t) (msr_content >> 32); > > HVMTRACE_3D(HAHN_TR2, v, 0xa, HAHN_h, HAHN_l); -> 5. Step > > > > rdmsrl(0xc3, msr_content); -> 6. Step > > General counter #2 HAHN_l = (uint32_t) msr_content; > > HAHN_h = (uint32_t) (msr_content >> 32); > > HVMTRACE_3D(HAHN_TR2, v, 0xc3, HAHN_h, HAHN_l); > > rdmsrl(0x30a, msr_content); -> 7. Step > > Fixed counter #1 HAHN_l = (uint32_t) msr_content; > > HAHN_h = (uint32_t) (msr_content >> 32); > > HVMTRACE_3D(HAHN_TR2, v, 0x30a, HAHN_h, HAHN_l); > > } > > > > With these tracers I got the following output: > > > > Last good NMI: > > Both counter cause the NMI. Resetting works OK. > > The counter itself were running further. > > 2. Step: par1 = 0x01, high = 0x0002, low = 0x0004 ] > > rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS) > > 5. Step: par1 = 0x0a, high = 0x0000, low = 0x0000 ] > > rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS) > > 6. Step: par1 = 0xc3, high = 0x0000, low = 0x03c4 ] rdmsrl(0xc3) > > -> #2 general counter > > 7. Step: par1 = 0x30a, high = 0x0000, low = 0x02da ] rdmsrl(0x30a) > > -> #1 fixed counter > > > > NMI from where things goes wrong: > > Both counter cause the NMI. Resetting works NOT correct, only for the > > general counter! > > The general counter (caused the NMI) seems to be stopped! > > 2. Step: par1 = 0x01, high = 0x0002, low = 0x0004 ] > > rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS) > > 5. Step: par1 = 0x0a, high = 0x0002, low = 0x0000 ] > > rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS) > > 6. Step: par1 = 0xc3, high = 0x0000, low = 0x00ec ] rdmsrl(0xc3) > > -> #2 general counter > > 7. Step: par1 = 0x30a, high = 0x0000, low = 0x0000 ] rdmsrl(0x30a) > > -> #1 fixed counter > > > > Wrong NMI: > > Only the fixed counter causes the NMI (which was not resetted during > > NMI handling above!) Both counter seems to be stopped! > > 2. Step: par1 = 0x01, high = 0x0002, low = 0x0000 ] > > rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS) > > 5. Step: par1 = 0x0a, high = 0x0002, low = 0x0000 ] > > rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS) > > 6. Step: par1 = 0xc3, high = 0x0000, low = 0x00ec ] rdmsrl(0xc3) > > -> #2 general counter > > 7. Step: par1 = 0x30a, high = 0x0000, low = 0x0000 ] rdmsrl(0x30a) > > -> #1 fixed counter > > > > And this state remains forever! > > I hope my explanations are understandable ;-) > > > > Until now I can see this behavior only on a Nehalem processor. > > > > Thanks. > > Dietmar > > > >> > >> Best Regards > >> Shan Haitao > >> > >> 2009/10/30 Keir Fraser <keir.fraser@xxxxxxxxxxxxx>: > >>> On 30/10/2009 12:20, "Dietmar Hahn" <dietmar.hahn@xxxxxxxxxxxxxx> > >>> wrote: > >>> > >>>> I searched the intel processor spec but couldn't find any help. > >>>> So my questions is, what is wrong here? > >>>> Can anybody with more knowledge point me in the right direction, > >>>> what can I still do to find the real cause of this? > >>> > >>> You should probably Cc one of the Intel guys who implemented this > >>> stuff -- I've added Haitao Shan. > >>> > >>> Meanwhile I'd be interested to know whether things work okay for > >>> you, minus performance counters and the hypervisor hang, if you > >>> return immediately from vpmu_initialise(). Really at minimum we > >>> need such a fix, perhaps with a boot paremeter to re-enable the > >>> feature, for 3.4.2 release; allowing guests to hose the hypervisor > >>> like this is of course not on. > >>> > >>> -- Keir > -- Dietmar Hahn TSP ES&S SWE OS Telephone: +49 (0) 89 636 40274 Fujitsu Technology Solutions Email: dietmar.hahn@xxxxxxxxxxxxxx Otto-Hahn-Ring 6 Internet: http://ts.fujitsu.com D-81739 München Company details:ts.fujitsu.com/imprint.html _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |