[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] Need help in debugging partially blocked hypervisor



Hi, Dietmar,

Please review the attached patch. Any comments?

Haitao


Dietmar Hahn wrote:
>> I suspect the guest will reproduce this PMI loop if guest behaves as
>> you said in this email. But as far as I know, VTune and oprofile do
>> not behave like that.  
>> Of course, this approach is still like workaround (unless I get
>> comfirm that HW requires to do so). This approach is preferrable
>> because it does not change the contents of MSRs. Thus, we have no
>> impact on guest software that does rely on reading the correct value
>> from HW. Approach 1 existed just because we knew that in event-based
>> sampling, counter value on receiving PMI was not used by
>> OProfile/VTune at all and it was safe to set the counter to some
>> non-zero value.       
>> 
>> Haitao
>> 
> 
> OK, then will you send a patch?
> Dietmar.
> 
>> 
>> Dietmar Hahn wrote:
>>> Please see below.
>>> 
>>>> See my comments embedded. :)
>>>> 
>>>> Haitao
>>>> 
>>>> 
>>>> Dietmar Hahn wrote:
>>>>> The conclusion is, that this seems to be a workaround for the
>>>>> endless NMI loop. PMI's are a very rarely event and this should
>>>>> not raise a performance problem.
>>>> I totally agree that this is only a workaround for approach 1.
>>>> 
>>>>> 
>>>>> I didn't try your second approach
>>>>>> 2> Remove unmasking PMI from vpmu_do_interrupt and unmask
>>>>>> *physical PMI* when guest vcpu unmasks virtual PMI. but I have
>>>>>> some question. 
>>>>> 
>>>>> - What if the 'physical PMI' is not unmasked in vpmu_do_interrupt
>>>>>   and a watchdog NMI would occur before the domU unmasks it?
>>>> I think the second NMI will be lost.
>>>> 
>>>>> - Is it possible that after handling the NMI (and not unmasking)
>>>>>   another domU got running on this CPU and therefore PMI's got
>>>>> lost? 
>>>> LVTPC entry in physical local APIC is save/restored by Xen on VCPU
>>>> switches. So unmasking (or not) of PMI of one vcpu should have no
>>>> impact on another vcpu. When developing vPMU, I treated as vPMU
>>>> context both PMU MSRs and LVTPC entry in local APIC. vPMU context
>>>> is save/restored on physical HW when vcpus is scheduled, either in
>>>> an active save/restore manner or a lazy one (depending on the PMU
>>>> usage at the time of switch). 
>>>> 
>>>>> 
>>>>> But the real cause of the problem is unknown. As said I saw this
>>>>> only on Nehalem. Maybe there is a problem together with the
>>>>> hardware? Perhaps your hardware colleagues know something more ;-)
>>>> When I found this problem, I just thought it might be a corner case
>>>> that only happens on my box (of course, I only see this in NHM,
>>>> too). I will try to pin HW guy to see if any explanation, since it
>>>> is proven to be a general problem on NHM.
>>>> 
>>>> But before everything is clear, I think approach 2 is a better
>>>> solution now.
>>> 
>>> What would be the effect if the guest unmasks the PMI (which leads
>>> to unmasking the 'physical PMI') but doesn't reset the counter to a
>>> value != 0? Is the guest able to produce the nmi endless loop?
>>> 
>>> Dietmar.
>>> 
>>>> 
>>>>> 
>>>>> Thanks
>>>>> Dietmar
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> When I met this problem, I remember that I tried two approaches:
>>>>>>> 1> Setting the counter to non-zero before unmasking PMI in
>>>>>>> vpmu_do_interrupt; 2> Remove unmasking PMI from
>>>>>>> vpmu_do_interrupt and unmask *physical PMI* when guest vcpu
>>>>>>> unmasks virtual PMI. 
>>>>>>> I remember that approach 2 can fix this issue. But I do not
>>>>>>> remember the result of approach 1, since I met this about one
>>>>>>> year ago. It is my understanding that approach 2 is quite same
>>>>>>> as approach 1, since normally guest will set the counter to some
>>>>>>> negative value (for example, -100000) before unmasking virtual
>>>>>>> PMI. However, approach 2 looks cleaner and more reasonable.
>>>>>>> 
>>>>>>> Can you have a try and let me know the result? If both can not
>>>>>>> work, there might be some problems that I have not met before.
>>>>>>> 
>>>>>>> BTW: Sorry, I did not see your patch to enable NHM vpmu before.
>>>>>>> So, there is no need for me to work on that now. :)
>>>>>>> 
>>>>>>> Haitao
>>>>>>> 
>>>>>>> 
>>>>>>> Dietmar Hahn wrote:
>>>>>>>> Hi Haitao,
>>>>>>>> 
>>>>>>>>> Can I know how you enabled vPMU on Nehalem? This is not
>>>>>>>>> supported in current Xen.
>>>>>>>> 
>>>>>>>> http://lists.xensource.com/archives/html/xen-devel/2009-09/msg00829.html
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Concerning vpmu support, I totally agree that we can disable
>>>>>>>>> this feature by default. If anyone really wants to use it, he
>>>>>>>>> can use boot options to turn it on.
>>>>>>>> 
>>>>>>>> Yes, that's OK for me.
>>>>>>>> 
>>>>>>>>> I am preparing a patch for that. And I will
>>>>>>>>> send a patch to enable NHM vpmu together.
>>>>>>>>> 
>>>>>>>>> For the problem that Dietmar met, I think I once met this
>>>>>>>>> before. Can you add some code in vpmu_do_interrupt that sets
>>>>>>>>> the counter you are using to a value other than zero? Please
>>>>>>>>> let me know if that can help.
>>>>>>>> 
>>>>>>>> I don't set the counter to zero. I use 0-val to set the
>>>>>>>> counter. Actually I testet on Nehalem with
>>>>>>>> - General Perf-counter #2 (0xc3) with CPU_CLK_UNHALTED and
>>>>>>>> val=1100000 
>>>>>>>> - Fixed counter #1 (0x30a) and val=1100000
>>>>>>>> The thing is that in normal case the overflows of both counters
>>>>>>>> appear nearly at the same time. As described I added some extra
>>>>>>>> tracer for xentrace in core2_vpmu_do_interrupt() so the code
>>>>>>>> looks like: 
>>>>>>>> 
>>>>>>>>     rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, msr_content);     -> 1.
>>>>>>>>                Step    { uint32_t HAHN_l, HAHN_h;
>>>>>>>>                HAHN_l = (uint32_t) msr_content;
>>>>>>>>                HAHN_h = (uint32_t) (msr_content >> 32);
>>>>>>>>                HVMTRACE_3D(HAHN_TR2, v, 1, HAHN_h, HAHN_l);      -> 2. 
>>>>>>>> Step
>>>>>>>>         }     if ( !msr_content ) return 0;
>>>>>>>>     core2_vpmu_cxt->global_ovf_status |= msr_content;
>>>>>>>>     msr_content = 0xC000000700000000 | ((1 <<
>>>>>>>>     core2_get_pmc_count()) - 1);
>>>>>>>> wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, msr_content);   -> 3.
>>>>>>>> Step 
>>>>>>>> 
>>>>>>>>     rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, msr_content);     -> 4.
>>>>>>>>         Step   { uint32_t HAHN_l, HAHN_h;
>>>>>>>>         HAHN_l = (uint32_t) msr_content;
>>>>>>>>         HAHN_h = (uint32_t) (msr_content >> 32);
>>>>>>>>         HVMTRACE_3D(HAHN_TR2, v, 0xa, HAHN_h, HAHN_l);    ->
>>>>>>>> 5. Step 
>>>>>>>> 
>>>>>>>>         rdmsrl(0xc3, msr_content);                        -> 6.
>>>>>>>>         Step General counter #2 HAHN_l = (uint32_t)
>>>>>>>>         msr_content; HAHN_h = (uint32_t) (msr_content >> 32);
>>>>>>>>         HVMTRACE_3D(HAHN_TR2, v, 0xc3, HAHN_h, HAHN_l);
>>>>>>>>         rdmsrl(0x30a, msr_content);                       -> 7.
>>>>>>>>         Step Fixed counter #1 HAHN_l = (uint32_t) msr_content;
>>>>>>>>         HAHN_h = (uint32_t) (msr_content >> 32);
>>>>>>>>         HVMTRACE_3D(HAHN_TR2, v, 0x30a, HAHN_h, HAHN_l);       }
>>>>>>>> 
>>>>>>>> With these tracers I got the following output:
>>>>>>>> 
>>>>>>>> Last good NMI:
>>>>>>>> Both counter cause the NMI. Resetting works OK.
>>>>>>>> The counter itself were running further.
>>>>>>>> 2. Step: par1 = 0x01,  high = 0x0002, low =  0x0004 ]
>>>>>>>> rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS)
>>>>>>>> 5. Step: par1 = 0x0a,  high = 0x0000, low =  0x0000 ]
>>>>>>>> rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS)
>>>>>>>> 6. Step: par1 = 0xc3,  high = 0x0000, low =  0x03c4 ]
>>>>>>>> rdmsrl(0xc3) -> #2 general counter
>>>>>>>> 7. Step: par1 = 0x30a, high = 0x0000, low =  0x02da ]
>>>>>>>> rdmsrl(0x30a) -> #1 fixed counter
>>>>>>>> 
>>>>>>>> NMI from where things goes wrong:
>>>>>>>> Both counter cause the NMI. Resetting works NOT correct, only
>>>>>>>> for the general counter! The general counter (caused the NMI)
>>>>>>>> seems to be stopped! 
>>>>>>>> 2. Step: par1 = 0x01,  high = 0x0002, low =  0x0004 ]
>>>>>>>> rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS)
>>>>>>>> 5. Step: par1 = 0x0a,  high = 0x0002, low =  0x0000 ]
>>>>>>>> rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS)
>>>>>>>> 6. Step: par1 = 0xc3,  high = 0x0000, low =  0x00ec ]
>>>>>>>> rdmsrl(0xc3) -> #2 general counter
>>>>>>>> 7. Step: par1 = 0x30a, high = 0x0000, low =  0x0000 ]
>>>>>>>> rdmsrl(0x30a) -> #1 fixed counter
>>>>>>>> 
>>>>>>>> Wrong NMI:
>>>>>>>> Only the fixed counter causes the NMI (which was not resetted
>>>>>>>> during NMI handling above!) Both counter seems to be stopped!
>>>>>>>> 2. Step: par1 = 0x01,  high = 0x0002, low =  0x0000 ]
>>>>>>>> rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS)
>>>>>>>> 5. Step: par1 = 0x0a,  high = 0x0002, low =  0x0000 ]
>>>>>>>> rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS)
>>>>>>>> 6. Step: par1 = 0xc3,  high = 0x0000, low =  0x00ec ]
>>>>>>>> rdmsrl(0xc3) -> #2 general counter
>>>>>>>> 7. Step: par1 = 0x30a, high = 0x0000, low =  0x0000 ]
>>>>>>>> rdmsrl(0x30a) -> #1 fixed counter
>>>>>>>> 
>>>>>>>> And this state remains forever!
>>>>>>>> I hope my explanations are understandable ;-)
>>>>>>>> 
>>>>>>>> Until now I can see this behavior only on a Nehalem processor.
>>>>>>>> 
>>>>>>>> Thanks.
>>>>>>>> Dietmar
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Best Regards
>>>>>>>>> Shan Haitao
>>>>>>>>> 
>>>>>>>>> 2009/10/30 Keir Fraser <keir.fraser@xxxxxxxxxxxxx>:
>>>>>>>>>> On 30/10/2009 12:20, "Dietmar Hahn"
>>>>>>>>>> <dietmar.hahn@xxxxxxxxxxxxxx> wrote:
>>>>>>>>>> 
>>>>>>>>>>> I searched the intel processor spec but couldn't find any
>>>>>>>>>>> help. So my questions is, what is wrong here?
>>>>>>>>>>> Can anybody with more knowledge point me in the right
>>>>>>>>>>> direction, what can I still do to find the real cause of
>>>>>>>>>>> this? 
>>>>>>>>>> 
>>>>>>>>>> You should probably Cc one of the Intel guys who implemented
>>>>>>>>>> this stuff -- I've added Haitao Shan.
>>>>>>>>>> 
>>>>>>>>>> Meanwhile I'd be interested to know whether things work okay
>>>>>>>>>> for you, minus performance counters and the hypervisor hang,
>>>>>>>>>> if you return immediately from vpmu_initialise(). Really at
>>>>>>>>>> minimum we need such a fix, perhaps with a boot paremeter to
>>>>>>>>>> re-enable the feature, for 3.4.2 release; allowing guests to
>>>>>>>>>> hose the hypervisor like this is of course not on.
>>>>>>>>>> 
>>>>>>>>>>  -- Keir
>>>> _______________________________________________
>>>> Xen-devel mailing list
>>>> Xen-devel@xxxxxxxxxxxxxxxxxxx
>>>> http://lists.xensource.com/xen-devel
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@xxxxxxxxxxxxxxxxxxx
>> http://lists.xensource.com/xen-devel

Attachment: unmask_vPMI.patch
Description: unmask_vPMI.patch

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.