[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] VPMU interrupt unreliability



On Wed, Oct 11, 2017 at 7:09 AM, Boris Ostrovsky
<boris.ostrovsky@xxxxxxxxxx> wrote:
> On 10/10/2017 12:54 PM, Kyle Huey wrote:
>> On Mon, Jul 24, 2017 at 9:54 AM, Kyle Huey <me@xxxxxxxxxxxx> wrote:
>>> On Mon, Jul 24, 2017 at 8:07 AM, Boris Ostrovsky
>>> <boris.ostrovsky@xxxxxxxxxx> wrote:
>>>>>> One thing I noticed is that the workaround doesn't appear to be
>>>>>> complete: it is only checking PMC0 status and not other counters (fixed
>>>>>> or architectural). Of course, without knowing what the actual problem
>>>>>> was it's hard to say whether this was intentional.
>>>>> handle_pmc_quirk appears to loop through all the counters ...
>>>> Right, I didn't notice that it is shifting MSR_CORE_PERF_GLOBAL_STATUS
>>>> value one by one and so it is looking at all bits.
>>>>
>>>>>>> 2. Intercepting MSR loads for counters that have the workaround
>>>>>>> applied and giving the guest the correct counter value.
>>>>>> We'd have to keep track of whether the counter has been reset (by the
>>>>>> quirk) since the last MSR write.
>>>>> Yes.
>>>>>
>>>>>>> 3. Or perhaps even changing the workaround to disable the PMI on that
>>>>>>> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works
>>>>>>> on the relevant hardware.
>>>>>> MSR_CORE_PERF_GLOBAL_OVF_CTRL is written immediately after the quirk
>>>>>> runs (in core2_vpmu_do_interrupt()) so we already do this, don't we?
>>>>> I'm suggesting waiting until the *guest* writes to the (virtualized)
>>>>> GLOBAL_OVF_CTRL.
>>>> Wouldn't it be better to wait until the counter is reloaded?
>>> Maybe!  I haven't thought through it a lot.  It's still not clear to
>>> me whether MSR_CORE_PERF_GLOBAL_OVF_CTRL actually controls the
>>> interrupt in any way or whether it just resets the bits in
>>> MSR_CORE_PERF_GLOBAL_STATUS and acking the interrupt on the APIC is
>>> all that's required to reenable it.
>>>
>>> - Kyle
>> I wonder if it would be reasonable to just remove the workaround
>> entirely at some point.  The set of people using 1) several year old
>> hardware, 2) an up to date Xen, and 3) the off-by-default performance
>> counters is probably rather small.
>
> We'd probably want to only enable this for affected processors, not
> remove it outright. But the problem is that we still don't know for sure
> whether this issue affects NHM only, do we?
>
> (https://lists.xenproject.org/archives/html/xen-devel/2017-07/msg02242.html
> is the original message)

Yes, the basic problem is that we don't know where to draw the line.

- Kyle

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.