Xen project Mailing List

Re: [Xen-devel] VPMU interrupt unreliability

On Thu, Oct 19, 2017 at 11:20 AM, Meng Xu <xumengpanda@xxxxxxxxx> wrote: > On Thu, Oct 19, 2017 at 11:40 AM, Andrew Cooper > <andrew.cooper3@xxxxxxxxxx> wrote: >> >> On 19/10/17 16:09, Kyle Huey wrote: >> > On Wed, Oct 11, 2017 at 7:09 AM, Boris Ostrovsky >> > <boris.ostrovsky@xxxxxxxxxx> wrote: >> >> On 10/10/2017 12:54 PM, Kyle Huey wrote: >> >>> On Mon, Jul 24, 2017 at 9:54 AM, Kyle Huey <me@xxxxxxxxxxxx> wrote: >> >>>> On Mon, Jul 24, 2017 at 8:07 AM, Boris Ostrovsky >> >>>> <boris.ostrovsky@xxxxxxxxxx> wrote: >> >>>>>>> One thing I noticed is that the workaround doesn't appear to be >> >>>>>>> complete: it is only checking PMC0 status and not other counters >> >>>>>>> (fixed >> >>>>>>> or architectural). Of course, without knowing what the actual problem >> >>>>>>> was it's hard to say whether this was intentional. >> >>>>>> handle_pmc_quirk appears to loop through all the counters ... >> >>>>> Right, I didn't notice that it is shifting MSR_CORE_PERF_GLOBAL_STATUS >> >>>>> value one by one and so it is looking at all bits. >> >>>>> >> >>>>>>>> 2. Intercepting MSR loads for counters that have the workaround >> >>>>>>>> applied and giving the guest the correct counter value. >> >>>>>>> We'd have to keep track of whether the counter has been reset (by the >> >>>>>>> quirk) since the last MSR write. >> >>>>>> Yes. >> >>>>>> >> >>>>>>>> 3. Or perhaps even changing the workaround to disable the PMI on >> >>>>>>>> that >> >>>>>>>> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that >> >>>>>>>> works >> >>>>>>>> on the relevant hardware. >> >>>>>>> MSR_CORE_PERF_GLOBAL_OVF_CTRL is written immediately after the quirk >> >>>>>>> runs (in core2_vpmu_do_interrupt()) so we already do this, don't we? >> >>>>>> I'm suggesting waiting until the *guest* writes to the (virtualized) >> >>>>>> GLOBAL_OVF_CTRL. >> >>>>> Wouldn't it be better to wait until the counter is reloaded? >> >>>> Maybe! I haven't thought through it a lot. It's still not clear to >> >>>> me whether MSR_CORE_PERF_GLOBAL_OVF_CTRL actually controls the >> >>>> interrupt in any way or whether it just resets the bits in >> >>>> MSR_CORE_PERF_GLOBAL_STATUS and acking the interrupt on the APIC is >> >>>> all that's required to reenable it. >> >>>> >> >>>> - Kyle >> >>> I wonder if it would be reasonable to just remove the workaround >> >>> entirely at some point. The set of people using 1) several year old >> >>> hardware, 2) an up to date Xen, and 3) the off-by-default performance >> >>> counters is probably rather small. >> >> We'd probably want to only enable this for affected processors, not >> >> remove it outright. But the problem is that we still don't know for sure >> >> whether this issue affects NHM only, do we? >> >> >> >> (https://lists.xenproject.org/archives/html/xen-devel/2017-07/msg02242.html >> >> is the original message) >> > Yes, the basic problem is that we don't know where to draw the line. >> >> vPMU is disabled by default for security reasons, > > > Is there any document about the possible attack via the vPMU? The > document I found (such as [1] and XSA-163) just briefly say that the > vPMU should be disabled due to security concern. > > > [1] https://xenbits.xen.org/docs/unstable/misc/xen-command-line.html Cross-guest information leaks, presumably. >> >> and also broken, in a >> way which demonstrates that vPMU isn't getting much real-world use. > > I also noticed that AWS seems support part of the vPMU > functionalities, which were used by Netflix to optimize their > applications' performance, according to > http://www.brendangregg.com/blog/2017-05-04/the-pmcs-of-ec2.html . > > I guess the security issue should be solved by AWS? However, without > knowing how the attack could be conducted, I'm not sure how AWS avoids > the attack concern for vPMU. AWS only allows you to use the vPMU if you have the entire physical machine your VM is running on dedicated to yourself. Cross-guest information leaks are not a big deal if the same tenant controls all the guests. >> >> As far as I'm concerned, all options (including rm -rf and start from >> scratch) are acceptable, especially if this ends up giving us a better >> overall subsystem. >> >> Do we know how other hypervisors work around this issue? > > Maybe the solution of AWS is a choice? I'm not sure. I'm just thinking aloud. > :) > > Thanks, > > Meng > > -- > Meng Xu > Ph.D. Candidate in Computer and Information Science > University of Pennsylvania > http://www.cis.upenn.edu/~mengxu/ - Kyle _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.