[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v1 00/13] x86/PMU: Xen PMU PV support

On 09/12/2013 05:39 AM, George Dunlap wrote:
On 11/09/13 19:22, Boris Ostrovsky wrote:
On 09/11/2013 01:01 PM, George Dunlap wrote:
On 10/09/13 16:47, Boris Ostrovsky wrote:
On 09/10/2013 11:34 AM, Jan Beulich wrote:
On 10.09.13 at 17:20, Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx> wrote:
This version has following limitations:
* For accurate profiling of dom0/Xen dom0 VCPUs should be pinned.
* Hypervisor code is only profiled on processors that have running dom0 VCPUs
on them.
With that I assume this is an RFC rather than full-fledged submission?

I was thinking that this would be something like stage 1 implementation (and
probably should have mentioned this in the cover letter).

For this stage I wanted to confine all changes on Linux side to xen subtrees. Properly addressing the above limitation would likely require changes in non-xen
sources (change in perf file format, remote MSR access etc.).

I think having the vpmu stuff for PV guests is a great idea, and from a quick skim through I don't have any problems with the general approach. (Obviously some more detailed review will be needed.)

However, I'm not a fan of this method of collecting perf stuff for Xen and other VMs together in the cpu buffers for dom0. I think it's ugly, fragile, and non-scalable, and I would prefer to see if we could implement the same feature (allowing perf to analyze Xen and other vcpus) some other way. And I would rather not use it as a "stage 1", for fear that it would become entrenched.

I can see how collecting samples for other domains may be questionable now (DOM0_PRIV mode) since at this stage there is no way to distinguish between samples for non-priviledged domains.

But why do you think that getting data for both dom0 and Xen is problematic? Someone has to process Xen's samples and who would do this if not dom0? We could store samples in separate files (e.g. perf.data.dom0 and perf.data.xen) but that's toolstack's job.

It's not so much about dom0 collecting the samples and passing them on to the analysis tools; this is already what xenalyze does, in essence. It's about the requirement of having the dom0 vcpus pinned 1-1 to physical cpus: both limiting the flexibility for scheduling, and limiting the configuration flexibility wrt having dom0 vcpus < pcpus. That is what seems an ugly hack to me -- having dom0 sort of try to do something that requires hypervisor-level privileges and making a bit of a mess of it.

I probably should have explained the limitations better in the
original message.


The only reason this version requires pinning is because I haven't
provided hooks in Linux perf code to store both PCPU and VCPU of the
sample in the perf_sample_data. And I didn't do so this because this
would need to be done outside of arch/x86/xen and I decided not to go
there for this stage. So for now perf still only knows about CPUs, not

Note that hypervisor already provides information about both P/VCPUs to
dom0 (*) so so when I fix what I described above in Linux (kernel and perf
toolstack) the right association of P/VCPUs will start working.

And pinning is not really *required*. If you don't pin you will not
get accurate sample distribution of hypervisor samples in perf.
For instance, if Xen's foo() was sampled on PCPU0 and then PCPU1 while
dom0's VCPU0 was running on each of them perf will assime that both
samples were taken on CPU0. Note again: CPU0, not P- or VCPU0).


This is different from pinning. The issue here is that tools (e.g. perf) need to
access the PMU's MSR. And they do it with something like wrmsr(msr, value),
and they assume that they are programming PMU on current processor. So
if a dom0's VCPU never runs on some PCPU it currently cannot program the
PMU there. One way to address this could be to have wrmsr_cpu(cpu, msr,
value). And presumably on bare metal this will be patched over with regular

(*) Well, it doesn't. Because I forgot to add this to the code (it's one line, really)
but I will in the next version.

I'm unfortunately not familiar enough with the perf system to know exactly what it is that Linux needs to do (why, for example, you think it would need remote MSR access if dom0 weren't pinned),

Remote MSR access is needed not because of pinning but because the tool (perf, or any other tool for that matter) needs to program the PMU on non-dom0 processors.

and how hard would be for Xen just to do that work, and provide an "adapter" that would translate Xen-specific stuff into something perf could consume. Would it be possible, for example, for dom0 to specify what needed to be collected, for Xen to generate the samples in a Xen-specific format, and then have something in dom0 that would separate the samples into one file per domain that look similar enough to a trace file that the perf system could consume it?

Perf calculates sampling period on each sample and writes resulting value into the counter MSR (I haven't looked yet at how it uses other performance facilities such as
PEBS, IBS and such).

Processing sample data is done by the toolstack and is relatively easy, we don't need Xen-specific format (once we fix the pinning issue so we know to whom a sample belongs).
Programming PMU HW from exiting perf code is the challenge.


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.