[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] xen/arm: introduce vwfi parameter

To: Dario Faggioli <dario.faggioli@xxxxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>
From: Julien Grall <julien.grall@xxxxxxx>
Date: Tue, 21 Feb 2017 12:30:11 +0000
Cc: edgar.iglesias@xxxxxxxxxx, george.dunlap@xxxxxxxxxxxxx, nd@xxxxxxx, Punit Agrawal <punit.agrawal@xxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx
Delivery-date: Tue, 21 Feb 2017 12:30:23 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

Hi Dario,

On 21/02/2017 09:09, Dario Faggioli wrote:

On Tue, 2017-02-21 at 07:59 +0000, Julien Grall wrote:

On 20/02/2017 22:53, Dario Faggioli wrote:

For instance, as you say, executing a WFI from a guest directly on
hardware, only makes sense if we have 1:1 static pinning. Which
means
it can't just be done by default, or with a boot parameter, because
we
need to check and enforce that there's only 1:1 pinning around.


I agree it cannot be done by default. Similarly, the poll mode cannot
be
done by default in platform nor by domain because you need to know
that
all vCPUs will be in polling mode.

No, that's the big difference. Polling (which, as far as this patch
goes, is yielding, in this case) is generic in the sense that, no
matter the pinned or non-pinned state, things work. Power is wasted,
but nothing breaks.

Not trapping WF* is not generic in the sense that, if you do in the
pinned case, i (probably) works. If you lift the pinning, but leave the
direct WF* execution in place, everything breaks.

This is all I'm saying: that if you say, not trapping is an alternative
to this patch, well, it is not. Not trapping _plus_ measures for
preventing things to break, is an alternative.

Am I nitpicking? Perhaps... In which case, sorry. :-P

I am sorry but I still don't understand why you say things will break ifyou don't trap WFI/WFE. Can you detail it?

But as I said, if vCPUs are not pinned this patch as very little
advantage because you may context switch between them when yielding.

Smaller advantage, sure. How much smaller, hard to tell. That is the
reason why I see some potential value in this patch, especially if
converted to doing its thing per-domain, as George suggested. One can
try (and, when that happens, we'll show a big WARNING about wasting
power an heating up the CPUs!), and decide whether the result is good
or not for the specific use case.

I even think there will be no advantage at all in multiple vCPUs casebecause I would not be surprised that the overhead of vCPU block isbecause we switch back and forth to the idle vCPU requiring tosave/restore the context of the same vCPU.


Anyway, having number here would help to confirm.

My concern of per-domain solution or even system wide is you may have anidle vCPU where you don't expect interrupt to come. In this case, yourvCPU will waste power and an unmodified app (e.g non-Xen aware) as thereis no solution to suspend the vCPU today on Xen.

Is it possible to decide whether to trap and emulate WFI, or just
execute it, online, and change such decision dynamically? And even
if
yes, how would the whole thing work? When the direct execution is
enabled for a domain we automatically enforce 1:1 pinning for that
domain, and kick all the other domain out of its pcpus? What if
they
have their own pinning, what if they also have 'direct WFI'
behavior
enabled?


It can be changed online, the WFI/WFE trapping is per pCPU (see
HCR_EL2.{TWE,TWI}

Ok, thanks for the info. Not bad. With added logic (perhaps in the nop
scheduler), this looks like it could be useful.

These are just examples, my point being that in theory, if we
consider
a very specific usecase or set of usecase, there's a lot we can do.
But
when you say "why don't you let the guest directly execute WFI", in
response to a patch and a discussion like this, people may think
that
you are actually proposing doing it as a solution, which is not
possible without figuring out all the open questions above
(actually,
probably, more) and without introducing a lot of cross-subsystem
policing inside Xen, which is often something we don't want.


I made this response because the patch sent by Stefano as a very
specific use case that can be solved the same way. Everyone here is
suggesting polling but it has it is own disadvantage: power
consumption.

Anyway, I still think in both case we are solving a specific problem
without looking at what matters. I.e Why the scheduler takes so much
time to block/unblock.

Well, TBH, we still are not entirely sure who the culprit is for high
latency. There are spikes in Credit2, and I'm investigating that. But
apart from them? I think we need other numbers with which we can
compare the numbers that Stefano has collected.

I think the problem is because we save/restore the vCPU state whenswitching to the idle vCPU.

Let say the only 1 vCPU can run on the pCPU, when the vCPU is issuing aWFI the following steps will happen:

     * WFI trapped and vcpu blocked
     * save vCPU state
     * run idle_loop
-> Interrupt incoming for the guest
     * restore vCPU state
     * back to the guest

Saving/restoring on ARM requires to context switch all the state of theVM (this is not saved in memory when entering in the hypervisor). Thisinclude things like system register, interrupt controller state, FPU...

Context switching the interrupt controller and the FPU can take sometimes as you got lots of register and some are only accessible throughthe memory interface (see GICv2 for instance).

So a context switch will likely hurt the performance of block vcpu inthe context of 1 vCPU only running per pCPU.


I'll send code for the nop scheduler, and we will compare with what
we'll get with it. Another interesting data point would be knowing how
the numbers look like on baremetal, on the same platform and under
comparable conditions.

And I guess there are other components and layers, in the Xen
architecture, that may be causing increased latency, which we may have
not identified yet.

Anyway, nop scheduler is probably first thing we want to check. I'll
send the patches soon.

So, yes in fine the guest will waste its slot.

Did I say it already that this concept of "slots" does not apply
here?
:-D


Sorry forgot about this :/. I guess you use the term credit? If so,
the
guest will use its credit for nothing.

If the guest is alone, or in general the system is undersubscribed, it
would, by continuously yielding in a busy loop, but that doesn't
matter, because there are enough pCPUs to run even vCPUs that are out
of credits.

If the guest is not alone, and the system is oversubscribed, it would
use a very tiny amount of its credits, every now and then, i.e., the
ones that are necessary to execute a WFI, and, for Xen, to issue a call
to sched_yield(). But after that, we will run someone else. This to say
that the problem of this patch might be that, in the oversubscribed
case, it relies too much on the behavior of yield, but not that it does
nothing.

But maybe I'm nitpicking again. Sorry. I don't get to talk about these
inner (and very interesting, to me at least) scheduling details too
often, and when it happens, I tend to get excited and exaggerate! :-P

Let's take a step aside. The ARM ARM describes WFI as "hint instructionthat permits the processor to enter a low-power state until one of anumber of asynchronous event occurs". Entering in lower-power statemeans it will have an impact (maybe small) interrupt latency because theCPU would have to leave the low-power state.

A baremetal application that use WFI is aware of the impact and wish tosave power. If that application really care about interrupt latency itwill use polling and not WFI. It depends on how much you could toleratethe interrupt latency.

Now, a same baremetal running as Xen guest will expect the samebehavior. This is why WFI is implement with block but is has an highimpact today (see above for a possible explanation). Moving to yield mayhave the same high impact because as you said the implementation willdepend on the scheduler and when multiple vCPU are running on the samepCPU then you would have to context switch and it has a cost.

A user who want to move his baremetal app into a guest will have to paythe price of virtualization overhead + power if he wants to get goodinterrupt latency result even by using WFI. I would be surprise if itlooks appealing to some people.

This is why for me implementing guest WFI as polling looks like anattempt to muddy the waters.

If you want a good interrupt latency with virtualization, you would pinyour vCPU and ensure no other vCPU will run on this pCPU. And the youcan play with the scheduler to optimize it (e.g avoiding pointlesscontext switch...).

So for me implementing guest WFI as polling looks like an attempt tomuddy the waters. It is not gonna solve the problem of the contextswitch takes time.


Cheers,

--
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

Follow-Ups:
- Re: [Xen-devel] [PATCH] xen/arm: introduce vwfi parameter
  - From: Dario Faggioli
- Re: [Xen-devel] [PATCH] xen/arm: introduce vwfi parameter
  - From: George Dunlap

References:
- Re: [Xen-devel] [PATCH] xen/arm: introduce vwfi parameter
  - From: Stefano Stabellini
- Re: [Xen-devel] [PATCH] xen/arm: introduce vwfi parameter
  - From: Julien Grall
- Re: [Xen-devel] [PATCH] xen/arm: introduce vwfi parameter
  - From: Stefano Stabellini
- Re: [Xen-devel] [PATCH] xen/arm: introduce vwfi parameter
  - From: Julien Grall
- Re: [Xen-devel] [PATCH] xen/arm: introduce vwfi parameter
  - From: Dario Faggioli
- Re: [Xen-devel] [PATCH] xen/arm: introduce vwfi parameter
  - From: Julien Grall
- Re: [Xen-devel] [PATCH] xen/arm: introduce vwfi parameter
  - From: Dario Faggioli
- Re: [Xen-devel] [PATCH] xen/arm: introduce vwfi parameter
  - From: Julien Grall
- Re: [Xen-devel] [PATCH] xen/arm: introduce vwfi parameter
  - From: Dario Faggioli

Prev by Date: Re: [Xen-devel] [PATCH 3/4] xenstore: make memory report available via XS_DEBUG
Next by Date: [Xen-devel] [PULL 03/11] hw: Default -drive to if=none instead of ide when ide cannot work
Previous by thread: Re: [Xen-devel] [PATCH] xen/arm: introduce vwfi parameter
Next by thread: Re: [Xen-devel] [PATCH] xen/arm: introduce vwfi parameter
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.