[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [Help] Trigger Watchdog when adding an IPI in vcpu_wake



On Tue, Sep 13, 2016 at 01:30:17PM +0200, Dario Faggioli wrote:
>[using xendevel correct address]
>
>On Tue, 2016-09-13 at 16:54 +0800, Wei Yang wrote:
>> On Fri, 2016-09-09 at 17:41 +0800, Wei Yang wrote:
>> > 
>> > I'm not surprised by that. Yet, I'd be interested in hearing more
>> > about this profiling you have done (things like, how you captured
>> > the data, what workloads you are exactly considering, how you
>> > determined what is the bottleneck, etc).
>> Let me try to explain this.
>> 
>> Workload:         a. Blue Screen in Windows Guests
>>                b. some client using Windows to do some video
>> processing
>>                   which need precise timestamp (we are not sure the
>> core
>>                   reason but we see the system is very slow)
>>
>Do you mind sharing just a bit more, such as:
> - number of pcpus
> - number of vcpus of the various VMs

160 pcpus
16 vcpus in VM and 8 VMs

>
>I also am not sure what "a. Blue screen in Windows guests" above
>means... is there a workload called "Blue Screen"? Or is it that you
>are hitting a BSOD in some of your guests (which ones, what were they
>doing)? Or is it that you _want_ to provoke a BSOD on some of your
>guests? Or is that something else? :-P
>

Yes, the "Blue Screen" is what we want to mimic the behavior from client.

The "Blue Screen" will force the hypervisor to do load balance in my mind.

>> Capture the data: lock profile
>> Bottleneck Check: The schedule lock wait time is really high      
>> 
>Ok, cool. Interesting that lock profiling works on 4.1! :-O
>
>> > The scheduler tries to see whether the v->processor of the waking
>> > vcpu can be re-used, but that's not at all guaranteed, and again,
>> > on a very loaded system, not even that likely!
>> > 
>> Hmm... I may missed something.
>> 
>> Took your assumption below for example. 
>> In my mind, the process looks like this:
>> 
>>     csched_vcpu_wake()
>>         __runq_insert(),  insert the vcpu in pcpu's 6 runq
>>         __runq_tickle(),  raise SCHEDULE_SOFTIRQ on pcpu 6 or other
>> (1)
>>     
>Well, yes. More precisely, at least in current staging,
>SCHEDULE_SOFTIRQ is raised for pcpu 6:
> - if pcpu 6 is idle, or
> - if pcpu 6 is not idle but there actually isn't any idle vcpu, and
>   the waking vcpu is higher in priority than what is running on pcpu 6
>
>>     __do_softirq()
>>         schedule()
>>             csched_schedule(),  for pcpu 6, it may wake up d2v2 based
>> on it's
>>                                 priority
>>
>Yes, but it is pcpu 6 that will run csched_schedule() only if the
>conditions mentioned above are met. If not, it will be some other pcpu
>that will do so (or none!).
>
>But I just realized that the fact that you are actually working on 4.1
>is going to be an issue. In fact, the code that it is now in Xen has
>changed **a lot** since 4.1. In fact, you're missing all the soft-
>affinity work (which you may or may not find useful) but also
>improvements and bugfixing.
>

That's true...

>I'll have a quick look at how __runq_tickle() looks like in Xen 4.1,
>but there's very few chances I'll be available to provide detailed
>review, advice, testing, etc., on such an old codebase. :-(
>
>> By looking at the code, what I missed may be in __runq_tickle(),
>> which in case
>> there are idle pcpu, schedule_softirq is also raised on them. By
>> doing so,
>> those idle pcpus would steal other busy tasks and may in chance get
>> d2v2?
>> 
>Yes, it looks like, in Xen 4.1, this is more or less what happens. The
>idea is to always tickle pcpu 6, if the priority of the waking vcpu is
>higher than what is running there. 
>

Hmm... in case there are idle pcpus and the priority of the waking vcpu is
higher than what is running on pcpu 6, would pcpu 6 have more chance to run it?
or other idle pcup would steal it from pcpu6? or they have equal chance?

>If pcpu 6 was not idle, we also tickle one or more idle pcpus so that:
> - if the waking vcpu preempted what was running on pcpu 6, an idler
>   can pick-up ("steal") such preempted vcpu and run it;
> - if the waking vcpu ended up in the runqueue, an idler can steal it
>

Hmm... I don't get the difference between these two cases.

Looks both are an idler steal the vcpu.

>> BTW, if the system is in heavy load, how much chance would we have
>> idle pcpu?
>> 
>It's not very likely that there will be idle pcpus in a very busy
>system, I agree.
>
>> > > We can divide this in three steps:
>> > > a) Send IPI to the target CPU and tell it which vcpu we want it
>> > > to 
>> > > wake up.
>> > > b) The interrupt handler on target cpu insert vcpu to a percpu
>> > > queue 
>> > > and raise
>> > >    softirq.
>> > > c) softirq will dequeue the vcpu and wake it up.
>> > > 
>> > I'm not sure I see how you think this would improve the situation.
>> > 
>> > So, quickly, right now:
>> > 
>> > - let's say vcpu 2 of domain 2 (from now d2v2) is waking up
>> > - let's assume d2v2->processor = 6
>> > - let's assume the wakeup is happening on pcpu 4
>> > 
>> > Right now:
>> > 
>> > - on pcpu 4, vcpu_wake(d2v2) takes the scheduler lock of d2v2,
>> >   which is the runqueue lock of pcpu 6 (in Credit1, there is 1
>> >   runqueue per pcpu, and locks are per-cpu already)
>> > - in csched_vcpu_wake(d2v2), d2v2 is inserted in pcpu's 6 runqueue
>> > - still executing on pcpu 4, __runq_tickle() is called, and it
>> >   determines on which pcpu d2v2 should run
>> > - it raises the SCHEDULE_SOFTIRQ for such pcpu. Let's look at the
>> >   following two cases:
>> >    a) if it was determined that d2v2 should run on pcpu 6:
>> >        - in a (little, hopefully) while, pcpu 6 will schedule
>> >        - it takes its own runqueue lock
>> >        - it finds d2v2 in there, and runs it
>> Even in this case, it depends on the priority of d2v2 whether it
>> would be run
>> now. Right?
>> 
>Yes. But both in 4.1 and in current staging, we only raise
>SCHEDULE_SOFTIRQ on pcpu 6, if and only if the priority of the waking
>vcpu is higher and it should preempt the currently running vcpu.
>

Oh, you are right. I didn't catch this either.

This means in case 
a) the priority is lower than current one 
b) no idler in system

The vcpu will sit quietly in the runq and waiting for the schedule next time.

>> >    b) if it was determined that d2v2 should run on, say, pcpu 9:
>> >        - in a (little, hopefully) while, pcpu 9 will schedule
>> >        - it takes its own runqueue lock
>> >        - if it has nothing in its runqueue (likely, if
>> >          __runq_tickle() chose it, but not necessarily and always
>> >          true, especially under high load), it looks around to
>> >          see if other pcpus have any vcpu that it can execute
>> >        - it will try to take the locks of the various pcpu's
>> >          runqueues it comes across
>> >        - if it manages to take the lock of pcpu's 6 runqueue, it
>> >          sees d2v2 in it, steal it and run it.
>> Oh, my understanding matches yours :-)
>> 
>> BYW, we also found in csched_load_balance() will hold the schedule
>> lock, but
>> not found a proper vcpu to steal. Maybe this would be a potential
>> optimization
>> point.
>> 
>Mmm... I actually don't understand what you mean here... what is the
>possible optimization?
>
>> > With your modifications, AFAICT:
>> > 
>> > - on pcpu 4, notify_remote_vcpu_wake(d2v2) takes the wake_vcpu_lock
>> >   of pcpu 6 and queue d2v2 in pcpu's 6 deferred wakeup list
>> > - poke pcpu 6 with an IPI
>> > - in a (little, hopefully) while, pcpu 6 react to the IPI and,
>> >   _I_think_, call vcpu_wake(d2v2) ? [*]
>> You are almost right. Here in the interrupt handler, it does two
>> things:
>> 1. echo back notify_remote_vcpu_wake() it finishes the job
>> 2. raise a separate softirq, which will call vcpu_wake(d2v2)
>> 
>> > 
>> > - on pcpu 6, vcpu_wake(d2v2) takes its own runqueue lock
>> > - in csched_vcpu_wake(d2v2), d2v2 is inserted in pcpu's 6 runqueue
>> > - on pcpu 6, __runq_tickle() is called, and it determines on which
>> >   pcpu d2v2 should run
>> At this place, the behavior of __run_tickle() is what I mentioned
>> above. Raise
>> softirq on pcpu6 and idle pcpus. Who is faster, who get d2v2.
>>
>Sort of, yes. In general, you can expect pcpu 6 going through
>csched_schedule(), and hence setting d2v2 to run, to be faster that
>SCHEDULE_SOFIRQ being notified to someone remote, and it scheduling and
>getting to work stealing from pcpu 6's runqueue.
>
>However, that is the case only if d2v2 had higher priority than what is
>running on pcpu 6. If it does not, you:
> - _don't_ tickle pcpu 6
> - tickle one or more idlers, if any.
>
>So, again, it's not that you always wake and run it on pcpu 6
>

I see what you mean :-)

So I am curious about why we add this in pcpu's 6 queue, when we know the
priority is low in __runq_tickle() and won't tickle that. And we are sure
someone in the future will grab pcpu's 6 schedule lock and take it.

Or in this case -- when there are idlers and vcpu's priority is higher,
instead of let pcpu 6 to wake this, but let an idler to take it? Ask someone
who is free to help a busy one.

>> > - it raises the SCHEDULE_SOFTIRQ for such pcpu. Let's look at the
>> >   following two cases:
>> >    a) if it was determined that d2v2 should run on pcpu 6
>> >       (i.e., SCHEDULE_SOFTIRQ was raised by pcpu 6 on itself):
>> >        - at the first softirq check, pcpu 6 will schedule
>> >        - it takes its own runqueue lock
>> >        - it finds d2v2 in there, and runs it
>> >    b) if it was determined that d2v2 should run on, say, pcpu 9:
>> >        - in a (little, hopefully) while, pcpu 9 will schedule
>> >        - it takes its own runqueue lock
>> >        - if it has nothing in its runqueue (likely, if
>> >          __runq_tickle() chose it, but not necessarily and always
>> >          true, especially under high load), it looks around to
>> >          see if other pcpus have any vcpu that it can execute
>> >        - it will try to take the locks of the various pcpu's
>> >          runqueues it comes across
>> >        - if it manages to take the lock of pcpu's 6 runqueue, it
>> >          sees d2v2 in it, steal it and run it.
>> > 
>> > [*] I don't see in the code you shared what happens in reaction to
>> > the IPI WAKEUP_TASK_VECTOR, so I'm going to assume that it actually
>> > calls vcpu_wake()
>> > 
>> > So, basically, it looks to me that you're adding a level of
>> > indirection, I'm not sure for what benefit.
>
>> In my mind, we are trying to reduce the contention on schedule lock
>> from two
>> aspects:
>> 1. for those vcpus would run on other pcpu, it will take
>> wake_vcpu_lock
>>    instead of schedule lock
>>
>I don't see how you can say "instead". It looks to me that, for those
>vcpus what would run on other pcpu, we need to take wake_vcpu_lock
>_in_addition_ to the runqueue lock.
>
>> 2. and in vcpu_wake(), it will not grab a remote schedule lock, which
>> also
>>    reduce the cache coherency between pcpus
>> 
>Right. Taking the wake_vcpu_lock of pcpu 6 from pcpu 4 is taking a
>remote lock, though.

Hmm... you are right.

>
>> > In fact, in the best case (i.e., d2v2 will actually be run on its
>> > v->processor), in the original code there is:
>> > - 1 IPI (SCHEDULE_SOFTIRQ) from pcpu 4 to pcpu 6.
>> > In your case, there is:
>> > - 1 IPI (SCHEDULE_SOFTIRQ) from pcpu 4 to pcpu 6
>> > - 1 softirq (again SCHEDULE_SOFTIRQ) to self on pcpu 6.
>> > 
>> Our idea may looks like this:
>>  - 1 IPI (WAKEUP_TASK_VECTOR) from pcpu 4 to pcpu 6
>>  - 1 softirq (SCHEDULE_WAKEUP) to self on pcpu 6 to queue the vcpu.
>>  - 1 softirq (SCHEDULE_SOFTIRQ) to self on pcpu 6.
>> 
>So, it's even worse than how I imagined! :-P
>
>> > Also, as far as locking within the wakeup path is concerned, in the
>> > original code:
>> > - the runqueue lock of pcpu 6 is held for the time of vcpu_wake()
>> >   (i.e., for the time of __runq_insert() and __runq_tickle().
>> > In your case:
>> > - the wake_vcpu_lock of pcpu 6 is held during queueing of the
>> >   deferred wakeup
>> > - the runqueue lock of pcpu 6 is held for the time of vcpu_wake()
>> > 
>> The wake_vcpu_lock is hold in the first step in above.
>> The runqueue lock(I think you mean the schedule lock?) won't be taken
>> in this
>> case.
>> 
>It will be taken, for executing vcpu_wake(). From your own description
>of the mechanism above:
>
>"Our idea may looks like this:
>   1) 1 IPI (WAKEUP_TASK_VECTOR) from pcpu 4 to pcpu 6
>   2) 1 softirq (SCHEDULE_WAKEUP) to self on pcpu 6 to queue the
>      vcpu.
>   3) 1 softirq (SCHEDULE_SOFTIRQ) to self on pcpu 6."
>
>1) runs on pcpu 4 and takes pcpu's 6 wake_vcpu_lock
>2) runs on pcpu 6 and takes pcpu'6 6 runqueue lock, for vcpu_wake()
>3) runs on pcpu 6 and takes pcpu's 6 runqueue lock, for
>   csched_schedule()
>

Oh, you are right.

We try to let the pcpu's 6 schedule lock just be grabbed by pcpu 6, while
looks the wake_vcpu_lock is grabbed by others.

I am thinking our change may benefit like this .

The situation is:
* vcpu's priority is higher
* pcpu 9 is idle
* pcpu 6 and pcpu 9 are raised SCHEDULE_SOFTIRQ

At this moment pcpu 4 is holding pcpu 6 runqueue lock, neither pcpu 6
nor pcpu 9 could do their job until pcpu 4 release pcpu 6 runqueue lock.

While in our case, pcpu 4 doesn't hold the runqueue lock, so pcpu 6 and pcpu 9
could do their job immediately. Ah I know, very corner, not sure about the
effect.

BTW, I suddenly found softirq is implemented as an IPI. So I am thinking
whether it is save to raise softirq to itself? Since it looks we already hold
the runqueue lock. Or because we have turned the irq off in vcpu_wake()?

>> > 
>> > (i.e., for the time of __runq_insert() and __runq_tickle().
>> > 
>> > Which, to me, looks like both more inter-CPU communication overhead
>> > and more chances for lock contention.
>> Hmm... yes, more inter-CPU communication, while less lock contention
>> I think.
>> 
>More inter-CPU communication, the same level of local lock contention
>plus some added moderately to low contented remote lock contention,
>AFAICT.
>
>> > So, again, more IPIs and more (potential) lock contention.
>> > 
>> > True the queueing into the deferred wakeup list will be very simple
>> > and quick, and hence the critical section needed for implementing
>> > it very short. But the problem is that you are not shortening
>> > something long, or splitting something long in two shorter chunks,
>> > you are _adding_ something, and thus you can't possibly win. :-/
>> > 
>> Yes, we are moving some thing, let me digest it for a while.
>> 
>Sure. Let me also add that I'm not saying it's not going to work, or
>that it's not going to improve the situation.
>
>What I'm saying is that, architecturally, it's not too different from
>the current situation, so I would not expect wonders from just this.
>
>What I'd do would be to try figuring out where it is that the most
>waiting time is being accumulated, i.e., which *specific* locking
>attempts of the scheduler's runqueues' locks are the most contended,
>and focus on those first.
>
>If you're on a big host, I think the way in which Credit1 does load
>balancing (i.e., by work stealing) may be the thing to blame. But all
>these are things that only profiling and benchmarking can really tell!
>:-)
>
>As a side note, we're also working on something which, at the end of
>the day, is rather similar, although for different purposes. In fact,
>see here:
>https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg01892.html
>
>The goal is not having to disable IRQs during scheduling, and I'm
>looking at keeping the per-pcpu deferred wakeup list lockless, in order
>not to add more lock contention.
>
>If you're interested on knowing more or working on that, just let me
>know. But, of course, this have to be done on Xen upstream, not on an
>old version.
>
>Regards,
>Dario
>-- 
><<This happens because I choose it to happen!>> (Raistlin Majere)
>-----------------------------------------------------------------
>Dario Faggioli, Ph.D, http://about.me/dario.faggioli
>Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>



-- 
Richard Yang\nHelp you, Help me

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.