Xen project Mailing List

Re: [Xen-devel] [Help] Trigger Watchdog when adding an IPI in vcpu_wake

On Wed, Sep 14, 2016 at 06:18:48PM +0200, Dario Faggioli wrote: >On Wed, 2016-09-14 at 18:44 +0800, Wei Yang wrote: >> On Tue, Sep 13, 2016 at 01:30:17PM +0200, Dario Faggioli wrote: >> > >> > Do you mind sharing just a bit more, such as: >> > - number of pcpus >> > - number of vcpus of the various VMs >> 160 pcpus >> 16 vcpus in VM and 8 VMs >> >So, 16x8=128, which means you're not even oversubscribing. Maybe some >of the pcpus are hyperthreads and not full cores (are they?), but >still, this is not something I'd describe "super intensive cpu load". > >Oh, well, there's dom0 of course. So, if dom0 has more than 160-128=32 >vcpus (is this the case?), you technically are oversubscribing. But >then, what does dom0 do? Is it very busy doing some stuff on its own, >or running the backends/etc for the VMs? Can you check that with top, >xentop, and alike? > >If the system is not overbooked, it's a bit strange that the scheduler >is the bottleneck. I looked at the original data again. I don't see detailed data to describe the dom0 configuration. The exact user model is not accessible from our client. We guess their model looks like this. +--------+ +-----------+ +-------------+ |Timer | --->|Coordinator| ---+--->|Worker | +--------+ +-----------+ | +-------------+ | | +-------------+ +--->|Worker | | +-------------+ | | +-------------+ +--->|Worker | +-------------+ One Coordinator would driver several workers based on a high resolution timer. Periodically, workers would be waked up by the coordinator. So at one moment, many workers would be waked up and would triggers the vcpu_wake() in xen. Not sure this would be a possible reason for the burst vcpu_wake()? > >> Yes, the "Blue Screen" is what we want to mimic the behavior from >> client. >> >> The "Blue Screen" will force the hypervisor to do load balance in my >> mind. >> >I've no idea what this means (but I don't know much of Windows and of >what happens when it crashes with a blue screen). > >> > I'll have a quick look at how __runq_tickle() looks like in Xen >> > 4.1, >> > but there's very few chances I'll be available to provide detailed >> > review, advice, testing, etc., on such an old codebase. :-( >> > >> > > >> > > By looking at the code, what I missed may be in __runq_tickle(), >> > > which in case >> > > there are idle pcpu, schedule_softirq is also raised on them. By >> > > doing so, >> > > those idle pcpus would steal other busy tasks and may in chance >> > > get >> > > d2v2? >> > > >> > Yes, it looks like, in Xen 4.1, this is more or less what happens. >> > The >> > idea is to always tickle pcpu 6, if the priority of the waking vcpu >> > is >> > higher than what is running there. >> > >> Hmm... in case there are idle pcpus and the priority of the waking >> vcpu is >> higher than what is running on pcpu 6, would pcpu 6 have more chance >> to run it? >> or other idle pcup would steal it from pcpu6? or they have equal >> chance? >> >No, it works like this: > > - pcpu 6 is eithe idle or it is running someone already (say d3v4) > + if pcpu 6 is idle, we tickle (i.e., we raise SCHEDULE_SOFTIRQ) > pcpu 6 itself, and we're done Ok, it looks the behavior differs from 4.1 and current upstream. The upstream just raise the softirq to pcpu6 when it is idle, while 4.1 would raise softirq to both pcpu6 and other idlers even pcpu6 is idle. I think current upstream is more cleaver. > + if pcpu 6 is running d3v4 and there is no other idle pcpu: > * if prio(d2v2) > prio(d3v4) we tickle pcpu 6 and we're done > * if prio(d2v2) < prio(d3v4) we just don't tickle anyone > + if pcpu 6 is running d3v4 and there are some idle pcpus: > * if prio(d2v2) > prio(d3v4) we tickle pcpu 6 _AND_ one or [1] > more of the idle pcpus > * if prio(d2v2) < prio(d3v4) we _ONLY_ tickle one or more [2] > of the idle pcpus > >Idea behind [1] is that d2v2 should preempt d3v4 on pcpu 6, and that's >why we tickle pcpu 6. However, that would mean that d3v4 would be put >back in pcpu's 6 runqueue. So, if there are idle pcpus, we tickle them >so that one can come and steam d3v4. > >Idea behind [2] is that d3v4 should continue to run on pcpu 6, while >d2v2 will be put in pcpu's 6 runqueue. However, if there are idle >pcpus, we tickle them so that one can come and steal d2v2. > >What really happens is not always what is expected, because it's >possible that, even if prio(d2v2)>prio(d3v4), an idler is quicker in >waking up and stealing d2v2 from pcpu's 6 runqueue where it was stashed >while pcpu 6 itself reschedules and enacts the preemption... But that's >unlikely and, all in all, not catastrophic (although, of course, worse >in terms of overhead, locality, etc) > >> > If pcpu 6 was not idle, we also tickle one or more idle pcpus so >> > that: >> > - if the waking vcpu preempted what was running on pcpu 6, an >> > idler >> > can pick-up ("steal") such preempted vcpu and run it; >> > - if the waking vcpu ended up in the runqueue, an idler can steal >> > it >> > >> Hmm... I don't get the difference between these two cases. >> >> Looks both are an idler steal the vcpu. >> >I hope it's more clear now. :-) > Yep, thanks. :-) >> > Yes. But both in 4.1 and in current staging, we only raise >> > SCHEDULE_SOFTIRQ on pcpu 6, if and only if the priority of the >> > waking >> > vcpu is higher and it should preempt the currently running vcpu. >> > >> Oh, you are right. I didn't catch this either. >> >> This means in case >> a) the priority is lower than current one >> b) no idler in system >> >> The vcpu will sit quietly in the runq and waiting for the schedule >> next time. >> >Well, yes. It will stay in pcpu's 6 runqueue until either: > - what's running on pcpu 6 blocks, or finishes its credits, etc., > and pcpu 6 reschedules and picks it up > - some other pcpu becomes idle and, when poking around the various > pcpus for stealing work, picks it up > >Of course, both depends on in what position in pcpu's 6 runqueue the >vcpu is when the specific event happens. > >This is a limit of the Credit1 scheduler, especially of the version >that you find in 4.1. Newer code is a bit better wrt this (although not >dramatically... that has not been possible! :-/). > >> > However, that is the case only if d2v2 had higher priority than >> > what is >> > running on pcpu 6. If it does not, you: >> > - _don't_ tickle pcpu 6 >> > - tickle one or more idlers, if any. >> > >> > So, again, it's not that you always wake and run it on pcpu 6 >> > >> I see what you mean :-) >> >> So I am curious about why we add this in pcpu's 6 queue, when we know >> the >> priority is low in __runq_tickle() and won't tickle that. And we are >> sure >> someone in the future will grab pcpu's 6 schedule lock and take it. >> >Yes, but what we should do? We've woken it up and it's v->processor >points to pcpu 6, so the lock that vcpu_schedule_lock() takes is pcpu's >6 runqueue lock. > >To queue it on some other pcpu we'd need to basically migrate it (i.e., >at least changing the value of v->processor, but also other things, >like checking hard affinity, etc). And we also would need to take the >runqueue lock of another pcpu's runqueue. All this in the wakeup path, >which will become slow and complex. And we'd need all the work stealing >logic (in load_balance(), etc) to still exist, to cover the case of >work needing to be stolen when wakeups are not involved (i.e., it's not >that we make other part of the scheduler's code less complex). > >So, basically, of course there probably were alternatives, but the >basic model, pretty much everywhere, is "queue where it was and tickle >others to come pick it up if suitable", and that's what's done here too >(and consistency is a very good thing :-D). > >> Or in this case -- when there are idlers and vcpu's priority is >> higher, >> instead of let pcpu 6 to wake this, but let an idler to take it? Ask >> someone >> who is free to help a busy one. >> >And in fact, in more recent version of Xen, I made the code do >something very close to what you're suggesting. Here's the comment that >you can find where all this logic is implemented (it's way more >complicated than this, and than the code in 4.1, because it takes soft- >affinity into account). > > /* > * If the pcpu is idle, or there are no idlers and the new > * vcpu is a higher priority than the old vcpu, run it here. > * > * If there are idle cpus, first try to find one suitable to run > * new, so we can avoid preempting cur. If we cannot find a > * suitable idler on which to run new, run it here, but try to > * find a suitable idler on which to run cur instead. > */ > I like this idea. >> > It will be taken, for executing vcpu_wake(). From your own >> > description >> > of the mechanism above: >> > >> > "Our idea may looks like this: >> > 1) 1 IPI (WAKEUP_TASK_VECTOR) from pcpu 4 to pcpu 6 >> > 2) 1 softirq (SCHEDULE_WAKEUP) to self on pcpu 6 to queue the >> > vcpu. >> > 3) 1 softirq (SCHEDULE_SOFTIRQ) to self on pcpu 6." >> > >> > 1) runs on pcpu 4 and takes pcpu's 6 wake_vcpu_lock >> > 2) runs on pcpu 6 and takes pcpu'6 6 runqueue lock, for vcpu_wake() >> > 3) runs on pcpu 6 and takes pcpu's 6 runqueue lock, for >> > csched_schedule() >> > >> Oh, you are right. >> >> We try to let the pcpu's 6 schedule lock just be grabbed by pcpu 6, >> while >> looks the wake_vcpu_lock is grabbed by others. >> >I understand that, and I can't say that this is not going to improve >things for you. All I'm saying is that you're potentially reducing the >overhead of one "operation", but at the same time adding some overhead >in other "operations". > >Depending on a multiplicity of aspects, the net effect could be >positive or negative. > >You you asked for general advice on the solution, and help with the >code. I can't help much debugging hangs in Xen 4.1. My opinion on the >solution is that, architecturally, is not something that I'd call an >absolute improvement. > >Let me put it this way: if you'd send a patch series implementing what >we're talking about for upstream Xen, I would not Ack it without solid >data, coming from benchmarks run on different platform, with different >workloads and under different load conditions, that clearly shows >performance are improved. > >Translated to your case, which is, AFAICT, that you need something on >top of 4.1. If you can hack this up quickly and try it, then good. If >this is taking a lot of time, I'm not sure I'd invest such time on this >(e.g., I'd rather try to narrow down even more the cause of the issue >you're seeing and, after that look at other solutions). But of course >I'm not you and this is only, and you absolutely have no reasons to >follow my advice. :-) :-) > Actually, I admire your advice :-) We found the schedule lock be a consuming point in our scenario, so the direct thought is try to avoid to grab it. Hmm... while our idea is not that sparkling. >> I am thinking our change may benefit like this . >> >> The situation is: >> * vcpu's priority is higher >> * pcpu 9 is idle >> * pcpu 6 and pcpu 9 are raised SCHEDULE_SOFTIRQ >> >> At this moment pcpu 4 is holding pcpu 6 runqueue lock, neither pcpu 6 >> nor pcpu 9 could do their job until pcpu 4 release pcpu 6 runqueue >> lock. >> >Yes, but pcpu 4 releases pcpu's 6 lock right after having raised the >softirq for pcpus 6 and 9. Most likely, when 6 and 9 will try to grab >their own locks for running csched_schedule(), after having reacted to >the IPI, etc., pcpu 4 will have released pcpu's 6 lock already. > >BTW, you say that pcpu 9 is idle, but then that "it can't do his job" >because pcpu 4 holds the lock of pcpu 6... I don't think I understand >what you mean with this. After pcpu9 get the softirq from pcup4, it try to schedule and do load balance, in which would take pcpu6 schedule lock. As I thought previously, pcpu6 schedule lock is hold by pcpu4, so pcpu9 should wait until pcpu4 release it. This is what I mean "it can't do his job immediately". While as you mentioned, compared with pcpu9, pcpu4 would release pcpu6 schedule lock earlier. :-) Little complicated, hope I described it clearly. :-) > >> While in our case, pcpu 4 doesn't hold the runqueue lock, so pcpu 6 >> and pcpu 9 >> could do their job immediately. Ah I know, very corner, not sure >> about the >> effect. >> >Again, I don't get what "do their job" and "immediately" means. > >In both cases the consequentiality of events is as follows: > 1. queue the task in one pcpu's 6 queue (either the runq, in original > code, or the wakeup list, with your modifications) > 2. tickle pcpu 6 > 3. pcpu 6 react to being tickled > 4. pcpu 6 start running the new vcpu > >I original code 1 and 4 are serialized by the same lock, with your >modifications, half of 1 can happen using a different lock. But 1 and 4 >are almost never overlapped anyway, so I don't think you're gaining >much parallelism. > >What (maybe) you're achieving is that 1 interferes a little bit less >with other things that wants to happen on vcpu 6 (e.g., because of >other vcpus wrt the one waking up), which may be a good thing, if there >are a lot of wakeups. But then, again, if there are a lot of wakeups, >you probably will see contention of the wake_vcpu_lock! > >> BTW, I suddenly found softirq is implemented as an IPI. So I am >> thinking >> whether it is save to raise softirq to itself? >> >Softirqs are implemented by means of IPIs, but when they're raised for >other processors. Self-raising a softirq does not involve IPIs, just >setting a bit in the pending mask (see raise_softirq() in >xen/common/softirq.c). > >> Since it looks we already hold >> the runqueue lock. Or because we have turned the irq off in >> vcpu_wake()? >> >We hold the runqueue lock, so what? Should we schedule that >immediately? That would probably be possible, but it's again an >architectural decision, and a matter of consistency. In Xen, we >schedule when we find the SCHEDULE_SOFTIRQ to be pending. > >Anyhow, I would not expect wonders by embedding the re-scheduling code >directly here either. > >Regards, >Dario >-- ><<This happens because I choose it to happen!>> (Raistlin Majere) >----------------------------------------------------------------- >Dario Faggioli, Ph.D, http://about.me/dario.faggioli >Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) > -- Richard Yang\nHelp you, Help me _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.