[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH 07/19] xen: credit2: prevent load balancing to go mad if time goes backwards
>>> On 07.07.16 at 11:09, <george.dunlap@xxxxxxxxxx> wrote: > On 07/07/16 08:29, Jan Beulich wrote: >>>>> On 06.07.16 at 18:21, <george.dunlap@xxxxxxxxxx> wrote: >>> On Mon, Jun 20, 2016 at 9:02 AM, Jan Beulich <JBeulich@xxxxxxxx> wrote: >>>>>>> On 18.06.16 at 01:12, <dario.faggioli@xxxxxxxxxx> wrote: >>>>> This really should not happen, but: >>>>> 1. it does happen! Investigation is ongoing here: >>>>> http://lists.xen.org/archives/html/xen-devel/2016-06/msg00922.html >>>>> 2. even when 1 will be fixed it makes sense and is easy enough >>>>> to have a 'safety catch' for it. >>>>> >>>>> The reason why this is particularly bad for Credit2 is that >>>>> negative values of delta mean out of scale high load (because >>>>> of the conversion to unsigned). This, for instance in the >>>>> case of runqueue load, results in a runqueue having its load >>>>> updated to values of the order of 10000% or so, which in turns >>>>> means that the load balancer will migrate everything off from >>>>> the pCPUs in the runqueue, and leave them idle until the load >>>>> gets back to something sane... which may indeed take a while! >>>>> >>>>> This is not a fix for the problem of time going backwards. In >>>>> fact, if that happens a lot, load tracking accuracy is still >>>>> compromized, but at least the effect is a lot less bad than >>>>> before. >>>>> >>>>> Signed-off-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx> >>>>> --- >>>>> Cc: George Dunlap <george.dunlap@xxxxxxxxxx> >>>>> Cc: Anshul Makkar <anshul.makkar@xxxxxxxxxx> >>>>> Cc: David Vrabel <david.vrabel@xxxxxxxxxx> >>>>> --- >>>>> xen/common/sched_credit2.c | 12 ++++++++++++ >>>>> 1 file changed, 12 insertions(+) >>>>> >>>>> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c >>>>> index 50f8dfd..b73d034 100644 >>>>> --- a/xen/common/sched_credit2.c >>>>> +++ b/xen/common/sched_credit2.c >>>>> @@ -404,6 +404,12 @@ __update_runq_load(const struct scheduler *ops, >>>>> else >>>>> { >>>>> delta = now - rqd->load_last_update; >>>>> + if ( unlikely(delta < 0) ) >>>>> + { >>>>> + d2printk("%s: Time went backwards? now %"PRI_stime" llu >>> %"PRI_stime"\n", >>>>> + __func__, now, rqd->load_last_update); >>>>> + delta = 0; >>>>> + } >>>>> >>>>> rqd->avgload = >>>>> ( ( delta * ( (unsigned long long)rqd->load << >>> prv->load_window_shift ) ) >>>>> @@ -455,6 +461,12 @@ __update_svc_load(const struct scheduler *ops, >>>>> else >>>>> { >>>>> delta = now - svc->load_last_update; >>>>> + if ( unlikely(delta < 0) ) >>>>> + { >>>>> + d2printk("%s: Time went backwards? now %"PRI_stime" llu >>> %"PRI_stime"\n", >>>>> + __func__, now, svc->load_last_update); >>>>> + delta = 0; >>>>> + } >>>>> >>>>> svc->avgload = >>>>> ( ( delta * ( (unsigned long long)vcpu_load << >>> prv->load_window_shift ) ) >>>> >>>> Do the absolute times really matter here? I.e. wouldn't it be more >>>> useful to simply log the value of delta? >>>> >>>> Also, may I ask you to use the L modifier in favor of the ll one, for >>>> being one byte shorter (and hence, even if just very slightly, >>>> reducing both image size and cache pressure)? >>>> >>>> And finally, instead of logging function names, could the two >>>> messages be made distinguishable by other means resulting in less >>>> data issued to the log (and potentially needing transmission over >>>> a slow serial line)? >>> >>> The reason this is under a "d2printk" is because it's really only to >>> help developers in debugging. In-tree this warning isn't even on with >>> debug=y; you have to go to the top of the file and change the #define >>> to make it even exist. >>> >>> Given that, I don't think the quibbles over the code size or the >>> length of what's logged really matter. I think we should just take it >>> as it is. >>> >>> Reviewed-by: George Dunlap <george.dunlap@xxxxxxxxxx> >> >> Oh, okay - I agree on those two parts then. But the question on the >> usefulness of absolute vs relative times remains. > > What is the usefulness of printing the relative time? If you have the > absolute time, you have some chance of catching mistakes like one of the > times being 0 or something like that; or of being able to correlate it > with another time printed somewhere else (for instance, a timestamp from > a trace record). Well, having had to deal with time going backwards elsewhere (both in the past and recently) I have always found it cumbersome to work out the delta from huge (far into the billions) absolute numbers, and therefore consider logging the delta more useful - apart from seeing at the first glance whether a particular delta is positive or negative, this also allows at almost the first glance to at least recognize the magnitude of the difference. But anyway ... > In any case, I think it's really a bike shed. Dario is the one who has > used this error message to find an actual bug recently, so I'll let him > decide what he thinks the most useful thing to print here is. ... fine with me; it was just a question after all. Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |