[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 07/19] xen: credit2: prevent load balancing to go mad if time goes backwards



>>> On 18.06.16 at 01:12, <dario.faggioli@xxxxxxxxxx> wrote:
> This really should not happen, but:
>  1. it does happen! Investigation is ongoing here:
>     http://lists.xen.org/archives/html/xen-devel/2016-06/msg00922.html 
>  2. even when 1 will be fixed it makes sense and is easy enough
>     to have a 'safety catch' for it.
> 
> The reason why this is particularly bad for Credit2 is that
> negative values of delta mean out of scale high load (because
> of the conversion to unsigned). This, for instance in the
> case of runqueue load, results in a runqueue having its load
> updated to values of the order of 10000% or so, which in turns
> means that the load balancer will migrate everything off from
> the pCPUs in the runqueue, and leave them idle until the load
> gets back to something sane... which may indeed take a while!
> 
> This is not a fix for the problem of time going backwards. In
> fact, if that happens a lot, load tracking accuracy is still
> compromized, but at least the effect is a lot less bad than
> before.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx>
> ---
> Cc: George Dunlap <george.dunlap@xxxxxxxxxx>
> Cc: Anshul Makkar <anshul.makkar@xxxxxxxxxx>
> Cc: David Vrabel <david.vrabel@xxxxxxxxxx>
> ---
>  xen/common/sched_credit2.c |   12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index 50f8dfd..b73d034 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -404,6 +404,12 @@ __update_runq_load(const struct scheduler *ops,
>      else
>      {
>          delta = now - rqd->load_last_update;
> +        if ( unlikely(delta < 0) )
> +        {
> +            d2printk("%s: Time went backwards? now %"PRI_stime" llu 
> %"PRI_stime"\n",
> +                     __func__, now, rqd->load_last_update);
> +            delta = 0;
> +        }
>  
>          rqd->avgload =
>              ( ( delta * ( (unsigned long long)rqd->load << 
> prv->load_window_shift ) )
> @@ -455,6 +461,12 @@ __update_svc_load(const struct scheduler *ops,
>      else
>      {
>          delta = now - svc->load_last_update;
> +        if ( unlikely(delta < 0) )
> +        {
> +            d2printk("%s: Time went backwards? now %"PRI_stime" llu 
> %"PRI_stime"\n",
> +                     __func__, now, svc->load_last_update);
> +            delta = 0;
> +        }
>  
>          svc->avgload =
>              ( ( delta * ( (unsigned long long)vcpu_load << 
> prv->load_window_shift ) )

Do the absolute times really matter here? I.e. wouldn't it be more
useful to simply log the value of delta?

Also, may I ask you to use the L modifier in favor of the ll one, for
being one byte shorter (and hence, even if just very slightly,
reducing both image size and cache pressure)?

And finally, instead of logging function names, could the two
messages be made distinguishable by other means resulting in less
data issued to the log (and potentially needing transmission over
a slow serial line)?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.