[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[xen stable-4.12] credit2: avoid vCPUs to ever reach lower credits than idle



commit 4c69d1c2dbe0ad1f5343852d9e596fe870892a6c
Author:     Dario Faggioli <dfaggioli@xxxxxxxx>
AuthorDate: Thu Apr 9 09:36:08 2020 +0200
Commit:     Jan Beulich <jbeulich@xxxxxxxx>
CommitDate: Thu Apr 9 09:36:08 2020 +0200

    credit2: avoid vCPUs to ever reach lower credits than idle
    
    There have been report of stalls of guest vCPUs, when Credit2 was used.
    It seemed like these vCPUs were not getting scheduled for very long
    time, even under light load conditions (e.g., during dom0 boot).
    
    Investigations led to the discovery that --although rarely-- it can
    happen that a vCPU manages to run for very long timeslices. In Credit2,
    this means that, when runtime accounting happens, the vCPU will lose a
    large quantity of credits. This in turn may lead to the vCPU having less
    credits than the idle vCPUs (-2^30). At this point, the scheduler will
    pick the idle vCPU, instead of the ready to run vCPU, for a few
    "epochs", which often times is enough for the guest kernel to think the
    vCPU is not responding and crashing.
    
    An example of this situation is shown here. In fact, we can see d0v1
    sitting in the runqueue while all the CPUs are idle, as it has
    -1254238270 credits, which is smaller than -2^30 = â??1073741824:
    
        (XEN) Runqueue 0:
        (XEN)   ncpus              = 28
        (XEN)   cpus               = 0-27
        (XEN)   max_weight         = 256
        (XEN)   pick_bias          = 22
        (XEN)   instload           = 1
        (XEN)   aveload            = 293391 (~111%)
        (XEN)   idlers: 00,00000000,00000000,00000000,00000000,00000000,0fffffff
        (XEN)   tickled: 
00,00000000,00000000,00000000,00000000,00000000,00000000
        (XEN)   fully idle cores: 
00,00000000,00000000,00000000,00000000,00000000,0fffffff
        [...]
        (XEN) Runqueue 0:
        (XEN) CPU[00] runq=0, sibling=00,..., core=00,...
        (XEN) CPU[01] runq=0, sibling=00,..., core=00,...
        [...]
        (XEN) CPU[26] runq=0, sibling=00,..., core=00,...
        (XEN) CPU[27] runq=0, sibling=00,..., core=00,...
        (XEN) RUNQ:
        (XEN)     0: [0.1] flags=0 cpu=5 credit=-1254238270 [w=256] load=262144 
(~100%)
    
    We certainly don't want, under any circumstance, this to happen.
    Let's, therefore, define a minimum amount of credits a vCPU can have.
    During accounting, we make sure that, for however long the vCPU has
    run, it will never get to have less than such minimum amount of
    credits. Then, we set the credits of the idle vCPU to an even
    smaller value.
    
    NOTE: investigations have been done about _how_ it is possible for a
    vCPU to execute for so much time that its credits becomes so low. While
    still not completely clear, there are evidence that:
    - it only happens very rarely,
    - it appears to be both machine and workload specific,
    - it does not look to be a Credit2 (e.g., as it happens when
      running with Credit1 as well) issue, or a scheduler issue.
    
    This patch makes Credit2 more robust to events like this, whatever
    the cause is, and should hence be backported (as far as possible).
    
    Reported-by: Glen <glenbarney@xxxxxxxxx>
    Reported-by: Tomas Mozes <hydrapolic@xxxxxxxxx>
    Signed-off-by: Dario Faggioli <dfaggioli@xxxxxxxx>
    Reviewed-by: George Dunlap <george.dunlap@xxxxxxxxxx>
    master commit: 36f3662f27dec32d76c0edb4c6b62b9628d6869d
    master date: 2020-04-03 10:45:43 +0200
---
 tools/xentrace/formats     |  2 +-
 tools/xentrace/xenalyze.c  |  5 ++---
 xen/common/sched_credit2.c | 53 ++++++++++++++++++++++++----------------------
 3 files changed, 31 insertions(+), 29 deletions(-)

diff --git a/tools/xentrace/formats b/tools/xentrace/formats
index d6e7e3f800..8f126f65f1 100644
--- a/tools/xentrace/formats
+++ b/tools/xentrace/formats
@@ -55,7 +55,7 @@
 0x00022204  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:credit_add
 0x00022205  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:tickle_check   [ 
dom:vcpu = 0x%(1)08x, credit = %(2)d, score = %(3)d ]
 0x00022206  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:tickle         [ cpu = 
%(1)d ]
-0x00022207  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:credit_reset   [ 
dom:vcpu = 0x%(1)08x, cr_start = %(2)d, cr_end = %(3)d, mult = %(4)d ]
+0x00022207  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:credit_reset   [ 
dom:vcpu = 0x%(1)08x, cr_start = %(2)d, cr_end = %(3)d ]
 0x00022208  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:sched_tasklet
 0x00022209  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:update_load
 0x0002220a  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:runq_assign    [ 
dom:vcpu = 0x%(1)08x, rq_id = %(2)d ]
diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c
index aa894673ad..d3c8368e9d 100644
--- a/tools/xentrace/xenalyze.c
+++ b/tools/xentrace/xenalyze.c
@@ -7718,13 +7718,12 @@ void sched_process(struct pcpu_info *p)
                 struct {
                     unsigned int vcpuid:16, domid:16;
                     int credit_start, credit_end;
-                    unsigned int multiplier;
                 } *r = (typeof(r))ri->d;
 
                 printf(" %s csched2:reset_credits d%uv%u, "
-                       "credit_start = %d, credit_end = %d, mult = %u\n",
+                       "credit_start = %d, credit_end = %d\n",
                        ri->dump_header, r->domid, r->vcpuid,
-                       r->credit_start, r->credit_end, r->multiplier);
+                       r->credit_start, r->credit_end);
             }
             break;
         case TRC_SCHED_CLASS_EVT(CSCHED2, 8):  /* SCHED_TASKLET    */
diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index e24c1f7762..4e7f0aa29e 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -228,11 +228,21 @@
  */
 #define CSCHED2_CREDIT_INIT          MILLISECS(10)
 /*
+ * Minimum amount of credits VMs can have. Ideally, no VM would get
+ * close to this (unless a vCPU manages to execute for really long
+ * time uninterrupted). In case it happens, it makes no sense to
+ * track even deeper undershoots.
+ *
+ * NOTE: If making this smaller than -CSCHED2_CREDIT_INIT, adjust
+ * reset_credit() accordingly.
+ */
+#define CSCHED2_CREDIT_MIN           (-CSCHED2_CREDIT_INIT)
+/*
  * Amount of credit the idle vcpus have. It never changes, as idle
  * vcpus does not consume credits, and it must be lower than whatever
  * amount of credit 'regular' vcpu would end up with.
  */
-#define CSCHED2_IDLE_CREDIT          (-(1U<<30))
+#define CSCHED2_IDLE_CREDIT          (CSCHED2_CREDIT_MIN-1)
 /*
  * Carryover: How much "extra" credit may be carried over after
  * a reset.
@@ -775,10 +785,15 @@ static int get_fallback_cpu(struct csched2_vcpu *svc)
 static void t2c_update(struct csched2_runqueue_data *rqd, s_time_t time,
                           struct csched2_vcpu *svc)
 {
-    uint64_t val = time * rqd->max_weight + svc->residual;
+    int64_t val = time * rqd->max_weight + svc->residual;
 
     svc->residual = do_div(val, svc->weight);
-    svc->credit -= val;
+    /* Getting to lower credit than CSCHED2_CREDIT_MIN makes no sense. */
+    val = svc->credit - val;
+    if ( unlikely(val < CSCHED2_CREDIT_MIN) )
+        svc->credit = CSCHED2_CREDIT_MIN;
+    else
+        svc->credit = val;
 }
 
 static s_time_t c2t(struct csched2_runqueue_data *rqd, s_time_t credit, struct 
csched2_vcpu *svc)
@@ -1618,28 +1633,25 @@ static void reset_credit(const struct scheduler *ops, 
int cpu, s_time_t now,
 {
     struct csched2_runqueue_data *rqd = c2rqd(ops, cpu);
     struct list_head *iter;
-    int m;
+    int reset = CSCHED2_CREDIT_INIT;
 
     /*
      * Under normal circumstances, snext->credit should never be less
      * than -CSCHED2_MIN_TIMER.  However, under some circumstances, a
      * vcpu with low credits may be allowed to run long enough that
-     * its credits are actually less than -CSCHED2_CREDIT_INIT.
+     * its credits are actually much lower than that.
      * (Instances have been observed, for example, where a vcpu with
      * 200us of credit was allowed to run for 11ms, giving it -10.8ms
      * of credit.  Thus it was still negative even after the reset.)
      *
      * If this is the case for snext, we simply want to keep moving
-     * everyone up until it is in the black again.  This fair because
-     * none of the other vcpus want to run at the moment.
-     *
-     * Rather than looping, however, we just calculate a multiplier,
-     * avoiding an integer division and multiplication in the common
-     * case.
+     * everyone up until it is in the black again. This means that,
+     * since CSCHED2_CREDIT_MIN is -CSCHED2_CREDIT_INIT, we need to
+     * actually add 2*CSCHED2_CREDIT_INIT.
      */
-    m = 1;
-    if ( snext->credit < -CSCHED2_CREDIT_INIT )
-        m += (-snext->credit) / CSCHED2_CREDIT_INIT;
+    ASSERT(snext->credit >= CSCHED2_CREDIT_MIN);
+    if ( unlikely(snext->credit == CSCHED2_CREDIT_MIN) )
+        reset += CSCHED2_CREDIT_INIT;
 
     list_for_each( iter, &rqd->svc )
     {
@@ -1670,15 +1682,7 @@ static void reset_credit(const struct scheduler *ops, 
int cpu, s_time_t now,
         }
 
         start_credit = svc->credit;
-
-        /*
-         * Add INIT * m, avoiding integer multiplication in the common case.
-         */
-        if ( likely(m==1) )
-            svc->credit += CSCHED2_CREDIT_INIT;
-        else
-            svc->credit += m * CSCHED2_CREDIT_INIT;
-
+        svc->credit += reset;
         /* "Clip" credits to max carryover */
         if ( svc->credit > CSCHED2_CREDIT_INIT + CSCHED2_CARRYOVER_MAX )
             svc->credit = CSCHED2_CREDIT_INIT + CSCHED2_CARRYOVER_MAX;
@@ -1690,19 +1694,18 @@ static void reset_credit(const struct scheduler *ops, 
int cpu, s_time_t now,
             struct {
                 unsigned vcpu:16, dom:16;
                 int credit_start, credit_end;
-                unsigned multiplier;
             } d;
             d.dom = svc->vcpu->domain->domain_id;
             d.vcpu = svc->vcpu->vcpu_id;
             d.credit_start = start_credit;
             d.credit_end = svc->credit;
-            d.multiplier = m;
             __trace_var(TRC_CSCHED2_CREDIT_RESET, 1,
                         sizeof(d),
                         (unsigned char *)&d);
         }
     }
 
+    ASSERT(snext->credit > 0);
     SCHED_STAT_CRANK(credit_reset);
 
     /* No need to resort runqueue, as everyone's order should be the same. */
--
generated by git-patchbot for /home/xen/git/xen.git#stable-4.12



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.