[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v3 34/47] xen/sched: add fall back to idle vcpu when scheduling unit



On 24.09.19 12:53, Jan Beulich wrote:
On 14.09.2019 10:52, Juergen Gross wrote:
When scheduling an unit with multiple vcpus there is no guarantee all
vcpus are available (e.g. above maxvcpus or vcpu offline). Fall back to
idle vcpu of the current cpu in that case. This requires to store the
correct schedule_unit pointer in the idle vcpu as long as it used as
fallback vcpu.

In order to modify the runstates of the correct vcpus when switching
schedule units merge sched_unit_runstate_change() into
sched_switch_units() and loop over the affected physical cpus instead
of the unit's vcpus. This in turn requires an access function to the
current variable of other cpus.

Today context_saved() is called in case previous and next vcpus differ
when doing a context switch. With an idle vcpu being capable to be a
substitute for an offline vcpu this is problematic when switching to
an idle scheduling unit. An idle previous vcpu leaves us in doubt which
schedule unit was active previously, so save the previous unit pointer
in the per-schedule resource area. If it is NULL the unit has not
changed and we don't have to set the previous unit to be not running.

When running an idle vcpu in a non-idle scheduling unit use a specific
guest idle loop not performing any tasklets and livepatching in order
to avoid populating the cpu caches with memory used by other domains
(as far as possible). Softirqs are considered to be save.

Aiui "tasklets" here means only ones run in (idle) vCPU context, whereas
"softirqs" includes tasklets run in softirq context. I think it would
help if the description made this distinction. With this I then wonder
whether the cache related argumentation above still holds: VT-d's IOMMU
fault handling tasklet runs in softirq context, for example, and
hvm_assert_evtchn_irq(), being handed a struct vcpu *, does too. Of
course this could be considered covered by "(as far as possible)" ...

I'll write "... not performing any non-softirq tasklets ...".


@@ -172,6 +191,10 @@ void startup_cpu_idle_loop(void)
static void noreturn continue_idle_domain(struct vcpu *v)
  {
+    /* Idle vcpus might be attached to non-idle units! */
+    if ( !is_idle_domain(v->sched_unit->domain) )
+        reset_stack_and_jump_nolp(guest_idle_loop);

The construct and comment make me wonder - did you audit all uses
of is_idle_{domain,vcpu}() for being in line with this new usage
mode?

Yes.


@@ -1767,33 +1774,66 @@ static void sched_switch_units(struct sched_resource 
*sd,
                                 struct sched_unit *next, struct sched_unit 
*prev,
                                 s_time_t now)
  {
-    sd->curr = next;
-
-    TRACE_3D(TRC_SCHED_SWITCH_INFPREV, prev->domain->domain_id, prev->unit_id,
-             now - prev->state_entry_time);
-    TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, next->domain->domain_id, next->unit_id,
-             (next->vcpu_list->runstate.state == RUNSTATE_runnable) ?
-             (now - next->state_entry_time) : 0, prev->next_time);
+    int cpu;

unsigned int?

Okay.


      ASSERT(unit_running(prev));
- TRACE_4D(TRC_SCHED_SWITCH, prev->domain->domain_id, prev->unit_id,
-             next->domain->domain_id, next->unit_id);
+    if ( prev != next )
+    {
+        sd->curr = next;
+        sd->prev = prev;
- sched_unit_runstate_change(prev, false, now);
+        TRACE_3D(TRC_SCHED_SWITCH_INFPREV, prev->domain->domain_id,
+                 prev->unit_id, now - prev->state_entry_time);
+        TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, next->domain->domain_id,
+                 next->unit_id,
+                 (next->vcpu_list->runstate.state == RUNSTATE_runnable) ?
+                 (now - next->state_entry_time) : 0, prev->next_time);
+        TRACE_4D(TRC_SCHED_SWITCH, prev->domain->domain_id, prev->unit_id,
+                 next->domain->domain_id, next->unit_id);
- ASSERT(!unit_running(next));
-    sched_unit_runstate_change(next, true, now);
+        ASSERT(!unit_running(next));
- /*
-     * NB. Don't add any trace records from here until the actual context
-     * switch, else lost_records resume will not work properly.
-     */
+        /*
+         * NB. Don't add any trace records from here until the actual context
+         * switch, else lost_records resume will not work properly.
+         */
+
+        ASSERT(!next->is_running);
+        next->is_running = 1;
+        next->state_entry_time = now;
+
+        if ( is_idle_unit(prev) )
+        {
+            prev->runstate_cnt[RUNSTATE_running] = 0;
+            prev->runstate_cnt[RUNSTATE_runnable] = sched_granularity;
+        }
+        if ( is_idle_unit(next) )
+        {
+            next->runstate_cnt[RUNSTATE_running] = sched_granularity;
+            next->runstate_cnt[RUNSTATE_runnable] = 0;
+        }

Is this correct when some of the vCPU-s a substitute idle ones?

Yes, as this affects idle units only. An idle vcpu running as a
substitute will _not_ result in the related idle unit runstate_cnt
to be updated.


@@ -1849,29 +1889,39 @@ static struct sched_unit *do_schedule(struct sched_unit 
*prev, s_time_t now,
      if ( prev->next_time >= 0 ) /* -ve means no limit */
          set_timer(&sd->s_timer, now + prev->next_time);
- if ( likely(prev != next) )
-        sched_switch_units(sd, next, prev, now);
+    sched_switch_units(sd, next, prev, now);
return next;
  }
-static void context_saved(struct vcpu *prev)
+static void vcpu_context_saved(struct vcpu *vprev, struct vcpu *vnext)
  {
-    struct sched_unit *unit = prev->sched_unit;
-
      /* Clear running flag /after/ writing context to memory. */
      smp_wmb();
- prev->is_running = 0;
+    if ( vprev != vnext )
+        vprev->is_running = 0;
+}

With this "vnext" could be const qualified as it seems. And "false"
instead of "0" perhaps, as you touch this anyway?

Okay.


@@ -2015,7 +2079,8 @@ static void sched_slave(void)
pcpu_schedule_unlock_irq(lock, cpu); - sched_context_switch(vprev, sched_unit2vcpu_cpu(next, cpu), now);
+    sched_context_switch(vprev, sched_unit2vcpu_cpu(next, cpu),
+                         is_idle_unit(next) && !is_idle_unit(prev), now);
  }
/*
@@ -2075,7 +2140,8 @@ static void schedule(void)
      pcpu_schedule_unlock_irq(lock, cpu);
vnext = sched_unit2vcpu_cpu(next, cpu);
-    sched_context_switch(vprev, vnext, now);
+    sched_context_switch(vprev, vnext,
+                         !is_idle_unit(prev) && is_idle_unit(next), now);
  }

As a minor remark, I think between such constructs it would be good
if there was no difference, unless there's a reason to have one. Yet
if there was a reason, it surely would want to be spelled out.

I guess you mean changing the parameters of sched_context_switch()? I
can do that.


@@ -124,16 +129,22 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
  # define CHECK_FOR_LIVEPATCH_WORK ""
  #endif
-#define reset_stack_and_jump(__fn) \
+#define switch_stack_and_jump(fn, instr)                                \

Is there any dependency on "instr" to just be a single instruction?
I'm inclined to ask for it to be named "extra" or some such.

Fine with me.


--- a/xen/include/asm-x86/smp.h
+++ b/xen/include/asm-x86/smp.h
@@ -76,6 +76,9 @@ void set_nr_sockets(void);
  /* Representing HT and core siblings in each socket. */
  extern cpumask_t **socket_cpumask;
+#define get_cpu_current(cpu) \
+    (get_cpu_info_from_stack((unsigned long)stack_base[cpu])->current_vcpu)

I don't think this can go without a comment clarifying under what
(pretty narrow I think) conditions this is legitimate to use.

Okay. I'll add a comment like: "to be used only while no context switch
can occur on the cpu".


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.