[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] [PATCH, RFC] x86/HVM: batch vCPU wakeups



Mass wakeups (via vlapic_ipi()) can take enormous amounts of time,
especially when many of the remote pCPU-s are in deep C-states. For
64-vCPU Windows Server 2012 R2 guests on Ivybridge hardware,
accumulated times of over 2ms were observed. Considering that Windows
broadcasts IPIs from its timer interrupt, which at least at certain
times can run at 1kHz, it is clear that this can result in good
guest behavior. In fact, on said hardware guests with significantly
beyond 40 vCPU-s simply hung when e.g. ServerManager gets started.
With the patch here, average broadcast times for a 64-vCPU guest went
to a measured maximum of 310us (which is still quite a lot).

This isn't just helping to reduce the number of ICR writes when the
host APICs run in clustered mode, but also to reduce them by
suppressing the sends altogether when - by the time
cpu_raise_softirq_batch_finish() is reached - the remote CPU already
managed to handle the softirq. Plus - when using MONITOR/MWAIT - the
update of softirq_pending(cpu), being on the monitored cache line -
should make the remote CPU wake up ahead of the ICR being sent,
allowing the wait-for-ICR-idle latencies to be reduced (to perhaps a
large part due to overlapping the wakeups of multiple CPUs).

Of course this necessarily increases the latencies for the remote
CPU wakeup at least slightly. To weigh between the effects, the
condition to enable batching in vlapic_ipi() may need further tuning.

Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
---
RFC for two reasons:
1) I started to consider elimination of the event-check IPIs in a more
   general way when MONITOR/MWAIT is in use: As long as the remote CPU
   is known to be MWAITing (or in the process of resuming after MWAIT),
   the write of softirq_pending(cpu) ought to be sufficient to wake it.
   This would yield the patch here pointless on MONITOR/MWAIT capable
   hardware.
2) The condition when to enable batching in vlapic_ipi() is already
   rather complex, but it is nevertheless unclear whether it would
   benefit from further tuning (as mentioned above).

--- a/xen/arch/x86/hvm/vlapic.c
+++ b/xen/arch/x86/hvm/vlapic.c
@@ -447,12 +447,30 @@ void vlapic_ipi(
 
     default: {
         struct vcpu *v;
-        for_each_vcpu ( vlapic_domain(vlapic), v )
+        const struct domain *d = vlapic_domain(vlapic);
+        bool_t batch = d->max_vcpus > 2 &&
+                       (short_hand
+                        ? short_hand != APIC_DEST_SELF
+                        : vlapic_x2apic_mode(vlapic)
+                          ? dest_mode ? hweight16(dest) > 1
+                                      : dest == 0xffffffff
+                          : dest_mode
+                            ? hweight8(dest &
+                                       GET_xAPIC_DEST_FIELD(
+                                           vlapic_get_reg(vlapic,
+                                                          APIC_DFR))) > 1
+                            : dest == 0xff);
+
+        if ( batch )
+            cpu_raise_softirq_batch_begin();
+        for_each_vcpu ( d, v )
         {
             if ( vlapic_match_dest(vcpu_vlapic(v), vlapic,
                                    short_hand, dest, dest_mode) )
                 vlapic_accept_irq(v, icr_low);
         }
+        if ( batch )
+            cpu_raise_softirq_batch_finish();
         break;
     }
     }
--- a/xen/common/softirq.c
+++ b/xen/common/softirq.c
@@ -23,6 +23,9 @@ irq_cpustat_t irq_stat[NR_CPUS];
 
 static softirq_handler softirq_handlers[NR_SOFTIRQS];
 
+static DEFINE_PER_CPU(cpumask_t, batch_mask);
+static DEFINE_PER_CPU(unsigned int, batching);
+
 static void __do_softirq(unsigned long ignore_mask)
 {
     unsigned int i, cpu;
@@ -70,22 +73,56 @@ void open_softirq(int nr, softirq_handle
 
 void cpumask_raise_softirq(const cpumask_t *mask, unsigned int nr)
 {
-    int cpu;
-    cpumask_t send_mask;
+    unsigned int cpu = smp_processor_id();
+    cpumask_t send_mask, *raise_mask;
+
+    if ( !per_cpu(batching, cpu) || in_irq() )
+    {
+        cpumask_clear(&send_mask);
+        raise_mask = &send_mask;
+    }
+    else
+        raise_mask = &per_cpu(batch_mask, cpu);
 
-    cpumask_clear(&send_mask);
     for_each_cpu(cpu, mask)
         if ( !test_and_set_bit(nr, &softirq_pending(cpu)) )
-            cpumask_set_cpu(cpu, &send_mask);
+            cpumask_set_cpu(cpu, raise_mask);
 
-    smp_send_event_check_mask(&send_mask);
+    if ( raise_mask == &send_mask )
+        smp_send_event_check_mask(raise_mask);
 }
 
 void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
 {
-    if ( !test_and_set_bit(nr, &softirq_pending(cpu))
-         && (cpu != smp_processor_id()) )
+    unsigned int this_cpu = smp_processor_id();
+
+    if ( test_and_set_bit(nr, &softirq_pending(cpu))
+         || (cpu == this_cpu) )
+        return;
+
+    if ( !per_cpu(batching, this_cpu) || in_irq() )
         smp_send_event_check_cpu(cpu);
+    else
+        set_bit(nr, &per_cpu(batch_mask, this_cpu));
+}
+
+void cpu_raise_softirq_batch_begin(void)
+{
+    ++per_cpu(batching, smp_processor_id());
+}
+
+void cpu_raise_softirq_batch_finish(void)
+{
+    unsigned int cpu, this_cpu = smp_processor_id();
+    cpumask_t *mask = &per_cpu(batch_mask, this_cpu);
+
+    ASSERT(per_cpu(batching, this_cpu));
+    for_each_cpu ( cpu, mask )
+        if ( !softirq_pending(cpu) )
+            cpumask_clear_cpu(cpu, mask);
+    smp_send_event_check_mask(mask);
+    cpumask_clear(mask);
+    --per_cpu(batching, this_cpu);
 }
 
 void raise_softirq(unsigned int nr)
--- a/xen/include/xen/softirq.h
+++ b/xen/include/xen/softirq.h
@@ -30,6 +30,9 @@ void cpumask_raise_softirq(const cpumask
 void cpu_raise_softirq(unsigned int cpu, unsigned int nr);
 void raise_softirq(unsigned int nr);
 
+void cpu_raise_softirq_batch_begin(void);
+void cpu_raise_softirq_batch_finish(void);
+
 /*
  * Process pending softirqs on this CPU. This should be called periodically
  * when performing work that prevents softirqs from running in a timely manner.


Attachment: vcpu-wake-batch.patch
Description: Text document

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.