Xen project Mailing List

Re: [Xen-devel] [PATCH, RFC] x86/HVM: batch vCPU wakeups

To: Tim Deegan <tim@xxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Tue, 09 Sep 2014 23:37:23 +0100

Cc: Ian Campbell <Ian.Campbell@xxxxxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Keir Fraser <keir@xxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>

Delivery-date: Tue, 09 Sep 2014 22:37:17 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 09/09/2014 22:29, Tim Deegan wrote: > At 09:33 +0100 on 09 Sep (1410251617), Jan Beulich wrote: >> Mass wakeups (via vlapic_ipi()) can take enormous amounts of time, >> especially when many of the remote pCPU-s are in deep C-states. For >> 64-vCPU Windows Server 2012 R2 guests on Ivybridge hardware, >> accumulated times of over 2ms were observed. Considering that Windows >> broadcasts IPIs from its timer interrupt > Bleargh. :( > >> RFC for two reasons: >> 1) I started to consider elimination of the event-check IPIs in a more >> general way when MONITOR/MWAIT is in use: As long as the remote CPU >> is known to be MWAITing (or in the process of resuming after MWAIT), >> the write of softirq_pending(cpu) ought to be sufficient to wake it. >> This would yield the patch here pointless on MONITOR/MWAIT capable >> hardware. > That's a promising line. I guess the number of 64-way systems that > don't support MONITOR/MWAIT is pretty small. :) > >> 2) The condition when to enable batching in vlapic_ipi() is already >> rather complex, but it is nevertheless unclear whether it would >> benefit from further tuning (as mentioned above). > The test you chose seems plausible enough to me. I suppose we care > enough about single-destination IPIs on idle hardware that we need > this check at all; but I doubt that IPIs to exactly two destinations, > for example, are worth worrying about. > > The implementation looks OK in general; a few nits below. > >> --- a/xen/arch/x86/hvm/vlapic.c >> +++ b/xen/arch/x86/hvm/vlapic.c >> @@ -447,12 +447,30 @@ void vlapic_ipi( >> >> default: { >> struct vcpu *v; >> - for_each_vcpu ( vlapic_domain(vlapic), v ) >> + const struct domain *d = vlapic_domain(vlapic); >> + bool_t batch = d->max_vcpus > 2 && >> + (short_hand >> + ? short_hand != APIC_DEST_SELF >> + : vlapic_x2apic_mode(vlapic) >> + ? dest_mode ? hweight16(dest) > 1 >> + : dest == 0xffffffff >> + : dest_mode >> + ? hweight8(dest & >> + GET_xAPIC_DEST_FIELD( >> + vlapic_get_reg(vlapic, >> + APIC_DFR))) > 1 >> + : dest == 0xff); > Yoiks. Could you extract some of this into a helper function, say > is_unicast_dest(), just for readability? > >> void cpu_raise_softirq(unsigned int cpu, unsigned int nr) >> { >> - if ( !test_and_set_bit(nr, &softirq_pending(cpu)) >> - && (cpu != smp_processor_id()) ) >> + unsigned int this_cpu = smp_processor_id(); > The name clashes with the this_cpu() macro (obviously OK for compiling > but harder to read) and can be dropped since... > >> + >> + if ( test_and_set_bit(nr, &softirq_pending(cpu)) >> + || (cpu == this_cpu) ) >> + return; >> + >> + if ( !per_cpu(batching, this_cpu) || in_irq() ) > this_cpu(batching) is the preferred way of addressing this on Xen > anyway - though I guess maybe it's _slightly_ less efficient to > recalculate get_cpu_info() here than to indirect through the > per_cpu_offsets[]? I haven't looked at the asm. It depends whether the compiler decides to do any stack manipulation. Certainly with the older implementation, if gcc leaves the stack pointer alone, it will coalesce calls to get_cpu_info(), but will recall it if any pushes/pops happen between basic blocks. I believe my newer implementation is coalesced across small stack changes, but it still recalculated more than I would expect in some of the later functions. (Guess who spent a long time looking at generated asm while working on that patch) > >> smp_send_event_check_cpu(cpu); >> + else >> + set_bit(nr, &per_cpu(batch_mask, this_cpu)); >> +} >> + >> +void cpu_raise_softirq_batch_begin(void) >> +{ >> + ++per_cpu(batching, smp_processor_id()); > This one should definitely be using this_cpu(). Agreed in this case... > >> +} >> + >> +void cpu_raise_softirq_batch_finish(void) >> +{ >> + unsigned int cpu, this_cpu = smp_processor_id(); >> + cpumask_t *mask = &per_cpu(batch_mask, this_cpu); > Again, this_cpu()? ...But disagree here. Multiple uses of this_cpu($FOO) cannot be coalesced due to RELOC_HIDE() deliberately preventing optimisation. For multiple uses, pulling it out by pointer to start with results in rather more efficient code. ~Andrew > >> + ASSERT(per_cpu(batching, this_cpu)); >> + for_each_cpu ( cpu, mask ) >> + if ( !softirq_pending(cpu) ) >> + cpumask_clear_cpu(cpu, mask); >> + smp_send_event_check_mask(mask); >> + cpumask_clear(mask); >> + --per_cpu(batching, this_cpu); >> } > One other thing that occurs to me is that we might do the batching in > the caller (by gathering a mask of pcpus in the vlapic code) but that > sounds messy. And it's not like softirq is so particularly > latency-sensitive that we'd want to avoid this much overhead in the > common case. So on second thoughts this way is better. :) > > Cheers, > > Tim. > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxx > http://lists.xen.org/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.