[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH, RFC] x86/HVM: batch vCPU wakeups



On 09/09/2014 22:29, Tim Deegan wrote:
> At 09:33 +0100 on 09 Sep (1410251617), Jan Beulich wrote:
>> Mass wakeups (via vlapic_ipi()) can take enormous amounts of time,
>> especially when many of the remote pCPU-s are in deep C-states. For
>> 64-vCPU Windows Server 2012 R2 guests on Ivybridge hardware,
>> accumulated times of over 2ms were observed. Considering that Windows
>> broadcasts IPIs from its timer interrupt
> Bleargh. :(
>
>> RFC for two reasons:
>> 1) I started to consider elimination of the event-check IPIs in a more
>>    general way when MONITOR/MWAIT is in use: As long as the remote CPU
>>    is known to be MWAITing (or in the process of resuming after MWAIT),
>>    the write of softirq_pending(cpu) ought to be sufficient to wake it.
>>    This would yield the patch here pointless on MONITOR/MWAIT capable
>>    hardware.
> That's a promising line.   I guess the number of 64-way systems that
> don't support MONITOR/MWAIT is pretty small. :)
>
>> 2) The condition when to enable batching in vlapic_ipi() is already
>>    rather complex, but it is nevertheless unclear whether it would
>>    benefit from further tuning (as mentioned above).
> The test you chose seems plausible enough to me.  I suppose we care
> enough about single-destination IPIs on idle hardware that we need
> this check at all; but I doubt that IPIs to exactly two destinations,
> for example, are worth worrying about.
>
> The implementation looks OK in general; a few nits below.
>
>> --- a/xen/arch/x86/hvm/vlapic.c
>> +++ b/xen/arch/x86/hvm/vlapic.c
>> @@ -447,12 +447,30 @@ void vlapic_ipi(
>>  
>>      default: {
>>          struct vcpu *v;
>> -        for_each_vcpu ( vlapic_domain(vlapic), v )
>> +        const struct domain *d = vlapic_domain(vlapic);
>> +        bool_t batch = d->max_vcpus > 2 &&
>> +                       (short_hand
>> +                        ? short_hand != APIC_DEST_SELF
>> +                        : vlapic_x2apic_mode(vlapic)
>> +                          ? dest_mode ? hweight16(dest) > 1
>> +                                      : dest == 0xffffffff
>> +                          : dest_mode
>> +                            ? hweight8(dest &
>> +                                       GET_xAPIC_DEST_FIELD(
>> +                                           vlapic_get_reg(vlapic,
>> +                                                          APIC_DFR))) > 1
>> +                            : dest == 0xff);
> Yoiks.  Could you extract some of this into a helper function, say
> is_unicast_dest(), just for readability?
>
>>  void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
>>  {
>> -    if ( !test_and_set_bit(nr, &softirq_pending(cpu))
>> -         && (cpu != smp_processor_id()) )
>> +    unsigned int this_cpu = smp_processor_id();
> The name clashes with the this_cpu() macro (obviously OK for compiling
> but harder to read) and can be dropped since...
>
>> +
>> +    if ( test_and_set_bit(nr, &softirq_pending(cpu))
>> +         || (cpu == this_cpu) )
>> +        return;
>> +
>> +    if ( !per_cpu(batching, this_cpu) || in_irq() )
> this_cpu(batching) is the preferred way of addressing this on Xen
> anyway - though I guess maybe it's _slightly_ less efficient to
> recalculate get_cpu_info() here than to indirect through the
> per_cpu_offsets[]?  I haven't looked at the asm.

It depends whether the compiler decides to do any stack manipulation. 
Certainly with the older implementation, if gcc leaves the stack pointer
alone, it will coalesce calls to get_cpu_info(), but will recall it if
any pushes/pops happen between basic blocks.  I believe my newer
implementation is coalesced across small stack changes, but it still
recalculated more than I would expect in some of the later functions. 
(Guess who spent a long time looking at generated asm while working on
that patch)

>
>>          smp_send_event_check_cpu(cpu);
>> +    else
>> +        set_bit(nr, &per_cpu(batch_mask, this_cpu));
>> +}
>> +
>> +void cpu_raise_softirq_batch_begin(void)
>> +{
>> +    ++per_cpu(batching, smp_processor_id());
> This one should definitely be using this_cpu().

Agreed in this case...

>
>> +}
>> +
>> +void cpu_raise_softirq_batch_finish(void)
>> +{
>> +    unsigned int cpu, this_cpu = smp_processor_id();
>> +    cpumask_t *mask = &per_cpu(batch_mask, this_cpu);
> Again, this_cpu()?

...But disagree here.  Multiple uses of this_cpu($FOO) cannot be
coalesced due to RELOC_HIDE() deliberately preventing optimisation.  For
multiple uses, pulling it out by pointer to start with results in rather
more efficient code.

~Andrew

>
>> +    ASSERT(per_cpu(batching, this_cpu));
>> +    for_each_cpu ( cpu, mask )
>> +        if ( !softirq_pending(cpu) )
>> +            cpumask_clear_cpu(cpu, mask);
>> +    smp_send_event_check_mask(mask);
>> +    cpumask_clear(mask);
>> +    --per_cpu(batching, this_cpu);
>>  }
> One other thing that occurs to me is that we might do the batching in
> the caller (by gathering a mask of pcpus in the vlapic code) but that
> sounds messy.  And it's not like softirq is so particularly
> latency-sensitive that we'd want to avoid this much overhead in the
> common case.  So on second thoughts this way is better. :)
>
> Cheers,
>
> Tim.
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.