[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 4/6] x86/idle: Implement a new MWAIT IPI-elision algorithm


  • To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Thu, 3 Jul 2025 11:01:12 +0200
  • Autocrypt: addr=jbeulich@xxxxxxxx; keydata= xsDiBFk3nEQRBADAEaSw6zC/EJkiwGPXbWtPxl2xCdSoeepS07jW8UgcHNurfHvUzogEq5xk hu507c3BarVjyWCJOylMNR98Yd8VqD9UfmX0Hb8/BrA+Hl6/DB/eqGptrf4BSRwcZQM32aZK 7Pj2XbGWIUrZrd70x1eAP9QE3P79Y2oLrsCgbZJfEwCgvz9JjGmQqQkRiTVzlZVCJYcyGGsD /0tbFCzD2h20ahe8rC1gbb3K3qk+LpBtvjBu1RY9drYk0NymiGbJWZgab6t1jM7sk2vuf0Py O9Hf9XBmK0uE9IgMaiCpc32XV9oASz6UJebwkX+zF2jG5I1BfnO9g7KlotcA/v5ClMjgo6Gl MDY4HxoSRu3i1cqqSDtVlt+AOVBJBACrZcnHAUSuCXBPy0jOlBhxPqRWv6ND4c9PH1xjQ3NP nxJuMBS8rnNg22uyfAgmBKNLpLgAGVRMZGaGoJObGf72s6TeIqKJo/LtggAS9qAUiuKVnygo 3wjfkS9A3DRO+SpU7JqWdsveeIQyeyEJ/8PTowmSQLakF+3fote9ybzd880fSmFuIEJldWxp Y2ggPGpiZXVsaWNoQHN1c2UuY29tPsJgBBMRAgAgBQJZN5xEAhsDBgsJCAcDAgQVAggDBBYC AwECHgECF4AACgkQoDSui/t3IH4J+wCfQ5jHdEjCRHj23O/5ttg9r9OIruwAn3103WUITZee e7Sbg12UgcQ5lv7SzsFNBFk3nEQQCACCuTjCjFOUdi5Nm244F+78kLghRcin/awv+IrTcIWF hUpSs1Y91iQQ7KItirz5uwCPlwejSJDQJLIS+QtJHaXDXeV6NI0Uef1hP20+y8qydDiVkv6l IreXjTb7DvksRgJNvCkWtYnlS3mYvQ9NzS9PhyALWbXnH6sIJd2O9lKS1Mrfq+y0IXCP10eS FFGg+Av3IQeFatkJAyju0PPthyTqxSI4lZYuJVPknzgaeuJv/2NccrPvmeDg6Coe7ZIeQ8Yj t0ARxu2xytAkkLCel1Lz1WLmwLstV30g80nkgZf/wr+/BXJW/oIvRlonUkxv+IbBM3dX2OV8 AmRv1ySWPTP7AAMFB/9PQK/VtlNUJvg8GXj9ootzrteGfVZVVT4XBJkfwBcpC/XcPzldjv+3 HYudvpdNK3lLujXeA5fLOH+Z/G9WBc5pFVSMocI71I8bT8lIAzreg0WvkWg5V2WZsUMlnDL9 mpwIGFhlbM3gfDMs7MPMu8YQRFVdUvtSpaAs8OFfGQ0ia3LGZcjA6Ik2+xcqscEJzNH+qh8V m5jjp28yZgaqTaRbg3M/+MTbMpicpZuqF4rnB0AQD12/3BNWDR6bmh+EkYSMcEIpQmBM51qM EKYTQGybRCjpnKHGOxG0rfFY1085mBDZCH5Kx0cl0HVJuQKC+dV2ZY5AqjcKwAxpE75MLFkr wkkEGBECAAkFAlk3nEQCGwwACgkQoDSui/t3IH7nnwCfcJWUDUFKdCsBH/E5d+0ZnMQi+G0A nAuWpQkjM1ASeQwSHEeAWPgskBQL
  • Cc: Roger Pau Monné <roger.pau@xxxxxxxxxx>, Anthony PERARD <anthony.perard@xxxxxxxxxx>, Michal Orzel <michal.orzel@xxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Thu, 03 Jul 2025 09:01:39 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 02.07.2025 16:41, Andrew Cooper wrote:
> In order elide IPIs, we must be able to identify whether a target CPU is in
> MWAIT at the point it is woken up.  i.e. the store to wake it up must also
> identify the state.
> 
> Create a new in_mwait variable beside __softirq_pending, so we can use a
> CMPXCHG to set the softirq while also observing the status safely.  Implement
> an x86 version of arch_pend_softirq() which does this.
> 
> In mwait_idle_with_hints(), advertise in_mwait, with an explanation of
> precisely what it means.  X86_BUG_MONITOR can be accounted for simply by not
> advertising in_mwait.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
> ---
> CC: Jan Beulich <JBeulich@xxxxxxxx>
> CC: Roger Pau Monné <roger.pau@xxxxxxxxxx>
> CC: Anthony PERARD <anthony.perard@xxxxxxxxxx>
> CC: Michal Orzel <michal.orzel@xxxxxxx>
> CC: Julien Grall <julien@xxxxxxx>
> CC: Stefano Stabellini <sstabellini@xxxxxxxxxx>
> 
> This is modelled after Linux's TIF_NEED_RESCHED (single bit equivelent of all
> of __softirq_pending), and TIF_POLLING_NRFLAG (arch-neutral "in_mwait").
> 
> In Linux, they're both in the same flags field, which adds complexity.  In
> Xen, __softirq_pending is already unsigned long for everything other than x86,
> so adding an arch-neutral "in_mwait" would need wider changes.

Why would the flag need to be arch-neutral? A pretty straightforward way to
achieve what you want would seem to be to define an x86-only MWAIT_SOFTIRQ,
which we'd never raise or open, but which instead we'd advertise to common
code as an always-ignore mask (to be OR-ed in by __do_softirq()). However,
setting and clearing such a bit would of course require LOCK-ed insns,
which clearly is less desirable than the simple MOVB you're using.

> --- a/xen/arch/x86/acpi/cpu_idle.c
> +++ b/xen/arch/x86/acpi/cpu_idle.c
> @@ -439,7 +439,21 @@ __initcall(cpu_idle_key_init);
>  void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
>  {
>      unsigned int cpu = smp_processor_id();
> -    const unsigned int *this_softirq_pending = &softirq_pending(cpu);
> +    irq_cpustat_t *stat = &irq_stat[cpu];
> +    const unsigned int *this_softirq_pending = &stat->__softirq_pending;
> +
> +    /*
> +     * By setting in_mwait, we promise to other CPUs that we'll notice 
> changes
> +     * to __softirq_pending without being sent an IPI.  We achieve this by
> +     * either not going to sleep, or by having hardware notice on our behalf.
> +     *
> +     * Some errata exist where MONITOR doesn't work properly, and the
> +     * workaround is to force the use of an IPI.  Cause this to happen by
> +     * simply not advertising outselves as being in_mwait.

Nit: ourselves

> +     */
> +    alternative_io("movb $1, %[in_mwait]",
> +                   "", X86_BUG_MONITOR,
> +                   [in_mwait] "=m" (stat->in_mwait));
>  
>      monitor(this_softirq_pending, 0, 0);
>      smp_mb();

Unlike alternative(), alternative_io() has no memory clobber. To the compiler
the two resulting asm() therefore have no dependency operand-wise[1]. The
sole ordering criteria then is that they're both volatile asm()-s. It not
being explicitly said (anywhere that I could find) that volatile guarantees
such ordering, I wonder if we wouldn't better have an operand-based
dependency between the two. Otoh it may well be that we rely on such ordering
elsewhere already, in which case having one more instance probably would be
okay(ish).

[1] It's actually worse than that, I think: monitor() also doesn't specify
the location as a (memory) input.

> @@ -452,6 +466,10 @@ void mwait_idle_with_hints(unsigned int eax, unsigned 
> int ecx)
>          mwait(eax, ecx);
>          spec_ctrl_exit_idle(info);
>      }
> +
> +    alternative_io("movb $0, %[in_mwait]",
> +                   "", X86_BUG_MONITOR,
> +                   [in_mwait] "=m" (stat->in_mwait));

This doesn't strictly need to be an alternative, does it? We can save a
memory write in the buggy case, but that likely makes at most a marginal
difference.

> --- a/xen/arch/x86/include/asm/hardirq.h
> +++ b/xen/arch/x86/include/asm/hardirq.h
> @@ -5,7 +5,19 @@
>  #include <xen/types.h>
>  
>  typedef struct {
> -    unsigned int __softirq_pending;
> +    /*
> +     * The layout is important.  Any CPU can set bits in __softirq_pending,
> +     * but in_mwait is a status bit owned by the CPU.  softirq_mwait_raw must
> +     * cover both, and must be in a single cacheline.
> +     */
> +    union {
> +        struct {
> +            unsigned int __softirq_pending;
> +            bool in_mwait;
> +        };
> +        uint64_t softirq_mwait_raw;
> +    };

To guard against someone changing e.g. __softirq_pending to unsigned long
while ignoring the comment, how about adding a BUILD_BUG_ON() checking both
parts of the union are the same size (which of course would require naming
either the union field within the containing struct or its struct member)?

> @@ -9,4 +11,36 @@
>  #define HVM_DPCI_SOFTIRQ       (NR_COMMON_SOFTIRQS + 4)
>  #define NR_ARCH_SOFTIRQS       5
>  
> +/*
> + * Ensure softirq @nr is pending on @cpu.  Return true if an IPI can be
> + * skipped, false if the IPI cannot be skipped.
> + *
> + * We use a CMPXCHG covering both __softirq_pending and in_mwait, in order to
> + * set softirq @nr while also observing in_mwait in a race-free way.
> + */
> +static always_inline bool arch_pend_softirq(unsigned int nr, unsigned int 
> cpu)
> +{
> +    uint64_t *ptr = &irq_stat[cpu].softirq_mwait_raw;
> +    uint64_t old, new;
> +    unsigned int softirq = 1U << nr;
> +
> +    old = ACCESS_ONCE(*ptr);
> +
> +    do {
> +        if ( old & softirq )
> +            /* Softirq already pending, nothing to do. */
> +            return true;
> +
> +        new = old | softirq;
> +
> +    } while ( (old = cmpxchg(ptr, old, new)) != new );

Don't you mean

    } while ( (new = cmpxchg(ptr, old, new)) != old );

here (with new latched ahead of the loop and old set from new inside it),
like we have it elsewhere when we avoid the use of yet another variable,
e.g. in inc_linear_{entries,uses}()?

> +    /*
> +     * We have caused the softirq to become pending.  If in_mwait was set, 
> the
> +     * target CPU will notice the modification and act on it.
> +     */
> +    return new & (1UL << 32);

Hmm, this requires the layout to allow for even less re-arrangement than the
comment ahead of the union says: You strictly require in_mwait to follow
__softirq_pending immediately, and the (assembly) write to strictly write 1
into the field. Could I talk you into at least

    return new & (1UL << (offsetof(..., in_mwait) * 8));

? This way the field being inspected would also be mentioned in the access
itself (i.e. become grep-able beyond the mentioning in the comment). And I
sincerely dislike hard-coded literal numbers when they (relatively) easily
can be expressed another way.

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.