[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] x86: serializing of non-serializing MSR writes


  • To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Thu, 29 Feb 2024 10:02:27 +0100
  • Autocrypt: addr=jbeulich@xxxxxxxx; keydata= xsDiBFk3nEQRBADAEaSw6zC/EJkiwGPXbWtPxl2xCdSoeepS07jW8UgcHNurfHvUzogEq5xk hu507c3BarVjyWCJOylMNR98Yd8VqD9UfmX0Hb8/BrA+Hl6/DB/eqGptrf4BSRwcZQM32aZK 7Pj2XbGWIUrZrd70x1eAP9QE3P79Y2oLrsCgbZJfEwCgvz9JjGmQqQkRiTVzlZVCJYcyGGsD /0tbFCzD2h20ahe8rC1gbb3K3qk+LpBtvjBu1RY9drYk0NymiGbJWZgab6t1jM7sk2vuf0Py O9Hf9XBmK0uE9IgMaiCpc32XV9oASz6UJebwkX+zF2jG5I1BfnO9g7KlotcA/v5ClMjgo6Gl MDY4HxoSRu3i1cqqSDtVlt+AOVBJBACrZcnHAUSuCXBPy0jOlBhxPqRWv6ND4c9PH1xjQ3NP nxJuMBS8rnNg22uyfAgmBKNLpLgAGVRMZGaGoJObGf72s6TeIqKJo/LtggAS9qAUiuKVnygo 3wjfkS9A3DRO+SpU7JqWdsveeIQyeyEJ/8PTowmSQLakF+3fote9ybzd880fSmFuIEJldWxp Y2ggPGpiZXVsaWNoQHN1c2UuY29tPsJgBBMRAgAgBQJZN5xEAhsDBgsJCAcDAgQVAggDBBYC AwECHgECF4AACgkQoDSui/t3IH4J+wCfQ5jHdEjCRHj23O/5ttg9r9OIruwAn3103WUITZee e7Sbg12UgcQ5lv7SzsFNBFk3nEQQCACCuTjCjFOUdi5Nm244F+78kLghRcin/awv+IrTcIWF hUpSs1Y91iQQ7KItirz5uwCPlwejSJDQJLIS+QtJHaXDXeV6NI0Uef1hP20+y8qydDiVkv6l IreXjTb7DvksRgJNvCkWtYnlS3mYvQ9NzS9PhyALWbXnH6sIJd2O9lKS1Mrfq+y0IXCP10eS FFGg+Av3IQeFatkJAyju0PPthyTqxSI4lZYuJVPknzgaeuJv/2NccrPvmeDg6Coe7ZIeQ8Yj t0ARxu2xytAkkLCel1Lz1WLmwLstV30g80nkgZf/wr+/BXJW/oIvRlonUkxv+IbBM3dX2OV8 AmRv1ySWPTP7AAMFB/9PQK/VtlNUJvg8GXj9ootzrteGfVZVVT4XBJkfwBcpC/XcPzldjv+3 HYudvpdNK3lLujXeA5fLOH+Z/G9WBc5pFVSMocI71I8bT8lIAzreg0WvkWg5V2WZsUMlnDL9 mpwIGFhlbM3gfDMs7MPMu8YQRFVdUvtSpaAs8OFfGQ0ia3LGZcjA6Ik2+xcqscEJzNH+qh8V m5jjp28yZgaqTaRbg3M/+MTbMpicpZuqF4rnB0AQD12/3BNWDR6bmh+EkYSMcEIpQmBM51qM EKYTQGybRCjpnKHGOxG0rfFY1085mBDZCH5Kx0cl0HVJuQKC+dV2ZY5AqjcKwAxpE75MLFkr wkkEGBECAAkFAlk3nEQCGwwACgkQoDSui/t3IH7nnwCfcJWUDUFKdCsBH/E5d+0ZnMQi+G0A nAuWpQkjM1ASeQwSHEeAWPgskBQL
  • Cc: Wei Liu <wl@xxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Dave Hansen <dave.hansen@xxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Thu, 29 Feb 2024 09:02:47 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 29.02.2024 02:10, Andrew Cooper wrote:
> On 28/02/2024 2:48 pm, Jan Beulich wrote:
>> Linux commit 25a068b8e9a4e ("x86/apic: Add extra serialization for non-
>> serializing MSRs") explains why an MFENCE+LFENCE pair is generally
>> needed ahead of ICR writes in x2APIC mode, and also why at least in
>> theory such is also needed ahead of TSC_DEADLINE writes. A comment of
>> our own in send_IPI_mask_x2apic_phys() further explains a condition
>> under which the LFENCE can be avoided.
>>
>> Further Linux commit 04c3024560d3 ("x86/barrier: Do not serialize MSR
>> accesses on AMD") explains that this barrier isn't needed on AMD or
>> Hygon, and is in fact hampering performance in a measurable way.
>>
>> Introduce a similarly named helper function, but with a parameter
>> allowing callers to specify whether a memory access will follow, thus
>> permitting the LFENCE to be omitted.
>>
>> Putting an instance in apic_wait_icr_idle() is to be on the safe side.
>> The one case where it was clearly missing is in send_IPI_shortcut(),
>> which is also used in x2APIC mode when called from send_IPI_mask().
>>
>> Function comment shamelessly borrowed (but adapted) from Linux.
>>
>> Fixes: 5500d265a2a8 ("x86/smp: use APIC ALLBUT destination shorthand when 
>> possible")
>> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
>> ---
>> I question the need for a fence ahead of writing TSC_DEADLINE: The Linux
>> commit message talks about LVT being accessed via MMIO in xAPIC mode,
>> but that should not be relevant here: It's all the local CPU, so there
>> ought to not be visibility concerns (much like for a self-IPI no fence
>> is needed ahead of the ICR write). If that wasn't needed, we could
>> further use alternatives patching to remove the fence also from
>> apic_wait_icr_idle() when in xAPIC mode. (And only then would I agree to
>> have APIC in the feature identifier, like Linux has it.)
>>
>> A number of apic_write() may better be turned into apic_mem_write(), in
>> particular e.g. the ones in send_IPI_mask_{flat,phys}(). That way it
>> would be quite a bit easier to spot paths taken only in xAPIC mode.
>>
>> The INIT-INIT-SIPI sequence for AP startup doesn't use any barrier, also
>> not in Linux afaics. I can't explain the lack thereof, though.
> 
> I have some opinions about what Linux did here...  I don't think a
> single vendor/arch-neutral helper can possibly be right.
> 
> It is vendor and uarch dependent which WRMSR's transparently degrade to
> WRMSRNS, and it is vendor dependent which serialising sequence to use
> (if any).  e.g. AMD have recently (Zen2 uarch I believe) retroactively
> defined FS/GS_BASE to be non-serialising.  (And this is another CPUID
> bit we need to OR backwards in time.)
> 
> Furthermore, IIRC AMD still owe us an update to the APM; the APM
> currently says that a serialising sequence is needed for ICR.  I'm told
> this isn't actually true, but I'm also very wary making an adjustment
> which is directly contradicted by the docs.

I can see you wanting the doc to be corrected. What I'm having trouble
with is you having indicated (long ago) that we can avoid this fence on
AMD, just to now effectively object to me (finally) getting around to
actually doing so?

> The Linux change explains why in principle the IPI can be emitted before
> the stores are visible.
> 
> This does actually explain TSC_DEADLINE too.  Setting a deadline in the
> past gets you an interrupt immediately, and if you combine that with a
> WRMSR being ordered ahead of an MFENCE, then causality is violated. 
> You'll take the interrupt on whichever instruction boundary has most
> recently retired, which will can be the wrong side of the WRMSR
> triggering the interrupts, at which point you'll livelock taking timer
> interrupts and re-arming the timer in the past.

Are you saying the interrupt is raised ahead of the insn retiring? That
would be concerning, imo.

Irrespective of that, as to live-locking: If that would really be a
possible issue, moving lapic_timer_on() ahead of local_irq_enable() in
acpi_processor_idle() and mwait_idle() would avoid that; the sole other
use of reprogram_timer() already runs with IRQs off.

> Now, for fencing, things are more complicated.  AMD define MFENCE as
> architecturally serialising.  Intel do not, hence why apparently a WRMSR
> can possibly move across it.  The LFENCE is added for it's new
> speculative property of dispatch serialising.
> 
> We don't actually care about architecturally serialising.  If someone is
> interacting with these MSRs with relevant data in WC memory, then they
> get to keep all resulting pieces.
> 
> However, we do care about plain stores, and for that we do need an
> MFENCE;LFENCE on Intel  (The jury is out on whether a single
> SERIALIZE[sic] would be better, but it should have the correct semantics
> architecturally speaking.)
> 
> In particular, ...
> 
>> --- a/xen/arch/x86/genapic/x2apic.c
>> +++ b/xen/arch/x86/genapic/x2apic.c
>> @@ -97,15 +97,15 @@ static void cf_check send_IPI_mask_x2api
>>  
>>      /*
>>       * Ensure that any synchronisation data written in program order by this
>> -     * CPU is seen by notified remote CPUs. The WRMSR contained within
>> -     * apic_icr_write() can otherwise be executed early.
>> +     * CPU is seen by notified remote CPUs. The WRMSR contained in the loop
>> +     * below can otherwise be executed early.
>>       * 
>> -     * The reason smp_mb() is sufficient here is subtle: the register 
>> arguments
>> +     * The reason MFENCE is sufficient here is subtle: the register 
>> arguments
>>       * to WRMSR must depend on a memory read executed after the barrier. 
>> This
>>       * is guaranteed by cpu_physical_id(), which reads from a global array 
>> (and
>>       * so cannot be hoisted above the barrier even by a clever compiler).
>>       */
>> -    smp_mb();
>> +    weak_wrmsr_fence(true);
>>  
>>      local_irq_save(flags);
>>  
>> @@ -130,7 +130,7 @@ static void cf_check send_IPI_mask_x2api
>>      const cpumask_t *cluster_cpus;
>>      unsigned long flags;
>>  
>> -    smp_mb(); /* See above for an explanation. */
>> +    weak_wrmsr_fence(true); /* See above for an explanation. */
>>  
>>      local_irq_save(flags);
>>  
>> --- a/xen/arch/x86/include/asm/cpufeatures.h
>> +++ b/xen/arch/x86/include/asm/cpufeatures.h
>> @@ -24,7 +24,7 @@ XEN_CPUFEATURE(APERFMPERF,        X86_SY
>>  XEN_CPUFEATURE(MFENCE_RDTSC,      X86_SYNTH( 9)) /* MFENCE synchronizes 
>> RDTSC */
>>  XEN_CPUFEATURE(XEN_SMEP,          X86_SYNTH(10)) /* SMEP gets used by Xen 
>> itself */
>>  XEN_CPUFEATURE(XEN_SMAP,          X86_SYNTH(11)) /* SMAP gets used by Xen 
>> itself */
>> -/* Bit 12 unused. */
>> +XEN_CPUFEATURE(NO_WRMSR_FENCE,    X86_SYNTH(12)) /* No MFENCE{,+LFENCE} 
>> ahead of certain WRMSR. */
>>  XEN_CPUFEATURE(IND_THUNK_LFENCE,  X86_SYNTH(13)) /* Use IND_THUNK_LFENCE */
>>  XEN_CPUFEATURE(IND_THUNK_JMP,     X86_SYNTH(14)) /* Use IND_THUNK_JMP */
>>  XEN_CPUFEATURE(SC_NO_BRANCH_HARDEN, X86_SYNTH(15)) /* (Disable) Conditional 
>> branch hardening */
>> --- a/xen/arch/x86/include/asm/msr.h
>> +++ b/xen/arch/x86/include/asm/msr.h
>> @@ -97,6 +97,25 @@ static inline void msr_split(struct cpu_
>>      regs->rax = (uint32_t)val;
>>  }
>>  
>> +/*
>> + * Make previous memory operations globally visible before a WRMSR.  Most
>> + * WRMSRs are full serializing instructions themselves and do not require 
>> this
>> + * barrier.  This may only be required for the TSC_DEADLINE and x2APIC MSRs.
>> + *
>> + * MFENCE makes writes visible, but only affects load/store instructions.
>> + * WRMSR is unfortunately not a load/store instruction and is unaffected by
>> + * MFENCE.
> 
> [ On Intel.   AMD didn't end up with this (mis)behaviour. ]
> 
>>   The LFENCE ensures that the WRMSR is not reordered, but callers
>> + * can indicate to avoid it when they have a suitable memory access between
>> + * the invocation of this function and the WRMSR in question.
> 
> ... this makes no sense.
> 
> We need the LFENCE for dispatch serialising properties, not it's load
> ordering properties.  What use will other memory have, when the entire
> problem is that WRMSR doesn't interact with them?

Are you suggesting the comment (and code) in send_IPI_mask_x2apic_*()
(left visible in context further up) are wrong then? I consider it
correct (looking forward to see you prove it wrong), and with that
having a way to avoid the LFENCE looks correct to me. Plus the comment
here doesn't say "load ordering" anywhere. It's strictly execution
ordering, guaranteed by a memory access the WRMSR input is dependent
upon. For load ordering, MFENCE alone would be enough.

> Worse, it's a Spectre-v1 gadget and we're now acutely familiar with how
> the CPU will find its way around these.  So even expressing "I
> critically need the LFENCE" still gets you pot luck on whether it has
> any effect against a causality-violating WRMSR.

Hmm, besides me possibly taking this as "drop this patch" (which could
do with making explicit, if that was meant), I'm afraid I can't view
this remark as actionable in any way. Yet I firmly expect an IPI to not
be raised speculatively (and, as said above, neither an LAPIC timer
interrupt), so I'm even having trouble seeing how this would form a
Spectre-v1 gadget.

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.