[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: S3 resume issue in xstate_init



On Tue, Aug 17, 2021 at 02:29:20PM +0100, Andrew Cooper wrote:
> On 17/08/2021 14:21, Jan Beulich wrote:
> > On 17.08.2021 15:06, Andrew Cooper wrote:
> >> On 17/08/2021 13:53, Andrew Cooper wrote:
> >>> On 17/08/2021 13:21, Jan Beulich wrote:
> >>>> I'm kind of guessing that set_xcr0() mistakenly skips the actual XCR0
> >>>> write, due to the cached value matching the to-be-written one, yet
> >>>> the cache having gone stale across S3.
> >>> This is a rats nest, and area for concern, but ...
> >>>
> >>>>  I think this is to be expected
> >>>> for previously parked CPUs, as those don't have their per-CPU data
> >>>> de-allocated (and hence also not re-setup, and thus also not starting
> >>>> out as zero).
> >>> ... we don't deallocate regular APs either, so I don't see why the smt=
> >>> setting would make a difference in this case.
> >>>
> >>> To be clear - I think your guess about set_xcr0() skipping the write is
> >>> correct, because 0x240 is correct for xcr0=X87, but I don't see why smt=
> >>> makes a difference.
> > Right - as per my later reply to Marek I don't see an immediate reason
> > anymore either. I could see an indirect reason via different scheduler
> > decisions, affecting what ran last on the respective CPUs.
> 
> Modern Linux has stripped all MPX support, so won't set
> %xcr0.bnd{reg,csr} in the first place, and will differ from Xen's
> default setting.
> 
> Therefore, I suppose we have a real difference in Xen's idea of the
> lazily-cached value depending on whether we're in half or full idle context.

:(

> >>>>  I guess an easy fix would be to write 0 to
> >>>> this_cpu(xcr0) directly early in xstate_init(), maybe in an "else"
> >>>> to the early "if ( bsp )".
> >>>>
> >>>> I'm not sure though this would be a good fix longer term, as there
> >>>> might easily be other similar issues elsewhere. IOW we may need to
> >>>> see whether per-CPU data initialization wouldn't want changing.
> >>> We've got other registers too, like MSR_TSC_AUX, but I don't think we
> >>> want to be doing anything as drastic as changing how the initialisation
> >>> works.
> >>>
> >>> The S3 path needs to explicitly write every register we do lazy context
> >>> switching of.
> >> Actually no - that's a dumb suggestion because the APs don't know
> >> better, and we don't want for_each_cpu() loops running from the BSP.
> >>
> >> Perhaps we want the cpu_down() logic to explicitly invalidate their
> >> lazily cached values?
> > I'd rather do this on the cpu_up() path (no point clobbering what may
> > get further clobbered, and then perhaps not to a value of our liking),
> > yet then we can really avoid doing this from a notifier and instead do
> > it early enough in xstate_init() (taking care of XSS at the same time).

Funny you mention notifiers. Apparently cpufreq driver does use it to
initialize things. And fails to do so:

(XEN) Finishing wakeup from ACPI S3 state.
(XEN) CPU0: xstate: size: 0x440 (uncompressed 0x440) and states: 0x1f
(XEN) Enabling non-boot CPUs  ...
(XEN) CPU1: xstate: size: 0x440 (uncompressed 0x440) and states: 0x1f
(XEN) ----[ Xen-4.16-unstable  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<ffff82d04024ad2b>] vcpu_runstate_get+0x153/0x244
(XEN) RFLAGS: 0000000000010246   CONTEXT: hypervisor
(XEN) rax: 0000000000000000   rbx: ffff830049667c50   rcx: 0000000000000001
(XEN) rdx: 000000321d74d000   rsi: ffff830049667c50   rdi: ffff83025dcc0000
(XEN) rbp: ffff830049667c40   rsp: ffff830049667c10   r8:  ffff83020511a820
(XEN) r9:  ffff82d04057ef78   r10: 0180000000000000   r11: 8000000000000000
(XEN) r12: ffff83025dcc0000   r13: ffff830205118c60   r14: 0000000000000001
(XEN) r15: 0000000000000010   cr0: 000000008005003b   cr4: 00000000003526e0
(XEN) cr3: 0000000049656000   cr2: 0000000000000028
(XEN) fsb: 0000000000000000   gsb: 0000000000000000   gss: 0000000000000000
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
(XEN) Xen code around <ffff82d04024ad2b> (vcpu_runstate_get+0x153/0x244):
(XEN)  48 8b 14 ca 48 8b 04 02 <4c> 8b 70 28 e9 01 ff ff ff 4c 8d 3d dd 64 32 00
(XEN) Xen stack trace from rsp=ffff830049667c10:
(XEN)    0000000000000180 ffff83025dcbd410 ffff83020511bf30 ffff830205118c60
(XEN)    0000000000000001 0000000000000010 ffff830049667c80 ffff82d04024ae73
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 ffff830049667cb8 ffff82d0402560a9
(XEN)    ffff830205118320 0000000000000001 ffff83020511bf30 ffff83025dc7a6f0
(XEN)    0000000000000000 ffff830049667d58 ffff82d040254cb1 00000001402e9f74
(XEN)    0000000000000000 ffff830049667d10 ffff82d040224eda 000000000025dc81
(XEN)    000000321d74d000 ffff82d040571278 0000000000000001 ffff830049667d28
(XEN)    ffff82d040228b44 ffff82d0403102cf 0000000000000000 ffff82d0402283a4
(XEN)    ffff82d040459688 ffff82d040459680 ffff82d040459240 0000000000000004
(XEN)    0000000000000000 ffff830049667d68 ffff82d04025510e ffff830049667db0
(XEN)    ffff82d040221ba4 0000000000000000 0000000000000001 0000000000000001
(XEN)    0000000000000000 ffff830049667e00 0000000000000001 ffff82d04058a5c0
(XEN)    ffff830049667dc8 ffff82d040203867 0000000000000001 ffff830049667df0
(XEN)    ffff82d040203c51 ffff82d040459400 0000000000000001 0000000000000010
(XEN)    ffff830049667e20 ffff82d040203e26 ffff830049667ef8 0000000000000000
(XEN)    0000000000000003 0000000000000200 ffff830049667e50 ffff82d040270bac
(XEN)    ffff83020116a640 ffff830258ff6000 0000000000000000 0000000000000000
(XEN)    ffff830049667e70 ffff82d0402056aa ffff830258ff61b8 ffff82d0405701b0
(XEN)    ffff830049667e88 ffff82d04022963c ffff82d0405701a0 ffff830049667eb8
(XEN) Xen call trace:
(XEN)    [<ffff82d04024ad2b>] R vcpu_runstate_get+0x153/0x244
(XEN)    [<ffff82d04024ae73>] F get_cpu_idle_time+0x57/0x59
(XEN)    [<ffff82d0402560a9>] F cpufreq_statistic_init+0x191/0x210
(XEN)    [<ffff82d040254cb1>] F cpufreq_add_cpu+0x3cc/0x5bb
(XEN)    [<ffff82d04025510e>] F cpufreq.c#cpu_callback+0x27/0x32
(XEN)    [<ffff82d040221ba4>] F notifier_call_chain+0x6c/0x96
(XEN)    [<ffff82d040203867>] F cpu.c#cpu_notifier_call_chain+0x1b/0x36
(XEN)    [<ffff82d040203c51>] F cpu_up+0xaf/0xc8
(XEN)    [<ffff82d040203e26>] F enable_nonboot_cpus+0x6b/0x1f8
(XEN)    [<ffff82d040270bac>] F power.c#enter_state_helper+0x152/0x60a
(XEN)    [<ffff82d0402056aa>] F 
domain.c#continue_hypercall_tasklet_handler+0x4c/0xb9
(XEN)    [<ffff82d04022963c>] F tasklet.c#do_tasklet_work+0x76/0xac
(XEN)    [<ffff82d040229920>] F do_tasklet+0x58/0x8a
(XEN)    [<ffff82d0402e6607>] F domain.c#idle_loop+0x74/0xdd
(XEN) 
(XEN) Pagetable walk from 0000000000000028:
(XEN)  L4[0x000] = 000000025dce1063 ffffffffffffffff
(XEN)  L3[0x000] = 000000025dce0063 ffffffffffffffff
(XEN)  L2[0x000] = 000000025dcdf063 ffffffffffffffff
(XEN)  L1[0x000] = 0000000000000000 ffffffffffffffff
(XEN) 
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0000]
(XEN) Faulting linear address: 0000000000000028
(XEN) ****************************************

This is after adding brutal `this_cpu(xcr0) = 0` in xstate_init().

> What we actually want to do is read the hardware register into the
> cached location.  %xcr0 is possibly the only lazy register we also do
> extra sanity checks on.

Yes, better load the actual XCR0 value into cache, instead of 0
(although in this very case, it will get immediately overwritten).
I've added similar cache init for XSS - and this one should be safe-ish
- get_msr_xss() is not used anywhere.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

Attachment: signature.asc
Description: PGP signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.