[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Yet another S3 issue in Xen 4.14



On Fri, Oct 02, 2020 at 04:42:50PM +0100, Andrew Cooper wrote:
> On 02/10/2020 16:39, Jan Beulich wrote:
> > On 02.10.2020 17:08, Marek Marczykowski-Górecki wrote:
> >> I've done another bisect on the commit broken up in separate changes
> >> (https://xenbits.xen.org/gitweb/?p=people/andrewcoop/xen.git;a=shortlog;h=refs/heads/dbg-s3)
> >> and the bad part seems to be this:
> >>
> >> From dbdb32f8c265295d6af7cd4cd0aa12b6d04a0430 Mon Sep 17 00:00:00 2001
> >> From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
> >> Date: Fri, 2 Oct 2020 15:40:22 +0100
> >> Subject: [PATCH 1/1] CR4
> > Interesting - I was wild guessing so yesterday, but couldn't come
> > up with even a vague reason why this would be. I think you could
> > further split it up:
> >
> >> --- a/xen/arch/x86/acpi/power.c
> >> +++ b/xen/arch/x86/acpi/power.c
> >> @@ -195,7 +195,6 @@ static int enter_state(u32 state)
> >>      unsigned long flags;
> >>      int error;
> >>      struct cpu_info *ci;
> >> -    unsigned long cr4;
> >>  
> >>      if ( (state <= ACPI_STATE_S0) || (state > ACPI_S_STATES_MAX) )
> >>          return -EINVAL;
> >> @@ -270,15 +269,15 @@ static int enter_state(u32 state)
> >>  
> >>      system_state = SYS_STATE_resume;
> >>  
> >> -    /* Restore CR4 and EFER from cached values. */
> >> -    cr4 = read_cr4();
> >> -    write_cr4(cr4 & ~X86_CR4_MCE);
> >> +    /* Restore EFER from cached value. */
> >>      write_efer(read_efer());
> > This one should be possible to leave in place despite ...
> >
> >>      device_power_up(SAVED_ALL);
> >>  
> >>      mcheck_init(&boot_cpu_data, false);
> >> -    write_cr4(cr4);
> >> +
> >> +    /* Restore CR4 from cached value, now MCE is set up. */
> >> +    write_cr4(read_cr4());
> > ... this change.
> >
> > Further, while I can't see how the set_in_cr4() in mcheck_init()
> > could badly interact with the CR4 writes here, another option
> > might be to suppress it when system_state == SYS_STATE_resume
> > && c == &boot_cpu_data (or !bsp && c == &boot_cpu_data).
> >
> >> --- a/xen/arch/x86/acpi/suspend.c
> >> +++ b/xen/arch/x86/acpi/suspend.c
> >> @@ -23,7 +23,4 @@ void save_rest_processor_state(void)
> >>  void restore_rest_processor_state(void)
> >>  {
> >>      load_system_tables();
> >> -
> >> -    /* Restore full CR4 (inc MCE) now that the IDT is in place. */
> >> -    write_cr4(mmu_cr4_features);
> >>  }
> > This one should be possible to leave in place despite the other
> > changes.
> 
> We're continuing to debug in private.  mmu_cr4_features and read_cr4()
> are equivalent (as expected), but very different from MINIMAL_CR4 which
> is what the trampoline configures, so I think we're suffering a
> CR4-related #UD/#GP somewhere in device_power_up() or mcheck_init().
> 
> Its not INVPCID.  Trying others.

Some update:
It's OSFXSR + probably other flags, the crash happens in
enter_state()->device_power_up()->time_resume()->efi_get_time()

This also explains the difference between legacy / UEFI boot.

Disabling efi_get_time() or setting CR4 earlier solves _this_ issue, but
applied on top of stable-4.14 still doesn't work. Looks like there is
yet another S3 breakage in between. I'm bisecting it further...

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

Attachment: signature.asc
Description: PGP signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.