[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 01/11] xen/manage: keep track of the on-going suspend mode

To: Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>
From: Anchal Agarwal <anchalag@xxxxxxxxxx>
Date: Fri, 28 May 2021 21:50:08 +0000
Cc: "tglx@xxxxxxxxxxxxx" <tglx@xxxxxxxxxxxxx>, "mingo@xxxxxxxxxx" <mingo@xxxxxxxxxx>, "bp@xxxxxxxxx" <bp@xxxxxxxxx>, "hpa@xxxxxxxxx" <hpa@xxxxxxxxx>, "jgross@xxxxxxxx" <jgross@xxxxxxxx>, "linux-pm@xxxxxxxxxxxxxxx" <linux-pm@xxxxxxxxxxxxxxx>, "linux-mm@xxxxxxxxx" <linux-mm@xxxxxxxxx>, "sstabellini@xxxxxxxxxx" <sstabellini@xxxxxxxxxx>, "konrad.wilk@xxxxxxxxxx" <konrad.wilk@xxxxxxxxxx>, "roger.pau@xxxxxxxxxx" <roger.pau@xxxxxxxxxx>, "axboe@xxxxxxxxx" <axboe@xxxxxxxxx>, "davem@xxxxxxxxxxxxx" <davem@xxxxxxxxxxxxx>, "rjw@xxxxxxxxxxxxx" <rjw@xxxxxxxxxxxxx>, "len.brown@xxxxxxxxx" <len.brown@xxxxxxxxx>, "pavel@xxxxxx" <pavel@xxxxxx>, "peterz@xxxxxxxxxxxxx" <peterz@xxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, "vkuznets@xxxxxxxxxx" <vkuznets@xxxxxxxxxx>, "netdev@xxxxxxxxxxxxxxx" <netdev@xxxxxxxxxxxxxxx>, "linux-kernel@xxxxxxxxxxxxxxx" <linux-kernel@xxxxxxxxxxxxxxx>, <dwmw@xxxxxxxxxxxx>, <“benh@xxxxxxxxxxxxxxxxxx--org-9o0a>, <“aams@xxxxxxxxx--com-9o0a>, <“anchalag@xxxxxxxxx--com-9o0a>
Delivery-date: Fri, 28 May 2021 21:50:47 +0000
List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Wed, May 26, 2021 at 02:29:53PM -0400, Boris Ostrovsky wrote:
> CAUTION: This email originated from outside of the organization. Do not click 
> links or open attachments unless you can confirm the sender and know the 
> content is safe.
> 
> 
> 
> On 5/26/21 12:40 AM, Anchal Agarwal wrote:
> > On Tue, May 25, 2021 at 06:23:35PM -0400, Boris Ostrovsky wrote:
> >> CAUTION: This email originated from outside of the organization. Do not 
> >> click links or open attachments unless you can confirm the sender and know 
> >> the content is safe.
> >>
> >>
> >>
> >> On 5/21/21 1:26 AM, Anchal Agarwal wrote:
> >>>>> What I meant there wrt VCPU info was that VCPU info is not unregistered 
> >>>>> during hibernation,
> >>>>> so Xen still remembers the old physical addresses for the VCPU 
> >>>>> information, created by the
> >>>>> booting kernel. But since the hibernation kernel may have different 
> >>>>> physical
> >>>>> addresses for VCPU info and if mismatch happens, it may cause issues 
> >>>>> with resume.
> >>>>> During hibernation, the VCPU info register hypercall is not invoked 
> >>>>> again.
> >>>> I still don't think that's the cause but it's certainly worth having a 
> >>>> look.
> >>>>
> >>> Hi Boris,
> >>> Apologies for picking this up after last year.
> >>> I did some dive deep on the above statement and that is indeed the case 
> >>> that's happening.
> >>> I did some debugging around KASLR and hibernation using reboot mode.
> >>> I observed in my debug prints that whenever vcpu_info* address for 
> >>> secondary vcpu assigned
> >>> in xen_vcpu_setup at boot is different than what is in the image, resume 
> >>> gets stuck for that vcpu
> >>> in bringup_cpu(). That means we have different addresses for 
> >>> &per_cpu(xen_vcpu_info, cpu) at boot and after
> >>> control jumps into the image.
> >>>
> >>> I failed to get any prints after it got stuck in bringup_cpu() and
> >>> I do not have an option to send a sysrq signal to the guest or rather get 
> >>> a kdump.
> >>
> >> xenctx and xen-hvmctx might be helpful.
> >>
> >>
> >>> This change is not observed in every hibernate-resume cycle. I am not 
> >>> sure if this is a bug or an
> >>> expected behavior.
> >>> Also, I am contemplating the idea that it may be a bug in xen code 
> >>> getting triggered only when
> >>> KASLR is enabled but I do not have substantial data to prove that.
> >>> Is this a coincidence that this always happens for 1st vcpu?
> >>> Moreover, since hypervisor is not aware that guest is hibernated and it 
> >>> looks like a regular shutdown to dom0 during reboot mode,
> >>> will re-registering vcpu_info for secondary vcpu's even plausible?
> >>
> >> I think I am missing how this is supposed to work (maybe we've talked 
> >> about this but it's been many months since then). You hibernate the guest 
> >> and it writes the state to swap. The guest is then shut down? And what's 
> >> next? How do you wake it up?
> >>
> >>
> >> -boris
> >>
> > To resume a guest, guest boots up as the fresh guest and then 
> > software_resume()
> > is called which if finds a stored hibernation image, quiesces the devices 
> > and loads
> > the memory contents from the image. The control then transfers to the 
> > targeted kernel.
> > This further disables non boot cpus,sycore_suspend/resume callbacks are 
> > invoked which sets up
> > the shared_info, pvclock, grant tables etc. Since the vcpu_info pointer for 
> > each
> > non-boot cpu is already registered, the hypercall does not happen again when
> > bringing up the non boot cpus. This leads to inconsistencies as pointed
> > out earlier when KASLR is enabled.
> 
> 
> I'd think the 'if' condition in the code fragment below should always fail 
> since hypervisor is creating new guest, resulting in the hypercall. Just like 
> in the case of save/restore.
>
That only fails during boot but not after the control jumps into the image. The
non boot cpus are brought offline(freeze_secondary_cpus) and then online via 
cpu hotplug path. In that case xen_vcpu_setup doesn't invokes the hypercall 
again.
> 
> Do you call xen_vcpu_info_reset() on resume? That will re-initialize 
> per_cpu(xen_vcpu). Maybe you need to add this to xen_syscore_resume().
> 
Yes coincidentally I did. It fails the registration of vcpu_info with error -22.
Basically because nobody unregistered them and xen does not know that guest 
hibernated
in first place. 

Moreover, syscore_resume is also called during hibernation path i.e after Image 
is
created. Everything is resumed and thawed back before final writing of the image
and then a machine shutdown. So syscore_resume can only invoke 
xen_vcpu_info_reset
when it is actually resuming from image. I had ben able to use in_suspend
variable to detect that luckily.

Another line of thought is something what kexec does to come around this problem
is to abuse soft_reset and issue it during syscore_resume or may be before the 
image get loaded.
I haven't experimented with that yet as I am assuming there has to be a way to 
re-register vcpus during resume.

Thanks,
Anchal
> 
> -boris
> 
> 
> >
> > Thanks,
> > Anchal
> >>
> >>>  I could definitely use some advice to debug this further.
> >>>
> >>>
> >>> Some printk's from my debugging:
> >>>
> >>> At Boot:
> >>>
> >>> xen_vcpu_setup: xen_have_vcpu_info_placement=1 cpu=1, 
> >>> vcpup=0xffff9e548fa560e0, info.mfn=3996246 info.offset=224,
> >>>
> >>> Image Loads:
> >>> It ends up in the condition:




> >>>  xen_vcpu_setup()
> >>>  {
> >>>  ...
> >>>  if (xen_hvm_domain()) {
> >>>         if (per_cpu(xen_vcpu, cpu) == &per_cpu(xen_vcpu_info, cpu))
> >>>                 return 0;
> >>>  }
> >>>  ...
> >>>  }
> >>>
> >>> xen_vcpu_setup: checking mfn on resume cpu=1, info.mfn=3934806 
> >>> info.offset=224, &per_cpu(xen_vcpu_info, cpu)=0xffff9d7240a560e0
> >>>
> >>> This is tested on c4.2xlarge [8vcpu 15GB mem] instance with 5.10 kernel 
> >>> running
> >>> in the guest.
> >>>
> >>> Thanks,
> >>> Anchal.
> >>>> -boris
> >>>>
> >>>>

References:
- Re: [PATCH v3 01/11] xen/manage: keep track of the on-going suspend mode
  - From: Anchal Agarwal
- Re: [PATCH v3 01/11] xen/manage: keep track of the on-going suspend mode
  - From: Boris Ostrovsky
- Re: [PATCH v3 01/11] xen/manage: keep track of the on-going suspend mode
  - From: Anchal Agarwal
- Re: [PATCH v3 01/11] xen/manage: keep track of the on-going suspend mode
  - From: Boris Ostrovsky

Prev by Date: [PATCH 3/3] x86/ept: force WB cache attributes for grant and foreign maps
Next by Date: [linux-linus test] 162246: regressions - FAIL
Previous by thread: Re: [PATCH v3 01/11] xen/manage: keep track of the on-going suspend mode
Next by thread: [libvirt test] 162110: regressions - FAIL
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.