[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen crash after S3 suspend - Xen 4.13 and newer


  • To: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Mon, 22 Aug 2022 11:53:50 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=lVmAOzgmnKNDKsgMRAXM4ZA2vRieRdDS7+CsUxKpyiA=; b=EptK1AyGUCYXu+VZyf5TubpUOHFtqL93t5M1bPU1eTOsgStQwQnSDmZcjTOlNSgdj7dNhgR8vH63TUKlaNZMYNZN3cxSUD1M2Q36UykUkPHI98zhe1bkJ1vm6OLHCx9Qg9Y1/nrkqNyh+h/uXjbjB3nr0S1sXLF0HXBiskWAl91JhmammLeu3A/zTAm7I68C7tDMMiPeV3rsFHyWCFtBwCJgxT96ZbmQfVaKOcF39diWPaQRrOFNreBP+4JOclq6RG7uynJfS7V594b+82oj5R4r3hUkgTG3b8cr4VjMp/JA25y/03XFbGgx0tpDNZByRbYGgdVbdxrIYdist81JJQ==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=CZz1A+oNVWjyO1RrTO9Sjtrke84sPPIorLB8K+JU7IjCtWVPUT9/5odMFAXr2AZaHMq7frMEobQXdsT3C0MLmlJ3LLa7HV9oZXvatKwj2fg5+EyTwxkyDxoHnE0SneRaZ576ZmE5MkaUNnEEVaHwaI3ePX8LDmCRXK0T365W9ka6gpO/9zmLiK2zHsmLeMkcJGhQ50jGozT/YpzudCb8d5LPRqGRuaCdPN4a+fGWlN0qIonHJDbPvCPUppC1i/j7iNhF+nssdDAKgvlY0gyokHOBKjzxkki2MuehMHb7TaDE4aGM3ZK+ImQJXKC5pH0+1tEJ87otsCVW2xDVHLO+FA==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;
  • Cc: Juergen Gross <jgross@xxxxxxx>, Dario Faggioli <dfaggioli@xxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Jürgen Groß <jgross@xxxxxxxx>
  • Delivery-date: Mon, 22 Aug 2022 09:54:02 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 21.08.2022 18:14, Marek Marczykowski-Górecki wrote:
> On Sat, Oct 09, 2021 at 06:28:17PM +0200, Marek Marczykowski-Górecki wrote:
>> On Sun, Jan 31, 2021 at 03:15:30AM +0100, Marek Marczykowski-Górecki wrote:
>>> On Tue, Sep 29, 2020 at 05:27:48PM +0200, Jürgen Groß wrote:
>>>> On 29.09.20 17:16, Marek Marczykowski-Górecki wrote:
>>>>> On Tue, Sep 29, 2020 at 05:07:11PM +0200, Jürgen Groß wrote:
>>>>>> On 29.09.20 16:27, Marek Marczykowski-Górecki wrote:
>>>>>>> On Mon, Mar 23, 2020 at 01:09:49AM +0100, Marek Marczykowski-Górecki 
>>>>>>> wrote:
>>>>>>>> On Thu, Mar 19, 2020 at 01:28:10AM +0100, Dario Faggioli wrote:
>>>>>>>>> [Adding Juergen]
>>>>>>>>>
>>>>>>>>> On Wed, 2020-03-18 at 23:10 +0100, Marek Marczykowski-Górecki wrote:
>>>>>>>>>> On Wed, Mar 18, 2020 at 02:50:52PM +0000, Andrew Cooper wrote:
>>>>>>>>>>> On 18/03/2020 14:16, Marek Marczykowski-Górecki wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> In my test setup (inside KVM with nested virt enabled), I rather
>>>>>>>>>>>> frequently get Xen crash on resume from S3. Full message below.
>>>>>>>>>>>>
>>>>>>>>>>>> This is Xen 4.13.0, with some patches, including "sched: fix
>>>>>>>>>>>> resuming
>>>>>>>>>>>> from S3 with smt=0".
>>>>>>>>>>>>
>>>>>>>>>>>> Contrary to the previous issue, this one does not happen always -
>>>>>>>>>>>> I
>>>>>>>>>>>> would say in about 40% cases on this setup, but very rarely on
>>>>>>>>>>>> physical
>>>>>>>>>>>> setup.
>>>>>>>>>>>>
>>>>>>>>>>>> This is _without_ core scheduling enabled, and also with smt=off.
>>>>>>>>>>>>
>>>>>>>>>>>> Do you think it would be any different on xen-unstable? I cat
>>>>>>>>>>>> try, but
>>>>>>>>>>>> it isn't trivial in this setup, so I'd ask first.
>>>>>>>>>>>>
>>>>>>>>> Well, Juergen has fixed quite a few issues.
>>>>>>>>>
>>>>>>>>> Most of them where triggering with core-scheduling enabled, and I 
>>>>>>>>> don't
>>>>>>>>> recall any of them which looked similar or related to this.
>>>>>>>>>
>>>>>>>>> Still, it's possible that the same issue causes different symptoms, 
>>>>>>>>> and
>>>>>>>>> hence that maybe one of the patches would fix this too.
>>>>>>>>
>>>>>>>> I've tested on master (d094e95fb7c), and reproduced exactly the same 
>>>>>>>> crash
>>>>>>>> (pasted below for the completeness).
>>>>>>>> But there is more: additionally, in most (all?) cases after resume 
>>>>>>>> I've got
>>>>>>>> soft lockup in Linux dom0 in smp_call_function_single() - see below. It
>>>>>>>> didn't happened before and the only change was Xen 4.13 -> master.
>>>>>>>>
>>>>>>>> Xen crash:
>>>>>>>>
>>>>>>>> (XEN) Assertion 'c2rqd(sched_unit_master(unit)) == svc->rqd' failed at 
>>>>>>>> credit2.c:2133
>>>>>>>
>>>>>>> Juergen, any idea about this one? This is also happening on the current
>>>>>>> stable-4.14 (28855ebcdbfa).
>>>>>>>
>>>>>>
>>>>>> Oh, sorry I didn't come back to this issue.
>>>>>>
>>>>>> I suspect this is related to stop_machine_run() being called during
>>>>>> suspend(), as I'm seeing very sporadic issues when offlining and then
>>>>>> onlining cpus with core scheduling being active (it seems as if the
>>>>>> dom0 vcpu doing the cpu online activity sometimes is using an old
>>>>>> vcpu state).
>>>>>
>>>>> Note this is default Xen 4.14 start, so core scheduling is _not_ active:
>>>>
>>>> The similarity in the two failure cases is that multiple cpus are
>>>> affected by the operations during stop_machine_run().
>>>>
>>>>>
>>>>>      (XEN) Brought up 2 CPUs
>>>>>      (XEN) Scheduling granularity: cpu, 1 CPU per sched-resource
>>>>>      (XEN) Adding cpu 0 to runqueue 0
>>>>>      (XEN)  First cpu on runqueue, activating
>>>>>      (XEN) Adding cpu 1 to runqueue 1
>>>>>      (XEN)  First cpu on runqueue, activating
>>>>>
>>>>>> I wasn't able to catch the real problem despite of having tried lots
>>>>>> of approaches using debug patches.
>>>>>>
>>>>>> Recently I suspected the whole problem could be somehow related to
>>>>>> RCU handling, as stop_machine_run() is relying on tasklets which are
>>>>>> executing in idle context, and RCU handling is done in idle context,
>>>>>> too. So there might be some kind of use after free scenario in case
>>>>>> some memory is freed via RCU despite it still being used by a tasklet.
>>>>>
>>>>> That sounds plausible, even though I don't really know this area of Xen.
>>>>>
>>>>>> I "just" need to find some time to verify this suspicion. Any help doing
>>>>>> this would be appreciated. :-)
>>>>>
>>>>> I do have a setup where I can easily-ish reproduce the issue. If there
>>>>> is some debug patch you'd like me to try, I can do that.
>>>>
>>>> Thanks. I might come back to that offer as you are seeing a crash which
>>>> will be much easier to analyze. Catching my error case is much harder as
>>>> it surfaces some time after the real problem in a non destructive way
>>>> (usually I'm seeing a failure to load a library in the program which
>>>> just did its job via exactly the library claiming not being loadable).
>>>
>>> Hi,
>>>
>>> I'm resurrecting this thread as it was recently mentioned elsewhere. I
>>> can still reproduce the issue on the recent staging branch (9dc687f155).
>>>
>>> It fails after the first resume (not always, but frequent enough to
>>> debug it). At least one guest needs to be running - with just (PV) dom0
>>> the crash doesn't happen (at least for the ~8 times in a row I tried).
>>> If the first resume works, the second (almost?) always will fail but
>>> with a different symptoms - dom0 kernel lockups (at least some of its
>>> vcpus). I haven't debugged this one yet at all.
>>>
>>> Any help will be appreciated, I can apply some debug patches, change
>>> configuration etc.
>>
>> This still happens on 4.14.3. Maybe it is related to freeing percpu
>> areas, as it caused other issues with suspend too? Just a thought...
> 
> I have reproduced this on current staging(*). And I can reproduce it
> reliably. And also, I got (I believe) closely related crash with credit1
> scheduler.
> 
> (*) It isn't plain staging, it's one with my xhci console patches on
> top, including attempt to make it survive S3. I believe the only
> relevant part there is sticking set_timer() into console resume path (or
> just having a timer with rather short delay registered). The actual tree
> at https://github.com/marmarek/xen/tree/master-xue2-debug, including
> quite a lot of debug prints and debug hacks.
> 
> Specific crash with credit2:

Are you sure this is Credit2? Both ...

>     (XEN) Assertion 'sched_unit_master(currunit) == cpu' failed at 
> common/sched/credit.c:928

... here and ...

>     (XEN) ----[ Xen-4.17-unstable  x86_64  debug=y  Tainted:   C    ]----
>     (XEN) CPU:    0
>     (XEN) RIP:    e008:[<ffff82d0402434bf>] credit.c#csched_tick+0x2d4/0x494
>     (XEN) RFLAGS: 0000000000010202   CONTEXT: hypervisor (d0v4)
>     (XEN) rax: ffff82d0405c4298   rbx: 0000000000000002   rcx: 
> 0000000000000002
>     (XEN) rdx: ffff8302517f64d0   rsi: ffff8302515c0fc0   rdi: 
> 0000000000000002
>     (XEN) rbp: ffff830256227e38   rsp: ffff830256227de0   r8:  
> 0000000000000004
>     (XEN) r9:  ffff8302517ac820   r10: ffff830251745068   r11: 
> 00000088cb734887
>     (XEN) r12: ffff83025174de50   r13: ffff8302515c0fa0   r14: 
> ffff83025174df40
>     (XEN) r15: ffff8302515c0cc0   cr0: 0000000080050033   cr4: 
> 0000000000372660
>     (XEN) cr3: 00000001bacbd000   cr2: 000077e5ec02a318
>     (XEN) fsb: 000077e5fe533700   gsb: ffff888255700000   gss: 
> 0000000000000000
>     (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
>     (XEN) Xen code around <ffff82d0402434bf> 
> (credit.c#csched_tick+0x2d4/0x494):
>     (XEN)  01 00 00 e9 2a 01 00 00 <0f> 0b 0f 0b 0f 0b 48 8b 41 20 0f b7 00 
> 89 45 cc
>     (XEN) Xen stack trace from rsp=ffff830256227de0:
>     (XEN)    ffff830256227fff 0000000000000000 0000000256227e10 
> ffff82d04035be90
>     (XEN)    ffff830256227ef8 ffff830251745000 ffff82d0405c3280 
> ffff82d0402431eb
>     (XEN)    0000000000000002 00000088c9ba9534 0000000000000000 
> ffff830256227e60
>     (XEN)    ffff82d04022ee53 ffff82d0405c3280 ffff8302963e1320 
> ffff8302515c0fc0
>     (XEN)    ffff830256227ea0 ffff82d04022f73f ffff830256227e80 
> ffff82d0405c9f00
>     (XEN)    ffffffffffffffff ffff82d0405c9f00 ffff830256227fff 
> 0000000000000000
>     (XEN)    ffff830256227ed8 ffff82d04022d26c ffff830251745000 
> 0000000000000000
>     (XEN)    0000000000000000 ffff830256227fff 0000000000000000 
> ffff830256227ee8
>     (XEN)    ffff82d04022d2ff 00007cfda9dd80e7 ffff82d0402f03c6 
> ffff88810c005c00
>     (XEN)    0000000000000031 0000000000000100 00000000fffffe00 
> 0000000000000031
>     (XEN)    0000000000000031 ffffffff82d45d28 0000000000000e2e 
> 0000000000000000
>     (XEN)    0000000000000032 00000000ffffef31 0000000000000000 
> ffff88812244a700
>     (XEN)    0000000000000005 ffff88812244a780 000000fa00000000 
> ffffffff818db55f
>     (XEN)    000000000000e033 0000000000000246 ffffc900409b7c50 
> 000000000000e02b
>     (XEN)    0000000000000000 0000000000000000 0000000000000000 
> 0000000000000000
>     (XEN)    0000e01000000000 ffff830251745000 0000000000000000 
> 0000000000372660
>     (XEN)    0000000000000000 800000025620b002 000e030300000001 
> 0000000000000000
>     (XEN) Xen call trace:
>     (XEN)    [<ffff82d0402434bf>] R credit.c#csched_tick+0x2d4/0x494
>     (XEN)    [<ffff82d04022ee53>] F timer.c#execute_timer+0x45/0x5c
>     (XEN)    [<ffff82d04022f73f>] F timer.c#timer_softirq_action+0x71/0x278
>     (XEN)    [<ffff82d04022d26c>] F softirq.c#__do_softirq+0x94/0xbe
>     (XEN)    [<ffff82d04022d2ff>] F do_softirq+0x13/0x15
>     (XEN)    [<ffff82d0402f03c6>] F x86_64/entry.S#process_softirqs+0x6/0x20

... here the only references are to credit.c, i.e. Credit1 code.

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.