Xen project Mailing List

Re: [BUG] Core scheduling patches causing deadlock in some situations

To: Tamas K Lengyel <tamas.k.lengyel@xxxxxxxxx>

From: Michał Leszczyński <michal.leszczynski@xxxxxxx>

Date: Sat, 30 May 2020 01:51:05 +0200 (CEST)

Cc: Jürgen Groß <jgross@xxxxxxxx>, Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, chivay@xxxxxxx, bonus@xxxxxxx

Delivery-date: Fri, 29 May 2020 23:52:16 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Thread-index: /ogUKyFrbbu2LuoRmFbM/mOB5O9qPA==

Thread-topic: Core scheduling patches causing deadlock in some situations

----- 29 maj 2020 o 18:12, Tamas K Lengyel tamas.k.lengyel@xxxxxxxxx napisał(a): > On Fri, May 29, 2020 at 8:48 AM Tamas K Lengyel > <tamas.k.lengyel@xxxxxxxxx> wrote: >> >> On Fri, May 29, 2020 at 7:51 AM Michał Leszczyński >> <michal.leszczynski@xxxxxxx> wrote: >> > >> > ----- 29 maj 2020 o 15:15, Jürgen Groß jgross@xxxxxxxx napisał(a): >> > >> > > On 29.05.20 14:51, Michał Leszczyński wrote: >> > >> ----- 29 maj 2020 o 14:44, Jürgen Groß jgross@xxxxxxxx napisał(a): >> > >> >> > >>> On 29.05.20 14:30, Michał Leszczyński wrote: >> > >>>> Hello, >> > >>>> >> > >>>> I'm running DRAKVUF on Dell Inc. PowerEdge R640/08HT8T server with >> > >>>> Intel(R) >> > >>>> Xeon(R) Gold 6132 CPU @ 2.60GHz CPU. >> > >>>> When upgrading from Xen RELEASE 4.12 to 4.13, we have noticed some >> > >>>> stability >> > >>>> problems concerning freezes of Dom0 (Debian Buster): >> > >>>> >> > >>>> --- >> > >>>> >> > >>>> maj 27 23:17:02 debian kernel: rcu: INFO: rcu_sched self-detected >> > >>>> stall on CPU >> > >>>> maj 27 23:17:02 debian kernel: rcu: 0-....: (5250 ticks this GP) >> > >>>> idle=cee/1/0x4000000000000002 softirq=11964/11964 fqs=2515 >> > >>>> maj 27 23:17:02 debian kernel: rcu: (t=5251 jiffies g=27237 q=799) >> > >>>> maj 27 23:17:02 debian kernel: NMI backtrace for cpu 0 >> > >>>> maj 27 23:17:02 debian kernel: CPU: 0 PID: 643 Comm: z_rd_int_1 >> > >>>> Tainted: P OE >> > >>>> 4.19.0-6-amd64 #1 Debian 4.19.67-2+deb10u2 >> > >>>> maj 27 23:17:02 debian kernel: Hardware name: Dell Inc. PowerEdge >> > >>>> R640/08HT8T, >> > >>>> BIOS 2.1.8 04/30/2019 >> > >>>> maj 27 23:17:02 debian kernel: Call Trace: >> > >>>> maj 27 23:17:02 debian kernel: <IRQ> >> > >>>> maj 27 23:17:02 debian kernel: dump_stack+0x5c/0x80 >> > >>>> maj 27 23:17:02 debian kernel: nmi_cpu_backtrace.cold.4+0x13/0x50 >> > >>>> maj 27 23:17:02 debian kernel: ? >> > >>>> lapic_can_unplug_cpu.cold.29+0x3b/0x3b >> > >>>> maj 27 23:17:02 debian kernel: nmi_trigger_cpumask_backtrace+0xf9/0xfb >> > >>>> maj 27 23:17:02 debian kernel: rcu_dump_cpu_stacks+0x9b/0xcb >> > >>>> maj 27 23:17:02 debian kernel: rcu_check_callbacks.cold.81+0x1db/0x335 >> > >>>> maj 27 23:17:02 debian kernel: ? tick_sched_do_timer+0x60/0x60 >> > >>>> maj 27 23:17:02 debian kernel: update_process_times+0x28/0x60 >> > >>>> maj 27 23:17:02 debian kernel: tick_sched_handle+0x22/0x60 >> > >>>> >> > >>>> --- >> > >>>> >> > >>>> This usually results in machine being completely unresponsive and >> > >>>> performing an >> > >>>> automated reboot after some time. >> > >>>> >> > >>>> I've bisected commits between 4.12 and 4.13 and it seems like this is >> > >>>> the patch >> > >>>> which introduced a bug: >> > >>>> https://github.com/xen-project/xen/commit/7c7b407e77724f37c4b448930777a59a479feb21 >> > >>>> >> > >>>> Enclosed you can find the `xl dmesg` log (attachment: dmesg.txt) from >> > >>>> the fresh >> > >>>> boot of the machine on which the bug was reproduced. >> > >>>> >> > >>>> I'm also attaching the `xl info` output from this machine: >> > >>>> >> > >>>> --- >> > >>>> >> > >>>> release : 4.19.0-6-amd64 >> > >>>> version : #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) >> > >>>> machine : x86_64 >> > >>>> nr_cpus : 14 >> > >>>> max_cpu_id : 223 >> > >>>> nr_nodes : 1 >> > >>>> cores_per_socket : 14 >> > >>>> threads_per_core : 1 >> > >>>> cpu_mhz : 2593.930 >> > >>>> hw_caps : >> > >>>> bfebfbff:77fef3ff:2c100800:00000121:0000000f:d19ffffb:00000008:00000100 >> > >>>> virt_caps : pv hvm hvm_directio pv_directio hap shadow >> > >>>> total_memory : 130541 >> > >>>> free_memory : 63591 >> > >>>> sharing_freed_memory : 0 >> > >>>> sharing_used_memory : 0 >> > >>>> outstanding_claims : 0 >> > >>>> free_cpus : 0 >> > >>>> xen_major : 4 >> > >>>> xen_minor : 13 >> > >>>> xen_extra : -unstable >> > >>>> xen_version : 4.13-unstable >> > >>>> xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 >> > >>>> hvm-3.0-x86_32p >> > >>>> hvm-3.0-x86_64 >> > >>>> xen_scheduler : credit2 >> > >>>> xen_pagesize : 4096 >> > >>>> platform_params : virt_start=0xffff800000000000 >> > >>>> xen_changeset : Wed Oct 2 09:27:27 2019 +0200 git:7c7b407e77-dirty >> > >>> >> > >>> Which is your original Xen base? This output is clearly obtained at the >> > >>> end of the bisect process. >> > >>> >> > >>> There have been quite some corrections since the release of Xen 4.13, >> > >>> so >> > >>> please make sure you are running the most actual version (4.13.1). >> > >>> >> > >>> >> > >>> Juergen >> > >> >> > >> Sure, we have tested both RELEASE 4.13 and RELEASE 4.13.1. >> > >> Unfortunately these >> > >> corrections didn't help and the bug is still reproducible. >> > >> >> > >> From our testing it turns out that: >> > >> >> > >> Known working revision: 997d6248a9ae932d0dbaac8d8755c2b15fec25dc >> > >> Broken revision: 6278553325a9f76d37811923221b21db3882e017 >> > >> First bad commit: 7c7b407e77724f37c4b448930777a59a479feb21 >> > > >> > > Would it be possible to test xen unstable, too? >> > > >> > > I could imagine e.g. commit b492c65da5ec5ed or 99266e31832fb4a4 to have >> > > an impact here. >> > > >> > > >> > > Juergen >> > >> > >> > I've tried b492c65da5ec5ed revision but it seems that there is some >> > problem with >> > ALTP2M support, so I can't launch anything at all. >> > >> > maj 29 15:45:32 debian drakrun[1223]: Failed to set HVM_PARAM_ALTP2M, RC: >> > -1 >> > maj 29 15:45:32 debian drakrun[1223]: VMI_ERROR: xc_altp2m_switch_to_view >> > returned rc: -1 >> >> Ough, great, that's another regression in 4.14-unstable. I ran into it >> myself but couldn't spend time to figure out whether its just >> something in my configuration or not so I reverted to 4.13.1. Now we >> know it's a real bug. > > This was a bug in libxl, I've sent a patch in that fixes it but you > can grab it from https://github.com/tklengyel/xen/tree/libxl_fix. > There is also an update to DRAKVUF that will need to be applied due to > the recent altp2m visibility option having to be specified, you can > grab that from https://github.com/tklengyel/drakvuf/tree/4.14. > > Tamas After checking out 99266e31832fb4a4 and applying a patch from https://github.com/tklengyel/xen/tree/libxl_fix it's again possible to succesfully launch DRAKVUF but the deadlock caused by the scheduler is still reproducible and the whole machine freezes just after a few seconds after starting the analysis. So I would suppose that since 7c7b407e77724f37c4b448930777a59a479feb21 through 99266e31832fb4a4 there is still a bug in scheduler which causes freeze when using DRAKVUF on some machines or at least some default behavior of Xen hypervisor has changed in some improper way. Best regards Michał Leszczyński CERT Polska

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.