[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [BUG] Core scheduling patches causing deadlock in some situations



----- 29 maj 2020 o 18:12, Tamas K Lengyel tamas.k.lengyel@xxxxxxxxx napisał(a):

> On Fri, May 29, 2020 at 8:48 AM Tamas K Lengyel
> <tamas.k.lengyel@xxxxxxxxx> wrote:
>>
>> On Fri, May 29, 2020 at 7:51 AM Michał Leszczyński
>> <michal.leszczynski@xxxxxxx> wrote:
>> >
>> > ----- 29 maj 2020 o 15:15, Jürgen Groß jgross@xxxxxxxx napisał(a):
>> >
>> > > On 29.05.20 14:51, Michał Leszczyński wrote:
>> > >> ----- 29 maj 2020 o 14:44, Jürgen Groß jgross@xxxxxxxx napisał(a):
>> > >>
>> > >>> On 29.05.20 14:30, Michał Leszczyński wrote:
>> > >>>> Hello,
>> > >>>>
>> > >>>> I'm running DRAKVUF on Dell Inc. PowerEdge R640/08HT8T server with 
>> > >>>> Intel(R)
>> > >>>> Xeon(R) Gold 6132 CPU @ 2.60GHz CPU.
>> > >>>> When upgrading from Xen RELEASE 4.12 to 4.13, we have noticed some 
>> > >>>> stability
>> > >>>> problems concerning freezes of Dom0 (Debian Buster):
>> > >>>>
>> > >>>> ---
>> > >>>>
>> > >>>> maj 27 23:17:02 debian kernel: rcu: INFO: rcu_sched self-detected 
>> > >>>> stall on CPU
>> > >>>> maj 27 23:17:02 debian kernel: rcu: 0-....: (5250 ticks this GP)
>> > >>>> idle=cee/1/0x4000000000000002 softirq=11964/11964 fqs=2515
>> > >>>> maj 27 23:17:02 debian kernel: rcu: (t=5251 jiffies g=27237 q=799)
>> > >>>> maj 27 23:17:02 debian kernel: NMI backtrace for cpu 0
>> > >>>> maj 27 23:17:02 debian kernel: CPU: 0 PID: 643 Comm: z_rd_int_1 
>> > >>>> Tainted: P OE
>> > >>>> 4.19.0-6-amd64 #1 Debian 4.19.67-2+deb10u2
>> > >>>> maj 27 23:17:02 debian kernel: Hardware name: Dell Inc. PowerEdge 
>> > >>>> R640/08HT8T,
>> > >>>> BIOS 2.1.8 04/30/2019
>> > >>>> maj 27 23:17:02 debian kernel: Call Trace:
>> > >>>> maj 27 23:17:02 debian kernel: <IRQ>
>> > >>>> maj 27 23:17:02 debian kernel: dump_stack+0x5c/0x80
>> > >>>> maj 27 23:17:02 debian kernel: nmi_cpu_backtrace.cold.4+0x13/0x50
>> > >>>> maj 27 23:17:02 debian kernel: ? 
>> > >>>> lapic_can_unplug_cpu.cold.29+0x3b/0x3b
>> > >>>> maj 27 23:17:02 debian kernel: nmi_trigger_cpumask_backtrace+0xf9/0xfb
>> > >>>> maj 27 23:17:02 debian kernel: rcu_dump_cpu_stacks+0x9b/0xcb
>> > >>>> maj 27 23:17:02 debian kernel: rcu_check_callbacks.cold.81+0x1db/0x335
>> > >>>> maj 27 23:17:02 debian kernel: ? tick_sched_do_timer+0x60/0x60
>> > >>>> maj 27 23:17:02 debian kernel: update_process_times+0x28/0x60
>> > >>>> maj 27 23:17:02 debian kernel: tick_sched_handle+0x22/0x60
>> > >>>>
>> > >>>> ---
>> > >>>>
>> > >>>> This usually results in machine being completely unresponsive and 
>> > >>>> performing an
>> > >>>> automated reboot after some time.
>> > >>>>
>> > >>>> I've bisected commits between 4.12 and 4.13 and it seems like this is 
>> > >>>> the patch
>> > >>>> which introduced a bug:
>> > >>>> https://github.com/xen-project/xen/commit/7c7b407e77724f37c4b448930777a59a479feb21
>> > >>>>
>> > >>>> Enclosed you can find the `xl dmesg` log (attachment: dmesg.txt) from 
>> > >>>> the fresh
>> > >>>> boot of the machine on which the bug was reproduced.
>> > >>>>
>> > >>>> I'm also attaching the `xl info` output from this machine:
>> > >>>>
>> > >>>> ---
>> > >>>>
>> > >>>> release : 4.19.0-6-amd64
>> > >>>> version : #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11)
>> > >>>> machine : x86_64
>> > >>>> nr_cpus : 14
>> > >>>> max_cpu_id : 223
>> > >>>> nr_nodes : 1
>> > >>>> cores_per_socket : 14
>> > >>>> threads_per_core : 1
>> > >>>> cpu_mhz : 2593.930
>> > >>>> hw_caps :
>> > >>>> bfebfbff:77fef3ff:2c100800:00000121:0000000f:d19ffffb:00000008:00000100
>> > >>>> virt_caps : pv hvm hvm_directio pv_directio hap shadow
>> > >>>> total_memory : 130541
>> > >>>> free_memory : 63591
>> > >>>> sharing_freed_memory : 0
>> > >>>> sharing_used_memory : 0
>> > >>>> outstanding_claims : 0
>> > >>>> free_cpus : 0
>> > >>>> xen_major : 4
>> > >>>> xen_minor : 13
>> > >>>> xen_extra : -unstable
>> > >>>> xen_version : 4.13-unstable
>> > >>>> xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 
>> > >>>> hvm-3.0-x86_32p
>> > >>>> hvm-3.0-x86_64
>> > >>>> xen_scheduler : credit2
>> > >>>> xen_pagesize : 4096
>> > >>>> platform_params : virt_start=0xffff800000000000
>> > >>>> xen_changeset : Wed Oct 2 09:27:27 2019 +0200 git:7c7b407e77-dirty
>> > >>>
>> > >>> Which is your original Xen base? This output is clearly obtained at the
>> > >>> end of the bisect process.
>> > >>>
>> > >>> There have been quite some corrections since the release of Xen 4.13, 
>> > >>> so
>> > >>> please make sure you are running the most actual version (4.13.1).
>> > >>>
>> > >>>
>> > >>> Juergen
>> > >>
>> > >> Sure, we have tested both RELEASE 4.13 and RELEASE 4.13.1. 
>> > >> Unfortunately these
>> > >> corrections didn't help and the bug is still reproducible.
>> > >>
>> > >>  From our testing it turns out that:
>> > >>
>> > >> Known working revision: 997d6248a9ae932d0dbaac8d8755c2b15fec25dc
>> > >> Broken revision: 6278553325a9f76d37811923221b21db3882e017
>> > >> First bad commit: 7c7b407e77724f37c4b448930777a59a479feb21
>> > >
>> > > Would it be possible to test xen unstable, too?
>> > >
>> > > I could imagine e.g. commit b492c65da5ec5ed or 99266e31832fb4a4 to have
>> > > an impact here.
>> > >
>> > >
>> > > Juergen
>> >
>> >
>> > I've tried b492c65da5ec5ed revision but it seems that there is some 
>> > problem with
>> > ALTP2M support, so I can't launch anything at all.
>> >
>> > maj 29 15:45:32 debian drakrun[1223]: Failed to set HVM_PARAM_ALTP2M, RC: 
>> > -1
>> > maj 29 15:45:32 debian drakrun[1223]: VMI_ERROR: xc_altp2m_switch_to_view
>> > returned rc: -1
>>
>> Ough, great, that's another regression in 4.14-unstable. I ran into it
>> myself but couldn't spend time to figure out whether its just
>> something in my configuration or not so I reverted to 4.13.1. Now we
>> know it's a real bug.
> 
> This was a bug in libxl, I've sent a patch in that fixes it but you
> can grab it from https://github.com/tklengyel/xen/tree/libxl_fix.
> There is also an update to DRAKVUF that will need to be applied due to
> the recent altp2m visibility option having to be specified, you can
> grab that from https://github.com/tklengyel/drakvuf/tree/4.14.
> 
> Tamas


After checking out 99266e31832fb4a4 and applying a patch from 
https://github.com/tklengyel/xen/tree/libxl_fix it's again possible to 
succesfully launch DRAKVUF but the deadlock caused by the scheduler is still 
reproducible and the whole machine freezes just after a few seconds after 
starting the analysis.

So I would suppose that since 7c7b407e77724f37c4b448930777a59a479feb21 through 
99266e31832fb4a4 there is still a bug in scheduler which causes freeze when 
using DRAKVUF on some machines or at least some default behavior of Xen 
hypervisor has changed in some improper way.


Best regards
Michał Leszczyński
CERT Polska



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.