[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] cpupool / credit2 misuse of xfree() (was: Re: [BUG] Xen causes a host hang by using xen-hptool cpu-offline)
On 27.07.2022 03:19, Gao, Ruifeng wrote: > Problem Description: > Trying to execute "/usr/local/sbin/xen-hptool cpu-offline <cpuid>", the host > will hang immediately. > > Version-Release and System Details: > Platform: Ice Lake Server > Host OS: Red Hat Enterprise Linux 8.3 (Ootpa) > Kernel: 5.19.0-rc6 > HW: Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz > Xen Version: 4.17-unstable(ab2977b027-dirty) > > Reproduce Steps: > 1. Boot from Xen and check the information: > [root@icx-2s1 ~]# xl info > host : icx-2s1 > release : 5.19.0-rc6 > xen_version : 4.17-unstable > xen_caps : xen-3.0-x86_64 hvm-3.0-x86_32 hvm-3.0-x86_32p > hvm-3.0-x86_64 > platform_params : virt_start=0xffff800000000000 > xen_changeset : Thu Jul 14 19:45:36 2022 +0100 git:ab2977b027-dirty > 2. Execute the cpu-offline command, here cpuid is 48 as an example: > [root@icx-2s1 ~]# /usr/local/sbin/xen-hptool cpu-offline 48 > > Actual Results: > The host will hang immediately. Well, it crashes (which is an important difference). Also you've hidden the important details (allowing to easily identify what area the issue is in) quite well in the attachment. Jürgen (and possibly George / Dario), this (XEN) Xen call trace: (XEN) [<ffff82d04023be76>] R xfree+0x150/0x1f7 (XEN) [<ffff82d040248795>] F common/sched/credit2.c#csched2_free_udata+0xc/0xe (XEN) [<ffff82d040259169>] F schedule_cpu_rm+0x38d/0x4b3 (XEN) [<ffff82d0402430ca>] F common/sched/cpupool.c#cpupool_unassign_cpu_finish+0x17e/0x22c (XEN) [<ffff82d04021d402>] F common/sched/cpupool.c#cpu_callback+0x3fb/0x4dc (XEN) [<ffff82d040229fc3>] F notifier_call_chain+0x6b/0x96 (XEN) [<ffff82d040204df7>] F common/cpu.c#cpu_notifier_call_chain+0x1b/0x33 (XEN) [<ffff82d040204e33>] F common/cpu.c#_take_cpu_down+0x24/0x2b (XEN) [<ffff82d040204e43>] F common/cpu.c#take_cpu_down+0x9/0x10 (XEN) [<ffff82d040231517>] F common/stop_machine.c#stopmachine_action+0x86/0x96 (XEN) [<ffff82d040231cc5>] F common/tasklet.c#do_tasklet_work+0x72/0xa5 (XEN) [<ffff82d040231f42>] F do_tasklet+0x58/0x8a (XEN) [<ffff82d040320b60>] F arch/x86/domain.c#idle_loop+0x8d/0xee (XEN) (XEN) (XEN) **************************************** (XEN) Panic on CPU 48: (XEN) Assertion '!in_irq() && (local_irq_is_enabled() || num_online_cpus() <= 1)' failed at common/xmalloc_tlsf.c:704 (XEN) **************************************** is pointing at the problem quite clearly. Conceptually I think it has always been wrong to call xfree() from stop-machine context. It just so happened that we got away with that so far, because the CPU being brought down was the only one using respective functions (and hence there was no other risk of locking issues). Question is whether we want to continue building upon this (and hence the involved assertion would need to "learn" to ignore stop-machine context) or whether instead the freeing of the memory here can be deferred, e.g. to be taken care of by the CPU driving the offlining process. Jan
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |