[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: cpupool / credit2 misuse of xfree() (was: Re: [BUG] Xen causes a host hang by using xen-hptool cpu-offline)
On 27.07.22 08:32, Jan Beulich wrote: On 27.07.2022 03:19, Gao, Ruifeng wrote:Problem Description: Trying to execute "/usr/local/sbin/xen-hptool cpu-offline <cpuid>", the host will hang immediately. Version-Release and System Details: Platform: Ice Lake Server Host OS: Red Hat Enterprise Linux 8.3 (Ootpa) Kernel: 5.19.0-rc6 HW: Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz Xen Version: 4.17-unstable(ab2977b027-dirty) Reproduce Steps: 1. Boot from Xen and check the information: [root@icx-2s1 ~]# xl info host : icx-2s1 release : 5.19.0-rc6 xen_version : 4.17-unstable xen_caps : xen-3.0-x86_64 hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64 platform_params : virt_start=0xffff800000000000 xen_changeset : Thu Jul 14 19:45:36 2022 +0100 git:ab2977b027-dirty 2. Execute the cpu-offline command, here cpuid is 48 as an example: [root@icx-2s1 ~]# /usr/local/sbin/xen-hptool cpu-offline 48 Actual Results: The host will hang immediately.Well, it crashes (which is an important difference). Also you've hidden the important details (allowing to easily identify what area the issue is in) quite well in the attachment. Jürgen (and possibly George / Dario), this (XEN) Xen call trace: (XEN) [<ffff82d04023be76>] R xfree+0x150/0x1f7 (XEN) [<ffff82d040248795>] F common/sched/credit2.c#csched2_free_udata+0xc/0xe (XEN) [<ffff82d040259169>] F schedule_cpu_rm+0x38d/0x4b3 (XEN) [<ffff82d0402430ca>] F common/sched/cpupool.c#cpupool_unassign_cpu_finish+0x17e/0x22c (XEN) [<ffff82d04021d402>] F common/sched/cpupool.c#cpu_callback+0x3fb/0x4dc (XEN) [<ffff82d040229fc3>] F notifier_call_chain+0x6b/0x96 (XEN) [<ffff82d040204df7>] F common/cpu.c#cpu_notifier_call_chain+0x1b/0x33 (XEN) [<ffff82d040204e33>] F common/cpu.c#_take_cpu_down+0x24/0x2b (XEN) [<ffff82d040204e43>] F common/cpu.c#take_cpu_down+0x9/0x10 (XEN) [<ffff82d040231517>] F common/stop_machine.c#stopmachine_action+0x86/0x96 (XEN) [<ffff82d040231cc5>] F common/tasklet.c#do_tasklet_work+0x72/0xa5 (XEN) [<ffff82d040231f42>] F do_tasklet+0x58/0x8a (XEN) [<ffff82d040320b60>] F arch/x86/domain.c#idle_loop+0x8d/0xee (XEN) (XEN) (XEN) **************************************** (XEN) Panic on CPU 48: (XEN) Assertion '!in_irq() && (local_irq_is_enabled() || num_online_cpus() <= 1)' failed at common/xmalloc_tlsf.c:704 (XEN) **************************************** is pointing at the problem quite clearly. Conceptually I think it has always been wrong to call xfree() from stop-machine context. It just so happened that we got away with that so far, because the CPU being brought down was the only one using respective functions (and hence there was no other risk of locking issues). Question is whether we want to continue building upon this (and hence the involved assertion would need to "learn" to ignore stop-machine context) or whether instead the freeing of the memory here can be deferred, e.g. to be taken care of by the CPU driving the offlining process. This is even more complicated. I think ASSERT_ALLOC_CONTEXT() will trigger more often, especially with core scheduling enabled. In fact I think this is the reason why I've seen very rare strange failures with core scheduling when trying cpu hotplug operations, as there are even xmalloc() calls in stop machine context. I'm seeing the following possibilities: 1) Pre-allocating the needed data and deferring freeing of no longer needed data when taking a cpu down. Apart form some refactoring in common/sched/cpupool.c and common/sched/core.c this should be doable. 2) In case stop_machine() is called for action on only one cpu allow memory allocations and freeing with interrupts off and flush the TLBs locally when enabling interrupts again. This would require rather limited changes, but wouldn't be as clean as the other approach. Any preferences? I'd be fine with both variants and could write the patches. Juergen Attachment:
OpenPGP_0xB0DE9DD628BF132F.asc Attachment:
OpenPGP_signature
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |