Xen project Mailing List

cpupool / credit2 misuse of xfree() (was: Re: [BUG] Xen causes a host hang by using xen-hptool cpu-offline)

To: "Gao, Ruifeng" <ruifeng.gao@xxxxxxxxx>, Juergen Gross <jgross@xxxxxxxx>

Date: Wed, 27 Jul 2022 08:32:46 +0200

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=YgtaUsS5yF/v4t7ZXNu2/g/7jMosFjQ8zb6pO9YeMwc=; b=fegIrrZBu8Kuz7pmWc1w7ylUBNgldc6AgTY3+LxyCiG3X9fZ03nhUyyz6JFy5fydc6bvfkw65HoXzkLA/ctv8NoxSwDJ9hq+aZCivYRoPeZYS0uq3rtSXxdv2KgR5E/d5srjSBnWvH2W0aMhT4JRqvImGVjUY7s5Lq0XGnN/UVKUi4D16uQ0Kvkp/tOIH0tYFOUM6Wxk61QZPwL4bLC9Qu8nDuj2/XRB97VCs12jqZN8ZycCXZGjK+7GTvkBsMIl3o16Zy8r8RUoAuxNEZMEGu6+5OrACNnFIUOOVbFez6z5OWjSF5vwTUXlw5paktBgzTWQYYO9yD+ZPumedo9XPA==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=kdpAkA5EBnhOgULm6HOrJ9LI+tyrAjxIc5/k1vzq/I0c0X3ic61SwoRSYhchtNgit6c8JQ0a9Hpfutots4pe7e2MPOiLRXSEMzLl8MwK/jE3mnLOBIjPIbGC+AI0MEUs77UI9k5KcRPG+qgL0EFUg8IESSrATxT/IRsdp/4RIOE7UTqQ+yLjFVhk+pam9U92hiwgQ9BoQGrwXB6XxVddmcOQiCw1sfaXX58Ru10kKUphMkTpgvYvYnavsk2uegDBL3mtIoAsZz4bB1fd5iIix7I/NG9A7rGXNQSs+T67mp+3j1zqpzgulyCckfMPOEFrsXf0AU7SIlynV/Xj8kmSSQ==

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;

Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Dario Faggioli <dfaggioli@xxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>

Delivery-date: Wed, 27 Jul 2022 06:33:14 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 27.07.2022 03:19, Gao, Ruifeng wrote: > Problem Description: > Trying to execute "/usr/local/sbin/xen-hptool cpu-offline <cpuid>", the host > will hang immediately. > > Version-Release and System Details: > Platform: Ice Lake Server > Host OS: Red Hat Enterprise Linux 8.3 (Ootpa) > Kernel: 5.19.0-rc6 > HW: Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz > Xen Version: 4.17-unstable(ab2977b027-dirty) > > Reproduce Steps: > 1. Boot from Xen and check the information: > [root@icx-2s1 ~]# xl info > host : icx-2s1 > release : 5.19.0-rc6 > xen_version : 4.17-unstable > xen_caps : xen-3.0-x86_64 hvm-3.0-x86_32 hvm-3.0-x86_32p > hvm-3.0-x86_64 > platform_params : virt_start=0xffff800000000000 > xen_changeset : Thu Jul 14 19:45:36 2022 +0100 git:ab2977b027-dirty > 2. Execute the cpu-offline command, here cpuid is 48 as an example: > [root@icx-2s1 ~]# /usr/local/sbin/xen-hptool cpu-offline 48 > > Actual Results: > The host will hang immediately. Well, it crashes (which is an important difference). Also you've hidden the important details (allowing to easily identify what area the issue is in) quite well in the attachment. Jürgen (and possibly George / Dario), this (XEN) Xen call trace: (XEN) [<ffff82d04023be76>] R xfree+0x150/0x1f7 (XEN) [<ffff82d040248795>] F common/sched/credit2.c#csched2_free_udata+0xc/0xe (XEN) [<ffff82d040259169>] F schedule_cpu_rm+0x38d/0x4b3 (XEN) [<ffff82d0402430ca>] F common/sched/cpupool.c#cpupool_unassign_cpu_finish+0x17e/0x22c (XEN) [<ffff82d04021d402>] F common/sched/cpupool.c#cpu_callback+0x3fb/0x4dc (XEN) [<ffff82d040229fc3>] F notifier_call_chain+0x6b/0x96 (XEN) [<ffff82d040204df7>] F common/cpu.c#cpu_notifier_call_chain+0x1b/0x33 (XEN) [<ffff82d040204e33>] F common/cpu.c#_take_cpu_down+0x24/0x2b (XEN) [<ffff82d040204e43>] F common/cpu.c#take_cpu_down+0x9/0x10 (XEN) [<ffff82d040231517>] F common/stop_machine.c#stopmachine_action+0x86/0x96 (XEN) [<ffff82d040231cc5>] F common/tasklet.c#do_tasklet_work+0x72/0xa5 (XEN) [<ffff82d040231f42>] F do_tasklet+0x58/0x8a (XEN) [<ffff82d040320b60>] F arch/x86/domain.c#idle_loop+0x8d/0xee (XEN) (XEN) (XEN) **************************************** (XEN) Panic on CPU 48: (XEN) Assertion '!in_irq() && (local_irq_is_enabled() || num_online_cpus() <= 1)' failed at common/xmalloc_tlsf.c:704 (XEN) **************************************** is pointing at the problem quite clearly. Conceptually I think it has always been wrong to call xfree() from stop-machine context. It just so happened that we got away with that so far, because the CPU being brought down was the only one using respective functions (and hence there was no other risk of locking issues). Question is whether we want to continue building upon this (and hence the involved assertion would need to "learn" to ignore stop-machine context) or whether instead the freeing of the memory here can be deferred, e.g. to be taken care of by the CPU driving the offlining process. Jan

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.