[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

cpupool / credit2 misuse of xfree() (was: Re: [BUG] Xen causes a host hang by using xen-hptool cpu-offline)


  • To: "Gao, Ruifeng" <ruifeng.gao@xxxxxxxxx>, Juergen Gross <jgross@xxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Wed, 27 Jul 2022 08:32:46 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=YgtaUsS5yF/v4t7ZXNu2/g/7jMosFjQ8zb6pO9YeMwc=; b=fegIrrZBu8Kuz7pmWc1w7ylUBNgldc6AgTY3+LxyCiG3X9fZ03nhUyyz6JFy5fydc6bvfkw65HoXzkLA/ctv8NoxSwDJ9hq+aZCivYRoPeZYS0uq3rtSXxdv2KgR5E/d5srjSBnWvH2W0aMhT4JRqvImGVjUY7s5Lq0XGnN/UVKUi4D16uQ0Kvkp/tOIH0tYFOUM6Wxk61QZPwL4bLC9Qu8nDuj2/XRB97VCs12jqZN8ZycCXZGjK+7GTvkBsMIl3o16Zy8r8RUoAuxNEZMEGu6+5OrACNnFIUOOVbFez6z5OWjSF5vwTUXlw5paktBgzTWQYYO9yD+ZPumedo9XPA==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=kdpAkA5EBnhOgULm6HOrJ9LI+tyrAjxIc5/k1vzq/I0c0X3ic61SwoRSYhchtNgit6c8JQ0a9Hpfutots4pe7e2MPOiLRXSEMzLl8MwK/jE3mnLOBIjPIbGC+AI0MEUs77UI9k5KcRPG+qgL0EFUg8IESSrATxT/IRsdp/4RIOE7UTqQ+yLjFVhk+pam9U92hiwgQ9BoQGrwXB6XxVddmcOQiCw1sfaXX58Ru10kKUphMkTpgvYvYnavsk2uegDBL3mtIoAsZz4bB1fd5iIix7I/NG9A7rGXNQSs+T67mp+3j1zqpzgulyCckfMPOEFrsXf0AU7SIlynV/Xj8kmSSQ==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Dario Faggioli <dfaggioli@xxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>
  • Delivery-date: Wed, 27 Jul 2022 06:33:14 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 27.07.2022 03:19, Gao, Ruifeng wrote:
> Problem Description:
> Trying to execute "/usr/local/sbin/xen-hptool cpu-offline <cpuid>", the host 
> will hang immediately.
> 
> Version-Release and System Details:
> Platform: Ice Lake Server
> Host OS: Red Hat Enterprise Linux 8.3 (Ootpa)
> Kernel: 5.19.0-rc6
> HW: Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz 
> Xen Version: 4.17-unstable(ab2977b027-dirty)
> 
> Reproduce Steps:
> 1. Boot from Xen and check the information:
> [root@icx-2s1 ~]# xl info
> host                   : icx-2s1
> release                : 5.19.0-rc6
> xen_version            : 4.17-unstable
> xen_caps               : xen-3.0-x86_64 hvm-3.0-x86_32 hvm-3.0-x86_32p 
> hvm-3.0-x86_64
> platform_params        : virt_start=0xffff800000000000
> xen_changeset          : Thu Jul 14 19:45:36 2022 +0100 git:ab2977b027-dirty
> 2. Execute the cpu-offline command, here cpuid is 48 as an example:
> [root@icx-2s1 ~]# /usr/local/sbin/xen-hptool cpu-offline 48
> 
> Actual Results:
> The host will hang immediately.

Well, it crashes (which is an important difference). Also you've hidden
the important details (allowing to easily identify what area the issue
is in) quite well in the attachment.

Jürgen (and possibly George / Dario),

this

(XEN) Xen call trace:
(XEN)    [<ffff82d04023be76>] R xfree+0x150/0x1f7
(XEN)    [<ffff82d040248795>] F 
common/sched/credit2.c#csched2_free_udata+0xc/0xe
(XEN)    [<ffff82d040259169>] F schedule_cpu_rm+0x38d/0x4b3
(XEN)    [<ffff82d0402430ca>] F 
common/sched/cpupool.c#cpupool_unassign_cpu_finish+0x17e/0x22c
(XEN)    [<ffff82d04021d402>] F common/sched/cpupool.c#cpu_callback+0x3fb/0x4dc
(XEN)    [<ffff82d040229fc3>] F notifier_call_chain+0x6b/0x96
(XEN)    [<ffff82d040204df7>] F common/cpu.c#cpu_notifier_call_chain+0x1b/0x33
(XEN)    [<ffff82d040204e33>] F common/cpu.c#_take_cpu_down+0x24/0x2b
(XEN)    [<ffff82d040204e43>] F common/cpu.c#take_cpu_down+0x9/0x10
(XEN)    [<ffff82d040231517>] F 
common/stop_machine.c#stopmachine_action+0x86/0x96
(XEN)    [<ffff82d040231cc5>] F common/tasklet.c#do_tasklet_work+0x72/0xa5
(XEN)    [<ffff82d040231f42>] F do_tasklet+0x58/0x8a
(XEN)    [<ffff82d040320b60>] F arch/x86/domain.c#idle_loop+0x8d/0xee
(XEN) 
(XEN) 
(XEN) ****************************************
(XEN) Panic on CPU 48:
(XEN) Assertion '!in_irq() && (local_irq_is_enabled() || num_online_cpus() <= 
1)' failed at common/xmalloc_tlsf.c:704
(XEN) ****************************************

is pointing at the problem quite clearly. Conceptually I think it
has always been wrong to call xfree() from stop-machine context. It
just so happened that we got away with that so far, because the CPU
being brought down was the only one using respective functions (and
hence there was no other risk of locking issues).

Question is whether we want to continue building upon this (and
hence the involved assertion would need to "learn" to ignore
stop-machine context) or whether instead the freeing of the memory
here can be deferred, e.g. to be taken care of by the CPU driving
the offlining process.

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.