Xen project Mailing List

Re: Linux: balloon_process() causing workqueue lockups?

To: Juergen Gross <jgross@xxxxxxxx>, Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>

Date: Fri, 27 Aug 2021 11:44:43 +0200

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=s1QdbS4tKCTVY5euqHq5INmSh+p8Ty/tdguKddlvNbk=; b=WB4XuBiIdNREeexiJqe1gjmB6cTT+SojwgyyhM4SbQQWshd9RhGfEj2sh3AoZVqlF1khgWJFXDvMm65iXHdflLwo5hDTxAna3Xw0mBPywYkfzVgkz9f/Jr3JRdVy1RAgtVuczojFiiO826YRxd45OzJH1YYLjwXyD6g3Brz3wBHF89KNMclscHhy/phB+laTRphgUcSonutQUZOMydo7zQaenKrq1H2qS83Z6Gue80+W3JP4fnpWe87jIDYi9QF+taegefGib75nr40SQQAMPHTHLLb6P2OtBBaq7KSrMCzkB+xTSpChrjg+t7dfdrQCR+IGQMKPz5g/1OQSPAn+oQ==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=TekHBTCSzQw1QwxGukn5FaDoN86HiX9ASr5jqJMMDF4vDX8k5f0BaluNq6Lu68rafziesJk4cIu8r68f77eFfyAbnnxh0ZUklEmBhCePuLmeOYJ4Jtb9KFcV9Dpia5AiX3aMGxmt796/DmxqrbqPmR9pY5Mcom9X/cKqvlfQSgecl3cCIf/RuI6z89H74iKSq1JPyVVkaQTwarWAEK+IdGnX391bb0h5W9HAEdxJeK7B3v0WuiJldeZZ/3q26ufLTS8VIIu2ZgEMp9vamF+9r2FC/z29I2YQqwMRfLLrDKcZPig7nzMpVlmzF0HICpaASuVl/FsW8B/8nc0ZMn4RWA==

Authentication-results: lists.xenproject.org; dkim=none (message not signed) header.d=none;lists.xenproject.org; dmarc=none action=none header.from=suse.com;

Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Fri, 27 Aug 2021 09:45:00 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 27.08.2021 11:29, Juergen Gross wrote: > On 27.08.21 11:01, Jan Beulich wrote: >> ballooning down Dom0 by about 16G in one go once in a while causes: >> >> BUG: workqueue lockup - pool cpus=6 node=0 flags=0x0 nice=0 stuck for 64s! >> Showing busy workqueues and worker pools: >> workqueue events: flags=0x0 >> pwq 12: cpus=6 node=0 flags=0x0 nice=0 active=2/256 refcnt=3 >> in-flight: 229:balloon_process >> pending: cache_reap >> workqueue events_freezable_power_: flags=0x84 >> pwq 12: cpus=6 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 >> pending: disk_events_workfn >> workqueue mm_percpu_wq: flags=0x8 >> pwq 12: cpus=6 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 >> pending: vmstat_update >> pool 12: cpus=6 node=0 flags=0x0 nice=0 hung=64s workers=3 idle: 2222 43 >> >> I've tried to double check that this isn't related to my IOMMU work >> in the hypervisor, and I'm pretty sure it isn't. Looking at the >> function I see it has a cond_resched(), but aiui this won't help >> with further items in the same workqueue. >> >> Thoughts? > > I'm seeing two possible solutions here: > > 1. After some time (1 second?) in balloon_process() setup a new > workqueue activity and return (similar to EAGAIN, but without > increasing the delay). > > 2. Don't use a workqueue for the ballooning activity, use a kernel > thread instead. > > I have a slight preference for 2, even if the resulting patch will > be larger. 1 is only working around the issue and it is hard to > find a really good timeout value. > > I'd be fine to write a patch, but would prefer some feedback which > way to go. Was there a particular reason that a workqueue was used in the first place? Otherwise using a kernel thread would look like the way to go, indeed. The presence of cond_resched() kind of indicates such an intention already anyway. Jan

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.