[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Linux: balloon_process() causing workqueue lockups?


  • To: Juergen Gross <jgross@xxxxxxxx>, Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Fri, 27 Aug 2021 11:44:43 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=s1QdbS4tKCTVY5euqHq5INmSh+p8Ty/tdguKddlvNbk=; b=WB4XuBiIdNREeexiJqe1gjmB6cTT+SojwgyyhM4SbQQWshd9RhGfEj2sh3AoZVqlF1khgWJFXDvMm65iXHdflLwo5hDTxAna3Xw0mBPywYkfzVgkz9f/Jr3JRdVy1RAgtVuczojFiiO826YRxd45OzJH1YYLjwXyD6g3Brz3wBHF89KNMclscHhy/phB+laTRphgUcSonutQUZOMydo7zQaenKrq1H2qS83Z6Gue80+W3JP4fnpWe87jIDYi9QF+taegefGib75nr40SQQAMPHTHLLb6P2OtBBaq7KSrMCzkB+xTSpChrjg+t7dfdrQCR+IGQMKPz5g/1OQSPAn+oQ==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=TekHBTCSzQw1QwxGukn5FaDoN86HiX9ASr5jqJMMDF4vDX8k5f0BaluNq6Lu68rafziesJk4cIu8r68f77eFfyAbnnxh0ZUklEmBhCePuLmeOYJ4Jtb9KFcV9Dpia5AiX3aMGxmt796/DmxqrbqPmR9pY5Mcom9X/cKqvlfQSgecl3cCIf/RuI6z89H74iKSq1JPyVVkaQTwarWAEK+IdGnX391bb0h5W9HAEdxJeK7B3v0WuiJldeZZ/3q26ufLTS8VIIu2ZgEMp9vamF+9r2FC/z29I2YQqwMRfLLrDKcZPig7nzMpVlmzF0HICpaASuVl/FsW8B/8nc0ZMn4RWA==
  • Authentication-results: lists.xenproject.org; dkim=none (message not signed) header.d=none;lists.xenproject.org; dmarc=none action=none header.from=suse.com;
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Fri, 27 Aug 2021 09:45:00 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 27.08.2021 11:29, Juergen Gross wrote:
> On 27.08.21 11:01, Jan Beulich wrote:
>> ballooning down Dom0 by about 16G in one go once in a while causes:
>>
>> BUG: workqueue lockup - pool cpus=6 node=0 flags=0x0 nice=0 stuck for 64s!
>> Showing busy workqueues and worker pools:
>> workqueue events: flags=0x0
>>    pwq 12: cpus=6 node=0 flags=0x0 nice=0 active=2/256 refcnt=3
>>      in-flight: 229:balloon_process
>>      pending: cache_reap
>> workqueue events_freezable_power_: flags=0x84
>>    pwq 12: cpus=6 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
>>      pending: disk_events_workfn
>> workqueue mm_percpu_wq: flags=0x8
>>    pwq 12: cpus=6 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
>>      pending: vmstat_update
>> pool 12: cpus=6 node=0 flags=0x0 nice=0 hung=64s workers=3 idle: 2222 43
>>
>> I've tried to double check that this isn't related to my IOMMU work
>> in the hypervisor, and I'm pretty sure it isn't. Looking at the
>> function I see it has a cond_resched(), but aiui this won't help
>> with further items in the same workqueue.
>>
>> Thoughts?
> 
> I'm seeing two possible solutions here:
> 
> 1. After some time (1 second?) in balloon_process() setup a new
>     workqueue activity and return (similar to EAGAIN, but without
>     increasing the delay).
> 
> 2. Don't use a workqueue for the ballooning activity, use a kernel
>     thread instead.
> 
> I have a slight preference for 2, even if the resulting patch will
> be larger. 1 is only working around the issue and it is hard to
> find a really good timeout value.
> 
> I'd be fine to write a patch, but would prefer some feedback which
> way to go.

Was there a particular reason that a workqueue was used in the first
place? Otherwise using a kernel thread would look like the way to
go, indeed. The presence of cond_resched() kind of indicates such an
intention already anyway.

Jan




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.