[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] xen/sched: fix sched_move_domain()


  • To: Juergen Gross <jgross@xxxxxxxx>
  • From: Henry Wang <Henry.Wang@xxxxxxx>
  • Date: Thu, 19 Oct 2023 11:49:06 +0000
  • Accept-language: zh-CN, en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=PXin2IsumT0ldkqAtVH7Y+bxdFf/GfBf2HlSmfwZ1hA=; b=lp8xmFXbpZSCFsFULHyBJjnjRDaTVPyTnvFaYSI7FIBymGrKgxvnr1IH8Je3g+BRJrlMu3MvwnejS8E6Jas6/KRsWdvrIZ7jkle68F3SJPuNZaF+FxpkLQLRTe9FvgTdnm1Z0cD55CvMxD9cA7PLDTVyakSg11m6QeP+du7MFOWnrfMjqAlVL0X/fAJa5BGxbH/WaZLKzDMVVDIo/UarvUmKEZwS16h5ETLdzF59Rx7ClyXkgGM8BBU0nFf7T0OF/5ELIw12Tz+OANM0nUcvcdHIzecv56dPl8tqyuS9MGvy5pgTLOl1Lt6lIYIUoDC8OhE2ALMODywtd8r2sQb7oQ==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=XMP6PLzEjroQKNtdiPI03kbqg6MKptpECwBcItQjSX0CWMvUTbocotNl6focW1+N9l6cS765uaGOQ/BPZMcgpQMlNTYsGA/EOI9mTbshqN9cij1IZnYERN393PYpxHialJzibpRwFPoi5VlW8QZy+7j79Udgf3PCJfgeOY0g29zIS8E7k3re8hsjVmKdidMboWFMX+SM3y9et1AYp9oDE9LCNQtgq8tvg1oW77sbRizQ8ae4Fp5UZ17F/T6YgQTchaQE0nvBhOURTUG9Qx8V+q1AV7IBPEpwILZmw4JYu/ToUeF6UA0Xh0XaSpOSUfY135ApgLjlHCEdWOaadpbI3A==
  • Authentication-results-original: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=arm.com;
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Dario Faggioli <dfaggioli@xxxxxxxx>
  • Delivery-date: Thu, 19 Oct 2023 11:50:01 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Nodisclaimer: true
  • Original-authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=arm.com;
  • Thread-index: AQHaAn6ufpdY+57bqkWZN4b9/h1wsrBQ+nMAgAACLYCAAALdAA==
  • Thread-topic: [PATCH] xen/sched: fix sched_move_domain()

Hi Juergen,

> On Oct 19, 2023, at 19:38, Juergen Gross <jgross@xxxxxxxx> wrote:
> 
> On 19.10.23 13:31, Henry Wang wrote:
>> Hi Juergen,
>>> On Oct 19, 2023, at 19:23, Juergen Gross <jgross@xxxxxxxx> wrote:
>>> 
>>> When moving a domain out of a cpupool running with the credit2
>>> scheduler and having multiple run-queues, the following ASSERT() can
>>> be observed:
>>> 
>>> (XEN) Xen call trace:
>>> (XEN)    [<ffff82d04023a700>] R credit2.c#csched2_unit_remove+0xe3/0xe7
>>> (XEN)    [<ffff82d040246adb>] S sched_move_domain+0x2f3/0x5b1
>>> (XEN)    [<ffff82d040234cf7>] S 
>>> cpupool.c#cpupool_move_domain_locked+0x1d/0x3b
>>> (XEN)    [<ffff82d040236025>] S cpupool_move_domain+0x24/0x35
>>> (XEN)    [<ffff82d040206513>] S domain_kill+0xa5/0x116
>>> (XEN)    [<ffff82d040232b12>] S do_domctl+0xe5f/0x1951
>>> (XEN)    [<ffff82d0402276ba>] S timer.c#timer_lock+0x69/0x143
>>> (XEN)    [<ffff82d0402dc71b>] S pv_hypercall+0x44e/0x4a9
>>> (XEN)    [<ffff82d0402012b7>] S lstar_enter+0x137/0x140
>>> (XEN)
>>> (XEN)
>>> (XEN) ****************************************
>>> (XEN) Panic on CPU 1:
>>> (XEN) Assertion 'svc->rqd == c2rqd(sched_unit_master(unit))' failed at 
>>> common/sched/credit2.c:1159
>>> (XEN) ****************************************
>>> 
>>> This is happening as sched_move_domain() is setting a different cpu
>>> for a scheduling unit without telling the scheduler. When this unit is
>>> removed from the scheduler, the ASSERT() will trigger.
>>> 
>>> In non-debug builds the result is usually a clobbered pointer, leading
>>> to another crash a short time later.
>>> 
>>> Fix that by swapping the two involved actions (setting another cpu and
>>> removing the unit from the scheduler).
>>> 
>>> Cc: Henry Wang <Henry.Wang@xxxxxxx>
>> Emmm, I think ^ this CC is better to me moved to the scissors line, otherwise
>> if this patch is committed, this line will be shown in the commit message...
>>> Fixes: 70fadc41635b ("xen/cpupool: support moving domain between cpupools 
>>> with different granularity")
>>> Signed-off-by: Juergen Gross <jgross@xxxxxxxx>
>>> ---
>>> This fixes a regression introduced in Xen 4.15. The fix is very simple
>>> and it will affect only configurations with multiple cpupools. I think
>>> whether to include it in 4.18 should be decided by the release manager
>>> based on the current state of the release (I think I wouldn't have
>>> added it that late in the release while being the release manager).
>> Thanks for the reminder :)
>> Please correct me if I am wrong, if this is fixing the regression introduced 
>> in
>> 4.15, shouldn’t this patch being backported to 4.15, 4.16, 4.17 and soon
>> 4.18? So honestly I think at least for 4.18 either add this patch now or
>> later won’t make much difference…I am ok either way I guess.
> 
> You are right, the patch needs to be backported.
> 
> OTOH nobody noticed the regression until a few days ago. So delaying the fix
> for a few weeks would probably not hurt too many people.

I am planning to branch next week, so I would say this patch probably will be
delayed max 1 week I guess.

> Each patch changing
> code is a risk to introduce another regression, so its your decision whether
> you want to take this risk. Especially changes like this one touching the
> core scheduling code always have a latent risk to open up a small race window
> (this could be said for the original patch I'm fixing here, too :-) ).

That said, given the fact that this patch is fixing a regression years ago and
will be backported to the stable releases, I am more leaning towards merging
this patch when the tree reopens (unless others strongly object), so that this
patch will get the opportunity to be tested properly, and we won’t take too much
risk to delay the 4.18 release. Also this decision is consistent with your (an
ex release manager) above words in the scissors line :)

Hopefully you will also be ok with that.

Kind regards,
Henry

> 
> 
> Juergen
> <OpenPGP_0xB0DE9DD628BF132F.asc>


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.