Xen project Mailing List

Re: [PATCH] xen/vcpu: remove vcpu_set_singleshot_timer flags field

To: Roger Pau Monné <roger.pau@xxxxxxxxxx>

Date: Wed, 19 Apr 2023 11:18:38 +0200

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=Z3rUZJgYqjCHR8Zt9kBKYEx2OC1fPPjxo69meIDR+tw=; b=eNOp9GkuFNtuCNMV7Pj8ArbFRkejZqsrqrPwyk7ItKXWGjxtDCK637Jp8ZGt84WPJwkBmtkWav3t2DnkPtUK4wqPUE24S57RnVEHe7bmRzK+L5zdpOI8HFjsmwCPSYtSI6pZ0C+RbWKNse0CNzcwaOxExT4fODNGkQLvjaY0u8sm5fQbEN5OI481v4DTjYj2C880EU+xcIvMYlj9NBsu5STcuteqBFOsW+9nu0+XZnlVWuFk35QlQu8C+8iJlkyAJysdI5YTMzmuBr0bqZmqtREISeVyvlQJXReiCmoY9aAFbFtzDSsBKPqLCkjEuUm0EAhKE7JIBSp46Lw02zuyFw==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=NWbZgUzSHbm7UjdlvZvZduSOFmw3myw6jmnp7Ogw/b1GuIvOK6LVQa5iGJU/+TBE5wjdtQbBGF93TLLDMoBLLSMlOHIyAcT0QkqLcFDV/M4yhUJfL7v/LVb5HMoT8Z1no7lZ1QJd/TV+5OMYffwvNgg0iyEI6pSgdGbUPWowzaHotqnGi3qKOtjKXY2AwPC7RHEY3XiXuLMfxqQkz6SUTCEcwSCfhsBM603L5pshxjBavwlnOCER54imYSSDUq+Qc04d6tieAgY8bXRbHG9j5/36l45EExi00SwLS7fgr1foioFBvhtdihwOWC9c7YbacXh+ql06B/ORZaztRI+EQQ==

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Henry Wang <Henry.Wang@xxxxxxx>, Community Manager <community.manager@xxxxxxxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx

Delivery-date: Wed, 19 Apr 2023 09:19:02 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 19.04.2023 11:02, Roger Pau Monné wrote: > On Wed, Apr 19, 2023 at 09:07:41AM +0200, Jan Beulich wrote: >> On 18.04.2023 17:54, Andrew Cooper wrote: >>> On 18/04/2023 4:42 pm, Roger Pau Monne wrote: >>>> The addition of the flags field in the vcpu_set_singleshot_timer in >>>> 505ef3ea8687 is an ABI breakage, as the size of the structure is >>>> increased. >>>> >>>> Remove such field addition and drop the implementation of the >>>> VCPU_SSHOTTMR_future flag. If a timer provides an expired timeout >>>> value just inject the timer interrupt. >>>> >>>> Bump the Xen interface version, and keep the flags field and >>>> VCPU_SSHOTTMR_future available for guests using the old interface. >>>> >>>> Note the removal of the field from the vcpu_set_singleshot_timer >>>> struct allows removing the compat translation of the struct. >>>> >>>> Fixes: 505ef3ea8687 ('Add flags field to VCPUOP_set_singlsehot_timer.') >>>> Reported-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx> >>>> Signed-off-by: Roger Pau Monné <roger.pau@xxxxxxxxxx> >>> >>> While everything said is true, this isn't the reason to to get rid of >>> VCPU_SSHOTTMR_future >>> >>> It 505ef3ea8687 does appear to have been an ABI break, that's >>> incidental. It dates from 2007 so whatever we have now is the de-facto >>> ABI, whether we like it or not. >>> >>> The reason to delete this is because it is a monumentality and entirely >>> stupid idea which should have been rejected outright at the time, and >>> the only guest we can find which uses it also BUG_ON()'s in response to >>> -ETIME. >> >> The instance in Linux (up to 4.6) that I could find was >> >> BUG_ON(ret != 0 && ret != -ETIME); >> >> i.e. not really dying when getting back -ETIME. (And if there really was >> a BUG_ON(ret) somewhere despite setting the flag, it would be a bug there, >> not something to "fix" in Xen.) I'm afraid I also disagree on "stupid >> idea" as well as ... > > The logic in old Linux is indeed 'fine' in the sense that it doesn't > hit a BUG_ON. > > The problem we are seeing is that when logdirty is enabled on a guest > with 32vCPUs (and without any kind of logdirty hardware assistance) > the contention on the p2m lock is so high that by the time > VCPUOP_set_singleshot_timer has copied the hypercall data from HVM > context the provided timeout has already expired, leading to: > > [ 65.543736] hrtimer: interrupt took 10817714 ns > [ 65.514171] CE: xen increased min_delta_ns to 150000 nsec > [ 65.514171] CE: xen increased min_delta_ns to 225000 nsec > [ 65.514171] CE: xen increased min_delta_ns to 337500 nsec > [ 65.566495] CE: xen increased min_delta_ns to 150000 nsec > [ 65.514171] CE: xen increased min_delta_ns to 506250 nsec > [ 65.573088] CE: xen increased min_delta_ns to 150000 nsec > [ 65.572884] CE: xen increased min_delta_ns to 150000 nsec > [ 65.514171] CE: xen increased min_delta_ns to 759375 nsec > [ 65.638644] CE: xen increased min_delta_ns to 150000 nsec > [ 65.566495] CE: xen increased min_delta_ns to 225000 nsec > [ 65.514171] CE: xen increased min_delta_ns to 1000000 nsec > [ 65.572884] CE: xen increased min_delta_ns to 225000 nsec > [ 65.573088] CE: xen increased min_delta_ns to 225000 nsec > [ 65.630224] CE: xen increased min_delta_ns to 150000 nsec > ... > > xenrt1062821 login: [ 82.752788] CE: Reprogramming failure. Giving up > [ 82.779470] CE: xen increased min_delta_ns to 1000000 nsec > [ 82.793075] CE: Reprogramming failure. Giving up > [ 82.779470] CE: Reprogramming failure. Giving up > [ 82.821864] CE: xen increased min_delta_ns to 506250 nsec > [ 82.821864] CE: xen increased min_delta_ns to 759375 nsec > [ 82.821864] CE: xen increased min_delta_ns to 1000000 nsec > [ 82.821864] CE: Reprogramming failure. Giving up > [ 82.856256] CE: Reprogramming failure. Giving up > [ 84.566279] CE: Reprogramming failure. Giving up > [ 84.649493] Freezing user space processes ... > [ 130.604032] INFO: rcu_sched detected stalls on CPUs/tasks: { 14} (detected > by 10, t=60002 jiffies, g=4006, c=4005, q=14130) > [ 130.604032] Task dump for CPU 14: > [ 130.604032] swapper/14 R running task 0 0 1 > 0x00000000 > [ 130.604032] Call Trace: > [ 130.604032] [<ffffffff90160f5d>] ? rcu_eqs_enter_common.isra.30+0x3d/0xf0 > [ 130.604032] [<ffffffff907b9bde>] ? default_idle+0x1e/0xd0 > [ 130.604032] [<ffffffff90039570>] ? arch_cpu_idle+0x20/0xc0 > [ 130.604032] [<ffffffff9010820a>] ? cpu_startup_entry+0x14a/0x1e0 > [ 130.604032] [<ffffffff9005d3a7>] ? start_secondary+0x1f7/0x270 > [ 130.604032] [<ffffffff900000d5>] ? start_cpu+0x5/0x14 > [ 549.654536] INFO: rcu_sched detected stalls on CPUs/tasks: { 26} (detected > by 24, t=60002 jiffies, g=6922, c=6921, q=7013) > [ 549.655463] Task dump for CPU 26: > [ 549.655463] swapper/26 R running task 0 0 1 > 0x00000000 > [ 549.655463] Call Trace: > [ 549.655463] [<ffffffff90160f5d>] ? rcu_eqs_enter_common.isra.30+0x3d/0xf0 > [ 549.655463] [<ffffffff907b9bde>] ? default_idle+0x1e/0xd0 > [ 549.655463] [<ffffffff90039570>] ? arch_cpu_idle+0x20/0xc0 > [ 549.655463] [<ffffffff9010820a>] ? cpu_startup_entry+0x14a/0x1e0 > [ 549.655463] [<ffffffff9005d3a7>] ? start_secondary+0x1f7/0x270 > [ 549.655463] [<ffffffff900000d5>] ? start_cpu+0x5/0x14 > [ 821.888478] INFO: rcu_sched detected stalls on CPUs/tasks: { 26} (detected > by 24, t=60002 jiffies, g=8499, c=8498, q=7664) > [ 821.888596] Task dump for CPU 26: > [ 821.888622] swapper/26 R running task 0 0 1 > 0x00000000 > [ 821.888677] Call Trace: > [ 821.888712] [<ffffffff90160f5d>] ? rcu_eqs_enter_common.isra.30+0x3d/0xf0 > [ 821.888771] [<ffffffff907b9bde>] ? default_idle+0x1e/0xd0 > [ 821.888818] [<ffffffff90039570>] ? arch_cpu_idle+0x20/0xc0 > [ 821.888865] [<ffffffff9010820a>] ? cpu_startup_entry+0x14a/0x1e0 > [ 821.888917] [<ffffffff9005d3a7>] ? start_secondary+0x1f7/0x270 > [ 821.888966] [<ffffffff900000d5>] ? start_cpu+0x5/0x14 > > At some point Linux simply gives up trying to reprogram the timer, and > that obviously lead to CPU stalls. And that's all with old enough Linux then, I suppose? > Ignoring the VCPU_SSHOTTMR_future flag allows the guest to survive, by > not returning ETIME and just injecting the expired interrupt. > > Overall I'm not sure how useful VCPU_SSHOTTMR_future is at least when > implemented as done currently in Linux. > > Instead of trying to reprogram the timer Linux should do the > equivalent of self-inject a timer interrupt in order to cope with the > fact that the selected timeout has already expired. Indeed - that's what I was expecting would be happening. But I didn't go check their code ... Yet them getting it wrong still isn't a reason to ignore the request, at least not unconditionally. OSes could be getting it right, and they could then benefit from the avoided event. As to "unconditionally": Introducing a per-guest control is likely too much overhead for something that, aiui, isn't commonly used (anymore). But tying this to a command line option might make sense - engaging it shouldn't (hopefully) lead to misbehavior in guests properly using the flag, so ought to be okay to enable in a system-wide manner. I vaguely recall considerations for similar overrides to hypercall behavior in other areas, so such an option - if made extensible - might find further uses down the road. Jan

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.