[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH 08/16] xen/arm: add watchdog domain suspend/resume helpers
Hi, @Julien Grall
On Tue, Mar 11, 2025 at 11:04 PM Julien Grall <julien@xxxxxxx> wrote: > > Hi Mykola, > > On 05/03/2025 09:11, Mykola Kvach wrote: > > From: Mykola Kvach <mykola_kvach@xxxxxxxx> > > > > This patch implements suspend/resume helpers for the watchdog. > > While a domain is suspended its watchdogs must be paused. Otherwise, > > if the domain stays in the suspend state for a longer period of time > > compared to the watchdog period, the domain would be shutdown on resume. > > Proper solution to this problem is to stop (suspend) the watchdog timers > > after the domain suspends and to restart (resume) the watchdog timers > > before the domain resumes. The suspend/resume of watchdog timers is done > > in Xen and is invisible to the guests. > > > > Signed-off-by: Mirela Simonovic <mirela.simonovic@xxxxxxxxxx> > > Signed-off-by: Saeed Nowshadi <saeed.nowshadi@xxxxxxxxxx> > > Signed-off-by: Mykyta Poturai <mykyta_poturai@xxxxxxxx> > > Signed-off-by: Mykola Kvach <mykola_kvach@xxxxxxxx> > > --- > > Changes in v3: > > - cover the code with CONFIG_SYSTEM_SUSPEND > > > > Changes in v2: > > - drop suspended field from timer structure > > - drop the call of watchdog_domain_resume from ctxt_switch_to > > --- > > xen/common/sched/core.c | 39 +++++++++++++++++++++++++++++++++++++++ > > xen/include/xen/sched.h | 9 +++++++++ > > 2 files changed, 48 insertions(+) > > > > diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c > > index b1c6b6b9fa..6c2231826a 100644 > > --- a/xen/common/sched/core.c > > +++ b/xen/common/sched/core.c > > @@ -1605,6 +1605,45 @@ void watchdog_domain_destroy(struct domain *d) > > kill_timer(&d->watchdog_timer[i].timer); > > } > > > > +#ifdef CONFIG_SYSTEM_SUSPEND > > The config option is Arm specific, yet this is common code. As x86, > already suspend/resume, then shouldn't the config option be common? > > But more importantly, why do we need to save/restore the watchdogs for > Arm but not x86? Is this a latent issue or design choice? I’ve looked into this a bit. Here's what I've found: A watchdog timer is created and initialized (but not started) for each domain during the domain_create call. This timer can be triggered either by the Linux kernel (I refer to Linux kernel–based operating systems because I have access to the code and can confirm that the Xen watchdog timer is implemented there. I don’t have knowledge about the existence or implementation of the Xen watchdog driver in other operating systems) Xen watchdog driver or by the xenwatchdogd service. Case 1: Watchdog started by the Linux kernel driver (I hope that operating systems not based on the Linux kernel also implement proper handling for the Xen watchdog timer driver during suspend and resume) When the Xen watchdog is started by the Linux kernel driver, everything works as expected. The driver correctly handles system suspend events: https://elixir.bootlin.com/linux/v6.14.8/source/drivers/watchdog/xen_wdt.c#L169 Case 2: Watchdog started by xenwatchdogd service However, when the watchdog is started by the xenwatchdogd service, neither the underlying OS nor the daemon takes care of stopping the watchdog timer during suspend: https://elixir.bootlin.com/xen/v4.20.0/source/tools/hotplug/Linux/init.d/xen-watchdog.in https://elixir.bootlin.com/xen/v4.20.0/source/tools/hotplug/NetBSD/rc.d/xen-watchdog Behavior on x86 during suspend: - Linux guest is configured with xenwatchdogd, and the Xen watchdog is started at boot - the OS initiates suspend (we request) - at that moment, there's an active watchdog timer in Xen for the domain, set to, say, 15 seconds - after suspend preparations, domain_shutdown() is called with the SHUTDOWN_suspend argument inside Xen hypervisor internals - inside this function, the is_shutting_down flag is set in the domain structure - when the watchdog timer expires, the Xen handler skips the reset action because the domain is marked as shutting down: https://elixir.bootlin.com/xen/v4.20.0/source/xen/common/sched/core.c#L1539 So far, everything behaves correctly. BUT for the second case there is another flow. The domain starts resuming from suspend. As part of the resume process, the is_shutting_down flag inside the domain is cleared, which re-enables normal watchdog behavior. However, the watchdog timer—set before suspend—has nearly expired. Because the OS and its user-space services (such as the watchdog pinging daemon) have not yet fully resumed and restarted, the watchdog timeout occurs before the ping can be sent. As a result, the watchdog triggers a reset or shutdown (as far as i know can't be another action of watchdog expiry, but we aren't interested in these options right now) before the service has a chance to take control again — effectively making a clean resume impossible. It's also unclear how common this situation is on x86 systems — specifically, whether xenwatchdogd is typically used in domU or dom0, or whether the kernel driver is more commonly relied upon instead. --- I think proper handling should be added to the relevant services to avoid unexpected triggers. What’s your take on this? > > > + > > +void watchdog_domain_suspend(struct domain *d) > > +{ > > + unsigned int i; > > + > > + spin_lock(&d->watchdog_lock); > > + > > + for ( i = 0; i < NR_DOMAIN_WATCHDOG_TIMERS; i++ ) > > + { > > + if ( test_bit(i, &d->watchdog_inuse_map) ) > > + { > > + stop_timer(&d->watchdog_timer[i].timer); > > + } > > + } > > + > > + spin_unlock(&d->watchdog_lock); > > +} > > + > > +void watchdog_domain_resume(struct domain *d) > > +{ > > + unsigned int i; > > + > > + spin_lock(&d->watchdog_lock); > > + > > + for ( i = 0; i < NR_DOMAIN_WATCHDOG_TIMERS; i++ ) > > + { > > + if ( test_bit(i, &d->watchdog_inuse_map) ) > > + { > > + set_timer(&d->watchdog_timer[i].timer, > > + NOW() + SECONDS(d->watchdog_timer[i].timeout)); > > + } > > + } > > + > > + spin_unlock(&d->watchdog_lock); > > +} > > + > > +#endif /* CONFIG_SYSTEM_SUSPEND */ > > + > > /* > > * Pin a vcpu temporarily to a specific CPU (or restore old pinning state if > > * cpu is NR_CPUS). > > diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h > > index d0d10612ce..caab4aad93 100644 > > --- a/xen/include/xen/sched.h > > +++ b/xen/include/xen/sched.h > > @@ -1109,6 +1109,15 @@ void scheduler_disable(void); > > void watchdog_domain_init(struct domain *d); > > void watchdog_domain_destroy(struct domain *d); > > > > +#ifdef CONFIG_SYSTEM_SUSPEND > > +/* > > + * Suspend/resume watchdogs of domain (while the domain is suspended its > > + * watchdogs should be on pause) > > + */ > > +void watchdog_domain_suspend(struct domain *d); > > +void watchdog_domain_resume(struct domain *d); > > +#endif /* CONFIG_SYSTEM_SUSPEND */ > > + > > /* > > * Use this check when the following are both true: > > * - Using this feature or interface requires full access to the hardware > > Cheers, > > -- > Julien Grall > Kind regards, Mykola
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |