[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Prepping a bugfix push
Not a patch, but I've just tried out xm save -c again with the latest xen changes, and while I no longer see the grant table version panic, the guest's devices (aside from the console) appear to be wedged on resume. Is anyone else seeing this? After a while on the console I see messages like this: INFO: task syslogd:2219 blocked for more than 120 seconds. which I assume is trouble with the block device. On Thursday, 03 December 2009 at 11:26, Jeremy Fitzhardinge wrote: > I'm preparing a general bugfix push for Linus, targeted at both current > linux-2.6.git and stable. The list of patches I have lined up (in the > "bugfix" branch) are below. Is there anything I've overlooked? Are > there any patches I've forgotten to apply altogether? > > (Note, this is all domU stuff; dom0 things will need to mature a bit.) > > Thanks, > J > > commit b4606f2165153833247823e8c04c5e88cb3d298b > Author: Ian Campbell <ian.campbell@xxxxxxxxxx> > Date: Tue Dec 1 11:47:15 2009 +0000 > > xen: explicitly create/destroy stop_machine workqueues outside > suspend/resume region. > > I have observed cases where the implicit stop_machine_destroy() done by > stop_machine() hangs while destroying the workqueues, specifically in > kthread_stop(). This seems to be because timer ticks are not restarted > until after stop_machine() returns. > > Fortunately stop_machine provides a facility to pre-create/post-destroy > the workqueues so use this to ensure that workqueues are only destroyed > after everything is really up and running again. > > I only actually observed this failure with 2.6.30. It seems that newer > kernels are somehow more robust against doing kthread_stop() without timer > interrupts (I tried some backports of some likely looking candidates but > did not track down the commit which added this robustness). However this > change seems like a reasonable belt&braces thing to do. > > Signed-off-by: Ian Campbell <ian.campbell@xxxxxxxxxx> > Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx> > Cc: Stable Kernel <stable@xxxxxxxxxx> > > commit 65f63384b391bf4d384327d8a7c6de9860290b5c > Author: Ian Campbell <ian.campbell@xxxxxxxxxx> > Date: Tue Dec 1 11:47:14 2009 +0000 > > xen: improve error handling in do_suspend. > > The existing error handling has a few issues: > - If freeze_processes() fails it exits with shutting_down = > SHUTDOWN_SUSPEND. > - If dpm_suspend_noirq() fails it exits without resuming xenbus. > - If stop_machine() fails it exits without resuming xenbus or calling > dpm_resume_end(). > - xs_suspend()/xs_resume() and dpm_suspend_noirq()/dpm_resume_noirq() > were not > nested in the obvious way. > > Fix by ensuring each failure case goto's the correct label. Treat a > failure of > stop_machine() as a cancelled suspend in order to follow the correct > resume > path. > > Signed-off-by: Ian Campbell <ian.campbell@xxxxxxxxxx> > Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx> > Cc: Stable Kernel <stable@xxxxxxxxxx> > > commit fed5ea87e02aaf902ff38c65b4514233db03dc09 > Author: Ian Campbell <ian.campbell@xxxxxxxxxx> > Date: Tue Dec 1 16:15:30 2009 +0000 > > xen: don't leak IRQs over suspend/resume. > > On resume irq_info[*].evtchn is reset to 0 since event channel mappings > are not preserved over suspend/resume. The other contents of irq_info > is preserved to allow rebind_evtchn_irq() to function. > > However when a device resumes it will try to unbind from the > previous IRQ (e.g. blkfront goes blkfront_resume() -> blkif_free() -> > unbind_from_irqhandler() -> unbind_from_irq()). This will fail due to the > check for VALID_EVTCHN in unbind_from_irq() and the IRQ is leaked. The > device will then continue to resume and allocate a new IRQ, eventually > leading to find_unbound_irq() panic()ing. > > Fix this by changing unbind_from_irq() to handle teardown of interrupts > which have type!=IRQT_UNBOUND but are not currently bound to a specific > event channel. > > Signed-off-by: Ian Campbell <ian.campbell@xxxxxxxxxx> > Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx> > Cc: Stable Kernel <stable@xxxxxxxxxx> > > commit f6eafe3665bcc374c66775d58312d1c06c55303f > Author: Ian Campbell <Ian.Campbell@xxxxxxxxxx> > Date: Wed Nov 25 14:12:08 2009 +0000 > > xen: call clock resume notifier on all CPUs > > tick_resume() is never called on secondary processors. Presumably this > is because they are offlined for suspend on native and so this is > normally taken care of in the CPU onlining path. Under Xen we keep all > CPUs online over a suspend. > > This patch papers over the issue for me but I will investigate a more > generic, less hacky, way of doing to the same. > > tick_suspend is also only called on the boot CPU which I presume should > be fixed too. > > Signed-off-by: Ian Campbell <Ian.Campbell@xxxxxxxxxx> > Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx> > Cc: Stable Kernel <stable@xxxxxxxxxx> > Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx> > > commit 6aaf5d633bb6cead81b396d861d7bae4b9a0ba7e > Author: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx> > Date: Wed Nov 25 13:15:38 2009 -0800 > > xen: use iret for return from 64b kernel to 32b usermode > > If Xen wants to return to a 32b usermode with sysret it must use the > right form. When using VCGF_in_syscall to trigger this, it looks at > the code segment and does a 32b sysret if it is FLAT_USER_CS32. > However, this is different from __USER32_CS, so it fails to return > properly if we use the normal Linux segment. > > So avoid the whole mess by dropping VCGF_in_syscall and simply use > plain iret to return to usermode. > > Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx> > Acked-by: Jan Beulich <jbeulich@xxxxxxxxxx> > Cc: Stable Kernel <stable@xxxxxxxxxx> > > commit 922cc38ab71d1360978e65207e4a4f4988987127 > Author: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx> > Date: Tue Nov 24 09:58:49 2009 -0800 > > xen: don't call dpm_resume_noirq() with interrupts disabled. > > dpm_resume_noirq() takes a mutex, so it can't be called from a > no-interrupt > context. Don't call it from within the stop-machine function, but just > afterwards, since we're resuming anyway, regardless of what happened. > > Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx> > Cc: Stable Kernel <stable@xxxxxxxxxx> > > commit 499d19b82b586aef18727b9ae1437f8f37b66e91 > Author: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx> > Date: Tue Nov 24 09:38:25 2009 -0800 > > xen: register runstate info for boot CPU early > > printk timestamping uses sched_clock, which in turn relies on runstate > info under Xen. So make sure we set it up before any printks can > be called. > > Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx> > Cc: Stable Kernel <stable@xxxxxxxxxx> > > commit 028896721ac04f6fa0697f3ecac3f98761746363 > Author: Ian Campbell <ian.campbell@xxxxxxxxxx> > Date: Tue Nov 24 09:32:48 2009 -0800 > > xen: register runstate on secondary CPUs > > The commit "xen: re-register runstate area earlier on resume" caused us > to never try and setup the runstate area for secondary CPUs. Ensure that > we do this... > > Signed-off-by: Ian Campbell <ian.campbell@xxxxxxxxxx> > Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx> > Cc: Stable Kernel <stable@xxxxxxxxxx> > > commit f350c7922faad3397c98c81a9e5658f5a1ef0214 > Author: Ian Campbell <ian.campbell@xxxxxxxxxx> > Date: Tue Nov 24 10:16:23 2009 +0000 > > xen: register timer interrupt with IRQF_TIMER > > Otherwise the timer is disabled by dpm_suspend_noirq() which in turn > prevents > correct operation of stop_machine on multi-processor systems and breaks > suspend. > > Signed-off-by: Ian Campbell <ian.campbell@xxxxxxxxxx> > Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx> > Cc: Stable Kernel <stable@xxxxxxxxxx> > > commit fa24ba62ea2869308ffc9f0b286ac9650b4ca6cb > Author: Ian Campbell <ian.campbell@xxxxxxxxxx> > Date: Sat Nov 21 11:32:49 2009 +0000 > > xen: correctly restore pfn_to_mfn_list_list after resume > > pvops kernels >= 2.6.30 can currently only be saved and restored once. The > second attempt to save results in: > > ERROR Internal error: Frame# in pfn-to-mfn frame list is not in > pseudophys > ERROR Internal error: entry 0: p2m_frame_list[0] is 0xf2c2c2c2, max > 0x120000 > ERROR Internal error: Failed to map/save the p2m frame list > > I finally narrowed it down to: > > commit cdaead6b4e657f960d6d6f9f380e7dfeedc6a09b > Author: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx> > Date: Fri Feb 27 15:34:59 2009 -0800 > > xen: split construction of p2m mfn tables from registration > > Build the p2m_mfn_list_list early with the rest of the p2m > table, but > register it later when the real shared_info structure is in > place. > > Signed-off-by: Jeremy Fitzhardinge > <jeremy.fitzhardinge@xxxxxxxxxx> > > The unforeseen side-effect of this change was to cause the mfn list list > to not > be rebuilt on resume. Prior to this change it would have been rebuilt via > xen_post_suspend() -> xen_setup_shared_info() -> > xen_setup_mfn_list_list(). > > Fix by explicitly calling xen_build_mfn_list_list() from > xen_post_suspend(). > > Signed-off-by: Ian Campbell <ian.campbell@xxxxxxxxxx> > Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx> > Cc: Stable Kernel <stable@xxxxxxxxxx> > > commit 3905bb2aa7bb801b31946b37a4635ebac4009051 > Author: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx> > Date: Sat Nov 21 08:46:29 2009 +0800 > > xen: restore runstate_info even if !have_vcpu_info_placement > > Even if have_vcpu_info_placement is not set, we still need to set up > the runstate area on each resumed vcpu. > > Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx> > Cc: Stable Kernel <stable@xxxxxxxxxx> > > commit be012920ecba161ad20303a3f6d9e96c58cf97c7 > Author: Ian Campbell <Ian.Campbell@xxxxxxxxxx> > Date: Sat Nov 21 08:35:55 2009 +0800 > > xen: re-register runstate area earlier on resume. > > This is necessary to ensure the runstate area is available to > xen_sched_clock before any calls to printk which will require it in > order to provide a timestamp. > > I chose to pull the xen_setup_runstate_info out of xen_time_init into > the caller in order to maintain parity with calling > xen_setup_runstate_info separately from calling xen_time_resume. > > Signed-off-by: Ian Campbell <ian.campbell@xxxxxxxxxx> > Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx> > Cc: Stable Kernel <stable@xxxxxxxxxx> > > commit ae7888012969355a548372e99b066d9e31153b62 > Author: Paolo Bonzini <pbonzini@xxxxxxxxxx> > Date: Wed Jul 8 12:27:39 2009 +0200 > > xen: wait up to 5 minutes for device connetion > > Increases the device timeout from 10s to 5 minutes, giving the user a > visual indication during that time in case there are problems. The patch > is a backport of changesets 144 and 150 in the Xenbits tree. > > Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx> > Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx> > Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx> > > commit f8dc33088febc63286b7a60e6b678de8e064de8e > Author: Paolo Bonzini <pbonzini@xxxxxxxxxx> > Date: Wed Jul 8 12:27:38 2009 +0200 > > xen: improvement to wait_for_devices() > > When printing a warning about a timed-out device, print the > current state of both ends of the device connection (i.e., backend as > well as frontend). This backports half of changeset 146 from the > Xenbits tree. > > Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx> > Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx> > Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx> > > commit c6e1971139be1342902873181f3b80a979bfb33b > Author: Paolo Bonzini <pbonzini@xxxxxxxxxx> > Date: Wed Jul 8 12:27:37 2009 +0200 > > xen: fix is_disconnected_device/exists_disconnected_device > > The logic of is_disconnected_device/exists_disconnected_device is wrong > in that they are used to test whether a device is trying to connect (i.e. > connecting). For this reason the patch fixes them to not consider a > Closing or Closed device to be connecting. At the same time the patch > also renames the functions according to what they really do; you could > say a closed device is "disconnected" (the old name), but not "connecting" > (the new name). > > This patch is a backport of changeset 909 from the Xenbits tree. > > Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx> > Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx> > Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx> > > commit db05fed0ad72f264e39bcb366795f7367384ec92 > Author: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx> > Date: Tue Nov 24 16:41:47 2009 -0800 > > xen/xenbus: make DEVICE_ATTR()s static > > They don't need to be global, and may cause linker clashes. > > Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx> > Cc: Stable Kernel <stable@xxxxxxxxxx> > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxxxxxxxx > http://lists.xensource.com/xen-devel > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |