[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Prepping a bugfix push



Not a patch, but I've just tried out xm save -c again with the latest
xen changes, and while I no longer see the grant table version panic,
the guest's devices (aside from the console) appear to be wedged on
resume. Is anyone else seeing this?

After a while on the console I see messages like this:

INFO: task syslogd:2219 blocked for more than 120 seconds.

which I assume is trouble with the block device.

On Thursday, 03 December 2009 at 11:26, Jeremy Fitzhardinge wrote:
> I'm preparing a general bugfix push for Linus, targeted at both current
> linux-2.6.git and stable.  The list of patches I have lined up (in the
> "bugfix" branch) are below.  Is there anything I've overlooked?  Are
> there any patches I've forgotten to apply altogether?
> 
> (Note, this is all domU stuff; dom0 things will need to mature a bit.)
> 
> Thanks,
>     J
> 
> commit b4606f2165153833247823e8c04c5e88cb3d298b
> Author: Ian Campbell <ian.campbell@xxxxxxxxxx>
> Date:   Tue Dec 1 11:47:15 2009 +0000
> 
>     xen: explicitly create/destroy stop_machine workqueues outside 
> suspend/resume region.
>     
>     I have observed cases where the implicit stop_machine_destroy() done by
>     stop_machine() hangs while destroying the workqueues, specifically in
>     kthread_stop(). This seems to be because timer ticks are not restarted
>     until after stop_machine() returns.
>     
>     Fortunately stop_machine provides a facility to pre-create/post-destroy
>     the workqueues so use this to ensure that workqueues are only destroyed
>     after everything is really up and running again.
>     
>     I only actually observed this failure with 2.6.30. It seems that newer
>     kernels are somehow more robust against doing kthread_stop() without timer
>     interrupts (I tried some backports of some likely looking candidates but
>     did not track down the commit which added this robustness). However this
>     change seems like a reasonable belt&braces thing to do.
>     
>     Signed-off-by: Ian Campbell <ian.campbell@xxxxxxxxxx>
>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>     Cc: Stable Kernel <stable@xxxxxxxxxx>
> 
> commit 65f63384b391bf4d384327d8a7c6de9860290b5c
> Author: Ian Campbell <ian.campbell@xxxxxxxxxx>
> Date:   Tue Dec 1 11:47:14 2009 +0000
> 
>     xen: improve error handling in do_suspend.
>     
>     The existing error handling has a few issues:
>     - If freeze_processes() fails it exits with shutting_down = 
> SHUTDOWN_SUSPEND.
>     - If dpm_suspend_noirq() fails it exits without resuming xenbus.
>     - If stop_machine() fails it exits without resuming xenbus or calling
>       dpm_resume_end().
>     - xs_suspend()/xs_resume() and dpm_suspend_noirq()/dpm_resume_noirq() 
> were not
>       nested in the obvious way.
>     
>     Fix by ensuring each failure case goto's the correct label. Treat a 
> failure of
>     stop_machine() as a cancelled suspend in order to follow the correct 
> resume
>     path.
>     
>     Signed-off-by: Ian Campbell <ian.campbell@xxxxxxxxxx>
>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>     Cc: Stable Kernel <stable@xxxxxxxxxx>
> 
> commit fed5ea87e02aaf902ff38c65b4514233db03dc09
> Author: Ian Campbell <ian.campbell@xxxxxxxxxx>
> Date:   Tue Dec 1 16:15:30 2009 +0000
> 
>     xen: don't leak IRQs over suspend/resume.
>     
>     On resume irq_info[*].evtchn is reset to 0 since event channel mappings
>     are not preserved over suspend/resume. The other contents of irq_info
>     is preserved to allow rebind_evtchn_irq() to function.
>     
>     However when a device resumes it will try to unbind from the
>     previous IRQ (e.g.  blkfront goes blkfront_resume() -> blkif_free() ->
>     unbind_from_irqhandler() -> unbind_from_irq()). This will fail due to the
>     check for VALID_EVTCHN in unbind_from_irq() and the IRQ is leaked. The
>     device will then continue to resume and allocate a new IRQ, eventually
>     leading to find_unbound_irq() panic()ing.
>     
>     Fix this by changing unbind_from_irq() to handle teardown of interrupts
>     which have type!=IRQT_UNBOUND but are not currently bound to a specific
>     event channel.
>     
>     Signed-off-by: Ian Campbell <ian.campbell@xxxxxxxxxx>
>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>     Cc: Stable Kernel <stable@xxxxxxxxxx>
> 
> commit f6eafe3665bcc374c66775d58312d1c06c55303f
> Author: Ian Campbell <Ian.Campbell@xxxxxxxxxx>
> Date:   Wed Nov 25 14:12:08 2009 +0000
> 
>     xen: call clock resume notifier on all CPUs
>     
>     tick_resume() is never called on secondary processors. Presumably this
>     is because they are offlined for suspend on native and so this is
>     normally taken care of in the CPU onlining path. Under Xen we keep all
>     CPUs online over a suspend.
>     
>     This patch papers over the issue for me but I will investigate a more
>     generic, less hacky, way of doing to the same.
>     
>     tick_suspend is also only called on the boot CPU which I presume should
>     be fixed too.
>     
>     Signed-off-by: Ian Campbell <Ian.Campbell@xxxxxxxxxx>
>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>     Cc: Stable Kernel <stable@xxxxxxxxxx>
>     Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> 
> commit 6aaf5d633bb6cead81b396d861d7bae4b9a0ba7e
> Author: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
> Date:   Wed Nov 25 13:15:38 2009 -0800
> 
>     xen: use iret for return from 64b kernel to 32b usermode
>     
>     If Xen wants to return to a 32b usermode with sysret it must use the
>     right form.  When using VCGF_in_syscall to trigger this, it looks at
>     the code segment and does a 32b sysret if it is FLAT_USER_CS32.
>     However, this is different from __USER32_CS, so it fails to return
>     properly if we use the normal Linux segment.
>     
>     So avoid the whole mess by dropping VCGF_in_syscall and simply use
>     plain iret to return to usermode.
>     
>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>     Acked-by: Jan Beulich <jbeulich@xxxxxxxxxx>
>     Cc: Stable Kernel <stable@xxxxxxxxxx>
> 
> commit 922cc38ab71d1360978e65207e4a4f4988987127
> Author: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
> Date:   Tue Nov 24 09:58:49 2009 -0800
> 
>     xen: don't call dpm_resume_noirq() with interrupts disabled.
>     
>     dpm_resume_noirq() takes a mutex, so it can't be called from a 
> no-interrupt
>     context.  Don't call it from within the stop-machine function, but just
>     afterwards, since we're resuming anyway, regardless of what happened.
>     
>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>     Cc: Stable Kernel <stable@xxxxxxxxxx>
> 
> commit 499d19b82b586aef18727b9ae1437f8f37b66e91
> Author: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
> Date:   Tue Nov 24 09:38:25 2009 -0800
> 
>     xen: register runstate info for boot CPU early
>     
>     printk timestamping uses sched_clock, which in turn relies on runstate
>     info under Xen.  So make sure we set it up before any printks can
>     be called.
>     
>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>     Cc: Stable Kernel <stable@xxxxxxxxxx>
> 
> commit 028896721ac04f6fa0697f3ecac3f98761746363
> Author: Ian Campbell <ian.campbell@xxxxxxxxxx>
> Date:   Tue Nov 24 09:32:48 2009 -0800
> 
>     xen: register runstate on secondary CPUs
>     
>     The commit "xen: re-register runstate area earlier on resume" caused us
>     to never try and setup the runstate area for secondary CPUs. Ensure that
>     we do this...
>     
>     Signed-off-by: Ian Campbell <ian.campbell@xxxxxxxxxx>
>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>     Cc: Stable Kernel <stable@xxxxxxxxxx>
> 
> commit f350c7922faad3397c98c81a9e5658f5a1ef0214
> Author: Ian Campbell <ian.campbell@xxxxxxxxxx>
> Date:   Tue Nov 24 10:16:23 2009 +0000
> 
>     xen: register timer interrupt with IRQF_TIMER
>     
>     Otherwise the timer is disabled by dpm_suspend_noirq() which in turn 
> prevents
>     correct operation of stop_machine on multi-processor systems and breaks
>     suspend.
>     
>     Signed-off-by: Ian Campbell <ian.campbell@xxxxxxxxxx>
>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>     Cc: Stable Kernel <stable@xxxxxxxxxx>
> 
> commit fa24ba62ea2869308ffc9f0b286ac9650b4ca6cb
> Author: Ian Campbell <ian.campbell@xxxxxxxxxx>
> Date:   Sat Nov 21 11:32:49 2009 +0000
> 
>     xen: correctly restore pfn_to_mfn_list_list after resume
>     
>     pvops kernels >= 2.6.30 can currently only be saved and restored once. The
>     second attempt to save results in:
>     
>         ERROR Internal error: Frame# in pfn-to-mfn frame list is not in 
> pseudophys
>         ERROR Internal error: entry 0: p2m_frame_list[0] is 0xf2c2c2c2, max 
> 0x120000
>         ERROR Internal error: Failed to map/save the p2m frame list
>     
>     I finally narrowed it down to:
>     
>         commit cdaead6b4e657f960d6d6f9f380e7dfeedc6a09b
>             Author: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>             Date:   Fri Feb 27 15:34:59 2009 -0800
>     
>                 xen: split construction of p2m mfn tables from registration
>     
>                 Build the p2m_mfn_list_list early with the rest of the p2m 
> table, but
>                 register it later when the real shared_info structure is in 
> place.
>     
>                 Signed-off-by: Jeremy Fitzhardinge 
> <jeremy.fitzhardinge@xxxxxxxxxx>
>     
>     The unforeseen side-effect of this change was to cause the mfn list list 
> to not
>     be rebuilt on resume. Prior to this change it would have been rebuilt via
>     xen_post_suspend() -> xen_setup_shared_info() -> 
> xen_setup_mfn_list_list().
>     
>     Fix by explicitly calling xen_build_mfn_list_list() from 
> xen_post_suspend().
>     
>     Signed-off-by: Ian Campbell <ian.campbell@xxxxxxxxxx>
>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>     Cc: Stable Kernel <stable@xxxxxxxxxx>
> 
> commit 3905bb2aa7bb801b31946b37a4635ebac4009051
> Author: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
> Date:   Sat Nov 21 08:46:29 2009 +0800
> 
>     xen: restore runstate_info even if !have_vcpu_info_placement
>     
>     Even if have_vcpu_info_placement is not set, we still need to set up
>     the runstate area on each resumed vcpu.
>     
>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>     Cc: Stable Kernel <stable@xxxxxxxxxx>
> 
> commit be012920ecba161ad20303a3f6d9e96c58cf97c7
> Author: Ian Campbell <Ian.Campbell@xxxxxxxxxx>
> Date:   Sat Nov 21 08:35:55 2009 +0800
> 
>     xen: re-register runstate area earlier on resume.
>     
>     This is necessary to ensure the runstate area is available to
>     xen_sched_clock before any calls to printk which will require it in
>     order to provide a timestamp.
>     
>     I chose to pull the xen_setup_runstate_info out of xen_time_init into
>     the caller in order to maintain parity with calling
>     xen_setup_runstate_info separately from calling xen_time_resume.
>     
>     Signed-off-by: Ian Campbell <ian.campbell@xxxxxxxxxx>
>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>     Cc: Stable Kernel <stable@xxxxxxxxxx>
> 
> commit ae7888012969355a548372e99b066d9e31153b62
> Author: Paolo Bonzini <pbonzini@xxxxxxxxxx>
> Date:   Wed Jul 8 12:27:39 2009 +0200
> 
>     xen: wait up to 5 minutes for device connetion
>     
>     Increases the device timeout from 10s to 5 minutes, giving the user a
>     visual indication during that time in case there are problems.  The patch
>     is a backport of changesets 144 and 150 in the Xenbits tree.
>     
>     Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>     Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx>
>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
> 
> commit f8dc33088febc63286b7a60e6b678de8e064de8e
> Author: Paolo Bonzini <pbonzini@xxxxxxxxxx>
> Date:   Wed Jul 8 12:27:38 2009 +0200
> 
>     xen: improvement to wait_for_devices()
>     
>     When printing a warning about a timed-out device, print the
>     current state of both ends of the device connection (i.e., backend as
>     well as frontend).  This backports half of changeset 146 from the
>     Xenbits tree.
>     
>     Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>     Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx>
>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
> 
> commit c6e1971139be1342902873181f3b80a979bfb33b
> Author: Paolo Bonzini <pbonzini@xxxxxxxxxx>
> Date:   Wed Jul 8 12:27:37 2009 +0200
> 
>     xen: fix is_disconnected_device/exists_disconnected_device
>     
>     The logic of is_disconnected_device/exists_disconnected_device is wrong
>     in that they are used to test whether a device is trying to connect (i.e.
>     connecting).  For this reason the patch fixes them to not consider a
>     Closing or Closed device to be connecting.  At the same time the patch
>     also renames the functions according to what they really do; you could
>     say a closed device is "disconnected" (the old name), but not "connecting"
>     (the new name).
>     
>     This patch is a backport of changeset 909 from the Xenbits tree.
>     
>     Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>     Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx>
>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
> 
> commit db05fed0ad72f264e39bcb366795f7367384ec92
> Author: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
> Date:   Tue Nov 24 16:41:47 2009 -0800
> 
>     xen/xenbus: make DEVICE_ATTR()s static
>     
>     They don't need to be global, and may cause linker clashes.
>     
>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>     Cc: Stable Kernel <stable@xxxxxxxxxx>
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.