[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[PATCH][RESEND]RE: [Xen-devel] [PATCH] Fix softlockup issue after vcpu hotplug


  • To: "Keir Fraser" <Keir.Fraser@xxxxxxxxxxxx>, <xen-devel@xxxxxxxxxxxxxxxxxxx>
  • From: "Tian, Kevin" <kevin.tian@xxxxxxxxx>
  • Date: Wed, 31 Jan 2007 14:17:45 +0800
  • Delivery-date: Tue, 30 Jan 2007 22:17:39 -0800
  • List-id: Xen developer discussion <xen-devel.lists.xensource.com>
  • Thread-index: AcdESFqDCWsISfq5RGeHgxcxVzRqmQACelaDAAAZiDAAAP4q2wADxf1QAAFVV3AAAMUYXAAACCkwAACFZyAAAV9OMAABEoucACBxTEA=
  • Thread-topic: [PATCH][RESEND]RE: [Xen-devel] [PATCH] Fix softlockup issue after vcpu hotplug

>From: Keir Fraser [mailto:Keir.Fraser@xxxxxxxxxxxx]
>Sent: 2007年1月30日 22:23
>
>On 30/1/07 2:11 pm, "Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote:
>
>> BTW, do you think whether it's worthy to destroy vcpu from
>> scheduler when it's down and then re-init that vcpu into scheduler
>> when it's on? I don't know whether this will make any influence to
>> accounting of scheduler. Actually domain save/restore doesn't show
>> this bug, and one obvious distinct compared to vcpu-hotplug is that
>> domain is restored in a new context...
>
>I wouldn't expect this to make any significant difference to scheduling
>accounting, certainly over a multi-second time period.
>
>Does the time you hoy-unplug the vcpu for make a difference to how
>often you
>see this problem? Did you try repro'ing with a 2.6.16 kernel?
>
> -- Keir

Hi, Keir,
        I verified that attached patch does fix the issue by restricting max 
timeout to 1s. Either vcpu unplug/plug, or suspend cancel works fine. 
Actually domain runs well several hours after intensive testing.

        I also tried 2.6.16, and it's immune to this issue. I add some debug 
info in both 2.6.16 and 2.6.18, to print out delta value when delta > 1s. 
The results further proves our analysis.

        In 2.6.16, all the prints are:
                Delta 101 > HZ for cpuN
                Delta 101 > HZ for cpuN
                Delta 101 > HZ for cpuN
                ...

        While in 2.6.18, something like:
                Delta 199 > HZ for cpuN
                Delta 156 > HZ for cpuN
                Delta 192 > HZ for cpuN
                Delta 102 > HZ for cpuN
                ...
        After unplug/plug a cpu:
                Delta 951 > HZ for cpuN
                ...
        And then soflockup warning jumps out.

        So in 2.6.16, watchdog thread itself promises max timeout
to about 1s by hooking a timer, while In 2.6.18, the max timeout 
value is volatile

        So I'm inclined to consider it as a fix, since there's no easy way 
to deduce an appropriate timeout without explicit/hard-code knowledge 
on such requirement like watchdog thread. How do you think? :-)

P.S. The warning reported by Simon on 2.6.16 may be fixed by my 
previous patch, due to the late check.

Thanks,
Kevin

Attachment: fix_softlockup_2618.patch
Description: fix_softlockup_2618.patch

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.