[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.



Hi Mauro, 

that's a question for you : 

> Philippe, could you clarify again what CPU model(s) this is being observed on
> (the long times between individual steps forward with this problem perhaps
> warrant repeating the basics each time, as it's otherwise quite cumbersome
> to always look up old pieces of information).

can you provide this information ? 
        cat /proc/cpuinfo       
        cat /proc/meminfo
        hardware information (manufacturer, model, urls, ...)

Thanks, Philippe


> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
> Sent: Thursday, November 08, 2012 10:40 AM
> To: Simonet Philippe, ITS-OUS-OP-IFM-NW-IPE; Keir Fraser
> Cc: 599161@xxxxxxxxxxxxxxx; mrsanna1@xxxxxxxxx; Ian Campbell; xen-
> devel@xxxxxxxxxxxxx
> Subject: Re: [Xen-devel] #599161: Xen debug patch for the "clock shifts by 50
> minutes" bug.
> 
> >>> On 07.11.12 at 18:40, Keir Fraser <keir@xxxxxxx> wrote:
> > On 07/11/2012 13:22, "Ian Campbell" <ijc@xxxxxxxxxxxxxx> wrote:
> >
> >>>> (XEN) XXX plt_overflow: plt_now=5ece12d34128
> plt_wrap=5ece12d09306
> >>>> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5
> >>>> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854
> >>>> tsc_stamp=e3839fcb0273
> >>>
> >>> (below is the complete xm dmesg output)
> >>>
> >>> did that help you ? do you need more info ?
> >>
> >> I'll leave this to Keir (who wrote the debugging patch) to answer but
> >> it looks to me like it should be useful!
> >
> > I'm scratching my head. plt_wrap is earlier than plt_now, which should
> > be impossible. plt_stamp64 oddly has low 32 bits identical to
> > new_stamp. That seems very very improbable!
> 
> Is it? My understanding was that plt_stamp64 is just a software extension to
> the more narrow HW counter, and hence the low plt_mask bits would always
> be expected to be identical.
> 
> The plt_wrap < plt_now thing of course is entirely unexplainable to me too:
> Considering that plt_scale doesn't change at all post- boot, apart from
> memory corruption I could only see an memory access ordering problem to
> be the reason (platform_timer_stamp and/or stime_platform_stamp
> changing despite platform_timer_lock being held. So maybe taking a
> snapshot of all three static values involved in the calculation in
> __read_platform_stime() between acquiring the lock and the first call to
> __read_platform_stime(), and printing them together with the "live" values
> in a second
> printk() after the one your original patch added could rule that out.
> 
> But the box doesn't even seem to be NUMA (of course it also doesn't help
> that the log level was kept restricted - hint, hint, Philippe), not does there
> appear to be any S3 cycle or pCPU bring-up/-down in between...
> 
> Philippe, could you clarify again what CPU model(s) this is being observed on
> (the long times between individual steps forward with this problem perhaps
> warrant repeating the basics each time, as it's otherwise quite cumbersome
> to always look up old pieces of information).
> 
> > I wonder whether the overflow handling should just be removed, or made
> > conditional on a command-line parameter, or on the 32-bit platform
> > counter being at least somewhat likely to overflow before a softirq
> > occurs -- it seems lots of systems are using 14MHz HPET, and that
> > gives us a couple of minutes for the plt_overflow softirq to do its work
> before overflow occurs.
> > I think we would notice that outage in other ways. :)
> 
> Iirc we added this for a good reason - to cover the, however unlikely, event
> of Xen running for very long without preemption.
> Presumably most of the cases got fixed meanwhile, and indeed a
> wraparound time on the order of minutes should make this superfluous, but
> as the case here shows that code did spot a severe anomaly (whatever that
> may turn out to be).
> 
> Also recall that there are HPET implementations around that tick at a much
> higher frequency than 14MHz.
> 
> So unless we finally reach the understanding that the code is flawed, I would
> rather want to keep it.
> 
> Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.