[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Workaround for buggy PIT
On 15.3.2006 13:32, Keir Fraser wrote: > > On 14 Mar 2006, at 18:05, Tomas Kopal wrote: > >> Well, in my case, I traced the problem down to a buggy chipset. The >> VIA686a PIT timer randomly looses it's programming and needs to be >> reset. The linux kernel has a workaround for this, but this does not get >> used when xen comes to play as the hypervisor takes over control of >> the PIT. >> I have implemented similar workaround in xen hypervisor. So far I am >> running it for about three weeks now and the server is perfectly stable. >> >> I am interested in your comments, and I would be happy if you could >> apply this patch to xen sources. > > Do you have any details on what mode the timer enters when it loses its > programming, whether this affects all PIT channels, etc? Well, there is not much info on this. There is no official VIA info, only speculations. Probably the most info I found on LKLM. The best summary I found is here: http://www.uwsg.iu.edu/hypermail/linux/kernel/0111.0/1613.html and http://www.uwsg.iu.edu/hypermail/linux/kernel/0205.3/1068.html One of initial problem descriptions: http://www.uwsg.iu.edu/hypermail/linux/kernel/0205.2/1405.html It seems to affect only one channel AFAIK, but it's not always the same (Linux kernel is using channel 0, Xen channel 2, and the problem is the same for both). It's probably not affecting all channels together, as bug on channel 1 could be quite disastrous to the memory contents. But similar problems may be in other chipsets too: http://support.microsoft.com/default.aspx?scid=kb;en-us;Q274323 http://support.microsoft.com/default.aspx?scid=kb;en-us;Q266344 So having a bit more "robust" PIT handling should generally help. > The patch is > potentially okay -- it differs from Linux in that we free-run channel 2 > (we don't periodically and automatically re-latch) and so the Linux test > for count > latch does not work. The test you use (diff > 2*latch) is > kind of weird, even if it does seem to work for you: I wonder what kind > of mode it enters where readings make it look like it is running at > three times normal speed? I think that the mode is not changed, just the immediate value in the timer. My explanation is that the timer sometimes (probably when the system is under heavy load, like during domU shutdown) returns "random jump", probably by resetting current timer value to some other, random one, but continues counting. If this happen during calibration call, the calibrated values are completely off, and the system time starts to run away due to using invalid calibration data. Together with xntpd it can get even more messy. (Just for the record, I tried to turn xntpd in dom0 off, but the problem remained). But this is not backed up by any real evidence, so take it with heaps of salt :-). The test for diff > 2*latch is a bit of heuristics :-). You are right that this differs from Linux, Xen is not resetting the counter to latch but free running it. But the diff between subsequent values should be always near the latch value, as this is driven by the channel 0 set to interrupt by latch. I was printing out real diff values (detecting min and max over periods of time) and it varied about 40% around the latch value. I didn't want to get too many false positives, so I set it to double the expected value. As the problematic values tend to be quite high, I think this is a safe threshold. > > Also, although you detect and fix up channel 2 problems, all that code > is driven off the channel 0 timer interrupt handler. What happens if ch0 > loses its programming? :-) Don't know. It either does not loose it, or the effect of it loosing it is not that obvious. Do you know any easy way how to detect this? (i.e. detect missing or late interrupts? We can't use channel 2 as we can't trust it. Maybe we can use the TSC?) As I said, I expect the timer to continue counting, so if I am right, the only problem which it can cause is that the timer will come a bit later. Apart from time keeping, this should not be a big deal, or is it? As I am thinking about this now, the cause may even be that the counter problem is in channel 0 only. Then the timer interrupt would come a lot later and the difference in values of channel 2 could overflow to negative values? > > Really I want to understand this problem rather better before committing > a patch for a six-year-old chipset. > > -- Keir Yes, the chipset is quite old. We were already thinking about replacing it, but after this fix, it will probably have to serve a bit longer :-). I share your desire to understand the problem, but I still don't understand it, and it seems that the people from LKLM didn't completely understood it either. And according to the MSDN records, it may be quite wide-spread, even on newer chipsets... Feel free to make it compile-time option, or just move it to contrib. But if it can save trouble I had to go through to anyone, it would be definitely beneficial to have in the mainstream, especially when it does not add any penalty to fault-less systems. Thanks a lot Tomas _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |