[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 1/1] xentrace: Add TRC_HW_VCHIP

On 03/28/14 07:45, Jan Beulich wrote:
On 28.03.14 at 12:25, <dslutz@xxxxxxxxxxx> wrote:
This add a set of trace events that track the setup of various
virtual chips related to timers in domU.

This set is hpet, pit (i8253, i8254), rtc (MC146818), apic (lapic),
and pic (i8259).  The pmtimer is not traced since it does not have a
changeable rate.
But you're not saying anything about why this would be useful
(considering that it wasn't needed before), and hence don't
provide a reason for taking this change.

Thank you for asking.  I am assuming from this that some
patches (like this one) should have this.

This is an area that I am very weak on.  No simple statement
comes to mind.  So here is the story of how this patch came

Months ago, 1 server for about 2 to 3 days would after 1st boot
have the 1 domU hang (1 out of 10 times) with the 1st interesting
message on the domU console of:

..MP-BIOS bug: 8254 timer not connected to IO-APIC

Since I know that this message was added to deal with certain
bad motherboards and that xen does not have this issue, I
started looking into this.

I considered add code like:

HVM_DBG_LOG(DBG_LEVEL_VLAPIC_TIMER, "value[0x%016"PRIx64"]", value);

but was not sure this would not change the timer enough to
stop the bug from happening.  So I added this patch.

I then spent a lot of time trying to reproduce this issue.  I was
not able to, nor was the person and server that reported it was
able to.  This was under various configurations:

1) No change.

2) debug=y xen build

3) debug=n + patch

4) debug=y + patch

Using a few of the trace files and the source code of the domU's
kernel, I was able to determine that the hpet.c code was involved.

Using this knowledge, I made a patch to xen to simulate various
values of "diff" (tn_cmp - cur_tick).  With this debug code I was able
to generate the hang on demand.  This work is what caused me to
post the patch:

hpet: Act more like real hardware

Which I now know to not be complete.  More testing after that time
has shown that 'diff > 0' will also cause this report if diff is large
enough.  Armed with this I went back to a few saved traces that I
had an was able to determine that the first interval in the calls to
create_periodic_time() (i.e. diff) had a very high variance.  I no
longer have the actual data, but my memory was that the
hpet_tick_to_ns(h, diff) values ranged from 23,696ns to 955,456ns.

More looking into linux in this area and learning about hpet hardware
and specification leads me to fact that this should not be happening.

I am still working on the set of changes to the hpet.c code to fix the
set of bugs that I think are there.

So I only know that this patch did provide very useful data to me. I
would think that it would be a help to developers in the future.

I think would could write a complex analysis program of the current
trace data and infer what some of the trace data was.  This to me is
a lot harder.  Since this is a new independent selectable trace a
developer can deside to include of exclude these.

This is long (and not a good reason for taking this change) but I
hope it helps.

   -Don Slutz


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.