[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Serial console stuck during boot, unblocked with xl debug-key



On Thu, Jan 04, 2024 at 12:59:28PM +0100, Jan Beulich wrote:
> On 29.12.2023 10:50, Marek Marczykowski-Górecki wrote:
> > Hi,
> > 
> > This is continuation from matrix chat. There is an occasional failure on
> > qubes-hw2 gitlab runner that console become stuck during boot. I can now
> > reproduce it _much_ more often on another system, and the serial console 
> > output
> > ends with:
> > 
> >     (XEN) Allocated console ring of 256 KiB.
> >     (XEN) Using HWP for cpufreq
> >     (XEN) mwait-idle: does not run on family 6
> > 
> > It should be:
> > 
> >     (XEN) Allocated console ring of 256 KiB.
> >     (XEN) Using HWP for cpufreq
> >     (XEN) mwait-idle: does not run on family 6 model 183
> >     (XEN) VMX: Supported advanced features:
> >     (XEN)  - APIC MMIO access virtualisation
> >     (XEN)  - APIC TPR shadow
> >     ...
> > 
> > 
> > Otherwise the system works perfectly fine, the logs are available in
> > full via `xl dmesg` etc. Doing (any?) `xl debug-key` unblocks the
> > console and missing logs gets dumped there too. I narrowed it down to
> > the serial console tx buffer and collected some info with the attacked
> > patch (it collects info still during boot, after the place where it
> > usually breaks). When it works, I get:
> > 
> >     (XEN) SERIAL DEBUG: txbufc: 0x5b5, txbufp: 0x9f7, uart intr_works: 1, 
> > serial_txbufsz: 0x4000, tx_ready: 0, lsr_mask: 0x20, msi: 0, io_size: 8, 
> > skipped_interrupts: 0
> > 
> > And when it breaks, I get:
> > 
> >     (XEN) SERIAL DEBUG: txbufc: 0x70, txbufp: 0x9fd, uart intr_works: 1, 
> > serial_txbufsz: 0x4000, tx_ready: 16, lsr_mask: 0x20, msi: 0, io_size: 8, 
> > skipped_interrupts: 0
> 
> The only meaningful difference is tx_ready then. Looking at
> ns16550_tx_ready() I wonder whether the LSR reports inconsistent
> values on successive reads (there are at least three separate calls
> to the function out of serial_tx_interrupt() alone). What you didn't
> log is the LSR value itself; from the tx_ready value one can conclude
> though that in the bad case fifo_size was returned, while in the good
> case 0 was passed back. At the first glance this looks backwards, or
> in other words I can't explain why it would be this way round. (I
> assume you've had each case multiple times, and the output was
> sufficiently consistent; that doesn't go without saying as your
> invocation of serial_debug() is competing with the asynchronous
> transmitting of data [if any].) It being this way round might suggest
> that we lost an interrupt.

That is my current hypothesis too. Either at the hw level (being masked
for some reason at some point?) or on sw level (somehow not calling the
handler - that's why adding skipped_interrupts).

> Is this a real serial port, or one mimicked
> by a BMC (SoL or alike)?

This one is a real serial port. It isn't fully reproducible, but
happened sufficiently often that I'm quite sure of the above info.
Yes, my serial_debug() can interfere with data transfer, but I
intentionally added it significantly later than the issue happens (I
realize that console output end may not directly coincide with actual
time of the problem due to async sending, but still IMO should
be good enough). I later moved it to keyhandler, but that didn't give
any more info.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

Attachment: signature.asc
Description: PGP signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.