[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Xen-unstable: xen panic RIP: dpci_softirq
On Wed, Nov 19, 2014 at 12:16:44PM +0100, Sander Eikelenboom wrote: > > Wednesday, November 19, 2014, 2:55:41 AM, you wrote: > > > On Tue, Nov 18, 2014 at 11:12:54PM +0100, Sander Eikelenboom wrote: > >> > >> Tuesday, November 18, 2014, 9:56:33 PM, you wrote: > >> > >> >> > >> >> Uhmm i thought i had these switched off (due to problems earlier and > >> >> then forgot > >> >> about them .. however looking at the earlier reports these lines were > >> >> also in > >> >> those reports). > >> >> > >> >> The xen-syms and these last runs are all with a prestine xen tree > >> >> cloned today (staging > >> >> branch), so the qemu-xen and seabios defined with that were also > >> >> freshly cloned > >> >> and had a new default seabios config. (just to rule out anything stale > >> >> in my tree) > >> >> > >> >> If you don't see those messages .. perhaps your seabios and qemu trees > >> >> (and at least the > >> >> seabios config) are not the most recent (they don't get updated > >> >> automatically > >> >> when you just do a git pull on the main tree) ? > >> >> > >> >> In /tools/firmware/seabios-dir/.config i have: > >> >> CONFIG_USB=y > >> >> CONFIG_USB_UHCI=y > >> >> CONFIG_USB_OHCI=y > >> >> CONFIG_USB_EHCI=y > >> >> CONFIG_USB_XHCI=y > >> >> CONFIG_USB_MSC=y > >> >> CONFIG_USB_UAS=y > >> >> CONFIG_USB_HUB=y > >> >> CONFIG_USB_KEYBOARD=y > >> >> CONFIG_USB_MOUSE=y > >> >> > >> > >> > I seem to have the same thing. Perhaps it is my XHCI controller being > >> > wonky. > >> > >> >> And this is all just from a: > >> >> - git clone git://xenbits.xen.org/xen.git -b staging > >> >> - make clean && ./configure && make -j6 && make -j6 install > >> > >> > Aye. > >> > .. snip.. > >> >> > 1) test_and_[set|clear]_bit sometimes return unexpected values. > >> >> > [But this might be invalid as the addition of the ffff8303faaf25a8 > >> >> > might be correct - as the second dpci the softirq is processing > >> >> > could be the MSI one] > >> >> > >> >> Would there be an easy way to stress test this function separately in > >> >> some > >> >> debugging function to see if it indeed is returning unexpected values ? > >> > >> > Sadly no. But you got me looking in the right direction when you > >> > mentioned > >> > 'timeout'. > >> >> > >> >> > 2) INIT_LIST_HEAD operations on the same CPU are not honored. > >> >> > >> >> Just curious, have you also tested the patches on AMD hardware ? > >> > >> > Yes. To reproduce this the first thing I did was to get an AMD box. > >> > >> >> > >> >> > >> >> >> When i look at the combination of (2) and (3), It seems it could be > >> >> >> an > >> >> >> interaction between the two passed through devices and/or different > >> >> >> IRQ types. > >> >> > >> >> > Could be - as in it is causing this issue to show up faster than > >> >> > expected. Or it is the one that triggers more than one dpci happening > >> >> > at the same time. > >> >> > >> >> Well that didn't seem to be it (see separate amendment i mailed > >> >> previously) > >> > >> > Right, the current theory I've is that the interrupts are not being > >> > Acked within 8 milisecond and we reset the 'state' - and at the same > >> > time we get an interrupt and schedule it - while we are still processing > >> > the same interrupt. This would explain why the 'test_and_clear_bit' > >> > got the wrong value. > >> > >> > In regards to the list poison - following this thread of logic - with > >> > the 'state = 0' set we open the floodgates for any CPU to put the same > >> > 'struct hvm_pirq_dpci' on its list. > >> > >> > We do reset the 'state' on _every_ GSI that is mapped to a guest - so > >> > we also reset the 'state' for the MSI one (XHCI). Anyhow in your case: > >> > >> > CPUX: CPUY: > >> > pt_irq_time_out: > >> > state = 0; > >> > [out of timer coder, the raise_softirq > >> > pirq_dpci is on the dpci_list] [adds the pirq_dpci as state == > >> > 0] > >> > >> > softirq_dpci softirq_dpci: > >> > list_del > >> > [entries poison] > >> > list_del <= BOOM > >> > > >> > Is what I believe is happening. > >> > >> > The INTX device - once I put a load on it - does not trigger > >> > any pt_irq_time_out, so that would explain why I cannot hit this. > >> > >> > But I believe your card hits these "hiccups". > >> > >> > >> Hi Konrad, > >> > >> I just tested you 5 patches and as a result i still got an(other) host > >> crash: > >> (complete serial log attached) > >> > >> (XEN) [2014-11-18 21:55:41.591] ----[ Xen-4.5.0-rc x86_64 debug=y Not > >> tainted ]---- > >> (XEN) [2014-11-18 21:55:41.591] CPU: 0 > >> (XEN) [2014-11-18 21:55:41.591] ----[ Xen-4.5.0-rc x86_64 debug=y Not > >> tainted ]---- > >> (XEN) [2014-11-18 21:55:41.591] RIP: e008:[<ffff82d08012c7e7>]CPU: 2 > >> (XEN) [2014-11-18 21:55:41.591] RIP: e008:[<ffff82d08014a461>] > >> hvm_do_IRQ_dpci+0xbd/0x13c > >> (XEN) [2014-11-18 21:55:41.591] RFLAGS: 0000000000010006 > >> _spin_unlock+0x1f/0x30CONTEXT: hypervisor > > > Duh! > > > Here is another patch on top of the five you have (attached and inline). > > Hi Konrad, > > Happy to report it has been running with this additional patch for 2 hours > now > without any problems. I think you nailed it :-) Could you also do an 'xl debug-keys k' and send that please? > More than happy to test the definitive patch as well. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |