[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Xen-unstable: xen panic RIP: dpci_softirq
Wednesday, November 19, 2014, 4:04:59 PM, you wrote: > On Wed, Nov 19, 2014 at 12:16:44PM +0100, Sander Eikelenboom wrote: >> >> Wednesday, November 19, 2014, 2:55:41 AM, you wrote: >> >> > On Tue, Nov 18, 2014 at 11:12:54PM +0100, Sander Eikelenboom wrote: >> >> >> >> Tuesday, November 18, 2014, 9:56:33 PM, you wrote: >> >> >> >> >> >> >> >> Uhmm i thought i had these switched off (due to problems earlier and >> >> >> then forgot >> >> >> about them .. however looking at the earlier reports these lines were >> >> >> also in >> >> >> those reports). >> >> >> >> >> >> The xen-syms and these last runs are all with a prestine xen tree >> >> >> cloned today (staging >> >> >> branch), so the qemu-xen and seabios defined with that were also >> >> >> freshly cloned >> >> >> and had a new default seabios config. (just to rule out anything stale >> >> >> in my tree) >> >> >> >> >> >> If you don't see those messages .. perhaps your seabios and qemu trees >> >> >> (and at least the >> >> >> seabios config) are not the most recent (they don't get updated >> >> >> automatically >> >> >> when you just do a git pull on the main tree) ? >> >> >> >> >> >> In /tools/firmware/seabios-dir/.config i have: >> >> >> CONFIG_USB=y >> >> >> CONFIG_USB_UHCI=y >> >> >> CONFIG_USB_OHCI=y >> >> >> CONFIG_USB_EHCI=y >> >> >> CONFIG_USB_XHCI=y >> >> >> CONFIG_USB_MSC=y >> >> >> CONFIG_USB_UAS=y >> >> >> CONFIG_USB_HUB=y >> >> >> CONFIG_USB_KEYBOARD=y >> >> >> CONFIG_USB_MOUSE=y >> >> >> >> >> >> >> > I seem to have the same thing. Perhaps it is my XHCI controller being >> >> > wonky. >> >> >> >> >> And this is all just from a: >> >> >> - git clone git://xenbits.xen.org/xen.git -b staging >> >> >> - make clean && ./configure && make -j6 && make -j6 install >> >> >> >> > Aye. >> >> > .. snip.. >> >> >> > 1) test_and_[set|clear]_bit sometimes return unexpected values. >> >> >> > [But this might be invalid as the addition of the >> >> >> > ffff8303faaf25a8 >> >> >> > might be correct - as the second dpci the softirq is processing >> >> >> > could be the MSI one] >> >> >> >> >> >> Would there be an easy way to stress test this function separately in >> >> >> some >> >> >> debugging function to see if it indeed is returning unexpected values ? >> >> >> >> > Sadly no. But you got me looking in the right direction when you >> >> > mentioned >> >> > 'timeout'. >> >> >> >> >> >> > 2) INIT_LIST_HEAD operations on the same CPU are not honored. >> >> >> >> >> >> Just curious, have you also tested the patches on AMD hardware ? >> >> >> >> > Yes. To reproduce this the first thing I did was to get an AMD box. >> >> >> >> >> >> >> >> >> >> >> >> When i look at the combination of (2) and (3), It seems it could be >> >> >> >> an >> >> >> >> interaction between the two passed through devices and/or different >> >> >> >> IRQ types. >> >> >> >> >> >> > Could be - as in it is causing this issue to show up faster than >> >> >> > expected. Or it is the one that triggers more than one dpci happening >> >> >> > at the same time. >> >> >> >> >> >> Well that didn't seem to be it (see separate amendment i mailed >> >> >> previously) >> >> >> >> > Right, the current theory I've is that the interrupts are not being >> >> > Acked within 8 milisecond and we reset the 'state' - and at the same >> >> > time we get an interrupt and schedule it - while we are still processing >> >> > the same interrupt. This would explain why the 'test_and_clear_bit' >> >> > got the wrong value. >> >> >> >> > In regards to the list poison - following this thread of logic - with >> >> > the 'state = 0' set we open the floodgates for any CPU to put the same >> >> > 'struct hvm_pirq_dpci' on its list. >> >> >> >> > We do reset the 'state' on _every_ GSI that is mapped to a guest - so >> >> > we also reset the 'state' for the MSI one (XHCI). Anyhow in your case: >> >> >> >> > CPUX: CPUY: >> >> > pt_irq_time_out: >> >> > state = 0; >> >> > [out of timer coder, the raise_softirq >> >> > pirq_dpci is on the dpci_list] [adds the pirq_dpci as state == >> >> > 0] >> >> >> >> > softirq_dpci softirq_dpci: >> >> > list_del >> >> > [entries poison] >> >> > list_del <= BOOM >> >> > >> >> > Is what I believe is happening. >> >> >> >> > The INTX device - once I put a load on it - does not trigger >> >> > any pt_irq_time_out, so that would explain why I cannot hit this. >> >> >> >> > But I believe your card hits these "hiccups". >> >> >> >> >> >> Hi Konrad, >> >> >> >> I just tested you 5 patches and as a result i still got an(other) host >> >> crash: >> >> (complete serial log attached) >> >> >> >> (XEN) [2014-11-18 21:55:41.591] ----[ Xen-4.5.0-rc x86_64 debug=y Not >> >> tainted ]---- >> >> (XEN) [2014-11-18 21:55:41.591] CPU: 0 >> >> (XEN) [2014-11-18 21:55:41.591] ----[ Xen-4.5.0-rc x86_64 debug=y Not >> >> tainted ]---- >> >> (XEN) [2014-11-18 21:55:41.591] RIP: e008:[<ffff82d08012c7e7>]CPU: 2 >> >> (XEN) [2014-11-18 21:55:41.591] RIP: e008:[<ffff82d08014a461>] >> >> hvm_do_IRQ_dpci+0xbd/0x13c >> >> (XEN) [2014-11-18 21:55:41.591] RFLAGS: 0000000000010006 >> >> _spin_unlock+0x1f/0x30CONTEXT: hypervisor >> >> > Duh! >> >> > Here is another patch on top of the five you have (attached and inline). >> >> Hi Konrad, >> >> Happy to report it has been running with this additional patch for 2 hours >> now >> without any problems. I think you nailed it :-) > Could you also do an 'xl debug-keys k' and send that please? Sure: (XEN) [2014-11-19 17:26:05.839] CPU00: (XEN) [2014-11-19 17:26:05.839] d16 OK-softirq 1msec ago, state:1, 751216 count, [prev:ffff82d0802e7e70, next:ffff82d0802e7e70] ffff8303fab608a8 22c258 (XEN) [2014-11-19 17:26:05.839] d16 OK-raise 1msec ago, state:1, 751216 count, [prev:0200200200200200, next:0100100100100100] ffff8303fab608a8 22c257 (XEN) [2014-11-19 17:26:05.839] d16 OK-raise 347977msec ago, state:1, 61 count, [prev:ffff82d080329160, next:ffff82d080329160] ffff8303fab608a8 203775 (XEN) [2014-11-19 17:26:05.839] d16 OK-reset 1msec ago, state:0, 258049 count, [prev:0200200200200200, next:0100100100100100] ffff8303fab608a8 22c256 (XEN) [2014-11-19 17:26:05.839] d16 OK-timeout 1msec ago, state:0, 258049 count, [prev:0200200200200200, next:0100100100100100] ffff8303fab608a8 22c254 (XEN) [2014-11-19 17:26:05.839] d16 OK-timeout 1msec ago, state:0, 258049 count, [prev:0200200200200200, next:0100100100100100] ffff8303fab608a8 22c255 (XEN) [2014-11-19 17:26:05.839] d16 Z-softirq 5746msec ago, state:6, 669 count, [prev:0200200200200200, next:0100100100100100] ffff8303fab608a8 22b871 (XEN) [2014-11-19 17:26:05.839] d16 Z-raise 5746msec ago, state:4, 669 count, [prev:ffff82d080329160, next:ffff82d080329160] ffff8303fab608a8 22b86f (XEN) [2014-11-19 17:26:05.839] CPU01: (XEN) [2014-11-19 17:26:05.839] CPU02: (XEN) [2014-11-19 17:26:05.839] CPU03: (XEN) [2014-11-19 17:26:05.839] CPU04: (XEN) [2014-11-19 17:26:05.840] CPU05: >> More than happy to test the definitive patch as well. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |