[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Xen-unstable: xen panic RIP: dpci_softirq
Wednesday, November 19, 2014, 2:55:41 AM, you wrote: > On Tue, Nov 18, 2014 at 11:12:54PM +0100, Sander Eikelenboom wrote: >> >> Tuesday, November 18, 2014, 9:56:33 PM, you wrote: >> >> >> >> >> Uhmm i thought i had these switched off (due to problems earlier and then >> >> forgot >> >> about them .. however looking at the earlier reports these lines were >> >> also in >> >> those reports). >> >> >> >> The xen-syms and these last runs are all with a prestine xen tree cloned >> >> today (staging >> >> branch), so the qemu-xen and seabios defined with that were also freshly >> >> cloned >> >> and had a new default seabios config. (just to rule out anything stale in >> >> my tree) >> >> >> >> If you don't see those messages .. perhaps your seabios and qemu trees >> >> (and at least the >> >> seabios config) are not the most recent (they don't get updated >> >> automatically >> >> when you just do a git pull on the main tree) ? >> >> >> >> In /tools/firmware/seabios-dir/.config i have: >> >> CONFIG_USB=y >> >> CONFIG_USB_UHCI=y >> >> CONFIG_USB_OHCI=y >> >> CONFIG_USB_EHCI=y >> >> CONFIG_USB_XHCI=y >> >> CONFIG_USB_MSC=y >> >> CONFIG_USB_UAS=y >> >> CONFIG_USB_HUB=y >> >> CONFIG_USB_KEYBOARD=y >> >> CONFIG_USB_MOUSE=y >> >> >> >> > I seem to have the same thing. Perhaps it is my XHCI controller being >> > wonky. >> >> >> And this is all just from a: >> >> - git clone git://xenbits.xen.org/xen.git -b staging >> >> - make clean && ./configure && make -j6 && make -j6 install >> >> > Aye. >> > .. snip.. >> >> > 1) test_and_[set|clear]_bit sometimes return unexpected values. >> >> > [But this might be invalid as the addition of the ffff8303faaf25a8 >> >> > might be correct - as the second dpci the softirq is processing >> >> > could be the MSI one] >> >> >> >> Would there be an easy way to stress test this function separately in >> >> some >> >> debugging function to see if it indeed is returning unexpected values ? >> >> > Sadly no. But you got me looking in the right direction when you mentioned >> > 'timeout'. >> >> >> >> > 2) INIT_LIST_HEAD operations on the same CPU are not honored. >> >> >> >> Just curious, have you also tested the patches on AMD hardware ? >> >> > Yes. To reproduce this the first thing I did was to get an AMD box. >> >> >> >> >> >> >> >> When i look at the combination of (2) and (3), It seems it could be an >> >> >> interaction between the two passed through devices and/or different >> >> >> IRQ types. >> >> >> >> > Could be - as in it is causing this issue to show up faster than >> >> > expected. Or it is the one that triggers more than one dpci happening >> >> > at the same time. >> >> >> >> Well that didn't seem to be it (see separate amendment i mailed >> >> previously) >> >> > Right, the current theory I've is that the interrupts are not being >> > Acked within 8 milisecond and we reset the 'state' - and at the same >> > time we get an interrupt and schedule it - while we are still processing >> > the same interrupt. This would explain why the 'test_and_clear_bit' >> > got the wrong value. >> >> > In regards to the list poison - following this thread of logic - with >> > the 'state = 0' set we open the floodgates for any CPU to put the same >> > 'struct hvm_pirq_dpci' on its list. >> >> > We do reset the 'state' on _every_ GSI that is mapped to a guest - so >> > we also reset the 'state' for the MSI one (XHCI). Anyhow in your case: >> >> > CPUX: CPUY: >> > pt_irq_time_out: >> > state = 0; >> > [out of timer coder, the raise_softirq >> > pirq_dpci is on the dpci_list] [adds the pirq_dpci as state == 0] >> >> > softirq_dpci softirq_dpci: >> > list_del >> > [entries poison] >> > list_del <= BOOM >> > >> > Is what I believe is happening. >> >> > The INTX device - once I put a load on it - does not trigger >> > any pt_irq_time_out, so that would explain why I cannot hit this. >> >> > But I believe your card hits these "hiccups". >> >> >> Hi Konrad, >> >> I just tested you 5 patches and as a result i still got an(other) host crash: >> (complete serial log attached) >> >> (XEN) [2014-11-18 21:55:41.591] ----[ Xen-4.5.0-rc x86_64 debug=y Not >> tainted ]---- >> (XEN) [2014-11-18 21:55:41.591] CPU: 0 >> (XEN) [2014-11-18 21:55:41.591] ----[ Xen-4.5.0-rc x86_64 debug=y Not >> tainted ]---- >> (XEN) [2014-11-18 21:55:41.591] RIP: e008:[<ffff82d08012c7e7>]CPU: 2 >> (XEN) [2014-11-18 21:55:41.591] RIP: e008:[<ffff82d08014a461>] >> hvm_do_IRQ_dpci+0xbd/0x13c >> (XEN) [2014-11-18 21:55:41.591] RFLAGS: 0000000000010006 >> _spin_unlock+0x1f/0x30CONTEXT: hypervisor > Duh! > Here is another patch on top of the five you have (attached and inline). Hi Konrad, Happy to report it has been running with this additional patch for 2 hours now without any problems. I think you nailed it :-) More than happy to test the definitive patch as well. -- Sander > From 18008650f10199e2b83f74394f634a97086e34b8 Mon Sep 17 00:00:00 2001 > From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> > Date: Tue, 18 Nov 2014 20:48:43 -0500 > Subject: [PATCH] debug: Whether we want to clear the 'dom' field or not when > changing the 'state' bit. > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> > --- > xen/drivers/passthrough/io.c | 16 ++++++++++------ > 1 file changed, 10 insertions(+), 6 deletions(-) > diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c > index aad5607..24e2bd6 100644 > --- a/xen/drivers/passthrough/io.c > +++ b/xen/drivers/passthrough/io.c > @@ -243,13 +243,14 @@ bool_t pt_pirq_softirq_active(struct hvm_pirq_dpci > *pirq_dpci) > } > > /* > - * Reset the pirq_dpci->dom parameter to NULL. > + * Reset the pirq_dpci->dom parameter to NULL if 'clear_dom' is set. > * > * This function checks the different states to make sure it can do it > * at the right time. If it unschedules the 'hvm_dirq_assist' from running > * it also refcounts (which is what the softirq would have done) properly. > */ > -static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci) > +static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci, > + unsigned int clear_dom) > { > struct domain *d = pirq_dpci->dom; > struct __debug *debug = &__get_cpu_var(_d); > @@ -276,7 +277,8 @@ static void pt_pirq_softirq_reset(struct hvm_pirq_dpci > *pirq_dpci) > * in local variable before it sets STATE_RUN - and therefore will > not > * dereference '->dom' which would crash. > */ - pirq_dpci->>dom = NULL; > + if (clear_dom) > + pirq_dpci->dom = NULL; > break; > } > } > @@ -295,7 +297,7 @@ static int pt_irq_guest_eoi(struct domain *d, struct > hvm_pirq_dpci *pirq_dpci, > &pirq_dpci->flags) ) > { > _record(&debug->clear, pirq_dpci); > - pt_pirq_softirq_reset(pirq_dpci); > + pt_pirq_softirq_reset(pirq_dpci, 0 /* leave ->dom /); > /* pirq_dpci->state = 0; <= OUCH! */ > pirq_dpci->pending = 0; > pirq_guest_eoi(dpci_pirq(pirq_dpci)); > @@ -333,9 +335,11 @@ static void pt_irq_time_out(void *data) > pt_pirq_iterate(irq_map->dom, pt_irq_guest_eoi, NULL); > > spin_unlock(&irq_map->dom->event_lock); > +#if 0 > console_start_sync(); > dump_debug((char)0); > console_end_sync(); > +#endif > } > > struct hvm_irq_dpci *domain_get_irq_dpci(const struct domain *d) > @@ -446,7 +450,7 @@ int pt_irq_create_bind( > * in the queue. > */ > pirq_dpci->line = __LINE__; > - pt_pirq_softirq_reset(pirq_dpci); > + pt_pirq_softirq_reset(pirq_dpci, 1 /* clear dom */); > } > } > if ( unlikely(rc) ) > @@ -701,7 +705,7 @@ int pt_irq_destroy_bind( > * call to pt_pirq_softirq_reset. > */ > pirq_dpci->line = __LINE__; > - pt_pirq_softirq_reset(pirq_dpci); > + pt_pirq_softirq_reset(pirq_dpci, 1 /* clear ->dom */); > > pirq_cleanup_check(pirq, d); > } _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |