Xen project Mailing List

Re: [Xen-devel] [xen-devel] create irq failed due to move_cleanup_count always being set

To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

From: Liuyongan <liuyongan@xxxxxxxxxx>

Date: Sat, 07 Jan 2012 10:33:43 +0000

Accept-language: zh-CN, en-US

Cc: "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, "Keir \(Xen.org\)" <keir@xxxxxxx>, Qianhuibin <qianhuibin@xxxxxxxxxx>

Delivery-date: Sat, 07 Jan 2012 10:35:01 +0000

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Thread-index: AQHMy3ErCljmZmaIjUOAjG8PXXN/DJX+VQ9QgABSyACAAItH0P//im2AgAHyHnA=

Thread-topic: [xen-devel] create irq failed due to move_cleanup_count always being set

> -----Original Message----- > From: Andrew Cooper [mailto:andrew.cooper3@xxxxxxxxxx] > Sent: Friday, January 06, 2012 8:18 PM > To: Liuyongan > Cc: xen-devel@xxxxxxxxxxxxxxxxxxx; Keir (Xen.org); Qianhuibin > Subject: Re: [xen-devel] create irq failed due to move_cleanup_count > always being set > > On 06/01/12 11:50, Liuyongan wrote: > > > >> -----Original Message----- > >> From: Andrew Cooper [mailto:andrew.cooper3@xxxxxxxxxx] > >> Sent: Friday, January 06, 2012 7:01 PM > >> To: Liuyongan > >> Cc: xen-devel@xxxxxxxxxxxxxxxxxxx; Keir (Xen.org); Qianhuibin > >> Subject: Re: [xen-devel] create irq failed due to move_cleanup_count > >> always being set > >> > >> Could you please avoid top posing. > >> > >> On 06/01/12 06:04, Liuyongan wrote: > >>> As only 33 domains were successfully created(and destroyed) > before > >> the problem occurring, there should be enough free IRQ number and > >> vector number to allocate(suppose that irqs and vectors failed to > >> deallocate). And destroy_irq() will clear move_in_progress, so > >> move_cleanup_count must be setted? Is this the case? > >> > >> Is it repeatably 33 domains, or was that a 1 off experiment? Can > you > > No, it's not repeatable, this occurred 2 times, another one is > after 152 domains. > > Can you list all the failures you have seen with the number of domains? > So far it seems that it has been 33 twice but many more some of the > time, which doesn't lend itself to saying "33 domains is a systematic > failure" for certain at the moment. Sorry, to make it clear: this problems occurred 2 times one is after 33 domains, the other is after 152 domains. I'm not quite expressive in English. > > >> confirm exactly which version of Xen you are using, including > changeset > >> if you know it? Without knowing your hardware, it is hard to say if > >> there are actually enough free IRQs, although I do agree that what > you > >> are currently seeing is buggy behavior. > >> > >> The per-cpu IDT functionality introduced in Xen-4.0 is fragile at > the > >> best of times, and has had several bugfixes and tweaks to it which I > am > >> not certain have actually found their way back to Xen-4.0. Could > you > >> try with Xen-4.1 and see if the problem persists? > >> > >> ~Andrew > > As I could not make it re-occure in xen-4.0, trying xen-4.1 seems > useless. > > I noticed a scenario: > > I am confused. Above, you say that the problem is repeatable, but here > you say it is not. > > > 1) move_in_progress occure; > > 2) IPI IRQ_MOVE_CLEANUP_VECTOR interrupt is sent; > > 3) the irq is destroyed, so cfg->vector is cleared, and etc.; > > 4) IRQ_MOVE_CLEANUP_VECTOR irq is responded. > > > > In xen-4.1 , step 3, vector_irq of old_cpu_mask/old_domain is also > reset, so in step 4) move_cleanup_count will failed to sub by one, and > finally leading to create_irq failure(right?); > > > > In xen-4.0, step 3, and in my code vector_irq is not reset(this is > a bug as you'v mentioned), I still could not figure out why > > create_irq should failed. > > The first point of debugging should be to see how create_irq is > failing. Is it failing because of find_unassigned_irq() or because of > __assign_irq_vector(). > > Another piece of useful information would be what your guests are and > what they are trying to do with interrupts. Are you using PCI > passthrough? > > ~Andrew Thx for your suggestion. I think I'v got the reason. Dig into details: 1) move_in_progress occurs; 2) new interrupt occurs on new cpus, so IPI IRQ_MOVE_CLEANUP_VECTOR interrupt is sent; 3) the irq is destroyed, so __clear_irq_vector is called; 4) IRQ_MOVE_CLEANUP_VECTOR irq is responded with function smp_irq_move_cleanup_interrupt(); In step 3) code with patch "cpus_and(tmp_mask, cfg->old_domain, cpu_online_map);" will clear vector_irq of old_cpu_mask/old_domain, so in step 4): irq = __get_cpu_var(vector_irq)[vector]; if (irq == -1) continue; will missed the irq(cfg) to clear. In step 3) code without patch(this is the case of mine),vector_irq of old_cpu_mask is not cleared, so in step 4) irq(cfg) is found correctly, but at code: if (vector == cfg->vector && cpu_isset(me, cfg->domain)) goto unlock; there is a chance that vector should equal cfg->vector and me equal cfg->domain, but because irq is destroyed, then not "goto unlock",so cfg->move_cleanup_count--; execute unexpectedly. And left cfg->move_cleanup_count=255 finally. So I think the loop made in smp_irq_move_cleanup_interrupt should based on irq not vectors to find struct cfg. Is that right? Drowsy head on weekend, if my analysis is right, I'll submit the patch on Monday :) > > >>>> -----Original Message----- > >>>> From: Liuyongan > >>>> Sent: Thursday, January 05, 2012 2:14 PM > >>>> To: Liuyongan; xen-devel@xxxxxxxxxxxxxxxxxxx; > >>>> andrew.cooper3@xxxxxxxxxx; keir@xxxxxxx > >>>> Cc: Qianhuibin > >>>> Subject: RE: [xen-devel] create irq failed due to > move_cleanup_count > >>>> always being set > >>>> > >>>>> On 04/01/12 11:38, Andrew Cooper wrote: > >>>>>> On 04/01/12 04:37, Liuyongan wrote: > >>>>>> Hi, all > >>>>>> > >>>>>> I'm using xen-4.0 to do a test. And when I create a domain, > it > >>>> failed due to create_irq() failure. As only 33 domains were > >>>> successfully created and destroyed before I got the continuous > >>>> failures, and the domain just before the failure was properly > >>>> destroyed(at least destroy_irq() was properly called, which will > >> clear > >>>> move_in_progress, according to the prink-message). So I can > conclude > >>>> for certain that __assign_irq_vector failed due to > >> move_cleanup_count > >>>> always being set. > >>>>> Is it always 33 domains it takes to cause the problem, or does it > >>>> vary? > >>>>> If it varies, then I think you want this patch > >>>>> http://xenbits.xensource.com/hg/xen-unstable.hg/rev/68b903bb1b01 > >>>> which > >>>>> corrects the logic which works out which moved vectors it should > >>>> clean > >>>>> up. Without it, stale irq numbers build up in the per-cpu > >> irq_vector > >>>>> tables leading to __assign_irq_vector failing with -ENOSPC as it > >> find > >>>>> find a vector to allocate. > >>>> Yes, I've noticed this patch, as only 33 domains were created > >> before > >>>> the failures, so vectors of a given cpu should not have been used > >> up. > >>>> Besides, I got this problem after 143 domains were created another > >>>> time. But I could not repeat this problem manually as 4000+ > domains > >>>> created successfully without this problem. > >>>> > >>>>>> //this is the normal case when create and destroy domain whose > id > >> is > >>>> 31; > >>>>>> (XEN) irq.c:1232:d0 bind pirq 79, irq 77, share flag:0 > >>>>>> (XEN) irq.c:1377: dom31: pirq 79, irq 77 force unbind > >>>>>> (XEN) irq.c:1593: dom31: forcing unbind of pirq 79 > >>>>>> (XEN) irq.c:223, destroy irq 77 > >>>>>> > >>>>>> //domain id 32 is created and destroyed correctly also. > >>>>>> (XEN) irq.c:1232:d0 bind pirq 79, irq 77, share flag:0 > >>>>>> (XEN) irq.c:1377: dom32: pirq 79, irq 77 force unbind > >>>>>> (XEN) irq.c:1593: dom32: forcing unbind of pirq 79 > >>>>>> (XEN) irq.c:223, destroy irq 77 > >>>>>> > >>>>>> //all the subsequent domain creation failed, below lists only 3 > >>>> times: > >>>>>> (XEN) physdev.c:88: dom33: can't create irq for msi! > >>>>>> (XEN) physdev.c:88: dom34: can't create irq for msi! > >>>>>> (XEN) physdev.c:88: dom35: can't create irq for msi! > >>>>>> > >>>>>> I think this might be a bug and might have fixed, so I > >> compare > >>>> my code with 4.1.2 and search the mail list for potential patches. > >>>> > >> > (http://xen.markmail.org/search/?q=move_cleanup_count#query:move_cleanu > >>>> p_count+page:6+mid:fpkrafqbeyiauvhs+state:results) submit a patch > >> which > >>>> add locks in __assign_irq_vector. Can anybody explain why this > lock > >> is > >>>> needed? Or is there a patch that might fix my bug? Thx. > >>>>> This patch fixes a problem where IOAPIC line level interrupts > cease > >>>> for > >>>>> a while. It has nothing to do with MSI interrupts. (Also, there > >> are > >>>> no > >>>>> locks altered, and xen-4.0-testing seems to have gained an > >> additional > >>>>> hunk in hvm/vmx code unrelated to the original patch.) > >>>>> > >>>>>> Addition message: my board is arch-x86, no domains left when > >>>> failed to create new ones, create_irq failure lasted one day until > I > >>>> reboot the board, and the irq number allocated is used certainly > for > >> a > >>>> msi dev. > >>>>>> Yong an Liu > >>>>>> 2012.1.4 > >>>>>> > >>>>>> _______________________________________________ > >>>>>> Xen-devel mailing list > >>>>>> Xen-devel@xxxxxxxxxxxxxxxxxxx > >>>>>> http://lists.xensource.com/xen-devel************** > >> -- > >> Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer > >> T: +44 (0)1223 225 900, http://www.citrix.com > > -- > Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer > T: +44 (0)1223 225 900, http://www.citrix.com _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.