[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [xen-devel] create irq failed due to move_cleanup_count always being set




> -----Original Message-----
> From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx [mailto:xen-devel-
> bounces@xxxxxxxxxxxxxxxxxxx] On Behalf Of Liuyongan
> Sent: Saturday, January 07, 2012 6:34 PM
> To: Andrew Cooper
> Cc: xen-devel@xxxxxxxxxxxxxxxxxxx; Keir (Xen.org); Qianhuibin
> Subject: Re: [Xen-devel] [xen-devel] create irq failed due to
> move_cleanup_count always being set
> 
> 
> 
> > -----Original Message-----
> > From: Andrew Cooper [mailto:andrew.cooper3@xxxxxxxxxx]
> > Sent: Friday, January 06, 2012 8:18 PM
> > To: Liuyongan
> > Cc: xen-devel@xxxxxxxxxxxxxxxxxxx; Keir (Xen.org); Qianhuibin
> > Subject: Re: [xen-devel] create irq failed due to move_cleanup_count
> > always being set
> >
> > On 06/01/12 11:50, Liuyongan wrote:
> > >
> > >> -----Original Message-----
> > >> From: Andrew Cooper [mailto:andrew.cooper3@xxxxxxxxxx]
> > >> Sent: Friday, January 06, 2012 7:01 PM
> > >> To: Liuyongan
> > >> Cc: xen-devel@xxxxxxxxxxxxxxxxxxx; Keir (Xen.org); Qianhuibin
> > >> Subject: Re: [xen-devel] create irq failed due to
> move_cleanup_count
> > >> always being set
> > >>
> > >> Could you please avoid top posing.
> > >>
> > >> On 06/01/12 06:04, Liuyongan wrote:
> > >>>    As only 33 domains were successfully created(and destroyed)
> > before
> > >> the problem occurring,  there should be enough free IRQ  number
> and
> > >> vector number to allocate(suppose that irqs and vectors failed to
> > >> deallocate). And destroy_irq() will clear move_in_progress, so
> > >> move_cleanup_count must be setted?  Is this the case?
> > >>
> > >> Is it repeatably 33 domains, or was that a 1 off experiment?  Can
> > you
> > >   No, it's not repeatable, this occurred 2 times, another one is
> > after 152 domains.
> >
> > Can you list all the failures you have seen with the number of
> domains?
> > So far it seems that it has been 33 twice but many more some of the
> > time, which doesn't lend itself to saying "33 domains is a systematic
> > failure" for certain at the moment.
> 
>   Sorry, to make it clear: this problems occurred 2 times one is after
> 33
>   domains, the other is after 152 domains. I'm not quite expressive in
>   English.
> 
> >
> > >> confirm exactly which version of Xen you are using, including
> > changeset
> > >> if you know it?  Without knowing your hardware, it is hard to say
> if
> > >> there are actually enough free IRQs, although I do agree that what
> > you
> > >> are currently seeing is buggy behavior.
> > >>
> > >> The per-cpu IDT functionality introduced in Xen-4.0 is fragile at
> > the
> > >> best of times, and has had several bugfixes and tweaks to it which
> I
> > am
> > >> not certain have actually found their way back to Xen-4.0.  Could
> > you
> > >> try with Xen-4.1 and see if the problem persists?
> > >>
> > >> ~Andrew
> > >   As I could not make it re-occure in xen-4.0, trying xen-4.1 seems
> > useless.
> > > I noticed a scenario:
> >
> > I am confused.  Above, you say that the problem is repeatable, but
> here
> > you say it is not.
> >
> > >    1) move_in_progress occure;
> > >    2) IPI IRQ_MOVE_CLEANUP_VECTOR interrupt is sent;
> > >    3) the irq is destroyed, so cfg->vector is cleared, and etc.;
> > >    4) IRQ_MOVE_CLEANUP_VECTOR irq is responded.
> > >
> > >   In xen-4.1 , step 3, vector_irq of old_cpu_mask/old_domain is
> also
> > reset, so in step 4) move_cleanup_count will failed to sub by one,
> and
> > finally leading to create_irq failure(right?);
> > >
> > >   In xen-4.0, step 3, and in my code vector_irq is not reset(this
> is
> > a bug as you'v mentioned),  I still could not figure out why
> > > create_irq should failed.
> >
> > The first point of debugging should be to see how create_irq is
> > failing.  Is it failing because of find_unassigned_irq() or because
> of
> > __assign_irq_vector().
> >
> > Another piece of useful information would be what your guests are and
> > what they are trying to do with interrupts.  Are you using PCI
> > passthrough?
> >
> > ~Andrew
> 
>   Thx for your suggestion. I think I'v got the reason. Dig into
> details:
>   1) move_in_progress occurs;
>   2) new interrupt occurs on new cpus, so IPI IRQ_MOVE_CLEANUP_VECTOR
>      interrupt is sent;
>   3) the irq is destroyed, so __clear_irq_vector is called;
>   4) IRQ_MOVE_CLEANUP_VECTOR irq is responded with function
>      smp_irq_move_cleanup_interrupt();
> 
>   In step 3) code with patch "cpus_and(tmp_mask, cfg->old_domain,
>   cpu_online_map);" will clear vector_irq of old_cpu_mask/old_domain,
>   so in step 4):
>         irq = __get_cpu_var(vector_irq)[vector];
> 
>         if (irq == -1)
>             continue;
>   will missed the irq(cfg) to clear.
    Because move_in_progress will be cleared right after IPI irq sending,
    so the chance of old_domain's vector_irq being cleared is little, yet
    chance do exist, so loop based on irq will solve this problem.
> 
>   In step 3) code without patch(this is the case of mine),vector_irq
>   of old_cpu_mask is not cleared, so in step 4) irq(cfg) is found
>   correctly, but at code:
>          if (vector == cfg->vector && cpu_isset(me, cfg->domain))
>             goto unlock;
>   there is a chance that vector should equal cfg->vector and me equal
>   cfg->domain, but because irq is destroyed, then not "goto unlock",so
>          cfg->move_cleanup_count--;
>   execute unexpectedly. And left cfg->move_cleanup_count=255 finally.

    This needs a scenario like this: two irqs move concurrently from/to 
    one cpu: Irq 69 moves from cpu5 to cpu6,and irq 70 moves from cpu6 
    to cpu7, then if cpu6 receives IPI because of irq 70 moving 
    completion, and at the meantime, irq 69 is destroyed, then irq69's
    cfg-> move_cleanup_count may be a invalid value of 255.

    The radical reason of this problem is that cpu who receives IPI irq 
    cannot tell which vector completes move if there are two move 
    concurrently. 
    
> 
>   So I think the loop made in smp_irq_move_cleanup_interrupt should
>   based on irq not vectors to find struct cfg.
> 
>   Is that right? Drowsy head on weekend, if my analysis is right, I'll
>   submit the patch on Monday :)
> 
> >
> > >>>> -----Original Message-----
> > >>>> From: Liuyongan
> > >>>> Sent: Thursday, January 05, 2012 2:14 PM
> > >>>> To: Liuyongan; xen-devel@xxxxxxxxxxxxxxxxxxx;
> > >>>> andrew.cooper3@xxxxxxxxxx; keir@xxxxxxx
> > >>>> Cc: Qianhuibin
> > >>>> Subject: RE: [xen-devel] create irq failed due to
> > move_cleanup_count
> > >>>> always being set
> > >>>>
> > >>>>> On 04/01/12 11:38, Andrew Cooper wrote:
> > >>>>>> On 04/01/12 04:37, Liuyongan wrote:
> > >>>>>> Hi, all
> > >>>>>>
> > >>>>>>     I'm using xen-4.0 to do a test. And when I create a
> domain,
> > it
> > >>>> failed due to create_irq() failure. As only 33 domains were
> > >>>> successfully created and destroyed before I got the continuous
> > >>>> failures, and the domain just before the failure was properly
> > >>>> destroyed(at least destroy_irq() was properly called, which will
> > >> clear
> > >>>> move_in_progress, according to the prink-message). So I can
> > conclude
> > >>>> for certain that __assign_irq_vector failed due to
> > >> move_cleanup_count
> > >>>> always being set.
> > >>>>> Is it always 33 domains it takes to cause the problem, or does
> it
> > >>>> vary?
> > >>>>> If it varies, then I think you want this patch
> > >>>>> http://xenbits.xensource.com/hg/xen-
> unstable.hg/rev/68b903bb1b01
> > >>>> which
> > >>>>> corrects the logic which works out which moved vectors it
> should
> > >>>> clean
> > >>>>> up.  Without it, stale irq numbers build up in the per-cpu
> > >> irq_vector
> > >>>>> tables leading to __assign_irq_vector failing with -ENOSPC as
> it
> > >> find
> > >>>>> find a vector to allocate.
> > >>>>   Yes, I've noticed this patch, as only 33 domains were created
> > >> before
> > >>>> the failures, so vectors of a given cpu should not have been
> used
> > >> up.
> > >>>> Besides, I got this problem after 143 domains were created
> another
> > >>>> time. But I could not repeat this problem manually as 4000+
> > domains
> > >>>> created successfully without this problem.
> > >>>>
> > >>>>>> //this is the normal case when create and destroy domain whose
> > id
> > >> is
> > >>>> 31;
> > >>>>>> (XEN) irq.c:1232:d0 bind pirq 79, irq 77, share flag:0
> > >>>>>> (XEN) irq.c:1377: dom31: pirq 79, irq 77 force unbind
> > >>>>>> (XEN) irq.c:1593: dom31: forcing unbind of pirq 79
> > >>>>>> (XEN) irq.c:223, destroy irq 77
> > >>>>>>
> > >>>>>> //domain id 32 is created and destroyed correctly also.
> > >>>>>> (XEN) irq.c:1232:d0 bind pirq 79, irq 77, share flag:0
> > >>>>>> (XEN) irq.c:1377: dom32: pirq 79, irq 77 force unbind
> > >>>>>> (XEN) irq.c:1593: dom32: forcing unbind of pirq 79
> > >>>>>> (XEN) irq.c:223, destroy irq 77
> > >>>>>>
> > >>>>>> //all the subsequent domain creation failed, below lists only
> 3
> > >>>> times:
> > >>>>>> (XEN) physdev.c:88: dom33: can't create irq for msi!
> > >>>>>> (XEN) physdev.c:88: dom34: can't create irq for msi!
> > >>>>>> (XEN) physdev.c:88: dom35: can't create irq for msi!
> > >>>>>>
> > >>>>>>      I think this might be a bug and might have fixed, so I
> > >> compare
> > >>>> my code with 4.1.2 and search the mail list for potential
> patches.
> > >>>>
> > >>
> >
> (http://xen.markmail.org/search/?q=move_cleanup_count#query:move_cleanu
> > >>>> p_count+page:6+mid:fpkrafqbeyiauvhs+state:results) submit a
> patch
> > >> which
> > >>>> add locks in __assign_irq_vector. Can anybody explain why this
> > lock
> > >> is
> > >>>> needed? Or is there a patch that might fix my bug? Thx.
> > >>>>> This patch fixes a problem where IOAPIC line level interrupts
> > cease
> > >>>> for
> > >>>>> a while.  It has nothing to do with MSI interrupts.  (Also,
> there
> > >> are
> > >>>> no
> > >>>>> locks altered, and xen-4.0-testing seems to have gained an
> > >> additional
> > >>>>> hunk in hvm/vmx code unrelated to the original patch.)
> > >>>>>
> > >>>>>>     Addition message: my board is arch-x86, no domains left
> when
> > >>>> failed to create new ones, create_irq failure lasted one day
> until
> > I
> > >>>> reboot the board, and the irq number allocated is used certainly
> > for
> > >> a
> > >>>> msi dev.
> > >>>>>> Yong an Liu
> > >>>>>> 2012.1.4
> > >>>>>>
> > >>>>>> _______________________________________________
> > >>>>>> Xen-devel mailing list
> > >>>>>> Xen-devel@xxxxxxxxxxxxxxxxxxx
> > >>>>>> http://lists.xensource.com/xen-devel**************
> > >> --
> > >> Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
> > >> T: +44 (0)1223 225 900, http://www.citrix.com
> >
> > --
> > Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
> > T: +44 (0)1223 225 900, http://www.citrix.com
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.