[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [xen-devel] create irq failed due to move_cleanup_count always being set




> -----Original Message-----
> From: Andrew Cooper [mailto:andrew.cooper3@xxxxxxxxxx]
> Sent: Friday, January 06, 2012 8:18 PM
> To: Liuyongan
> Cc: xen-devel@xxxxxxxxxxxxxxxxxxx; Keir (Xen.org); Qianhuibin
> Subject: Re: [xen-devel] create irq failed due to move_cleanup_count
> always being set
> 
> On 06/01/12 11:50, Liuyongan wrote:
> >
> >> -----Original Message-----
> >> From: Andrew Cooper [mailto:andrew.cooper3@xxxxxxxxxx]
> >> Sent: Friday, January 06, 2012 7:01 PM
> >> To: Liuyongan
> >> Cc: xen-devel@xxxxxxxxxxxxxxxxxxx; Keir (Xen.org); Qianhuibin
> >> Subject: Re: [xen-devel] create irq failed due to move_cleanup_count
> >> always being set
> >>
> >> Could you please avoid top posing.
> >>
> >> On 06/01/12 06:04, Liuyongan wrote:
> >>>    As only 33 domains were successfully created(and destroyed)
> before
> >> the problem occurring,  there should be enough free IRQ  number and
> >> vector number to allocate(suppose that irqs and vectors failed to
> >> deallocate). And destroy_irq() will clear move_in_progress, so
> >> move_cleanup_count must be setted?  Is this the case?
> >>
> >> Is it repeatably 33 domains, or was that a 1 off experiment?  Can
> you
> >   No, it's not repeatable, this occurred 2 times, another one is
> after 152 domains.
> 
> Can you list all the failures you have seen with the number of domains?
> So far it seems that it has been 33 twice but many more some of the
> time, which doesn't lend itself to saying "33 domains is a systematic
> failure" for certain at the moment.

  Sorry, to make it clear: this problems occurred 2 times one is after 33 
  domains, the other is after 152 domains. I'm not quite expressive in 
  English.

> 
> >> confirm exactly which version of Xen you are using, including
> changeset
> >> if you know it?  Without knowing your hardware, it is hard to say if
> >> there are actually enough free IRQs, although I do agree that what
> you
> >> are currently seeing is buggy behavior.
> >>
> >> The per-cpu IDT functionality introduced in Xen-4.0 is fragile at
> the
> >> best of times, and has had several bugfixes and tweaks to it which I
> am
> >> not certain have actually found their way back to Xen-4.0.  Could
> you
> >> try with Xen-4.1 and see if the problem persists?
> >>
> >> ~Andrew
> >   As I could not make it re-occure in xen-4.0, trying xen-4.1 seems
> useless.
> > I noticed a scenario:
> 
> I am confused.  Above, you say that the problem is repeatable, but here
> you say it is not.
> 
> >    1) move_in_progress occure;
> >    2) IPI IRQ_MOVE_CLEANUP_VECTOR interrupt is sent;
> >    3) the irq is destroyed, so cfg->vector is cleared, and etc.;
> >    4) IRQ_MOVE_CLEANUP_VECTOR irq is responded.
> >
> >   In xen-4.1 , step 3, vector_irq of old_cpu_mask/old_domain is also
> reset, so in step 4) move_cleanup_count will failed to sub by one, and
> finally leading to create_irq failure(right?);
> >
> >   In xen-4.0, step 3, and in my code vector_irq is not reset(this is
> a bug as you'v mentioned),  I still could not figure out why
> > create_irq should failed.
> 
> The first point of debugging should be to see how create_irq is
> failing.  Is it failing because of find_unassigned_irq() or because of
> __assign_irq_vector().
> 
> Another piece of useful information would be what your guests are and
> what they are trying to do with interrupts.  Are you using PCI
> passthrough?
> 
> ~Andrew

  Thx for your suggestion. I think I'v got the reason. Dig into details:
  1) move_in_progress occurs;
  2) new interrupt occurs on new cpus, so IPI IRQ_MOVE_CLEANUP_VECTOR 
     interrupt is sent;
  3) the irq is destroyed, so __clear_irq_vector is called;
  4) IRQ_MOVE_CLEANUP_VECTOR irq is responded with function 
     smp_irq_move_cleanup_interrupt(); 
  
  In step 3) code with patch "cpus_and(tmp_mask, cfg->old_domain, 
  cpu_online_map);" will clear vector_irq of old_cpu_mask/old_domain,
  so in step 4):
        irq = __get_cpu_var(vector_irq)[vector];

        if (irq == -1)
            continue;
  will missed the irq(cfg) to clear.

  In step 3) code without patch(this is the case of mine),vector_irq
  of old_cpu_mask is not cleared, so in step 4) irq(cfg) is found
  correctly, but at code:
         if (vector == cfg->vector && cpu_isset(me, cfg->domain))
            goto unlock;
  there is a chance that vector should equal cfg->vector and me equal
  cfg->domain, but because irq is destroyed, then not "goto unlock",so
         cfg->move_cleanup_count--;
  execute unexpectedly. And left cfg->move_cleanup_count=255 finally.

  So I think the loop made in smp_irq_move_cleanup_interrupt should 
  based on irq not vectors to find struct cfg. 

  Is that right? Drowsy head on weekend, if my analysis is right, I'll 
  submit the patch on Monday :)
   
> 
> >>>> -----Original Message-----
> >>>> From: Liuyongan
> >>>> Sent: Thursday, January 05, 2012 2:14 PM
> >>>> To: Liuyongan; xen-devel@xxxxxxxxxxxxxxxxxxx;
> >>>> andrew.cooper3@xxxxxxxxxx; keir@xxxxxxx
> >>>> Cc: Qianhuibin
> >>>> Subject: RE: [xen-devel] create irq failed due to
> move_cleanup_count
> >>>> always being set
> >>>>
> >>>>> On 04/01/12 11:38, Andrew Cooper wrote:
> >>>>>> On 04/01/12 04:37, Liuyongan wrote:
> >>>>>> Hi, all
> >>>>>>
> >>>>>>     I'm using xen-4.0 to do a test. And when I create a domain,
> it
> >>>> failed due to create_irq() failure. As only 33 domains were
> >>>> successfully created and destroyed before I got the continuous
> >>>> failures, and the domain just before the failure was properly
> >>>> destroyed(at least destroy_irq() was properly called, which will
> >> clear
> >>>> move_in_progress, according to the prink-message). So I can
> conclude
> >>>> for certain that __assign_irq_vector failed due to
> >> move_cleanup_count
> >>>> always being set.
> >>>>> Is it always 33 domains it takes to cause the problem, or does it
> >>>> vary?
> >>>>> If it varies, then I think you want this patch
> >>>>> http://xenbits.xensource.com/hg/xen-unstable.hg/rev/68b903bb1b01
> >>>> which
> >>>>> corrects the logic which works out which moved vectors it should
> >>>> clean
> >>>>> up.  Without it, stale irq numbers build up in the per-cpu
> >> irq_vector
> >>>>> tables leading to __assign_irq_vector failing with -ENOSPC as it
> >> find
> >>>>> find a vector to allocate.
> >>>>   Yes, I've noticed this patch, as only 33 domains were created
> >> before
> >>>> the failures, so vectors of a given cpu should not have been used
> >> up.
> >>>> Besides, I got this problem after 143 domains were created another
> >>>> time. But I could not repeat this problem manually as 4000+
> domains
> >>>> created successfully without this problem.
> >>>>
> >>>>>> //this is the normal case when create and destroy domain whose
> id
> >> is
> >>>> 31;
> >>>>>> (XEN) irq.c:1232:d0 bind pirq 79, irq 77, share flag:0
> >>>>>> (XEN) irq.c:1377: dom31: pirq 79, irq 77 force unbind
> >>>>>> (XEN) irq.c:1593: dom31: forcing unbind of pirq 79
> >>>>>> (XEN) irq.c:223, destroy irq 77
> >>>>>>
> >>>>>> //domain id 32 is created and destroyed correctly also.
> >>>>>> (XEN) irq.c:1232:d0 bind pirq 79, irq 77, share flag:0
> >>>>>> (XEN) irq.c:1377: dom32: pirq 79, irq 77 force unbind
> >>>>>> (XEN) irq.c:1593: dom32: forcing unbind of pirq 79
> >>>>>> (XEN) irq.c:223, destroy irq 77
> >>>>>>
> >>>>>> //all the subsequent domain creation failed, below lists only 3
> >>>> times:
> >>>>>> (XEN) physdev.c:88: dom33: can't create irq for msi!
> >>>>>> (XEN) physdev.c:88: dom34: can't create irq for msi!
> >>>>>> (XEN) physdev.c:88: dom35: can't create irq for msi!
> >>>>>>
> >>>>>>      I think this might be a bug and might have fixed, so I
> >> compare
> >>>> my code with 4.1.2 and search the mail list for potential patches.
> >>>>
> >>
> (http://xen.markmail.org/search/?q=move_cleanup_count#query:move_cleanu
> >>>> p_count+page:6+mid:fpkrafqbeyiauvhs+state:results) submit a patch
> >> which
> >>>> add locks in __assign_irq_vector. Can anybody explain why this
> lock
> >> is
> >>>> needed? Or is there a patch that might fix my bug? Thx.
> >>>>> This patch fixes a problem where IOAPIC line level interrupts
> cease
> >>>> for
> >>>>> a while.  It has nothing to do with MSI interrupts.  (Also, there
> >> are
> >>>> no
> >>>>> locks altered, and xen-4.0-testing seems to have gained an
> >> additional
> >>>>> hunk in hvm/vmx code unrelated to the original patch.)
> >>>>>
> >>>>>>     Addition message: my board is arch-x86, no domains left when
> >>>> failed to create new ones, create_irq failure lasted one day until
> I
> >>>> reboot the board, and the irq number allocated is used certainly
> for
> >> a
> >>>> msi dev.
> >>>>>> Yong an Liu
> >>>>>> 2012.1.4
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> Xen-devel mailing list
> >>>>>> Xen-devel@xxxxxxxxxxxxxxxxxxx
> >>>>>> http://lists.xensource.com/xen-devel**************
> >> --
> >> Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
> >> T: +44 (0)1223 225 900, http://www.citrix.com
> 
> --
> Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
> T: +44 (0)1223 225 900, http://www.citrix.com


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.