[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v1 0/2] xen/arm: maintenance_interrupt SMP fix



I moved to 4.4.0-rc1 which already has necessary irq patches.

And applied only one patch "cpumask_of(0) in gic_route_irq_to_guest".
I see that Hypervisor hangs very often. Unfortunately, now I don't
have debugger to localize part of code.
So, I have to use prints and it may takes some time(

On Thu, Jan 30, 2014 at 7:18 PM, Stefano Stabellini
<stefano.stabellini@xxxxxxxxxxxxx> wrote:
> On Thu, 30 Jan 2014, Oleksandr Tyshchenko wrote:
>> 1.2 I also have checked solution where on_selected_cpus call was moved
>> out of the
>> interrupt handler. Unfortunately, it doesn't work.
>>
>> I almost immediately see next error:
>> (XEN) Assertion 'this_cpu(eoi_irq) == NULL' failed, line 981, file gic.c
>> (XEN) Xen BUG at gic.c:981
>> (XEN) CPU1: Unexpected Trap: Undefined Instruction
>> (XEN) ----[ Xen-4.4-unstable  arm32  debug=y  Not tainted ]----
>> (XEN) CPU:    1
>> (XEN) PC:     00241ee0 __bug+0x2c/0x44
>> (XEN) CPSR:   2000015a MODE:Hypervisor
>> (XEN)      R0: 0026770c R1: 00000000 R2: 3fd2fd00 R3: 00000fff
>> (XEN)      R4: 00263248 R5: 00264384 R6: 000003d5 R7: 4003d000
>> (XEN)      R8: 00000001 R9: 00000091 R10:00000000 R11:40037ebc R12:00000001
>> (XEN) HYP: SP: 40037eb4 LR: 00241ee0
>> (XEN)
>> (XEN)   VTCR_EL2: 80002558
>> (XEN)  VTTBR_EL2: 00010000deffc000
>> (XEN)
>> (XEN)  SCTLR_EL2: 30cd187f
>> (XEN)    HCR_EL2: 0000000000002835
>> (XEN)  TTBR0_EL2: 00000000d2014000
>> (XEN)
>> (XEN)    ESR_EL2: 00000000
>> (XEN)  HPFAR_EL2: 0000000000482110
>> (XEN)      HDFAR: fa211f00
>> (XEN)      HIFAR: 00000000
>> (XEN)
>> (XEN) Xen stack trace from sp=40037eb4:
>> (XEN)    00000000 40037efc 00247e1c 002e6610 002e6610 002e6608 002e6608 
>> 00000001
>> (XEN)    00000000 40015000 40017000 40005f60 40017014 40037f58 00000019 
>> 00000000
>> (XEN)    40005f60 40037f24 00249068 00000009 00000019 00404000 40037f58 
>> 00000000
>> (XEN)    00405000 00004680 002e7694 40037f4c 00248b80 00000000 c5b72000 
>> 00000091
>> (XEN)    00000000 c700d4e0 c008477c 000000f1 00000001 40037f54 0024f6c0 
>> 40037f58
>> (XEN)    00251a30 c700d4e0 00000001 c008477c 00000000 c5b72000 00000091 
>> 00000000
>> (XEN)    c700d4e0 c008477c 000000f1 00000001 00000001 c5b72000 ffffffff 
>> 0000a923
>> (XEN)    c0077ac4 60000193 00000000 b6eadaa0 c0578f40 c00138c0 c5b73f58 
>> c036ab90
>> (XEN)    c0578f4c c00136a0 c0578f58 c0013920 00000000 00000000 00000000 
>> 00000000
>> (XEN)    00000000 00000000 00000000 80000010 60000193 a0000093 80000193 
>> 00000000
>> (XEN)    00000000 0c41e00c 450c2880
>> (XEN) Xen call trace:
>> (XEN)    [<00241ee0>] __bug+0x2c/0x44 (PC)
>> (XEN)    [<00241ee0>] __bug+0x2c/0x44 (LR)
>> (XEN)    [<00247e1c>] maintenance_interrupt+0x2e8/0x328
>> (XEN)    [<00249068>] do_IRQ+0x138/0x198
>> (XEN)    [<00248b80>] gic_interrupt+0x58/0xc0
>> (XEN)    [<0024f6c0>] do_trap_irq+0x10/0x14
>> (XEN)    [<00251a30>] return_from_trap+0/0x4
>> (XEN)
>
> Are you seeing more than one interrupt being EOI'ed with a single
> maintenance interrupt?
> I didn't think it could be possible in practice.
> If so, we might have to turn eoi_irq into a list or an array.
>
>
>> 2. The "simultaneous cross-interrupts" issue doesn't occur if I use
>> next solution:
>> So, as result I don't see deadlock in on_selected_cpus()
>>
>> diff --git a/xen/arch/arm/gic.c b/xen/arch/arm/gic.c
>> index e6257a7..af96a31 100644
>> --- a/xen/arch/arm/gic.c
>> +++ b/xen/arch/arm/gic.c
>> @@ -776,8 +795,7 @@ int gic_route_irq_to_guest(struct domain *d, const
>> struct dt_irq *irq,
>>
>>      level = dt_irq_is_level_triggered(irq);
>>
>> -    gic_set_irq_properties(irq->irq, level, cpumask_of(smp_processor_id()),
>> -                           0xa0);
>> +    gic_set_irq_properties(irq->irq, level, cpumask_of(0), 0xa0);
>>
>>      retval = __setup_irq(desc, irq->irq, action);
>>      if (retval) {
>> So, as result I don't see deadlock in on_selected_cpus().
>
> As I stated before I think this is a good change to have in 4.4.
>
>
>> But, rarely, I see deadlocks in other parts related to interrupts handling.
>> As noted by Julien, I am using the old version of the interrupt patch series.
>> I completely agree.
>>
>> We are based on next XEN commit:
>> 48249a1 libxl: Avoid realloc(,0) when libxl__xs_directory returns empty list
>>
>> Also we have some patches, which we cherry-picked when we urgently needed 
>> them:
>> 6bba1a3 xen/arm: Keep count of inflight interrupts
>> 33a8aa9 xen/arm: Only enable physical IRQs when the guest asks
>> b6a4e65 xen/arm: Rename gic_irq_{startup, shutdown} to gic_irq_{mask, unmask}
>> 5dbe455 xen/arm: Don't reinject the IRQ if it's already in LRs
>> 1438f03 xen/arm: Physical IRQ is not always equal to virtual IRQ
>>
>> I have to apply next patches and check with them:
>> 88eb95e xen/arm: disable a physical IRQ when the guest disables the
>> corresponding IRQ
>> a660ee3 xen/arm: implement gic_irq_enable and gic_irq_disable
>> 1dc9556 xen/arm: do not add a second irq to the LRs if one is already present
>> d16d511 xen/arm: track the state of guest IRQs
>>
>> I'll report about the results. I hope to do it today.
>
> I am looking forward to reading your report.
> Cheers,
>
> Stefano
>
>> A lot of thanks to all.
>>
>> On Thu, Jan 30, 2014 at 5:35 PM, Stefano Stabellini
>> <stefano.stabellini@xxxxxxxxxxxxx> wrote:
>> > Given that we don't deactivate the interrupt (writing to GICC_DIR) until
>> > the guest EOIs it, I can't understand how you manage to get a second
>> > interrupt notifications before the guest EOIs the first one.
>> >
>> > Do you set GICC_CTL_EOI in GICC_CTLR?
>> >
>> > On Thu, 30 Jan 2014, Oleksandr Tyshchenko wrote:
>> >> According to DT it is a level irq (DT_IRQ_TYPE_LEVEL_HIGH)
>> >>
>> >> On Thu, Jan 30, 2014 at 3:24 PM, Stefano Stabellini
>> >> <stefano.stabellini@xxxxxxxxxxxxx> wrote:
>> >> > Is it a level or an edge irq?
>> >> >
>> >> > On Wed, 29 Jan 2014, Julien Grall wrote:
>> >> >> Hi,
>> >> >>
>> >> >> It's weird, physical IRQ should not be injected twice ...
>> >> >> Were you able to print the IRQ number?
>> >> >>
>> >> >> In any case, you are using the old version of the interrupt patch 
>> >> >> series.
>> >> >> Your new error may come of race condition in this code.
>> >> >>
>> >> >> Can you try to use a newest version?
>> >> >>
>> >> >> On 29 Jan 2014 18:40, "Oleksandr Tyshchenko" 
>> >> >> <oleksandr.tyshchenko@xxxxxxxxxxxxxxx> wrote:
>> >> >>       > Right, that's why changing it to cpumask_of(0) shouldn't make 
>> >> >> any
>> >> >>       > difference for xen-unstable (it should make things clearer, if 
>> >> >> nothing
>> >> >>       > else) but it should fix things for Oleksandr.
>> >> >>
>> >> >>       Unfortunately, it is not enough for stable work.
>> >> >>
>> >> >>       I was tried to use cpumask_of(smp_processor_id()) instead of 
>> >> >> cpumask_of(0) in
>> >> >>       gic_route_irq_to_guest(). And as result, I don't see our 
>> >> >> situation
>> >> >>       which cause to deadlock in on_selected_cpus function (expected).
>> >> >>       But, hypervisor sometimes hangs somewhere else (I have not 
>> >> >> identified
>> >> >>       yet where this is happening) or I sometimes see traps, like that:
>> >> >>       ("WARN_ON(p->desc != NULL)" in maintenance_interrupt() leads to 
>> >> >> them)
>> >> >>
>> >> >>       (XEN) CPU1: Unexpected Trap: Undefined Instruction
>> >> >>       (XEN) ----[ Xen-4.4-unstable  arm32  debug=y  Not tainted ]----
>> >> >>       (XEN) CPU:    1
>> >> >>       (XEN) PC:     00242c1c __warn+0x20/0x28
>> >> >>       (XEN) CPSR:   200001da MODE:Hypervisor
>> >> >>       (XEN)      R0: 0026770c R1: 00000001 R2: 3fd2fd00 R3: 00000fff
>> >> >>       (XEN)      R4: 00406100 R5: 40020ee0 R6: 00000000 R7: 4bfdf000
>> >> >>       (XEN)      R8: 00000001 R9: 4bfd7ed0 R10:00000001 R11:4bfd7ebc 
>> >> >> R12:00000002
>> >> >>       (XEN) HYP: SP: 4bfd7eb4 LR: 00242c1c
>> >> >>       (XEN)
>> >> >>       (XEN)   VTCR_EL2: 80002558
>> >> >>       (XEN)  VTTBR_EL2: 00020000dec6a000
>> >> >>       (XEN)
>> >> >>       (XEN)  SCTLR_EL2: 30cd187f
>> >> >>       (XEN)    HCR_EL2: 00000000000028b5
>> >> >>       (XEN)  TTBR0_EL2: 00000000d2014000
>> >> >>       (XEN)
>> >> >>       (XEN)    ESR_EL2: 00000000
>> >> >>       (XEN)  HPFAR_EL2: 0000000000482110
>> >> >>       (XEN)      HDFAR: fa211190
>> >> >>       (XEN)      HIFAR: 00000000
>> >> >>       (XEN)
>> >> >>       (XEN) Xen stack trace from sp=4bfd7eb4:
>> >> >>       (XEN)    0026431c 4bfd7efc 00247a54 00000024 002e6608 002e6608 
>> >> >> 00000097 00000001
>> >> >>       (XEN)    00000000 4bfd7f54 40017000 40005f60 40017014 4bfd7f58 
>> >> >> 00000019 00000000
>> >> >>       (XEN)    40005f60 4bfd7f24 00248e60 00000009 00000019 00404000 
>> >> >> 4bfd7f58 00000000
>> >> >>       (XEN)    00405000 000045f0 002e7694 4bfd7f4c 00248978 c0079a90 
>> >> >> 00000097 00000097
>> >> >>       (XEN)    00000000 fa212000 ea80c900 00000001 c05b8a60 4bfd7f54 
>> >> >> 0024f4b8 4bfd7f58
>> >> >>       (XEN)    00251830 ea80c950 00000000 00000001 c0079a90 00000097 
>> >> >> 00000097 00000000
>> >> >>       (XEN)    fa212000 ea80c900 00000001 c05b8a60 00000000 e9879e3c 
>> >> >> ffffffff b6efbca3
>> >> >>       (XEN)    c03b29fc 60000193 9fffffe7 b6c0bbf0 c0607500 c03b3140 
>> >> >> e9879eb8 c007680c
>> >> >>       (XEN)    c060750c c03b32c0 c0607518 c03b3360 00000000 00000000 
>> >> >> 00000000 00000000
>> >> >>       (XEN)    00000000 00000000 3ff6bebf a0000113 800b0193 800b0093 
>> >> >> 40000193 00000000
>> >> >>       (XEN)    ffeffbfe fedeefff fffd5ffe
>> >> >>       (XEN) Xen call trace:
>> >> >>       (XEN)    [<00242c1c>] __warn+0x20/0x28 (PC)
>> >> >>       (XEN)    [<00242c1c>] __warn+0x20/0x28 (LR)
>> >> >>       (XEN)    [<00247a54>] maintenance_interrupt+0xfc/0x2f4
>> >> >>       (XEN)    [<00248e60>] do_IRQ+0x138/0x198
>> >> >>       (XEN)    [<00248978>] gic_interrupt+0x58/0xc0
>> >> >>       (XEN)    [<0024f4b8>] do_trap_irq+0x10/0x14
>> >> >>       (XEN)    [<00251830>] return_from_trap+0/0x4
>> >> >>       (XEN)
>> >> >>
>> >> >>       Also I am posting maintenance_interrupt() from my tree:
>> >> >>
>> >> >>       static void maintenance_interrupt(int irq, void *dev_id, struct
>> >> >>       cpu_user_regs *regs)
>> >> >>       {
>> >> >>           int i = 0, virq, pirq;
>> >> >>           uint32_t lr;
>> >> >>           struct vcpu *v = current;
>> >> >>           uint64_t eisr = GICH[GICH_EISR0] | (((uint64_t) 
>> >> >> GICH[GICH_EISR1]) << 32);
>> >> >>
>> >> >>           while ((i = find_next_bit((const long unsigned int *) &eisr,
>> >> >>                                     64, i)) < 64) {
>> >> >>               struct pending_irq *p, *n;
>> >> >>               int cpu, eoi;
>> >> >>
>> >> >>               cpu = -1;
>> >> >>               eoi = 0;
>> >> >>
>> >> >>               spin_lock_irq(&gic.lock);
>> >> >>               lr = GICH[GICH_LR + i];
>> >> >>               virq = lr & GICH_LR_VIRTUAL_MASK;
>> >> >>
>> >> >>               p = irq_to_pending(v, virq);
>> >> >>               if ( p->desc != NULL ) {
>> >> >>                   p->desc->status &= ~IRQ_INPROGRESS;
>> >> >>                   /* Assume only one pcpu needs to EOI the irq */
>> >> >>                   cpu = p->desc->arch.eoi_cpu;
>> >> >>                   eoi = 1;
>> >> >>                   pirq = p->desc->irq;
>> >> >>               }
>> >> >>               if ( !atomic_dec_and_test(&p->inflight_cnt) )
>> >> >>               {
>> >> >>                   /* Physical IRQ can't be reinject */
>> >> >>                   WARN_ON(p->desc != NULL);
>> >> >>                   gic_set_lr(i, p->irq, GICH_LR_PENDING, p->priority);
>> >> >>                   spin_unlock_irq(&gic.lock);
>> >> >>                   i++;
>> >> >>                   continue;
>> >> >>               }
>> >> >>
>> >> >>               GICH[GICH_LR + i] = 0;
>> >> >>               clear_bit(i, &this_cpu(lr_mask));
>> >> >>
>> >> >>               if ( !list_empty(&v->arch.vgic.lr_pending) ) {
>> >> >>                   n = list_entry(v->arch.vgic.lr_pending.next, 
>> >> >> typeof(*n), lr_queue);
>> >> >>                   gic_set_lr(i, n->irq, GICH_LR_PENDING, n->priority);
>> >> >>                   list_del_init(&n->lr_queue);
>> >> >>                   set_bit(i, &this_cpu(lr_mask));
>> >> >>               } else {
>> >> >>                   gic_inject_irq_stop();
>> >> >>               }
>> >> >>               spin_unlock_irq(&gic.lock);
>> >> >>
>> >> >>               spin_lock_irq(&v->arch.vgic.lock);
>> >> >>               list_del_init(&p->inflight);
>> >> >>               spin_unlock_irq(&v->arch.vgic.lock);
>> >> >>
>> >> >>               if ( eoi ) {
>> >> >>                   /* this is not racy because we can't receive another 
>> >> >> irq of the
>> >> >>                    * same type until we EOI it.  */
>> >> >>                   if ( cpu == smp_processor_id() )
>> >> >>                       gic_irq_eoi((void*)(uintptr_t)pirq);
>> >> >>                   else
>> >> >>                       on_selected_cpus(cpumask_of(cpu),
>> >> >>                                        gic_irq_eoi, 
>> >> >> (void*)(uintptr_t)pirq, 0);
>> >> >>               }
>> >> >>
>> >> >>               i++;
>> >> >>           }
>> >> >>       }
>> >> >>
>> >> >>
>> >> >>       Oleksandr Tyshchenko | Embedded Developer
>> >> >>       GlobalLogic
>> >> >>
>> >> >>
>> >> >>
>> >>
>> >>
>> >>
>> >> --
>> >>
>> >> Name | Title
>> >> GlobalLogic
>> >> P +x.xxx.xxx.xxxx  M +x.xxx.xxx.xxxx  S skype
>> >> www.globallogic.com
>> >>
>> >> http://www.globallogic.com/email_disclaimer.txt
>> >>
>>
>>
>>
>> --
>>
>> Name | Title
>> GlobalLogic
>> P +x.xxx.xxx.xxxx  M +x.xxx.xxx.xxxx  S skype
>> www.globallogic.com
>>
>> http://www.globallogic.com/email_disclaimer.txt
>>



-- 

Name | Title
GlobalLogic
P +x.xxx.xxx.xxxx  M +x.xxx.xxx.xxxx  S skype
www.globallogic.com

http://www.globallogic.com/email_disclaimer.txt

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.