Xen project Mailing List

Re: [Xen-devel] [PATCH v3] xen/arm: fix rank/vgic lock inversion bug

From: Stefano Stabellini <sstabellini@xxxxxxxxxx>

Date: Thu, 2 Feb 2017 14:56:14 -0800 (PST)

Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Andre Przywara <Andre.Przywara@xxxxxxx>

Delivery-date: Thu, 02 Feb 2017 22:56:37 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Thu, 2 Feb 2017, Julien Grall wrote: > Hi Stefano, > > On 01/02/17 23:23, Stefano Stabellini wrote: > > On Wed, 1 Feb 2017, Julien Grall wrote: > > > On 31/01/2017 23:49, Stefano Stabellini wrote: > > > > On Fri, 27 Jan 2017, Julien Grall wrote: > > > > > On 03/01/17 23:29, Stefano Stabellini wrote: > > > So we have to worry about SPIs and LPIs (thought they are not yet > > > supported). > > > From the note in 3.2.3: > > > "For any processor, if an interrupt is active and pending, the GIC does > > > not > > > signal an interrupt exception request > > > for the interrupt to any processor until the active status is cleared." > > > > > > Given that SPIs have an activate state and will be shared, this scenario > > > cannot happen. > > > > Yes, but just above: > > > > A GICv1 implementation might ensure that only one processor can make a > > 1-N interrupt active, removing the requirement for a lock on the ISR. > > This is not required by the architecture, and generic GIC code must not > > rely on this behavior. > > That's for a GICv1 implementation. However, only GICv2 and onwards support > virtualization. > > > > > then, the spec also says: > > > > The GIC maintains a state machine for each supported interrupt on each > > CPU interface. > > > > So I am thinking that the statement you quoted wouldn't apply, because > > the second interrupt wouldn't mark the state of the interrupt as "active > > and pending" because, the interrupt being targeted to another processor, > > it would simply mark it as "pending" in the CPU interface of the other > > processor. > > I don't have the proper quote in the spec, but I can confirm the interrupt can > only be active on a single CPU at the time. It is also implied in the > description of the 1-N model (see 1.4.3): " Only one processor handles this > interrupt". Thanks, in that case I feel much better about this approach :-) > > > For LPIs, there is no activate state. So as soon as they are EOIed, they > > > might > > > come up again. Depending on how will we handle irq migration, your > > > scenario > > > will become true. I am not sure if we should take into account LPIs right > > > now. > > > > > > To be honest, I don't much like the idea of kicking the other vCPU. But I > > > don't have a better idea in order to clear the LRs. What if we skip the interrupt if it's an LPI, and we kick the other vcpu and wait if it's an SPI? Software should be more tolerant of lost interrupts in case of LPIs. We are also considering rate-limiting them anyway, which implies the possibility of skipping some LPIs at times. > > Me neither, that's why I was proposing a different solution instead. We > > still have the option to take the right lock in vgic_migrate_irq: > > > > http://marc.info/?l=xen-devel&m=148237289620471 > > > > The code is more complex, but I think it's safe in all cases. > > It is not only complex but also really confusing as we would have a variable > protected by two locks, both lock does not need to be taken at the same time. Yes, but there is a large in-code comment about it :-) > I may have an idea to avoid completely the lock in vgic_get_target_vcpu. The > lock is only here to read the target vcpu in the rank, the rest does not need > a lock, right? So could not we read the target vcpu atomically instead? Yes, I think that could solve the lock inversion bug: - remove the spin_lock in vgic_get_target_vcpu and replace it with an atomic read - replace rank->vcpu writes with atomic writes However, it would not solve the other issu affecting the current code: http://marc.info/?l=xen-devel&m=148218667104072, which is related to the problem you mentioned about irq_set_affinity and list_del_init in gic_update_one_lr not being separated by a barrier. Arguably, that bug could be solved separately. It would be easier to solve that bug by one of these approaches: 1) use the same (vgic) lock in gic_update_one_lr and vgic_migrate_irq 2) remove the irq_set_affinity call from gic_update_one_lr where 1) is the approach taken by v2 of this series and 2) is the approach taken by this patch. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.