Xen project Mailing List

Re: [Xen-devel] [PATCH RFC] pass-through: sync pir to irr after msix vector been updated

Date: Fri, 13 Sep 2019 09:14:17 +0200

Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx, Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>, Joao Martins <joao.m.martins@xxxxxxxxxx>, "DONGLI.ZHANG" <dongli.zhang@xxxxxxxxxx>

Delivery-date: Fri, 13 Sep 2019 07:15:36 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 12.09.2019 20:03, Joe Jin wrote: > With below testcase, guest kernel reported "No irq handler for vector": > 1). Passthrough mlx ib VF to 2 pvhvm guests. > 2). Start rds-stress between 2 guests. > 3). Scale down 2 guests vcpu from 32 to 6 at the same time. > > Repeat above test several iteration, guest kernel reported "No irq handler > for vector", and IB traffic downed to zero which caused by interrupt lost. > > When vcpu offline, kernel disabled local IRQ, migrate IRQ to other cpu, > update MSI-X table, enable IRQ. If any new interrupt arrived after > local IRQ disabled also before MSI-X table been updated, interrupt still > used old vector and dest cpu info, and when local IRQ enabled again, > interrupt been sent to wrong cpu and vector. > > Looks sync PIR to IRR after MSI-X been updated is help for this issue. I'm having trouble making the connection, which quite possibly simply means the description needs to be further extended: Sync-ing PIR to IRR has nothing to do with a vector change. It would help if nothing else caused this bitmap propagation, and an interrupt was lost (rather than delivered through the wrong vector, or to the wrong CPU). Furthermore with vector and destination being coupled, after a CPU has been offlined this would generally mean - if there was just a single destination permitted, lack of delivery altogether, - if there were multiple destinations permitted, delivery to one of the other CPUs, at which point the vector would still be valid. An interesting aspect would be on which CPU the log message was observed, and how this correlates with the destination sets of the CPUs that have got offlined. From there it would then further be interesting to understand why the interrupt made it to that CPU, since - as said - destination and vector get changed together, and hence with things going wrong it would be of interest to know whether the CPU receiving the IRQ is within the new destination set, or some (random?) other one. > BTW, I could not reproduced this issue if I disabled apicv. Which, I agree, is a fair hint of something APIC-V-specific to be amiss, but which (due to the vastly different timing) isn't a reliable indicator. > --- a/xen/drivers/passthrough/io.c > +++ b/xen/drivers/passthrough/io.c > @@ -412,6 +412,9 @@ int pt_irq_create_bind( > pirq_dpci->gmsi.gvec = pt_irq_bind->u.msi.gvec; > pirq_dpci->gmsi.gflags = gflags; > } > + > + if ( hvm_funcs.sync_pir_to_irr ) > + > hvm_funcs.sync_pir_to_irr(d->vcpu[pirq_dpci->gmsi.dest_vcpu_id]); If the need for this change can be properly explained, then it still wants converting to alternative_vcall() - the the other caller of this hook. Or perhaps even better move vlapic.c's wrapper (suitably renamed) into hvm.h, and use it here. Additionally, the code setting pirq_dpci->gmsi.dest_vcpu_id (right after your code insertion) allows for the field to be invalid, which I think you need to guard against. Also, just to remind you: Please follow patch submission rules: They get sent _to_ the list, with maintainers etc _cc_-ed. Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.