Xen project Mailing List

Re: [Xen-devel] [PATCH for-4.13 v4 2/3] x86/passthrough: fix migration of MSI when using posted interrupts

To: Roger Pau Monné <roger.pau@xxxxxxxxxx>

Date: Thu, 14 Nov 2019 16:35:33 +0100

Cc: Juergen Gross <jgross@xxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Joe Jin <joe.jin@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx

Delivery-date: Thu, 14 Nov 2019 15:35:22 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 14.11.2019 15:42, Roger Pau Monné wrote: > On Thu, Nov 14, 2019 at 02:35:56PM +0100, Jan Beulich wrote: >> On 13.11.2019 16:59, Roger Pau Monne wrote: >>> + for ( id = find_first_bit(vcpus, d->max_vcpus); >>> + id < d->max_vcpus; >>> + id = find_next_bit(vcpus, d->max_vcpus, id + 1) ) >>> + { >>> + if ( d->vcpu[id] != current ) >>> + vcpu_pause(d->vcpu[id]); >> >> Isn't this setting us up for a deadlock if two parties come here >> for the same domain, and both on a vCPU belonging to that domain >> (and with the opposite one's bit set in the bitmap)? But it looks >> like d would never be the current domain here - you will want to >> assert and comment on this, though. At that point the comparisons >> against current can then go away as well afaict. > > The above is true for syncs triggered by MSI changes that bounce to > QEMU and then get forwarded to Xen as DOMCTLs, but AFAICT syncs that > result from a vIO-APIC entry write (patch #3) will have v == > current. Ah, yes. But let's please handle the needs of the vIO-APIC code in that other patch. That'll also ... > vIO-APIC writes however use the d->arch.hvm.irq_lock, so it's not > possible to process multiple vCPUs vIO-APIC accesses at the same time. > I'm afraid I don't know how which kind of assert should be added > here. I could add a comment, but seems fragile. ... help with the assertions to add. You'd put the strict one in here, as suggested, and then relax it there as needed. One option would be to require callers in the context of the current domain to acquire the lock you mention (holding of which you could assert here, in case current->domain == d). >>> @@ -420,20 +384,16 @@ int pt_irq_create_bind( >>> delivery_mode = MASK_EXTR(pirq_dpci->gmsi.gflags, >>> XEN_DOMCTL_VMSI_X86_DELIV_MASK); >>> >>> - dest_vcpu_id = hvm_girq_dest_2_vcpu_id(d, dest, dest_mode); >>> + hvm_intr_get_dests(d, dest, dest_mode, delivery_mode, dest_vcpus); >>> + dest_vcpu_id = bitmap_weight(dest_vcpus, d->max_vcpus) != 1 ? >>> + -1 : find_first_bit(dest_vcpus, d->max_vcpus); >>> pirq_dpci->gmsi.dest_vcpu_id = dest_vcpu_id; >>> spin_unlock(&d->event_lock); >>> >>> pirq_dpci->gmsi.posted = false; >>> vcpu = (dest_vcpu_id >= 0) ? d->vcpu[dest_vcpu_id] : NULL; >>> - if ( iommu_intpost ) >>> - { >>> - if ( delivery_mode == dest_LowestPrio ) >>> - vcpu = vector_hashing_dest(d, dest, dest_mode, >>> - pirq_dpci->gmsi.gvec); >>> - if ( vcpu ) >>> - pirq_dpci->gmsi.posted = true; >>> - } >>> + if ( vcpu && iommu_intpost ) >>> + pirq_dpci->gmsi.posted = true; >> >> One aspect that I'm curious about: How much posting opportunity do >> we lose in practice by no longer posting when the guest uses lowest >> priority mode with multiple destinations? > > Linux seems to use dest_Fixed exclusively, and the same goes to > FreeBSD. That's for recent Linux. Go back to 3.0 (just as an example), and you'll see it was different back then. But yes, our decision here would of course be influenced more by more recent guest OS versions. >>> @@ -442,6 +402,9 @@ int pt_irq_create_bind( >>> pi_update_irte(vcpu ? &vcpu->arch.hvm.vmx.pi_desc : NULL, >>> info, pirq_dpci->gmsi.gvec); >>> >>> + if ( hvm_funcs.deliver_posted_intr ) >>> + domain_sync_vlapic_pir(d, prev_vcpus); >> >> Accessing hvm_funcs here looks like a layering violation. This >> wants either moving into the function or (seeing the other use) >> abstracting away. Seeing the conditional here (and below) I also >> notice that you calculate prev_vcpus in vein when there's no >> interrupt posting in use. > > I could indeed only fill prev_vcpus when posting is in use. > >> I guess together with the variable size issue mentioned above a >> possible solution would be: >> - have one bitmap hanging off of pirq_dpci->gmsi, >> - have one bitmap per pCPU, >> - populate the new destination bits into the per-pCPU one, >> - issue the PIR->IRR sync, >> - exchange the per-pCPU and per-DPCI pointers. >> You could then leave the pointers at NULL when no posting is to >> be used, addressing the apparent layering violation here at the >> same time. > > Right, the above option avoids having to calculate the possible > destinations twice (once on setup and once on teardown), however it > expands the size of gmsi. Well, size of dynamically allocated structures is clearly secondary compared to size of objects we put on the stack. (It had bothered me from the beginning that the multiple destination information was simply discarded, yet not discarding it would likely have meant allocating such bitmaps anyway.) > While here, I've also realized that interrupts injected using > XEN_DMOP_inject_msi will also be posted, and I'm afraid there's no way > to track and flush those unless we provide a posted-flush hypercall or > some such, so that emulators can request a PIR flush when interrupts > of fully emulated devices are reconfigured. > > OTOH, maybe XEN_DMOP_inject_msi should pause the vCPU, set the bit in > IRR and unpause? Might be an option, yes. This op is - afaict - not for production use anyway, so ought to be fine to be slow. (The pausing wouldn't be needed when not posting the interrupt, iiuc.) Another option (without having looked at the code) might be to suppress posting, if that can be suitably requested through the interface layers. Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.