[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH RFC] pass-through: sync pir to irr after msix vector been updated



On Thu, Sep 26, 2019 at 01:33:42PM -0700, Joe Jin wrote:
> On 9/24/19 8:42 AM, Roger Pau Monné wrote:
> > On Fri, Sep 13, 2019 at 09:50:34AM -0700, Joe Jin wrote:
> >> On 9/13/19 3:33 AM, Roger Pau Monné wrote:
> >>> On Thu, Sep 12, 2019 at 11:03:14AM -0700, Joe Jin wrote:
> >>>> With below testcase, guest kernel reported "No irq handler for vector":
> >>>>   1). Passthrough mlx ib VF to 2 pvhvm guests.
> >>>>   2). Start rds-stress between 2 guests.
> >>>>   3). Scale down 2 guests vcpu from 32 to 6 at the same time.
> >>>>
> >>>> Repeat above test several iteration, guest kernel reported "No irq 
> >>>> handler
> >>>> for vector", and IB traffic downed to zero which caused by interrupt 
> >>>> lost.
> >>>>
> >>>> When vcpu offline, kernel disabled local IRQ, migrate IRQ to other cpu,
> >>>> update MSI-X table, enable IRQ. If any new interrupt arrived after
> >>>> local IRQ disabled also before MSI-X table been updated, interrupt still 
> >>>> used old vector and dest cpu info, and when local IRQ enabled again, 
> >>>> interrupt been sent to wrong cpu and vector.
> >>>
> >>> Yes, but that's something Linux shoulkd be able to handle, according
> >>> to your description there's a window where interrupts can be delivered
> >>> to the old CPU, but that's something expected.
> >>
> >> Actually, kernel will check APIC IRR when do migration, if any pending
> >> IRQ, will retrigger IRQ to new destination, but Xen does not set the
> >> bit.
> > 
> > Right, because the interrupt is pending in the PIRR posted descriptor
> > field, it has not yet reached the vlapic IRR.
> > 
> >>>
> >>>>
> >>>> Looks sync PIR to IRR after MSI-X been updated is help for this issue.
> >>>
> >>> AFAICT the sync that you do is still using the old vcpu id, as
> >>> pirq_dpci->gmsi.dest_vcpu_id gets updated a little bit below. I'm
> >>> unsure about why does this help, I would expect the sync between pir
> >>> and irr to happen anyway, and hence I'm not sure why is this helping.
> >>
> >> As my above update, IRQ retriggered by old cpu, so Xen need to set IRR
> >> for old cpu but not of new.
> > 
> > AFAICT you are draining any pending data from the posted interrupt
> > PIRR field into the IRR vlapic field, so that no stale interrupts are
> > left behind after the MSI fields have been updated by the guest. I
> > think this is correct, I wonder however whether you also need to
> > trigger a vcpu re-scheduling (pause/unpause the vpcu), so that pending
> > interrupts in IRR are injected by vmx_intr_assist.
> > 
> > Also, I think you should do this syncing after the pi_update_irte
> > call, or else there's still a window (albeit small) where you can
> > still get posted interrupts delivered to the old vcpu.
> 
> I agree with you that we need to take care of this issue as well.

Can you give a try to the patch above? I don't have the hardware to
test any of this ATM, so your help would be appreciated.

Thanks, Roger.
---8<---
diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
index 9466258d6f..d255ad8db7 100644
--- a/xen/arch/x86/hvm/vlapic.c
+++ b/xen/arch/x86/hvm/vlapic.c
@@ -106,7 +106,7 @@ static void vlapic_clear_irr(int vector, struct vlapic 
*vlapic)
     vlapic_clear_vector(vector, &vlapic->regs->data[APIC_IRR]);
 }
 
-static void sync_pir_to_irr(struct vcpu *v)
+void vlapic_sync_pir_to_irr(struct vcpu *v)
 {
     if ( hvm_funcs.sync_pir_to_irr )
         alternative_vcall(hvm_funcs.sync_pir_to_irr, v);
@@ -114,7 +114,7 @@ static void sync_pir_to_irr(struct vcpu *v)
 
 static int vlapic_find_highest_irr(struct vlapic *vlapic)
 {
-    sync_pir_to_irr(vlapic_vcpu(vlapic));
+    vlapic_sync_pir_to_irr(vlapic_vcpu(vlapic));
 
     return vlapic_find_highest_vector(&vlapic->regs->data[APIC_IRR]);
 }
@@ -1493,7 +1493,7 @@ static int lapic_save_regs(struct vcpu *v, 
hvm_domain_context_t *h)
     if ( !has_vlapic(v->domain) )
         return 0;
 
-    sync_pir_to_irr(v);
+    vlapic_sync_pir_to_irr(v);
 
     return hvm_save_entry(LAPIC_REGS, v->vcpu_id, h, vcpu_vlapic(v)->regs);
 }
diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index b292e79382..88188d2d7f 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -341,7 +341,7 @@ int pt_irq_create_bind(
     {
         uint8_t dest, delivery_mode;
         bool dest_mode;
-        int dest_vcpu_id;
+        int dest_vcpu_id, prev_vcpu_id = -1;
         const struct vcpu *vcpu;
         uint32_t gflags = pt_irq_bind->u.msi.gflags &
                           ~XEN_DOMCTL_VMSI_X86_UNMASKED;
@@ -411,6 +411,7 @@ int pt_irq_create_bind(
 
                 pirq_dpci->gmsi.gvec = pt_irq_bind->u.msi.gvec;
                 pirq_dpci->gmsi.gflags = gflags;
+                prev_vcpu_id = pirq_dpci->gmsi.dest_vcpu_id;
             }
         }
         /* Calculate dest_vcpu_id for MSI-type pirq migration. */
@@ -440,7 +441,8 @@ int pt_irq_create_bind(
         /* Use interrupt posting if it is supported. */
         if ( iommu_intpost )
             pi_update_irte(vcpu ? &vcpu->arch.hvm.vmx.pi_desc : NULL,
-                           info, pirq_dpci->gmsi.gvec);
+                           info, pirq_dpci->gmsi.gvec,
+                           prev_vcpu_id >= 0 ? d->vcpu[prev_vcpu_id] : NULL);
 
         if ( pt_irq_bind->u.msi.gflags & XEN_DOMCTL_VMSI_X86_UNMASKED )
         {
@@ -729,7 +731,9 @@ int pt_irq_destroy_bind(
             what = "bogus";
     }
     else if ( pirq_dpci && pirq_dpci->gmsi.posted )
-        pi_update_irte(NULL, pirq, 0);
+        pi_update_irte(NULL, pirq, 0,
+                       pirq_dpci->gmsi.dest_vcpu_id >= 0
+                       ? d->vcpu[pirq_dpci->gmsi.dest_vcpu_id] : NULL);
 
     if ( pirq_dpci && (pirq_dpci->flags & HVM_IRQ_DPCI_MAPPED) &&
          list_empty(&pirq_dpci->digl_list) )
diff --git a/xen/drivers/passthrough/vtd/intremap.c 
b/xen/drivers/passthrough/vtd/intremap.c
index bf846195c4..ad03abb4da 100644
--- a/xen/drivers/passthrough/vtd/intremap.c
+++ b/xen/drivers/passthrough/vtd/intremap.c
@@ -951,7 +951,7 @@ void intel_iommu_disable_eim(void)
  * when guest changes MSI/MSI-X information.
  */
 int pi_update_irte(const struct pi_desc *pi_desc, const struct pirq *pirq,
-    const uint8_t gvec)
+    const uint8_t gvec, struct vcpu *prev)
 {
     struct irq_desc *desc;
     struct msi_desc *msi_desc;
@@ -964,8 +964,8 @@ int pi_update_irte(const struct pi_desc *pi_desc, const 
struct pirq *pirq,
     msi_desc = desc->msi_desc;
     if ( !msi_desc )
     {
-        rc = -ENODEV;
-        goto unlock_out;
+        spin_unlock_irq(&desc->lock);
+        return -ENODEV;
     }
     msi_desc->pi_desc = pi_desc;
     msi_desc->gvec = gvec;
@@ -974,10 +974,9 @@ int pi_update_irte(const struct pi_desc *pi_desc, const 
struct pirq *pirq,
 
     ASSERT(pcidevs_locked());
 
-    return msi_msg_write_remap_rte(msi_desc, &msi_desc->msg);
-
- unlock_out:
-    spin_unlock_irq(&desc->lock);
+    rc = msi_msg_write_remap_rte(msi_desc, &msi_desc->msg);
+    if ( !rc && prev )
+         vlapic_sync_pir_to_irr(prev);
 
     return rc;
 }
diff --git a/xen/include/asm-x86/hvm/vlapic.h b/xen/include/asm-x86/hvm/vlapic.h
index dde66b4f0f..b0017d1dae 100644
--- a/xen/include/asm-x86/hvm/vlapic.h
+++ b/xen/include/asm-x86/hvm/vlapic.h
@@ -150,4 +150,6 @@ bool_t vlapic_match_dest(
     const struct vlapic *target, const struct vlapic *source,
     int short_hand, uint32_t dest, bool_t dest_mode);
 
+void vlapic_sync_pir_to_irr(struct vcpu *v);
+
 #endif /* __ASM_X86_HVM_VLAPIC_H__ */
diff --git a/xen/include/asm-x86/iommu.h b/xen/include/asm-x86/iommu.h
index 85741f7c96..314dcfbe47 100644
--- a/xen/include/asm-x86/iommu.h
+++ b/xen/include/asm-x86/iommu.h
@@ -119,7 +119,7 @@ static inline void iommu_disable_x2apic(void)
 extern bool untrusted_msi;
 
 int pi_update_irte(const struct pi_desc *pi_desc, const struct pirq *pirq,
-                   const uint8_t gvec);
+                   const uint8_t gvec, struct vcpu *prev);
 
 #endif /* !__ARCH_X86_IOMMU_H__ */
 /*


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.