[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.



Jan, thanks for your comments.

> On January 15, 2016 at 9:10, <JBeulich@xxxxxxxx> wrote:
> >>> On 23.12.15 at 09:25, <quan.xu@xxxxxxxxx> wrote:
> > --- a/xen/drivers/passthrough/vtd/qinval.c
> > +++ b/xen/drivers/passthrough/vtd/qinval.c
> > @@ -190,9 +190,19 @@ static int queue_invalidate_wait(struct iommu
> > *iommu,  static int invalidate_sync(struct iommu *iommu)  {
> >      struct qi_ctrl *qi_ctrl = iommu_qi_ctrl(iommu);
> > +    int rc;
> >
> >      if ( qi_ctrl->qinval_maddr )
> > -        return queue_invalidate_wait(iommu, 0, 1, 1);
> > +    {
> > +        rc = queue_invalidate_wait(iommu, 0, 1, 1);
> > +        if ( rc )
> > +        {
> > +            if ( rc == -ETIMEDOUT )
> > +                panic("Queue invalidate wait descriptor was
> > + timeout.\n");
> 


> Unless I'm overlooking something this doesn't replace and existing panic(), 
> and I
> think I had indicated quite clearly that this series shouldn't add any new 
> ones,
> unless the alternative of crashing the owning domain is completely unfeasible.
> 


I will remove it in next v5.

[...]

> > @@ -229,6 +239,63 @@ int qinval_device_iotlb(struct iommu *iommu,
> >      return 0;
> >  }
> >
> > +static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16 did,
> > +                                         u16 seg, u8 bus, u8 devfn)
> {
> > +    struct domain *d;
> > +    struct pci_dev *pdev;
> > +
> > +    d = rcu_lock_domain_by_id(iommu->domid_map[did]);
> > +    ASSERT(d);
> 
> Don't you need to acquire some lock in order to safely assert this?

Referred to the other use case of 'rcu_lock_domain_by_id()', Xen checks whether
The domain is 'NULL'. 
Could I also replace the 'ASSERT(d)' with
          + If ( d == NULL )
          +   return;
?

> Also note that unused slots hold zero, i.e. there's at least a theoretical 
> risk of
> problems here when you don't also look at
> iommu->domid_bitmap.
> 
I am not clear how to fix this point. Do you have good idea?
Add a lock to 'iommu->domid_bitmap'?


> > +    for_each_pdev(d, pdev)
> > +    {
> > +        if ( (pdev->seg == seg) &&
> > +             (pdev->bus == bus) &&
> > +             (pdev->devfn == devfn) )
> > +        {
> > +            if ( pdev->domain )
> 
> If the device is on the domain's list, can this reasonably be false?

I can't find the false case.

> I.e. don't you rather mean ASSERT() here?

I'd better replace 'if' with ASSERT() for this case.

> 
> > +            {
> > +                list_del(&pdev->domain_list);
> 
> This should happen under pcidevs_lock - you need to either acquire it or
> ASSERT() it being held.
> 

I should check whether it is under pcidevs_lock -- with 
spin_is_locked(&pcidevs_lock)
If it is not under pcidevs_lock, I should acquire it.
I think ASSERT() is not a good idea. Hypervisor acquires this lock and then 
remove the resource.



> > +    }
> > +
> > +    if ( !is_hardware_domain(d) )
> > +        domain_crash(d);
> 
> I wonder whether the device hiding shouldn't also be avoided if the device is
> owned by hwdom.
> 

I think it should be avoided if the device is owned by hwdom.
Otherwise, it is quite similar to panic..


> > @@ -349,9 +416,18 @@ static int flush_iotlb_qi(
> >          queue_invalidate_iotlb(iommu,
> >                                 type >>
> DMA_TLB_FLUSH_GRANU_OFFSET, dr,
> >                                 dw, did, size_order, 0, addr);
> > +
> > +        /*
> > +         * Synchronize with hardware for invalidation request descriptors
> > +         * submitted before Device-TLB invalidate descriptor.
> > +         */
> > +        rc = invalidate_sync(iommu);
> > +        if ( rc )
> > +             return rc;
> > +
> >          if ( flush_dev_iotlb )
> >              ret = dev_invalidate_iotlb(iommu, did, addr, size_order, type);
> > -        rc = invalidate_sync(iommu);
> > +
> >          if ( !ret )
> >              ret = rc;
> >      }
> 
> This change is because of ...?
> 
As Kevin's comments,
I add 'dev_invalidate_iotlb_sync()' to dev_invalidate_iotlb(). 
Then i can know which device is flush timeout.



> > --- a/xen/drivers/passthrough/vtd/x86/ats.c
> > +++ b/xen/drivers/passthrough/vtd/x86/ats.c
> > @@ -162,6 +162,19 @@ int dev_invalidate_iotlb(struct iommu *iommu, u16
> did,
> >              return -EOPNOTSUPP;
> >          }
> >
> > +        /*
> > +         * Synchronize with hardware for Device-TLB invalidate
> > +         * descriptor.
> > +         */
> > +        ret = dev_invalidate_iotlb_sync(iommu, did, pdev->seg,
> > +                                        pdev->bus, pdev->devfn);
> > +        if ( ret )
> > +        {
> > +            dprintk(XENLOG_WARNING VTDPREFIX,
> > +                    "Device-TLB flush timeout.\n");
> 
> Are you aware that dprintk() compiles to nothing in non-debug builds?


To be honest, I was not aware.

> Plus logging the error code and device coordinates might turn out helpful in
> such cases. But first of all - if there was a timeout, we'd make it here only 
> for
> Dom0. Perhaps the printing should move into an else to the domain_crash()?
> And if there was another error, the message would be outright wrong.
> 
IMO, the print for Dom0 is enough.
I think it is not a good idea to move into an else to domain_crash(). 
Domain_crash is quite a common part.
Anyway I can improve it in low priority.

Jan, thanks..
-Quan




_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.