[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.



> On January 20, 2016 at 7:29 pm,  <JBeulich@xxxxxxxx> wrote:
> >>> On 20.01.16 at 11:26, <quan.xu@xxxxxxxxx> wrote:
> >> On January 15, 2016 at 9:10, <JBeulich@xxxxxxxx> wrote:
> >> >>> On 23.12.15 at 09:25, <quan.xu@xxxxxxxxx> wrote:
> >> > @@ -229,6 +239,63 @@ int qinval_device_iotlb(struct iommu *iommu,
> >> >      return 0;
> >> >  }
> >> >
> >> > +static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16 did,
> >> > +                                         u16 seg, u8 bus, u8
> >> > +devfn)
> >> {
> >> > +    struct domain *d;
> >> > +    struct pci_dev *pdev;
> >> > +
> >> > +    d = rcu_lock_domain_by_id(iommu->domid_map[did]);
> >> > +    ASSERT(d);
> >>
> >> Don't you need to acquire some lock in order to safely assert this?
> >
> > Referred to the other use case of 'rcu_lock_domain_by_id()', Xen
> > checks whether The domain is 'NULL'.
> > Could I also replace the 'ASSERT(d)' with
> >           + If ( d == NULL )
> >           +   return;
> > ?
> 
> At a first glance this doesn't look right, but in the end that's something 
> you need
> to answer.
> 

Is it quite similar to whether the domain has been destroyed when Device-TLB is 
flushing?
Correct me if I still doesn't get you point..


> >> Also note that unused slots hold zero, i.e. there's at least a
> >> theoretical
> > risk of
> >> problems here when you don't also look at
> >> iommu->domid_bitmap.
> >>
> > I am not clear how to fix this point. Do you have good idea?
> > Add a lock to 'iommu->domid_bitmap'?
> 
> How would a lock help avoiding mistaking an unused slot to mean Dom0? As
> already suggested - I think you simply need to consult the bitmap along with 
> the
> domain ID array.
> 

Try to get domain id with iommu->domid_map[did] ?


> >> > +            {
> >> > +                list_del(&pdev->domain_list);
> >>
> >> This should happen under pcidevs_lock - you need to either acquire it
> >> or
> >> ASSERT() it being held.
> >>
> >
> > I should check whether it is under pcidevs_lock -- with
> > spin_is_locked(&pcidevs_lock)
> > If it is not under pcidevs_lock, I should acquire it.
> > I think ASSERT() is not a good idea. Hypervisor acquires this lock and
> > then remove the resource.
> 
> I don't understand this last sentence.
> 
For example: in 
pci_remove_device()
{
...
    spin_lock(&pcidevs_lock);
    ..
    iommu_remove_device()..
    ..
    spin_unlock(&pcidevs_lock);
}

Device-TLB is maybe flush error in iommu_remove_device()..
Then it is under pcidevs_lock..
In this case, I need to check whether it is under pcidevs_lock.
If not, I need to acquire the pcidevs_lock.

> >> > @@ -349,9 +416,18 @@ static int flush_iotlb_qi(
> >> >          queue_invalidate_iotlb(iommu,
> >> >                                 type >>
> >> DMA_TLB_FLUSH_GRANU_OFFSET, dr,
> >> >                                 dw, did, size_order, 0, addr);
> >> > +
> >> > +        /*
> >> > +         * Synchronize with hardware for invalidation request
> descriptors
> >> > +         * submitted before Device-TLB invalidate descriptor.
> >> > +         */
> >> > +        rc = invalidate_sync(iommu);
> >> > +        if ( rc )
> >> > +             return rc;
> >> > +
> >> >          if ( flush_dev_iotlb )
> >> >              ret = dev_invalidate_iotlb(iommu, did, addr, size_order,
> type);
> >> > -        rc = invalidate_sync(iommu);
> >> > +
> >> >          if ( !ret )
> >> >              ret = rc;
> >> >      }
> >>
> >> This change is because of ...?
> >>
> > As Kevin's comments,
> > I add 'dev_invalidate_iotlb_sync()' to dev_invalidate_iotlb().
> > Then i can know which device is flush timeout.
> 
> I don't understand: How is your reply related to you moving up the invocation 
> of
> invalidate_sync()?

This invalidate_sync() is to synchronize with hardware for invalidation request 
descriptors
submitted before Device-TLB invalidate descriptor.

If I don't add this invalidate_sync().. 
dev_invalidate_iotlb_timeout() is not clear whether the flush time out is about 
Device-TLB flush error or the other previous iotlb/iec/context flush error. 


> 
> >> Plus logging the error code and device coordinates might turn out
> >> helpful in such cases. But first of all - if there was a timeout,
> >> we'd make it here only for Dom0. Perhaps the printing should move into an
> else to the domain_crash()?
> >> And if there was another error, the message would be outright wrong.
> >>
> > IMO, the print for Dom0 is enough.
> > I think it is not a good idea to move into an else to domain_crash().
> > Domain_crash is quite a common part.
> > Anyway I can improve it in low priority.
> 
> At the very least the message should not end up being actively misleading.
> 
What about the below?
+                printk(XENLOG_ERR
+                       "Device %04x:%02x:%02x is flush error.",
+                       seg, bus, devfn);



Quan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.