[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.



> On December 25 2015, 11:06 AM, <Tian, Kevin> wrote: 
> > From: Xu, Quan
> > Sent: Wednesday, December 23, 2015 4:26 PM
> >
> > Now if IOTLB/Context/IETC flush is timeout, panic hypervisor.
> > The coming patch set will fix it.
> 
> again, please adjust comment to reflect what this patch is doing, i.e. only 
> about
> improvement on Device-TLB flush.
> 

Agreed.

> >
> > If Device-TLB flush is timeout, we'll hide the target ATS device and
> > crash the domain owning this ATS device.
> >
> > If impacted domain is hardware domain, just throw out a warning.
> >
> > The hided Device will be disallowed to be further assigned to any
> > domain.
> 
> hided Device -> hidden device
> 

Agreed.

[...]

> > diff --git a/xen/drivers/passthrough/vtd/qinval.c
> > b/xen/drivers/passthrough/vtd/qinval.c
> > index b227e4e..7330c5d 100644
> > --- a/xen/drivers/passthrough/vtd/qinval.c
> > +++ b/xen/drivers/passthrough/vtd/qinval.c
> > @@ -190,9 +190,19 @@ static int queue_invalidate_wait(struct iommu
> > *iommu,  static int invalidate_sync(struct iommu *iommu)  {
> >      struct qi_ctrl *qi_ctrl = iommu_qi_ctrl(iommu);
> > +    int rc;
> >
> >      if ( qi_ctrl->qinval_maddr )
> > -        return queue_invalidate_wait(iommu, 0, 1, 1);
> > +    {
> > +        rc = queue_invalidate_wait(iommu, 0, 1, 1);
> > +        if ( rc )
> > +        {
> > +            if ( rc == -ETIMEDOUT )
> > +                panic("Queue invalidate wait descriptor was
> timeout.\n");
> > +            return rc;
> > +        }
> > +    }
> > +
> 
> why do you still do panic here? w/ PATCH 1/3 you should return rc to caller
> directly, right?
> 

This panic is original in queue_invalidate_wait(). Patch 1/3 is just for 
Device-TLB flush error ( Actually it covers iotlb flush error).
If I fix VT-d iotlb/context/iec flush error to propagated callers, then I can 
remove this panic and return rc to propagated caller directly.



> > +static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16 did,
> > +                                         u16 seg, u8 bus, u8 devfn)
> {
> > +    struct domain *d;
> > +    struct pci_dev *pdev;
> > +
> > +    d = rcu_lock_domain_by_id(iommu->domid_map[did]);
> > +    ASSERT(d);
> > +    for_each_pdev(d, pdev)
> > +    {
> > +        if ( (pdev->seg == seg) &&
> > +             (pdev->bus == bus) &&
> > +             (pdev->devfn == devfn) )
> > +        {
> > +            if ( pdev->domain )
> > +            {
> > +                list_del(&pdev->domain_list);
> > +                pdev->domain = NULL;
> > +            }
> > +
> > +            if ( pci_hide_device(bus, devfn) )
> > +            {
> > +                printk(XENLOG_ERR
> > +                       "IOMMU hide device %04x:%02x:%02x error.",
> > +                       seg, bus, devfn);
> > +                break;
> > +            }
> 
> I'm not sure whether using pci_hide_device is enough or not break other code
> path here? For example, now you have errors propagated to callers.

More information, after call pci_hide_device(),
..
         pdev->domain = dom_xen;
         list_add(&pdev->domain_list, &dom_xen->arch.pdev_list);
..

Hidden PCI devices are associated with 'dom_xen'.
Now I found it may not clear rmrr mapping / context mapping. .i.e 
iommu_remove_device() -- [hd->platform_ops is NULL for dom_xen].
But I think it is acceptable, IMO dom_xen is a part hypervisor, and there 
mappings would not a security issue.
Or I can remove these mappings before to hide this device.

So I think it will not break other code break other code.

> What's the
> final policy against flush error? 
>

If Device-TLB flush is timeout, we'll hide the target ATS device and crash the 
domain owning this ATS device.
If impacted domain is hardware domain, just throw out a warning, instead of 
crash the hardware domain.
The hided Device will be disallowed to be further assigned to any domain.



>If they will finally go to cleanup some more state
> relying on pdev->domain, then that code path might be broken. If it's fine to 
> do
> cleanup at this point, then just hidden is not enough. If you look at
> pci_remove_device, there's quite some state to be further cleaned up...
> 


As the pdev->domain is 'dom_xen';
So for this case, it is still working. Correct me if it is not correct.
BTW, a quick question, which case to call pci_remove_device()?


> I didn't think about the full logic thoroughly now. But it would always be 
> good
> to not hide device now until a point where all related states have been 
> cleaned
> up in error handling path chained up.
> 
 
Quan



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.