[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v3 0/2] VT-d flush issue

>On 12.12.2015 at 9:22pm, <quan.xu@xxxxxxxxx> wrote:
> This patches are based on Kevin Tian's previous discussion 'Revisit 
>VT-d  asynchronous flush issue'.
> Fix current timeout concern and also allow limited ATS support in a light way:

> 2. Fix vt-d flush timeout issue.
>     If IOTLB/Context/IETC flush is timeout, we should think all 
> devices under this IOMMU cannot function correctly.
>     So for each device under this IOMMU we'll mark it as unassignable 
> and kill the domain owning the device.


Through research and investigation, when IEC/Iotlb/Context are flush error(VT-d 
is bug),
IMO it is unavoidable to panic. The following are some reasons:

1. The below is the general platform topology, illustrated by VT-d spec.
VT-d is a key component of the platform infrastructure in virtualization usage,
providing DMA/Intr remapping capabilities.
If such a key component VT-d is bug, it can't provide reliability for recording
and reporting of DMA/Intr error to VMM that may otherwise corrupt memory or
impact VM isolation.

       Processor  ... Processor
       ---------      ---------

             North Bridge
            --------------      <---> DRAM
           DMA/Intr Remapping

                |||| PCIe Devices

2. If VT-d is bug, does the hardware_domain continue to work with PCIe Devices 
/ DRAM well with DMA remapping error?
   I think it is no. furthermore, i think VMM can NOT run a normal HVM domain 
without device-passthrough.

3. There are so many reasons for IEC/iotlb/Conetxt flush, .i.e. msi/ept... 
   It distributed across the VMM source code, it is challenge to make sure 
callers actually honor errors
   and check all the way up the call trees. it looks like rewriting VMM source 

4. Much more detail, some flush errors are very tricky. .i.e. how to deal with 
msi free with IEC flush error,
   restore or ignore it?

Welcome your comments and correct me if i am wrong. thanks.


>     If Device-TLB flush is timeout, we'll mark the target ATS device 
> as unassignable and kill the domain owning
>     this device.
>     If impacted domain is hardware domain, just throw out a warning. 
> It's an open here whether we want to kill
>     hardware domain (or directly panic hypervisor). Comments are welcomed.
>     Device marked as unassignable will be disallowed to be further 
> assigned to any domain.

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.