[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v3 0/2] VT-d flush issue

On 20/12/15 13:57, Xu, Quan wrote:
>> On 12.12.2015 at 9:22pm, <quan.xu@xxxxxxxxx> wrote:
>> This patches are based on Kevin Tian's previous discussion 'Revisit 
>> VT-d  asynchronous flush issue'.
>> Fix current timeout concern and also allow limited ATS support in a light 
>> way:
>> 2. Fix vt-d flush timeout issue.
>>     If IOTLB/Context/IETC flush is timeout, we should think all 
>> devices under this IOMMU cannot function correctly.
>>     So for each device under this IOMMU we'll mark it as unassignable 
>> and kill the domain owning the device.
> Hi, 
> Through research and investigation, when IEC/Iotlb/Context are flush 
> error(VT-d is bug),
> IMO it is unavoidable to panic. The following are some reasons:
> 1. The below is the general platform topology, illustrated by VT-d spec.
> VT-d is a key component of the platform infrastructure in virtualization 
> usage,
> providing DMA/Intr remapping capabilities.
> If such a key component VT-d is bug, it can't provide reliability for 
> recording
> and reporting of DMA/Intr error to VMM that may otherwise corrupt memory or
> impact VM isolation.
>        Processor  ... Processor
>        ---------      ---------
>                   ^
>                   |
>              North Bridge
>             --------------      <---> DRAM
>            DMA/Intr Remapping
>                 ^^^^
>                 |||| PCIe Devices
>                 vvvv
> 2. If VT-d is bug, does the hardware_domain continue to work with PCIe 
> Devices / DRAM well with DMA remapping error?
>    I think it is no. furthermore, i think VMM can NOT run a normal HVM domain 
> without device-passthrough.
> 3. There are so many reasons for IEC/iotlb/Conetxt flush, .i.e. msi/ept... 
> update.
>    It distributed across the VMM source code, it is challenge to make sure 
> callers actually honor errors
>    and check all the way up the call trees. it looks like rewriting VMM 
> source code.
> 4. Much more detail, some flush errors are very tricky. .i.e. how to deal 
> with msi free with IEC flush error,
>    restore or ignore it?
> Welcome your comments and correct me if i am wrong. thanks.
> -Quan

I believe everywhere you say "is bug", you mean "is buggy".

I agree that, if the remapping engine itself is buggy, Xen can't be
certain about anything else functioning correctly.

However, there are many errors the remapping engine could generate which
are not because it itself is buggy.  These could be because of a
downstream device misbehaving, or because the remapping engine was
incorrectly programmed.  These cases should not be able to directly
cause a crash.

In the case of a buggy device, it doesn't matter if it, and all its
resources, are left in stuck state; it can be ignored and the rest of
the host can try to continue normally.

For point 3 specifically.  The code is indeed currently broken, and
needs fixing, one way or another.  It is sad that we are still suffering
from lots of very poor quality code submitted in the past.

Even in the case that we discover that a remapping engine itself is
buggy, it is likely not to be the only remapping engine in the system. 
If it can be safely disabled, the host has a good chance of being able
to continue sensibly.


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.