[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] VT-d flush timeout

>>> On 25.09.14 at 03:02, <eddie.dong@xxxxxxxxx> wrote:
>> >I don't get your point. What's the different between the 1s and the
>> >large enough value? Since hardware completed the flush quickly, it will
>> >never spin more than real flush time in normal case. 1s spin only
>> >happens in the abnormal case.
>> Right, but we can't ignore this abnormal case. In particular we can't 
> exclude
>> that a DomU with a device assigned may have ways to (perhaps indirectly)
>> affect the completion time of the flushes.
>> >My only concern is that, for QI flush, the spin time relies on the
>> >length of the queue. I am not sure whether 1s is enough for worst case
>> >and I think we should remove the 1s in QI flush. And I think this also
>> >the same reason for Linux don't use timeout mechanism in QI flush.
>> First of all I think both Linux and Xen in the majority of cases waits for
>> completion of just individual queue entries. I.e. I'm not sure if the 
> practical
>> worst case really is equal to the theoretical one. And then removing a
>> timeout just to allow _longer_ spinning isn't really a step forward. If the
>> timeout isn't big enough, the only solution is to immediately replace it 
> with
>> asynchronous handling.
> Giving this path of long time wait-loop only happens at the case when the 
> hardware fails, I don't care if it enters panic in 1 seconds or in 10ms 
> seconds. But the software compatibility (for all existing platforms and 
> potential future platforms) is much more important.

I agree for the paths leading to a panic(). But there's one such case
where it doesn't: snb_vtd_ops_preamble().

> Replacing wait-loop with an asynchronous handling mechanism is definitely 
> good -- maybe put an TODO list for the IOMMU stuff which I believe requiring 
> not small effort. But the main goal should be targeting the normal code path, 
> i.e. success of the IOTLB operation no matter it is 1ms or 10ms. 
> However, as for the timeout code path, given that the specification doesn't 
> say what the hardware WORST case is, using practical smaller number is not a 
> good choice to me. Nobody is able to test on all platforms, and it may fail 
> in future platform. Further more, the result of the mis-prediction to the 
> hardware behavior is so serious-->leading to hypervisor panic.  Of course 
> 1second is not a good value too, but at least it is verified in the past 
> years.

Once again - the ATS spec talks of 60 seconds (with possibly another
30 seconds added on top depending on how you read it). So while for
the non-ATS case, provided the one case pointed out above gets dealt
with, I agree we don't need to bother changing the code, the ATS case
clearly makes asynchronous handling necessary. It is for that reason
that we decided to disable ATS support by default. Albeit now that I
think about it again, I'm not sure that was an appropriate action for
the AMD side of things - Andrew, what do you think?


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.