[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Revisit VT-d asynchronous flush issue

On 02/11/15 08:03, Tian, Kevin wrote:
> Let's start a new thread with a summary of previous discussion, and 
> then our latest experiment data and updated proposal.
> From previous discussions, it's suggested that a spin model is accepted, 
> only when spin timeout doesn't exceed the order of a scheduling time 
> slice, or other blocking operations like what WBINVD might take. 
> Otherwise async-flush model is preferred to prevent misbehaving guests 
> taking long spins if possible, to impact whole system.
> Below are some thresholds to be considered:
> 1) scheduling time slice in Credit is 1ms.

1ms is the minimum scheduling timeslice.  30ms is the current default
(not that this should affect the following reasoning).

> 2) WBINVD cost is 4.6ms in worst case on an IVT platform (32 cores, 
> 10GB NIC assigned to the VM, running iperf). Detail data is append in 
> the bottom. Actual cost varies on different platforms, due to different 
> cache size/layout. For example, we also heard from other colleagues 
> about 10ms level cost on another platform.
> 3) PCI SIG strongly recommends that Completion Timeout mechanism
> not expire in less than 10ms (PCIe 3.0 spec, 7.8.15, Device Capabilities
> 2 Register). It means CPU MMIO read might already take >10ms which 
> we just didn't note.
> Based on above information, at least we can think a timeout range
> between [1ms, 10ms] would likely not introduce bad system behavior. 
> Or conservatively, we can define the spin timeout default as 1ms, 
> while allowing boot-time override up to 10ms for more flexibility.
> Then regarding to VT-d flush:
> - For context/iotlb/iec flush, our measurements show worst cases
> <10us. We also confirmed with hardware team, that 1ms is large 
> enough for IOMMU internal flush.
> - For ATS device-TLB flush, PCI spec defines up to 60s, but:
>       * Our hardware team confirms that 1ms should be enough for 
> integrated PCI devices w/ ATS.
>       * for discrete PCI devices w/ ATS, it's uncertain whether 1ms 
> or 10ms is too restrictive to them, but there are only a few devices
> now in the market. 
> Based on above information, we propose to continue spin-timeout
> model w/ some adjustment, which fixes current timeout concern
> and also allows limited ATS support in a light way:
> 1) reduce spin timeout to 1ms, which can be boot-time changed
> up to 10ms.

If this is going to be command line configurable, don't have an upper limit.

Given the uncertainty with external devices, it might be necessary to
experiment with timeouts greater than 10ms.

> 2) if timeout expires, kill the VM which the target device is assigned 
> to. Optionally hypervisor may mark device non-assignable.
> It works for devices w/o ATS. It works for integrated devices w/ ATS.
> It might or might not work for discrete devices w/ ATS, but we can
> re-evaluate the gain vs. software complexity of async flush until we 
> see many discrete devices breaking the timeout assumptions in the 
> future.
> Thoughts?

As presented, this is probably an improvement, but I am concerning with
the case of external devices.

Then again, as none of this currently works at all, we are not in a
worse state.


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.