[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Revisit VT-d asynchronous flush issue



Let's start a new thread with a summary of previous discussion, and 
then our latest experiment data and updated proposal.

From previous discussions, it's suggested that a spin model is accepted, 
only when spin timeout doesn't exceed the order of a scheduling time 
slice, or other blocking operations like what WBINVD might take. 
Otherwise async-flush model is preferred to prevent misbehaving guests 
taking long spins if possible, to impact whole system.

Below are some thresholds to be considered:

1) scheduling time slice in Credit is 1ms.

2) WBINVD cost is 4.6ms in worst case on an IVT platform (32 cores, 
10GB NIC assigned to the VM, running iperf). Detail data is append in 
the bottom. Actual cost varies on different platforms, due to different 
cache size/layout. For example, we also heard from other colleagues 
about 10ms level cost on another platform.

3) PCI SIG strongly recommends that Completion Timeout mechanism
not expire in less than 10ms (PCIe 3.0 spec, 7.8.15, Device Capabilities
2 Register). It means CPU MMIO read might already take >10ms which 
we just didn't note.

Based on above information, at least we can think a timeout range
between [1ms, 10ms] would likely not introduce bad system behavior. 
Or conservatively, we can define the spin timeout default as 1ms, 
while allowing boot-time override up to 10ms for more flexibility.

Then regarding to VT-d flush:

- For context/iotlb/iec flush, our measurements show worst cases
<10us. We also confirmed with hardware team, that 1ms is large 
enough for IOMMU internal flush.

- For ATS device-TLB flush, PCI spec defines up to 60s, but:

        * Our hardware team confirms that 1ms should be enough for 
integrated PCI devices w/ ATS.

        * for discrete PCI devices w/ ATS, it's uncertain whether 1ms 
or 10ms is too restrictive to them, but there are only a few devices
now in the market. 

Based on above information, we propose to continue spin-timeout
model w/ some adjustment, which fixes current timeout concern
and also allows limited ATS support in a light way:

1) reduce spin timeout to 1ms, which can be boot-time changed
up to 10ms.

2) if timeout expires, kill the VM which the target device is assigned 
to. Optionally hypervisor may mark device non-assignable.

It works for devices w/o ATS. It works for integrated devices w/ ATS.
It might or might not work for discrete devices w/ ATS, but we can
re-evaluate the gain vs. software complexity of async flush until we 
see many discrete devices breaking the timeout assumptions in the 
future.

Thoughts?

----
<detail data>
                Min(us)         Max(us) Average(us)
context 5.24            5.49            5.36
iotlb   1.90            2.07            2.03
iec             5.54            7.86            6.58
wbinvd  2721.42         4655.71         3571.43

Platform info:
1. Base Board Information
        Manufacturer: Intel Corporation
        Product Name: S2600CP
        Version: E99552-561

2. CPU:
        cpu family : 6
        model : 62
        model name : Genuine Intel(R) CPU  @ 2.80GHz

Thanks
Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.