[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.



Snabbswitch (virtualized switch) also encountered similar problem : 
https://groups.google.com/forum/#!topic/snabb-devel/xX0yFzeXylI

Thanks
Anshul Makkar

-----Original Message-----
From: Andrew Cooper [mailto:andrew.cooper3@xxxxxxxxxx] 
Sent: 01 December 2015 10:34
To: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
Cc: Jan Beulich <JBeulich@xxxxxxxx>; Kevin Tian <kevin.tian@xxxxxxxxx>; 
yang.z.zhang@xxxxxxxxx; Malcolm Crossley <malcolm.crossley@xxxxxxxxxx>; Anshul 
Makkar <anshul.makkar@xxxxxxxxxx>; xen-devel@xxxxxxxxxxxxx
Subject: Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for 
Sandybridge and earlier processors.

On 30/11/15 21:22, Konrad Rzeszutek Wilk wrote:
> On Thu, Nov 26, 2015 at 01:55:57PM +0000, Andrew Cooper wrote:
>> On 26/11/15 13:48, Malcolm Crossley wrote:
>>> On 26/11/15 13:46, Jan Beulich wrote:
>>>>>>> On 25.11.15 at 11:28, <andrew.cooper3@xxxxxxxxxx> wrote:
>>>>> The problem is that SandyBridge IOMMUs advertise 2M support and do 
>>>>> function with it, but cannot cache 2MB translations in the IOTLBs.
>>>>>
>>>>> As a result, attempting to use 2M translations causes 
>>>>> substantially worse performance than 4K translations.
>>>> Btw - how does this get explained? At a first glance, even if 2Mb 
>>>> translations don't get entered into the TLB, it should still be one 
>>>> less page table level to walk for the IOMMU, and should hence 
>>>> nevertheless be a benefit. Yet you even say _substantially_ worse 
>>>> performance results.
>>> There is a IOTLB for the 4K translation so if you only use 4K 
>>> translations then you get to take advantage of the IOTLB.
>>>
>>> If you use the 2Mb translation then a page table walk has to be 
>>> performed every time there's a DMA access to that region of the BFN 
>>> address space.
>> Also remember that a high level dma access (from the point of view of 
>> a
>> driver) will be fragmented at the PCIe max packet size, which is 
>> typically 256 bytes.
>>
>> So by not caching the 2Mb translation, a dma access of 4k may undergo 
>> 16 pagetable walks, one for each PCIe packet.
>>
>> We observed that using 2Mb mappings results in a 40% overhead, 
>> compared to using 4k mappings, from the point of view of a sample network 
>> workload.
> How did you observe this? I am mighty curious what kind of performance 
> tools you used to find this  as I would love to figure out if some of 
> the issues we have seen are related to this?

The 40% difference is just in terms of network throughput of a VF, given a 
workload which can normally saturate line rate on the card.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.