[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Shattering superpages impact on IOMMU in Xen



On Tue, Apr 4, 2017 at 12:28 PM, Oleksandr Tyshchenko
<olekstysh@xxxxxxxxx> wrote:
> Hi, Stefano.
>
> On Mon, Apr 3, 2017 at 11:33 PM, Stefano Stabellini
> <sstabellini@xxxxxxxxxx> wrote:
>> On Mon, 3 Apr 2017, Oleksandr Tyshchenko wrote:
>>> On Mon, Apr 3, 2017 at 9:06 PM, Julien Grall <julien.grall@xxxxxxx> wrote:
>>> > Hi Andrew,
>>> >
>>> >
>>> > On 03/04/17 18:16, Andrew Cooper wrote:
>>> >>
>>> >> On 03/04/17 18:02, Julien Grall wrote:
>>> >>>
>>> >>> Hi Andrew,
>>> >>>
>>> >>> On 03/04/17 17:42, Andrew Cooper wrote:
>>> >>>>
>>> >>>> On 03/04/17 17:24, Oleksandr Tyshchenko wrote:
>>> >>>>>
>>> >>>>> Hi, all.
>>> >>>>>
>>> >>>>> Playing with non-shared IOMMU in Xen on ARM I faced one interesting
>>> >>>>> thing. I found out that the superpages were shattered during domain
>>> >>>>> life cycle.
>>> >>>>> This is the result of mapping of foreign pages, ballooning memory,
>>> >>>>> even if domain maps Xen shared pages, etc.
>>> >>>>> I don't bother with the memory fragmentation at the moment. But,
>>> >>>>> shattering bothers me from the IOMMU point of view.
>>> >>>>> As the Xen owns IOMMU it might manipulate IOMMU page tables when
>>> >>>>> passthoughed/protected device doing DMA in Linux. It is hard to detect
>>> >>>>> when the DMA transaction isn't in progress
>>> >>>>> in order to prevent this race. So, if we have inflight transaction
>>> >>>>> from a device when changing IOMMU mapping we might get into trouble.
>>> >>>>> Unfortunately, not in all the cases the
>>> >>>>> faulting transaction can be restarted. The chance to hit the problem
>>> >>>>> increases during shattering.
>>> >>>>>
>>> >>>>> I did next test:
>>> >>>>> The dom0 on my setup contains ethernet IP that are protected by IOMMU.
>>> >>>>> What is more, as the IOMMU I am playing with supports superpages (2M,
>>> >>>>> 1G) the IOMMU driver
>>> >>>>> takes into account these capabilities when building page tables. As I
>>> >>>>> gave 256 MB for dom0, the IOMMU mapping was built by 2M memory blocks
>>> >>>>> only. As I am using NFS for both dom0 and domU the ethernet IP
>>> >>>>> performs DMA transactions almost all the time.
>>> >>>>> Sometimes, I see the IOMMU page faults during creating guest domain. I
>>> >>>>> think, it happens during Xen is shattering 2M mappings 4K mappings (it
>>> >>>>> unmaps dom0 pages by one 4K page at a time, then maps domU pages there
>>> >>>>> for copying domU images).
>>> >>>>> But, I don't see any page faults when the IOMMU page table was built
>>> >>>>> by 4K pages only.
>>> >>>>>
>>> >>>>> I had a talk with Julien on IIRC and we came to conclusion that the
>>> >>>>> safest way would be to use 4K pages to prevent shattering, so the
>>> >>>>> IOMMU shouldn't report superpage capability.
>>> >>>>> On the other hand, if we build IOMMU from 4K pages we will have
>>> >>>>> performance drop (during building, walking page tables), TLB pressure,
>>> >>>>> etc.
>>> >>>>> Another possible solution Julien was suggesting is to always
>>> >>>>> ballooning with 2M, 1G, and not using 4K. That would help us to
>>> >>>>> prevent shattering effect.
>>> >>>>> The discussion was moved to the ML since it seems to be a generic
>>> >>>>> issue and the right solution should be think of.
>>> >>>>>
>>> >>>>> What do you think is the right way to follow? Use 4K pages and don't
>>> >>>>> bother with shattering or try to optimize? And if the idea to make
>>> >>>>> balloon mechanism smarter makes sense how to teach balloon to do so?
>>> >>>>> Thank you.
>>> >>>>
>>> >>>>
>>> >>>> Ballooning and foreign mappings are terrible for trying to retain
>>> >>>> superpage mappings.  No OS, not even Linux, can sensibly provide victim
>>> >>>> pages in a useful way to avoid shattering.
>>> >>>>
>>> >>>> If you care about performance, don't ever balloon.  Foreign mappings in
>>> >>>> translated guests should start from the top of RAM, and work upwards.
>>> >>>
>>> >>>
>>> >>> I am not sure to understand this. Can you extend?
>>> >>
>>> >>
>>> >> I am not sure what is unclear.  Handing random frames of RAM back to the
>>> >> hypervisor is what exacerbates host superpage fragmentation, and all
>>> >> balloon drivers currently do it.
>>> >>
>>> >> If you want to avoid host superpage fragmentation, don't use a
>>> >> scattergun approach of handing frames back to Xen.  However, because
>>> >> even Linux doesn't provide enough hooks into the physical memory
>>> >> management logic, the only solution is to not balloon at all, and to use
>>> >> already-unoccupied frames for foreign mappings.
>>> >
>>> >
>>> > Do you have any pointer in the Linux code?
>>> >
>>> >
>>> >>
>>> >>>
>>> >>>>
>>> >>>>
>>> >>>> As for the IOMMU specifically, things are rather easier.  It is the
>>> >>>> guests responsibility to ensure that frames offered up for ballooning 
>>> >>>> or
>>> >>>> foreign mappings are unused.  Therefore, if anything cares about the
>>> >>>> specific 4K region becoming non-present in the IOMMU mappings, it is 
>>> >>>> the
>>> >>>> guest kernels fault for offering up a frame already in use.
>>> >>>>
>>> >>>> For the shattering however, It is Xen's responsibility to ensure that
>>> >>>> all other mappings stay valid at all points.  The correct way to do 
>>> >>>> this
>>> >>>> is to construct a new L1 table, mirroring the L2 superpage but lacking
>>> >>>> the specific 4K mapping in question, then atomically replace the L2
>>> >>>> superpage entry with the new L1 table, then issue an IOMMU TLB
>>> >>>> invalidation to remove any cached mappings.
>>> >>>>
>>> >>>> By following that procedure, all DMA within the 2M region, but not
>>> >>>> hitting the 4K frame, won't observe any interim lack of mappings.  It
>>> >>>> appears from your description that Xen isn't following the procedure.
>>> >>>
>>> >>>
>>> >>> Xen is following what's the ARM ARM is mandating. For shattering page
>>> >>> table, we have to follow the break-before-sequence i.e:
>>> >>>     - Invalidate the L2 entry
>>> >>>     - Flush the TLBs
>>> >>>     - Add the new L1 table
>>> >>> See D4-1816 in ARM DDI 0487A.k_iss10775 for details. So we end up in a
>>> >>> small window where there are no valid mapping. It is easy to trap data
>>> >>> abort from processor and restarting it but not for device memory
>>> >>> transactions.
>>> >>>
>>> >>> Xen by default is sharing stage-2 page tables with between the IOMMU
>>> >>> and the MMU. However, from the discussion I had with Oleksandr, they
>>> >>> are not sharing page tables and still see the problem. I am not sure
>>> >>> how they are updating the page table here. Oleksandr, can you provide
>>> >>> more details?
>>> >>
>>> >>
>>> >> Are you saying that ARM has no way of making atomic updates to the IOMMU
>>> >> mappings?  (How do I get access to that document?  Google gets me to
>>> >>
>>> >> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.subset.architecture.reference/index.html,
>>> >> but
>>> >> http://infocenter.arm.com/help/topic/com.arm.doc.ddi0487a.k/index.html
>>> >> which looks like the document you specified results in 404.)
>>> >
>>> >
>>> > Below a link, I am not sure why google does not refer it:
>>> >
>>> > http://infocenter.arm.com/help/topic/com.arm.doc.ddi0487a.k_10775/index.html
>>> >
>>> >>
>>> >> If so, that is an architecture bug IMO.  By design, the IOMMU is out of
>>> >> control of guest software, and the hypervisor should be able to make
>>> >> atomic modifications without guest cooperation.
>>> >
>>> >
>>> > I think you misread what I meant, IOMMU supports atomic operations. 
>>> > However,
>>> > if you share the page table we have to apply Break-Before-Make when
>>> > shattering superpage. This is mandatory if you want to get Xen running on
>>> > all the micro-architectures.
>>> >
>>> > Some IOMMU may cope with the BBM, some not. I haven't seen any issue so 
>>> > far
>>> > (it does not mean there are none).
>>> >
>>> > The IOMMU used by Oleksandr (e.g VMSA-IPMMU) is an IP from Renesas which I
>>> > never used myself. In his case he needs different page tables because the
>>> > layouts are not the same.
>>> >
>>> > Oleksandr, looking at the code your provided, the superpage are split the
>>> > way Andrew said, i.e:
>>> >         1) allocating level 3 table minus the 4K mapping
>>> >         2) replace level 2 entry with the new table
>>> >
>>> > Am I right?
>>>
>>> It seems, yes. Walking the page table down when trying to unmap we
>>> bump into leaf entry (2M mapping),
>>> so 2M-4K mapping are inserted at the next level and after that the
>>> page table entry are replaced.
>>
>> Let me premise that Andrew well pointed out what should be the right
>> approach on dealing with this issue. However, if we have to use
>> break-before-make for IOMMU pagetables, then it means we cannot do
>> atomic updates to IOMMU mappings, like Andrew wrote. Therefore, we
>> have to make a choice: we either disable superpage IOMMU mappings or
>> ballooning. I would disable IOMMU superpage mappings, on the ground that
>> supporting superpage mappings without supporting atomic shattering or
>> restartable transactions is not really supporting superpage mappings.
>
> Sounds reasonable. As Julien mentioned too "using 4K pages only" is
> the safest way.
> At least until I will find a reason why DMA faults take place despite
> the fast that shattering is
> doing in an atomic way.
>
>>
>> However, you are not doing break-before-make here. I would investigate
>> if break-before-make is required by VMSA-IPMMU. If it is not required,
>> why are you seeing DMA faults?
>
> Unfortunally, I can't say about break-before-make sequence for IPMMU
> at the moment.
> TRM says nothing about it.
>
> --
> Regards,
>
> Oleksandr Tyshchenko

Hi, guys.

Seems, it was only my fault. The issue wasn't exactly in shattering,
the shattering just increased probability for IOMMU page faults to
occur. I didn't do clean_dcache for the page table entry after
updating it. So, with clean_dcache I don't see page faults when
shattering superpages!
BTW, can I configure domheap pages (which I am using for the IOMMU
page table) to be uncached? What do you think?

-- 
Regards,

Oleksandr Tyshchenko

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.