[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)

To: Dan Williams <dan.j.williams@xxxxxxxxx>
From: David Hildenbrand <david@xxxxxxxxxx>
Date: Wed, 23 Oct 2019 09:26:17 +0200
Cc: Kate Stewart <kstewart@xxxxxxxxxxxxxxxxxxx>, linux-hyperv@xxxxxxxxxxxxxxx, Michal Hocko <mhocko@xxxxxxxx>, Radim Krčmář <rkrcmar@xxxxxxxxxx>, KVM list <kvm@xxxxxxxxxxxxxxx>, Pavel Tatashin <pavel.tatashin@xxxxxxxxxxxxx>, KarimAllah Ahmed <karahmed@xxxxxxxxx>, Benjamin Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx>, Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>, Alexander Duyck <alexander.duyck@xxxxxxxxx>, Michal Hocko <mhocko@xxxxxxxxxx>, Paul Mackerras <paulus@xxxxxxxxxx>, Linux MM <linux-mm@xxxxxxxxx>, Paul Mackerras <paulus@xxxxxxxxx>, Michael Ellerman <mpe@xxxxxxxxxxxxxx>, "H. Peter Anvin" <hpa@xxxxxxxxx>, Wanpeng Li <wanpengli@xxxxxxxxxxx>, "K. Y. Srinivasan" <kys@xxxxxxxxxxxxx>, Fabio Estevam <festevam@xxxxxxxxx>, Ben Chan <benchan@xxxxxxxxxxxx>, Pavel Tatashin <pasha.tatashin@xxxxxxxxxx>, devel@xxxxxxxxxxxxxxxxxxxx, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Stephen Hemminger <sthemmin@xxxxxxxxxxxxx>, "Aneesh Kumar K.V" <aneesh.kumar@xxxxxxxxxxxxx>, Joerg Roedel <joro@xxxxxxxxxx>, X86 ML <x86@xxxxxxxxxx>, YueHaibing <yuehaibing@xxxxxxxxxx>, Mike Rapoport <rppt@xxxxxxxxxxxxx>, Madhumitha Prabakaran <madhumithabiw@xxxxxxxxx>, Peter Zijlstra <peterz@xxxxxxxxxxxxx>, Ingo Molnar <mingo@xxxxxxxxxx>, Vlastimil Babka <vbabka@xxxxxxx>, Nishka Dasgupta <nishkadg.linux@xxxxxxxxx>, Anthony Yznaga <anthony.yznaga@xxxxxxxxxx>, Oscar Salvador <osalvador@xxxxxxx>, Dan Carpenter <dan.carpenter@xxxxxxxxxx>, "Isaac J. Manjarres" <isaacm@xxxxxxxxxxxxxx>, Matt Sickler <Matt.Sickler@xxxxxxxxxxxxxx>, Kees Cook <keescook@xxxxxxxxxxxx>, Anshuman Khandual <anshuman.khandual@xxxxxxx>, Haiyang Zhang <haiyangz@xxxxxxxxxxxxx>, Simon Sandström <simon@xxxxxxxxxx>, Sasha Levin <sashal@xxxxxxxxxx>, Juergen Gross <jgross@xxxxxxxx>, kvm-ppc@xxxxxxxxxxxxxxx, Qian Cai <cai@xxxxxx>, Alex Williamson <alex.williamson@xxxxxxxxxx>, Mike Rapoport <rppt@xxxxxxxxxxxxxxxxxx>, Borislav Petkov <bp@xxxxxxxxx>, Nicholas Piggin <npiggin@xxxxxxxxx>, Andy Lutomirski <luto@xxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>, Todd Poynor <toddpoynor@xxxxxxxxxx>, Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>, Allison Randal <allison@xxxxxxxxxxx>, Jim Mattson <jmattson@xxxxxxxxxx>, Christophe Leroy <christophe.leroy@xxxxxx>, Vandana BN <bnvandana@xxxxxxxxx>, Jeremy Sowden <jeremy@xxxxxxxxxx>, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>, Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>, Cornelia Huck <cohuck@xxxxxxxxxx>, Linux Kernel Mailing List <linux-kernel@xxxxxxxxxxxxxxx>, Sean Christopherson <sean.j.christopherson@xxxxxxxxx>, Rob Springer <rspringer@xxxxxxxxxx>, Thomas Gleixner <tglx@xxxxxxxxxxxxx>, Johannes Weiner <hannes@xxxxxxxxxxx>, Paolo Bonzini <pbonzini@xxxxxxxxxx>, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>, linuxppc-dev <linuxppc-dev@xxxxxxxxxxxxxxxx>
Delivery-date: Wed, 23 Oct 2019 07:27:08 +0000
List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 22.10.19 23:54, Dan Williams wrote:
> Hi David,
> 
> Thanks for tackling this!

Thanks for having a look :)

[...]


>> I am probably a little bit too careful (but I don't want to break things).
>> In most places (besides KVM and vfio that are nuts), the
>> pfn_to_online_page() check could most probably be avoided by a
>> is_zone_device_page() check. However, I usually get suspicious when I see
>> a pfn_valid() check (especially after I learned that people mmap parts of
>> /dev/mem into user space, including memory without memmaps. Also, people
>> could memmap offline memory blocks this way :/). As long as this does not
>> hurt performance, I think we should rather do it the clean way.
> 
> I'm concerned about using is_zone_device_page() in places that are not
> known to already have a reference to the page. Here's an audit of
> current usages, and the ones I think need to cleaned up. The "unsafe"
> ones do not appear to have any protections against the device page
> being removed (get_dev_pagemap()). Yes, some of these were added by
> me. The "unsafe? HMM" ones need HMM eyes because HMM leaks device
> pages into anonymous memory paths and I'm not up to speed on how it
> guarantees 'struct page' validity vs device shutdown without using
> get_dev_pagemap().
> 
> smaps_pmd_entry(): unsafe
> 
> put_devmap_managed_page(): safe, page reference is held
> 
> is_device_private_page(): safe? gpu driver manages private page lifetime
> 
> is_pci_p2pdma_page(): safe, page reference is held
> 
> uncharge_page(): unsafe? HMM
> 
> add_to_kill(): safe, protected by get_dev_pagemap() and dax_lock_page()
> 
> soft_offline_page(): unsafe
> 
> remove_migration_pte(): unsafe? HMM
> 
> move_to_new_page(): unsafe? HMM
> 
> migrate_vma_pages() and helpers: unsafe? HMM
> 
> try_to_unmap_one(): unsafe? HMM
> 
> __put_page(): safe
> 
> release_pages(): safe
> 
> I'm hoping all the HMM ones can be converted to
> is_device_private_page() directlly and have that routine grow a nice
> comment about how it knows it can always safely de-reference its @page
> argument.
> 
> For the rest I'd like to propose that we add a facility to determine
> ZONE_DEVICE by pfn rather than page. The most straightforward why I
> can think of would be to just add another bitmap to mem_section_usage
> to indicate if a subsection is ZONE_DEVICE or not.

(it's a somewhat unrelated bigger discussion, but we can start discussing it in 
this thread)

I dislike this for three reasons

a) It does not protect against any races, really, it does not improve things.
b) We do have the exact same problem with pfn_to_online_page(). As long as we
   don't hold the memory hotplug lock, memory can get offlined and remove any 
time. Racy.
c) We mix in ZONE specific stuff into the core. It should be "just another zone"

What I propose instead (already discussed in 
https://lkml.org/lkml/2019/10/10/87)

1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
2. Convert SECTION_IS_ACTIVE to a subsection bitmap
3. Introduce pfn_active() that checks against the subsection bitmap
4. Once the memmap was initialized / prepared, set the subsection active
   (similar to SECTION_IS_ONLINE in the buddy right now)
5. Before the memmap gets invalidated, set the subsection inactive
   (similar to SECTION_IS_ONLINE in the buddy right now)
5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE

Especially, driver-reserved device memory will not get set active in
the subsection bitmap.

Now to the race. Taking the memory hotplug lock at random places is ugly. I do
wonder if we can use RCU:

The user of pfn_active()/pfn_to_online_page()/pfn_to_device_page():

        /* the memmap is guaranteed to remain active under RCU */
        rcu_read_lock();
        if (pfn_active(random_pfn)) {
                page = pfn_to_page(random_pfn);
                ... use the page, stays valid
        }
        rcu_unread_lock();

Memory offlining/memremap code:

        set_subsections_inactive(pfn, nr_pages); /* clears the bit atomically */
        synchronize_rcu();
        /* all users saw the bitmap update, we can invalide the memmap */
        remove_pfn_range_from_zone(zone, pfn, nr_pages);

> 
>>
>> I only gave it a quick test with DIMMs on x86-64, but didn't test the
>> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
>> on x86-64 and PPC.
> 
> I'll give it a spin, but I don't think the kernel wants to grow more
> is_zone_device_page() users.

Let's recap. In this RFC, I introduce a total of 4 (!) users only.
The other parts can rely on pfn_to_online_page() only.

1. "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes"
- Basically never used with ZONE_DEVICE.
- We hold a reference!
- All it protects is a SetPageDirty(page);

2. "staging/gasket: Prepare gasket_release_page() for PG_reserved changes"
- Same as 1.

3. "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes"
- We come via virt_to_head_page() / virt_to_head_page(), not sure about 
  references (I assume this should be fine as we don't come via random 
  PFNs)
- We check that we don't mix Reserved (including device memory) and CMA 
  pages when crossing compound pages.

I think we can drop 1. and 2., resulting in a total of 2 new users in
the same context. I think that is totally tolerable to finally clean
this up.


However, I think we also have to clarify if we need the change in 3 at all.
It comes from

commit f5509cc18daa7f82bcc553be70df2117c8eedc16
Author: Kees Cook <keescook@xxxxxxxxxxxx>
Date:   Tue Jun 7 11:05:33 2016 -0700

    mm: Hardened usercopy
    
    This is the start of porting PAX_USERCOPY into the mainline kernel. This
    is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
    work is based on code by PaX Team and Brad Spengler, and an earlier port
    from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.
[...]
    - otherwise, object must not span page allocations (excepting Reserved
      and CMA ranges)

Not sure if we really have to care about ZONE_DEVICE at this point.


-- 

Thanks,

David / dhildenb


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

Follow-Ups:
- Re: [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
  - From: David Hildenbrand
- Re: [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
  - From: Dan Williams

References:
- [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
  - From: David Hildenbrand
- Re: [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
  - From: Dan Williams

Prev by Date: Re: [Xen-devel] [PATCH RFC v1 07/12] staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes
Next by Date: Re: [Xen-devel] [PATCH] xen/typesafe: Force helpers to be always_inline
Previous by thread: Re: [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
Next by thread: Re: [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.