[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)



>> I dislike this for three reasons
>>
>> a) It does not protect against any races, really, it does not improve things.
>> b) We do have the exact same problem with pfn_to_online_page(). As long as we
>>    don't hold the memory hotplug lock, memory can get offlined and remove 
>> any time. Racy.
> 
> True, we need to solve that problem too. That seems to want something
> lighter weight than the hotplug lock that can be held over pfn lookups
> +  use rather than requiring a page lookup in paths where it's not
> clear that a page reference would prevent unplug.
> 
>> c) We mix in ZONE specific stuff into the core. It should be "just another 
>> zone"
> 
> Not sure I grok this when the RFC is sprinkling zone-specific
> is_zone_device_page() throughout the core?

Most users should not care about the zone. pfn_active() would be enough
in most situations, especially most PFN walkers - "this memmap is valid
and e.g., contains a valid zone ...".

> 
>>
>> What I propose instead (already discussed in 
>> https://lkml.org/lkml/2019/10/10/87)
> 
> Sorry I missed this earlier...
> 
>>
>> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
>> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap
>> 3. Introduce pfn_active() that checks against the subsection bitmap
>> 4. Once the memmap was initialized / prepared, set the subsection active
>>    (similar to SECTION_IS_ONLINE in the buddy right now)
>> 5. Before the memmap gets invalidated, set the subsection inactive
>>    (similar to SECTION_IS_ONLINE in the buddy right now)
>> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
>> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE
> 
> This does not seem to reduce any complexity because it still requires
> a pfn to zone lookup at the end of the process.
> 
> I.e. converting pfn_to_online_page() to use a new pfn_active()
> subsection map plus looking up the zone from pfn_to_page() is more
> steps than just doing a direct pfn to zone lookup. What am I missing?

That a real "pfn to zone" lookup without going via the struct page will
require to have more than just a single bitmap. IMHO, keeping the
information at a single place (memmap) is the clean thing to do (not
replicating it somewhere else). Going via the memmap might not be as
fast as a direct lookup, but do we actually care? We are already looking
at "random PFNs we are not sure if there is a valid memmap".

>>
>> Especially, driver-reserved device memory will not get set active in
>> the subsection bitmap.
>>
>> Now to the race. Taking the memory hotplug lock at random places is ugly. I 
>> do
>> wonder if we can use RCU:
> 
> Ah, yes, exactly what I was thinking above.
> 
>>
>> The user of pfn_active()/pfn_to_online_page()/pfn_to_device_page():
>>
>>         /* the memmap is guaranteed to remain active under RCU */
>>         rcu_read_lock();
>>         if (pfn_active(random_pfn)) {
>>                 page = pfn_to_page(random_pfn);
>>                 ... use the page, stays valid
>>         }
>>         rcu_unread_lock();
>>
>> Memory offlining/memremap code:
>>
>>         set_subsections_inactive(pfn, nr_pages); /* clears the bit 
>> atomically */
>>         synchronize_rcu();
>>         /* all users saw the bitmap update, we can invalide the memmap */
>>         remove_pfn_range_from_zone(zone, pfn, nr_pages);
> 
> Looks good to me.
> 
>>
>>>
>>>>
>>>> I only gave it a quick test with DIMMs on x86-64, but didn't test the
>>>> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
>>>> on x86-64 and PPC.
>>>
>>> I'll give it a spin, but I don't think the kernel wants to grow more
>>> is_zone_device_page() users.
>>
>> Let's recap. In this RFC, I introduce a total of 4 (!) users only.
>> The other parts can rely on pfn_to_online_page() only.
>>
>> 1. "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes"
>> - Basically never used with ZONE_DEVICE.
>> - We hold a reference!
>> - All it protects is a SetPageDirty(page);
>>
>> 2. "staging/gasket: Prepare gasket_release_page() for PG_reserved changes"
>> - Same as 1.
>>
>> 3. "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes"
>> - We come via virt_to_head_page() / virt_to_head_page(), not sure about
>>   references (I assume this should be fine as we don't come via random
>>   PFNs)
>> - We check that we don't mix Reserved (including device memory) and CMA
>>   pages when crossing compound pages.
>>
>> I think we can drop 1. and 2., resulting in a total of 2 new users in
>> the same context. I think that is totally tolerable to finally clean
>> this up.
> 
> ...but more is_zone_device_page() doesn't "finally clean this up".
> Like we discussed above it's the missing locking that's the real
> cleanup, the pfn_to_online_page() internals are secondary.

It's a different cleanup IMHO. We can't do everything in one shot. But
maybe I can drop the is_zone_device_page() parts from this patch and
completely rely on pfn_to_online_page(). Yes, that needs fixing to, but
it's a different story.

The important part of this patch:

While pfn_to_online_page() will always exclude ZONE_DEVICE pages,
checking PG_reserved on ZONE_DEVICE pages (what we do right now!) is
racy as hell (especially when concurrently initializing the memmap).

This does improve the situation.

>>
>> However, I think we also have to clarify if we need the change in 3 at all.
>> It comes from
>>
>> commit f5509cc18daa7f82bcc553be70df2117c8eedc16
>> Author: Kees Cook <keescook@xxxxxxxxxxxx>
>> Date:   Tue Jun 7 11:05:33 2016 -0700
>>
>>     mm: Hardened usercopy
>>
>>     This is the start of porting PAX_USERCOPY into the mainline kernel. This
>>     is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
>>     work is based on code by PaX Team and Brad Spengler, and an earlier port
>>     from Casey Schaufler. Additional non-slab page tests are from Rik van 
>> Riel.
>> [...]
>>     - otherwise, object must not span page allocations (excepting Reserved
>>       and CMA ranges)
>>
>> Not sure if we really have to care about ZONE_DEVICE at this point.
> 
> That check needs to be careful to ignore ZONE_DEVICE pages. There's
> nothing wrong with a copy spanning ZONE_DEVICE and typical pages.

Please note that the current check would *forbid* this (AFAIKs for a
single heap object). As discussed in the relevant patch, we might be
able to just stop doing that and limit it to real PG_reserved pages
(without ZONE_DEVICE). I'd be happy to not introduce new
is_zone_device_page() users.

-- 

Thanks,

David / dhildenb


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.