[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [RFC PATCH v1 3/6] mm: Avoid calling page allocator from apply_to_page_range()
* Ryan Roberts <ryan.roberts@xxxxxxx> [250530 10:05]: > Lazy mmu mode applies to the current task and permits pte modifications > to be deferred and updated at a later time in a batch to improve > performance. apply_to_page_range() calls its callback in lazy mmu mode > and some of those callbacks call into the page allocator to either > allocate or free pages. > > This is problematic with CONFIG_DEBUG_PAGEALLOC because > debug_pagealloc_[un]map_pages() calls the arch implementation of > __kernel_map_pages() which must modify the ptes for the linear map. > > There are two possibilities at this point: > > - If the arch implementation modifies the ptes directly without first > entering lazy mmu mode, the pte modifications may get deferred until > the existing lazy mmu mode is exited. This could result in taking > spurious faults for example. > > - If the arch implementation enters a nested lazy mmu mode before > modification of the ptes (many arches use apply_to_page_range()), > then the linear map updates will definitely be applied upon leaving > the inner lazy mmu mode. But because lazy mmu mode does not support > nesting, the remainder of the outer user is no longer in lazy mmu > mode and the optimization opportunity is lost. > > So let's just ensure that the page allocator is never called from within > lazy mmu mode. New "_nolazy" variants of apply_to_page_range() and > apply_to_existing_page_range() are introduced which don't enter lazy mmu > mode. Then users which need to call into the page allocator within their > callback are updated to use the _nolazy variants. > > Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> > --- > include/linux/mm.h | 6 ++++++ > kernel/bpf/arena.c | 6 +++--- > mm/kasan/shadow.c | 2 +- > mm/memory.c | 54 +++++++++++++++++++++++++++++++++++----------- > 4 files changed, 51 insertions(+), 17 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index e51dba8398f7..11cae6ce04ff 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -3743,9 +3743,15 @@ static inline bool gup_can_follow_protnone(struct > vm_area_struct *vma, > typedef int (*pte_fn_t)(pte_t *pte, unsigned long addr, void *data); > extern int apply_to_page_range(struct mm_struct *mm, unsigned long address, > unsigned long size, pte_fn_t fn, void *data); > +extern int apply_to_page_range_nolazy(struct mm_struct *mm, > + unsigned long address, unsigned long size, > + pte_fn_t fn, void *data); We are removing externs as things are edited, so probably drop them here. > extern int apply_to_existing_page_range(struct mm_struct *mm, > unsigned long address, unsigned long size, > pte_fn_t fn, void *data); > +extern int apply_to_existing_page_range_nolazy(struct mm_struct *mm, > + unsigned long address, unsigned long size, > + pte_fn_t fn, void *data); > > #ifdef CONFIG_PAGE_POISONING > extern void __kernel_poison_pages(struct page *page, int numpages); > diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c > index 0d56cea71602..ca833cfeefb7 100644 > --- a/kernel/bpf/arena.c > +++ b/kernel/bpf/arena.c > @@ -187,10 +187,10 @@ static void arena_map_free(struct bpf_map *map) > /* > * free_vm_area() calls remove_vm_area() that calls > free_unmap_vmap_area(). > * It unmaps everything from vmalloc area and clears pgtables. > - * Call apply_to_existing_page_range() first to find populated ptes and > - * free those pages. > + * Call apply_to_existing_page_range_nolazy() first to find populated > + * ptes and free those pages. > */ > - apply_to_existing_page_range(&init_mm, > bpf_arena_get_kern_vm_start(arena), > + apply_to_existing_page_range_nolazy(&init_mm, > bpf_arena_get_kern_vm_start(arena), > KERN_VM_SZ - GUARD_SZ, existing_page_cb, > NULL); > free_vm_area(arena->kern_vm); > range_tree_destroy(&arena->rt); > diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c > index d2c70cd2afb1..2325c5166c3a 100644 > --- a/mm/kasan/shadow.c > +++ b/mm/kasan/shadow.c > @@ -590,7 +590,7 @@ void kasan_release_vmalloc(unsigned long start, unsigned > long end, > > > if (flags & KASAN_VMALLOC_PAGE_RANGE) > - apply_to_existing_page_range(&init_mm, > + apply_to_existing_page_range_nolazy(&init_mm, > (unsigned long)shadow_start, > size, kasan_depopulate_vmalloc_pte, > NULL); > diff --git a/mm/memory.c b/mm/memory.c > index 49199410805c..24436074ce48 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2913,7 +2913,7 @@ EXPORT_SYMBOL(vm_iomap_memory); > static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd, > unsigned long addr, unsigned long end, > pte_fn_t fn, void *data, bool create, > - pgtbl_mod_mask *mask) > + pgtbl_mod_mask *mask, bool lazy_mmu) > { > pte_t *pte, *mapped_pte; > int err = 0; > @@ -2933,7 +2933,8 @@ static int apply_to_pte_range(struct mm_struct *mm, > pmd_t *pmd, > return -EINVAL; > } > > - arch_enter_lazy_mmu_mode(); > + if (lazy_mmu) > + arch_enter_lazy_mmu_mode(); > > if (fn) { > do { > @@ -2946,7 +2947,8 @@ static int apply_to_pte_range(struct mm_struct *mm, > pmd_t *pmd, > } > *mask |= PGTBL_PTE_MODIFIED; > > - arch_leave_lazy_mmu_mode(); > + if (lazy_mmu) > + arch_leave_lazy_mmu_mode(); > > if (mm != &init_mm) > pte_unmap_unlock(mapped_pte, ptl); > @@ -2956,7 +2958,7 @@ static int apply_to_pte_range(struct mm_struct *mm, > pmd_t *pmd, > static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud, > unsigned long addr, unsigned long end, > pte_fn_t fn, void *data, bool create, > - pgtbl_mod_mask *mask) > + pgtbl_mod_mask *mask, bool lazy_mmu) I am having a hard time understanding why other lazy mmus were more self-contained, but arm has added arguments to generic code? > { > pmd_t *pmd; > unsigned long next; > @@ -2983,7 +2985,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, > pud_t *pud, > pmd_clear_bad(pmd); > } > err = apply_to_pte_range(mm, pmd, addr, next, > - fn, data, create, mask); > + fn, data, create, mask, lazy_mmu); > if (err) > break; > } while (pmd++, addr = next, addr != end); > @@ -2994,7 +2996,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, > pud_t *pud, > static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d, > unsigned long addr, unsigned long end, > pte_fn_t fn, void *data, bool create, > - pgtbl_mod_mask *mask) > + pgtbl_mod_mask *mask, bool lazy_mmu) > { > pud_t *pud; > unsigned long next; > @@ -3019,7 +3021,7 @@ static int apply_to_pud_range(struct mm_struct *mm, > p4d_t *p4d, > pud_clear_bad(pud); > } > err = apply_to_pmd_range(mm, pud, addr, next, > - fn, data, create, mask); > + fn, data, create, mask, lazy_mmu); > if (err) > break; > } while (pud++, addr = next, addr != end); > @@ -3030,7 +3032,7 @@ static int apply_to_pud_range(struct mm_struct *mm, > p4d_t *p4d, > static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd, > unsigned long addr, unsigned long end, > pte_fn_t fn, void *data, bool create, > - pgtbl_mod_mask *mask) > + pgtbl_mod_mask *mask, bool lazy_mmu) > { > p4d_t *p4d; > unsigned long next; > @@ -3055,7 +3057,7 @@ static int apply_to_p4d_range(struct mm_struct *mm, > pgd_t *pgd, > p4d_clear_bad(p4d); > } > err = apply_to_pud_range(mm, p4d, addr, next, > - fn, data, create, mask); > + fn, data, create, mask, lazy_mmu); > if (err) > break; > } while (p4d++, addr = next, addr != end); > @@ -3065,7 +3067,7 @@ static int apply_to_p4d_range(struct mm_struct *mm, > pgd_t *pgd, > > static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr, > unsigned long size, pte_fn_t fn, > - void *data, bool create) > + void *data, bool create, bool lazy_mmu) > { > pgd_t *pgd; > unsigned long start = addr, next; > @@ -3091,7 +3093,7 @@ static int __apply_to_page_range(struct mm_struct *mm, > unsigned long addr, > pgd_clear_bad(pgd); > } > err = apply_to_p4d_range(mm, pgd, addr, next, > - fn, data, create, &mask); > + fn, data, create, &mask, lazy_mmu); This is annoying. We now have a bool, bool passed through with mask inserted in the middle. > if (err) > break; > } while (pgd++, addr = next, addr != end); > @@ -3105,11 +3107,14 @@ static int __apply_to_page_range(struct mm_struct > *mm, unsigned long addr, > /* > * Scan a region of virtual memory, filling in page tables as necessary > * and calling a provided function on each leaf page table. > + * > + * fn() is called in lazy mmu mode. As a result, the callback must be careful > + * not to perform memory allocation. > */ > int apply_to_page_range(struct mm_struct *mm, unsigned long addr, > unsigned long size, pte_fn_t fn, void *data) > { > - return __apply_to_page_range(mm, addr, size, fn, data, true); > + return __apply_to_page_range(mm, addr, size, fn, data, true, true); Please add something here to tell me what false, true is. > } > EXPORT_SYMBOL_GPL(apply_to_page_range); > > @@ -3117,13 +3122,36 @@ EXPORT_SYMBOL_GPL(apply_to_page_range); > * Scan a region of virtual memory, calling a provided function on > * each leaf page table where it exists. > * > + * fn() is called in lazy mmu mode. As a result, the callback must be careful > + * not to perform memory allocation. > + * > * Unlike apply_to_page_range, this does _not_ fill in page tables > * where they are absent. > */ > int apply_to_existing_page_range(struct mm_struct *mm, unsigned long addr, > unsigned long size, pte_fn_t fn, void *data) > { > - return __apply_to_page_range(mm, addr, size, fn, data, false); > + return __apply_to_page_range(mm, addr, size, fn, data, false, true); every.. > +} > + > +/* > + * As per apply_to_page_range() but fn() is not called in lazy mmu mode. > + */ > +int apply_to_page_range_nolazy(struct mm_struct *mm, unsigned long addr, > + unsigned long size, pte_fn_t fn, void *data) > +{ > + return __apply_to_page_range(mm, addr, size, fn, data, true, false); one... > +} > + > +/* > + * As per apply_to_existing_page_range() but fn() is not called in lazy mmu > + * mode. > + */ > +int apply_to_existing_page_range_nolazy(struct mm_struct *mm, > + unsigned long addr, unsigned long size, > + pte_fn_t fn, void *data) > +{ > + return __apply_to_page_range(mm, addr, size, fn, data, false, false); adds confusion :) These wrappers are terrible for readability and annoying for argument lists too. Could we do something like the pgtbl_mod_mask or zap_details and pass through a struct or one unsigned int for create and lazy_mmu? At least we'd have better self-documenting code in the wrappers.. and if we ever need a third boolean, we could avoid multiplying the wrappers again. WDYT? Cheers, Liam
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |