[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v4 06/12] mm: introduce generic lazy_mmu helpers

To: Ryan Roberts <ryan.roberts@xxxxxxx>
From: Alexander Gordeev <agordeev@xxxxxxxxxxxxx>
Date: Tue, 11 Nov 2025 09:01:52 +0100
Cc: "David Hildenbrand (Red Hat)" <davidhildenbrandkernel@xxxxxxxxx>, Kevin Brodsky <kevin.brodsky@xxxxxxx>, linux-mm@xxxxxxxxx, linux-kernel@xxxxxxxxxxxxxxx, Andreas Larsson <andreas@xxxxxxxxxxx>, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>, Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>, Borislav Petkov <bp@xxxxxxxxx>, Catalin Marinas <catalin.marinas@xxxxxxx>, Christophe Leroy <christophe.leroy@xxxxxxxxxx>, Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>, "David S. Miller" <davem@xxxxxxxxxxxxx>, David Woodhouse <dwmw2@xxxxxxxxxxxxx>, "H. Peter Anvin" <hpa@xxxxxxxxx>, Ingo Molnar <mingo@xxxxxxxxxx>, Jann Horn <jannh@xxxxxxxxxx>, Juergen Gross <jgross@xxxxxxxx>, "Liam R. Howlett" <Liam.Howlett@xxxxxxxxxx>, Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx>, Madhavan Srinivasan <maddy@xxxxxxxxxxxxx>, Michael Ellerman <mpe@xxxxxxxxxxxxxx>, Michal Hocko <mhocko@xxxxxxxx>, Mike Rapoport <rppt@xxxxxxxxxx>, Nicholas Piggin <npiggin@xxxxxxxxx>, Peter Zijlstra <peterz@xxxxxxxxxxxxx>, Suren Baghdasaryan <surenb@xxxxxxxxxx>, Thomas Gleixner <tglx@xxxxxxxxxxxxx>, Vlastimil Babka <vbabka@xxxxxxx>, Will Deacon <will@xxxxxxxxxx>, Yeoreum Yun <yeoreum.yun@xxxxxxx>, linux-arm-kernel@xxxxxxxxxxxxxxxxxxx, linuxppc-dev@xxxxxxxxxxxxxxxx, sparclinux@xxxxxxxxxxxxxxx, xen-devel@xxxxxxxxxxxxxxxxxxxx, x86@xxxxxxxxxx
Delivery-date: Tue, 11 Nov 2025 08:02:57 +0000
List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Mon, Nov 10, 2025 at 09:19:40AM +0000, Ryan Roberts wrote:
> On 10/11/2025 08:11, Alexander Gordeev wrote:
> > On Fri, Nov 07, 2025 at 03:22:54PM +0000, Ryan Roberts wrote:
> > 
> > Hi Ryan,
> > 
> >> On 07/11/2025 14:34, David Hildenbrand (Red Hat) wrote:
> >>>>>   #ifndef pte_batch_hint
> >>>>> diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
> >>>>> index 5d2a876035d6..c49b029d3593 100644
> >>>>> --- a/mm/kasan/shadow.c
> >>>>> +++ b/mm/kasan/shadow.c
> >>>>> @@ -305,7 +305,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep,
> >>>>> unsigned long addr,
> >>>>>       pte_t pte;
> >>>>>       int index;
> >>>>>   -    arch_leave_lazy_mmu_mode();
> >>>>> +    lazy_mmu_mode_pause();
> >>>>
> >>>> I wonder if there really are use cases that *require* pause/resume? I 
> >>>> think
> >>>> these kasan cases could be correctly implemented using a new nest level 
> >>>> instead?
> >>>> Are there cases where the effects really need to be immediate or do the 
> >>>> effects
> >>>> just need to be visible when you get to where the resume is?
> >>>>
> >>>> If the latter, that could just be turned into a nested disable (e.g. a 
> >>>> flush).
> >>>> In this case, there is only 1 PTE write so no benefit, but I wonder if 
> >>>> other
> >>>> cases may have more PTE writes that could then still be batched. It 
> >>>> would be
> >>>> nice to simplify the API by removing pause/resume if we can?
> >>>
> >>> It has clear semantics, clearer than some nest-disable IMHO.
> >>>
> >>> Maybe you can elaborate how you would change ("simplify") the API in that
> >>> regard? What would the API look like?
> >>
> >> By simplify, I just meant can we remove lazy_mmu_mode_pause() and
> >> lazy_mmu_mode_resume() ?
> >>
> >>
> >> We currently have:
> >>
> >> apply_to_page_range
> >>   lazy_mmu_mode_enable()
> >>     kasan_populate_vmalloc_pte()
> >>       lazy_mmu_mode_pause()
> >>       <code>
> >>       lazy_mmu_mode_resume()
> >>   lazy_mmu_mode_disable()
> >>
> >> Where <code> is setting ptes. But if <code> doesn't need the effects to be
> >> visible until lazy_mmu_mode_resume(), then you could replace the block 
> >> with:
> >>
> >> apply_to_page_range
> >>   lazy_mmu_mode_enable()
> >>     kasan_populate_vmalloc_pte()
> >>       lazy_mmu_mode_enable()
> >>       <code>
> >>       lazy_mmu_mode_disable()
> >>   lazy_mmu_mode_disable()
> >>
> >> However, looking at this more closely, I'm not really clear on why we need 
> >> *any*
> >> special attention to lazy mmu inside of kasan_populate_vmalloc_pte() and
> >> kasan_depopulate_vmalloc_pte().
> >>
> >> I *think* that the original concern was that we were doing ptep_get(ptep) 
> >> inside
> >> of a lazy_mmu block? So we need to flush so that the getter returns the 
> >> most
> >> recent value? But given we have never written to that particular ptep 
> >> while in
> >> the lazy mmu block, there is surely no hazard in the first place?
> > 
> > There is, please see:
> > https://lore.kernel.org/linux-mm/cover.1755528662.git.agordeev@xxxxxxxxxxxxx/
> 
> I've stared at this for a while, but I'm afraid I still don't see the problem.
> This all looks safe to me. Could you explain exactly what this issue is?
> 
> If I've understood correctly, kasan_populate_vmalloc() is called during 
> virtual
> range allocation by vmalloc. This is not in a nested lazy mmu block (but it
> wouldn't matter if it was once we have Kevin's nested changes to ensure flush
> when exiting the nested scope). kasan_populate_vmalloc() calls
> apply_to_page_range(), which will walk the set of ptes, calling
> kasan_populate_vmalloc_pte() for each one. kasan_populate_vmalloc_pte() does a
> ptep_get() then, if none, calls set_pte_at().
> 
> That's not a hazard since you're calling get before the set and you only visit
> each pte once for the apply_to_page_range() lazy mmu block.

I have to admit I do not remember every detail and would have to recreate
the issue - which is specific to s390 lazy_mmu implementation I think.
Both kasan_populate_vmalloc_pte() and kasan_depopulate_vmalloc_pte() do:

apply_to_page_range()
{
    arch_enter_lazy_mmu_mode();

    kasan_de|populate_vmalloc_pte()
    {
        arch_leave_lazy_mmu_mode();             <--- remove?

        spin_lock(&init_mm.page_table_lock);
        <PTE update>
        spin_unlock(&init_mm.page_table_lock);  <--- PTE store should be done

        arch_enter_lazy_mmu_mode();             <--- remove?
    }

    arch_leave_lazy_mmu_mode();
}

Upon return from spin_unlock() both kasan callbacks expect the PTE contains
an updated value to be stored to pgtable. That is true unless we remove
arch_leave|enter_lazy_mmu_mode() brackets. If we do the value is continued
to be cached and only stored when the outer arch_leave_lazy_mmu_mode() is
called. That results in a race between concurrent PTE updaters.

> >> apply_to_existing_page_range() will only call 
> >> kasan_depopulate_vmalloc_pte()
> >> once per pte, right? So given we read the ptep before writing it, there 
> >> should
> >> be no hazard? If so we can remove pause/resume.
> > 
> > Unfortunately, we rather not, please see:
> > https://lore.kernel.org/linux-mm/d407a381-099b-4ec6-a20e-aeff4f3d750f@xxxxxxx/
> 
> Sorry but I don't see anything relavent to my point in this mail. Perhaps 
> there
> is some s390-specific detail that I'm failing to understand?

Sorry, with this message I meant the branch where it was discussed,
I will try to C&P some excerpts and summarize it here.

* lazy_mmu_mode_enable()

This helper is parameter-free, assuming the MMU unit does not need any
configuration other than turning it on/off. That is currently true, but
(as I noted in my other mail) I am going to introduce a friend enable
function that accepts parameters, creates an arch-specific state and
uses it while the lazy mmu mode is active:

static inline void arch_enter_lazy_mmu_mode_pte(struct mm_struct *mm,
                                                unsigned long addr,
                                                unsigned long end,
                                                pte_t *ptep)
{
        ...
}

* lazy_mmu_mode_resume() -> arch_enter_lazy_mmu_mode()

Conversely, this needs to be -> arch_resume_lazy_mmu_mode(). And it can not
be arch_enter_lazy_mmu_mode(), since a lazy_mmu_mode_resume() caller does
not know the parameters passed to the original lazy_mmu_mode_enable(...)-
friend.

> 
> Thanks,
> Ryan

Thanks!

> > 
> > The problem is kasan code invokes apply_to_page_range(), which enters 
> > lazy_mmu
> > mode unconditionally. I would claim that is rather an obstacle for the kasan
> > code, not a benefit. But it needs to be tackled.
> > > Should apply_to_page_range() had an option not to enter the lazy_mmu mode
> > (e.g. an extra "bool skip_lazy" parameter) - the pause/resume could have
> > been avoided.
> > 
> >> Thanks,
> >> Ryan
> > 
> > Thanks!

Follow-Ups:
- Re: [PATCH v4 06/12] mm: introduce generic lazy_mmu helpers
  - From: Ryan Roberts

References:
- Re: [PATCH v4 06/12] mm: introduce generic lazy_mmu helpers
  - From: Ryan Roberts
- Re: [PATCH v4 06/12] mm: introduce generic lazy_mmu helpers
  - From: David Hildenbrand (Red Hat)
- Re: [PATCH v4 06/12] mm: introduce generic lazy_mmu helpers
  - From: Ryan Roberts
- Re: [PATCH v4 06/12] mm: introduce generic lazy_mmu helpers
  - From: Alexander Gordeev
- Re: [PATCH v4 06/12] mm: introduce generic lazy_mmu helpers
  - From: Ryan Roberts

Prev by Date: Re: [PATCH v4] xen: Support LLVM raw profile versions 5, 6, 7, 8, 9, and 10
Next by Date: Re: [PATCH v3 09/28] xen/vm_event: consolidate CONFIG_VM_EVENT
Previous by thread: Re: [PATCH v4 06/12] mm: introduce generic lazy_mmu helpers
Next by thread: Re: [PATCH v4 06/12] mm: introduce generic lazy_mmu helpers
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.