[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v4 07/12] mm: enable lazy_mmu sections to nest



On Thu, Nov 06, 2025 at 10:51:43AM +0000, Kevin Brodsky wrote:
> On 05/11/2025 16:12, Alexander Gordeev wrote:
> > On Wed, Nov 05, 2025 at 02:19:03PM +0530, Ritesh Harjani wrote:
> >>> + * in_lazy_mmu_mode() can be used to check whether the lazy MMU mode is
> >>> + * currently enabled.
> >>>   */
> >>>  #ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
> >>>  static inline void lazy_mmu_mode_enable(void)
> >>>  {
> >>> - arch_enter_lazy_mmu_mode();
> >>> + struct lazy_mmu_state *state = &current->lazy_mmu_state;
> >>> +
> >>> + VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
> >>> + /* enable() must not be called while paused */
> >>> + VM_WARN_ON(state->nesting_level > 0 && !state->active);
> >>> +
> >>> + if (state->nesting_level++ == 0) {
> >>> +         state->active = true;
> >>> +         arch_enter_lazy_mmu_mode();
> >>> + }
> >>>  }
> >> Some architectures disables preemption in their
> >> arch_enter_lazy_mmu_mode(). So shouldn't the state->active = true should
> >> happen after arch_enter_lazy_mmu_mode() has disabled preemption()? i.e.
> > Do you have some scenario in mind that could cause an issue?
> > IOW, what could go wrong if the process is scheduled to another
> > CPU before preempt_disable() is called?
> 
> I'm not sure I understand the issue either.
> 
> >>   static inline void lazy_mmu_mode_enable(void)
> >>   {
> >>  - arch_enter_lazy_mmu_mode();
> >>  + struct lazy_mmu_state *state = &current->lazy_mmu_state;
> >>  +
> >>  + VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
> >>  + /* enable() must not be called while paused */
> >>  + VM_WARN_ON(state->nesting_level > 0 && !state->active);
> >>  +
> >>  + if (state->nesting_level++ == 0) {
> >>  +         arch_enter_lazy_mmu_mode();
> >>  +         state->active = true;
> >>  + }
> >>   }
> >>
> >> ... I think it make more sense to enable the state after the arch_**
> >> call right.
> > But then in_lazy_mmu_mode() would return false if called from
> > arch_enter_lazy_mmu_mode(). Not big problem, but still..
> 
> The ordering of nesting_level/active was the way you expected in v3, but
> the conclusion of the discussion with David H [1] is that it doesn't
> really matter so I simplified the ordering in v4 - the arch hooks
> shouldn't call in_lazy_mmu_mode() or inspect lazy_mmu_state.
> arch_enter()/arch_leave() shouldn't need it anyway since they're called
> once per outer section (not in nested sections). arch_flush() could
> potentially do something different when nested, but that seems unlikely.
> 
> - Kevin
> 
> [1]
> https://lore.kernel.org/all/af4414b6-617c-4dc8-bddc-3ea00d1f6f3b@xxxxxxxxxx/

I might be misunderstand this conversation, but it looked to me as a discussion
about lazy_mmu_state::nesting_level value, not lazy_mmu_state::active.

I do use in_lazy_mmu_mode() (lazy_mmu_state::active) check from the arch-
callbacks. Here is the example (and likely the only case so far) where it hits:

static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
                                      void *_data)
{
        lazy_mmu_mode_pause();
        ...
        if (likely(pte_none(ptep_get(ptep)))) {

                /* Here set_pte() checks whether we are in lazy_mmu mode */
                set_pte_at(&init_mm, addr, ptep, pte);  <--- calls set_pte()
                data->pages[index] = NULL;
        }
        ...
        lazy_mmu_mode_resume();
        ...
}

So without in_lazy_mmu_mode() check above the arch-specific set_pte()
implementation enters a wrong branch, which ends up in:

[  394.503134] Call Trace:
[  394.503137]  [<00007fffe01333f4>] dump_stack_lvl+0xbc/0xf0 
[  394.503143]  [<00007fffe010298c>] vpanic+0x1cc/0x418 
[  394.503149]  [<00007fffe0102c7a>] panic+0xa2/0xa8 
[  394.503154]  [<00007fffe01e7a8a>] check_panic_on_warn+0x8a/0xb0 
[  394.503160]  [<00007fffe082d122>] end_report+0x72/0x110 
[  394.503166]  [<00007fffe082d3e6>] kasan_report+0xc6/0x100 
[  394.503171]  [<00007fffe01b9556>] ipte_batch_ptep_get+0x146/0x150 
[  394.503176]  [<00007fffe0830096>] kasan_populate_vmalloc_pte+0xe6/0x1e0 
[  394.503183]  [<00007fffe0718050>] apply_to_pte_range+0x1a0/0x570 
[  394.503189]  [<00007fffe07260fa>] __apply_to_page_range+0x3ca/0x8f0 
[  394.503195]  [<00007fffe0726648>] apply_to_page_range+0x28/0x40 
[  394.503201]  [<00007fffe082fe34>] __kasan_populate_vmalloc+0x324/0x340 
[  394.503207]  [<00007fffe076954e>] alloc_vmap_area+0x31e/0xbf0 
[  394.503213]  [<00007fffe0770106>] __get_vm_area_node+0x1a6/0x2d0 
[  394.503218]  [<00007fffe07716fa>] __vmalloc_node_range_noprof+0xba/0x260 
[  394.503224]  [<00007fffe0771970>] __vmalloc_node_noprof+0xd0/0x110 
[  394.503229]  [<00007fffe0771a22>] vmalloc_noprof+0x32/0x40 
[  394.503234]  [<00007fff604eaa42>] full_fit_alloc_test+0xb2/0x3e0 
[test_vmalloc] 
[  394.503241]  [<00007fff604eb478>] test_func+0x488/0x760 [test_vmalloc] 
[  394.503247]  [<00007fffe025ad68>] kthread+0x368/0x630 
[  394.503253]  [<00007fffe01391e0>] __ret_from_fork+0xd0/0x490 
[  394.503259]  [<00007fffe24e468a>] ret_from_fork+0xa/0x30 

I could have cached lazy_mmu_state::active as arch-specific data
and check it, but then what is the point to have it generalized?

Thanks!



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.