[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v4 07/12] mm: enable lazy_mmu sections to nest



On 10/11/2025 10:47, Kevin Brodsky wrote:
> On 07/11/2025 14:59, Ryan Roberts wrote:
>> On 29/10/2025 10:09, Kevin Brodsky wrote:
>>> [...]
>>>
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index b5fdf32c437f..e6064e00b22d 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -228,27 +228,86 @@ static inline int pmd_dirty(pmd_t pmd)
>>>   * of the lazy mode. So the implementation must assume preemption may be 
>>> enabled
>>>   * and cpu migration is possible; it must take steps to be robust against 
>>> this.
>>>   * (In practice, for user PTE updates, the appropriate page table lock(s) 
>>> are
>>> - * held, but for kernel PTE updates, no lock is held). Nesting is not 
>>> permitted
>>> - * and the mode cannot be used in interrupt context.
>>> + * held, but for kernel PTE updates, no lock is held). The mode cannot be 
>>> used
>>> + * in interrupt context.
>> "The mode cannot be used in interrupt context"; except it is for arm64. 
>> KFENCE
>> and/or DEBUG_PAGEALLOC will request the arch to change linear map 
>> permissions,
>> which will enter lazy mmu (now using the new generic API). This can happen in
>> softirq context.
> 
> Are you happy with the wording update in patch 12?

Yes!

> 
>>> + *
>>> + * The lazy MMU mode is enabled for a given block of code using:
>>> + *
>>> + *   lazy_mmu_mode_enable();
>>> + *   <code>
>>> + *   lazy_mmu_mode_disable();
>>> + *
>>> + * Nesting is permitted: <code> may itself use an enable()/disable() pair.
>>> + * A nested call to enable() has no functional effect; however disable() 
>>> causes
>>> + * any batched architectural state to be flushed regardless of nesting. 
>>> After a
>>> + * call to disable(), the caller can therefore rely on all previous page 
>>> table
>>> + * modifications to have taken effect, but the lazy MMU mode may still be
>>> + * enabled.
>>> + *
>>> + * In certain cases, it may be desirable to temporarily pause the lazy MMU 
>>> mode.
>>> + * This can be done using:
>>> + *
>>> + *   lazy_mmu_mode_pause();
>>> + *   <code>
>>> + *   lazy_mmu_mode_resume();
>>> + *
>>> + * This sequence must only be used if the lazy MMU mode is already enabled.
>>> + * pause() ensures that the mode is exited regardless of the nesting level;
>>> + * resume() re-enters the mode at the same nesting level. <code> must not 
>>> modify
>>> + * the lazy MMU state (i.e. it must not call any of the lazy_mmu_mode_*
>>> + * helpers).
>>> + *
>>> + * in_lazy_mmu_mode() can be used to check whether the lazy MMU mode is
>>> + * currently enabled.
>>>   */
>> Nice documentation!
> 
> Thanks!
> 
>>>  #ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>>>  static inline void lazy_mmu_mode_enable(void)
>>>  {
>>> -   arch_enter_lazy_mmu_mode();
>>> +   struct lazy_mmu_state *state = &current->lazy_mmu_state;
>>> +
>>> +   VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
>>> +   /* enable() must not be called while paused */
>>> +   VM_WARN_ON(state->nesting_level > 0 && !state->active);
>>> +
>>> +   if (state->nesting_level++ == 0) {
>> Hmm... for the arm64 case of calling this in an interrupt, Is it safe?
>>
>> If a task is calling this function and gets interrupted here, 
>> nesting_level==1
>> but active==false. The interrupt then calls this function and increments 
>> from 1
>> to 2 but arch_enter_lazy_mmu_mode() hasn't been called.
>>
>> More dangerously (I think), when the interrupt handler calls
>> lazy_mmu_mode_disable(), it will end up calling arch_flush_lazy_mmu_mode() 
>> which
>> could be an issue because as far as the arch is concerned, it's not in lazy 
>> mode.
>>
>> The current arm64 implementation works because setting and clearing the 
>> thread
>> flags is atomic.
>>
>> Perhaps you need to disable preemption around the if block?
> 
> As you found out this is addressed in patch 12, but indeed I hadn't
> realised that this patch leaves the generic API in an unsafe situation
> w.r.t. interrupts. We at least need to have in_interrupt() checks in the
> generic layer by the time we get to this patch.

Yeah that should solve it.

> 
>>> +           state->active = true;
>>> +           arch_enter_lazy_mmu_mode();
>>> +   }
>>>  }
>>>  
>>>  static inline void lazy_mmu_mode_disable(void)
>>>  {
>>> -   arch_leave_lazy_mmu_mode();
>>> +   struct lazy_mmu_state *state = &current->lazy_mmu_state;
>>> +
>>> +   VM_WARN_ON_ONCE(state->nesting_level == 0);
>>> +   VM_WARN_ON(!state->active);
>>> +
>>> +   if (--state->nesting_level == 0) {
>>> +           state->active = false;
>>> +           arch_leave_lazy_mmu_mode();
>>> +   } else {
>>> +           /* Exiting a nested section */
>>> +           arch_flush_lazy_mmu_mode();
>>> +   }
>>>  }
>>>  
>>>  static inline void lazy_mmu_mode_pause(void)
>>>  {
>>> +   struct lazy_mmu_state *state = &current->lazy_mmu_state;
>>> +
>>> +   VM_WARN_ON(state->nesting_level == 0 || !state->active);
>> nit: do you need the first condition? I think when nesting_level==0, we 
>> expect
>> to be !active?
> 
> I suppose this should never happen indeed - I was just being extra
> defensive.
> 
> Either way David suggested allowing pause()/resume() to be called
> outside of any section so the next version will bail out on
> nesting_level == 0.

Ignoring my current opinion that we don't need pause/resume at all for now; Are
you suggesting that pause/resume will be completely independent of
enable/disable? I think that would be best. So enable/disable increment and
decrement the nesting_level counter regardless of whether we are paused.
nesting_level 0 => 1 enables if not paused. nesting_level 1 => 0 disables if not
paused. pause disables nesting_level >= 1, resume enables if nesting_level >= 1.

Perhaps we also need nested pause/resume? Then you just end up with 2 counters;
enable_count and pause_count. Sorry if this has already been discussed.

> 
>>> +
>>> +   state->active = false;
>>>     arch_leave_lazy_mmu_mode();
>>>  }
>>>  
>>>  static inline void lazy_mmu_mode_resume(void)
>>>  {
>>> +   struct lazy_mmu_state *state = &current->lazy_mmu_state;
>>> +
>>> +   VM_WARN_ON(state->nesting_level == 0 || state->active);
>> Similar argument?
>>
>>> +
>>> +   state->active = true;
>>>     arch_enter_lazy_mmu_mode();
>>>  }
>>>  #else
>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>> index cbb7340c5866..11566d973f42 100644
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -1441,6 +1441,10 @@ struct task_struct {
>>>  
>>>     struct page_frag                task_frag;
>>>  
>>> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>>> +   struct lazy_mmu_state           lazy_mmu_state;
>>> +#endif
>>> +
>>>  #ifdef CONFIG_TASK_DELAY_ACCT
>>>     struct task_delay_info          *delays;
>>>  #endif
>>> @@ -1724,6 +1728,18 @@ static inline char task_state_to_char(struct 
>>> task_struct *tsk)
>>>     return task_index_to_char(task_state_index(tsk));
>>>  }
>>>  
>>> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>>> +static inline bool in_lazy_mmu_mode(void)
>>> +{
>>> +   return current->lazy_mmu_state.active;
>>> +}
>>> +#else
>>> +static inline bool in_lazy_mmu_mode(void)
>>> +{
>>> +   return false;
>> Just pointing out that this isn't really a correct implementation:
>>
>> lazy_mmu_mode_enable()
>> ASSERT(in_lazy_mmu_mode()) << triggers for arches without lazy mmu
>> lazy_mmu_mode_disable()
>>
>> Although it probably doesn't matter in practice?
> 
> I'd say that the expectation is invalid - lazy MMU mode can only be
> enabled if the architecture supports it. In fact as you pointed out
> above the API may be called in interrupt context but it will have no
> effect, so this sequence would always fail in interrupt context.

Yep, but previously there was no way to query the current state so it didn't
matter. Now you have a query API so it might matter if/when people come along
and use it in unexpected ways.

> 
> Worth nothing that in_lazy_mmu_mode() is only ever called from arch code
> where lazy MMU is implemented. I added the fallback as a matter of
> principle, but it isn't actually required.

Yes, I agree that's the intent. I'm just wondering if it's possible to enforce
that only arch code uses this. Perhaps add some docs to explain that it's only
intended for arches that implement lazy_mmu, and don't define it for arches that
don't, which would catch any generic users?

Thanks,
Ryan

> 
> - Kevin




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.