[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs



On Fri, Jan 17, 2025 at 06:00:30PM +0100, Valentin Schneider wrote:
> On 17/01/25 17:11, Uladzislau Rezki wrote:
> > On Fri, Jan 17, 2025 at 04:25:45PM +0100, Valentin Schneider wrote:
> >> On 14/01/25 19:16, Jann Horn wrote:
> >> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@xxxxxxxxxx> 
> >> > wrote:
> >> >> vunmap()'s issued from housekeeping CPUs are a relatively common source 
> >> >> of
> >> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
> >> >> flush_tlb_kernel_range() IPIs.
> >> >>
> >> >> Given that CPUs executing in userspace do not access data in the vmalloc
> >> >> range, these IPIs could be deferred until their next kernel entry.
> >> >>
> >> >> Deferral vs early entry danger zone
> >> >> ===================================
> >> >>
> >> >> This requires a guarantee that nothing in the vmalloc range can be 
> >> >> vunmap'd
> >> >> and then accessed in early entry code.
> >> >
> >> > In other words, it needs a guarantee that no vmalloc allocations that
> >> > have been created in the vmalloc region while the CPU was idle can
> >> > then be accessed during early entry, right?
> >>
> >> I'm not sure if that would be a problem (not an mm expert, please do
> >> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
> >> deferred anyway.
> >>
> >> So after vmapping something, I wouldn't expect isolated CPUs to have
> >> invalid TLB entries for the newly vmapped page.
> >>
> >> However, upon vunmap'ing something, the TLB flush is deferred, and thus
> >> stale TLB entries can and will remain on isolated CPUs, up until they
> >> execute the deferred flush themselves (IOW for the entire duration of the
> >> "danger zone").
> >>
> >> Does that make sense?
> >>
> > Probably i am missing something and need to have a look at your patches,
> > but how do you guarantee that no-one map same are that you defer for TLB
> > flushing?
> >
> 
> That's the cool part: I don't :')
> 
Indeed, sounds unsafe :) Then we just do not need to free areas.

> For deferring instruction patching IPIs, I (well Josh really) managed to
> get instrumentation to back me up and catch any problematic area.
> 
> I looked into getting something similar for vmalloc region access in
> .noinstr code, but I didn't get anywhere. I even tried using emulated
> watchpoints on QEMU to watch the whole vmalloc range, but that went about
> as well as you could expect.
> 
> That left me with staring at code. AFAICT the only vmap'd thing that is
> accessed during early entry is the task stack (CONFIG_VMAP_STACK), which
> itself cannot be freed until the task exits - thus can't be subject to
> invalidation when a task is entering kernelspace.
> 
> If you have any tracing/instrumentation suggestions, I'm all ears (eyes?).
> 
As noted before, we defer flushing for vmalloc. We have a lazy-threshold
which can be exposed(if you need it) over sysfs for tuning. So, we can add it.

--
Uladzislau Rezki



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.