[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
- To: Valentin Schneider <vschneid@xxxxxxxxxx>
- From: Uladzislau Rezki <urezki@xxxxxxxxx>
- Date: Mon, 20 Jan 2025 12:15:20 +0100
- Cc: Uladzislau Rezki <urezki@xxxxxxxxx>, Jann Horn <jannh@xxxxxxxxxx>, linux-kernel@xxxxxxxxxxxxxxx, x86@xxxxxxxxxx, virtualization@xxxxxxxxxxxxxxx, linux-arm-kernel@xxxxxxxxxxxxxxxxxxx, loongarch@xxxxxxxxxxxxxxx, linux-riscv@xxxxxxxxxxxxxxxxxxx, linux-perf-users@xxxxxxxxxxxxxxx, xen-devel@xxxxxxxxxxxxxxxxxxxx, kvm@xxxxxxxxxxxxxxx, linux-arch@xxxxxxxxxxxxxxx, rcu@xxxxxxxxxxxxxxx, linux-hardening@xxxxxxxxxxxxxxx, linux-mm@xxxxxxxxx, linux-kselftest@xxxxxxxxxxxxxxx, bpf@xxxxxxxxxxxxxxx, bcm-kernel-feedback-list@xxxxxxxxxxxx, Juergen Gross <jgross@xxxxxxxx>, Ajay Kaher <ajay.kaher@xxxxxxxxxxxx>, Alexey Makhalov <alexey.amakhalov@xxxxxxxxxxxx>, Russell King <linux@xxxxxxxxxxxxxxx>, Catalin Marinas <catalin.marinas@xxxxxxx>, Will Deacon <will@xxxxxxxxxx>, Huacai Chen <chenhuacai@xxxxxxxxxx>, WANG Xuerui <kernel@xxxxxxxxxx>, Paul Walmsley <paul.walmsley@xxxxxxxxxx>, Palmer Dabbelt <palmer@xxxxxxxxxxx>, Albert Ou <aou@xxxxxxxxxxxxxxxxx>, Thomas Gleixner <tglx@xxxxxxxxxxxxx>, Ingo Molnar <mingo@xxxxxxxxxx>, Borislav Petkov <bp@xxxxxxxxx>, Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>, "H. Peter Anvin" <hpa@xxxxxxxxx>, Peter Zijlstra <peterz@xxxxxxxxxxxxx>, Arnaldo Carvalho de Melo <acme@xxxxxxxxxx>, Namhyung Kim <namhyung@xxxxxxxxxx>, Mark Rutland <mark.rutland@xxxxxxx>, Alexander Shishkin <alexander.shishkin@xxxxxxxxxxxxxxx>, Jiri Olsa <jolsa@xxxxxxxxxx>, Ian Rogers <irogers@xxxxxxxxxx>, Adrian Hunter <adrian.hunter@xxxxxxxxx>, "Liang, Kan" <kan.liang@xxxxxxxxxxxxxxx>, Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>, Josh Poimboeuf <jpoimboe@xxxxxxxxxx>, Pawan Gupta <pawan.kumar.gupta@xxxxxxxxxxxxxxx>, Sean Christopherson <seanjc@xxxxxxxxxx>, Paolo Bonzini <pbonzini@xxxxxxxxxx>, Andy Lutomirski <luto@xxxxxxxxxx>, Arnd Bergmann <arnd@xxxxxxxx>, Frederic Weisbecker <frederic@xxxxxxxxxx>, "Paul E. McKenney" <paulmck@xxxxxxxxxx>, Jason Baron <jbaron@xxxxxxxxxx>, Steven Rostedt <rostedt@xxxxxxxxxxx>, Ard Biesheuvel <ardb@xxxxxxxxxx>, Neeraj Upadhyay <neeraj.upadhyay@xxxxxxxxxx>, Joel Fernandes <joel@xxxxxxxxxxxxxxxxx>, Josh Triplett <josh@xxxxxxxxxxxxxxxx>, Boqun Feng <boqun.feng@xxxxxxxxx>, Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>, Lai Jiangshan <jiangshanlai@xxxxxxxxx>, Zqiang <qiang.zhang1211@xxxxxxxxx>, Juri Lelli <juri.lelli@xxxxxxxxxx>, Clark Williams <williams@xxxxxxxxxx>, Yair Podemsky <ypodemsk@xxxxxxxxxx>, Tomas Glozar <tglozar@xxxxxxxxxx>, Vincent Guittot <vincent.guittot@xxxxxxxxxx>, Dietmar Eggemann <dietmar.eggemann@xxxxxxx>, Ben Segall <bsegall@xxxxxxxxxx>, Mel Gorman <mgorman@xxxxxxx>, Kees Cook <kees@xxxxxxxxxx>, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>, Christoph Hellwig <hch@xxxxxxxxxxxxx>, Shuah Khan <shuah@xxxxxxxxxx>, Sami Tolvanen <samitolvanen@xxxxxxxxxx>, Miguel Ojeda <ojeda@xxxxxxxxxx>, Alice Ryhl <aliceryhl@xxxxxxxxxx>, "Mike Rapoport (Microsoft)" <rppt@xxxxxxxxxx>, Samuel Holland <samuel.holland@xxxxxxxxxx>, Rong Xu <xur@xxxxxxxxxx>, Nicolas Saenz Julienne <nsaenzju@xxxxxxxxxx>, Geert Uytterhoeven <geert@xxxxxxxxxxxxxx>, Yosry Ahmed <yosryahmed@xxxxxxxxxx>, "Kirill A. Shutemov" <kirill.shutemov@xxxxxxxxxxxxxxx>, "Masami Hiramatsu (Google)" <mhiramat@xxxxxxxxxx>, Jinghao Jia <jinghao7@xxxxxxxxxxxx>, Luis Chamberlain <mcgrof@xxxxxxxxxx>, Randy Dunlap <rdunlap@xxxxxxxxxxxxx>, Tiezhu Yang <yangtiezhu@xxxxxxxxxxx>
- Delivery-date: Mon, 20 Jan 2025 11:15:51 +0000
- List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
On Fri, Jan 17, 2025 at 06:00:30PM +0100, Valentin Schneider wrote:
> On 17/01/25 17:11, Uladzislau Rezki wrote:
> > On Fri, Jan 17, 2025 at 04:25:45PM +0100, Valentin Schneider wrote:
> >> On 14/01/25 19:16, Jann Horn wrote:
> >> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@xxxxxxxxxx>
> >> > wrote:
> >> >> vunmap()'s issued from housekeeping CPUs are a relatively common source
> >> >> of
> >> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
> >> >> flush_tlb_kernel_range() IPIs.
> >> >>
> >> >> Given that CPUs executing in userspace do not access data in the vmalloc
> >> >> range, these IPIs could be deferred until their next kernel entry.
> >> >>
> >> >> Deferral vs early entry danger zone
> >> >> ===================================
> >> >>
> >> >> This requires a guarantee that nothing in the vmalloc range can be
> >> >> vunmap'd
> >> >> and then accessed in early entry code.
> >> >
> >> > In other words, it needs a guarantee that no vmalloc allocations that
> >> > have been created in the vmalloc region while the CPU was idle can
> >> > then be accessed during early entry, right?
> >>
> >> I'm not sure if that would be a problem (not an mm expert, please do
> >> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
> >> deferred anyway.
> >>
> >> So after vmapping something, I wouldn't expect isolated CPUs to have
> >> invalid TLB entries for the newly vmapped page.
> >>
> >> However, upon vunmap'ing something, the TLB flush is deferred, and thus
> >> stale TLB entries can and will remain on isolated CPUs, up until they
> >> execute the deferred flush themselves (IOW for the entire duration of the
> >> "danger zone").
> >>
> >> Does that make sense?
> >>
> > Probably i am missing something and need to have a look at your patches,
> > but how do you guarantee that no-one map same are that you defer for TLB
> > flushing?
> >
>
> That's the cool part: I don't :')
>
Indeed, sounds unsafe :) Then we just do not need to free areas.
> For deferring instruction patching IPIs, I (well Josh really) managed to
> get instrumentation to back me up and catch any problematic area.
>
> I looked into getting something similar for vmalloc region access in
> .noinstr code, but I didn't get anywhere. I even tried using emulated
> watchpoints on QEMU to watch the whole vmalloc range, but that went about
> as well as you could expect.
>
> That left me with staring at code. AFAICT the only vmap'd thing that is
> accessed during early entry is the task stack (CONFIG_VMAP_STACK), which
> itself cannot be freed until the task exits - thus can't be subject to
> invalidation when a task is entering kernelspace.
>
> If you have any tracing/instrumentation suggestions, I'm all ears (eyes?).
>
As noted before, we defer flushing for vmalloc. We have a lazy-threshold
which can be exposed(if you need it) over sysfs for tuning. So, we can add it.
--
Uladzislau Rezki
|