Xen project Mailing List

Re: [PATCH] x86/pv: Flush TLB in response to paging structure changes

To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Thu, 22 Oct 2020 09:22:17 +0200

Cc: Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>

Delivery-date: Thu, 22 Oct 2020 07:22:39 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 21.10.2020 17:39, Andrew Cooper wrote: > On 21/10/2020 14:56, Jan Beulich wrote: >> On 21.10.2020 15:07, Andrew Cooper wrote: >>> @@ -4037,6 +4037,9 @@ long do_mmu_update( >>> break; >>> rc = mod_l2_entry(va, l2e_from_intpte(req.val), mfn, >>> cmd == MMU_PT_UPDATE_PRESERVE_AD, v); >>> + /* Paging structure maybe changed. Flush linear >>> range. */ >>> + if ( !rc ) >>> + flush_flags_all |= FLUSH_TLB; >>> break; >>> >>> case PGT_l3_page_table: >>> @@ -4044,6 +4047,9 @@ long do_mmu_update( >>> break; >>> rc = mod_l3_entry(va, l3e_from_intpte(req.val), mfn, >>> cmd == MMU_PT_UPDATE_PRESERVE_AD, v); >>> + /* Paging structure maybe changed. Flush linear >>> range. */ >>> + if ( !rc ) >>> + flush_flags_all |= FLUSH_TLB; >>> break; >>> >>> case PGT_l4_page_table: >>> @@ -4051,27 +4057,28 @@ long do_mmu_update( >>> break; >>> rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn, >>> cmd == MMU_PT_UPDATE_PRESERVE_AD, v); >>> - if ( !rc && pt_owner->arch.pv.xpti ) >>> + /* Paging structure maybe changed. Flush linear >>> range. */ >>> + if ( !rc ) >>> { >>> - bool local_in_use = false; >>> + bool local_in_use = mfn_eq( >>> + pagetable_get_mfn(curr->arch.guest_table), >>> mfn); >>> >>> - if ( >>> mfn_eq(pagetable_get_mfn(curr->arch.guest_table), >>> - mfn) ) >>> - { >>> - local_in_use = true; >>> - get_cpu_info()->root_pgt_changed = true; >>> - } >>> + flush_flags_all |= FLUSH_TLB; >>> + >>> + if ( local_in_use ) >>> + flush_flags_local |= FLUSH_TLB | >>> FLUSH_ROOT_PGTBL; >> Aiui here (and in the code consuming the flags) you build upon >> flush_flags_local, when not zero, always being a superset of >> flush_flags_all. I think this is a trap to fall into when later >> wanting to change this code, but as per below this won't hold >> anymore anyway, I think. Hence here I think you want to set >> FLUSH_TLB unconditionally, and above for L3 and L2 you want to >> set it in both variables. Or, if I'm wrong below, a comment to >> that effect may help people avoid falling into this trap. >> >> An alternative would be to have >> >> flush_flags_local |= (flush_flags_all & FLUSH_TLB); >> >> before doing the actual flush. > > Honestly, this is what I meant by stating that the asymmetry is a total > mess. > > I originally named all 'remote', but this is even less accurate, it may > still contain the current cpu. > > Our matrix of complexity: > > * FLUSH_TLB for L2+ structure changes > * FLUSH_TLB_GLOBAL/FLUSH_ROOT_PGTBL for XPTI > > with optimisations to skip GLOBAL/ROOT_PGTBL on the local CPU if none of > the updates hit the L4-in-use, and to skip the remote if we hold all > references on the L4. > > Everything is complicated because pt_owner may not be current, for > toolstack operations constructing a new domain. Which is a case I'm wondering why we flush in the first place. The L4 under construction can't be in use yet, and hence updates to it shouldn't need syncing. Otoh pt_owner->dirty_cpumask obviously is empty at that point, i.e. special casing this likely isn't really worth it. >>> @@ -4173,18 +4180,36 @@ long do_mmu_update( >>> if ( va ) >>> unmap_domain_page(va); >>> >>> - if ( sync_guest ) >>> + /* >>> + * Flushing needs to occur for one of several reasons. >>> + * >>> + * 1) An update to an L2 or higher occured. This potentially changes >>> the >>> + * pagetable structure, requiring a flush of the linear range. >>> + * 2) An update to an L4 occured, and XPTI is enabled. All CPUs >>> running >>> + * on a copy of this L4 need refreshing. >>> + */ >>> + if ( flush_flags_all || flush_flags_local ) >> Minor remark: At least in x86 code it is more efficient to use >> | instead of || in such cases, to avoid relying on the compiler >> to carry out this small optimization. > > This transformation should not be recommended in general. There are > cases, including this one, where it is at best, no effect, and at worst, > wrong. > > I don't care about people using ancient compilers. They've got far > bigger (== more impactful) problems than (the absence of) this > transformation, and the TLB flush will dwarf anything the compiler does > here. > > However, the hand "optimised" case breaks a compiler spotting that the > entire second clause is actually redundant for now. Oh, you mean non-zero flush_flags_local implying non-zero flush_flags_all? Not sure why a compiler recognizing this shouldn't be able to optimize away | when it is able to optimize away || in this case. Anyway - minor point, as said. > I specifically didn't encode the dependency, to avoid subtle bugs > if/when someone alters the logic. Good point, but let me point out then that you encoded another subtle dependency, which I didn't spell out completely before because with the suggested adjustments it disappears: You may not omit FLUSH_TLB from flush_flags_local when !local_in_use, as the L4 being changed may be one recursively hanging off of the L4 we're running on. >>> { >>> + cpumask_t *mask = pt_owner->dirty_cpumask; >>> + >>> /* >>> - * Force other vCPU-s of the affected guest to pick up L4 entry >>> - * changes (if any). >>> + * Local flushing may be asymmetric with remote. If there is local >>> + * flushing to do, perform it separately and omit the current CPU >>> from >>> + * pt_owner->dirty_cpumask. >>> */ >>> - unsigned int cpu = smp_processor_id(); >>> - cpumask_t *mask = per_cpu(scratch_cpumask, cpu); >>> + if ( flush_flags_local ) >>> + { >>> + unsigned int cpu = smp_processor_id(); >>> + >>> + mask = per_cpu(scratch_cpumask, cpu); >>> + cpumask_copy(mask, pt_owner->dirty_cpumask); >>> + __cpumask_clear_cpu(cpu, mask); >>> + >>> + flush_local(flush_flags_local); >>> + } >>> >>> - cpumask_andnot(mask, pt_owner->dirty_cpumask, cpumask_of(cpu)); >> I understand you're of the opinion that cpumask_copy() + >> __cpumask_clear_cpu() is more efficient than cpumask_andnot()? >> (I guess I agree for high enough CPU counts.) > > Its faster in all cases, even low CPU counts. Is it? Data for BTR with a memory operand is available in the ORM only for some Atom CPUs, and its latency and throughput aren't very good there. Anyway - again a minor aspect, but I'll keep this in mind for future code I write, as so far I've preferred the "andnot" form in such cases. Jan

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.