[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH 10/16] x86/shadow: move OOS functions to their own file
On 22/03/2023 9:35 am, Jan Beulich wrote: > --- /dev/null > +++ b/xen/arch/x86/mm/shadow/oos.c > @@ -0,0 +1,603 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ GPL-2.0-only The unqualified form in deprecated. > +/****************************************************************************** > + * arch/x86/mm/shadow/oos.c > + * > + * Shadow code dealing with out-of-sync shadows. > + * Parts of this code are Copyright (c) 2006 by XenSource Inc. > + * Parts of this code are Copyright (c) 2006 by Michael A Fetterman > + * Parts based on earlier work by Michael A Fetterman, Ian Pratt et al. > + */ > + > +#include "private.h" > + > +#if (SHADOW_OPTIMIZATIONS & SHOPT_OUT_OF_SYNC) > + > +#include <xen/trace.h> > + > +#include <asm/shadow.h> > + > +/* > + * From time to time, we let a shadowed pagetable page go out of sync > + * with its shadow: the guest is allowed to write directly to the page, > + * and those writes are not synchronously reflected in the shadow. > + * This lets us avoid many emulations if the guest is writing a lot to a > + * pagetable, but it relaxes a pretty important invariant in the shadow > + * pagetable design. Therefore, some rules: > + * > + * 1. Only L1 pagetables may go out of sync: any page that is shadowed > + * at at higher level must be synchronously updated. This makes > + * using linear shadow pagetables much less dangerous. > + * That means that: (a) unsyncing code needs to check for higher-level > + * shadows, and (b) promotion code needs to resync. > + * > + * 2. All shadow operations on a guest page require the page to be brought > + * back into sync before proceeding. This must be done under the > + * paging lock so that the page is guaranteed to remain synced until > + * the operation completes. > + * > + * Exceptions to this rule: the pagefault and invlpg handlers may > + * update only one entry on an out-of-sync page without resyncing it. > + * > + * 3. Operations on shadows that do not start from a guest page need to > + * be aware that they may be handling an out-of-sync shadow. > + * > + * 4. Operations that do not normally take the paging lock (fast-path > + * #PF handler, INVLPG) must fall back to a locking, syncing version > + * if they see an out-of-sync table. > + * > + * 5. Operations corresponding to guest TLB flushes (MOV CR3, INVLPG) > + * must explicitly resync all relevant pages or update their > + * shadows. > + * > + * Currently out-of-sync pages are listed in a simple open-addressed > + * hash table with a second chance (must resist temptation to radically > + * over-engineer hash tables...) The virtual address of the access > + * which caused us to unsync the page is also kept in the hash table, as > + * a hint for finding the writable mappings later. > + * > + * We keep a hash per vcpu, because we want as much as possible to do > + * the re-sync on the save vcpu we did the unsync on, so the VA hint > + * will be valid. > + */ > + > +#if SHADOW_AUDIT & SHADOW_AUDIT_ENTRIES_FULL > +void sh_oos_audit(struct domain *d) > +{ > + unsigned int idx, expected_idx, expected_idx_alt; > + struct page_info *pg; > + struct vcpu *v; > + > + for_each_vcpu(d, v) > + { > + for ( idx = 0; idx < SHADOW_OOS_PAGES; idx++ ) > + { > + mfn_t *oos = v->arch.paging.shadow.oos; Newline. But the variable placement is weird. oos ought to be one scope further out to prevent recalculation in the for() loop, while pg and the two expected could be at the inter-most scope. > + if ( mfn_eq(oos[idx], INVALID_MFN) ) > + continue; > + > + expected_idx = mfn_x(oos[idx]) % SHADOW_OOS_PAGES; > + expected_idx_alt = ((expected_idx + 1) % SHADOW_OOS_PAGES); > + if ( idx != expected_idx && idx != expected_idx_alt ) > + { > + printk("%s: idx %x contains gmfn %lx, expected at %x or > %x.\n", > + __func__, idx, mfn_x(oos[idx]), > + expected_idx, expected_idx_alt); > + BUG(); > + } > + pg = mfn_to_page(oos[idx]); > + if ( !(pg->count_info & PGC_shadowed_pt) ) > + { > + printk("%s: idx %x gmfn %lx not a pt (count %lx)\n", > + __func__, idx, mfn_x(oos[idx]), pg->count_info); > + BUG(); > + } > + if ( !(pg->shadow_flags & SHF_out_of_sync) ) > + { > + printk("%s: idx %x gmfn %lx not marked oos (flags %x)\n", > + __func__, idx, mfn_x(oos[idx]), pg->shadow_flags); > + BUG(); > + } > + if ( (pg->shadow_flags & SHF_page_type_mask & ~SHF_L1_ANY) ) > + { > + printk("%s: idx %x gmfn %lx shadowed as non-l1 (flags %x)\n", > + __func__, idx, mfn_x(oos[idx]), pg->shadow_flags); > + BUG(); > + } > + } > + } > +} > +#endif > + > +#if SHADOW_AUDIT & SHADOW_AUDIT_ENTRIES > +void oos_audit_hash_is_present(struct domain *d, mfn_t gmfn) > +{ > + int idx; > + struct vcpu *v; > + mfn_t *oos; > + > + ASSERT(mfn_is_out_of_sync(gmfn)); > + > + for_each_vcpu(d, v) > + { > + oos = v->arch.paging.shadow.oos; > + idx = mfn_x(gmfn) % SHADOW_OOS_PAGES; Same for oos and idx here, which would shrink this function overall. As a tangent, do we really want all these modulo 3's all over the place? It's a lot of divide instructions in paths that are fast-ish for shadow guests. > + if ( !mfn_eq(oos[idx], gmfn) ) > + idx = (idx + 1) % SHADOW_OOS_PAGES; > + > + if ( mfn_eq(oos[idx], gmfn) ) > + return; > + } > + > + printk(XENLOG_ERR "gmfn %"PRI_mfn" marked OOS but not in hash table\n", > + mfn_x(gmfn)); > + BUG(); > +} > +#endif > + > +/* Update the shadow, but keep the page out of sync. */ > +static inline void _sh_resync_l1(struct vcpu *v, mfn_t gmfn, mfn_t snpmfn) inline can go. > +{ > + struct page_info *pg = mfn_to_page(gmfn); > + > + ASSERT(mfn_valid(gmfn)); > + ASSERT(page_is_out_of_sync(pg)); > + > + /* Call out to the appropriate per-mode resyncing function */ > + if ( pg->shadow_flags & SHF_L1_32 ) > + SHADOW_INTERNAL_NAME(sh_resync_l1, 2)(v, gmfn, snpmfn); > + else if ( pg->shadow_flags & SHF_L1_PAE ) > + SHADOW_INTERNAL_NAME(sh_resync_l1, 3)(v, gmfn, snpmfn); > + else if ( pg->shadow_flags & SHF_L1_64 ) > + SHADOW_INTERNAL_NAME(sh_resync_l1, 4)(v, gmfn, snpmfn); > +} > + > +static int sh_remove_write_access_from_sl1p(struct domain *d, mfn_t gmfn, > + mfn_t smfn, unsigned long off) > +{ > + ASSERT(mfn_valid(smfn)); > + ASSERT(mfn_valid(gmfn)); > + > + switch ( mfn_to_page(smfn)->u.sh.type ) > + { > + case SH_type_l1_32_shadow: > + case SH_type_fl1_32_shadow: > + return SHADOW_INTERNAL_NAME(sh_rm_write_access_from_sl1p, 2) > + (d, gmfn, smfn, off); > + > + case SH_type_l1_pae_shadow: > + case SH_type_fl1_pae_shadow: > + return SHADOW_INTERNAL_NAME(sh_rm_write_access_from_sl1p, 3) > + (d, gmfn, smfn, off); > + > + case SH_type_l1_64_shadow: > + case SH_type_fl1_64_shadow: > + return SHADOW_INTERNAL_NAME(sh_rm_write_access_from_sl1p, 4) > + (d, gmfn, smfn, off); > + > + default: > + return 0; > + } > +} > + > +/* > + * Fixup arrays: We limit the maximum number of writable mappings to > + * SHADOW_OOS_FIXUPS and store enough information to remove them > + * quickly on resync. > + */ > + > +static inline int oos_fixup_flush_gmfn(struct vcpu *v, mfn_t gmfn, > + struct oos_fixup *fixup) inline > +{ > + struct domain *d = v->domain; > + int i; > + for ( i = 0; i < SHADOW_OOS_FIXUPS; i++ ) > + { > + if ( !mfn_eq(fixup->smfn[i], INVALID_MFN) ) > + { > + sh_remove_write_access_from_sl1p(d, gmfn, > + fixup->smfn[i], > + fixup->off[i]); > + fixup->smfn[i] = INVALID_MFN; > + } > + } > + > + /* Always flush the TLBs. See comment on oos_fixup_add(). */ > + return 1; > +} This looks suspiciously like it ought to be a void function. [edit, yes - see later] > + > +void oos_fixup_add(struct domain *d, mfn_t gmfn, > + mfn_t smfn, unsigned long off) > +{ > + int idx, next; > + mfn_t *oos; > + struct oos_fixup *oos_fixup; > + struct vcpu *v; > + > + perfc_incr(shadow_oos_fixup_add); > + > + for_each_vcpu(d, v) > + { > + oos = v->arch.paging.shadow.oos; > + oos_fixup = v->arch.paging.shadow.oos_fixup; > + idx = mfn_x(gmfn) % SHADOW_OOS_PAGES; > + if ( !mfn_eq(oos[idx], gmfn) ) > + idx = (idx + 1) % SHADOW_OOS_PAGES; > + if ( mfn_eq(oos[idx], gmfn) ) > + { > + int i; > + for ( i = 0; i < SHADOW_OOS_FIXUPS; i++ ) This is a case where "for ( int i = ..." would definitely read better. Luckily, this example is simple enough that the compiler has already optimised properly. > + { > + if ( mfn_eq(oos_fixup[idx].smfn[i], smfn) > + && (oos_fixup[idx].off[i] == off) ) Given that you mention style in the commit message, it would be nice to move the && onto the previous line. > + return; > + } > + > + next = oos_fixup[idx].next; > + > + if ( !mfn_eq(oos_fixup[idx].smfn[next], INVALID_MFN) ) > + { > + TRACE_SHADOW_PATH_FLAG(TRCE_SFLAG_OOS_FIXUP_EVICT); > + > + /* Reuse this slot and remove current writable mapping. */ > + sh_remove_write_access_from_sl1p(d, gmfn, > + oos_fixup[idx].smfn[next], > + oos_fixup[idx].off[next]); > + perfc_incr(shadow_oos_fixup_evict); > + /* > + * We should flush the TLBs now, because we removed a > + * writable mapping, but since the shadow is already > + * OOS we have no problem if another vcpu write to > + * this page table. We just have to be very careful to > + * *always* flush the tlbs on resync. > + */ > + } > + > + oos_fixup[idx].smfn[next] = smfn; > + oos_fixup[idx].off[next] = off; > + oos_fixup[idx].next = (next + 1) % SHADOW_OOS_FIXUPS; > + > + TRACE_SHADOW_PATH_FLAG(TRCE_SFLAG_OOS_FIXUP_ADD); > + return; > + } > + } > + > + printk(XENLOG_ERR "gmfn %"PRI_mfn" was OOS but not in hash table\n", > + mfn_x(gmfn)); > + BUG(); > +} > + > +static int oos_remove_write_access(struct vcpu *v, mfn_t gmfn, > + struct oos_fixup *fixup) > +{ > + struct domain *d = v->domain; > + int ftlb = 0; > + > + ftlb |= oos_fixup_flush_gmfn(v, gmfn, fixup); Oof yes. With oos_fixup_flush_gmfn() changed to being void, it is now obvious that ftlb is unconditionally 1 throughout this function, which can be simplified to just: oos_fixup_flush_gmfn(v, gmfn, fixup); if ( sh_remove_write_access(d, gmfn, 0, 0) == -1 ) { shadow_remove_all_shadows(d, gmfn); return 1; } guest_flush_tlb_mask(d, d->dirty_cpumask); return 0; Maybe we don't want to go that far, but it is overly complex in its current form. > + > + switch ( sh_remove_write_access(d, gmfn, 0, 0) ) > + { > + default: > + case 0: > + break; > + > + case 1: > + ftlb |= 1; > + break; > + > + case -1: > + /* > + * An unfindable writeable typecount has appeared, probably via a > + * grant table entry: can't shoot the mapping, so try to unshadow > + * the page. If that doesn't work either, the guest is granting > + * his pagetables and must be killed after all. > + * This will flush the tlb, so we can return with no worries. > + */ > + shadow_remove_all_shadows(d, gmfn); > + return 1; > + } > + > + if ( ftlb ) > + guest_flush_tlb_mask(d, d->dirty_cpumask); > + > + return 0; > +} > + > +static inline void trace_resync(int event, mfn_t gmfn) inline, and this reminds me that I *still* need to sort my series to avoid stack rubble leakage in the trace subsystem. "int" event really ought to become unsigned int, but it doesn't matter in this case because the timestamp (bit 31) doesn't need setting. > +{ > + if ( tb_init_done ) > + { > + /* Convert gmfn to gfn */ > + gfn_t gfn = mfn_to_gfn(current->domain, gmfn); > + > + __trace_var(event, 0/*!tsc*/, sizeof(gfn), &gfn); > + } > +} > + > +/* Pull all the entries on an out-of-sync page back into sync. */ > +static void _sh_resync(struct vcpu *v, mfn_t gmfn, > + struct oos_fixup *fixup, mfn_t snp) > +{ > + struct page_info *pg = mfn_to_page(gmfn); > + > + ASSERT(paging_locked_by_me(v->domain)); > + ASSERT(mfn_is_out_of_sync(gmfn)); > + /* Guest page must be shadowed *only* as L1 when out of sync. */ > + ASSERT(!(mfn_to_page(gmfn)->shadow_flags & SHF_page_type_mask > + & ~SHF_L1_ANY)); > + ASSERT(!sh_page_has_multiple_shadows(mfn_to_page(gmfn))); > + > + SHADOW_PRINTK("%pv gmfn=%"PRI_mfn"\n", v, mfn_x(gmfn)); > + > + /* Need to pull write access so the page *stays* in sync. */ > + if ( oos_remove_write_access(v, gmfn, fixup) ) > + { > + /* Page has been unshadowed. */ > + return; > + } > + > + /* No more writable mappings of this page, please */ > + pg->shadow_flags &= ~SHF_oos_may_write; > + > + /* Update the shadows with current guest entries. */ > + _sh_resync_l1(v, gmfn, snp); > + > + /* Now we know all the entries are synced, and will stay that way */ > + pg->shadow_flags &= ~SHF_out_of_sync; > + perfc_incr(shadow_resync); > + trace_resync(TRC_SHADOW_RESYNC_FULL, gmfn); > +} > + > +/* Add an MFN to the list of out-of-sync guest pagetables */ > +static void oos_hash_add(struct vcpu *v, mfn_t gmfn) > +{ > + int i, idx, oidx, swap = 0; > + mfn_t *oos = v->arch.paging.shadow.oos; > + mfn_t *oos_snapshot = v->arch.paging.shadow.oos_snapshot; > + struct oos_fixup *oos_fixup = v->arch.paging.shadow.oos_fixup; > + struct oos_fixup fixup = { .next = 0 }; > + > + for ( i = 0; i < SHADOW_OOS_FIXUPS; i++ ) > + fixup.smfn[i] = INVALID_MFN; > + > + idx = mfn_x(gmfn) % SHADOW_OOS_PAGES; > + oidx = idx; > + > + if ( !mfn_eq(oos[idx], INVALID_MFN) > + && (mfn_x(oos[idx]) % SHADOW_OOS_PAGES) == idx ) > + { > + /* Punt the current occupant into the next slot */ > + SWAP(oos[idx], gmfn); > + SWAP(oos_fixup[idx], fixup); > + swap = 1; > + idx = (idx + 1) % SHADOW_OOS_PAGES; > + } > + if ( !mfn_eq(oos[idx], INVALID_MFN) ) > + { > + /* Crush the current occupant. */ > + _sh_resync(v, oos[idx], &oos_fixup[idx], oos_snapshot[idx]); > + perfc_incr(shadow_unsync_evict); > + } > + oos[idx] = gmfn; > + oos_fixup[idx] = fixup; > + > + if ( swap ) > + SWAP(oos_snapshot[idx], oos_snapshot[oidx]); > + > + copy_domain_page(oos_snapshot[oidx], oos[oidx]); > +} > + > +/* Remove an MFN from the list of out-of-sync guest pagetables */ > +void oos_hash_remove(struct domain *d, mfn_t gmfn) > +{ > + int idx; > + mfn_t *oos; > + struct vcpu *v; > + > + SHADOW_PRINTK("d%d gmfn %lx\n", d->domain_id, mfn_x(gmfn)); > + > + for_each_vcpu(d, v) > + { > + oos = v->arch.paging.shadow.oos; > + idx = mfn_x(gmfn) % SHADOW_OOS_PAGES; > + if ( !mfn_eq(oos[idx], gmfn) ) > + idx = (idx + 1) % SHADOW_OOS_PAGES; > + if ( mfn_eq(oos[idx], gmfn) ) > + { > + oos[idx] = INVALID_MFN; > + return; > + } > + } > + > + printk(XENLOG_ERR "gmfn %"PRI_mfn" was OOS but not in hash table\n", > + mfn_x(gmfn)); > + BUG(); > +} > + > +mfn_t oos_snapshot_lookup(struct domain *d, mfn_t gmfn) > +{ > + int idx; > + mfn_t *oos; > + mfn_t *oos_snapshot; > + struct vcpu *v; > + > + for_each_vcpu(d, v) > + { > + oos = v->arch.paging.shadow.oos; > + oos_snapshot = v->arch.paging.shadow.oos_snapshot; > + idx = mfn_x(gmfn) % SHADOW_OOS_PAGES; > + if ( !mfn_eq(oos[idx], gmfn) ) > + idx = (idx + 1) % SHADOW_OOS_PAGES; > + if ( mfn_eq(oos[idx], gmfn) ) > + { > + return oos_snapshot[idx]; > + } > + } > + > + printk(XENLOG_ERR "gmfn %"PRI_mfn" was OOS but not in hash table\n", > + mfn_x(gmfn)); > + BUG(); > +} > + > +/* Pull a single guest page back into sync */ > +void sh_resync(struct domain *d, mfn_t gmfn) > +{ > + int idx; > + mfn_t *oos; > + mfn_t *oos_snapshot; > + struct oos_fixup *oos_fixup; > + struct vcpu *v; > + > + for_each_vcpu(d, v) > + { > + oos = v->arch.paging.shadow.oos; > + oos_fixup = v->arch.paging.shadow.oos_fixup; > + oos_snapshot = v->arch.paging.shadow.oos_snapshot; > + idx = mfn_x(gmfn) % SHADOW_OOS_PAGES; > + if ( !mfn_eq(oos[idx], gmfn) ) > + idx = (idx + 1) % SHADOW_OOS_PAGES; > + > + if ( mfn_eq(oos[idx], gmfn) ) > + { > + _sh_resync(v, gmfn, &oos_fixup[idx], oos_snapshot[idx]); > + oos[idx] = INVALID_MFN; > + return; > + } > + } > + > + printk(XENLOG_ERR "gmfn %"PRI_mfn" was OOS but not in hash table\n", > + mfn_x(gmfn)); > + BUG(); > +} > + > +/* > + * Figure out whether it's definitely safe not to sync this l1 table, > + * by making a call out to the mode in which that shadow was made. > + */ > +static int sh_skip_sync(struct vcpu *v, mfn_t gl1mfn) > +{ > + struct page_info *pg = mfn_to_page(gl1mfn); Newline here, and ... > + if ( pg->shadow_flags & SHF_L1_32 ) > + return SHADOW_INTERNAL_NAME(sh_safe_not_to_sync, 2)(v, gl1mfn); > + else if ( pg->shadow_flags & SHF_L1_PAE ) > + return SHADOW_INTERNAL_NAME(sh_safe_not_to_sync, 3)(v, gl1mfn); > + else if ( pg->shadow_flags & SHF_L1_64 ) > + return SHADOW_INTERNAL_NAME(sh_safe_not_to_sync, 4)(v, gl1mfn); here IMO. > + printk(XENLOG_ERR "gmfn %"PRI_mfn" was OOS but not shadowed as an l1\n", > + mfn_x(gl1mfn)); > + BUG(); > +} > + > +/* > + * Pull all out-of-sync pages back into sync. Pages brought out of sync > + * on other vcpus are allowed to remain out of sync, but their contents > + * will be made safe (TLB flush semantics); pages unsynced by this vcpu > + * are brought back into sync and write-protected. If skip != 0, we try > + * to avoid resyncing at all if we think we can get away with it. > + */ > +void sh_resync_all(struct vcpu *v, int skip, int this, int others) This is begging to become void sh_resync_all(struct vcpu *v, unsigned int flags) because, if nothing else, it changes the callers to be: sh_resync_all(v, RESYNC_THIS | RESYNC_OTHERS); sh_resync_all(v, RESYNC_THIS); sh_resync_all(v, RESYNC_SKIP | RESYNC_THIS); which is far more readable that the raw numbers currently there. I don't mind doing the patch, but I'll probably wait until you've got this committed to avoid churn. ~Andrew
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |