Xen project Mailing List

Re: [PATCH 1/2] x86/mm: add API for marking only part of a MMIO page read only

To: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>

From: Roger Pau Monné <roger.pau@xxxxxxxxxx>

Date: Tue, 28 Mar 2023 16:04:47 +0200

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=c1DS4VMMSiXKeCgZOSVNu18fpWQO9NZk3/qNtckSkE8=; b=ju6WnzS5tX9HtnplThGCCrM9xPAeL/RcBveFOxYacqpexH0sS6qFxBRVkMDc0p1CiL8FvLqb4MckXYKOGS1bxuxLmVUDTtb5FF8tqbbYc7cCqWDs84BPpsrdgGZRnyaRn7EzAsRpiaQ/Bee94uPSVkFMyuwztbg1PIz68GDfF/+JD0b6oJRvkqxadHvBhFIkvRwm4rcebb0yMhSUkjkdtKyDNa4iaMe485Lzv+vgPBxwandnc3h85bq641zjOmu9JJCsULcDmSimHYhmDQChck8aU8zrE1J1GzTJdkfoad1Jwq1YdVTqlg1nCdxc+NN8SaTcKDQrKrN5nj+9rkJ0ew==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=Jx+8C9bmaeVZmYl2h36vgYmgZbB67NLwMO5HLmCLX2AnqzBnGckZVSvAgv9ikHRaDrf+RBKESZPUk/LKGAXm9ZPUJplny+bl73A9EgtgVTwAvG2hIJKMJtcKcCIATZv3Eq7PCbM28fgoxPw8VGDPmO5Mc7H1JyEygJouWUolneR0PPswIWxG3WO5UWx4VdXAUAcvL3Pk9BmkRnG8lK9gJWlwKWBl7n3gJ/zUP+h2EwavoC25AA7SEBMBBVkapwA9tYemMQfz2ASmzryz3sWGezq9qymExKCYblN1l0A9eoJwuxaOD0OBUZWh3VrJ1VuYYI6SBexn5/QdQmA3/kgFBw==

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;

Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx, Paul Durrant <paul@xxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>

Delivery-date: Tue, 28 Mar 2023 14:05:15 +0000

Ironport-data: A9a23:XGu+qK2beNom0mNXM/bD5Qdwkn2cJEfYwER7XKvMYLTBsI5bp2cCx jYaDzyBOf6NYDGgeYgiOt+yph5UvMeEztIxSgA4pC1hF35El5HIVI+TRqvS04F+DeWYFR46s J9OAjXkBJppJpMJjk71atANlVEliefTAOK6ULWeUsxIbVcMYD87jh5+kPIOjIdtgNyoayuAo tq3qMDEULOf82cc3lk8tuTS+HuDgNyo4GlD5gdkNagR1LPjvyJ94Kw3dPnZw0TQGuG4LsbiL 87fwbew+H/u/htFIrtJRZ6iLyXm6paLVeS/oiI+t5qK23CulQRrukoPD9IOaF8/ttm8t4sZJ OOhF3CHYVxB0qXkwIzxWvTDes10FfUuFLTveRBTvSEPpqFvnrSFL/hGVSkL0YMkFulfJ05rs s0aGAs0TRGTqc262qOmcutmv5F2RCXrFNt3VnBI6xj8Vaxja7aaBqLA6JlfwSs6gd1IEbDGf c0FZDFzbRPGJRpSJlMQD5F4l+Ct7pX9W2QA9BTJ+uxqui6PkmSd05C0WDbRUsaNSshP2F6Ru 0rN/njjAwFcP9uaodaA2iv03bWQxHmgBer+EpWn2Mx2nQyY+lc8Ukc/CnWGrOOykU6HDoc3x 0s8v3BGQbIJ3EmiVNz0RRC7iH+CoB8HWtBUHvE66QeC0a7d6UCSAW1sZi5MbpkqudE7QRQu1 0SVhJX5CDp3qrqXRHmBsLCOoluaISEIKUcSaClCShEKi/HhqowuihPETv54DbW4yNbyHFnY3 DSivCU4wbIJgqY2O76T+FnGh3emoMHPRwttvAHPBDr5sEV+eZKvYJGu5R7D9/FcIY2FT16H+ n8Zh8yZ6+NIBpaI/MCQfNgw8HiSz67tGFXhbZRHQvHNKxzFF6afQL1t

Ironport-hdrordr: A9a23:YtjkaqwdCRb22xIAqedgKrPw671zdoMgy1knxilNoRw8SL3/qy nOppQmPHrP4wr5N0tApTntAtjkfZq+z+8N3WByB8bbYOCOggLBQ+9fBOPZskbd8kbFh4pgPM lbAs9DIey1IGJWyeDdy2CDf+rIxuPszImYwd3z9TNGayZES49d1C9FKiC9VndbeWB9dPkEPa vZ6cpDqyChangMB/7XOlAOQ/LfodnGj7LKCCR2ZSIa1A==

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Mon, Mar 27, 2023 at 12:09:15PM +0200, Marek Marczykowski-Górecki wrote: > In some cases, only few registers on a page needs to be write-protected. > Examples include USB3 console (64 bytes worth of registers) or MSI-X's > PBA table (which doesn't need to span the whole table either). > Current API allows only marking whole pages pages read-only, which > sometimes may cover other registers that guest may need to write into. > > Currently, when a guest tries to write to an MMIO page on the > mmio_ro_ranges, it's either immediately crashed on EPT violation - if > that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was > from userspace (like, /dev/mem), it will try to fixup by updating page > tables (that Xen again will force to read-only) and will hit that #PF > again (looping endlessly). Both behaviors are undesirable if guest could > actually be allowed the write. > > Introduce an API that allows marking part of a page read-only. Since > sub-page permissions are not a thing in page tables, do this via > emulation (or simply page fault handler for PV) that handles writes that > are supposed to be allowed. Those writes require the page to be mapped > to Xen, so subpage_mmio_ro_add() function takes fixmap index of the > page. The page needs to be added to mmio_ro_ranges, first anyway. > Sub-page ranges are stored using rangeset for each added page, and those > pages are stored on a plain list (as there isn't supposed to be many > pages needing this precise r/o control). Since mmio_ro_ranges is x86 only, it is possible to mutate it to track address ranges instead of page frames. The current type is unsigned long, so that should be fine, and would avoid having to create a per-page rangeset to just track offsets. > The mechanism this API is plugged in is slightly different for PV and > HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV, > it's already called for #PF on read-only MMIO page. For HVM however, EPT > violation on p2m_mmio_direct page results in a direct domain_crash(). > To reach mmio_ro_emulated_write(), change how write violations for > p2m_mmio_direct are handled - specifically, treat them similar to > p2m_ioreq_server. This makes relevant ioreq handler being called, > that finally end up calling mmio_ro_emulated_write(). > Both of those paths need an MFN to which guest tried to write (to check > which part of the page is supposed to be read-only, and where > the page is mapped for writes). This information currently isn't > available directly in mmio_ro_emulated_write(), but in both cases it is > already resolved somewhere higher in the call tree. Pass it down to > mmio_ro_emulated_write() via new mmio_ro_emulate_ctxt.mfn field. > > Signed-off-by: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx> > --- > Shadow mode is not tested, but I don't expect it to work differently than > HAP in areas related to this patch. > The used locking should make it safe to use similar to mmio_ro_ranges, > but frankly the only use (introduced in the next patch) could go without > locking at all, as subpage_mmio_ro_add() is called only before any > domain is constructed and subpage_mmio_ro_remove() is never called. > --- > xen/arch/x86/hvm/emulate.c | 2 +- > xen/arch/x86/hvm/hvm.c | 3 +- > xen/arch/x86/include/asm/mm.h | 22 ++++- > xen/arch/x86/mm.c | 181 +++++++++++++++++++++++++++++++++- > xen/arch/x86/pv/ro-page-fault.c | 1 +- > 5 files changed, 207 insertions(+), 2 deletions(-) > > diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c > index 95364deb1996..311102724dea 100644 > --- a/xen/arch/x86/hvm/emulate.c > +++ b/xen/arch/x86/hvm/emulate.c > @@ -2733,7 +2733,7 @@ int hvm_emulate_one_mmio(unsigned long mfn, unsigned > long gla) > .write = mmio_ro_emulated_write, > .validate = hvmemul_validate, > }; > - struct mmio_ro_emulate_ctxt mmio_ro_ctxt = { .cr2 = gla }; > + struct mmio_ro_emulate_ctxt mmio_ro_ctxt = { .cr2 = gla, .mfn = > _mfn(mfn) }; > struct hvm_emulate_ctxt ctxt; > const struct x86_emulate_ops *ops; > unsigned int seg, bdf; > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c > index d326fa1c0136..f1c928e3e4ee 100644 > --- a/xen/arch/x86/hvm/hvm.c > +++ b/xen/arch/x86/hvm/hvm.c > @@ -1942,7 +1942,8 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned > long gla, > */ > if ( (p2mt == p2m_mmio_dm) || > (npfec.write_access && > - (p2m_is_discard_write(p2mt) || (p2mt == p2m_ioreq_server))) ) > + (p2m_is_discard_write(p2mt) || (p2mt == p2m_ioreq_server) || > + p2mt == p2m_mmio_direct)) ) > { > if ( !handle_mmio_with_translation(gla, gpa >> PAGE_SHIFT, npfec) ) > hvm_inject_hw_exception(TRAP_gp_fault, 0); > diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h > index db29e3e2059f..91937d556bac 100644 > --- a/xen/arch/x86/include/asm/mm.h > +++ b/xen/arch/x86/include/asm/mm.h > @@ -522,9 +522,31 @@ extern struct rangeset *mmio_ro_ranges; > void memguard_guard_stack(void *p); > void memguard_unguard_stack(void *p); > > +/* > + * Add more precise r/o marking for a MMIO page. Bytes range specified here > + * will still be R/O, but the rest of the page (nor marked as R/O via another > + * call) will have writes passed through. The write passthrough requires > + * providing fixmap entry by the caller. > + * Since multiple callers can mark different areas of the same page, they > might > + * provide different fixmap entries (although that's very unlikely in > + * practice). Only the one provided by the first caller will be used. Return > value > + * indicates whether this fixmap entry will be used, or a different one > + * provided earlier (in which case the caller might decide to release it). Why not use ioremap() to map the page instead of requiring a fixmap entry? > + * > + * Return values: > + * - negative: error > + * - 0: success, fixmap entry is claimed > + * - 1: success, fixmap entry set earlier will be used > + */ > +int subpage_mmio_ro_add(mfn_t mfn, unsigned long offset_s, > + unsigned long offset_e, int fixmap_idx); > +int subpage_mmio_ro_remove(mfn_t mfn, unsigned long offset_s, > + unsigned long offset_e); > + > struct mmio_ro_emulate_ctxt { > unsigned long cr2; > unsigned int seg, bdf; > + mfn_t mfn; > }; > > int cf_check mmio_ro_emulated_write( > diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c > index 0fe14faa5fa7..b50bdee40b6b 100644 > --- a/xen/arch/x86/mm.c > +++ b/xen/arch/x86/mm.c > @@ -165,6 +165,19 @@ bool __read_mostly machine_to_phys_mapping_valid; > > struct rangeset *__read_mostly mmio_ro_ranges; > > +/* Handling sub-page read-only MMIO regions */ > +struct subpage_ro_range { > + struct list_head list; > + mfn_t mfn; > + int fixmap_idx; > + struct rangeset *ro_bytes; > + struct rcu_head rcu; > +}; > + > +static LIST_HEAD(subpage_ro_ranges); > +static DEFINE_RCU_READ_LOCK(subpage_ro_rcu); > +static DEFINE_SPINLOCK(subpage_ro_lock); > + > static uint32_t base_disallow_mask; > /* Global bit is allowed to be set on L1 PTEs. Intended for user mappings. */ > #define L1_DISALLOW_MASK ((base_disallow_mask | _PAGE_GNTTAB) & > ~_PAGE_GLOBAL) > @@ -4893,6 +4906,172 @@ long arch_memory_op(unsigned long cmd, > XEN_GUEST_HANDLE_PARAM(void) arg) > return 0; > } > > +int subpage_mmio_ro_add( > + mfn_t mfn, > + unsigned long offset_s, > + unsigned long offset_e, Since those are page offset, you can likely use unsigned int rather than long? I also wonder why not use [paddr_t start, paddr_t end] (or start and size) and drop the mfn parameter? You can certainly get the mfn from the full address, and it seems more natural that having the caller pass an mfn and two offsets. > + int fixmap_idx) > +{ > + struct subpage_ro_range *entry = NULL, *iter; > + int rc; > + > + ASSERT(rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn))); > + ASSERT(offset_s < PAGE_SIZE); > + ASSERT(offset_e < PAGE_SIZE); > + > + spin_lock(&subpage_ro_lock); > + > + list_for_each_entry( iter, &subpage_ro_ranges, list ) > + { > + if ( mfn_eq(iter->mfn, mfn) ) > + { > + entry = iter; > + break; > + } > + } > + if ( !entry ) > + { > + /* iter==NULL marks it was a newly allocated entry */ > + iter = NULL; > + entry = xmalloc(struct subpage_ro_range); > + rc = -ENOMEM; > + if ( !entry ) > + goto err_unlock; > + entry->mfn = mfn; > + entry->fixmap_idx = fixmap_idx; > + entry->ro_bytes = rangeset_new(NULL, "subpage r/o mmio", > + RANGESETF_prettyprint_hex); > + rc = -ENOMEM; rc will already be -ENOMEM, albeit doing error handling this way is tricky... > + if ( !entry->ro_bytes ) > + goto err_unlock; > + } > + > + rc = rangeset_add_range(entry->ro_bytes, offset_s, offset_e); > + if ( rc < 0 ) > + goto err_unlock; > + > + if ( !iter ) > + list_add_rcu(&entry->list, &subpage_ro_ranges); > + > + spin_unlock(&subpage_ro_lock); > + > + if ( !iter || entry->fixmap_idx == fixmap_idx ) > + return 0; > + else > + return 1; > + > +err_unlock: > + spin_unlock(&subpage_ro_lock); > + if ( !iter ) > + { > + if ( entry ) > + { > + if ( entry->ro_bytes ) > + rangeset_destroy(entry->ro_bytes); > + xfree(entry); > + } > + } > + return rc; > +} > + > +static void subpage_mmio_ro_free(struct rcu_head *rcu) > +{ > + struct subpage_ro_range *entry = container_of(rcu, struct > subpage_ro_range, rcu); > + > + rangeset_destroy(entry->ro_bytes); > + xfree(entry); > +} > + > +int subpage_mmio_ro_remove( > + mfn_t mfn, > + unsigned long offset_s, > + unsigned long offset_e) > +{ > + struct subpage_ro_range *entry = NULL, *iter; > + int rc; > + > + ASSERT(offset_s < PAGE_SIZE); > + ASSERT(offset_e < PAGE_SIZE); > + > + spin_lock(&subpage_ro_lock); > + > + list_for_each_entry_rcu( iter, &subpage_ro_ranges, list ) > + { > + if ( mfn_eq(iter->mfn, mfn) ) > + { > + entry = iter; > + break; > + } > + } > + rc = -ENOENT; > + if ( !entry ) > + goto out_unlock; > + > + rc = rangeset_remove_range(entry->ro_bytes, offset_s, offset_e); > + if ( rc < 0 ) > + goto out_unlock; > + > + rc = 0; You can use `if ( rc ) goto out_unlock;` and that avoids having to explicitly set rc to 0. > + > + if ( !rangeset_is_empty(entry->ro_bytes) ) > + goto out_unlock; > + > + list_del_rcu(&entry->list); > + call_rcu(&entry->rcu, subpage_mmio_ro_free); > + > +out_unlock: > + spin_unlock(&subpage_ro_lock); > + return rc; > +} > + > +static void subpage_mmio_write_emulate( > + mfn_t mfn, > + unsigned long offset, > + void *data, > + unsigned int len) > +{ > + struct subpage_ro_range *entry; const. > + void __iomem *addr; Do we care about the access being aligned? > + > + rcu_read_lock(&subpage_ro_rcu); > + > + list_for_each_entry_rcu( entry, &subpage_ro_ranges, list ) > + { > + if ( mfn_eq(entry->mfn, mfn) ) > + { You need to take the spinlock at this point, since the contents of entry->ro_bytes are not protected by the RCU. The RCU only provides safe iteration over the list, but not the content of the elements on the list. Thanks, Roger.

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.