Xen project Mailing List

[PATCH 1/2] x86/mm: add API for marking only part of a MMIO page read only

From: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>

Date: Mon, 27 Mar 2023 12:09:15 +0200

Cc: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>, Paul Durrant <paul@xxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>

Delivery-date: Mon, 27 Mar 2023 10:09:51 +0000

Feedback-id: i1568416f:Fastmail

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

In some cases, only few registers on a page needs to be write-protected. Examples include USB3 console (64 bytes worth of registers) or MSI-X's PBA table (which doesn't need to span the whole table either). Current API allows only marking whole pages pages read-only, which sometimes may cover other registers that guest may need to write into. Currently, when a guest tries to write to an MMIO page on the mmio_ro_ranges, it's either immediately crashed on EPT violation - if that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was from userspace (like, /dev/mem), it will try to fixup by updating page tables (that Xen again will force to read-only) and will hit that #PF again (looping endlessly). Both behaviors are undesirable if guest could actually be allowed the write. Introduce an API that allows marking part of a page read-only. Since sub-page permissions are not a thing in page tables, do this via emulation (or simply page fault handler for PV) that handles writes that are supposed to be allowed. Those writes require the page to be mapped to Xen, so subpage_mmio_ro_add() function takes fixmap index of the page. The page needs to be added to mmio_ro_ranges, first anyway. Sub-page ranges are stored using rangeset for each added page, and those pages are stored on a plain list (as there isn't supposed to be many pages needing this precise r/o control). The mechanism this API is plugged in is slightly different for PV and HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV, it's already called for #PF on read-only MMIO page. For HVM however, EPT violation on p2m_mmio_direct page results in a direct domain_crash(). To reach mmio_ro_emulated_write(), change how write violations for p2m_mmio_direct are handled - specifically, treat them similar to p2m_ioreq_server. This makes relevant ioreq handler being called, that finally end up calling mmio_ro_emulated_write(). Both of those paths need an MFN to which guest tried to write (to check which part of the page is supposed to be read-only, and where the page is mapped for writes). This information currently isn't available directly in mmio_ro_emulated_write(), but in both cases it is already resolved somewhere higher in the call tree. Pass it down to mmio_ro_emulated_write() via new mmio_ro_emulate_ctxt.mfn field. Signed-off-by: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx> --- Shadow mode is not tested, but I don't expect it to work differently than HAP in areas related to this patch. The used locking should make it safe to use similar to mmio_ro_ranges, but frankly the only use (introduced in the next patch) could go without locking at all, as subpage_mmio_ro_add() is called only before any domain is constructed and subpage_mmio_ro_remove() is never called. --- xen/arch/x86/hvm/emulate.c | 2 +- xen/arch/x86/hvm/hvm.c | 3 +- xen/arch/x86/include/asm/mm.h | 22 ++++- xen/arch/x86/mm.c | 181 +++++++++++++++++++++++++++++++++- xen/arch/x86/pv/ro-page-fault.c | 1 +- 5 files changed, 207 insertions(+), 2 deletions(-) diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c index 95364deb1996..311102724dea 100644 --- a/xen/arch/x86/hvm/emulate.c +++ b/xen/arch/x86/hvm/emulate.c @@ -2733,7 +2733,7 @@ int hvm_emulate_one_mmio(unsigned long mfn, unsigned long gla) .write = mmio_ro_emulated_write, .validate = hvmemul_validate, }; - struct mmio_ro_emulate_ctxt mmio_ro_ctxt = { .cr2 = gla }; + struct mmio_ro_emulate_ctxt mmio_ro_ctxt = { .cr2 = gla, .mfn = _mfn(mfn) }; struct hvm_emulate_ctxt ctxt; const struct x86_emulate_ops *ops; unsigned int seg, bdf; diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index d326fa1c0136..f1c928e3e4ee 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -1942,7 +1942,8 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla, */ if ( (p2mt == p2m_mmio_dm) || (npfec.write_access && - (p2m_is_discard_write(p2mt) || (p2mt == p2m_ioreq_server))) ) + (p2m_is_discard_write(p2mt) || (p2mt == p2m_ioreq_server) || + p2mt == p2m_mmio_direct)) ) { if ( !handle_mmio_with_translation(gla, gpa >> PAGE_SHIFT, npfec) ) hvm_inject_hw_exception(TRAP_gp_fault, 0); diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h index db29e3e2059f..91937d556bac 100644 --- a/xen/arch/x86/include/asm/mm.h +++ b/xen/arch/x86/include/asm/mm.h @@ -522,9 +522,31 @@ extern struct rangeset *mmio_ro_ranges; void memguard_guard_stack(void *p); void memguard_unguard_stack(void *p); +/* + * Add more precise r/o marking for a MMIO page. Bytes range specified here + * will still be R/O, but the rest of the page (nor marked as R/O via another + * call) will have writes passed through. The write passthrough requires + * providing fixmap entry by the caller. + * Since multiple callers can mark different areas of the same page, they might + * provide different fixmap entries (although that's very unlikely in + * practice). Only the one provided by the first caller will be used. Return value + * indicates whether this fixmap entry will be used, or a different one + * provided earlier (in which case the caller might decide to release it). + * + * Return values: + * - negative: error + * - 0: success, fixmap entry is claimed + * - 1: success, fixmap entry set earlier will be used + */ +int subpage_mmio_ro_add(mfn_t mfn, unsigned long offset_s, + unsigned long offset_e, int fixmap_idx); +int subpage_mmio_ro_remove(mfn_t mfn, unsigned long offset_s, + unsigned long offset_e); + struct mmio_ro_emulate_ctxt { unsigned long cr2; unsigned int seg, bdf; + mfn_t mfn; }; int cf_check mmio_ro_emulated_write( diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index 0fe14faa5fa7..b50bdee40b6b 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -165,6 +165,19 @@ bool __read_mostly machine_to_phys_mapping_valid; struct rangeset *__read_mostly mmio_ro_ranges; +/* Handling sub-page read-only MMIO regions */ +struct subpage_ro_range { + struct list_head list; + mfn_t mfn; + int fixmap_idx; + struct rangeset *ro_bytes; + struct rcu_head rcu; +}; + +static LIST_HEAD(subpage_ro_ranges); +static DEFINE_RCU_READ_LOCK(subpage_ro_rcu); +static DEFINE_SPINLOCK(subpage_ro_lock); + static uint32_t base_disallow_mask; /* Global bit is allowed to be set on L1 PTEs. Intended for user mappings. */ #define L1_DISALLOW_MASK ((base_disallow_mask | _PAGE_GNTTAB) & ~_PAGE_GLOBAL) @@ -4893,6 +4906,172 @@ long arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg) return 0; } +int subpage_mmio_ro_add( + mfn_t mfn, + unsigned long offset_s, + unsigned long offset_e, + int fixmap_idx) +{ + struct subpage_ro_range *entry = NULL, *iter; + int rc; + + ASSERT(rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn))); + ASSERT(offset_s < PAGE_SIZE); + ASSERT(offset_e < PAGE_SIZE); + + spin_lock(&subpage_ro_lock); + + list_for_each_entry( iter, &subpage_ro_ranges, list ) + { + if ( mfn_eq(iter->mfn, mfn) ) + { + entry = iter; + break; + } + } + if ( !entry ) + { + /* iter==NULL marks it was a newly allocated entry */ + iter = NULL; + entry = xmalloc(struct subpage_ro_range); + rc = -ENOMEM; + if ( !entry ) + goto err_unlock; + entry->mfn = mfn; + entry->fixmap_idx = fixmap_idx; + entry->ro_bytes = rangeset_new(NULL, "subpage r/o mmio", + RANGESETF_prettyprint_hex); + rc = -ENOMEM; + if ( !entry->ro_bytes ) + goto err_unlock; + } + + rc = rangeset_add_range(entry->ro_bytes, offset_s, offset_e); + if ( rc < 0 ) + goto err_unlock; + + if ( !iter ) + list_add_rcu(&entry->list, &subpage_ro_ranges); + + spin_unlock(&subpage_ro_lock); + + if ( !iter || entry->fixmap_idx == fixmap_idx ) + return 0; + else + return 1; + +err_unlock: + spin_unlock(&subpage_ro_lock); + if ( !iter ) + { + if ( entry ) + { + if ( entry->ro_bytes ) + rangeset_destroy(entry->ro_bytes); + xfree(entry); + } + } + return rc; +} + +static void subpage_mmio_ro_free(struct rcu_head *rcu) +{ + struct subpage_ro_range *entry = container_of(rcu, struct subpage_ro_range, rcu); + + rangeset_destroy(entry->ro_bytes); + xfree(entry); +} + +int subpage_mmio_ro_remove( + mfn_t mfn, + unsigned long offset_s, + unsigned long offset_e) +{ + struct subpage_ro_range *entry = NULL, *iter; + int rc; + + ASSERT(offset_s < PAGE_SIZE); + ASSERT(offset_e < PAGE_SIZE); + + spin_lock(&subpage_ro_lock); + + list_for_each_entry_rcu( iter, &subpage_ro_ranges, list ) + { + if ( mfn_eq(iter->mfn, mfn) ) + { + entry = iter; + break; + } + } + rc = -ENOENT; + if ( !entry ) + goto out_unlock; + + rc = rangeset_remove_range(entry->ro_bytes, offset_s, offset_e); + if ( rc < 0 ) + goto out_unlock; + + rc = 0; + + if ( !rangeset_is_empty(entry->ro_bytes) ) + goto out_unlock; + + list_del_rcu(&entry->list); + call_rcu(&entry->rcu, subpage_mmio_ro_free); + +out_unlock: + spin_unlock(&subpage_ro_lock); + return rc; +} + +static void subpage_mmio_write_emulate( + mfn_t mfn, + unsigned long offset, + void *data, + unsigned int len) +{ + struct subpage_ro_range *entry; + void __iomem *addr; + + rcu_read_lock(&subpage_ro_rcu); + + list_for_each_entry_rcu( entry, &subpage_ro_ranges, list ) + { + if ( mfn_eq(entry->mfn, mfn) ) + { + if ( rangeset_overlaps_range(entry->ro_bytes, offset, offset + len - 1) ) + goto out_unlock; + + addr = fix_to_virt(entry->fixmap_idx) + offset; + switch ( len ) + { + case 1: + writeb(*(uint8_t*)data, addr); + break; + case 2: + writew(*(uint16_t*)data, addr); + break; + case 4: + writel(*(uint32_t*)data, addr); + break; + case 8: + writeq(*(uint64_t*)data, addr); + break; + default: + /* mmio_ro_emulated_write() already validated the size */ + ASSERT_UNREACHABLE(); + } + goto out_unlock; + } + } + gdprintk(XENLOG_WARNING, + "ignoring write to R/O MMIO mfn %" PRI_mfn " offset %lx len %u\n", + mfn_x(mfn), offset, len); + +out_unlock: + rcu_read_unlock(&subpage_ro_rcu); +} + int cf_check mmio_ro_emulated_write( enum x86_segment seg, unsigned long offset, @@ -4911,6 +5090,8 @@ int cf_check mmio_ro_emulated_write( return X86EMUL_UNHANDLEABLE; } + subpage_mmio_write_emulate(mmio_ro_ctxt->mfn, offset & (PAGE_SIZE - 1), p_data, bytes); + return X86EMUL_OKAY; } diff --git a/xen/arch/x86/pv/ro-page-fault.c b/xen/arch/x86/pv/ro-page-fault.c index 5963f5ee2d51..91caa2c8f520 100644 --- a/xen/arch/x86/pv/ro-page-fault.c +++ b/xen/arch/x86/pv/ro-page-fault.c @@ -342,6 +342,7 @@ static int mmio_ro_do_page_fault(struct x86_emulate_ctxt *ctxt, return X86EMUL_UNHANDLEABLE; } + mmio_ro_ctxt.mfn = mfn; ctxt->data = &mmio_ro_ctxt; if ( pci_ro_mmcfg_decode(mfn_x(mfn), &mmio_ro_ctxt.seg, &mmio_ro_ctxt.bdf) ) return x86_emulate(ctxt, &mmcfg_intercept_ops); -- git-series 0.9.1

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.