Xen project Mailing List

Re: [PATCH 1/2] x86/mm: add API for marking only part of a MMIO page read only

To: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>

Date: Wed, 29 Mar 2023 10:50:20 +0200

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=WDwTQjm9f4YS0A6Zf/UkCq2GJX8v/ebzIxfJ7bjcWB0=; b=gaRUEkkNujS2LlvrMlc3Bmn/Lem2jkOvLd+Q7jzXu1cMzgKqs2G7C/qkwF/SFuLeaLzHOzluHSceEnxZrAskmv1REQ/FcpUmKMm8ePXTIw/AzDrzvGlsPMdnMxifmhlhJZd0H3kVZYMkO7GmCaRfCnGUjdcRa12M+CNSe5H9HYUVRoEERtzpt0WVsPDAf3TAuR3U+ujgvAunqT1PFs0oq5V9xhK3p9b4pHWte0BeEWTsPqIq7r6Q+AJB753gKCPGs5hx/QpjYhbnvtOjGy7cGmVmhdf3B597IWU0RQRHCiXYLS1TfCEEsBblkJWDDnGhSM3wiWmsqG7kWlyVim0R8A==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=HODU2ap3wxttrYE+nj0+9XhMTE7/xriCOd6BTn+BP1U/rkOMqzcYhImlTuLRHecOptAz6MIqjUM8ZWjnmXoG5M+d5nizsMRwJb0kuvtXNuCiAK43bq4XjP55f8N+WcIkhhZ/8J2RlMSknKkwmkCxN6/r0786WAQsx+P219RPYllj3KbAsZ04/oaPDD7K9yVmFs0lQ48EXcwF5lMPm6Qy6IHqmX0XcUHSmnQcX9s/Y1xwxc2t6DdPCt5/oDM78SXUhJx92syz1o2pkSbkovqm/9HIhz+bV4OEEqoAE5vp/m8TY3Wx7bgOkDABKdfNsCjtwQXwZLSwOC/mIo9ABfjrcA==

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;

Cc: Paul Durrant <paul@xxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx

Delivery-date: Wed, 29 Mar 2023 08:50:41 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 27.03.2023 12:09, Marek Marczykowski-Górecki wrote: > In some cases, only few registers on a page needs to be write-protected. > Examples include USB3 console (64 bytes worth of registers) or MSI-X's > PBA table (which doesn't need to span the whole table either). Yet like the MSI-X table the PBA is not permitted to share a page with other registers (the table itself excluded). So a need there would (again) only arise for devices violating the spec. > Current API allows only marking whole pages pages read-only, which > sometimes may cover other registers that guest may need to write into. > > Currently, when a guest tries to write to an MMIO page on the > mmio_ro_ranges, it's either immediately crashed on EPT violation - if > that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was > from userspace (like, /dev/mem), it will try to fixup by updating page > tables (that Xen again will force to read-only) and will hit that #PF > again (looping endlessly). Both behaviors are undesirable if guest could > actually be allowed the write. > > Introduce an API that allows marking part of a page read-only. Since > sub-page permissions are not a thing in page tables, do this via > emulation (or simply page fault handler for PV) that handles writes that > are supposed to be allowed. Those writes require the page to be mapped > to Xen, so subpage_mmio_ro_add() function takes fixmap index of the > page. The page needs to be added to mmio_ro_ranges, first anyway. > Sub-page ranges are stored using rangeset for each added page, and those > pages are stored on a plain list (as there isn't supposed to be many > pages needing this precise r/o control). While, unlike Roger, I don't want to see mmio_ro_ranges' granularity changed, I wonder if a bitmap isn't going to be easier to use (even if perhaps requiring a little more memory, but whether that's the case depends on intended granularity - I'm not convinced we need byte- granular ranges). I'm also worried of this yet more heavily tying to x86 of (as of the next patch) the USB3 debug console driver (i.e. as opposed to Roger I wouldn't take anything that's x86-only as justification for making wider changes). As to sub-page permissions: EPT has, as one of its extensions, support for this. It might be worth at least mentioning, even if the feature isn't intended to be used right here. Taking both remarks together, limiting granularity to 16(?) bytes would allow using the advanced EPT functionality down the road, and would at the same time limit the suggested bitmap to just 256 bits / 32 bytes, which I think gets us below what even an empty rangeset would require. Plus lookup would also be quite a bit more lightweight. > The mechanism this API is plugged in is slightly different for PV and > HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV, > it's already called for #PF on read-only MMIO page. For HVM however, EPT > violation on p2m_mmio_direct page results in a direct domain_crash(). > To reach mmio_ro_emulated_write(), change how write violations for > p2m_mmio_direct are handled - specifically, treat them similar to > p2m_ioreq_server. This makes relevant ioreq handler being called, > that finally end up calling mmio_ro_emulated_write(). I wonder whether that isn't too heavy a change to existing behavior. I understand that taking this path is necessary for the purpose of the patch, but I don't think it is necessary for all p2m_mmio_direct pages. Checking at least mmio_ro_ranges right in hvm_hap_nested_page_fault() looks reasonable to me. > --- a/xen/arch/x86/mm.c > +++ b/xen/arch/x86/mm.c > @@ -165,6 +165,19 @@ bool __read_mostly machine_to_phys_mapping_valid; > > struct rangeset *__read_mostly mmio_ro_ranges; > > +/* Handling sub-page read-only MMIO regions */ > +struct subpage_ro_range { > + struct list_head list; > + mfn_t mfn; > + int fixmap_idx; > + struct rangeset *ro_bytes; > + struct rcu_head rcu; > +}; > + > +static LIST_HEAD(subpage_ro_ranges); > +static DEFINE_RCU_READ_LOCK(subpage_ro_rcu); > +static DEFINE_SPINLOCK(subpage_ro_lock); > + > static uint32_t base_disallow_mask; > /* Global bit is allowed to be set on L1 PTEs. Intended for user mappings. */ > #define L1_DISALLOW_MASK ((base_disallow_mask | _PAGE_GNTTAB) & > ~_PAGE_GLOBAL) > @@ -4893,6 +4906,172 @@ long arch_memory_op(unsigned long cmd, > XEN_GUEST_HANDLE_PARAM(void) arg) > return 0; > } > > +int subpage_mmio_ro_add( As long as patch 2 is going to add the only users, __init please, and there's no need for a "remove" counterpart. > + mfn_t mfn, > + unsigned long offset_s, > + unsigned long offset_e, > + int fixmap_idx) enum fixed_addresses? > +{ > + struct subpage_ro_range *entry = NULL, *iter; > + int rc; > + > + ASSERT(rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn))); > + ASSERT(offset_s < PAGE_SIZE); > + ASSERT(offset_e < PAGE_SIZE); > + > + spin_lock(&subpage_ro_lock); > + > + list_for_each_entry( iter, &subpage_ro_ranges, list ) Nit: Style (either you view list_for_each_entry as a [pseudo]keyword for the purposes of what blanks should be present/absent, or you don't, a mixture isn't reasonable; also elsewhere). > + { > + if ( mfn_eq(iter->mfn, mfn) ) > + { > + entry = iter; > + break; > + } > + } > + if ( !entry ) > + { > + /* iter==NULL marks it was a newly allocated entry */ > + iter = NULL; > + entry = xmalloc(struct subpage_ro_range); > + rc = -ENOMEM; > + if ( !entry ) > + goto err_unlock; > + entry->mfn = mfn; > + entry->fixmap_idx = fixmap_idx; > + entry->ro_bytes = rangeset_new(NULL, "subpage r/o mmio", > + RANGESETF_prettyprint_hex); > + rc = -ENOMEM; > + if ( !entry->ro_bytes ) > + goto err_unlock; > + } > + > + rc = rangeset_add_range(entry->ro_bytes, offset_s, offset_e); > + if ( rc < 0 ) > + goto err_unlock; > + > + if ( !iter ) > + list_add_rcu(&entry->list, &subpage_ro_ranges); > + > + spin_unlock(&subpage_ro_lock); > + > + if ( !iter || entry->fixmap_idx == fixmap_idx ) > + return 0; > + else > + return 1; If this case is really needed, this special return value (as documented in the header) probably needs specially handling in the callers - it's not necessarily an error after all. But I wonder whether it wouldn't be easier to check earlier on and fail right away (with e.g. -EBUSY), rather than adding the range and _then_ (kind of, as per patch 2) failing. > +err_unlock: Nit: Style (labels indented by at least one space please, also elsewhere). > +static void subpage_mmio_write_emulate( > + mfn_t mfn, > + unsigned long offset, unsigned int > + void *data, const > + unsigned int len) > +{ > + struct subpage_ro_range *entry; > + void __iomem *addr; > + > + rcu_read_lock(&subpage_ro_rcu); > + > + list_for_each_entry_rcu( entry, &subpage_ro_ranges, list ) > + { > + if ( mfn_eq(entry->mfn, mfn) ) > + { > + if ( rangeset_overlaps_range(entry->ro_bytes, offset, offset + > len - 1) ) > + goto out_unlock; This case ... > + addr = fix_to_virt(entry->fixmap_idx) + offset; > + switch ( len ) > + { > + case 1: > + writeb(*(uint8_t*)data, addr); > + break; > + case 2: > + writew(*(uint16_t*)data, addr); > + break; > + case 4: > + writel(*(uint32_t*)data, addr); > + break; > + case 8: > + writeq(*(uint64_t*)data, addr); > + break; > + default: > + /* mmio_ro_emulated_write() already validated the size */ > + ASSERT_UNREACHABLE(); ... as well as, in a release build, this one wants to ... > + } > + goto out_unlock; ... not use this path but ... > + } > + } > + gdprintk(XENLOG_WARNING, > + "ignoring write to R/O MMIO mfn %" PRI_mfn " offset %lx len > %u\n", > + mfn_x(mfn), offset, len); ... make it here. (By implication: I'm not convinced this wants to be a gdprintk(), as I think at least one such warning would better surface in release builds as well.) At the same time I don't think any message should be issued for write attempts to pages which don't have part of it marked writable. We deal with such silently right now, and this shouldn't change. In fact even for ranges which don't overlap the writable portion of a page we may want to consider remaining silent. The log message ought to be of interest mainly for writes which we meant to permit, but which for whatever reason we elect to suppress nevertheless. As to the message - why do you split MFN from offset, rather than simply logging an address ("... %" PRI_mfn "%03x ...")? Like (iirc) Roger I think that misaligned accesses should be refused (and hence also make it here). Jan

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.