Xen project Mailing List

Re: [PATCH 1/2] x86/mm: add API for marking only part of a MMIO page read only

To: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>

Date: Wed, 29 Mar 2023 14:39:22 +0200

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=2G2f/T67qJji4pmNb8rtlag+GnESLtkQshqk1X8WSAk=; b=St6F7JoMnVHlyqCw7NGKGCCOabIQg2z2wGODhfAHn8g4mSrX0tYplXdngr2RXDe44Mm4W3ewtMgS/hSwWi5sZsRbRvRolyvlLBw4Kpreq2C+TSYw4AG4lvbHqyinCNzvqZHQOBEjr5RgGJ4KIvI22tgyJKtpREayyUVzIVjPcZZE3F5ZzsFISEkhZ+nxFI53NbQ0y8iRH5u/r9tUDDyb6x4c5YZA0YM58ljuBMNj70KNW10wvCDfxgkpbQ0ob2nBQkqk6M5LhS7cuxQF2ns2Yo3IR3VC+0hsDhRgT+6+6UvCfZj2UFmduURCrN2kX9/LlxXWgbt8M9k7N+kmdQXSHQ==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=c9s+mfdrpewzRyFR8bKKfFdw36yUoFQ4hCyfLei3oGr7YWsY2G8UyYkbKbdEqEufMgx5HxoZyDU9A4fWo77HOhWxBuAVt0i9Xu5bU4AJe7msXF8j+0YhODLsF1vxIEBYGmvwwGvUgyf/HlgIJc4nyKoG/lTt4J56wcgWbLFizRMp5uh+xnV1FgG6F5yH6qLdRHKtPpn1fGOO0BV4iQj+4i5yvYvOY69Cg8APcr7HZAMe7Rpx17ZS6gwouB5qs63H4i0kECT/OExLYB8KLCUx+dvJ2Gw0ft9FQaYnz097k99PDpJSTbmzMpCzkOSA1/pxoPyBvg3D6Tq0/kHC+ZIKEg==

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;

Cc: Paul Durrant <paul@xxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx

Delivery-date: Wed, 29 Mar 2023 12:39:45 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 29.03.2023 12:51, Marek Marczykowski-Górecki wrote: > On Wed, Mar 29, 2023 at 10:50:20AM +0200, Jan Beulich wrote: >> On 27.03.2023 12:09, Marek Marczykowski-Górecki wrote: >>> In some cases, only few registers on a page needs to be write-protected. >>> Examples include USB3 console (64 bytes worth of registers) or MSI-X's >>> PBA table (which doesn't need to span the whole table either). >> >> Yet like the MSI-X table the PBA is not permitted to share a page with >> other registers (the table itself excluded). So a need there would >> (again) only arise for devices violating the spec. > > For PBA, indeed (and due to not seeing device needing such hack, I don't > do that for PBA yet). But the XHCI spec doesn't include such limitation, so > this case is perfectly valid. My remark was merely because you mention PBA in the description. Mentioning it is okay, but you would want to further qualify it, I think. >>> Current API allows only marking whole pages pages read-only, which >>> sometimes may cover other registers that guest may need to write into. >>> >>> Currently, when a guest tries to write to an MMIO page on the >>> mmio_ro_ranges, it's either immediately crashed on EPT violation - if >>> that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was >>> from userspace (like, /dev/mem), it will try to fixup by updating page >>> tables (that Xen again will force to read-only) and will hit that #PF >>> again (looping endlessly). Both behaviors are undesirable if guest could >>> actually be allowed the write. >>> >>> Introduce an API that allows marking part of a page read-only. Since >>> sub-page permissions are not a thing in page tables, do this via >>> emulation (or simply page fault handler for PV) that handles writes that >>> are supposed to be allowed. Those writes require the page to be mapped >>> to Xen, so subpage_mmio_ro_add() function takes fixmap index of the >>> page. The page needs to be added to mmio_ro_ranges, first anyway. >>> Sub-page ranges are stored using rangeset for each added page, and those >>> pages are stored on a plain list (as there isn't supposed to be many >>> pages needing this precise r/o control). >> >> While, unlike Roger, I don't want to see mmio_ro_ranges' granularity >> changed, I wonder if a bitmap isn't going to be easier to use (even >> if perhaps requiring a little more memory, but whether that's the case >> depends on intended granularity - I'm not convinced we need byte- >> granular ranges). I'm also worried of this yet more heavily tying to >> x86 of (as of the next patch) the USB3 debug console driver (i.e. as >> opposed to Roger I wouldn't take anything that's x86-only as >> justification for making wider changes). > > Well, it's still under the very same CONFIG_X86, and for the same > purpose, so I don't think it's "more heavily tying". > >> As to sub-page permissions: EPT has, as one of its extensions, support >> for this. It might be worth at least mentioning, even if the feature >> isn't intended to be used right here. > > Ah, ok. > >> Taking both remarks together, limiting granularity to 16(?) bytes >> would allow using the advanced EPT functionality down the road, and >> would at the same time limit the suggested bitmap to just 256 bits / >> 32 bytes, which I think gets us below what even an empty rangeset >> would require. Plus lookup would also be quite a bit more lightweight. > > Indeed, in that case it makes sense. Hmm, I've checked the SDM, and I was misremembering: Granularity is 128 bytes, which might be too large for the purposes here. >>> @@ -4893,6 +4906,172 @@ long arch_memory_op(unsigned long cmd, >>> XEN_GUEST_HANDLE_PARAM(void) arg) >>> return 0; >>> } >>> >>> +int subpage_mmio_ro_add( >> >> As long as patch 2 is going to add the only users, __init please, and >> there's no need for a "remove" counterpart. > > __init makes sense. But as for removing "remove" part, I'm not sure. I > realize it is a dead code now, but it's easier to introduce it now to > provide complete API, than later by somebody else who would want to use > it in other places. Is there some trick to make compiler/linker optimize > it out? At the very least you could also mark it __init. There are also the .discard and .discard.* sections we handle specially in the linker script. But no matter what you do to retain the code without impacting the resulting binary, iirc Misra tells us that there shouldn't be dead code. >>> + mfn_t mfn, >>> + unsigned long offset_s, >>> + unsigned long offset_e, >>> + int fixmap_idx) >> >> enum fixed_addresses? > > If that parameter stays, yes. > >>> +{ >>> + struct subpage_ro_range *entry = NULL, *iter; >>> + int rc; >>> + >>> + ASSERT(rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn))); >>> + ASSERT(offset_s < PAGE_SIZE); >>> + ASSERT(offset_e < PAGE_SIZE); >>> + >>> + spin_lock(&subpage_ro_lock); >>> + >>> + list_for_each_entry( iter, &subpage_ro_ranges, list ) >> >> Nit: Style (either you view list_for_each_entry as a [pseudo]keyword >> for the purposes of what blanks should be present/absent, or you don't, >> a mixture isn't reasonable; also elsewhere). > > Which version would be your preference for list_for_each_entry? I have no preference, and I'm sure we have ample examples of all forms (including the mixed ones that we shouldn't have). There's a for_each_online_cpu() in this file, using the non-keyword style, so if anything this might be taken as reference. But I didn't check whether there are any other uses of keyword-like identifiers which may use the other style. >>> + { >>> + if ( mfn_eq(iter->mfn, mfn) ) >>> + { >>> + entry = iter; >>> + break; >>> + } >>> + } >>> + if ( !entry ) >>> + { >>> + /* iter==NULL marks it was a newly allocated entry */ >>> + iter = NULL; >>> + entry = xmalloc(struct subpage_ro_range); >>> + rc = -ENOMEM; >>> + if ( !entry ) >>> + goto err_unlock; >>> + entry->mfn = mfn; >>> + entry->fixmap_idx = fixmap_idx; >>> + entry->ro_bytes = rangeset_new(NULL, "subpage r/o mmio", >>> + RANGESETF_prettyprint_hex); >>> + rc = -ENOMEM; >>> + if ( !entry->ro_bytes ) >>> + goto err_unlock; >>> + } >>> + >>> + rc = rangeset_add_range(entry->ro_bytes, offset_s, offset_e); >>> + if ( rc < 0 ) >>> + goto err_unlock; >>> + >>> + if ( !iter ) >>> + list_add_rcu(&entry->list, &subpage_ro_ranges); >>> + >>> + spin_unlock(&subpage_ro_lock); >>> + >>> + if ( !iter || entry->fixmap_idx == fixmap_idx ) >>> + return 0; >>> + else >>> + return 1; >> >> If this case is really needed, this special return value (as documented >> in the header) probably needs specially handling in the callers - it's >> not necessarily an error after all. But I wonder whether it wouldn't be >> easier to check earlier on and fail right away (with e.g. -EBUSY), > > The idea is to allow multiple sub-ranges in a single page. Again, if > using ioremap() internally, instead of fixmap provided externally, this > case will go away. > >> rather >> than adding the range and _then_ (kind of, as per patch 2) failing. > > Right, I missed "!= 0" there. Hmm, adding "!= 0" won't make a difference, will it? Adding "< 0" would. >>> +err_unlock: >> >> Nit: Style (labels indented by at least one space please, also elsewhere). >> >>> +static void subpage_mmio_write_emulate( >>> + mfn_t mfn, >>> + unsigned long offset, >> >> unsigned int >> >>> + void *data, >> >> const >> >>> + unsigned int len) >>> +{ >>> + struct subpage_ro_range *entry; >>> + void __iomem *addr; >>> + >>> + rcu_read_lock(&subpage_ro_rcu); >>> + >>> + list_for_each_entry_rcu( entry, &subpage_ro_ranges, list ) >>> + { >>> + if ( mfn_eq(entry->mfn, mfn) ) >>> + { >>> + if ( rangeset_overlaps_range(entry->ro_bytes, offset, offset + >>> len - 1) ) >>> + goto out_unlock; >> >> This case ... >> >>> + addr = fix_to_virt(entry->fixmap_idx) + offset; >>> + switch ( len ) >>> + { >>> + case 1: >>> + writeb(*(uint8_t*)data, addr); >>> + break; >>> + case 2: >>> + writew(*(uint16_t*)data, addr); >>> + break; >>> + case 4: >>> + writel(*(uint32_t*)data, addr); >>> + break; >>> + case 8: >>> + writeq(*(uint64_t*)data, addr); >>> + break; >>> + default: >>> + /* mmio_ro_emulated_write() already validated the size */ >>> + ASSERT_UNREACHABLE(); >> >> ... as well as, in a release build, this one wants to ... >> >>> + } >>> + goto out_unlock; >> >> ... not use this path but ... >> >>> + } >>> + } >>> + gdprintk(XENLOG_WARNING, >>> + "ignoring write to R/O MMIO mfn %" PRI_mfn " offset %lx len >>> %u\n", >>> + mfn_x(mfn), offset, len); >> >> ... make it here. (By implication: I'm not convinced this wants to be a >> gdprintk(), as I think at least one such warning would better surface in >> release builds as well.) > > Right, gprintk() would make more sense indeed. > >> At the same time I don't think any message should be issued for write >> attempts to pages which don't have part of it marked writable. We deal >> with such silently right now, and this shouldn't change. > > At least for HVM domains, it isn't really silent. It's domain_crash() > (before my change, not reaching this function at all). Well, yes, but that's one instance in the lifetime of a domain. Jan

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.