Xen project Mailing List

Re: [PATCH 1/2] x86/mm: add API for marking only part of a MMIO page read only

From: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>

Date: Wed, 29 Mar 2023 12:51:26 +0200

Cc: Paul Durrant <paul@xxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx

Delivery-date: Wed, 29 Mar 2023 10:51:54 +0000

Feedback-id: i1568416f:Fastmail

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Wed, Mar 29, 2023 at 10:50:20AM +0200, Jan Beulich wrote: > On 27.03.2023 12:09, Marek Marczykowski-Górecki wrote: > > In some cases, only few registers on a page needs to be write-protected. > > Examples include USB3 console (64 bytes worth of registers) or MSI-X's > > PBA table (which doesn't need to span the whole table either). > > Yet like the MSI-X table the PBA is not permitted to share a page with > other registers (the table itself excluded). So a need there would > (again) only arise for devices violating the spec. For PBA, indeed (and due to not seeing device needing such hack, I don't do that for PBA yet). But the XHCI spec doesn't include such limitation, so this case is perfectly valid. > > Current API allows only marking whole pages pages read-only, which > > sometimes may cover other registers that guest may need to write into. > > > > Currently, when a guest tries to write to an MMIO page on the > > mmio_ro_ranges, it's either immediately crashed on EPT violation - if > > that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was > > from userspace (like, /dev/mem), it will try to fixup by updating page > > tables (that Xen again will force to read-only) and will hit that #PF > > again (looping endlessly). Both behaviors are undesirable if guest could > > actually be allowed the write. > > > > Introduce an API that allows marking part of a page read-only. Since > > sub-page permissions are not a thing in page tables, do this via > > emulation (or simply page fault handler for PV) that handles writes that > > are supposed to be allowed. Those writes require the page to be mapped > > to Xen, so subpage_mmio_ro_add() function takes fixmap index of the > > page. The page needs to be added to mmio_ro_ranges, first anyway. > > Sub-page ranges are stored using rangeset for each added page, and those > > pages are stored on a plain list (as there isn't supposed to be many > > pages needing this precise r/o control). > > While, unlike Roger, I don't want to see mmio_ro_ranges' granularity > changed, I wonder if a bitmap isn't going to be easier to use (even > if perhaps requiring a little more memory, but whether that's the case > depends on intended granularity - I'm not convinced we need byte- > granular ranges). I'm also worried of this yet more heavily tying to > x86 of (as of the next patch) the USB3 debug console driver (i.e. as > opposed to Roger I wouldn't take anything that's x86-only as > justification for making wider changes). Well, it's still under the very same CONFIG_X86, and for the same purpose, so I don't think it's "more heavily tying". > As to sub-page permissions: EPT has, as one of its extensions, support > for this. It might be worth at least mentioning, even if the feature > isn't intended to be used right here. Ah, ok. > Taking both remarks together, limiting granularity to 16(?) bytes > would allow using the advanced EPT functionality down the road, and > would at the same time limit the suggested bitmap to just 256 bits / > 32 bytes, which I think gets us below what even an empty rangeset > would require. Plus lookup would also be quite a bit more lightweight. Indeed, in that case it makes sense. > > The mechanism this API is plugged in is slightly different for PV and > > HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV, > > it's already called for #PF on read-only MMIO page. For HVM however, EPT > > violation on p2m_mmio_direct page results in a direct domain_crash(). > > To reach mmio_ro_emulated_write(), change how write violations for > > p2m_mmio_direct are handled - specifically, treat them similar to > > p2m_ioreq_server. This makes relevant ioreq handler being called, > > that finally end up calling mmio_ro_emulated_write(). > > I wonder whether that isn't too heavy a change to existing behavior. > I understand that taking this path is necessary for the purpose of the > patch, but I don't think it is necessary for all p2m_mmio_direct pages. > Checking at least mmio_ro_ranges right in hvm_hap_nested_page_fault() > looks reasonable to me. Before this change, domU was crashed on write EPT violation to p2m_mmio_direct page (which, IIUC, can happen only for mmio_ro_ranges, as otherwise page is mapped R/W), so I don't think there are many accesses that would hit this path for other reasons. > > --- a/xen/arch/x86/mm.c > > +++ b/xen/arch/x86/mm.c > > @@ -165,6 +165,19 @@ bool __read_mostly machine_to_phys_mapping_valid; > > > > struct rangeset *__read_mostly mmio_ro_ranges; > > > > +/* Handling sub-page read-only MMIO regions */ > > +struct subpage_ro_range { > > + struct list_head list; > > + mfn_t mfn; > > + int fixmap_idx; > > + struct rangeset *ro_bytes; > > + struct rcu_head rcu; > > +}; > > + > > +static LIST_HEAD(subpage_ro_ranges); > > +static DEFINE_RCU_READ_LOCK(subpage_ro_rcu); > > +static DEFINE_SPINLOCK(subpage_ro_lock); > > + > > static uint32_t base_disallow_mask; > > /* Global bit is allowed to be set on L1 PTEs. Intended for user mappings. > > */ > > #define L1_DISALLOW_MASK ((base_disallow_mask | _PAGE_GNTTAB) & > > ~_PAGE_GLOBAL) > > @@ -4893,6 +4906,172 @@ long arch_memory_op(unsigned long cmd, > > XEN_GUEST_HANDLE_PARAM(void) arg) > > return 0; > > } > > > > +int subpage_mmio_ro_add( > > As long as patch 2 is going to add the only users, __init please, and > there's no need for a "remove" counterpart. __init makes sense. But as for removing "remove" part, I'm not sure. I realize it is a dead code now, but it's easier to introduce it now to provide complete API, than later by somebody else who would want to use it in other places. Is there some trick to make compiler/linker optimize it out? > > > + mfn_t mfn, > > + unsigned long offset_s, > > + unsigned long offset_e, > > + int fixmap_idx) > > enum fixed_addresses? If that parameter stays, yes. > > +{ > > + struct subpage_ro_range *entry = NULL, *iter; > > + int rc; > > + > > + ASSERT(rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn))); > > + ASSERT(offset_s < PAGE_SIZE); > > + ASSERT(offset_e < PAGE_SIZE); > > + > > + spin_lock(&subpage_ro_lock); > > + > > + list_for_each_entry( iter, &subpage_ro_ranges, list ) > > Nit: Style (either you view list_for_each_entry as a [pseudo]keyword > for the purposes of what blanks should be present/absent, or you don't, > a mixture isn't reasonable; also elsewhere). Which version would be your preference for list_for_each_entry? > > + { > > + if ( mfn_eq(iter->mfn, mfn) ) > > + { > > + entry = iter; > > + break; > > + } > > + } > > + if ( !entry ) > > + { > > + /* iter==NULL marks it was a newly allocated entry */ > > + iter = NULL; > > + entry = xmalloc(struct subpage_ro_range); > > + rc = -ENOMEM; > > + if ( !entry ) > > + goto err_unlock; > > + entry->mfn = mfn; > > + entry->fixmap_idx = fixmap_idx; > > + entry->ro_bytes = rangeset_new(NULL, "subpage r/o mmio", > > + RANGESETF_prettyprint_hex); > > + rc = -ENOMEM; > > + if ( !entry->ro_bytes ) > > + goto err_unlock; > > + } > > + > > + rc = rangeset_add_range(entry->ro_bytes, offset_s, offset_e); > > + if ( rc < 0 ) > > + goto err_unlock; > > + > > + if ( !iter ) > > + list_add_rcu(&entry->list, &subpage_ro_ranges); > > + > > + spin_unlock(&subpage_ro_lock); > > + > > + if ( !iter || entry->fixmap_idx == fixmap_idx ) > > + return 0; > > + else > > + return 1; > > If this case is really needed, this special return value (as documented > in the header) probably needs specially handling in the callers - it's > not necessarily an error after all. But I wonder whether it wouldn't be > easier to check earlier on and fail right away (with e.g. -EBUSY), The idea is to allow multiple sub-ranges in a single page. Again, if using ioremap() internally, instead of fixmap provided externally, this case will go away. > rather > than adding the range and _then_ (kind of, as per patch 2) failing. Right, I missed "!= 0" there. > > +err_unlock: > > Nit: Style (labels indented by at least one space please, also elsewhere). > > > +static void subpage_mmio_write_emulate( > > + mfn_t mfn, > > + unsigned long offset, > > unsigned int > > > + void *data, > > const > > > + unsigned int len) > > +{ > > + struct subpage_ro_range *entry; > > + void __iomem *addr; > > + > > + rcu_read_lock(&subpage_ro_rcu); > > + > > + list_for_each_entry_rcu( entry, &subpage_ro_ranges, list ) > > + { > > + if ( mfn_eq(entry->mfn, mfn) ) > > + { > > + if ( rangeset_overlaps_range(entry->ro_bytes, offset, offset + > > len - 1) ) > > + goto out_unlock; > > This case ... > > > + addr = fix_to_virt(entry->fixmap_idx) + offset; > > + switch ( len ) > > + { > > + case 1: > > + writeb(*(uint8_t*)data, addr); > > + break; > > + case 2: > > + writew(*(uint16_t*)data, addr); > > + break; > > + case 4: > > + writel(*(uint32_t*)data, addr); > > + break; > > + case 8: > > + writeq(*(uint64_t*)data, addr); > > + break; > > + default: > > + /* mmio_ro_emulated_write() already validated the size */ > > + ASSERT_UNREACHABLE(); > > ... as well as, in a release build, this one wants to ... > > > + } > > + goto out_unlock; > > ... not use this path but ... > > > + } > > + } > > + gdprintk(XENLOG_WARNING, > > + "ignoring write to R/O MMIO mfn %" PRI_mfn " offset %lx len > > %u\n", > > + mfn_x(mfn), offset, len); > > ... make it here. (By implication: I'm not convinced this wants to be a > gdprintk(), as I think at least one such warning would better surface in > release builds as well.) Right, gprintk() would make more sense indeed. > At the same time I don't think any message should be issued for write > attempts to pages which don't have part of it marked writable. We deal > with such silently right now, and this shouldn't change. At least for HVM domains, it isn't really silent. It's domain_crash() (before my change, not reaching this function at all). I think it is silent only for PV domains (or maybe even just hardware domain?). > In fact even > for ranges which don't overlap the writable portion of a page we may want > to consider remaining silent. The log message ought to be of interest > mainly for writes which we meant to permit, but which for whatever reason > we elect to suppress nevertheless. > > As to the message - why do you split MFN from offset, rather than simply > logging an address ("... %" PRI_mfn "%03x ...")? I could use such formatting trick indeed. > Like (iirc) Roger I think that misaligned accesses should be refused (and > hence also make it here). Ok, will change. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab

Attachment: signature.asc
Description: PGP signature

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.